Introduction to Pandas#
Pandas works hand-in-hand with other python libraries (e.g. matplotlib and numpy) to make manipulating data (what the Pandas team calls “Data Munging”) easy. With pandas it is easy to
Easily access data using variable names, but have full linear algebra capabilities of numpy
Group data by values (for example, calculate the mean income by state)
Plot and summarize values
Join (combine) different sources of data
Powerful time series and panel data capabilities (beyond scope of course)
Note: If you want to reinforce some of these concepts on your own, I recommend this superb youtube video (https://www.youtube.com/watch?v=5JnMutdy6Fw) and accompanying coursework (brandon-rhodes/pycon-pandas-tutorial) by Brandon Rhodes.
There are numerous ways to get data into Pandas:
Import excel and comma delimited data
Import stata, sas, matlab, and other datasets
Import data from an SQL server
Import data scraped from the web
Manually building (by typing in values)
In this tutorial we will focus on the first two methods for reading data, but just know that there is probably a way to get your data into pandas irrespective of what format it is in.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sbn
Loading and Cleaning Data#
We will be loading 2010 “Trip” data from the NOAA Fisheries Service Recreational Fisheries Statistics Survey (called MRIP). More detail here: https://www.fisheries.noaa.gov/topic/recreational-fishing-data
The MRIPs data needs to be downloaded from my website (you only need to do this once).
##
## Load some data
##
trips_ = pd.read_pickle("../_static/lectures/appendix/mrips_trips.pickle")
trips_.head()
index | add_hrs | add_ph | age | alt_flag | area | area_nc | area_x | art_reef | arx_method | ... | time | tourn | tsn1 | tsn2 | turtle | var_id | wave | wp_int | year | intercept_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 828 | NaN | 1 | 75.0 | 0.0 | 1 | 1 | 1 | 2.0 | ... | 1205.0 | None | 3.0 | 2018013733wdS:8AM-8PM | 1.0 | 352.092753 | 2018.0 | 2018-01-11 | |||
1 | 829 | NaN | 2 | 32.0 | 0.0 | 1 | 1 | 1 | 1.0 | ... | 1140.0 | None | 3.0 | 2018013733weS:8AM-8PM | 1.0 | 252.349394 | 2018.0 | 2018-01-21 | |||
2 | 830 | NaN | 2 | 25.0 | 0.0 | 1 | 1 | 1 | 1.0 | ... | 1142.0 | None | 3.0 | 2018013733weS:8AM-8PM | 1.0 | 252.349394 | 2018.0 | 2018-01-21 | |||
3 | 831 | NaN | 2 | 35.0 | 0.0 | 1 | 1 | 1 | 1.0 | ... | 1144.0 | None | 3.0 | 2018013733weS:8AM-8PM | 1.0 | 252.349394 | 2018.0 | 2018-01-21 | |||
4 | 832 | NaN | 1 | 34.0 | 0.0 | 1 | 1 | 1 | 1.0 | ... | 1230.0 | None | 3.0 | 2018013733weS:8AM-8PM | 1.0 | 252.349394 | 2018.0 | 2018-01-21 |
5 rows × 83 columns
Trips is a very big dataset with lots of columns we’ll never use. Let’s trim it down:
trips_.columns
Index(['index', 'add_hrs', 'add_ph', 'age', 'alt_flag', 'area', 'area_nc',
'area_x', 'art_reef', 'arx_method', 'asg_code', 'boat_hrs', 'catch',
'celltype', 'cntrbtrs', 'cnty', 'cnty_res', 'coastal', 'compflag',
'county', 'date1', 'dist', 'distkeys', 'f_by_p', 'ffdays12', 'ffdays2',
'first', 'fl_reg', 'fshinsp_a', 'gear', 'gender', 'hrs_dtd', 'hrsf',
'id_code', 'intsite', 'kod', 'leader', 'license', 'mode2001',
'mode_asg', 'mode_f', 'mode_fx', 'monitor', 'month', 'muni_res',
'muni_trp', 'new_list', 'num_fish_a', 'num_typ2', 'num_typ3',
'num_typ4', 'num_typ6', 'num_typ9', 'on_list', 'party', 'prim1',
'prim1_common', 'prim2', 'prim2_common', 'prt_code', 'psu_id',
'pvt_res', 'reefcode', 'reg_res', 'region', 'rig', 'sep_fish', 'st',
'st_res', 'strat_id', 'strat_interval', 'sub_reg', 'telefon', 'time',
'tourn', 'tsn1', 'tsn2', 'turtle', 'var_id', 'wave', 'wp_int', 'year',
'intercept_date'],
dtype='object')
This reduces the columns and creates a new data frame called trips
from trips_
:
trips = trips_[['id_code','year','wave','intercept_date','st','prim1',
'prim1_common','prim2','prim2_common','cnty','ffdays12',
'ffdays2']]
This is summary statistics for numeric data columns:
trips.describe()
year | wave | intercept_date | st | cnty | ffdays12 | ffdays2 | |
---|---|---|---|---|---|---|---|
count | 31340.0 | 31340.000000 | 31319 | 31340.000000 | 31319.000000 | 31319.000000 | 31319.000000 |
mean | 2018.0 | 3.876324 | 2018-07-24 23:34:12.364379648 | 33.676835 | 127.602574 | 36.734953 | 5.647945 |
min | 2018.0 | 1.000000 | 2018-01-02 00:00:00 | 10.000000 | 1.000000 | 0.000000 | 0.000000 |
25% | 2018.0 | 3.000000 | 2018-06-07 00:00:00 | 24.000000 | 25.000000 | 1.000000 | 0.000000 |
50% | 2018.0 | 4.000000 | 2018-07-19 00:00:00 | 37.000000 | 55.000000 | 6.000000 | 1.000000 |
75% | 2018.0 | 5.000000 | 2018-09-07 00:00:00 | 37.000000 | 55.000000 | 25.000000 | 5.000000 |
max | 2018.0 | 6.000000 | 2018-12-31 00:00:00 | 51.000000 | 810.000000 | 999.000000 | 99.000000 |
std | 0.0 | 1.143137 | NaN | 11.965540 | 231.156385 | 127.058290 | 13.331434 |
Using Pandas DataFrames#
Referencing columns in data:#
# copy data frame into t (note, this is different than assignment [t=trips])
t=trips.copy()
t.ffdays12.head()
0 0.0
1 6.0
2 0.0
3 6.0
4 6.0
Name: ffdays12, dtype: float64
t['ffdays12'].head()
0 0.0
1 6.0
2 0.0
3 6.0
4 6.0
Name: ffdays12, dtype: float64
Referencing rows in data:#
We can use numpy-like slicing:
# rows 50-54
t.loc[51:55]
id_code | year | wave | intercept_date | st | prim1 | prim1_common | prim2 | prim2_common | cnty | ffdays12 | ffdays2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
51 | 1109620180120003 | 2018.0 | 1.0 | 2018-01-20 | 37.0 | 55.0 | 3.0 | 0.0 | ||||
52 | 1109620180120004 | 2018.0 | 1.0 | 2018-01-20 | 37.0 | 8835440901 | RED DRUM | 55.0 | 998.0 | 98.0 | ||
53 | 1109620180120005 | 2018.0 | 1.0 | 2018-01-20 | 37.0 | 8835440901 | RED DRUM | 55.0 | 100.0 | 30.0 | ||
54 | 1109620180120006 | 2018.0 | 1.0 | 2018-01-20 | 37.0 | 55.0 | 50.0 | 0.0 | ||||
55 | 1109620180120007 | 2018.0 | 1.0 | 2018-01-20 | 37.0 | 8835440801 | BLACK DRUM | 55.0 | 60.0 | 15.0 |
# rows 1 and 2 for the first 5 columns
t.iloc[0:2,0:5]
id_code | year | wave | intercept_date | st | |
---|---|---|---|---|---|
0 | 1105920180111001 | 2018.0 | 1.0 | 2018-01-11 | 37.0 |
1 | 1105920180121001 | 2018.0 | 1.0 | 2018-01-21 | 37.0 |
We can select rows of data based on column values. Let’s select all
the rows where st
is 51 (Virginia)
t[t.st == 51].head()
id_code | year | wave | intercept_date | st | prim1 | prim1_common | prim2 | prim2_common | cnty | ffdays12 | ffdays2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
3623 | 1662020180511001 | 2018.0 | 3.0 | 2018-05-11 | 51.0 | 1.0 | 0.0 | 0.0 | ||||
3624 | 1662020180511002 | 2018.0 | 3.0 | 2018-05-11 | 51.0 | 1.0 | 0.0 | 0.0 | ||||
3625 | 1662020180511003 | 2018.0 | 3.0 | 2018-05-11 | 51.0 | 1.0 | 1.0 | 1.0 | ||||
3626 | 1662020180511004 | 2018.0 | 3.0 | 2018-05-11 | 51.0 | 8835440801 | BLACK DRUM | 1.0 | 6.0 | 1.0 | ||
3627 | 1662020180511005 | 2018.0 | 3.0 | 2018-05-11 | 51.0 | 1.0 | 6.0 | 1.0 |
Find all rows where days spent fishing during past 12 months exceeds 10:
t[t.ffdays12>10].head()
id_code | year | wave | intercept_date | st | prim1 | prim1_common | prim2 | prim2_common | cnty | ffdays12 | ffdays2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 1105920180127001 | 2018.0 | 1.0 | 2018-01-27 | 37.0 | 8835440901 | RED DRUM | 133.0 | 100.0 | 2.0 | ||
6 | 1105920180127002 | 2018.0 | 1.0 | 2018-01-27 | 37.0 | 133.0 | 100.0 | 2.0 | ||||
7 | 1105920180127003 | 2018.0 | 1.0 | 2018-01-27 | 37.0 | 8857030300 | LEFTEYE FLOUNDER GENUS | 8835440901 | RED DRUM | 133.0 | 30.0 | 0.0 |
8 | 1105920180127004 | 2018.0 | 1.0 | 2018-01-27 | 37.0 | 8835440901 | RED DRUM | 133.0 | 50.0 | 5.0 | ||
9 | 1105920180127005 | 2018.0 | 1.0 | 2018-01-27 | 37.0 | 133.0 | 12.0 | 12.0 |
Math on columns:#
# we can create a new variable and use numpy commands:
t['temp'] = t.ffdays12 + .5*np.random.randn(t.shape[0])
t.head()
id_code | year | wave | intercept_date | st | prim1 | prim1_common | prim2 | prim2_common | cnty | ffdays12 | ffdays2 | temp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1105920180111001 | 2018.0 | 1.0 | 2018-01-11 | 37.0 | 129.0 | 0.0 | 0.0 | -0.360327 | ||||
1 | 1105920180121001 | 2018.0 | 1.0 | 2018-01-21 | 37.0 | 129.0 | 6.0 | 2.0 | 7.362805 | ||||
2 | 1105920180121002 | 2018.0 | 1.0 | 2018-01-21 | 37.0 | 129.0 | 0.0 | 0.0 | 0.269195 | ||||
3 | 1105920180121003 | 2018.0 | 1.0 | 2018-01-21 | 37.0 | 129.0 | 6.0 | 2.0 | 6.871520 | ||||
4 | 1105920180121004 | 2018.0 | 1.0 | 2018-01-21 | 37.0 | 129.0 | 6.0 | 3.0 | 5.407346 |
t.temp.mean()
36.74097889645163
# or we could do this all in one step if we didn't want to create a new column in t:
(t.ffdays12 + .5*np.random.randn(t.shape[0])).mean()
36.73324074328389
Note: You can combine pandas with numpy. This is the standard deviation of the column ffdays12
:
np.std(t.ffdays12)
127.05626187891463
We can perform matrix/vector operations on pandas data.
Here, we can transpose a slice of the data:
t[['ffdays12','ffdays2']].head(10).T
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
ffdays12 | 0.0 | 6.0 | 0.0 | 6.0 | 6.0 | 100.0 | 100.0 | 30.0 | 50.0 | 12.0 |
ffdays2 | 0.0 | 2.0 | 0.0 | 2.0 | 3.0 | 2.0 | 2.0 | 0.0 | 5.0 | 12.0 |
We can do matrix multiplication:
np.dot(t[['ffdays12','ffdays2']].head(10).T,t[['ffdays12','ffdays2']].head(10))
array([[23652., 836.],
[ 836., 194.]])