Introduction to Pandas#

Pandas works hand-in-hand with other python libraries (e.g. matplotlib and numpy) to make manipulating data (what the Pandas team calls “Data Munging”) easy. With pandas it is easy to

  1. Easily access data using variable names, but have full linear algebra capabilities of numpy

  2. Group data by values (for example, calculate the mean income by state)

  3. Plot and summarize values

  4. Join (combine) different sources of data

  5. Powerful time series and panel data capabilities (beyond scope of course)

Note: If you want to reinforce some of these concepts on your own, I recommend this superb youtube video (https://www.youtube.com/watch?v=5JnMutdy6Fw) and accompanying coursework (brandon-rhodes/pycon-pandas-tutorial) by Brandon Rhodes.

There are numerous ways to get data into Pandas:

  • Import excel and comma delimited data

  • Import stata, sas, matlab, and other datasets

  • Import data from an SQL server

  • Import data scraped from the web

  • Manually building (by typing in values)

In this tutorial we will focus on the first two methods for reading data, but just know that there is probably a way to get your data into pandas irrespective of what format it is in.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sbn

Loading and Cleaning Data#

We will be loading 2010 “Trip” data from the NOAA Fisheries Service Recreational Fisheries Statistics Survey (called MRIP). More detail here: https://www.fisheries.noaa.gov/topic/recreational-fishing-data

The MRIPs data needs to be downloaded from my website (you only need to do this once).

##
## Load some data
##
trips_ = pd.read_pickle("../_static/lectures/appendix/mrips_trips.pickle")
trips_.head()
index add_hrs add_ph age alt_flag area area_nc area_x art_reef arx_method ... time tourn tsn1 tsn2 turtle var_id wave wp_int year intercept_date
0 828 NaN 1 75.0 0.0 1 1 1 2.0 ... 1205.0 None 3.0 2018013733wdS:8AM-8PM 1.0 352.092753 2018.0 2018-01-11
1 829 NaN 2 32.0 0.0 1 1 1 1.0 ... 1140.0 None 3.0 2018013733weS:8AM-8PM 1.0 252.349394 2018.0 2018-01-21
2 830 NaN 2 25.0 0.0 1 1 1 1.0 ... 1142.0 None 3.0 2018013733weS:8AM-8PM 1.0 252.349394 2018.0 2018-01-21
3 831 NaN 2 35.0 0.0 1 1 1 1.0 ... 1144.0 None 3.0 2018013733weS:8AM-8PM 1.0 252.349394 2018.0 2018-01-21
4 832 NaN 1 34.0 0.0 1 1 1 1.0 ... 1230.0 None 3.0 2018013733weS:8AM-8PM 1.0 252.349394 2018.0 2018-01-21

5 rows × 83 columns

Trips is a very big dataset with lots of columns we’ll never use. Let’s trim it down:

trips_.columns
Index(['index', 'add_hrs', 'add_ph', 'age', 'alt_flag', 'area', 'area_nc',
       'area_x', 'art_reef', 'arx_method', 'asg_code', 'boat_hrs', 'catch',
       'celltype', 'cntrbtrs', 'cnty', 'cnty_res', 'coastal', 'compflag',
       'county', 'date1', 'dist', 'distkeys', 'f_by_p', 'ffdays12', 'ffdays2',
       'first', 'fl_reg', 'fshinsp_a', 'gear', 'gender', 'hrs_dtd', 'hrsf',
       'id_code', 'intsite', 'kod', 'leader', 'license', 'mode2001',
       'mode_asg', 'mode_f', 'mode_fx', 'monitor', 'month', 'muni_res',
       'muni_trp', 'new_list', 'num_fish_a', 'num_typ2', 'num_typ3',
       'num_typ4', 'num_typ6', 'num_typ9', 'on_list', 'party', 'prim1',
       'prim1_common', 'prim2', 'prim2_common', 'prt_code', 'psu_id',
       'pvt_res', 'reefcode', 'reg_res', 'region', 'rig', 'sep_fish', 'st',
       'st_res', 'strat_id', 'strat_interval', 'sub_reg', 'telefon', 'time',
       'tourn', 'tsn1', 'tsn2', 'turtle', 'var_id', 'wave', 'wp_int', 'year',
       'intercept_date'],
      dtype='object')

This reduces the columns and creates a new data frame called trips from trips_:

trips = trips_[['id_code','year','wave','intercept_date','st','prim1',
                    'prim1_common','prim2','prim2_common','cnty','ffdays12',
                    'ffdays2']]

This is summary statistics for numeric data columns:

trips.describe()
year wave intercept_date st cnty ffdays12 ffdays2
count 31340.0 31340.000000 31319 31340.000000 31319.000000 31319.000000 31319.000000
mean 2018.0 3.876324 2018-07-24 23:34:12.364379648 33.676835 127.602574 36.734953 5.647945
min 2018.0 1.000000 2018-01-02 00:00:00 10.000000 1.000000 0.000000 0.000000
25% 2018.0 3.000000 2018-06-07 00:00:00 24.000000 25.000000 1.000000 0.000000
50% 2018.0 4.000000 2018-07-19 00:00:00 37.000000 55.000000 6.000000 1.000000
75% 2018.0 5.000000 2018-09-07 00:00:00 37.000000 55.000000 25.000000 5.000000
max 2018.0 6.000000 2018-12-31 00:00:00 51.000000 810.000000 999.000000 99.000000
std 0.0 1.143137 NaN 11.965540 231.156385 127.058290 13.331434

Using Pandas DataFrames#

Referencing columns in data:#

# copy data frame into t (note, this is different than assignment [t=trips])
t=trips.copy()

t.ffdays12.head()
0    0.0
1    6.0
2    0.0
3    6.0
4    6.0
Name: ffdays12, dtype: float64
t['ffdays12'].head()
0    0.0
1    6.0
2    0.0
3    6.0
4    6.0
Name: ffdays12, dtype: float64

Referencing rows in data:#

We can use numpy-like slicing:

# rows 50-54
t.loc[51:55]
id_code year wave intercept_date st prim1 prim1_common prim2 prim2_common cnty ffdays12 ffdays2
51 1109620180120003 2018.0 1.0 2018-01-20 37.0 55.0 3.0 0.0
52 1109620180120004 2018.0 1.0 2018-01-20 37.0 8835440901 RED DRUM 55.0 998.0 98.0
53 1109620180120005 2018.0 1.0 2018-01-20 37.0 8835440901 RED DRUM 55.0 100.0 30.0
54 1109620180120006 2018.0 1.0 2018-01-20 37.0 55.0 50.0 0.0
55 1109620180120007 2018.0 1.0 2018-01-20 37.0 8835440801 BLACK DRUM 55.0 60.0 15.0
# rows 1 and 2 for the first 5 columns
t.iloc[0:2,0:5]
id_code year wave intercept_date st
0 1105920180111001 2018.0 1.0 2018-01-11 37.0
1 1105920180121001 2018.0 1.0 2018-01-21 37.0

We can select rows of data based on column values. Let’s select all the rows where st is 51 (Virginia)

t[t.st == 51].head()
id_code year wave intercept_date st prim1 prim1_common prim2 prim2_common cnty ffdays12 ffdays2
3623 1662020180511001 2018.0 3.0 2018-05-11 51.0 1.0 0.0 0.0
3624 1662020180511002 2018.0 3.0 2018-05-11 51.0 1.0 0.0 0.0
3625 1662020180511003 2018.0 3.0 2018-05-11 51.0 1.0 1.0 1.0
3626 1662020180511004 2018.0 3.0 2018-05-11 51.0 8835440801 BLACK DRUM 1.0 6.0 1.0
3627 1662020180511005 2018.0 3.0 2018-05-11 51.0 1.0 6.0 1.0

Find all rows where days spent fishing during past 12 months exceeds 10:

t[t.ffdays12>10].head()
id_code year wave intercept_date st prim1 prim1_common prim2 prim2_common cnty ffdays12 ffdays2
5 1105920180127001 2018.0 1.0 2018-01-27 37.0 8835440901 RED DRUM 133.0 100.0 2.0
6 1105920180127002 2018.0 1.0 2018-01-27 37.0 133.0 100.0 2.0
7 1105920180127003 2018.0 1.0 2018-01-27 37.0 8857030300 LEFTEYE FLOUNDER GENUS 8835440901 RED DRUM 133.0 30.0 0.0
8 1105920180127004 2018.0 1.0 2018-01-27 37.0 8835440901 RED DRUM 133.0 50.0 5.0
9 1105920180127005 2018.0 1.0 2018-01-27 37.0 133.0 12.0 12.0

Math on columns:#

# we can create a new variable and use numpy commands:
t['temp'] = t.ffdays12 + .5*np.random.randn(t.shape[0])
t.head()
id_code year wave intercept_date st prim1 prim1_common prim2 prim2_common cnty ffdays12 ffdays2 temp
0 1105920180111001 2018.0 1.0 2018-01-11 37.0 129.0 0.0 0.0 -0.360327
1 1105920180121001 2018.0 1.0 2018-01-21 37.0 129.0 6.0 2.0 7.362805
2 1105920180121002 2018.0 1.0 2018-01-21 37.0 129.0 0.0 0.0 0.269195
3 1105920180121003 2018.0 1.0 2018-01-21 37.0 129.0 6.0 2.0 6.871520
4 1105920180121004 2018.0 1.0 2018-01-21 37.0 129.0 6.0 3.0 5.407346
t.temp.mean()
36.74097889645163
# or we could do this all in one step if we didn't want to create a new column in t:
(t.ffdays12 + .5*np.random.randn(t.shape[0])).mean()
36.73324074328389

Note: You can combine pandas with numpy. This is the standard deviation of the column ffdays12:

np.std(t.ffdays12)
127.05626187891463

We can perform matrix/vector operations on pandas data.

Here, we can transpose a slice of the data:

t[['ffdays12','ffdays2']].head(10).T
0 1 2 3 4 5 6 7 8 9
ffdays12 0.0 6.0 0.0 6.0 6.0 100.0 100.0 30.0 50.0 12.0
ffdays2 0.0 2.0 0.0 2.0 3.0 2.0 2.0 0.0 5.0 12.0

We can do matrix multiplication:

np.dot(t[['ffdays12','ffdays2']].head(10).T,t[['ffdays12','ffdays2']].head(10))
array([[23652.,   836.],
       [  836.,   194.]])