Introduction to Pandas

Introduction to Pandas#

Pandas works hand-in-hand with other python libraries (e.g. matplotlib and numpy) to make manipulating data (what the Pandas team calls “Data Munging”) easy. With pandas it is easy to

Easily access data using variable names, but have full linear algebra capabilities of numpy
Group data by values (for example, calculate the mean income by state)
Plot and summarize values
Join (combine) different sources of data
Powerful time series and panel data capabilities (beyond scope of course)

Note: If you want to reinforce some of these concepts on your own, I recommend this superb youtube video (https://www.youtube.com/watch?v=5JnMutdy6Fw) and accompanying coursework (brandon-rhodes/pycon-pandas-tutorial) by Brandon Rhodes.

There are numerous ways to get data into Pandas:

Import excel and comma delimited data
Import stata, sas, matlab, and other datasets
Import data from an SQL server
Import data scraped from the web
Manually building (by typing in values)

In this tutorial we will focus on the first two methods for reading data, but just know that there is probably a way to get your data into pandas irrespective of what format it is in.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sbn

Loading and Cleaning Data#

We will be loading 2010 “Trip” data from the NOAA Fisheries Service Recreational Fisheries Statistics Survey (called MRIP). More detail here: https://www.fisheries.noaa.gov/topic/recreational-fishing-data

The MRIPs data needs to be downloaded from my website (you only need to do this once).

##
## Load some data
##
trips_ = pd.read_pickle("../_static/lectures/appendix/mrips_trips.pickle")
trips_.head()

	index	add_hrs	add_ph	age	area	area_nc	area_x	art_reef	...	time	tourn	turtle	var_id	wave	wp_int	year	intercept_date
0	828	NaN	1	75.0	1	1	1	2.0	...	1205.0	None	3.0	2018013733wdS:8AM-8PM	1.0	352.092753	2018.0	2018-01-11
1	829	NaN	2	32.0	1	1	1	1.0	...	1140.0	None	3.0	2018013733weS:8AM-8PM	1.0	252.349394	2018.0	2018-01-21
2	830	NaN	2	25.0	1	1	1	1.0	...	1142.0	None	3.0	2018013733weS:8AM-8PM	1.0	252.349394	2018.0	2018-01-21
3	831	NaN	2	35.0	1	1	1	1.0	...	1144.0	None	3.0	2018013733weS:8AM-8PM	1.0	252.349394	2018.0	2018-01-21
4	832	NaN	1	34.0	1	1	1	1.0	...	1230.0	None	3.0	2018013733weS:8AM-8PM	1.0	252.349394	2018.0	2018-01-21

5 rows × 83 columns

Trips is a very big dataset with lots of columns we’ll never use. Let’s trim it down:

trips_.columns

Index(['index', 'add_hrs', 'add_ph', 'age', 'alt_flag', 'area', 'area_nc',
       'area_x', 'art_reef', 'arx_method', 'asg_code', 'boat_hrs', 'catch',
       'celltype', 'cntrbtrs', 'cnty', 'cnty_res', 'coastal', 'compflag',
       'county', 'date1', 'dist', 'distkeys', 'f_by_p', 'ffdays12', 'ffdays2',
       'first', 'fl_reg', 'fshinsp_a', 'gear', 'gender', 'hrs_dtd', 'hrsf',
       'id_code', 'intsite', 'kod', 'leader', 'license', 'mode2001',
       'mode_asg', 'mode_f', 'mode_fx', 'monitor', 'month', 'muni_res',
       'muni_trp', 'new_list', 'num_fish_a', 'num_typ2', 'num_typ3',
       'num_typ4', 'num_typ6', 'num_typ9', 'on_list', 'party', 'prim1',
       'prim1_common', 'prim2', 'prim2_common', 'prt_code', 'psu_id',
       'pvt_res', 'reefcode', 'reg_res', 'region', 'rig', 'sep_fish', 'st',
       'st_res', 'strat_id', 'strat_interval', 'sub_reg', 'telefon', 'time',
       'tourn', 'tsn1', 'tsn2', 'turtle', 'var_id', 'wave', 'wp_int', 'year',
       'intercept_date'],
      dtype='object')

This reduces the columns and creates a new data frame called trips from trips_:

trips = trips_[['id_code','year','wave','intercept_date','st','prim1',
                    'prim1_common','prim2','prim2_common','cnty','ffdays12',
                    'ffdays2']]

This is summary statistics for numeric data columns:

trips.describe()

	year	wave	intercept_date	st	cnty	ffdays12	ffdays2
count	31340.0	31340.000000	31319	31340.000000	31319.000000	31319.000000	31319.000000
mean	2018.0	3.876324	2018-07-24 23:34:12.364379648	33.676835	127.602574	36.734953	5.647945
min	2018.0	1.000000	2018-01-02 00:00:00	10.000000	1.000000	0.000000	0.000000
25%	2018.0	3.000000	2018-06-07 00:00:00	24.000000	25.000000	1.000000	0.000000
50%	2018.0	4.000000	2018-07-19 00:00:00	37.000000	55.000000	6.000000	1.000000
75%	2018.0	5.000000	2018-09-07 00:00:00	37.000000	55.000000	25.000000	5.000000
max	2018.0	6.000000	2018-12-31 00:00:00	51.000000	810.000000	999.000000	99.000000
std	0.0	1.143137	NaN	11.965540	231.156385	127.058290	13.331434

Using Pandas DataFrames#

Referencing columns in data:#

# copy data frame into t (note, this is different than assignment [t=trips])
t=trips.copy()

t.ffdays12.head()

  0.0
  6.0
  0.0
  6.0
  6.0
Name: ffdays12, dtype: float64

t['ffdays12'].head()

  0.0
  6.0
  0.0
  6.0
  6.0
Name: ffdays12, dtype: float64

Referencing rows in data:#

We can use numpy-like slicing:

# rows 50-54
t.loc[51:55]

	id_code	year	wave	intercept_date	st	prim1	prim1_common	cnty	ffdays12	ffdays2
51	1109620180120003	2018.0	1.0	2018-01-20	37.0			55.0	3.0	0.0
52	1109620180120004	2018.0	1.0	2018-01-20	37.0	8835440901	RED DRUM	55.0	998.0	98.0
53	1109620180120005	2018.0	1.0	2018-01-20	37.0	8835440901	RED DRUM	55.0	100.0	30.0
54	1109620180120006	2018.0	1.0	2018-01-20	37.0			55.0	50.0	0.0
55	1109620180120007	2018.0	1.0	2018-01-20	37.0	8835440801	BLACK DRUM	55.0	60.0	15.0

# rows 1 and 2 for the first 5 columns
t.iloc[0:2,0:5]

	id_code	year	wave	intercept_date	st
0	1105920180111001	2018.0	1.0	2018-01-11	37.0
1	1105920180121001	2018.0	1.0	2018-01-21	37.0

We can select rows of data based on column values. Let’s select all the rows where st is 51 (Virginia)

t[t.st == 51].head()

	id_code	year	wave	intercept_date	st	prim1	prim1_common	cnty	ffdays12	ffdays2
3623	1662020180511001	2018.0	3.0	2018-05-11	51.0			1.0	0.0	0.0
3624	1662020180511002	2018.0	3.0	2018-05-11	51.0			1.0	0.0	0.0
3625	1662020180511003	2018.0	3.0	2018-05-11	51.0			1.0	1.0	1.0
3626	1662020180511004	2018.0	3.0	2018-05-11	51.0	8835440801	BLACK DRUM	1.0	6.0	1.0
3627	1662020180511005	2018.0	3.0	2018-05-11	51.0			1.0	6.0	1.0

Find all rows where days spent fishing during past 12 months exceeds 10:

t[t.ffdays12>10].head()

	id_code	year	wave	intercept_date	st	prim1	prim1_common	prim2	prim2_common	cnty	ffdays12	ffdays2
5	1105920180127001	2018.0	1.0	2018-01-27	37.0	8835440901	RED DRUM			133.0	100.0	2.0
6	1105920180127002	2018.0	1.0	2018-01-27	37.0					133.0	100.0	2.0
7	1105920180127003	2018.0	1.0	2018-01-27	37.0	8857030300	LEFTEYE FLOUNDER GENUS	8835440901	RED DRUM	133.0	30.0	0.0
8	1105920180127004	2018.0	1.0	2018-01-27	37.0	8835440901	RED DRUM			133.0	50.0	5.0
9	1105920180127005	2018.0	1.0	2018-01-27	37.0					133.0	12.0	12.0

Math on columns:#

# we can create a new variable and use numpy commands:
t['temp'] = t.ffdays12 + .5*np.random.randn(t.shape[0])
t.head()

	id_code	year	wave	intercept_date	st	cnty	ffdays12	ffdays2	temp
0	1105920180111001	2018.0	1.0	2018-01-11	37.0	129.0	0.0	0.0	-0.360327
1	1105920180121001	2018.0	1.0	2018-01-21	37.0	129.0	6.0	2.0	7.362805
2	1105920180121002	2018.0	1.0	2018-01-21	37.0	129.0	0.0	0.0	0.269195
3	1105920180121003	2018.0	1.0	2018-01-21	37.0	129.0	6.0	2.0	6.871520
4	1105920180121004	2018.0	1.0	2018-01-21	37.0	129.0	6.0	3.0	5.407346

t.temp.mean()

36.74097889645163

# or we could do this all in one step if we didn't want to create a new column in t:
(t.ffdays12 + .5*np.random.randn(t.shape[0])).mean()

36.73324074328389

Note: You can combine pandas with numpy. This is the standard deviation of the column ffdays12:

np.std(t.ffdays12)

127.05626187891463

We can perform matrix/vector operations on pandas data.

Here, we can transpose a slice of the data:

t[['ffdays12','ffdays2']].head(10).T

	0	1	2	3	4	5	6	7	8	9
ffdays12	0.0	6.0	0.0	6.0	6.0	100.0	100.0	30.0	50.0	12.0
ffdays2	0.0	2.0	0.0	2.0	3.0	2.0	2.0	0.0	5.0	12.0

We can do matrix multiplication:

np.dot(t[['ffdays12','ffdays2']].head(10).T,t[['ffdays12','ffdays2']].head(10))

array([[23652.,   836.],
       [  836.,   194.]])