Panel Data Estimation in R
Contents
Panel Data Estimation in R#
This document provides a brief description of how to implement panel data models in both R and Stata.
# allows us to open stata datasets
library(foreign)
library(lmtest)
library(stargazer)
library(plm)
We will persist with Tobias and Koop but this time will use the entire dataset. Load data and summarize:
tk.df = read.dta("http://rlhick.people.wm.edu/econ407/data/tobias_koop.dta")
stargazer(tk.df,header=FALSE,title="Tobias and Koop Summary Statistics",type="text")
Tobias and Koop Summary Statistics
==================================================
Statistic N Mean St. Dev. Min Max
--------------------------------------------------
id 17,919 1,081.852 629.846 1 2,178
educ 17,919 12.676 1.922 9 20
ln_wage 17,919 2.297 0.528 0.070 4.570
pexp 17,919 8.363 4.128 0 22
time 17,919 8.197 3.956 0 14
ability 17,919 0.052 0.926 -4.040 2.010
meduc 17,919 11.472 2.989 0 20
feduc 17,919 11.709 3.767 0 20
broken_home 17,919 0.154 0.361 0 1
siblings 17,919 3.156 2.121 0 18
pexp2 17,919 86.970 75.263 0 484
--------------------------------------------------
We will be estimating the returns to education equation and will ignore any endogeneity issues for the purposes of illustration.
Pooled OLS#
In R pooled OLS, this is simply an OLS model and is easily run using
wage.ols = lm(ln_wage~educ+pexp+pexp2+broken_home, data=tk.df)
summary(wage.ols)
Call:
lm(formula = ln_wage ~ educ + pexp + pexp2 + broken_home, data = tk.df)
Residuals:
Min 1Q Median 3Q Max
-2.53137 -0.27441 0.02012 0.31121 2.11371
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.6822885 0.0286842 23.786 < 2e-16 ***
educ 0.0890684 0.0019390 45.934 < 2e-16 ***
pexp 0.0907662 0.0034375 26.405 < 2e-16 ***
pexp2 -0.0030460 0.0001895 -16.071 < 2e-16 ***
broken_home -0.0561547 0.0100131 -5.608 2.08e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4808 on 17914 degrees of freedom
Multiple R-squared: 0.1717, Adjusted R-squared: 0.1715
F-statistic: 928.6 on 4 and 17914 DF, p-value: < 2.2e-16
or we could do it using the plm package after setting up our panel dataset (tk.panel.df
):
tk.panel.df = plm.data(tk.df, index = c("id","time"))
wage.pool = plm(ln_wage~educ+pexp+pexp2+broken_home, data=tk.panel.df,model="pooling")
summary(wage.pool)
Warning message:
“use of 'plm.data' is discouraged, better use 'pdata.frame' instead”
Pooling Model
Call:
plm(formula = ln_wage ~ educ + pexp + pexp2 + broken_home, data = tk.panel.df,
model = "pooling")
Unbalanced Panel: n = 2178, T = 1-15, N = 17919
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-2.531374 -0.274411 0.020121 0.311211 2.113707
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
(Intercept) 0.68228847 0.02868415 23.7863 < 2.2e-16 ***
educ 0.08906836 0.00193904 45.9343 < 2.2e-16 ***
pexp 0.09076619 0.00343747 26.4050 < 2.2e-16 ***
pexp2 -0.00304601 0.00018953 -16.0715 < 2.2e-16 ***
broken_home -0.05615469 0.01001310 -5.6081 2.076e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 4999.7
Residual Sum of Squares: 4141.1
R-Squared: 0.17173
Adj. R-Squared: 0.17155
F-statistic: 928.576 on 4 and 17914 DF, p-value: < 2.22e-16
Random Effects#
In R we again use the plm
package (note that this implements something closest to the Swamy-Aurora version of the random effects model, which in Stata is xtreg ln_wage educ pexp pexp2 broken_home , sa th
):
wage.re = plm(ln_wage~educ+pexp+pexp2+broken_home, data=tk.panel.df,model="random")
summary(wage.re)
Oneway (individual) effect Random Effect Model
(Swamy-Arora's transformation)
Call:
plm(formula = ln_wage ~ educ + pexp + pexp2 + broken_home, data = tk.panel.df,
model = "random")
Unbalanced Panel: n = 2178, T = 1-15, N = 17919
Effects:
var std.dev share
idiosyncratic 0.1080 0.3287 0.469
individual 0.1223 0.3498 0.531
theta:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.3152 0.6847 0.7151 0.7007 0.7478 0.7642
Residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.36051 -0.15586 0.03123 0.00648 0.19569 2.21809
Coefficients:
Estimate Std. Error z-value Pr(>|z|)
(Intercept) 0.57764593 0.04297643 13.4410 < 2.2e-16 ***
educ 0.09053539 0.00333714 27.1296 < 2.2e-16 ***
pexp 0.10252019 0.00261397 39.2202 < 2.2e-16 ***
pexp2 -0.00362720 0.00014322 -25.3263 < 2.2e-16 ***
broken_home -0.06411881 0.02178336 -2.9435 0.003245 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 2693.3
Residual Sum of Squares: 1960.1
R-Squared: 0.27508
Adj. R-Squared: 0.27492
Chisq: 4998.17 on 4 DF, p-value: < 2.22e-16
Fixed Effects#
In R we again use the plm
package (and R plm
surpresses the model constant in a way more in line with our class notes). R also automatically drops covariates not varying within a cross sectional unit (although it doesn’t tell you explicitly the way Stata does).
wage.fe = plm(ln_wage~educ+pexp+pexp2+broken_home, data=tk.panel.df, model="within")
summary(wage.fe)
Oneway (individual) effect Within Model
Call:
plm(formula = ln_wage ~ educ + pexp + pexp2 + broken_home, data = tk.panel.df,
model = "within")
Unbalanced Panel: n = 2178, T = 1-15, N = 17919
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-2.452999 -0.145082 0.010925 0.167903 2.543757
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
educ 0.07619867 0.00594136 12.825 < 2.2e-16 ***
pexp 0.10703306 0.00277281 38.601 < 2.2e-16 ***
pexp2 -0.00381882 0.00014899 -25.632 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 2204
Residual Sum of Squares: 1700.2
R-Squared: 0.22862
Adj. R-Squared: 0.12176
F-statistic: 1554.76 on 3 and 15738 DF, p-value: < 2.22e-16
Hypothesis Testing and Model Comparison#
We have various tests to run:
OLS (Pooled) versus Random Effects#
In R we can run the Breusch-Pagan Lagrange Multiplier test using,
plmtest(wage.pool,type='bp')
Lagrange Multiplier Test - (Breusch-Pagan)
data: ln_wage ~ educ + pexp + pexp2 + broken_home
chisq = 19339, df = 1, p-value < 2.2e-16
alternative hypothesis: significant effects
OLS (Pooled) versus Fixed Effects#
In R, use this (note there is a slight difference in the F statistic (and degrees of freedom) due to Stata using a model constant):
pFtest(wage.fe, wage.pool)
F test for individual effects
data: ln_wage ~ educ + pexp + pexp2 + broken_home
F = 10.384, df1 = 2176, df2 = 15738, p-value < 2.2e-16
alternative hypothesis: significant effects
Random Effects versus Fixed Effects#
In R, we use the Hausman test:
phtest(wage.fe, wage.re)
Hausman Test
data: ln_wage ~ educ + pexp + pexp2 + broken_home
chisq = 49.388, df = 3, p-value = 1.079e-10
alternative hypothesis: one model is inconsistent
Endogeneity#
IV regression on panel data (where x4
is endogenous and instrumented by z1
) can be performed by
summary(plm(y~x1+x2+x3+x4 | . + x1+x2+x3+z1, data=panel.df,model="within"))
Note: Random Effects with instruments seemingly doesn’t work in R (doesn’t come close to matching stata’s xtivreg command). Fixed effects (shown above), not verified.
Robust Standard Errors#
In R and noting that only the standard errors, t-stats, and p values change, we have very similar results. Note the vcovHC
and coeftest
functions are being drawn from the plm
package and not sandwich
:
coeftest(wage.fe,vcov=vcovHC(wage.fe,type="sss"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
educ 0.07619867 0.00991101 7.6883 1.579e-14 ***
pexp 0.10703306 0.00421265 25.4076 < 2.2e-16 ***
pexp2 -0.00381882 0.00021928 -17.4155 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1