IV Regression in R
Contents
IV Regression in R#
We quickly explore the problem of endogeneity and how to estimate this class of models in R. Recall that the OLS estimator requires
This code shows how to overcome estimation problems where this assumption fails but where we can identify an instrument for implementing instrumental variables regression (IV Regression).
# allows us to open stata datasets
library(foreign)
library(sandwich)
library(lmtest)
library(stargazer)
library(boot)
library(AER)
library(car)
library(margins)
We will persist with a “cross-sectioned” version of Tobias and Koop that focuses on 1983. Load data and summarize:
tk.df = read.dta("http://rlhick.people.wm.edu/econ407/data/tobias_koop.dta")
tk4.df = subset(tk.df, time == 4)
stargazer(tk4.df,header=FALSE,title="Tobias and Koop Summary Statistics", type="text")
Tobias and Koop Summary Statistics
=================================================
Statistic N Mean St. Dev. Min Max
-------------------------------------------------
id 1,034 1,090.952 634.892 4 2,177
educ 1,034 12.275 1.567 9 19
ln_wage 1,034 2.138 0.466 0.420 3.590
pexp 1,034 4.815 2.190 0 12
time 1,034 4.000 0.000 4 4
ability 1,034 0.017 0.921 -3.140 1.890
meduc 1,034 11.403 3.027 0 20
feduc 1,034 11.585 3.736 0 20
broken_home 1,034 0.169 0.375 0 1
siblings 1,034 3.200 2.127 0 15
pexp2 1,034 27.980 22.599 0 144
-------------------------------------------------
OLS#
If we ignore any potential endogeneity problem we can estimate OLS as described in the OLS document (ols.Rmd
). Running this in R is done this way
ols.lm = lm(ln_wage ~ pexp + pexp2 + broken_home + educ, data=tk4.df)
summary(ols.lm)
Call:
lm(formula = ln_wage ~ pexp + pexp2 + broken_home + educ, data = tk4.df)
Residuals:
Min 1Q Median 3Q Max
-1.69490 -0.24818 -0.00079 0.25417 1.51139
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.460333 0.137294 3.353 0.000829 ***
pexp 0.203521 0.023586 8.629 < 2e-16 ***
pexp2 -0.012413 0.002283 -5.438 6.73e-08 ***
broken_home -0.008725 0.035711 -0.244 0.807020
educ 0.085273 0.009290 9.179 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4265 on 1029 degrees of freedom
Multiple R-squared: 0.1664, Adjusted R-squared: 0.1632
F-statistic: 51.36 on 4 and 1029 DF, p-value: < 2.2e-16
Examining the elasticity of education in R is a more complicated than it is in stata (for Stata jus run margins eyex
after your model). We have to use the definition of an elasticity and manually compute it using:
eyex = mean(coef(ols.lm)['educ'] * tk4.df$educ)
cat(sprintf("Elasticity of log(wage) with respect to education from OLS: %1.4f", eyex))
Elasticity of log(wage) with respect to education from OLS: 1.0467
which is appropriate for this case since ln_wage
has wages in logs and educ
is in levels (not log transformed).
The endogeneity problem#
We assume that education is endogenous. That is, it is correlated with the regression errors. This means OLS estimates of \(\beta\) are biased. We hypothesize that the variable feduc
is a good instrument having all the properties we describe in detail in the notes document.
Estimation in R#
This runs the estimation in R.
ivmodel <- ivreg(ln_wage ~ pexp + pexp2 + broken_home + educ |
pexp + pexp2 + broken_home + feduc, data= tk4.df)
This is the unadjusted standard errors that matches output from this stata command ivregress 2sls ln_wage pexp pexp2 broken_home (educ=feduc), small
:
summary(ivmodel)
Call:
ivreg(formula = ln_wage ~ pexp + pexp2 + broken_home + educ |
pexp + pexp2 + broken_home + feduc, data = tk4.df)
Residuals:
Min 1Q Median 3Q Max
-1.8472 -0.2326 0.0194 0.2541 1.6113
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.406439 0.436664 -0.931 0.352
pexp 0.214752 0.024715 8.689 < 2e-16 ***
pexp2 -0.011745 0.002357 -4.984 7.30e-07 ***
broken_home 0.024471 0.039815 0.615 0.539
educ 0.149503 0.032079 4.661 3.57e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4363 on 1029 degrees of freedom
Multiple R-Squared: 0.1277, Adjusted R-squared: 0.1243
Wald test: 34.38 on 4 and 1029 DF, p-value: < 2.2e-16
Here is the robust version of the model, matching stata’s ivregress output from ivregress 2sls ln_wage pexp pexp2 broken_home (educ=feduc), robust
:
summary(ivmodel, vcov=sandwich)
Call:
ivreg(formula = ln_wage ~ pexp + pexp2 + broken_home + educ |
pexp + pexp2 + broken_home + feduc, data = tk4.df)
Residuals:
Min 1Q Median 3Q Max
-1.8472 -0.2326 0.0194 0.2541 1.6113
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.406439 0.440450 -0.923 0.356
pexp 0.214752 0.023863 8.999 < 2e-16 ***
pexp2 -0.011745 0.002359 -4.978 7.53e-07 ***
broken_home 0.024471 0.033503 0.730 0.465
educ 0.149503 0.032908 4.543 6.20e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4363 on 1029 degrees of freedom
Multiple R-Squared: 0.1277, Adjusted R-squared: 0.1243
Wald test: 37.63 on 4 and 1029 DF, p-value: < 2.2e-16
Inference#
We have more work to do to determine what model we should report:
Test for relevant and strong instruments
Test for endogeneity
Test for overidentification (not relevant for this example)
In R, we do it this way:
summary(ivmodel, vcov=sandwich,diagnostics = TRUE)
Call:
ivreg(formula = ln_wage ~ pexp + pexp2 + broken_home + educ |
pexp + pexp2 + broken_home + feduc, data = tk4.df)
Residuals:
Min 1Q Median 3Q Max
-1.8472 -0.2326 0.0194 0.2541 1.6113
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.406439 0.440450 -0.923 0.356
pexp 0.214752 0.023863 8.999 < 2e-16 ***
pexp2 -0.011745 0.002359 -4.978 7.53e-07 ***
broken_home 0.024471 0.033503 0.730 0.465
educ 0.149503 0.032908 4.543 6.20e-06 ***
Diagnostic tests:
df1 df2 statistic p-value
Weak instruments 1 1029 80.649 <2e-16 ***
Wu-Hausman 1 1028 4.376 0.0367 *
Sargan 0 NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4363 on 1029 degrees of freedom
Multiple R-Squared: 0.1277, Adjusted R-squared: 0.1243
Wald test: 37.63 on 4 and 1029 DF, p-value: < 2.2e-16
These results tell us we have relevant and strong instruments and that education is likely endogenous. Revisiting our elasticity estimate,
eyex_iv = mean(coef(ivmodel)['educ'] * tk4.df$educ)
cat(sprintf("Elasticity of log(wage) with respect to education from OLS: %1.4f", eyex), "\n")
cat(sprintf("Elasticity of log(wage) with respect to education from IV: %1.4f", eyex_iv))
Elasticity of log(wage) with respect to education from OLS: 1.0467
Elasticity of log(wage) with respect to education from IV: 1.8351
So this is a big difference from OLS. Assuming our IV model is justified, an OLS model will underestimate the effect of education on wage by quite a lot. Note that we would need to extend this further to compute standard errors or confidence intervals for our elasticities (which the Stata margins
command does automatically).