IV Regression in R#

We quickly explore the problem of endogeneity and how to estimate this class of models in R. Recall that the OLS estimator requires

\[ E(\mathbf{x'\epsilon}) = 0 \]

This code shows how to overcome estimation problems where this assumption fails but where we can identify an instrument for implementing instrumental variables regression (IV Regression).

# allows us to open stata datasets
library(foreign)
library(sandwich)
library(lmtest)
library(stargazer)
library(boot)
library(AER)
library(car)
library(margins)

We will persist with a “cross-sectioned” version of Tobias and Koop that focuses on 1983. Load data and summarize:

tk.df = read.dta("http://rlhick.people.wm.edu/econ407/data/tobias_koop.dta")
tk4.df = subset(tk.df, time == 4)
stargazer(tk4.df,header=FALSE,title="Tobias and Koop Summary Statistics", type="text")
Tobias and Koop Summary Statistics
=================================================
Statistic     N     Mean    St. Dev.  Min    Max 
-------------------------------------------------
id          1,034 1,090.952 634.892    4    2,177
educ        1,034  12.275    1.567     9     19  
ln_wage     1,034   2.138    0.466   0.420  3.590
pexp        1,034   4.815    2.190     0     12  
time        1,034   4.000    0.000     4      4  
ability     1,034   0.017    0.921   -3.140 1.890
meduc       1,034  11.403    3.027     0     20  
feduc       1,034  11.585    3.736     0     20  
broken_home 1,034   0.169    0.375     0      1  
siblings    1,034   3.200    2.127     0     15  
pexp2       1,034  27.980    22.599    0     144 
-------------------------------------------------

OLS#

If we ignore any potential endogeneity problem we can estimate OLS as described in the OLS document (ols.Rmd). Running this in R is done this way

ols.lm = lm(ln_wage ~ pexp + pexp2 + broken_home + educ, data=tk4.df)
summary(ols.lm)
Call:
lm(formula = ln_wage ~ pexp + pexp2 + broken_home + educ, data = tk4.df)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.69490 -0.24818 -0.00079  0.25417  1.51139 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.460333   0.137294   3.353 0.000829 ***
pexp         0.203521   0.023586   8.629  < 2e-16 ***
pexp2       -0.012413   0.002283  -5.438 6.73e-08 ***
broken_home -0.008725   0.035711  -0.244 0.807020    
educ         0.085273   0.009290   9.179  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4265 on 1029 degrees of freedom
Multiple R-squared:  0.1664,	Adjusted R-squared:  0.1632 
F-statistic: 51.36 on 4 and 1029 DF,  p-value: < 2.2e-16

Examining the elasticity of education in R is a more complicated than it is in stata (for Stata jus run margins eyex after your model). We have to use the definition of an elasticity and manually compute it using:

\[ E_{educ} = \sum_{i=1}^N \frac{educ_i b_{educ}}{N} \]
eyex = mean(coef(ols.lm)['educ'] * tk4.df$educ)
cat(sprintf("Elasticity of log(wage) with respect to education from OLS: %1.4f", eyex))
Elasticity of log(wage) with respect to education from OLS: 1.0467

which is appropriate for this case since ln_wage has wages in logs and educ is in levels (not log transformed).

The endogeneity problem#

We assume that education is endogenous. That is, it is correlated with the regression errors. This means OLS estimates of \(\beta\) are biased. We hypothesize that the variable feduc is a good instrument having all the properties we describe in detail in the notes document.

Estimation in R#

This runs the estimation in R.

ivmodel <- ivreg(ln_wage ~ pexp + pexp2 + broken_home + educ |
   pexp + pexp2 + broken_home + feduc, data= tk4.df)

This is the unadjusted standard errors that matches output from this stata command ivregress 2sls ln_wage pexp pexp2 broken_home (educ=feduc), small:

summary(ivmodel)
Call:
ivreg(formula = ln_wage ~ pexp + pexp2 + broken_home + educ | 
    pexp + pexp2 + broken_home + feduc, data = tk4.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8472 -0.2326  0.0194  0.2541  1.6113 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.406439   0.436664  -0.931    0.352    
pexp         0.214752   0.024715   8.689  < 2e-16 ***
pexp2       -0.011745   0.002357  -4.984 7.30e-07 ***
broken_home  0.024471   0.039815   0.615    0.539    
educ         0.149503   0.032079   4.661 3.57e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4363 on 1029 degrees of freedom
Multiple R-Squared: 0.1277,	Adjusted R-squared: 0.1243 
Wald test: 34.38 on 4 and 1029 DF,  p-value: < 2.2e-16 

Here is the robust version of the model, matching stata’s ivregress output from ivregress 2sls ln_wage pexp pexp2 broken_home (educ=feduc), robust:

summary(ivmodel, vcov=sandwich)
Call:
ivreg(formula = ln_wage ~ pexp + pexp2 + broken_home + educ | 
    pexp + pexp2 + broken_home + feduc, data = tk4.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8472 -0.2326  0.0194  0.2541  1.6113 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.406439   0.440450  -0.923    0.356    
pexp         0.214752   0.023863   8.999  < 2e-16 ***
pexp2       -0.011745   0.002359  -4.978 7.53e-07 ***
broken_home  0.024471   0.033503   0.730    0.465    
educ         0.149503   0.032908   4.543 6.20e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4363 on 1029 degrees of freedom
Multiple R-Squared: 0.1277,	Adjusted R-squared: 0.1243 
Wald test: 37.63 on 4 and 1029 DF,  p-value: < 2.2e-16 

Inference#

We have more work to do to determine what model we should report:

  1. Test for relevant and strong instruments

  2. Test for endogeneity

  3. Test for overidentification (not relevant for this example)

In R, we do it this way:

summary(ivmodel, vcov=sandwich,diagnostics = TRUE)
Call:
ivreg(formula = ln_wage ~ pexp + pexp2 + broken_home + educ | 
    pexp + pexp2 + broken_home + feduc, data = tk4.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8472 -0.2326  0.0194  0.2541  1.6113 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.406439   0.440450  -0.923    0.356    
pexp         0.214752   0.023863   8.999  < 2e-16 ***
pexp2       -0.011745   0.002359  -4.978 7.53e-07 ***
broken_home  0.024471   0.033503   0.730    0.465    
educ         0.149503   0.032908   4.543 6.20e-06 ***

Diagnostic tests:
                  df1  df2 statistic p-value    
Weak instruments    1 1029    80.649  <2e-16 ***
Wu-Hausman          1 1028     4.376  0.0367 *  
Sargan              0   NA        NA      NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4363 on 1029 degrees of freedom
Multiple R-Squared: 0.1277,	Adjusted R-squared: 0.1243 
Wald test: 37.63 on 4 and 1029 DF,  p-value: < 2.2e-16 

These results tell us we have relevant and strong instruments and that education is likely endogenous. Revisiting our elasticity estimate,

eyex_iv = mean(coef(ivmodel)['educ'] * tk4.df$educ)
cat(sprintf("Elasticity of log(wage) with respect to education from OLS: %1.4f", eyex), "\n")
cat(sprintf("Elasticity of log(wage) with respect to education from IV: %1.4f", eyex_iv))
Elasticity of log(wage) with respect to education from OLS: 1.0467 
Elasticity of log(wage) with respect to education from IV: 1.8351

So this is a big difference from OLS. Assuming our IV model is justified, an OLS model will underestimate the effect of education on wage by quite a lot. Note that we would need to extend this further to compute standard errors or confidence intervals for our elasticities (which the Stata margins command does automatically).