OLS in R#

This document shows how to implement the topics we cover in cross-section econometrics in R.

# allows us to open stata datasets
library(foreign)
# for heteroskedasticity corrected variance/covariance matrices
library(sandwich)
# for White Heteroskedasticity Test
library(lmtest)
# for very pretty tables
library(stargazer)
# for bootstrapping
library(boot)

Load data and summarize:

tk.df = read.dta("http://rlhick.people.wm.edu/econ407/data/tobias_koop.dta")
tk4.df = subset(tk.df, time == 4)
stargazer(tk4.df,header=FALSE,title="Tobias and Koop Summary Statistics", type='text')
Tobias and Koop Summary Statistics
=================================================
Statistic     N     Mean    St. Dev.  Min    Max 
-------------------------------------------------
id          1,034 1,090.952 634.892    4    2,177
educ        1,034  12.275    1.567     9     19  
ln_wage     1,034   2.138    0.466   0.420  3.590
pexp        1,034   4.815    2.190     0     12  
time        1,034   4.000    0.000     4      4  
ability     1,034   0.017    0.921   -3.140 1.890
meduc       1,034  11.403    3.027     0     20  
feduc       1,034  11.585    3.736     0     20  
broken_home 1,034   0.169    0.375     0      1  
siblings    1,034   3.200    2.127     0     15  
pexp2       1,034  27.980    22.599    0     144 
-------------------------------------------------

Focusing on our dependent variable:

hist(tk4.df$ln_wage,main = "Log Wage")
../_images/R_ols_5_0.png

The OLS Model#

In R we use

ols.lm = lm(ln_wage~educ + pexp + pexp2 + broken_home, data=tk4.df)
summary(ols.lm)
Call:
lm(formula = ln_wage ~ educ + pexp + pexp2 + broken_home, data = tk4.df)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.69490 -0.24818 -0.00079  0.25417  1.51139 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.460333   0.137294   3.353 0.000829 ***
educ         0.085273   0.009290   9.179  < 2e-16 ***
pexp         0.203521   0.023586   8.629  < 2e-16 ***
pexp2       -0.012413   0.002283  -5.438 6.73e-08 ***
broken_home -0.008725   0.035711  -0.244 0.807020    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4265 on 1029 degrees of freedom
Multiple R-squared:  0.1664,	Adjusted R-squared:  0.1632 
F-statistic: 51.36 on 4 and 1029 DF,  p-value: < 2.2e-16

Robust Standard Errors#

Note that since the \(\mathbf{b}\)’s are unbiased estimators for \(\beta\), those continue to be the ones to report. However, our variance covariance matrix is $\( VARCOV_{HC1} = \mathbf{(x'x)^{-1}\hat{V}(x'x)^{-1}} \)$

Defining \(\mathbf{e=y-xb}\), $\( \hat{\mathbf{V}} = diag(\mathbf{ee}') \)$

Implementing this in R, we can use the sandwich package’s vcovHC function with the HC1 option (stata default):

VARCOV = vcovHC(ols.lm, 'HC1')
se = sqrt(diag(VARCOV))
print(se)
(Intercept)        educ        pexp       pexp2 broken_home 
0.131541623 0.009929445 0.022683530 0.002244678 0.031238081 

Note that the F-statistic reported by OLS is inappropriate in the presence of heteroskedasticity. We need to use a different form of the F test. Stata does this automatically. In R, we need to use waldtest for the correct model F statistic:

waldtest(ols.lm, vcov = vcovHC(ols.lm,'HC1'))
A anova: 2 × 4
Res.DfDfFPr(>F)
<dbl><dbl><dbl><dbl>
11029NA NA NA
21033-464.819686.411661e-49

We can put this together to make a more useful summary output:

# copy summary object 
robust.summ = summary(ols.lm)
# overwrite all coefficients with robust counterparts
robust.summ$coefficients = unclass(coeftest(ols.lm, vcov. = VARCOV))
# overwrite f stat with correct robust value
f_robust = waldtest(ols.lm, vcov = VARCOV)
robust.summ$fstatistic[1] = f_robust$F[2]
print(robust.summ)
Call:
lm(formula = ln_wage ~ educ + pexp + pexp2 + broken_home, data = tk4.df)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.69490 -0.24818 -0.00079  0.25417  1.51139 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.460333   0.131542   3.500 0.000486 ***
educ         0.085273   0.009929   8.588  < 2e-16 ***
pexp         0.203521   0.022684   8.972  < 2e-16 ***
pexp2       -0.012413   0.002245  -5.530 4.06e-08 ***
broken_home -0.008725   0.031238  -0.279 0.780057    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4265 on 1029 degrees of freedom
Multiple R-squared:  0.1664,	Adjusted R-squared:  0.1632 
F-statistic: 64.82 on 4 and 1029 DF,  p-value: < 2.2e-16

The Bootstrapping Approach#

In what follows below, I should you how to use R for bootstrapping regression coefficients, which will fix any heteroskedasticity problems we might have.

bs_ols <- function(formula, data, indices) {
  d <- data[indices,] # allows boot to select sample 
  fit <- lm(formula, data=d)
  return(coef(fit)) 
} 

results <- boot(data=tk4.df, statistic=bs_ols, 
  	R=1000, formula=ln_wage~educ + pexp + pexp2 + broken_home)

boot.ci(results,conf=0.95,type='perc',index=1)
boot.ci(results,conf=0.95,type='perc',index=2)
boot.ci(results,conf=0.95,type='perc',index=3)
boot.ci(results,conf=0.95,type='perc',index=4)
boot.ci(results,conf=0.95,type='perc',index=5)

colnames(results$t) = c("Intercept","educ","pexp","pexp2","broken_home")
summary(results$t)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = results, conf = 0.95, type = "perc", index = 1)

Intervals : 
Level     Percentile     
95%   ( 0.2106,  0.7247 )  
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = results, conf = 0.95, type = "perc", index = 2)

Intervals : 
Level     Percentile     
95%   ( 0.0645,  0.1035 )  
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = results, conf = 0.95, type = "perc", index = 3)

Intervals : 
Level     Percentile     
95%   ( 0.1593,  0.2475 )  
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = results, conf = 0.95, type = "perc", index = 4)

Intervals : 
Level     Percentile     
95%   (-0.0168, -0.0080 )  
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = results, conf = 0.95, type = "perc", index = 5)

Intervals : 
Level     Percentile     
95%   (-0.0695,  0.0489 )  
Calculations and Intervals on Original Scale
   Intercept            educ              pexp            pexp2         
 Min.   :0.04771   Min.   :0.05564   Min.   :0.1316   Min.   :-0.01988  
 1st Qu.:0.37305   1st Qu.:0.07856   1st Qu.:0.1877   1st Qu.:-0.01401  
 Median :0.46360   Median :0.08527   Median :0.2035   Median :-0.01236  
 Mean   :0.46230   Mean   :0.08513   Mean   :0.2035   Mean   :-0.01241  
 3rd Qu.:0.55261   3rd Qu.:0.09202   3rd Qu.:0.2196   3rd Qu.:-0.01081  
 Max.   :0.84488   Max.   :0.12228   Max.   :0.2846   Max.   :-0.00548  
  broken_home       
 Min.   :-0.107363  
 1st Qu.:-0.029781  
 Median :-0.009216  
 Mean   :-0.009060  
 3rd Qu.: 0.011689  
 Max.   : 0.083809  

Testing for Heteroskedasticity#

The two dominant heteroskedasticity tests are the White test and the Cook-Weisberg (both Breusch Pagan tests). The White test is particularly useful if you believe the error variances have a potentially non-linear relationship with model covariates.

In R we perform the White test by:

bptest(ols.lm,ln_wage ~ educ*pexp + educ*pexp2 + educ*broken_home + 
                        pexp*pexp2 + pexp*broken_home + pexp2*broken_home + 
                        I(educ^2) + I(pexp^2) + I(pexp2^2) + I(broken_home^2),
       data=tk4.df)
	studentized Breusch-Pagan test

data:  ols.lm
BP = 29.196, df = 12, p-value = 0.003685

Notice that in R, we need to specify the form of the heteroskedasticity. To mirror the White test, we include interaction and second order terms. Both stata’s imtest and R’s bptest allow you to specify more esoteric forms of heteroskedasticity, although in R it is more apparent since we have to specify the form of the heteroskedasticity for the test to run.

A more basic test- one assuming that heteroskedastity is a linear function of the \(\mathbf{x}\)’s- is the Cook-Weisberg Breusch Pagan test. R doesn’t implement this test directly. But it can be performed manually following the steps we discussed in class:

  1. Recover residuals as above, denote as \(\mathbf{e}\)

  2. square the residuals by the sqare root of mean squared residuals: $\( \mathbf{r} = \frac{diagonal(\mathbf{ee}')}{\sqrt{\mathbf{e'e}/N}} \)$

  3. Run the regression \(\mathbf{r} = \gamma_0 + \gamma_1 \mathbf{xb} + \mathbf{v}\). Recall that \(\mathbf{xb} = \mathbf{\hat{y}}\).

  4. Recover ESS defined as $\( ESS = (\mathbf{\hat{r}} - \bar{r})'(\hat{\mathbf{r}} - \bar{r}) \)$

  5. ESS/2 is the test statistic distributed \(\chi^2_{(1)}\)