OLS in Stata#

Here we show how to implement many of these ideas in Stata. Note, this example uses data from a panel dataset (multiple time periods per individual) and we arbitrarily restrict the analysis to a cross section dataset by analyzing only records where time is 4. We do this for demonstration purposes only since we motivate this model for a pure cross section case. It is almost always a horrible idea to throw away data like this, and later in the course we will learn more about panel data techniques.

Running the OLS Model#

Using ln_wage as our dependent variable, here is the code for running a simple OLS model in Stata. Since we haven’t loaded the data yet, let’s do that and run the model:

webuse set "https://rlhick.people.wm.edu/econ407/data/"
webuse tobias_koop
keep if time==4
regress ln_wage educ pexp pexp2 broken_home
. webuse set "https://rlhick.people.wm.edu/econ407/data/"
(prefix now "https://rlhick.people.wm.edu/econ407/data")

. webuse tobias_koop

. keep if time==4
(16,885 observations deleted)

. regress ln_wage educ pexp pexp2 broken_home

      Source |       SS           df       MS      Number of obs   =     1,034
-------------+----------------------------------   F(4, 1029)      =     51.36
       Model |  37.3778146         4  9.34445366   Prob > F        =    0.0000
    Residual |   187.21445     1,029  .181938241   R-squared       =    0.1664
-------------+----------------------------------   Adj R-squared   =    0.1632
       Total |  224.592265     1,033  .217417488   Root MSE        =    .42654

------------------------------------------------------------------------------
     ln_wage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        educ |   .0852725   .0092897     9.18   0.000     .0670437    .1035014
        pexp |   .2035214   .0235859     8.63   0.000     .1572395    .2498033
       pexp2 |  -.0124126   .0022825    -5.44   0.000    -.0168916   -.0079336
 broken_home |  -.0087254   .0357107    -0.24   0.807    -.0787995    .0613488
       _cons |   .4603326    .137294     3.35   0.001     .1909243    .7297408
------------------------------------------------------------------------------

. 

Robust Standard Errors#

Note that since the \(\mathbf{b}\)’s are unbiased estimators for \(\beta\), those continue to be the ones to report. However, our variance covariance matrix is

\[ VARCOV_{HC1} = \mathbf{(x'x)^{-1}x'Vx(x'x)^{-1}} \]

Defining \(\mathbf{e=y-xb}\),

\[ \mathbf{V} = diag(\mathbf{ee}') \]

In stata, the robust option is a very easy way to implement the robust standard error correction. Note: this is not robust regression.

regress ln_wage educ pexp pexp2 broken_home, robust
Linear regression                               Number of obs     =      1,034
                                                F(4, 1029)        =      64.82
                                                Prob > F          =     0.0000
                                                R-squared         =     0.1664
                                                Root MSE          =     .42654

------------------------------------------------------------------------------
             |               Robust
     ln_wage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        educ |   .0852725   .0099294     8.59   0.000     .0657882    .1047568
        pexp |   .2035214   .0226835     8.97   0.000     .1590101    .2480326
       pexp2 |  -.0124126   .0022447    -5.53   0.000    -.0168173   -.0080079
 broken_home |  -.0087254   .0312381    -0.28   0.780     -.070023    .0525723
       _cons |   .4603326   .1315416     3.50   0.000     .2022121     .718453
------------------------------------------------------------------------------

The bootstrapping approach to robust standard errors#

Given the advent of very fast computation even with laptop computers, we can use boostrapping to correct our standard errrors for any non-iid error process. In Stata, we just run (for 1000 replicates) using the following code. I believe this is doing what is called a parametric bootstrap (confidence intervals assume normality).

regress ln_wage educ pexp pexp2 broken_home, vce(bootstrap,rep(1000))
(running regress on estimation sample)

Bootstrap replications (1,000)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..................................................   100
..................................................   150
..................................................   200
..................................................   250
..................................................   300
..................................................   350
..................................................   400
..................................................   450
..................................................   500
..................................................   550
..................................................   600
..................................................   650
..................................................   700
..................................................   750
..................................................   800
..................................................   850
..................................................   900
..................................................   950
.................................................. 1,000

Linear regression                                       Number of obs =  1,034
                                                        Replications  =  1,000
                                                        Wald chi2(4)  = 252.66
                                                        Prob > chi2   = 0.0000
                                                        R-squared     = 0.1664
                                                        Adj R-squared = 0.1632
                                                        Root MSE      = 0.4265

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
     ln_wage | coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        educ |   .0852725   .0100503     8.48   0.000     .0655743    .1049707
        pexp |   .2035214    .022688     8.97   0.000     .1590538     .247989
       pexp2 |  -.0124126   .0022409    -5.54   0.000    -.0168047   -.0080205
 broken_home |  -.0087254   .0320888    -0.27   0.786    -.0716183    .0541676
       _cons |   .4603326   .1340845     3.43   0.001     .1975318    .7231333
------------------------------------------------------------------------------

We will return to why this works later in the course.

Testing for Heteroskedasticity#

The two dominant heteroskedasticity tests are the White test and the Cook-Weisberg (both Breusch Pagan tests). The White test is particularly useful if you believe the error variances have a potentially non-linear relationship with model covariates. For stata, we need to make sure the basic OLS regression (rather than robust) is run just prior to these commands and then we can use:

quietly regress ln_wage educ pexp pexp2 broken_home
imtest, white
. quietly regress ln_wage educ pexp pexp2 broken_home

. imtest, white

White's test
H0: Homoskedasticity
Ha: Unrestricted heteroskedasticity

   chi2(12) =  29.20
Prob > chi2 = 0.0037

Cameron & Trivedi's decomposition of IM-test

--------------------------------------------------
              Source |       chi2     df         p
---------------------+----------------------------
  Heteroskedasticity |      29.20     12    0.0037
            Skewness |      14.14      4    0.0068
            Kurtosis |      16.19      1    0.0001
---------------------+----------------------------
               Total |      59.53     17    0.0000
--------------------------------------------------

. 

Stata’s imtest allows you to specify more esoteric forms of heteroskedasticity. A more basic test outlined in Testing for Heteroskedasticity, that assumes heteroskedastity is a linear function of the \(\mathbf{x}\)’s- is the Cook-Weisberg Breusch Pagan test. In Stata, we use

hettest
Breusch–Pagan/Cook–Weisberg test for heteroskedasticity 
Assumption: Normal error terms
Variable: Fitted values of ln_wage

H0: Constant variance

    chi2(1) =  19.57
Prob > chi2 = 0.0000