Ordinary least squares using PyMC

Ordinary least squares using PyMC#

Recall the ordinary least squares model you might be familiar with from a previous statistics or econometrics class. We are interested in estimating the parameter vector \(\beta\) from the following population regression function:

\[ \mathbf{y = x \beta + \epsilon} \]

where \(\mathbf{y}\) and \(\mathbf{\epsilon}\) are each \(N \times 1\) vectors of independent (outcome) and unobservable random variables. The \(N \times K\) matrix \(\mathbf{x}\) are the non-random (assumed for this notebook) independent variables (sometimes referred to as covariates). The least squares approach yields the “ols” estimator

\[ \mathbf{b} = \mathbf{{x'x}^{-1} x'y} \]

The frequentist approach performs inference based on the variance covariance matrix of \(\mathbf{b}\) which is estimated as

\[ cov(\mathbf{b}) = \hat{\sigma}^2 \left( \mathbf{x'x} \right)^{-1} \]

where \(\hat{\sigma}^2 = \frac{(\mathbf{y-xb})'(\mathbf{y-xb})}{N-K}\)

The square root of the diagonal of this matrix yields standard errors for estimated coefficients as reported in programs like Stata. These standard errors are used for inference related questions.

The Bayesian approach to OLS takes a different approach. Consider the model

\[\begin{split} \begin{align} \sigma &\sim Uniform(a=.01, b=10) \nonumber \\ \mathbf{\beta}_k &\sim N(\mu_b = 0, \sigma_b = 10) \text{ } \forall k \in K \nonumber\\ \mathbf{y} &\sim N(\mathbf{x\beta}, \sigma^2 I) \nonumber \end{align} \end{split}\]

We can setup the likelihood and priors for getting information about the posterior. Assembling the pieces,

Likelihood:

\[ prob(\mathbf{y}| \mathbf{x}, \mathbf{\beta}, \mathbf{\sigma}) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi\sigma^2}} e^\frac{(y_i-\mathbf{x}_i \mathbf{b})^2}{2\sigma^2} \]
Prior on \(\sigma\):

\[ prob(\sigma | a=0.01, b=10) = \frac{1}{(10 - .01)} \text{ for } .01 \le \sigma \le 10 \text{, 0 otherwise} \]
Prior on \(\mathbf{\beta}\)

\[ prob(\mathbf{b} | \mu_b = 0, \sigma_b = 10) = \prod_{k = 1}^{K} \frac{1}{\sqrt{2\pi \sigma^2_b}} e ^ {\frac{b_k - \mu_b}{2\sigma^2_b}} \]

Let \(\theta\) be the vector of all unknown parameters, \(\begin{bmatrix} \beta_1 \\ \vdots \\ \beta_k \\ \vdots \\ \beta_K \\ \sigma \end{bmatrix}\). Then the posterior is proportional to

\[ p(\theta | \mathbf{y}, \mathbf{x}, a, b, \mu_b, \sigma_b) \propto prob(\mathbf{y}| \mathbf{x}, \mathbf{\beta}, \mathbf{\sigma}) \times prob(\sigma | \alpha_l, \alpha_u) \times prob(\mathbf{b} | \mu_b = 0, \sigma_b = 10) \]

An example#

We will use the Tobias and Koop dataset for this excercise. We’ll estimate a simple model using frequentist and Bayesian methods. This dataset has the following fields:

Variable	Definition
id	person id, ranging from 1 to 2,178
educ	education (years)
ln_wage	log of hourly wage
pexp	potential experience
time	time trend
ability	ability
meduc	mother’s education (years)
feduc	father’s education
broken_home	dummy variable for residence in a broken home
siblings	number of siblings
pexp2	squared potential experience

tobias_koop = pd.read_stata('https://rlhick.people.wm.edu/econ407/data/tobias_koop.dta')
tobias_koop.head()

	id	educ	ln_wage	pexp	time	ability	meduc	feduc	siblings	pexp2
0	1.0	13.0	1.82	1.0	0.0	1.0	12.0	12.0	1.0	1.0
1	1.0	18.0	3.29	3.0	7.0	1.0	12.0	12.0	1.0	9.0
2	1.0	18.0	3.21	5.0	9.0	1.0	12.0	12.0	1.0	25.0
3	1.0	18.0	3.06	6.0	10.0	1.0	12.0	12.0	1.0	36.0
4	2.0	15.0	2.14	4.0	6.0	1.5	12.0	12.0	1.0	16.0

Note, that this is a panel dataset having the following time periods:

tobias_koop.time.value_counts()

time
0    1570
0    1535
0    1520
0    1517
0    1499
0     1443
0     1398
0     1284
0     1248
0     1093
0     1034
0      964
0      745
0      615
0      454
Name: count, dtype: int64

For purposes of this lecture we will ignore the panel aspects of this dataset and estimate an ordinary least squares model. Furthermore, due to memory constraints that students may encounter on jupyterhub for some of the features below, we will restrict our sample to time==4 (1983). In general we don’t recommend omitting data like this, but do it here for the aforementioned reasons. Reducing the data:

print("Records in original dataset: ", tobias_koop.shape[0])
tobias_koop = tobias_koop[tobias_koop.time == 4]
print("Records in reduced dataset: ", tobias_koop.shape[0])

Records in original dataset:  17919
Records in reduced dataset:  1034

Let’s estimate a simple wage model:

\[ ln\_wage_i = \beta_0 + \beta_1 educ + \beta_2 pexp + \beta_3 broken\_home_i + \epsilon_i \]

Using statsmodels, we have

res = smf.ols(formula='ln_wage ~ educ + pexp + broken_home', 
              data=tobias_koop).fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                ln_wage   R-squared:                       0.142
Model:                            OLS   Adj. R-squared:                  0.140
Method:                 Least Squares   F-statistic:                     57.04
Date:                Fri, 12 Jan 2024   Prob (F-statistic):           4.09e-34
Time:                        10:52:15   Log-Likelihood:                -598.31
No. Observations:                1034   AIC:                             1205.
Df Residuals:                    1030   BIC:                             1224.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       0.6789      0.133      5.101      0.000       0.418       0.940
educ            0.0874      0.009      9.290      0.000       0.069       0.106
pexp            0.0804      0.007     12.034      0.000       0.067       0.093
broken_home    -0.0031      0.036     -0.085      0.933      -0.074       0.068
==============================================================================
Omnibus:                       49.553   Durbin-Watson:                   1.734
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               95.369
Skew:                          -0.327   Prob(JB):                     1.95e-21
Kurtosis:                       4.336   Cond. No.                         132.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Bayesian Estimation#

We will continue using the model setup as specified above,

\[\begin{split} \begin{align} \sigma &\sim Uniform(a=.01, b=10) \nonumber \\ \mathbf{\beta} &\sim N(\mu_b = 0, \sigma_b = 10) \nonumber \\ \mathbf{y} &\sim N(\mathbf{x\beta}, \sigma^2 I) \nonumber \end{align} \end{split}\]

The implementation in PyMC in code is very close to this model definition:

with pm.Model() as ols_model:
    
    # priors
    b0 = pm.Normal('constant', mu=0., sigma=10.)
    b1 = pm.Normal('educ', mu=0., sigma=10.)
    b2 = pm.Normal('pexp', mu=0., sigma=10.)
    b3 = pm.Normal('broken_home', mu = 0., sigma=10.)
    sigma = pm.Uniform('sigma', lower=.1, upper=10.)

    # yhat (or the mean for each i's likelihood contrib)
    mu_i = b0 + b1 * tobias_koop.educ + b2 * tobias_koop.pexp +\
                b3 * tobias_koop.broken_home
        
    # likelihood
    like = pm.Normal('likelihood', mu=mu_i, sigma=sigma, 
                      observed = tobias_koop.ln_wage)

Note, that MAP (maximum of the posterior) is very close to the OLS estimate:

with ols_model:
    startvals = pm.find_MAP()
    
print(startvals)

100.00% [44/44 00:00<00:00 logp = -613.5, ||grad|| = 9.6493]

{'constant': array(0.67877072), 'educ': array(0.0874166), 'pexp': array(0.08036808), 'broken_home': array(-0.00305453), 'sigma_interval__': array(-3.36234812), 'sigma': array(0.43158197)}

Note

On our jupyterhub server, it is necessary to include two function arguments, which in normal cases you might not want to use:

Since each user is allocated 2 CPU cores. For PyMC to run properly, you must use the cores=2 argument below. While the code will run without this argument, results may be unreliable particularly for this notebook. On a typical PC, you would want to omit the cores argument and let PyMC use the maximum number of cores available for quickest execution.
The progressbar=False is included so that these notes will look better on this webpage. In general, you would probably want to view the progress and wouldn’t need this option.

Sampling from the posterior for the model is accomplished using

with ols_model:
    step = pm.Metropolis()
    samples = pm.sample(1000, step=step, tune=500, initvals=startvals, cores=2, chains=4,
                       return_inferencedata=True, progressbar=False)

Multiprocess sampling (4 chains in 2 jobs)

CompoundStep

>Metropolis: [constant]

>Metropolis: [educ]

>Metropolis: [pexp]

>Metropolis: [broken_home]

>Metropolis: [sigma]

Sampling 4 chains for 500 tune and 1_000 draw iterations (2_000 + 4_000 draws total) took 2 seconds.

The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details

The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details

Summarizing the trace:

pm.summary(samples)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
constant	0.751	0.081	0.649	0.932	0.035	0.026	6.0	15.0	1.85
educ	0.083	0.006	0.072	0.093	0.002	0.002	7.0	14.0	1.70
pexp	0.077	0.006	0.066	0.087	0.002	0.001	13.0	87.0	1.24
broken_home	-0.007	0.035	-0.080	0.051	0.003	0.002	169.0	323.0	1.04
sigma	0.432	0.010	0.414	0.449	0.001	0.000	241.0	186.0	1.03

The summary above shows

Results very similar to OLS
That we have very doubtful chain convergence as the Gelman-Rubin score (r_hat is large)

Examining autocorrelation, we have

pm.plot_autocorr(samples, combined=True);

../_images/6a584b4d81cdde92a9126f851fc0213512d3eb631ae6d2fb1fba188abccccf2b.png

which shows very high level of autocorrelation particularly for educ and the model constant.

The accept rate is

accept = samples.sample_stats.accepted.mean().to_numpy()
print("Acceptance Rate: ", accept.round(3))

Acceptance Rate:  0.126

To obtain better results, we need to either

Increase sample size (and tuning), or
Choose a different step method

At this point in the course, our only option is to increase sampling and tuning:

with ols_model:
    step = pm.Metropolis()
    samples = pm.sample(10000, step=step, tune=10000, initvals=startvals, cores=2, chains=4,
                       return_inferencedata=True, progressbar=False)

Multiprocess sampling (4 chains in 2 jobs)

CompoundStep

>Metropolis: [constant]

>Metropolis: [educ]

>Metropolis: [pexp]

>Metropolis: [broken_home]

>Metropolis: [sigma]

Sampling 4 chains for 10_000 tune and 10_000 draw iterations (40_000 + 40_000 draws total) took 32 seconds.

The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details

The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details

We get better, albeit not acceptable, convergence diagnostics:

pm.summary(samples)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
constant	0.662	0.124	0.432	0.900	0.015	0.010	73.0	156.0	1.05
educ	0.089	0.009	0.072	0.105	0.001	0.001	76.0	162.0	1.05
pexp	0.081	0.007	0.068	0.093	0.000	0.000	240.0	910.0	1.02
broken_home	-0.002	0.036	-0.069	0.067	0.001	0.000	4752.0	7001.0	1.00
sigma	0.433	0.009	0.415	0.451	0.000	0.000	7234.0	8344.0	1.00

but we still have high levels of autocorrelation:

pm.plot_autocorr(samples, combined=True);

../_images/35967dbee9de1b2bbac4a636ff3115c937850964753259b9a5cd95b2f8c09b73.png

Our acceptance rate is also improved:

accept = samples.sample_stats.accepted.mean().to_numpy()
print("Acceptance Rate: ", accept.round(3))

Acceptance Rate:  0.317

Here is the trace plot:

pm.plot_trace(samples);

../_images/0e9f10b4fd17ed61d7ef9f790d40fa1d2374fd76291b5b94094e05cdb8daa87e.png

Analyzing the Trace#

Summarizing the trace (and omitting our predicted wage) yields:

pm.summary(samples, var_names = ['constant', 'educ', 'pexp', 'broken_home', 'sigma', 'elas_educ'])

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
constant	0.664	0.139	0.393	0.925	0.020	0.014	51.0	53.0	1.08
educ	0.088	0.010	0.070	0.107	0.001	0.001	52.0	59.0	1.07
pexp	0.081	0.007	0.069	0.093	0.001	0.000	169.0	751.0	1.02
broken_home	-0.003	0.036	-0.071	0.065	0.001	0.000	3076.0	7115.0	1.00
sigma	0.433	0.010	0.416	0.452	0.000	0.000	7719.0	8559.0	1.00
elas_educ	1.085	0.121	0.854	1.316	0.017	0.012	52.0	59.0	1.07

We can even see the marginal posterior for our elasticity estimate:

pm.plot_trace(samples, var_names = ['elas_educ']);

../_images/226b8d570001a5095f146798ca0fdaca15700b1e7b4b944b03d600073e8c3613.png

The predicted wage is included in the trace and is of dimension number of chains x number of samples x number of individuals. This is alot of information. To illustrate the information we have stored in our posterior let’s focus on only the first 50 individuals:

individs = [i for i in range(50)]
pm.plot_trace(samples.sel(wage_hat_dim_0 = individs), var_names=['wage_hat']);

../_images/435defd1eb21844de71fc22da82e59c70e47f9e6ad0234305612c42cf0dca406.png

By storing these values in our trace, we can calculate policy relevant measures that immediately reflect the uncertainty of our parameter estimates since the distribution is being driven by the distribution of the posterior itself. This is evident in the posterior distributions of predicted wage for each individual.

Summary#

This workbook has demonstrated how to use pymc for a linear regression model. We have examined some convergence criteria and see that we have to increase sample size quite alot to meet the Gelman-Rubin metric of r_hat \(\approx 1\). Even after increasing the number of samples by quite alot, the values above fail to completely satisfy this criteria for some of the parameters in our model.

In the next chapter we turn our attention to more modern step methods for achieving convergence with many fewer samples.

Ordinary least squares using PyMC

Contents

Ordinary least squares using PyMC#

An example#

Bayesian Estimation#

Adding “policy relevant” measures to your model#

Analyzing the Trace#

Summary#