Introduction to Bayesian Econometrics#
This course will cover the application of Bayesian statistical methods for econometric inference. Broadly speaking we will
Briefly discuss sampling methods for classical statistics
Introduce Bayes Rule and Provide an Application
Examine the use of Monte Carlo Markov Chains
Link to Bayes Rule
Metropolis-Hastings and other Samplers
Chain “convergence” and diagnostics
Application: OLS, Time-Series Econometrics, Heirarchical Models
The frequentist paradigm, arguably the dominant statistical paradigm in the social sciences (and what you all have studied), relies on the following notions:
\(\beta\) is not random but neither is it known. It is a fixed quantity.
To uncover information about \(\beta\), we observe part of some process (e.g. \(\mathbf{y=x\beta+\epsilon}\)).
For statistical inference, we rely on repeated trials of \(\mathbf{y}\) and \(\mathbf{x}\), even if this repetition rarely (if ever) occurs in the social science context.
\(\mathbf{y}\) and \(\mathbf{x}\) are considered random
The model typically attempts to uncover information about \(\mathbf{\beta}\) by examining the likelihood function
\[ prob(\mathbf{y}|\mathbf{b},\mathbf{x}) \]where \(\mathbf{b}\) are our estimates of \(\beta\)
The bayesian paradigm tackles the issue of estimating \(\beta\) by
Treating \(\beta\) as random and unknown
For a process (\(\mathbf{y=x\beta+\epsilon}\)) uncover information about \(\beta\)
Treating \(\mathbf{y}\) and \(\mathbf{x}\) as fixed and non-random (at least once they are recorded in your dataset)
Uncovers information about \(\mathbf{\beta}\) by examining the posterior likelihood
\[ prob(\mathbf{b}|\mathbf{y},\mathbf{x}) \]where \(\mathbf{b}\) are our estimates of \(\beta\)
In quite a lot of instances, these two approaches give you the same estimate for \(\beta\). Until recently, Bayesian Statistical modeling wasn’t used because calculating the posterior likelihood was computationally challenging, but recent advances in the theory and construction of Monte Carlo Markov Chains and computational ability has really opened the door for Bayesian analysis for problems that might not be estimated using the frequentist paradigm (ie. Maximim Likelihood). There is an ongoing holy war in the two statistical camps, during the semester I will attempt to highlite the pros and cons of each paradigm without taking a position on which one is better. My philosophy is that if it gets the job done, use it while being aware of limitations and advantages.
Repeated trials in frequentist statistics#
A good jumping off point for this course is to understand the use of sampling techniques in a classical statistical paradigm.
Underlying all statistical inference that you have learned in statistics and econometrics is the idea of repeated trials. Bootstrapping highlites this really well.
We will begin with an exploration of bootstrapping and see the implementation steps.
# load python libraries for this ipython notebook:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sbn
import statsmodels.formula.api as smf
import warnings
warnings.filterwarnings('ignore')
plt.style.use('ggplot')
np.random.seed(12578)
Load Tobias and Koop data for time==4
#
tobias_koop=pd.read_csv('https://rlhick.people.wm.edu/econ407/data/tobias_koop_t_4.csv')
tobias_koop.head()
id | educ | ln_wage | pexp | time | ability | meduc | feduc | broken_home | siblings | pexp2 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4 | 12 | 2.14 | 2 | 4 | 0.26 | 12 | 10 | 1 | 4 | 4 |
1 | 6 | 15 | 1.91 | 4 | 4 | 0.44 | 12 | 16 | 0 | 2 | 16 |
2 | 8 | 13 | 2.32 | 8 | 4 | 0.51 | 12 | 15 | 1 | 2 | 64 |
3 | 11 | 14 | 1.64 | 1 | 4 | 1.82 | 16 | 17 | 1 | 2 | 1 |
4 | 12 | 13 | 2.16 | 6 | 4 | -1.30 | 13 | 12 | 0 | 5 | 36 |
And a data description,
tobias_koop.describe()
id | educ | ln_wage | pexp | time | ability | meduc | feduc | broken_home | siblings | pexp2 | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 1034.000000 | 1034.000000 | 1034.000000 | 1034.000000 | 1034.0 | 1034.000000 | 1034.000000 | 1034.000000 | 1034.000000 | 1034.000000 | 1034.000000 |
mean | 1090.951644 | 12.274662 | 2.138259 | 4.815280 | 4.0 | 0.016596 | 11.403288 | 11.585106 | 0.169246 | 3.200193 | 27.979691 |
std | 634.891728 | 1.566838 | 0.466280 | 2.190298 | 0.0 | 0.920963 | 3.027277 | 3.735833 | 0.375150 | 2.126575 | 22.598790 |
min | 4.000000 | 9.000000 | 0.420000 | 0.000000 | 4.0 | -3.140000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 537.250000 | 12.000000 | 1.820000 | 3.000000 | 4.0 | -0.550000 | 11.000000 | 10.000000 | 0.000000 | 2.000000 | 9.000000 |
50% | 1081.500000 | 12.000000 | 2.120000 | 5.000000 | 4.0 | 0.170000 | 12.000000 | 12.000000 | 0.000000 | 3.000000 | 25.000000 |
75% | 1666.500000 | 13.000000 | 2.450000 | 6.000000 | 4.0 | 0.720000 | 12.000000 | 14.000000 | 0.000000 | 4.000000 | 36.000000 |
max | 2177.000000 | 19.000000 | 3.590000 | 12.000000 | 4.0 | 1.890000 | 20.000000 | 20.000000 | 1.000000 | 15.000000 | 144.000000 |
Next, visualize the distribution of ln_wage
data, our “dependent
variable” for this problem.
plt.figure(figsize=(8,6))
tobias_koop.ln_wage.plot(kind='hist',bins=30)
plt.xlabel('ln_wage')
sbn.despine(offset=3.)
plt.show()
And using statsmodels
, we can run this OLS regression:
formula = 'ln_wage ~ educ + pexp + broken_home'
mod = smf.ols(formula, data=tobias_koop)
res=mod.fit()
res.summary()
Dep. Variable: | ln_wage | R-squared: | 0.142 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.140 |
Method: | Least Squares | F-statistic: | 57.04 |
Date: | Fri, 12 Jan 2024 | Prob (F-statistic): | 4.09e-34 |
Time: | 10:51:22 | Log-Likelihood: | -598.31 |
No. Observations: | 1034 | AIC: | 1205. |
Df Residuals: | 1030 | BIC: | 1224. |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 0.6789 | 0.133 | 5.101 | 0.000 | 0.418 | 0.940 |
educ | 0.0874 | 0.009 | 9.290 | 0.000 | 0.069 | 0.106 |
pexp | 0.0804 | 0.007 | 12.034 | 0.000 | 0.067 | 0.093 |
broken_home | -0.0031 | 0.036 | -0.085 | 0.933 | -0.074 | 0.068 |
Omnibus: | 49.553 | Durbin-Watson: | 1.734 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 95.369 |
Skew: | -0.327 | Prob(JB): | 1.95e-21 |
Kurtosis: | 4.336 | Cond. No. | 132. |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
These standard errors aren’t robust (although we could ask for that). As we saw in Cross-Section, we can non-parametrically bootstrap confidence intervals of our parameters. While bootstrapping is implemented in many ols routines, we implement them manually to see how it works:
Steps for bootstrapping:
Loop \(R\) times (\(R\) = # of boostrapped replicates)
For each \(r\in R\), sample \(N\) times with replacement from rows in \(\mathbf{Y}\) and \(\mathbf{X}\), noting that \(N\) is equal to the number of observations in the original dataset \(\mathbf{Y}\) and \(\mathbf{X}\). Denote each of these samples as \(\mathbf{X}_r\) and \(\mathbf{Y}_r\).
Estimate an OLS model and recover estimate for replicate \(r\): \(\hat{\beta}_r = (\mathbf{X}_r'\mathbf{X}_r)^{-1}\mathbf{X}_r'\mathbf{Y}_r\)
Store results
Calculate the standard deviation (our estimate for the standard errors) of each parameter estimate, or construct non-parametric confidence intervals (preferred)
Note, for our purposes we don’t need to see the regression results.
We only need to store them and move on to the next replicate. Let’s
investigate further the res
object we created above.
How to Recover OLS Estimates from Statsmodels#
# The parameter values can be accessed here
beta = res.params
print(beta)
print("\nExtract the education parameter")
print(beta['educ'])
Intercept 0.678893
educ 0.087409
pexp 0.080364
broken_home -0.003058
dtype: float64
Extract the education parameter
0.08740853040171812
Sampling with replacement#
Fundamental to the idea of bootstrapping is to sample with replacement to preserve the shape of the underlying distribution of our parameters. We can do this quite easily using numpy tools.
To do this, suppose we want to sample with replacement from 5 rows of
data indexed by [0,1,2,3,4]
.
So we can sample with replacement from row numbers (index) and then use that to select the rows we need for that replicate. For each replicate, run the model or calculate the statistic we are interested in, store the results, and then repeat for all subsequent replicates.
row_id = np.arange(5)
print(row_id)
for r in range(3):
this_sample = np.random.choice(row_id,size=row_id.shape[0],replace=True)
print("\nReplicate", r+1)
print("Rows to sample for this replicate")
print(this_sample)
[0 1 2 3 4]
Replicate 1
Rows to sample for this replicate
[3 3 0 2 2]
Replicate 2
Rows to sample for this replicate
[1 2 4 1 0]
Replicate 3
Rows to sample for this replicate
[0 1 0 4 4]
The Bootstrap for our OLS Model#
Note
Note that the boostrap is a method leaning on the frequentist idea of sampling from the data for inference about unknown parameters. In this case we will be using it to learn about possible heterogeneity in individual-specific error standard variances \(\sigma^2_i\) that is systematically related to our independent variables \(\mathbf{x}\).
The following code uses the ideas above to perform the following steps for each replicate \(r\) of \(R\) total replicates:
Sample with replacement from rows of our dataset
Calculate the statistic of interest (our OLS parameter estimates)
Store results
# Number of replicates
R = 2000
# store each r of the R replicates here in rows:
results_boot = np.zeros((R,res.params.shape[0]))
row_id = range(0,tobias_koop.shape[0])
for r in range(R):
# this samples with replacement from rows in the Tobias and Koop dataset
this_sample = np.random.choice(row_id, size=tobias_koop.shape[0],
replace=True) # gives sampled row numbers
# Define data for this replicate:
tobias_koop_r = tobias_koop.iloc[this_sample]
# Estimate model
results_r = smf.ols(formula,data=tobias_koop_r).fit().params
# Store in row r of results_boot:
results_boot[r,:] = results_r
Let’s see what is being stored in results_boot
:
# Convert results to pandas dataframe for easier analysis:
results_boot = pd.DataFrame(results_boot,columns=['b_Intercept', 'b_educ',
'b_pexp', 'b_broken_home'])
results_boot.head(10)
b_Intercept | b_educ | b_pexp | b_broken_home | |
---|---|---|---|---|
0 | 0.721343 | 0.087336 | 0.073638 | -0.019011 |
1 | 0.736026 | 0.084456 | 0.076059 | 0.001683 |
2 | 0.754465 | 0.084692 | 0.073544 | -0.053538 |
3 | 0.662644 | 0.086421 | 0.084544 | 0.024570 |
4 | 0.853913 | 0.072120 | 0.081522 | 0.016549 |
5 | 0.571139 | 0.096741 | 0.073106 | 0.001449 |
6 | 0.805548 | 0.079790 | 0.075882 | -0.008706 |
7 | 0.514763 | 0.097322 | 0.088359 | -0.047815 |
8 | 0.751332 | 0.083861 | 0.079385 | -0.071247 |
9 | 0.632651 | 0.090178 | 0.083682 | 0.023307 |
and looking at the summary statistics:
results_boot.describe(percentiles=[.005,.025,.05,.5,.95,.975,.995])
b_Intercept | b_educ | b_pexp | b_broken_home | |
---|---|---|---|---|
count | 2000.000000 | 2000.000000 | 2000.000000 | 2000.000000 |
mean | 0.675511 | 0.087646 | 0.080503 | -0.003090 |
std | 0.131140 | 0.009872 | 0.006195 | 0.031879 |
min | 0.228924 | 0.057217 | 0.058955 | -0.103214 |
0.5% | 0.327764 | 0.061143 | 0.063925 | -0.090730 |
2.5% | 0.415075 | 0.068354 | 0.068366 | -0.063346 |
5% | 0.461648 | 0.071911 | 0.070596 | -0.056133 |
50% | 0.675157 | 0.087676 | 0.080379 | -0.003017 |
95% | 0.887589 | 0.103901 | 0.090750 | 0.049725 |
97.5% | 0.932553 | 0.106997 | 0.092291 | 0.058456 |
99.5% | 1.021891 | 0.113232 | 0.096221 | 0.080701 |
max | 1.109086 | 0.120323 | 0.100580 | 0.113913 |
From the above results, we see that the 95% confidence interval for education is in the range [0.0668,0.10517] (approximately, as these numbers will change each time you run the code above). It is tempting to interpret the confidence intervals above as There is a 95% chance that \(\beta\) is in the range [0.0668,0.10517]. However, recall that in the frequentist paradigm \(\beta\) is fixed and non-random. So it is either in the range [0.0668,0.10517] or it isn’t. What is random is the range [0.0668,0.10517]. So a better way to verbalize the confidence interval is that if you repeated your regression analysis many times, 95% of your calculated confidence interval would contain the true parameter \(\beta\).
Let’s plot our bootstrap replicates, by parameter. Note that the values of each of these replicates are completely independent (or should be). Replicate 3 doesn’t in any way inform us about replicate 4. For each parameter, we show the estimated parameter (solid black line) and two parameter estimates:
The standard 95% confidence interval (dashed red line):
\[ CI_i = \hat{\beta}_i \pm 1.96 * \hat{\sigma}_{\beta_i} \]The 95% confidence interval calculated based off the 2.5 and 97.5 percentiles from our boostrap database (dashed blue line).
Focusing on educ
#
plt.figure(figsize=(10,8))
plt.xlabel('b_educ')
lw = 2
plt.axvline(beta['educ'],color='k',linestyle='dashed',lw=lw,
label='OLS Estimate')
# Method 1. 95% Parametric CI's
plt.axvline(beta['educ'] -1.96*np.std(results_boot.b_educ),color='b',
linestyle='dashed',
lw=lw,label='Par. Lower 95%')
plt.axvline(beta['educ'] +1.96*np.std(results_boot.b_educ),color='b',
linestyle='dashed',
lw=lw,label='Par. Lower 95%')
# Method 2. Non-Parametric 95% CI's
plt.axvline(np.percentile(results_boot.b_educ,2.5),color='r',
linestyle='dashed',
lw=lw,label='Non-Par. Lower 95%')
plt.axvline(np.percentile(results_boot.b_educ,97.5),color='r',
linestyle='dashed',
lw=lw,label='Non-Par. Upper Lower 95%')
# scootch the upper limit of the x-axis a bit to the right for
# non-overlapping legend
plt.xlim([.05,.15])
results_boot.b_educ.plot(kind='hist',bins=20,alpha=.5)
sbn.despine(offset=3.)
plt.legend()
plt.show()
We can see similar information for all parameters:
# set x axis values
replicate= np.arange(R)
# plot point estimate and confidence intervals:
plt.figure(figsize=(14, 20), dpi=200)
lw = 1
plt.subplot(4,2,1)
plt.ylabel("Intercept")
plt.plot(replicate, results_boot.b_Intercept, label="Intercept", lw=lw)
plt.subplot(4,2,2)
plt.hist(results_boot.b_Intercept,lw=lw, label="b_intercept", orientation='horizontal')
plt.subplot(4,2,3)
plt.ylabel("b_educ")
plt.plot(replicate, results_boot.b_educ, label="b_educ", lw=lw)
plt.subplot(4,2,4)
plt.hist(results_boot.b_educ,lw=lw, label="b_educ", orientation='horizontal')
plt.subplot(4,2,5)
plt.ylabel("b_pexp")
plt.plot(replicate, results_boot.b_pexp, label="b_pexp", lw=lw)
plt.subplot(4,2,6)
plt.hist(results_boot.b_pexp,lw=lw, label="b_pexp", orientation='horizontal')
plt.subplot(4,2,7)
plt.ylabel("b_broken_home")
plt.plot(replicate, results_boot.b_broken_home, label="b_broken_home", lw=lw)
plt.subplot(4,2,8)
plt.hist(results_boot.b_broken_home,lw=lw, label="b_broken_home", orientation='horizontal')
plt.show()
Visualizing Covariances \(\beta\):#
plt.figure(figsize=(10,8));
sbn.jointplot(results_boot['b_educ'], results_boot['b_Intercept'], kind='hex');
plt.show()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[13], line 2
1 plt.figure(figsize=(10,8));
----> 2 sbn.jointplot(results_boot['b_educ'], results_boot['b_Intercept'], kind='hex');
3 plt.show()
TypeError: jointplot() takes from 0 to 1 positional arguments but 2 positional arguments (and 1 keyword-only argument) were given
<Figure size 1000x800 with 0 Axes>
or for all parameters:
g = sbn.PairGrid(results_boot)
g.map_diag(sbn.kdeplot)
g.map_offdiag(sbn.kdeplot, cmap="Blues_d")
plt.show()
Note
An important point about bootstrapping is that we are sampling with replacement from data to uncover better information about our error structure (something we need to estimate for the purposes of inference).