Introduction to PyMC

Introduction to PyMC#

Note

Note: the description of software in this introduction are my personal opinions and I’m sure that lots of people would disagree with my assessment of available software packages.

We have been working on getting under the hood of Metropolis-Hastings and Monte Carlo Markov Chain. But there are some very accessible Metropolis Markov Chain software packages out there. These are my personal views on what is available:

Stata (bayesianmh)- just available in Stata 14. To be honest, I have never heard of anyone using Stata for Bayesian Sampling Methods although I haven’t really looked.
Matlab. There are numerous user and Matlab written routines. For example, there is a version of emcee that is implemented there (more on this later in the course). Matlab is for people who want to possibly tweak their own sampler code and who need the fastest possible computation.
SAS. A decent set of packages, not used very much in the social sciences (maybe in the medical sciences??), but is well suited for big data.
R. For a long time R was the de-facto standard and probably still is, with packages like R-stan which interfaces the STAN C Library. Another big advantage to using R-stan is that Gelman is an active contributer to that package. Also Jags, Rbugs and Open BUGS have been around for a long time. I almost elected to teach this course in R.
Python. The libraries aren’t as rich as R, but you might view Python as sitting between Matlab and R. Maybe (in my view probably) faster and also easier to code your own samplers than R. Python is the de facto language for many applied physicists, and they need speed and the latest samplers. Additionally Python is strong in the big-data environment, more so than any of the other packages mentioned above. The major players on python are:
- PyMC
- PyStan
- EMCEE
- Tensorflow Probability
- PyTorch
- Jax (not exactly a Bayesian library, but often useful)

These final three packages are also optimized to use graphical processors (GPUs) which can result in massive speedups for computationally expensive models.

Today, we are going to focus on PyMC, which is a very easy to use package now that we have a solid understanding of how posteriors are constructed.

We will start with our very simple one parameter model and then move to slightly more complicated settings:

sigma = 3 # Note this is the std of our data
data = norm(10,sigma).rvs(100)

# hyperparameters on prior for mu
mu_prior = 8
sigma_prior = 1.5

plt.hist(data,bins=20)
sbn.despine(offset=3.)
plt.show()

../_images/2107ca74ce989a4f6d31c4a512ea18878eec405e50eeca578c51dc2e91b18c87.png

PyMC#

Let’s use PyMC to model this:

This is the model statement describing priors and the likelihood. Here, mu is defined as a stochastic variable (we want a chain of sampled values for this variable) and we provide a prior distribution and hyper-parameters for it. The likelihood function is chosen to be Normal, with one parameter to be estimated (mu), and we use known \(\sigma\) (denoted as sigma). Our “dependent variable” is given by observed=data, where data is generated above and shown in the histogram. In more formal terms, the code below sets up a basic_model having the following form:

(3)#\[\begin{align} Prob(\mu|\sigma,\mathbf{data}) \propto& Prob(\mathbf{data}|\mu,\sigma) \times Prob(\mu | \mu^0_\mu, \sigma^0_\mu) \nonumber \\ Prob(\mu|\sigma,\mathbf{data}) \propto& \phi \left( \frac{\mathbf{data} - \mu}{\sigma} \right ) \times \phi \left(\frac{\mu - \mu^0_\mu}{\sigma^0_\mu} \right ) \nonumber \\ Prob(\mu|\sigma,\mathbf{data}) \propto& \phi \left( \frac{\mathbf{data} - \mu}{3} \right ) \times \phi \left(\frac{\mu - 8}{1.5} \right ) \end{align}\]

where \(\phi(.)\) is the standard normal pdf.

This code sets up the model (with name basic_model) but doesn’t sample from it:

import pymc as pm

with pm.Model() as basic_model:

    # Priors for unknown model parameters
    mu = pm.Normal('Mean of Data',mu_prior,sigma_prior)
    
    # Likelihood (sampling distribution) of observations
    data_in = pm.Normal('Y_obs', mu=mu, sigma=sigma, observed=data)

The following uses our PyMC model and proceeds with sampling. We will set starting values at MAP and use a Metropolis step method (which is Random Walk MH that we have studied so far).

chain_length = 10000 

with basic_model:
    # obtain starting values via MAP
    startvals = pm.find_MAP(model=basic_model)
    print(startvals)
    # instantiate sampler
    step = pm.Metropolis() 

    # draw posterior samples
    trace = pm.sample(chain_length, step=step, initvals=startvals, tune=1000,
                       return_inferencedata=True, progressbar=False) 

100.00% [3/3 00:00<00:00 logp = -290.67, ||grad|| = 31.153]

{'Mean of Data': array(10.69594596)}

Multiprocess sampling (2 chains in 2 jobs)

Metropolis: [Mean of Data]

Sampling 2 chains for 1_000 tune and 10_000 draw iterations (2_000 + 20_000 draws total) took 2 seconds.

We recommend running at least 4 chains for robust computation of convergence diagnostics

A few things to note about this code

We will tune and burn-in our sampler using the first 1000 samples. Note the option discard_tuned_samples by default is set to True. Therefore our sampler is actually sampling 11,000 times but only returning 10,000.
We are starting our chain using MAP. This may or may not be a good idea depending on your problem. If we don’t supply starting values,PyMC makes intelligent decisions about starting values.
We are using random-walk Metropolis Hastings using the Metropolis() step method. This method auto-tunes the proposal width and discards the samples during the tuning process (first 1000 samples).
Note that the sampler automatically samples multiple independent chains (for Gelman-Rubin convergence criteria). By default 4 cores are sampled unless there are fewer processor cores available (in which case the number of samples is equal to the number of cores, but no less than 2).

In addition to having an easy way to setup and sample from a model without having to write the accept/reject algorith, PyMC offers a full suite of tools for visualizing and assessing the convergence properties of your chain.

Visualize the traces. For each of the four chains:

pm.plot_trace(trace,figsize=(20,5));

../_images/3c6e4f871f15fa0986dd4e65e8ceb28845309e0ca4e0910e23570fa8b6785989.png

or we can combine all information for the histogram on the left panel (visualization of the posterior) across all chains

pm.plot_trace(trace, figsize=(20,5), combined=True);

../_images/83ec97e503ea8d58b8532500a5b8ad24e9f300eb3b0c36cdc06842668a95bb9f.png

Summarize the posterior. Note by default, this combines samples across all chains:

pm.summary(trace)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
Mean of Data	10.69	0.297	10.104	11.215	0.005	0.003	3783.0	4091.0	1.0

Note, that the summary has useful convergence criteria for each parameter (r_hat) is the Gelman Rubin statistic, and the effective sample size measures are given by ess_bulk and ess_tail.

Understanding the PyMC Results Object#

All the results are contained in the trace variable. Since in the above pm3.sample call, we specified inference_data=True, this returns and Arviz inference data. What you need to know about this is that it is a data standard for multiple Bayesian software packages and understanding the structure of this object is useful in numerous ways.

First, notice that the object organizes information for our simple model around the following keys (identifiers):

trace.keys()

KeysView(Inference data with groups:
	> posterior
	> sample_stats
	> observed_data)

For our purposes, the posterior and sample_stats are the most important items to focus on. Look inside the object to examine the estimated mean inside the posterior:

Further we can drill down to examine only the sampled Mean of Data parameter:

Notice that the name each parameter tracked inside the posterior is exactly as specified in the pymc model. If you want to extract your samples and place them in a numpy array, use

trace_np = trace.posterior['Mean of Data'].to_numpy()

This trace has the following shape:

trace_np.shape

(2, 10000)

Where the first dimension is the number of chains, and the second is the number of samples per chain.

Selecting subsets of chains/samples from the Inference Data object#

Sometimes you want to

Only examine a single chain, or
Omit burn-in observations

Before running diagnostics or visualizing the results. The sel method is useful in this regard. Suppose we only want to visualize a single chain:

pm.plot_trace(trace.sel(chain=[0]));

../_images/61ba5ed47088af619d47e7d92255d936ab9d37929829853411c6d7060001b7dc.png

This performs a trace plot over only the first chain, rather than over all of them. One can also “drop” samples if you need longer burn-in:

keep_samples = [i for i in range(5000,10000)]
trace_burn_in = trace.sel(draw=keep_samples)
pm.plot_trace(trace_burn_in);

../_images/01aaae165a63d9d65dc72269f368800ca7aec86ea6bbd86afd77eb08fcaf4120.png

This shows a traceplot for the last 5000 samples for each chain (dropping the first 5000). Using similar techniques, we can also compute summaries only on these last 5000 samples:

pm.summary(trace_burn_in)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
Mean of Data	10.685	0.294	10.104	11.207	0.007	0.005	2013.0	2019.0	1.0

Diagnostics#

Many of the key diagnostics discussed in the previous chapter are already included in the summary call:

pm.summary(trace)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
Mean of Data	10.69	0.297	10.104	11.215	0.005	0.003	3783.0	4091.0	1.0

Note that:

r_hat is the Gelman-Rubin statistic
The two effective sample size statistics (beginning with ess) are as discussed in the prior chapter
The standard error of sampling is provided by mcse_mean

Autocorrelation Plots#

pm.plot_autocorr(trace,figsize=(17,5));

../_images/a4f58930f2cd9de44606d774fb4c59cf96d57c41be878554a66da4b64f12b2e9.png

For me, this is better viewed as combined particularly after you examine the Gelman Rubin statistic and confirm convergence:

pm.plot_autocorr(trace, figsize=(12,9), combined=True);

../_images/c4aa3e841a6a7516c89721019ad21a03f4a07e811594ae5ccf61f84a51e3bbc2.png

Acceptance Rate#

The acceptance rate can also be computed from our trace object:

accept = trace.sample_stats.accepted.mean().to_numpy()
print("Acceptance Rate: ", accept.round(3))

Acceptance Rate:  0.329