Stata, Jupyter, and Reproducible Research

From Wikipedia, reproducible research is defined as:

The term reproducible research refers to the idea that the ultimate product of academic research is the paper along with the full computational environment used to produce the results in the paper such as the code, data, etc. that can be used to reproduce the results and create new work based on the research.

The reproducible research movement (especially for the statistical sciences) takes this a step further by advocating for dynamic documents. The idea is that a researcher should provide a file (the dynamic document) that can execute the statistical analysis, generate figures, and contains accompanying text narrative. This file can be executed to produce the academic paper. The researcher shares this file with other researchers rather than the only the paper. It is my view that within 20 years nearly every scientific journal in applied statistics will require this approach.

This document shows how to use jupyter notebook or lab and markdown syntax for reproducible research and dynamic documents for work in stata. The idea behind jupyter is that you share your research by sharing your ipynb notebook file. This file performs the full suite of statistical analysis and can produce the pdf manuscript describing your analysis. You will use this workflow for all class problem set assignments.

For every problem set, you will turn in the jupyter notebook ipynb (similar to a do file) file containing all commands, descriptive text, and embedded handwritten responses that produces your problem set.

Getting Started

You will be using https://jupyterhub.wm.edu. When you log in initially, choose either the R or Stata option.

Some Features of Markdown in Jupyter

Jupyter allows for most features of Markdown, which is a liteweight and readable text-based language that allows files to be easily converted to nice looking pdf, html, or even word documents. Some features you will likely want to use:

  • Equations and Math Notation using latex math

  • Headers

  • Emphasizing text (bold and italics)

  • Numeric and bulletted lists

  • Turning stata output on and off

  • Adding page breaks for pdf output using \pagebreak in a markdown code cell

A simple example analysis using Markdown syntax

Below we’ll be modeling the following regression equation for cars back in the day:

\[ price_i = \beta_0 + \beta_1 mpg_i + \beta_2 foreign_i + \epsilon_i \]

Load Data and Summarize

Load and summarize a dataset:

sysuse auto
sum

At this point you are free to execute stata commands interactively in your notebook. If you encounter any problems, open an issue at the issue-tracker.

Loading and Summarizing Data

Summarizing the data shows the variables we can consider in our analysis using a Stata code cell:

. sysuse auto
(1978 automobile data)

. sum

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
        make |          0
       price |         74    6165.257    2949.496       3291      15906
         mpg |         74     21.2973    5.785503         12         41
       rep78 |         69    3.405797    .9899323          1          5
    headroom |         74    2.993243    .8459948        1.5          5
-------------+---------------------------------------------------------
       trunk |         74    13.75676    4.277404          5         23
      weight |         74    3019.459    777.1936       1760       4840
      length |         74    187.9324    22.26634        142        233
        turn |         74    39.64865    4.399354         31         51
displacement |         74    197.2973    91.83722         79        425
-------------+---------------------------------------------------------
  gear_ratio |         74    3.014865    .4562871       2.19       3.89
     foreign |         74    .2972973    .4601885          0          1 

We might also want to look at histograms of our dependent variable, price:

hist price
../_images/reproducible_research_5_1.svg

Regression Model

Here are the regression results:

reg price mpg foreign
      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(2, 71)        =     14.07
       Model |   180261702         2  90130850.8   Prob > F        =    0.0000
    Residual |   454803695        71  6405685.84   R-squared       =    0.2838
-------------+----------------------------------   Adj R-squared   =    0.2637
       Total |   635065396        73  8699525.97   Root MSE        =    2530.9

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         mpg |  -294.1955   55.69172    -5.28   0.000    -405.2417   -183.1494
     foreign |   1767.292    700.158     2.52   0.014     371.2169    3163.368
       _cons |   11905.42   1158.634    10.28   0.000     9595.164    14215.67
------------------------------------------------------------------------------

Discussion of Results

We can now proceed to describe our results and add narrative to the document: Looks like back in the day, foreign cars sell for more!

Jupyter and Mata

Mata is the matrix algebra environment in stata. It operates exactly as a Stata code block by wrapping code with mata and end:

Define \(\mathbf{A}_{2 \times 2}\) as

\[\begin{split} \mathbf{A}=\begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \end{split}\]
mata
A = (1,2\3,4)
A
end
. mata
------------------------------------------------- mata (type end to exit) -----
: 
: A = (1,2\3,4)

: A
       1   2
    +---------+
  1 |  1   2  |
  2 |  3   4  |
    +---------+

: end
-------------------------------------------------------------------------------

. 

Producing pdf’s from your notebook

It is possible to export your notebook in a variety of formats including pdf. To do this click on the download link in the top corner of this page and choose pdf. To create a pdf from your notebook, click on File -> Export Notebook as ... and choose pdf. This may require additional configuration steps and are not required for this course.

A reproducible version of this notebook

Due to some technical issues related to Stata not being open source and available when producing this website, you need to use this ipynb notebook[^2] if you want to run this document on a campus lab computer to fully replicate these results.