Stata, Jupyter, and Reproducible Research
Contents
Stata, Jupyter, and Reproducible Research¶
From Wikipedia, reproducible research is defined as:
The term reproducible research refers to the idea that the ultimate product of academic research is the paper along with the full computational environment used to produce the results in the paper such as the code, data, etc. that can be used to reproduce the results and create new work based on the research.
The reproducible research movement (especially for the statistical sciences) takes this a step further by advocating for dynamic documents. The idea is that a researcher should provide a file (the dynamic document) that can execute the statistical analysis, generate figures, and contains accompanying text narrative. This file can be executed to produce the academic paper. The researcher shares this file with other researchers rather than the only the paper. It is my view that within 20 years nearly every scientific journal in applied statistics will require this approach.
This document shows how to use jupyter
notebook
or lab
and markdown syntax for reproducible research and
dynamic documents for work in stata. The idea behind jupyter
is
that you share your research by sharing your ipynb
notebook file.
This file performs the full suite of statistical analysis and can
produce the pdf manuscript describing your analysis. You will use
this workflow for all class problem set assignments.
For every problem set, you will turn in the jupyter notebook ipynb
(similar to a do file) file containing all commands, descriptive text,
and embedded handwritten responses that produces your problem set.
Getting Started¶
You will be using
https://jupyterhub.wm.edu. When you log
in initially, choose either the R
or Stata
option.
Some Features of Markdown in Jupyter¶
Jupyter allows for most features of Markdown, which is a liteweight and readable text-based language that allows files to be easily converted to nice looking pdf, html, or even word documents. Some features you will likely want to use:
Equations and Math Notation using latex math
Headers
Emphasizing text (bold and italics)
Numeric and bulletted lists
Turning stata output on and off
Adding page breaks for
pdf
output using\pagebreak
in a markdown code cell
A simple example analysis using Markdown syntax¶
Below we’ll be modeling the following regression equation for cars back in the day:
Load Data and Summarize¶
Load and summarize a dataset:
sysuse auto
sum
At this point you are free to execute stata commands interactively in your notebook. If you encounter any problems, open an issue at the issue-tracker.
Loading and Summarizing Data¶
Summarizing the data shows the variables we can consider in our
analysis using a Stata
code cell:
. sysuse auto
(1978 automobile data)
. sum
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
make | 0
price | 74 6165.257 2949.496 3291 15906
mpg | 74 21.2973 5.785503 12 41
rep78 | 69 3.405797 .9899323 1 5
headroom | 74 2.993243 .8459948 1.5 5
-------------+---------------------------------------------------------
trunk | 74 13.75676 4.277404 5 23
weight | 74 3019.459 777.1936 1760 4840
length | 74 187.9324 22.26634 142 233
turn | 74 39.64865 4.399354 31 51
displacement | 74 197.2973 91.83722 79 425
-------------+---------------------------------------------------------
gear_ratio | 74 3.014865 .4562871 2.19 3.89
foreign | 74 .2972973 .4601885 0 1
We might also want to look at histograms of our dependent variable, price
:
hist price
Regression Model¶
Here are the regression results:
reg price mpg foreign
Source | SS df MS Number of obs = 74
-------------+---------------------------------- F(2, 71) = 14.07
Model | 180261702 2 90130850.8 Prob > F = 0.0000
Residual | 454803695 71 6405685.84 R-squared = 0.2838
-------------+---------------------------------- Adj R-squared = 0.2637
Total | 635065396 73 8699525.97 Root MSE = 2530.9
------------------------------------------------------------------------------
price | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
mpg | -294.1955 55.69172 -5.28 0.000 -405.2417 -183.1494
foreign | 1767.292 700.158 2.52 0.014 371.2169 3163.368
_cons | 11905.42 1158.634 10.28 0.000 9595.164 14215.67
------------------------------------------------------------------------------
Discussion of Results¶
We can now proceed to describe our results and add narrative to the document: Looks like back in the day, foreign cars sell for more!
Jupyter and Mata¶
Mata is the matrix algebra environment in stata. It operates exactly as a Stata code block by wrapping code with mata
and end
:
Define \(\mathbf{A}_{2 \times 2}\) as
mata
A = (1,2\3,4)
A
end
. mata
------------------------------------------------- mata (type end to exit) -----
:
: A = (1,2\3,4)
: A
1 2
+---------+
1 | 1 2 |
2 | 3 4 |
+---------+
: end
-------------------------------------------------------------------------------
.
Producing pdf’s from your notebook¶
It is possible to export your notebook in a variety of formats
including pdf. To do this click on the download link in the top
corner of this page and choose pdf
. To create a pdf
from your
notebook, click on File
-> Export Notebook as ...
and choose
pdf
. This may require additional configuration steps and are not
required for this course.
A reproducible version of this notebook¶
Due to some technical issues related to Stata not being open source and available when producing this website, you need to use this ipynb notebook[^2] if you want to run this document on a campus lab computer to fully replicate these results.