Introduction to R#

This page is intended for students in Cross-section who are unfamiliar with R. While not a complete introduction, this should give you enough background to complete course assignments in R. There are a couple of ways to run R code:

From the terminal/command line
Using R-Studio or some other integrated development environment
Using jupyter lab and jupyter notebooks.

In this course, you will be using the William and Mary Jupyterhub server, choosing the R kernel, and creating jupyter notebooks (method 3).

Loading data into R#

The foreign library allows us to open a bunch of different types of datafiles including excel, stata, sas, and comma delimited data to name a few.

library(foreign)
mroz = read.dta("https://rlhick.people.wm.edu/econ407/data/mroz.dta")
summary(mroz)

      lfp              whrs             kl6              k618      
 Min.   :0.0000   Min.   :   0.0   Min.   :0.0000   Min.   :0.000  
 1st Qu.:0.0000   1st Qu.:   0.0   1st Qu.:0.0000   1st Qu.:0.000  
 Median :1.0000   Median : 288.0   Median :0.0000   Median :1.000  
 Mean   :0.5684   Mean   : 740.6   Mean   :0.2377   Mean   :1.353  
 3rd Qu.:1.0000   3rd Qu.:1516.0   3rd Qu.:0.0000   3rd Qu.:2.000  
 Max.   :1.0000   Max.   :4950.0   Max.   :3.0000   Max.   :8.000  
       wa              we              ww              rpwg           hhrs     
 Min.   :30.00   Min.   : 5.00   Min.   : 0.000   Min.   :0.00   Min.   : 175  
 1st Qu.:36.00   1st Qu.:12.00   1st Qu.: 0.000   1st Qu.:0.00   1st Qu.:1928  
 Median :43.00   Median :12.00   Median : 1.625   Median :0.00   Median :2164  
 Mean   :42.54   Mean   :12.29   Mean   : 2.375   Mean   :1.85   Mean   :2267  
 3rd Qu.:49.00   3rd Qu.:13.00   3rd Qu.: 3.788   3rd Qu.:3.58   3rd Qu.:2553  
 Max.   :60.00   Max.   :17.00   Max.   :25.000   Max.   :9.98   Max.   :5010  
       ha              he              hw              faminc     
 Min.   :30.00   Min.   : 3.00   Min.   : 0.4121   Min.   : 1500  
 1st Qu.:38.00   1st Qu.:11.00   1st Qu.: 4.7883   1st Qu.:15428  
 Median :46.00   Median :12.00   Median : 6.9758   Median :20880  
 Mean   :45.12   Mean   :12.49   Mean   : 7.4822   Mean   :23081  
 3rd Qu.:52.00   3rd Qu.:15.00   3rd Qu.: 9.1667   3rd Qu.:28200  
 Max.   :60.00   Max.   :17.00   Max.   :40.5090   Max.   :96000  
      mtr              wmed             wfed              un        
 Min.   :0.4415   Min.   : 0.000   Min.   : 0.000   Min.   : 3.000  
 1st Qu.:0.6215   1st Qu.: 7.000   1st Qu.: 7.000   1st Qu.: 7.500  
 Median :0.6915   Median :10.000   Median : 7.000   Median : 7.500  
 Mean   :0.6789   Mean   : 9.251   Mean   : 8.809   Mean   : 8.624  
 3rd Qu.:0.7215   3rd Qu.:12.000   3rd Qu.:12.000   3rd Qu.:11.000  
 Max.   :0.9415   Max.   :17.000   Max.   :17.000   Max.   :14.000  
      cit               ax       
 Min.   :0.0000   Min.   : 0.00  
 1st Qu.:0.0000   1st Qu.: 4.00  
 Median :1.0000   Median : 9.00  
 Mean   :0.6428   Mean   :10.63  
 3rd Qu.:1.0000   3rd Qu.:15.00  
 Max.   :1.0000   Max.   :45.00  

Loading files from disk is a slight variation the above command. Supposing that your stata data file mroz.dta was in the folder /some/place, in Linux or MacOS we would use the R command

mroz = read.dta("/some/place/mroz.dta")

It is also possible to open datasets stored in R format using mroz = load("/some/place/mroz.RData"), but in all cases in this class we’ll be opening Stata, Excel, or other common datatypes using foreign.

Viewing Data in R#

Viewing R data at the command line is achieved by the head or tail commands. Here we’ll view the first 5 rows of data:

head(mroz,5)

A data.frame: 5 × 19
	lfp	whrs	kl6	k618	wa	we	ww	rpwg	hhrs	ha	he	hw	faminc	mtr	wmed	wfed	un	cit	ax
	<int>	<int>	<int>	<int>	<int>	<int>	<dbl>	<dbl>	<int>	<int>	<int>	<dbl>	<int>	<dbl>	<int>	<int>	<dbl>	<int>	<int>
1	1	1610	1	0	32	12	3.3540	2.65	2708	34	12	4.0288	16310	0.7215	12	7	5.0	0	14
2	1	1656	0	2	30	12	1.3889	2.65	2310	30	9	8.4416	21800	0.6615	7	7	11.0	1	5
3	1	1980	1	3	35	12	4.5455	4.04	3072	40	12	3.5807	21040	0.6915	12	7	5.0	0	15
4	1	456	0	3	34	12	1.0965	3.25	1920	53	10	3.5417	7300	0.7815	7	7	5.0	0	6
5	1	1568	1	2	31	14	4.5918	3.60	2000	32	12	10.0000	27300	0.6215	12	14	9.5	1	7

Or, the last 6 rows of data:

tail(mroz,6)

A data.frame: 6 × 19
	lfp	whrs	kl6	k618	wa	we	ww	rpwg	hhrs	ha	he	hw	faminc	mtr	wmed	wfed	un	cit	ax
	<int>	<int>	<int>	<int>	<int>	<int>	<dbl>	<dbl>	<int>	<int>	<int>	<dbl>	<int>	<dbl>	<int>	<int>	<dbl>	<int>	<int>
748	0	0	0	2	36	12	0	0	3120	39	12	1.3013	5330	0.7915	7	12	14.0	0	4
749	0	0	0	2	40	13	0	0	3020	43	16	9.2715	28200	0.6215	10	10	9.5	1	5
750	0	0	2	3	31	12	0	0	2056	33	12	4.8638	10000	0.7715	12	12	7.5	0	14
751	0	0	0	0	43	12	0	0	2383	43	12	1.0898	9952	0.7515	10	3	7.5	0	4
752	0	0	0	0	60	12	0	0	1705	55	8	12.4400	24984	0.6215	12	12	14.0	1	15
753	0	0	0	3	39	9	0	0	3120	48	12	6.0897	28363	0.6915	7	7	11.0	1	12

Or specific rows, using what is called “slice” indexing:

mroz[10:15,]

A data.frame: 6 × 19
	lfp	whrs	kl6	k618	wa	we	ww	rpwg	hhrs	ha	he	hw	faminc	mtr	wmed	wfed	un	cit	ax
	<int>	<int>	<int>	<int>	<int>	<int>	<dbl>	<dbl>	<int>	<int>	<int>	<dbl>	<int>	<dbl>	<int>	<int>	<dbl>	<int>	<int>
10	1	1600	0	2	39	12	4.6875	4.15	2100	43	12	5.7143	20425	0.6915	7	7	5.0	0	21
11	1	1969	0	1	33	12	4.0630	4.30	2450	34	12	9.7959	32300	0.5815	12	3	5.0	0	15
12	1	1960	0	1	42	11	4.5918	4.58	2375	47	14	8.0000	28700	0.6215	14	7	5.0	0	14
13	1	240	1	2	30	12	2.0833	0.00	2830	33	16	5.3004	15500	0.7215	16	16	5.0	0	0
14	1	997	0	2	43	12	2.2668	3.50	3317	46	12	4.3413	16860	0.7215	10	10	7.5	1	14
15	1	1848	0	1	43	10	3.6797	3.38	2024	45	17	10.8700	31431	0.5815	7	7	7.5	1	6

Or rows meeting logical conditions. Let’s look at the first 10 rows where the respondent has kids less than 6 years old:

head(mroz[mroz$kl6>0,],10)

A data.frame: 10 × 19
	lfp	whrs	kl6	k618	wa	we	ww	rpwg	hhrs	ha	he	hw	faminc	mtr	wmed	wfed	un	cit	ax
	<int>	<int>	<int>	<int>	<int>	<int>	<dbl>	<dbl>	<int>	<int>	<int>	<dbl>	<int>	<dbl>	<int>	<int>	<dbl>	<int>	<int>
1	1	1610	1	0	32	12	3.3540	2.65	2708	34	12	4.0288	16310	0.7215	12	7	5.0	0	14
3	1	1980	1	3	35	12	4.5455	4.04	3072	40	12	3.5807	21040	0.6915	12	7	5.0	0	15
5	1	1568	1	2	31	14	4.5918	3.60	2000	32	12	10.0000	27300	0.6215	12	14	9.5	1	7
13	1	240	1	2	30	12	2.0833	0.00	2830	33	16	5.3004	15500	0.7215	16	16	5.0	0	0
25	1	1955	1	1	31	12	2.1545	2.30	2024	31	12	4.0884	12487	0.7515	12	7	5.0	1	4
29	1	1516	1	0	31	17	7.2559	6.00	2390	30	17	6.2762	26100	0.6215	12	12	5.0	0	7
41	1	112	1	2	30	12	2.6786	0.00	4030	33	16	3.8462	15810	0.7215	12	12	3.0	0	1
43	1	583	1	2	31	16	2.5729	9.98	1530	34	16	13.7250	24000	0.6615	14	16	9.5	1	6
74	1	608	2	4	34	10	8.2237	3.00	1304	38	9	3.3742	15200	0.7915	0	0	7.5	1	11
79	1	90	2	2	32	15	1.0000	0.00	2350	31	14	4.8787	13755	0.7515	10	12	7.5	1	9

Note, this is achieved using logical addressing, where only rows having the logical value TRUE is included. So for the first five rows of mroz, only rows 1, 3, and 5 have more than one young child and would be displayed above:

head(mroz$kl6>0,5)

TRUE
FALSE
TRUE
FALSE
TRUE

Creating and Modifying Variables#

Creating Variables#

In Stata, you need to start a new variable with create. In R, just assign the variable:

mroz$newvar = mroz$lfp * mroz$ax
print(colnames(mroz))
print(summary(mroz))

 [1] "lfp"    "whrs"   "kl6"    "k618"   "wa"     "we"     "ww"     "rpwg"  
 [9] "hhrs"   "ha"     "he"     "hw"     "faminc" "mtr"    "wmed"   "wfed"  
[17] "un"     "cit"    "ax"     "newvar"
      lfp              whrs             kl6              k618      
 Min.   :0.0000   Min.   :   0.0   Min.   :0.0000   Min.   :0.000  
 1st Qu.:0.0000   1st Qu.:   0.0   1st Qu.:0.0000   1st Qu.:0.000  
 Median :1.0000   Median : 288.0   Median :0.0000   Median :1.000  
 Mean   :0.5684   Mean   : 740.6   Mean   :0.2377   Mean   :1.353  
 3rd Qu.:1.0000   3rd Qu.:1516.0   3rd Qu.:0.0000   3rd Qu.:2.000  
 Max.   :1.0000   Max.   :4950.0   Max.   :3.0000   Max.   :8.000  
       wa              we              ww              rpwg           hhrs     
 Min.   :30.00   Min.   : 5.00   Min.   : 0.000   Min.   :0.00   Min.   : 175  
 1st Qu.:36.00   1st Qu.:12.00   1st Qu.: 0.000   1st Qu.:0.00   1st Qu.:1928  
 Median :43.00   Median :12.00   Median : 1.625   Median :0.00   Median :2164  
 Mean   :42.54   Mean   :12.29   Mean   : 2.375   Mean   :1.85   Mean   :2267  
 3rd Qu.:49.00   3rd Qu.:13.00   3rd Qu.: 3.788   3rd Qu.:3.58   3rd Qu.:2553  
 Max.   :60.00   Max.   :17.00   Max.   :25.000   Max.   :9.98   Max.   :5010  
       ha              he              hw              faminc     
 Min.   :30.00   Min.   : 3.00   Min.   : 0.4121   Min.   : 1500  
 1st Qu.:38.00   1st Qu.:11.00   1st Qu.: 4.7883   1st Qu.:15428  
 Median :46.00   Median :12.00   Median : 6.9758   Median :20880  
 Mean   :45.12   Mean   :12.49   Mean   : 7.4822   Mean   :23081  
 3rd Qu.:52.00   3rd Qu.:15.00   3rd Qu.: 9.1667   3rd Qu.:28200  
 Max.   :60.00   Max.   :17.00   Max.   :40.5090   Max.   :96000  
      mtr              wmed             wfed              un        
 Min.   :0.4415   Min.   : 0.000   Min.   : 0.000   Min.   : 3.000  
 1st Qu.:0.6215   1st Qu.: 7.000   1st Qu.: 7.000   1st Qu.: 7.500  
 Median :0.6915   Median :10.000   Median : 7.000   Median : 7.500  
 Mean   :0.6789   Mean   : 9.251   Mean   : 8.809   Mean   : 8.624  
 3rd Qu.:0.7215   3rd Qu.:12.000   3rd Qu.:12.000   3rd Qu.:11.000  
 Max.   :0.9415   Max.   :17.000   Max.   :17.000   Max.   :14.000  
      cit               ax            newvar     
 Min.   :0.0000   Min.   : 0.00   Min.   : 0.00  
 1st Qu.:0.0000   1st Qu.: 4.00   1st Qu.: 0.00  
 Median :1.0000   Median : 9.00   Median : 4.00  
 Mean   :0.6428   Mean   :10.63   Mean   : 7.41  
 3rd Qu.:1.0000   3rd Qu.:15.00   3rd Qu.:13.00  
 Max.   :1.0000   Max.   :45.00   Max.   :38.00  

Note a new column called newvar is now part of the data.

R aficionados would probably criticize the above code, since strictly speaking the assignment

x = y

is sometimes different than the R recommended way of making an assignment:

x <- y

which is an artifact from the use of ancient keyboards when R was written. I have never encountered a case where x=y doesn’t work, but apparently it can happen.

Modifying Variables#

Unlike stata we simply redefine the variable and don’t need to bother with replace:

mroz$newvar = mroz$newvar/10
print(summary(mroz$newvar))

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.400   0.741   1.300   3.800 

Variable Lists#

Sometimes we may want to refer to subsets of variables as lists. To only display lfp and whrs in the tail command, use the list constructor c('lfp', 'whrs') like this:

head(mroz[,c('lfp', 'whrs')], 6)

A data.frame: 6 × 2
	lfp	whrs
	<int>	<int>
1	1	1610
2	1	1656
3	1	1980
4	1	456
5	1	1568
6	1	2032

Peaking inside R objects#

Sometimes, we are unaware of the information inside of an R object (like an estimation result). We can use names to see the named items inside the object:

names(mroz)

'lfp'
'whrs'
'kl6'
'k618'
'wa'
'we'
'ww'
'rpwg'
'hhrs'
'ha'
'he'
'hw'
'faminc'
'mtr'
'wmed'
'wfed'
'un'
'cit'
'ax'
'newvar'

Installing Packages#

Each course module uses add on functionality (referred to as packages). For example, for the OLS section of the course you will need to install the foreign, sandwich, lmtest, stargazer, and boot packages using:

install.packages(c('foreign', 'sandwich', 'lmtest', 'stargazer', 'boot'))

This might take awhile to run and it may be necessary to re-install these packages if you exit jupyterhub and rejoin later.

Getting help in R#

If you know the function you need help with, just use the help function:

help(tail)

ECON407 Cross Section Econometrics

Introduction to R

Contents