Introduction to R#

This page is intended for students in Cross-section who are unfamiliar with R. While not a complete introduction, this should give you enough background to complete course assignments in R. There are a couple of ways to run R code:

  1. From the terminal/command line

  2. Using R-Studio or some other integrated development environment

  3. Using jupyter lab and jupyter notebooks.

In this course, you will be using the William and Mary Jupyterhub server, choosing the R kernel, and creating jupyter notebooks (method 3).

Loading data into R#

The foreign library allows us to open a bunch of different types of datafiles including excel, stata, sas, and comma delimited data to name a few.

library(foreign)
mroz = read.dta("https://rlhick.people.wm.edu/econ407/data/mroz.dta")
summary(mroz)
      lfp              whrs             kl6              k618      
 Min.   :0.0000   Min.   :   0.0   Min.   :0.0000   Min.   :0.000  
 1st Qu.:0.0000   1st Qu.:   0.0   1st Qu.:0.0000   1st Qu.:0.000  
 Median :1.0000   Median : 288.0   Median :0.0000   Median :1.000  
 Mean   :0.5684   Mean   : 740.6   Mean   :0.2377   Mean   :1.353  
 3rd Qu.:1.0000   3rd Qu.:1516.0   3rd Qu.:0.0000   3rd Qu.:2.000  
 Max.   :1.0000   Max.   :4950.0   Max.   :3.0000   Max.   :8.000  
       wa              we              ww              rpwg           hhrs     
 Min.   :30.00   Min.   : 5.00   Min.   : 0.000   Min.   :0.00   Min.   : 175  
 1st Qu.:36.00   1st Qu.:12.00   1st Qu.: 0.000   1st Qu.:0.00   1st Qu.:1928  
 Median :43.00   Median :12.00   Median : 1.625   Median :0.00   Median :2164  
 Mean   :42.54   Mean   :12.29   Mean   : 2.375   Mean   :1.85   Mean   :2267  
 3rd Qu.:49.00   3rd Qu.:13.00   3rd Qu.: 3.788   3rd Qu.:3.58   3rd Qu.:2553  
 Max.   :60.00   Max.   :17.00   Max.   :25.000   Max.   :9.98   Max.   :5010  
       ha              he              hw              faminc     
 Min.   :30.00   Min.   : 3.00   Min.   : 0.4121   Min.   : 1500  
 1st Qu.:38.00   1st Qu.:11.00   1st Qu.: 4.7883   1st Qu.:15428  
 Median :46.00   Median :12.00   Median : 6.9758   Median :20880  
 Mean   :45.12   Mean   :12.49   Mean   : 7.4822   Mean   :23081  
 3rd Qu.:52.00   3rd Qu.:15.00   3rd Qu.: 9.1667   3rd Qu.:28200  
 Max.   :60.00   Max.   :17.00   Max.   :40.5090   Max.   :96000  
      mtr              wmed             wfed              un        
 Min.   :0.4415   Min.   : 0.000   Min.   : 0.000   Min.   : 3.000  
 1st Qu.:0.6215   1st Qu.: 7.000   1st Qu.: 7.000   1st Qu.: 7.500  
 Median :0.6915   Median :10.000   Median : 7.000   Median : 7.500  
 Mean   :0.6789   Mean   : 9.251   Mean   : 8.809   Mean   : 8.624  
 3rd Qu.:0.7215   3rd Qu.:12.000   3rd Qu.:12.000   3rd Qu.:11.000  
 Max.   :0.9415   Max.   :17.000   Max.   :17.000   Max.   :14.000  
      cit               ax       
 Min.   :0.0000   Min.   : 0.00  
 1st Qu.:0.0000   1st Qu.: 4.00  
 Median :1.0000   Median : 9.00  
 Mean   :0.6428   Mean   :10.63  
 3rd Qu.:1.0000   3rd Qu.:15.00  
 Max.   :1.0000   Max.   :45.00  

Loading files from disk is a slight variation the above command. Supposing that your stata data file mroz.dta was in the folder /some/place, in Linux or MacOS we would use the R command

mroz = read.dta("/some/place/mroz.dta")

It is also possible to open datasets stored in R format using mroz = load("/some/place/mroz.RData"), but in all cases in this class we’ll be opening Stata, Excel, or other common datatypes using foreign.

Viewing Data in R#

Viewing R data at the command line is achieved by the head or tail commands. Here we’ll view the first 5 rows of data:

head(mroz,5)
A data.frame: 5 × 19
lfpwhrskl6k618wawewwrpwghhrshahehwfamincmtrwmedwfeduncitax
<int><int><int><int><int><int><dbl><dbl><int><int><int><dbl><int><dbl><int><int><dbl><int><int>
1116101032123.35402.6527083412 4.0288163100.721512 7 5.0014
2116560230121.38892.65231030 9 8.4416218000.6615 7 711.01 5
3119801335124.54554.0430724012 3.5807210400.691512 7 5.0015
41 4560334121.09653.2519205310 3.5417 73000.7815 7 7 5.00 6
5115681231144.59183.602000321210.0000273000.62151214 9.51 7

Or, the last 6 rows of data:

tail(mroz,6)
A data.frame: 6 × 19
lfpwhrskl6k618wawewwrpwghhrshahehwfamincmtrwmedwfeduncitax
<int><int><int><int><int><int><dbl><dbl><int><int><int><dbl><int><dbl><int><int><dbl><int><int>
748000236120031203912 1.3013 53300.7915 71214.00 4
749000240130030204316 9.2715282000.62151010 9.51 5
750002331120020563312 4.8638100000.77151212 7.5014
751000043120023834312 1.0898 99520.751510 3 7.50 4
7520000601200170555 812.4400249840.6215121214.0115
753000339 90031204812 6.0897283630.6915 7 711.0112

Or specific rows, using what is called “slice” indexing:

mroz[10:15,]
A data.frame: 6 × 19
lfpwhrskl6k618wawewwrpwghhrshahehwfamincmtrwmedwfeduncitax
<int><int><int><int><int><int><dbl><dbl><int><int><int><dbl><int><dbl><int><int><dbl><int><int>
10116000239124.68754.1521004312 5.7143204250.6915 7 75.0021
11119690133124.06304.3024503412 9.7959323000.581512 35.0015
12119600142114.59184.5823754714 8.0000287000.621514 75.0014
131 2401230122.08330.0028303316 5.3004155000.721516165.00 0
141 9970243122.26683.5033174612 4.3413168600.721510107.5114
15118480143103.67973.382024451710.8700314310.5815 7 77.51 6

Or rows meeting logical conditions. Let’s look at the first 10 rows where the respondent has kids less than 6 years old:

head(mroz[mroz$kl6>0,],10)
A data.frame: 10 × 19
lfpwhrskl6k618wawewwrpwghhrshahehwfamincmtrwmedwfeduncitax
<int><int><int><int><int><int><dbl><dbl><int><int><int><dbl><int><dbl><int><int><dbl><int><int>
1116101032123.35402.6527083412 4.0288163100.721512 75.0014
3119801335124.54554.0430724012 3.5807210400.691512 75.0015
5115681231144.59183.602000321210.0000273000.621512149.51 7
131 2401230122.08330.0028303316 5.3004155000.721516165.00 0
25119551131122.15452.3020243112 4.0884124870.751512 75.01 4
29115161031177.25596.0023903017 6.2762261000.621512125.00 7
411 1121230122.67860.0040303316 3.8462158100.721512123.00 1
431 5831231162.57299.981530341613.7250240000.661514169.51 6
741 6082434108.22373.00130438 9 3.3742152000.7915 0 07.5111
791 902232151.00000.0023503114 4.8787137550.751510127.51 9

Note, this is achieved using logical addressing, where only rows having the logical value TRUE is included. So for the first five rows of mroz, only rows 1, 3, and 5 have more than one young child and would be displayed above:

head(mroz$kl6>0,5)
  1. TRUE
  2. FALSE
  3. TRUE
  4. FALSE
  5. TRUE

Creating and Modifying Variables#

Creating Variables#

In Stata, you need to start a new variable with create. In R, just assign the variable:

mroz$newvar = mroz$lfp * mroz$ax
print(colnames(mroz))
print(summary(mroz))
 [1] "lfp"    "whrs"   "kl6"    "k618"   "wa"     "we"     "ww"     "rpwg"  
 [9] "hhrs"   "ha"     "he"     "hw"     "faminc" "mtr"    "wmed"   "wfed"  
[17] "un"     "cit"    "ax"     "newvar"
      lfp              whrs             kl6              k618      
 Min.   :0.0000   Min.   :   0.0   Min.   :0.0000   Min.   :0.000  
 1st Qu.:0.0000   1st Qu.:   0.0   1st Qu.:0.0000   1st Qu.:0.000  
 Median :1.0000   Median : 288.0   Median :0.0000   Median :1.000  
 Mean   :0.5684   Mean   : 740.6   Mean   :0.2377   Mean   :1.353  
 3rd Qu.:1.0000   3rd Qu.:1516.0   3rd Qu.:0.0000   3rd Qu.:2.000  
 Max.   :1.0000   Max.   :4950.0   Max.   :3.0000   Max.   :8.000  
       wa              we              ww              rpwg           hhrs     
 Min.   :30.00   Min.   : 5.00   Min.   : 0.000   Min.   :0.00   Min.   : 175  
 1st Qu.:36.00   1st Qu.:12.00   1st Qu.: 0.000   1st Qu.:0.00   1st Qu.:1928  
 Median :43.00   Median :12.00   Median : 1.625   Median :0.00   Median :2164  
 Mean   :42.54   Mean   :12.29   Mean   : 2.375   Mean   :1.85   Mean   :2267  
 3rd Qu.:49.00   3rd Qu.:13.00   3rd Qu.: 3.788   3rd Qu.:3.58   3rd Qu.:2553  
 Max.   :60.00   Max.   :17.00   Max.   :25.000   Max.   :9.98   Max.   :5010  
       ha              he              hw              faminc     
 Min.   :30.00   Min.   : 3.00   Min.   : 0.4121   Min.   : 1500  
 1st Qu.:38.00   1st Qu.:11.00   1st Qu.: 4.7883   1st Qu.:15428  
 Median :46.00   Median :12.00   Median : 6.9758   Median :20880  
 Mean   :45.12   Mean   :12.49   Mean   : 7.4822   Mean   :23081  
 3rd Qu.:52.00   3rd Qu.:15.00   3rd Qu.: 9.1667   3rd Qu.:28200  
 Max.   :60.00   Max.   :17.00   Max.   :40.5090   Max.   :96000  
      mtr              wmed             wfed              un        
 Min.   :0.4415   Min.   : 0.000   Min.   : 0.000   Min.   : 3.000  
 1st Qu.:0.6215   1st Qu.: 7.000   1st Qu.: 7.000   1st Qu.: 7.500  
 Median :0.6915   Median :10.000   Median : 7.000   Median : 7.500  
 Mean   :0.6789   Mean   : 9.251   Mean   : 8.809   Mean   : 8.624  
 3rd Qu.:0.7215   3rd Qu.:12.000   3rd Qu.:12.000   3rd Qu.:11.000  
 Max.   :0.9415   Max.   :17.000   Max.   :17.000   Max.   :14.000  
      cit               ax            newvar     
 Min.   :0.0000   Min.   : 0.00   Min.   : 0.00  
 1st Qu.:0.0000   1st Qu.: 4.00   1st Qu.: 0.00  
 Median :1.0000   Median : 9.00   Median : 4.00  
 Mean   :0.6428   Mean   :10.63   Mean   : 7.41  
 3rd Qu.:1.0000   3rd Qu.:15.00   3rd Qu.:13.00  
 Max.   :1.0000   Max.   :45.00   Max.   :38.00  

Note a new column called newvar is now part of the data.

R aficionados would probably criticize the above code, since strictly speaking the assignment

x = y

is sometimes different than the R recommended way of making an assignment:

x <- y

which is an artifact from the use of ancient keyboards when R was written. I have never encountered a case where x=y doesn’t work, but apparently it can happen.

Modifying Variables#

Unlike stata we simply redefine the variable and don’t need to bother with replace:

mroz$newvar = mroz$newvar/10
print(summary(mroz$newvar))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.400   0.741   1.300   3.800 

Variable Lists#

Sometimes we may want to refer to subsets of variables as lists. To only display lfp and whrs in the tail command, use the list constructor c('lfp', 'whrs') like this:

head(mroz[,c('lfp', 'whrs')], 6)
A data.frame: 6 × 2
lfpwhrs
<int><int>
111610
211656
311980
41 456
511568
612032

Peaking inside R objects#

Sometimes, we are unaware of the information inside of an R object (like an estimation result). We can use names to see the named items inside the object:

names(mroz)
  1. 'lfp'
  2. 'whrs'
  3. 'kl6'
  4. 'k618'
  5. 'wa'
  6. 'we'
  7. 'ww'
  8. 'rpwg'
  9. 'hhrs'
  10. 'ha'
  11. 'he'
  12. 'hw'
  13. 'faminc'
  14. 'mtr'
  15. 'wmed'
  16. 'wfed'
  17. 'un'
  18. 'cit'
  19. 'ax'
  20. 'newvar'

Installing Packages#

Each course module uses add on functionality (referred to as packages). For example, for the OLS section of the course you will need to install the foreign, sandwich, lmtest, stargazer, and boot packages using:

install.packages(c('foreign', 'sandwich', 'lmtest', 'stargazer', 'boot'))

This might take awhile to run and it may be necessary to re-install these packages if you exit jupyterhub and rejoin later.

Getting help in R#

If you know the function you need help with, just use the help function:

help(tail)