Introduction to R
Contents
Introduction to R#
This page is intended for students in Cross-section who are unfamiliar with R. While not a complete introduction, this should give you enough background to complete course assignments in R. There are a couple of ways to run R code:
From the terminal/command line
Using R-Studio or some other integrated development environment
Using jupyter lab and jupyter notebooks.
In this course, you will be using the William and Mary Jupyterhub server, choosing the R kernel, and creating jupyter notebooks (method 3).
Loading data into R#
The foreign
library allows us to open a bunch of different types of datafiles including excel, stata, sas, and comma delimited data to name a few.
library(foreign)
mroz = read.dta("https://rlhick.people.wm.edu/econ407/data/mroz.dta")
summary(mroz)
lfp whrs kl6 k618
Min. :0.0000 Min. : 0.0 Min. :0.0000 Min. :0.000
1st Qu.:0.0000 1st Qu.: 0.0 1st Qu.:0.0000 1st Qu.:0.000
Median :1.0000 Median : 288.0 Median :0.0000 Median :1.000
Mean :0.5684 Mean : 740.6 Mean :0.2377 Mean :1.353
3rd Qu.:1.0000 3rd Qu.:1516.0 3rd Qu.:0.0000 3rd Qu.:2.000
Max. :1.0000 Max. :4950.0 Max. :3.0000 Max. :8.000
wa we ww rpwg hhrs
Min. :30.00 Min. : 5.00 Min. : 0.000 Min. :0.00 Min. : 175
1st Qu.:36.00 1st Qu.:12.00 1st Qu.: 0.000 1st Qu.:0.00 1st Qu.:1928
Median :43.00 Median :12.00 Median : 1.625 Median :0.00 Median :2164
Mean :42.54 Mean :12.29 Mean : 2.375 Mean :1.85 Mean :2267
3rd Qu.:49.00 3rd Qu.:13.00 3rd Qu.: 3.788 3rd Qu.:3.58 3rd Qu.:2553
Max. :60.00 Max. :17.00 Max. :25.000 Max. :9.98 Max. :5010
ha he hw faminc
Min. :30.00 Min. : 3.00 Min. : 0.4121 Min. : 1500
1st Qu.:38.00 1st Qu.:11.00 1st Qu.: 4.7883 1st Qu.:15428
Median :46.00 Median :12.00 Median : 6.9758 Median :20880
Mean :45.12 Mean :12.49 Mean : 7.4822 Mean :23081
3rd Qu.:52.00 3rd Qu.:15.00 3rd Qu.: 9.1667 3rd Qu.:28200
Max. :60.00 Max. :17.00 Max. :40.5090 Max. :96000
mtr wmed wfed un
Min. :0.4415 Min. : 0.000 Min. : 0.000 Min. : 3.000
1st Qu.:0.6215 1st Qu.: 7.000 1st Qu.: 7.000 1st Qu.: 7.500
Median :0.6915 Median :10.000 Median : 7.000 Median : 7.500
Mean :0.6789 Mean : 9.251 Mean : 8.809 Mean : 8.624
3rd Qu.:0.7215 3rd Qu.:12.000 3rd Qu.:12.000 3rd Qu.:11.000
Max. :0.9415 Max. :17.000 Max. :17.000 Max. :14.000
cit ax
Min. :0.0000 Min. : 0.00
1st Qu.:0.0000 1st Qu.: 4.00
Median :1.0000 Median : 9.00
Mean :0.6428 Mean :10.63
3rd Qu.:1.0000 3rd Qu.:15.00
Max. :1.0000 Max. :45.00
Loading files from disk is a slight variation the above command. Supposing that your stata data file mroz.dta was in the folder /some/place, in Linux or MacOS we would use the R command
mroz = read.dta("/some/place/mroz.dta")
It is also possible to open datasets stored in R format using mroz = load("/some/place/mroz.RData")
, but in all cases in this class we’ll be opening Stata, Excel, or other common datatypes using foreign
.
Viewing Data in R#
Viewing R data at the command line is achieved by the head
or tail
commands. Here we’ll view the first 5 rows of data:
head(mroz,5)
lfp | whrs | kl6 | k618 | wa | we | ww | rpwg | hhrs | ha | he | hw | faminc | mtr | wmed | wfed | un | cit | ax | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<int> | <int> | <int> | <int> | <int> | <int> | <dbl> | <dbl> | <int> | <int> | <int> | <dbl> | <int> | <dbl> | <int> | <int> | <dbl> | <int> | <int> | |
1 | 1 | 1610 | 1 | 0 | 32 | 12 | 3.3540 | 2.65 | 2708 | 34 | 12 | 4.0288 | 16310 | 0.7215 | 12 | 7 | 5.0 | 0 | 14 |
2 | 1 | 1656 | 0 | 2 | 30 | 12 | 1.3889 | 2.65 | 2310 | 30 | 9 | 8.4416 | 21800 | 0.6615 | 7 | 7 | 11.0 | 1 | 5 |
3 | 1 | 1980 | 1 | 3 | 35 | 12 | 4.5455 | 4.04 | 3072 | 40 | 12 | 3.5807 | 21040 | 0.6915 | 12 | 7 | 5.0 | 0 | 15 |
4 | 1 | 456 | 0 | 3 | 34 | 12 | 1.0965 | 3.25 | 1920 | 53 | 10 | 3.5417 | 7300 | 0.7815 | 7 | 7 | 5.0 | 0 | 6 |
5 | 1 | 1568 | 1 | 2 | 31 | 14 | 4.5918 | 3.60 | 2000 | 32 | 12 | 10.0000 | 27300 | 0.6215 | 12 | 14 | 9.5 | 1 | 7 |
Or, the last 6 rows of data:
tail(mroz,6)
lfp | whrs | kl6 | k618 | wa | we | ww | rpwg | hhrs | ha | he | hw | faminc | mtr | wmed | wfed | un | cit | ax | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<int> | <int> | <int> | <int> | <int> | <int> | <dbl> | <dbl> | <int> | <int> | <int> | <dbl> | <int> | <dbl> | <int> | <int> | <dbl> | <int> | <int> | |
748 | 0 | 0 | 0 | 2 | 36 | 12 | 0 | 0 | 3120 | 39 | 12 | 1.3013 | 5330 | 0.7915 | 7 | 12 | 14.0 | 0 | 4 |
749 | 0 | 0 | 0 | 2 | 40 | 13 | 0 | 0 | 3020 | 43 | 16 | 9.2715 | 28200 | 0.6215 | 10 | 10 | 9.5 | 1 | 5 |
750 | 0 | 0 | 2 | 3 | 31 | 12 | 0 | 0 | 2056 | 33 | 12 | 4.8638 | 10000 | 0.7715 | 12 | 12 | 7.5 | 0 | 14 |
751 | 0 | 0 | 0 | 0 | 43 | 12 | 0 | 0 | 2383 | 43 | 12 | 1.0898 | 9952 | 0.7515 | 10 | 3 | 7.5 | 0 | 4 |
752 | 0 | 0 | 0 | 0 | 60 | 12 | 0 | 0 | 1705 | 55 | 8 | 12.4400 | 24984 | 0.6215 | 12 | 12 | 14.0 | 1 | 15 |
753 | 0 | 0 | 0 | 3 | 39 | 9 | 0 | 0 | 3120 | 48 | 12 | 6.0897 | 28363 | 0.6915 | 7 | 7 | 11.0 | 1 | 12 |
Or specific rows, using what is called “slice” indexing:
mroz[10:15,]
lfp | whrs | kl6 | k618 | wa | we | ww | rpwg | hhrs | ha | he | hw | faminc | mtr | wmed | wfed | un | cit | ax | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<int> | <int> | <int> | <int> | <int> | <int> | <dbl> | <dbl> | <int> | <int> | <int> | <dbl> | <int> | <dbl> | <int> | <int> | <dbl> | <int> | <int> | |
10 | 1 | 1600 | 0 | 2 | 39 | 12 | 4.6875 | 4.15 | 2100 | 43 | 12 | 5.7143 | 20425 | 0.6915 | 7 | 7 | 5.0 | 0 | 21 |
11 | 1 | 1969 | 0 | 1 | 33 | 12 | 4.0630 | 4.30 | 2450 | 34 | 12 | 9.7959 | 32300 | 0.5815 | 12 | 3 | 5.0 | 0 | 15 |
12 | 1 | 1960 | 0 | 1 | 42 | 11 | 4.5918 | 4.58 | 2375 | 47 | 14 | 8.0000 | 28700 | 0.6215 | 14 | 7 | 5.0 | 0 | 14 |
13 | 1 | 240 | 1 | 2 | 30 | 12 | 2.0833 | 0.00 | 2830 | 33 | 16 | 5.3004 | 15500 | 0.7215 | 16 | 16 | 5.0 | 0 | 0 |
14 | 1 | 997 | 0 | 2 | 43 | 12 | 2.2668 | 3.50 | 3317 | 46 | 12 | 4.3413 | 16860 | 0.7215 | 10 | 10 | 7.5 | 1 | 14 |
15 | 1 | 1848 | 0 | 1 | 43 | 10 | 3.6797 | 3.38 | 2024 | 45 | 17 | 10.8700 | 31431 | 0.5815 | 7 | 7 | 7.5 | 1 | 6 |
Or rows meeting logical conditions. Let’s look at the first 10 rows where the respondent has kids less than 6 years old:
head(mroz[mroz$kl6>0,],10)
lfp | whrs | kl6 | k618 | wa | we | ww | rpwg | hhrs | ha | he | hw | faminc | mtr | wmed | wfed | un | cit | ax | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<int> | <int> | <int> | <int> | <int> | <int> | <dbl> | <dbl> | <int> | <int> | <int> | <dbl> | <int> | <dbl> | <int> | <int> | <dbl> | <int> | <int> | |
1 | 1 | 1610 | 1 | 0 | 32 | 12 | 3.3540 | 2.65 | 2708 | 34 | 12 | 4.0288 | 16310 | 0.7215 | 12 | 7 | 5.0 | 0 | 14 |
3 | 1 | 1980 | 1 | 3 | 35 | 12 | 4.5455 | 4.04 | 3072 | 40 | 12 | 3.5807 | 21040 | 0.6915 | 12 | 7 | 5.0 | 0 | 15 |
5 | 1 | 1568 | 1 | 2 | 31 | 14 | 4.5918 | 3.60 | 2000 | 32 | 12 | 10.0000 | 27300 | 0.6215 | 12 | 14 | 9.5 | 1 | 7 |
13 | 1 | 240 | 1 | 2 | 30 | 12 | 2.0833 | 0.00 | 2830 | 33 | 16 | 5.3004 | 15500 | 0.7215 | 16 | 16 | 5.0 | 0 | 0 |
25 | 1 | 1955 | 1 | 1 | 31 | 12 | 2.1545 | 2.30 | 2024 | 31 | 12 | 4.0884 | 12487 | 0.7515 | 12 | 7 | 5.0 | 1 | 4 |
29 | 1 | 1516 | 1 | 0 | 31 | 17 | 7.2559 | 6.00 | 2390 | 30 | 17 | 6.2762 | 26100 | 0.6215 | 12 | 12 | 5.0 | 0 | 7 |
41 | 1 | 112 | 1 | 2 | 30 | 12 | 2.6786 | 0.00 | 4030 | 33 | 16 | 3.8462 | 15810 | 0.7215 | 12 | 12 | 3.0 | 0 | 1 |
43 | 1 | 583 | 1 | 2 | 31 | 16 | 2.5729 | 9.98 | 1530 | 34 | 16 | 13.7250 | 24000 | 0.6615 | 14 | 16 | 9.5 | 1 | 6 |
74 | 1 | 608 | 2 | 4 | 34 | 10 | 8.2237 | 3.00 | 1304 | 38 | 9 | 3.3742 | 15200 | 0.7915 | 0 | 0 | 7.5 | 1 | 11 |
79 | 1 | 90 | 2 | 2 | 32 | 15 | 1.0000 | 0.00 | 2350 | 31 | 14 | 4.8787 | 13755 | 0.7515 | 10 | 12 | 7.5 | 1 | 9 |
Note, this is achieved using logical addressing, where only rows having the logical value TRUE is included. So for the first five rows of mroz, only rows 1, 3, and 5 have more than one young child and would be displayed above:
head(mroz$kl6>0,5)
- TRUE
- FALSE
- TRUE
- FALSE
- TRUE
Creating and Modifying Variables#
Creating Variables#
In Stata, you need to start a new variable with create
. In R, just assign the variable:
mroz$newvar = mroz$lfp * mroz$ax
print(colnames(mroz))
print(summary(mroz))
[1] "lfp" "whrs" "kl6" "k618" "wa" "we" "ww" "rpwg"
[9] "hhrs" "ha" "he" "hw" "faminc" "mtr" "wmed" "wfed"
[17] "un" "cit" "ax" "newvar"
lfp whrs kl6 k618
Min. :0.0000 Min. : 0.0 Min. :0.0000 Min. :0.000
1st Qu.:0.0000 1st Qu.: 0.0 1st Qu.:0.0000 1st Qu.:0.000
Median :1.0000 Median : 288.0 Median :0.0000 Median :1.000
Mean :0.5684 Mean : 740.6 Mean :0.2377 Mean :1.353
3rd Qu.:1.0000 3rd Qu.:1516.0 3rd Qu.:0.0000 3rd Qu.:2.000
Max. :1.0000 Max. :4950.0 Max. :3.0000 Max. :8.000
wa we ww rpwg hhrs
Min. :30.00 Min. : 5.00 Min. : 0.000 Min. :0.00 Min. : 175
1st Qu.:36.00 1st Qu.:12.00 1st Qu.: 0.000 1st Qu.:0.00 1st Qu.:1928
Median :43.00 Median :12.00 Median : 1.625 Median :0.00 Median :2164
Mean :42.54 Mean :12.29 Mean : 2.375 Mean :1.85 Mean :2267
3rd Qu.:49.00 3rd Qu.:13.00 3rd Qu.: 3.788 3rd Qu.:3.58 3rd Qu.:2553
Max. :60.00 Max. :17.00 Max. :25.000 Max. :9.98 Max. :5010
ha he hw faminc
Min. :30.00 Min. : 3.00 Min. : 0.4121 Min. : 1500
1st Qu.:38.00 1st Qu.:11.00 1st Qu.: 4.7883 1st Qu.:15428
Median :46.00 Median :12.00 Median : 6.9758 Median :20880
Mean :45.12 Mean :12.49 Mean : 7.4822 Mean :23081
3rd Qu.:52.00 3rd Qu.:15.00 3rd Qu.: 9.1667 3rd Qu.:28200
Max. :60.00 Max. :17.00 Max. :40.5090 Max. :96000
mtr wmed wfed un
Min. :0.4415 Min. : 0.000 Min. : 0.000 Min. : 3.000
1st Qu.:0.6215 1st Qu.: 7.000 1st Qu.: 7.000 1st Qu.: 7.500
Median :0.6915 Median :10.000 Median : 7.000 Median : 7.500
Mean :0.6789 Mean : 9.251 Mean : 8.809 Mean : 8.624
3rd Qu.:0.7215 3rd Qu.:12.000 3rd Qu.:12.000 3rd Qu.:11.000
Max. :0.9415 Max. :17.000 Max. :17.000 Max. :14.000
cit ax newvar
Min. :0.0000 Min. : 0.00 Min. : 0.00
1st Qu.:0.0000 1st Qu.: 4.00 1st Qu.: 0.00
Median :1.0000 Median : 9.00 Median : 4.00
Mean :0.6428 Mean :10.63 Mean : 7.41
3rd Qu.:1.0000 3rd Qu.:15.00 3rd Qu.:13.00
Max. :1.0000 Max. :45.00 Max. :38.00
Note a new column called newvar
is now part of the data.
R aficionados would probably criticize the above code, since strictly speaking the assignment
x = y
is sometimes different than the R recommended way of making an assignment:
x <- y
which is an artifact from the use of ancient keyboards when R was written. I have never encountered a case where x=y
doesn’t work, but apparently it can happen.
Modifying Variables#
Unlike stata
we simply redefine the variable and don’t need to bother with replace
:
mroz$newvar = mroz$newvar/10
print(summary(mroz$newvar))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 0.400 0.741 1.300 3.800
Variable Lists#
Sometimes we may want to refer to subsets of variables as lists. To only display lfp
and whrs
in the tail command, use the list constructor c('lfp', 'whrs')
like this:
head(mroz[,c('lfp', 'whrs')], 6)
lfp | whrs | |
---|---|---|
<int> | <int> | |
1 | 1 | 1610 |
2 | 1 | 1656 |
3 | 1 | 1980 |
4 | 1 | 456 |
5 | 1 | 1568 |
6 | 1 | 2032 |
Peaking inside R objects#
Sometimes, we are unaware of the information inside of an R object (like an estimation result). We can use names
to see the named items inside the object:
names(mroz)
- 'lfp'
- 'whrs'
- 'kl6'
- 'k618'
- 'wa'
- 'we'
- 'ww'
- 'rpwg'
- 'hhrs'
- 'ha'
- 'he'
- 'hw'
- 'faminc'
- 'mtr'
- 'wmed'
- 'wfed'
- 'un'
- 'cit'
- 'ax'
- 'newvar'
Installing Packages#
Each course module uses add on functionality (referred to as packages). For example, for the OLS section of the course you will need to install the foreign, sandwich, lmtest, stargazer, and boot packages using:
install.packages(c('foreign', 'sandwich', 'lmtest', 'stargazer', 'boot'))
This might take awhile to run and it may be necessary to re-install these packages if you exit jupyterhub and rejoin later.
Getting help in R#
If you know the function you need help with, just use the help function:
help(tail)