---
title: "Getting your data into R"
author: "Anthony Staines & Tim Downing"
date: "07/09/2016"
output:
pdf_document:
number_sections: yes
toc: yes
html_document:
toc: yes
---
```{r Setup, message=FALSE,warning=FALSE,echo=FALSE,results='hide'}
rm(list=ls())
set.seed(7645378)
options(digits=4)
# setwd("~/Dropbox/Rcourse2016/2-GettingDataIntoR")
# change this, eg setwd("/home/you/iheartr/")
set.seed(7645378)
# install.packages("ggplot2") # run this just once to install the package
library(ggplot2)
# install.packages("plyr") # run this just once to install the package
library(plyr)
# install.packages("dplyr") # run this just once to install the package
library(dplyr)
# install.packages("stringr") # run this just once to install the package
library(stringr)
# install.packages("reshape2") # run this just once to install the package
library(reshape2)
# install.packages("knitr") # run this just once to install the package
library(knitr)
# install.packages("mi") # run this just once to install the package
library(mi)
# install.packages("pander") # run this just once to install the package
library(pander)
# install.packages("hflights") # run this just once to install the package
library(hflights)
# install.packages("car") # run this just once to install the package
library(car)
opts_chunk$set(dev="png", dev.args=list(type="cairo"), dpi=300)
opts_chunk$set(error=TRUE,warning=FALSE,message=FALSE,echo=TRUE,cache=TRUE,digits=2)
```
Getting data in to R
====================================
Let's read in a CSV file containing some data similar to the "Davis" dataset
```{r}
data <- read.csv("~/Dropbox/Rcourse2016/2-GettingDataIntoR/data/Data.csv")
```
Getting data into CSV files
----------------
R can read data in all sorts of formats, from many sources, including SPSS, Stata, SAS (sort of), MySQL, MAriaDB, SQLITE, Oracle, Excel, ftp servers, web pages and lots more. However we are only talking about CSV files today. If you need to read in data from another statistics package look at ` foreign `. If you need to use a database look at ` dplyr `, ` RODBC `, ` ROracle `, and ` RMySQL ` for some guidance.
We make CSV files, as do most people I know, from spreadsheets. One of us uses OpenOffice, the other Excel, but it makes no difference. Both offer a 'Save as' option, and we pick CSV files from that.
#Now get your own data into R
-------------
#R in daily use
R is, among other things, a powerful programming language, and a pretty good calculator. We're going to ask you to do a few bits in R now, and see how you get on. If you are already an experienced programmer, R can confuse. It is a functional programming language, more like LISP and Scheme, than C or Pascal. R works, at its most fundamental level, on vectors.
##Making dataframes
A data frame is a list, containing one or more vectors, each of which is the same length. Basically it's a table of data. Typically a dataframe is a set of observations on things, people, sites, cells, gels or whatever, with one row per thing, and one column per observation.
```{r}
library(car)
data(Davis)
View(Davis)
str(Davis)
summary(Davis)
# `tbl_df` is a function that creates a "local data frame".
# This is just a wrapper for a data frame that prints nicely
# convert to local data frame
tbl_df(Davis)
# you can specify that you want to see more rows
print(tbl_df(Davis), n=4)
# convert to a normal data frame to see all of the columns
data.frame(head(Davis))
```
#Missing data
The Davis dataset has some missing data - a few people did not give self-reported heights or weights. There is a useful package called mi which allows us to some very clever stuff with missing data, but here we just going to look and see what data are missing.
```{r, fig.width = 3, fig.height = 3}
d <- missing_data.frame(Davis)
image(d)
rm(d)
```
Lets check this in more detail. Missing data in R is represented as the two letters 'NA'. There's a useful function called is.na() which returns 1, or TRUE, for any missing values, and 0 or FALSE for non-missing values. With this function, we can make tables, and do chi-squared tests, and Fisher's exact tests very easily.
```{r}
is.na(Davis$repwt) # creates a long vector of TRUE or FALSE
length(is.na(Davis$repwt)) # how long is the vector?
# is.na sets things to missing, and expects to be given a thing, a vector or similar, and
# an index - that is a set of numbers which tells it which items to set to missing
sum(is.na(Davis$repwt)) # 17 missing reported weights
sum(is.na(Davis$repht)) # 17 missing reported heights
sum(is.na(Davis$weight)) # no missing weights
table(is.na(Davis$repwt)) # 183 valid reported weights
table(is.na(Davis$repht)) # 183 valid reported heights
table(is.na(Davis$repwt),is.na(Davis$repht)) # 181 valid reported weights and heights
table(Davis$sex,is.na(Davis$repwt)) # Missing reported weights: 11 are F and 6 are M
table(Davis$sex,is.na(Davis$repht)) # Missing reported heights: 11 are F and 6 are M
chisq.test(table(is.na(Davis$repwt),is.na(Davis$repht))) # X-squared = 140, p<2e-16
chisq.test(table(Davis$sex,is.na(Davis$repwt))) # X-squared = 0.25, p=0.6
chisq.test(table(Davis$sex,is.na(Davis$repht))) # X-squared = 0.25, p=0.6
fisher.test(table(is.na(Davis$repwt),is.na(Davis$repht))) # FET p<2e-16
# more at http://blogisticreflections.wordpress.com/2009/09/21/r-function-of-the-day-table/
# Let's try sampling values with sample()
sample(1:100, 20) # generate 20 random values with values between 1.0 and 100.0
sample(1:1000, 20) # generate 20 random values with values between 1.0 and 1000.0
sample(1:1000, 14) # generate 14 random values with values between 1.0 and 1000.0
Davis$repwt[sample(1:100, 20)] # sample from reported weights
# let's select the reported missing data to examine it
# reload Davis with data(Davis) if you want to re-load the original data
subset(Davis, is.na(Davis$repwt)=="TRUE" | is.na(Davis$repht)=="TRUE" )
length(subset(Davis, is.na(Davis$repwt)=="TRUE" | is.na(Davis$repht)=="TRUE" )[,1])
# select missing reported weights or missing reported heights = 19 values
subset(Davis, is.na(Davis$repwt)=="TRUE" & is.na(Davis$repht)=="TRUE" )
length(subset(Davis, is.na(Davis$repwt)=="TRUE" & is.na(Davis$repht)=="TRUE" )[,1])
# select missing reported weights AND missing reported heights = 15 values
# the useNA argument shows the missing values, too
table(Davis$repwt < 60, useNA = "always")
table(Davis$repht > 160, useNA = "always")
# let's view the detailed breakdown using useNA()
table(Davis$repwt < 60, useNA = "always", Davis$sex)
# we can even add in new columns to Davis associated with specific criteria
Davis$repwt60 <- Davis$repwt < 60
```
Data cleaning
----------------
#Revision from the Introductory session
Now we will look at what we did yesterday so that we are all progressing at the same pace.
```{r, fig.width = 3, fig.height = 3}
# data.frame(Davis$height,Davis$weight)
# create a dataframe from heights and weights only
data.frame(Davis$height,Davis$weight)[1:5,]
# look at dataframe for heights and weights - samples 1-5
# Let's look further at it:
# First subset rows of smallest 50%
# sort rows of smallest 50%
# select columns we want
# create new variables, eg "weight_diff"
# summarise details of interest
Davis %>% filter(height% arrange(desc(height)) %>% select(sex,weight,repwt) %>% mutate(weight_diff=(weight-repwt)) %>% group_by(sex) %>% summarise(mean=mean(weight))
# Let's use ggplot to examine it:
# assign Davis heights vs weights using ggplot2
x <- ggplot(Davis,aes(x=weight, y=height, colour=sex)) + geom_jitter() + geom_line() + stat_smooth(span=0.5) + ggtitle('heights v weights') + xlab('height') + ylab('weight')
# plot it (this way avoids errors)
ggsave(filename = 'heights-weights-plot1.png', plot=x, dpi=1200)
# create a separate boxplot with flipped axes for fun
x2 <- ggplot(Davis,aes(x=weight, y=height, colour=sex)) + geom_boxplot() + coord_flip() + facet_wrap(~sex) + ggtitle('heights v weights')
# plot it (NB not all plot types are sensible)
ggsave(filename = 'heights-weights-plot2.png', plot=x2, dpi=1200)
# create a function because collaborators insist on using units of nautical miles
scale_h <- function(height) { # the function takes some vector of numbers called "height"
return(height/185200) # and divides it by 185200, and then returns these values
}
# create function to scale heights as nautical miles
Davis$height_nm <-scale_h(Davis$height)
# assign a new variable with nautical mile heights
Davis$height_nm[1:5]
# always check the output
scale_h(c(1,2,3))
# numbers can be computed
scale_h(c("abc",2,3))
# strings cannot be computed
data(Davis) #re-load the data in case you
# Calculate of observed vs reported BMI on these data
Davis <- within(Davis, {BMI <- weight/(height/100)^2
repBMI <- repwt/(repht/100)^2})
# within() let's us perform operates on the data itself
# ggplot to see the values
ggplot(Davis,aes(x=BMI,y=repBMI)) + geom_point(shape=1) + ggtitle('Davis - BMI')
ggplot(Davis,aes(x=BMI,y=repBMI,color=sex)) + geom_point(shape=1) + ggtitle('Davis - BMI by sex')
# So what's up with that value?
ggplot(Davis,aes(x=height,y=weight)) + geom_point(shape=1) + ggtitle('Davis - measured height v weight')
ggplot(Davis,aes(x=height,y=weight,color=sex)) + geom_point(shape=1) + ggtitle('Davis - reprted height v weight by sex')
```
What is going on here? There is apparently, one very short woman, who weighs quite a lot. A BMI over 500 is not really credible. Such observations are called 'outliers' for obvious enough reasons. The right thing to do is to go back to the original data collection sheets, and have a look. However, these data were collected in the late 1980's in the University of York, so that's not feasible. Looking more closely at the data we get this :-
```{r}
Davis[Davis$height<100,]
```
These data include reported height and weight, as well as measured height and weight. It looks as if the measured height and weight have been transposed at some time. We can do three things. We can fix it, we can leave out this person, or we can use regression methods not heavily influenced by outliers. Which of these is right depends on what you are doing.
```{r, fig.width = 3, fig.height = 3}
# Never, ever, ever, ever, overwrite or directly edit the data in your original dataset
Davis2 <- Davis
#This is the row we want
Davis2[12,]
# This is the weight and the height
Davis2[12,]$weight
Davis2[12,]$height
# Fix them
Davis2[12,]$weight <- 57
Davis2[12,]$height <- 166
# Check the fix
Davis2[12,]$weight
Davis2[12,]$height
# and check again
Davis[12,]
Davis2[12,]
#Don't forget to redo the BMI's
Davis2 <- within(Davis2, {BMI <- weight/(height/100)^2
repBMI <- repwt/(repht/100)^2 } )
# and to see what we have now
ggplot(Davis2,aes(x=BMI,y=repBMI)) + geom_point(shape=1) + ggtitle('Davis - BMI measured v reported')
ggplot(Davis2,aes(x=BMI,y=repBMI,color=sex)) + geom_point(shape=1)+ ggtitle('Davis - BMI - measured vs reported by sex')
# differences between reported and measured height and weight.
Davis2 <- within(Davis2,{dht <- height - repht
dwt <- weight - repwt})
```
There is only one grouping variable in this dataset - sex, but it might be interesting to make another - overweight, for example. We use the 'cut' function for this. We have to feed it numbers at which to cut, so we write a special function 'tertile' which in turn uses the 'quantile' function. This makes our code easier to read.
```{r, fig.width = 4, fig.height = 4}
#Make a function to define tertiles
tertile <- function(x,...) {
quantile(x,
probs=c(0/3,1/3,2/3,3/3),
...)
}
tertile(Davis2$BMI)
tertile(Davis2$repBMI,na.rm=TRUE)
Davis2 <- within(Davis2, {
Group <- cut(BMI,breaks=tertile(BMI,na.rm=TRUE),labels=c('light','middling','heavy'))
repGroup <- cut(repBMI,tertile(repBMI,na.rm=TRUE),labels=c('rlight','rmiddling','rheavy'))
})
table(Davis2$repGroup,Davis2$Group,useNA="ifany") # Let's look at what we have
```
#DPLYR - data processing from yesterday
```{r, fig.width = 3, fig.height = 3}
library(dplyr)
Davis2 %>% summarise_each(funs(mean),height,weight,repht,repwt)
# get mean heights and weights
Davis2 %>% summarise_each(funs(mean(.,na.rm=TRUE)),height,weight,repht,repwt)
# get mean heights and weights ignoring missing data
Davis2 %>% group_by(sex) %>% summarise_each(funs(mean(.,na.rm=TRUE)),height,weight,repht,repwt)
# get mean heights and weights ignoring missing data by sex
Davis2 %>% group_by(sex) %>% filter(is.na(repht),is.na(repwt)) %>% summarise_each(funs(mean),height,weight)
# get mean heights and weights for the missing data by sex
```
See more at www.dataschool.io/dplyr-tutorial-for-faster-data-manipulation-in-r/
And at www.youtube.com/watch?v=jWjqLW-u3hc on YouTube
And at http://blog.rstudio.org/2014/01/17/introducing-dplyr/
We don't cover:
- Joins: inner join, left join, semi-join, anti-join
- Window functions for calculating ranking, offsets, and more
A quick glance at linear regression
====================================
Linear regression is the basis of all the more complicated regression analyses that get used. The basic idea is to look for a simple relationship between two or more variables. One is the y variable, or the dependent variable, which you are trying to explain. The others are the x-variables, with which you hope to understand some of the sources of variation in the y-variable.
This is all expressed in very standard form, which looks like an equation, or a line from a programming language but is actually neither.
```{r, fig.width = 3, fig.height = 3}
# y depends on one variable x1
y ~ x1
# y depends on three variables x1, x2 and x3
y ~ x1 + x2 + x3
# y depends on two variables x1, and x2, but the effect of x1 depends on the level
# of x2 (and vice versa)
y ~ x1 + x2 + x1.x2
# same thng more compactly
y ~ x1*x2
lm1 <- with(Davis2,lm(weight~height))
summary(lm1) # let's examine weight as a function of height
# Coefficients Estimate Std. Error t value Pr(>|t|)
# (Intercept) -130.9104 11.5279 -11.4 <2e-16 ***
# height 1.1501 0.0675 17.0 <2e-16 ***
# residual standard error: 8.5 on 198 degrees of freedom
# Multiple R-squared: 0.595, Adjusted R-squared: 0.593
# F-statistic: 290 on 1 and 198 DF, p-value: <2e-16
ggplot(Davis2,aes(x=height,y=weight)) + geom_point() + geom_smooth(method='lm')
anova(lm1) # ANOVA = analysis of variance
# Df Sum Sq Mean Sq F value Pr(>F)
# height 1 21001 21001 290 <2e-16 ***
# Residuals 198 14321 72
# Let's compare it to the old data
lm_old <- with(Davis,lm(weight~height))
ggplot(Davis,aes(x=height,y=weight)) + geom_point() + geom_smooth(method='lm')
summary(lm_old) # let's examine weight as a function of height
# Coefficients Estimate Std. Error t value Pr(>|t|)
# (Intercept) 25.2662 14.9504 1.69 0.0926 .
# height 0.2384 0.0877 2.72 0.0072 **
# residual standard error: 14.9 on 198 degrees of freedom
# Multiple R-squared: 0.036, Adjusted R-squared: 0.0311
# F-statistic: 7.39 on 1 and 198 DF, p-value: 0.00715
# plot the old data
ggplot(Davis, aes(x=height, y=weight)) +
geom_point(shape=1) +
scale_colour_hue(l=50) + # Use a slightly darker palette than normal
geom_smooth(method=lm, # Add linear regression lines
se=FALSE) # Don't add shaded confidence region
# plot the revised data
ggplot(Davis2, aes(x=height, y=weight)) +
geom_point(shape=1) +
scale_colour_hue(l=50) + # Use a slightly darker palette than normal
geom_smooth(method=lm, # Add linear regression lines
se=FALSE) # Don't add shaded confidence region
```
This is sometimes called 'leverage'. The outlier is bending the regression up towards itself, and giving a very false impression of what is going on. There are other regression methods - known as robust methods, which are not nearly as badly affected by outliers as linear regression.
```{r, fig.width = 3, fig.height = 3}
# plot the old data with loess regression
ggplot(Davis, aes(x=height, y=weight)) +
geom_point(shape=1) +
scale_colour_hue(l=50) + # Use a slightly darker palette than normal
geom_smooth(method=loess, # Add linear regression lines
se=TRUE) # Do add shaded confidence region
# plot the revised data with loess regression
ggplot(Davis2, aes(x=height, y=weight)) +
geom_point(shape=1) +
scale_colour_hue(l=50) + # Use a slightly darker palette than normal
geom_smooth(method=loess, # Add linear regression lines
se=TRUE) # Do add shaded confidence region
lm_old <- with(Davis,lm(weight~height)) # weak effect of height on weight
lm_old # Intercept=25.266 height=0.238
lm1 <- with(Davis2,lm(weight~height)) # strong effect of height on weight
lm1 # Intercept=-130.91 height=1.15
anova(lm1,lm_old) # RSS is much better (ie smaller) for lm1
# plot it again
# plot the old data
ggplot(Davis, aes(x=height, y=weight, color=sex)) +
geom_point(shape=1) +
scale_colour_hue(l=50) + # Use a slightly darker palette than normal
geom_smooth() # Loess smooth
# plot the revised data
ggplot(Davis2, aes(x=height, y=weight, color=sex)) +
geom_point(shape=1) +
scale_colour_hue(l=50) + # Use a slightly darker palette than normal
geom_smooth() # Loess smooth
# get the 95% confidence intervals
confint(lm_old)
confint(lm1)
```