Quantitative Data Analysis using R Guide

Quantitative Data Analysis using R
By Taddesse Kassahun
Department of Statistics
Addis Ababa University
1/209

2/209
Introduction
By Taddesse Kassahun (
)
Quantitative Data Analysis using R 2 / 209

3/209
Introduction
What are quantitative data?
Quantitative data refer to a set of values of observations that
can be counted or measured.
They answer questions of the following form.
▶ How many?
▶ How often?
▶ How much?
Quantitative data are mainly collected for statistical analysis.
Some examples of quantitative data:
▶ Number of times students in a college updated their phones in a
quarter.
▶ Percentage increase in revenue of wholesalers with the inclusion
of a new product.
▶ Price of Teff/kg in different kebeles of a region.
)

4/209
Introduction
Analysis methods of quantitative data
Quantitative data can be analyzed by using descriptive and
inferential methods.
Descriptive analysis can be performed using tables, graphs, and
summary measures.
Tables can be classified into two broad classes: simple and
complex tables.
Commonly used graphs: bar, pie, histogram, boxplot, and line.
Some widely used summary measures include mean, median,
mode, frequency, minimum, maximum, total, standard deviation,
range, and percent
Inferential method involves estimation, model fitting, and
hypothesis testing.
)

5/209
Introduction
Steps in inferential analysis
)

6/209
Introduction
Why R for data management and analysis?
)

7/209
Introduction
Where to find R?
R is freely available from the Comprehensive R Archive Network
(CRAN) at http://cran.r-project.org.
Once R is installed, go to
https://www.rstudio.com/products/rstudio/download/
to install R Studio.
)

8/209
Introduction
R studio layout
The location of each pane can be customized by clicking Tools
⇒ Global Options ⇒ Pane Layout
)

9/209
Introduction
R studio layout - console
All the output generated by R except plots goes to the console.
R evaluates all the codes in the console.
R studio function calls can be entered into the console to
produce output, for example, try the following.
> print(”Hello IFA”) gives [1] ”Hello IFA”
)

10/209
Introduction
R studio layout - console
We can enter commands one at a time at the command prompt
(>)
We may also use R as a calculator. Try the following
> 5+4 gives [1] 9
> 5-4 gives [1] 1
> 5*4 gives [1] 20
> 5/4 gives [1] 1.25
It is not recommended to enter longer pieces of code into the
console. Instead use the script editor (source) window.
Create an R script by clicking File ⇒ New File ⇒ R Script.
In general, execute codes from saved R scripts for the sake of
reproducibility.
)

11/209
Introduction
R studio layout - source
The R scripts are written in the source pane where we can run a
set of commands at a time.
A command in the source pane can be executed by selecting the
line and pressing Ctrl + Enter or hitting the Run icon.
)

12/209
Introduction
R studio layout - source
The corresponding result should be seen in the console window.
Comments can be incorporated after the hashtag #.
>print(”Hello IFA”) # this function prints Hello IFA.
To comment multiple lines of codes: highlight the codes and
then press ctrl + shift + c.
Save the R-scripts by clicking a save shortcut or pressing Ctrl +
S with a file name, e.g., ”test.R”.
A time stamp can be included in R scripts by writing ts and then
pressing the shift + tab keys.
To open R scripts click on File ⇒ Open File ...
KEEP ALL THE CODES YOU USE IN R SCRIPTS!
)

13/209
Introduction
Environment/history/connection/tutorial
Understanding environments is not necessary for day-to-day use
of R.
In the present training, we will only work on the Global
Environment.
All the objects assigned in R are stored and displayed in the
Global Environment.
We can also store an output with an object of arbitrary name,
say x, using an assignment operator (< −), i.e., >x < −
Type x in the script and execute it or type x into the console
and press Enter.
If we write x < − 5+5 on the console, x will appear in the
environment pane under Values section (see figure below).
)

14/209
Introduction
It’s important to always start with a clean environment when
working on an R project.
To clean the working environment, selecting “Restart R” option
from the “Session” tab at the top of the window or press Ctrl +
Shift + F10.
)

15/209
Introduction
The History tab contains a list of commands entered into R
console.
Previously used commands can be searched through the History
tab.
The Connections tab allows to connect to various data sources
like external databases.
The Tutorials tab is used to run tutorials for R studio.
)

16/209
Introduction
Files/plots/packages/help/viewer
The Files tab lists external files in the current working directory.
We can open, copy, rename, move and delete files listed in the
window.
The Plots tab is where all the plots we create in R are displayed.
There is an option of exporting plots to an external file using the
Export drop down menu.
The Packages tab lists all of the packages that are installed on
the computer.
We can install new packages & update existing ones by clicking
on the Install and Update buttons.
We view the help coming from R documentation in the Help tab.
The Viewer tab displays local web content such as web graphics
generated by some packages.
)

17/209
Introduction
Projects in R
RStudio Projects help to keep things organised.
An RStudio Project keeps all of our R scripts, R functions and
data together in one place.
To create a project, open RStudio and select File ⇒ New
Project... from the menu.
)

18/209
Introduction
Projects in R
In the next window select New Project.
Enter the name of the directory we want to create in the
Directory name: field.
Click the Browse... button and navigate to change the location
of the directory.
)

19/209
Introduction
Projects in R
Tick the Open in new session box and then hit the Create
Project button.
)

20/209
Introduction
R workspace
The working directory is the default location where R finds files
to load and put any files.
The file path of current working directory can be obtained at the
top of the Console pane or using the command > getwd()
)

21/209
Introduction
R workspace
A directory structure can be created by clicking on the New
folder button in the Files pane.
> ls() — Lists the objects in the current workspace.
> rm(objectlist) — Removes (deletes) one or more objects.
> rm(list=c(”Object1”, ”Object2”, ”Object3”))
)

22/209
Introduction
Basics
R is a case-sensitive software: r is different from R.
An object in R is anything that can be assigned a value, e.g.,
> X = 5
> Name = ”Tadddesse”
Objects can also be output of a plot, a summary of statistical
analysis or a set of R commands that perform a specific task.
Commands in R can be separated by either a new line or
semicolon (;).
If a continuation prompt + appears in the console after a code is
executed, this means the code was not completed correctly.
R statements consist of functions and assignments.
In general, R is tolerant of extra spaces inserted into codes.
)

23/209
Introduction
Basics
Do not use spaces into the assignment operator (< −).
If a console ‘hangs’ and unresponsive after running a command,
press the escape key (esc) on the keyboard.
We can also click on the stop icon in the top right of our console
to terminate most current operations.
To save an object to an .RData file use save(nameOfObject, file
= ”file name.RData”)
To save all objects in a workspace into a single .RData file use
save.image(file = ”file name.RData”)
To load .RData file back into RStudio use load(file =
”file name.RData”)
We can end the R session that we are working on by typing and
running the command q().
)

24/209
Introduction
Packages
Packages are collections of R functions, data, and compiled code.
The base installation of R comes with many packages as
standard.
The directory where packages are stored on your computer is
called the library.
Standard set of packages in R: base, datasets, utils, grDevices,
graphics, stats, and methods.
Use the install.packages() command to install a package for the
first time.
Example. A package car can be installed by writing
install.packages(”car”) into the Console window of RStudio.
We need to have a working internet connection to do this.
)

25/209
Introduction
Packages
We may be asked to select a CRAN mirror, we can select
0-cloud or a mirror near to our location.
We may include the dependencies = TRUE argument to install
additional required packages.
Packages can be updated using the command
update.packages().
The ask = FALSE argument helps to update all installed
packages.
To use a package in an R session, we need to first load the
package using > library(pkg name).
We need to load the packages we will be using every time we
start a new R session.
> help(package=”package name”) provides a brief description
of the package
)

26/209
Introduction
Help in R
> help.start() — General help.
)

27/209
Introduction
Help in R
> help(”mean”) or ?mean — Help on function mean.
> help.search(”linear”) or ??linear — Searches the help system
for the string linear.
> example(”anova”) — Examples of function anova.
)

28/209
Introduction
Help in R
> apropos(”mean”, mode=”function”) — Lists all functions
with mean in their name.
Programming help:
StackOverflow (https://stackoverflow.com/) is a Q & A
website focused on programming in all languages.
)

29/209
Introduction
Exercise 1
)

30/209
Data structures, input & output
)

31/209
Introduction
The base data structure in R can be organized by their
dimensionality (1d, 2d, or nd).
In R, ID (case identifier) is considered as rownames.
Data structure can constitute either homogeneous (all contents
same type) or heterogeneous (contents are different types).
The FIVE most often used structures:
We can use str() to understand what data structures an object is
composed of.
)

32/209
Notation and naming
Generally, variable names should be nouns and function names
should be verbs.
Use an underscore ( ) to separate words within a name.
Avoid using names of existing functions and variables. Some
examples of BAD variable names:
> F < − FALSE
> c < − 10
Place spaces around all infix operators (=, +, -, <, >, etc.).
Always put a space after a comma.
:, ::, < − and ::: do not need spaces around them.
Place a space before left parentheses, except in a function call.
)

33/209
Vectors
Vectors are one-dimensional arrays that can hold numeric,
character, or logical data.
Vector is the basic data structure in R.
Scalars are also considered as vectors with only one-elements.
is.atomic(x) tests if an object x is actually a vector.
There are four common types of atomic vectors: logical, integer,
double (numeric), and character.
Atomic vectors are usually created with c(), short for
concatenate.
Example
▶ Logical < − c(TRUE, FALSE)
▶ Integer < − c(1, 3, 6, 9)
▶ Double < − c(1, 2.4, 3.7)
▶ Character < − c(”He”, ”is”, ”a”, ”statistician”)
)

34/209
Vectors ...
Data in a vector must be only one type or mode – numeric,
character, or logical.
We can check the data type (mode) of a vector using the
command mode()
▶ > mode(Integer) gives ”numeric”
We use square brackets [ ] to extract elements of a vector.
Example
▶ > Double[2] gives 2.4
The colon operator (:) can be used to generate a sequence of
numbers/characters.
Example
▶ > Character[2:4] gives ”is” ”a” ”statistician”
)

35/209
Vectors
A summary of logical test and coercion functions
> X < − 5.8
> class(X) gives [1] ”numeric”
> is.numeric(X) gives [1] TRUE
> is.character(X) gives [1] FALSE
> X1 < − as.character(X)
> class(X1) gives [1] ”character”
)

36/209
Factors
Factors are specific type of vectors to store values that take a
pre-specified set of values.
The function factor() stores categorical values as a vector of
integers in the range 1, . . . , k,
where k is the number of unique categories. > x < − c(”A”,
”B”, ”B”, ”A”)
> xf < − factor(x) stores this vector as (1, 2, 2, 1) this can be
checked by >as.numeric(xf)
The variable x is now treated as nominal.
For ordinal variables, we add the parameter ordered=TRUE to
the factor() function.
> y < − c(”Good”, ”Very good”, ”Excellent”, ”Good”)
> factor(y, ordered=TRUE) is encoded as (2,3,1,2)
)

37/209
Factors
The variable y is now treated as ordinal.
By default, factor levels for character vectors are created in
alphabetical order.
We can override the default by specifying a levels option.
levels(x) defines the set of allowed values for a vector x.
>factor(y, ordered=TRUE, levels=c(”Good”, ”Very good”,
”Excellent”)) is encoded as (1,2,3,1)
Numeric variables can be coded as factors using the levels and
labels options.
>Sex < −c(0,1,1,1,0,1,1,0,0)
> factor(Sex, levels=c(1, 2), labels=c(”Male”, ”Female”))
)

38/209
List
Lists are the most complex of the R data types.
A list may contain a combination of vectors, matrices, data
frames, and even other lists. Use list() to create lists.
x < − list(1:3, ”a”, c(TRUE, FALSE, TRUE), c(2.3, 5.9))
str(x) gives
List of 4
$ : int [1:3] 1 2 3
$ : chr ”a”
$ : logi [1:3] TRUE FALSE TRUE
$ : num [1:2] 2.3 5.9
Note that the results of many R functions return lists.
We can turn a list into an atomic vector with unlist().
)

39/209
Data frames
A data frame is a list of equal-length vectors. It shares
properties of a matrix.
It is more general than a matrix as it can contain different types
of data.
Each row corresponds to an individual observation and each
column corresponds to a different measured or recorded variable.
We create a data frame using data.frame(), which takes named
vectors as input.
> mydata < − data.frame(col1, col2, col3,...)
DF < − data.frame(x = 1:3, y = c(”A”, ”B”, ”C”))
We can combine data frames using cbind() and rbind()
)

40/209
Data frames
The dimensions of a data frame can be determined by the dim()
function.
To access columns of a data frame we can use any of the
following ways:
1 Using a square bracket and column number: DF[2] gives
2 Using a square bracket and column names: DF[”y”]
3 Using dollar sign: DF$y gives [1] ”A” ”B” ”C”
Typing DF$VarName is somewhat tiresome for a data frame
having several variables.
We can use the attach() function as a shortcut.
)

41/209
Data frames
The attach() function adds the data frame to the R search path.
Example. > attach(ChickWeight); plot(weight,Diet)
> weight < − c(23,25,20); attach(ChickWeight) gives
The following object is masked by .GlobalEnv: weight
> plot(weight,Diet) gives Error. Why?
We can use the detach() function to remove the data frame
from the search path.
The operator $ can also be used to create a new column.
> DF$Sex < − c(”M”, ”F”, ”M”)
R automatically decides a character class for the above variable
”y”.
We can include the stringsAsFactors = TRUE argument to
convert ”y” to factors.
)

42/209
Data frames
Values stored in a data frame can be retrieved by a square
bracket [i, j]; i for row and j for column.
> DF[3,2] gives C
> DF$y[3] gives C
> DF[,1] gives 1 2 3
> DF[1,] gives 1 A
The function is.data.frame(object) checks if an object is of class
data frame.
> nrow(DF) # to determine number of rows of DF
> ncol(DF) # to determine number of columns of DF
> colnames(DF) # to display column names of DF
> rownames(DF) # to display row names of DF
)

43/209
Missing values
Missing values in R are marked by NA.
We can use is.na() function to check whether a value of an R
object is missing.
It returns TRUE if a value is missing.
> myvec< −c(2, 5,NA ,8)
> is.na(myvec) gives [1] FALSE FALSE TRUE FALSE
A sum() function can be used to determine the number of NA
values in a vector.
> sum(is.na(myvec)) gives [1] 1
The anyNA() function can be used to check any missing values
in a data frame.
> anyNA(DF) gives [1] FALSE
Let us consider the following data frame and check for missing
values.
)

44/209
Missing values
To verify which observation(s) is/are the cause, we can use
> mydf[!complete.cases(mydf),] gives
Ad Car Mi
3 1 NA 13
)

45/209
Exercise 2
)

46/209
Data input
R can import data from a variety of sources.
Entering data from the keyboard:
We can use the edit() function to enter data manually.
> mydata < − data.frame(age=numeric(0),
gender=character(0), weight=numeric(0))
> mydata < − edit(mydata)
A shortcut for mydata < − edit(mydata) is fix(mydata).
)

47/209
Data input
)

48/209
Data input
We can embed data directly using the code
> mydatatxt < − ”
age gender weight
25 m 166
30 f 115
”
mydata < − read.table(header=TRUE, text=mydatatxt)
Importing data from a delimited text file
Syntax: mydataframe < − read.table(file, options)
Some of the options: header, sep, row.names, col.names,
na.Strings, quote, skip, dec.
)

49/209
Data input
The above data can be imported into a data frame using the
following code:
grades < − read.table(file=”studentgrades.csv”, header=TRUE,
row.names=”StudentID”, sep=”,”, colClasses=c(”character”,
”character”, ”character”, ”numeric”, ”numeric”, ”numeric”))
Importing data from Excel
Save the file into Tab-delimited Text (.txt) form: File ⇒ Save as
... ⇒ select the Text (Tab delimited) in the Save as type:
)

50/209
Data input
Once the excel file is saved in tab delimited format we can use
the read.table() function.
We may also save excel files in comma separated values (.csv)
format and use the function read.csv() to input the data.
Excel files can be directly imported into R using xlsx package.
Syntax to import the first worksheet(1) from the workbook.
library(xlsx)
workbook < − ”c:/myworkbook.xlsx”
mydataframe < − read.xlsx(workbook, 1)
For large worksheets (say, 100,000+ cells), we can use
read.xlsx2().
In general, it is recommended to save data as tab or comma
delimited files and import into R using read.table()
)

51/209
Data input
Importing data from SPSS
SPSS datasets can be imported into R via read.spss() function
in the foreign package.
We can also use the spss.get() function in the Hmisc package.
library(Hmisc)
mydataframe < − spss.get(”mydata.sav”,
use.value.labels=TRUE)
Importing data from Stata
We can employ the following syntax.
library(foreign)
mydataframe < − read.dta(”mydata.dta”)
)

52/209
Data input
We can also use a shortcut Import Dataset from the R
Environment to input data.
)

53/209
Data input
It is common to get error messages when a person starts
importing data due to one or more of the following reasons.
1 Mistake in the spelling of either file name or file path.
2 Forget to include the file extension (.txt, .csv, etc) in the file
name.
3 An incorrect file path is used.
4 Forget to include the header = TRUE argument when the first
row a data frame contains variable names.
Always use the function str() after importing the data to see the
structure of the data set.
)

54/209
Annotating datasets
Annotating includes adding descriptive labels to variable names
and value labels to the codes for categorical variables.
Consider the following data frame
Age Sex
15 2
17 2
16 1
The following code can be used to rename ”Age” as ”Age at
first marriage”.
> names(Mydata)[1] < − ”Age at first marriage”
The following code creates value labels for variable ”Sex”.
Mydata$Sex < − factor(Mydata$Sex,
levels = c(1,2),
labels = c(”male”, ”female”))
)

55/209
Data output
The main function to export data frames is write.table().
> write.table(DF, file = ”file path/file name.txt”,
col.names=TRUE, row.names=FALSE, sep = ”
”)
DF is the data frame we want to export;
file name.txt is the file name for exported data frame;
col.names=TRUE indicates that the variable name should be
written in the first row the file;
row.names = FALSE stops R from including the row names in
the first column of the file.
The above exported file can be opened in any text editor as it is
saved in tab delimited text form.
We can also export a data frame to csv format by setting sep =
”,” in the write.table() function.
)

56/209
Data structures, input output
Data output
The function write.csv() can also be used directly as
write.csv(DF, ”filepath/mydf.csv”, row.names=FALSE)
library(xlsx)
write.xlsx(df, ”filepath/mydf.xlsx”) # To export a data frame
”df” to ”mydf.xlsx”
We need to first install the library haven to export data into
SPSS and Stata.
library(haven)
write sav(df, ”filepath/mydf.sav”) # To SPSS
write dta(df, ”filepath/mydf.dta”) # To Stata
)

57/209
Exercise 3
)

58/209
Data wrangling
)

59/209
Data wrangling
Introduction
Data wrangling is the most important activity before data
analysis begins.
It includes:
▶ creating new variables,
▶ extracting part of a data frame,
▶ recoding existing variables,
▶ sorting and merging data,
▶ selecting and dropping variables,
▶ working with dates, etc.
The following codes create a new variable, say SUM from an
existing data frame ”mydata”.
mydata −data.frame(x1 = c(2, 2, 6, 4), x2 = c(3, 4, 2, 8))
mydata$SUM − mydata$x1 + mydata$x2
mydata
)

60/209
Data wrangling
Extracting values
Consider the built in data frame USArrests and check its
structure.
Let us extract the third value of the Assault variable.
USArrests[3,2] gives [1] 294
USArrests$Assault[3] gives the same thing as above.
The data in the first 10 rows and 3 columns of USArrests can be
extracted by:
USArrests[1:10, 1:3]
)

61/209
Data wrangling
Extracting values
We use negative positional indexes to exclude certain rows and
columns from a data frame.
Let us extract all of the rows except the first 40 rows and all
columns except the 2nd
and the 4th
columns in USArrests
dataset.
USArrests[-(1:40), -c(2,4)] # gives
Values can also be extracted based on a logical test.
)

62/209
Data wrangling
Extracting values
Let us extract all rows where the value of Assault is 200 or more
and all the columns in USArrests dataset.
USArrests[USArrests$Assault = 200, ]
We can use the following to extract all rows where the value of
UrbanPop is 80 and all the columns in USArrests dataset.
USArrests[USArrests$UrbanPop == 80, ]
Boolean expressions can be used to extract values based on a
combination of logical tests.
To extract rows based on an AND Boolean expression we can
use the symbol.
Le us extract all columns and rows where Assault 250 and
UrbanPop 70.
USArrests[USArrests$Assault 50 USArrests$UrbanPop
70, ]
)

63/209
Data wrangling
Extracting values
To extract rows based on an OR Boolean expression we can use
the | symbol.
Extract values where UrbanPop is greater than 50 OR less than
60.
USArrests[UArrests$UrbanPop 50 | USArrests$UrbanPop
60, ]
Alternatively, one can use the subset() function to select parts of
a data frame.
subset(USArrests, UrbanPop 50 UrbanPop 60)
)

64/209
Data wrangling
Ordering data frames
We can use the order() function to sort a data frame.
Let us order the USArrests data frame based on the values of
Murder
USArrests[order(USArrests$Murder), ] #Ascending
USArrests[order(-USArrests$Murder), ] #Descending OR
USArrests[order(USArrests$Murder, decreasing = TRUE), ]
Let us order the USArrests data frame based on the values of
Murder and Rape.
USArrests[order(USArrests$Murder, USArrests$Rape), ]
)

65/209
Data wrangling
Inclusion of extra rows and column
The function rbind() can be used to append additional rows on a
data frame.
Use cbind() to append columns on an existing data frame.
DF1 − data.frame(Id = 1:3, Age = c(20, 19, 23), Sex =
c(”Male”, ”Female”, ”Female”))
DF2 − data.frame(Id = 4:5, Age = c(27, 25), Sex =
c(”Male”, ”Male”))
)

66/209
Data wrangling
Inclusion of extra rows and columns
DF3 − data.frame(Height = c(178, 171, 167))
)

67/209
Data wrangling
Inclusion of extra rows and columns
rbind(DF1, DF3) will give
cbind(DF1, DF3) will give
)

68/209
Data wrangling
Merging data frames
The function merge() merges two data frames horizontally.
We need to have at least one unique identifier which is available
in both data frames.
The following code merges data frame A data frame B by ID.
total − merge(dataframeA, dataframeB, by=”ID”)
The following code merges data frame A data frame B by ID
and Region.
total − merge(dataframeA, dataframeB, by=c(”ID”,
”Region”)
Use the all = TRUE argument if we want to include all data
from both data frames.
total − merge(dataframeA, dataframeB, all = TRUE)
)

69/209
Data wrangling
Merging data frames
Let us merge the two data frames given below.
Total − merge(DF1, DF2) gives
Total1 − merge(DF1, DF2, all = TRUE) gives
)

70/209
Data wrangling
Recoding variables
Recoding refers to creating new values of a variable from the
existing values.
It may include:
▶ changing a continuous variable into a set of categories,
▶ replacing miscoded values with correct values,
Suppose we want to recode the stopping distance of cars
(”dist”) in ”cars” dataset to distcat (Small, Medium, Long).
We first recode the value 999 for ”dist” to indicate that this
value is missing:
cars$dist[cars$dist == 999] − NA
cars$distcat[cars$dist 50] − ”Small”
cars$distcat[cars$dist = 50 cars$dist 100] −
”Medium”
cars$distcat[cars$dist = 100] − ”Long”
)

71/209
Data wrangling
Recoding variables
We may also use the within() function as follows.
cars − within(cars,{ distcat1 − NA
distcat1[dist 50] − ”Small”
distcat1[dist = 50 dist 100] − ”Medium”
distcat1[dist = 100] − ”Long” })
)

72/209
Data wrangling
Exercise 4
)

73/209
Data wrangling
Packages for wrangling data
Important packages to install: dplyr; tidyr.
Once these packages are successfully installed, attach them to
the working environment.
install.packages(”dplyr”)
install.packages(”tidyr”)
library(dplyr)
library(tidyr)
The pipe operator % % is used to perform a set of procedures
in a sequential manner and get the result at once.
Let us determine the number of missing values in Total1 data
frame (see Slide 69).
sum(is.na(Total1)) gives [1] 2
This procedure can also be done using the pipe operator.
)

74/209
Data wrangling
Total1 %% is.na() %% sum()
Key functions used for data management in the dplyr packages
include:
mutate – to modify and create columns in a data frame.
select – to select columns by name.
filter – to select rows based on a set of logical values.
Let us create a column variable time by dividing dist to speed in
the cars data frame.
mutate(cars, time = dist/speed)
)

75/209
Data wrangling
Let us now list out the cars with speed less than 10.
cars %% filter(speed 10)
Suppose that we would like to take out only the dist of cars with
speed less than 10. Then we can use:
cars1 − cars %% filter(speed 10)
cars1 %% select(dist)
)

76/209
Data wrangling
Each of the above procedures can be performed simultaneously
by using the pipe operator as:
cars %%
mutate(time = dist/speed) %%
filter(speed 10) %%
select(dist)
)

77/209
Data wrangling
Reshaping data frames
Two main data frame shapes: the long format (sometimes called
stacked) and the wide format.
In the long data format, a separate column represents the name
of the variable and a separate one value of the corresponding
variable.
In the wide data format, each column represents a variable.
)

78/209
Data wrangling
We use the function pivot longer() to reshape data into long
format.
pivot longer(Wide DF, ColNames in DF, names to = ” ”,
values to = ””)
Wide DF is the name of the data frame in wide format.
ColNames in DF represents the column names in the wide
format (Test1, Test2, Final in the above data).
The names to = ” ” argument specifies the name of the variable
that will be used to store the names of reformatted variables
(Exam in the above data).
The values to = ” ” argument specifies the name of the variable
that will be used to store the values (Score in the above data).
)

79/209
Data wrangling
Consider the data frame given in Slide 77 and reshape the data
in wide format to long format.
Wide − data.frame(Stud.Id = 1:3, Test1 = c(20,23,18),
Test2 = c(26,28,25), Final = c(32,30,34))
library(tidyr)
pivot longer(Wide, c(Test1, Test2, Final), names to =
”Exam”, values to = ”Score”)
)

80/209
Data wrangling
We use the function pivot wider() to reshape data into wide
format.
pivot wider(Long DF, names from = ” ”, values from = ””)
Long DF is the name of the data frame in long format.
The names from = ” ” argument to specify the name of the
variable containing the variable names (Exam in the above data).
The values from = ” ” argument to specify the variable
containing the values (Score in the above data).
Consider the data frame given in Slide 77 and reshape the data
in long format to wide format.
)

81/209
Data wrangling
Long − data.frame(Stud.Id = c(1,1,1,2,2,2,3,3,3), Exam =
rep(c(”Test1”,”Test2”,”Final”),3), Score =
c(20,26,32,23,28,30,18,25,34))
library(tidyr)
pivot wider(Long, names from = ”Exam”, values from =
”Score”)
)

82/209
Data wrangling
Date values
The function as.Date() translates character strings into date
variable.
Syntax is as.Date(x, ”input format”). The input format can be:
mydate −as.Date(c(”2021-01-01”, ”2021-03-02”,
”2021-03-29”))
)

83/209
Data wrangling
Date values
We can use the format(x, format=”output format”) function to
extract portions of dates.
x is a given date value.
format = ”” is any one or more of the formats %d, %a, %A,
%m, %b, %B, %y, %Y
today − Sys.Date()
today gives the current date.
format(today, format=”%y”) gives [1] ”22”
We may also extract more than one format using:
format(today, c(”%A”,”%m” )) this extracts the
unabbreviated weekday and the month number.
It is also possible to perform arithmetic operations in dates as
follows.
)

84/209
Data wrangling
Date values
For example, how old in years is a person if he was born on 12
October 1950?
DoB − as.Date(”1950-10-12”)
today - DoB gives the time difference in terms of days.
Alternatively, the following code can be employed.
difftime(today,DoB, units = ”days”) gives the time difference
in terms of days.
How old in years is the person if he was born on 12 October
1950?
First install the lubridate package and then load it.
library(lubridate)
trunc(interval(DoB, today) / years(1)) gives [1] the age in
years.
)

85/209
Data wrangling
Date values
Suppose we would like to limit our analyses to observations
collected between January 1, 2019 and December 31, 2020 in
”Ex” dataset.
Ex −data.frame(A=1:10, B = c(”2019-08-11”,
”2020-12-18”,”2018-05-14”,”2020-07-26”,”2018-11-23”,
”2020-01-03”, ”2018-05-05”, ”2018-11-07”,
”2019-03-06”,”2019-05-08”))
attach(Ex)
Ex$date − as.Date(Ex$B)
startdate − as.Date(”2010-01-01”)
enddate − as.Date(”2020-10-31”)
newdata − Ex[which(Ex$date = startdate Ex$date =
enddate), ]
)

86/209
Data wrangling
Exercise 5
)

87/209
Data wrangling
Descriptive analysis
)

88/209
Commonly used mathematical functions
)

89/209
Commonly used statistical functions
)

90/209
Descriptive statistics
Load the built in dataset swiss in the workspace.
It is good practice to look at how many observations and
variables are included in a dataset prior to further analysis.
dim(swiss) gives [1] 47 6
Following this, have a look at the structure str(swiss)
Then get basic summary statistics by using the function
summary().
)

91/209
Missing values may be available in the dataset under analysis.
We can compute summary statistics by removing them, for
example,
mean(swiss$Fertility, na.rm=TRUE)
with(swiss, c(Mean = mean(Fertility, na.rm=TRUE), Sd =
sd(Fertility, na.rm=TRUE) )
)

92/209
Suppose we would like to calculate summary statistics, e.g.,
mean for each level of a categorical variable.
We can use the tapply() function.
Let us calculate the mean miles per gallon (mpg) for each of the
cylinder types (cyl) in mtcars dataset.
tapply(mtcars$mpg, mtcars$cyl, mean)
We can also use tapply() to apply on more than one factor.
Let us calculate mean mpg for each combination of gear cyl
tapply(mtcars$mpg, list(Cylinder = mtcars$cyl, Gear
=mtcars$gear), mean)
)

93/209
aggregate() works in a similar way to that of apply() but it is
flexible.
Suppose we want to calculate the mean values of mpg and wt
for each level of cyl.
aggregate(mtcars[, c(”mpg”, ”wt”)], by=list(Cylinder =
mtcars$cyl), FUN = mean)
)

94/209
We can also use the aggregate() function by using the formula
method.
aggregate(mpg ∼ cyl,FUN = mean, data =mtcars)
Furthermore, we may compute summary statistics for subsets of
the original data.
Let us compute the mean of mpg for each level of cyl only for hp
115
aggregate(mpg ∼ cyl,FUN = mean, subset = hp 115, data
=mtcars)
)

95/209
More descriptive statistics
More descriptive statistics can be obtained by installing and
loading different packages.
install.packages(”pastecs”) # If it is not already installed
library(pastecs)
myvars − c(”mpg”, ”hp”, ”wt”)
stat.desc(mtcars[myvars]) # mtcars is a built in dataset
)

96/209
More descriptive statistics
install.packages(”psych”) # If it is not already installed
library(psych)
describe(mtcars[myvars]) # gives
)

97/209
Descriptive statistics by group
Descriptive statistics can be obtained with respect to different
groups by using summaryBy in the doBy package.
iinstall.packages(”psych”) # If it is not already installed
library(doBy)
summaryBy(mpg+hp+wt am, data=mtcars, FUN=c(mean, sd,
min.max))
)

98/209
Descriptive statistics by group
We may also use the describeBy function in the psych package.
library(psych)
describeBy(mtcars[myvars], list(am=mtcars$am)) # gives
)

99/209
Frequency tables
Following are functions for creating tables
)

100/209
Tables
Categorical variables are usually described by frequency tables.
Let us tabulate the values of gear in the mtcars dataset.
table(mtcars$gear)
We can produce table of proportions instead of counts by using
prop.table(table(mtcars$gear))
)

101/209
Two-way tables
We can also use the table() function to cross tabulate two
categorical variables.
Let us cross tabulate cyl and gear in mtcars dataset.
with(mtcars, table(cyl, gear))
The above table can be obtained by the more flexible function,
xtabs()
xtabs(∼ cyl+gear, data = mtcars)
)

102/209
Two-way tables
There are times where we want to include row and column sums
of a table.
This can be done by applying the addmargins on the table.
Let us include the row and column sums for the table produced
in Slide 101.
addmargins(xtabs(∼ cyl+gear, data = mtcars))
)

103/209
Two-way tables
We can calculate proportions with respect to row or column
margins as follows.
prop.table(xtabs( cyl+gear, data = mtcars), 1) #row-wise
prop.table(xtabs( cyl+gear, data = mtcars), 2) #column-wise
)

104/209
Multidimensional tables
The functions table(), margin.table(), xtabs(), prop.table(), and
addmargins() also extend to more than 2-dimensions.
A 3-way table of cyl, gear and am in mtcars dataset can be
obtained from:
mytable − xtabs(∼ cyl+gear+am, data=mtcars)
The function ftable() prints an attractive multidimensional table.
ftable(mytable)
)

105/209
Exercise 6
)

106/209
Graphs with base R
)

107/209
Descriptive analysis through graphs
The base R graphics system is the original plotting which comes
together with installing R.
Graphs in base R are created by high-level plotting commands,
e.g., plot() and then more information can be added by using
low-level commands, e.g., lines(), text()..
When a plot is created in RStudio it will be displayed in the
Plots tab by default.
)

108/209
Descriptions through graphs
Previously created plots can be scrolled by clicking on one of the
arrow buttons.
We can save plots in a variety of formats (pdf, png, tiff, jpeg
etc) by clicking on the Export button.
plot() is the most common high-level function to make one or
more plots.
Let us create a scatterplot of mpg in the mtcars dataset.
)

109/209
with(mtcars, plot(mpg))
We can produce a scatterplot of mpg versus wt in the mtcars
dataset using with(mtcars, plot(mpg, wt))
It is possible to specify the type of graph we wish to plot using
the type = argument.
)

110/209
Let us apply the different types of arguments to plot x
={1,2,3,4,5,6,7,8} versus y = {12,21,9,11,8,10,17,18}
par(mfrow=c(2,2)) # to plot 4 graphs in one page
plot(x, y, type=”l”, main=”l”)
plot(x, y, type=”b”, main=”b”)
plot(x, y, type=”o”, main=”o”)
plot(x, y, type=”c”, main=”c”)
)

111/209
Bar plots
A bar plot displays the distribution (frequency) of a categorical
variable through vertical or horizontal bars.
Vertical bar plot: barplot(height)
Horizontal bar plot: barplot(height, horiz = TRUE)
barplot(height, names.arg = Bar Labels)
The following command gives a bar plot of cyl in mtcars dataset.
barplot(table(mtcars$cyl))
)

112/209
Bar plots
If height is a matrix rather than a vector, we will have a stacked
or grouped bar plot.
Stacked bar plot: barplot(height)
Grouped bar plot: barplot(height, beside = TRUE)
Let us use the built in dataset VADeaths and represent it by
using bar plots.
barplot(VADeaths)
)

113/209
Bar plots
barplot(VADeaths, beside = TRUE)
)

114/209
Pie charts
A pie chart is a circular diagram to present data by using the
function pie().
Let us produce a pie chart from the following data.
slices − c(10, 12,4, 16, 8)
lbls − c(”US”, ”UK”, ”Australia”, ”Germany”, ”France”)
pie(slices, labels = lbls, main=”Simple Pie Chart”)
)

115/209
Pie charts
Let include percentages on the above pie chart
pct − round(slices/sum(slices)*100)
lbls2 − paste(lbls, ” ”, pct, ”%”, sep=””)
pie(slices, labels=lbls2, col=rainbow(length(lbls2)),
main=”Pie Chart with Percentages”)
)

116/209
Histograms
Histograms are useful when we want to get an idea about the
distribution of values in a numeric variable.
We use the function hist() to produce a histogram.
Let us generate a histogram for the variable mpg in mtcars
dataset.
hist(mtcars$mpg)
)

117/209
Histograms
The freq = FALSE argument can be used to display the
histogram as a proportion rather than a frequency.
hist(mtcars$mpg, freq = FALSE)
We can control the breakpoints of a histogram using the breaks
= argument.
hist(mtcars$mpg, breaks = seq(from = 0,to = 35, by = 5))
It is also possible to add a kernel density curve to the histogram
by using the density() and lines() function.
)

118/209
Histograms
Kernel density estimation is a nonparametric method for
estimating the probability density function of a random variable.
Let us add a kernel density plot on the histogram of mpg.
Density − density(mtcars$mpg)
hist(mtcars$mpg, freq = FALSE)
lines(Density)
)

119/209
Box plots
Boxplots are useful to graphically summarize distribution of a
variable, identify potential unusual values compare
distributions between different groups.
We use the function boxplot() to create a boxplot.
Let us create a boxplot for mpg in the mtcars dataset.
boxplot(mtcars$mpg)
)

120/209
Box plots
Box plots of a quantitative variable with respect to different
levels of a factor can be produced as follows.
boxplot(mpg ∼ cyl, data=mtcars)
)

121/209
Box plots
We can also have box plots based on more than one grouping
factor, e.g., cyl and am.
Let us first create factors from these variables.
mtcars$cyl.f − factor(mtcars$cyl, levels=c(4,6,8),
labels=c(”4”,”6”,”8”)) # to factor cyl
mtcars$am.f − factor(mtcars$am, levels=c(0,1),
labels=c(”auto”, ”standard”)) # to factor am
boxplot(mpg ∼ am.f*cyl.f, data=mtcars)
)

122/209
Scatterplot matrix
Matrix scatterplots can be created if we want to graphically
explore relationship between more than two variables.
We use the function pairs() to create pairs of scatterplots.
pairs(mtcars[,c(”mpg”, ”disp”, ”hp”, ”drat”)])
)

123/209
Low-level plotting functions
To add extra information such as points, lines, arrows or text on
the different plots.
points(x, y,...): Adds points to the current plot.
lines(x, y, ...): Adds line segments.
text(x, y, labels, ...): Adds text into the graph.
abline(a, b, ...): Adds the line y = a + bx.
abline(h=y, ...): Adds a horizontal line.
segments(x0, y0, x1, y1, ...): Draws line segments with x0 and
y0 initial values.
legend(“arg”, fill= , cex = , ...) : Displays a legend.
Let us include a legend on the barplot produced in Slide 112.
)

124/209
Low-levl plotting functions
barplot(VADeaths, col =c(”green”, ”yellow”,”red”, ”black”,
”blue”))
legend(”topright”, legend = rownames(VADeaths), fill =
c(”green”, ”yellow”, ”red”, ”black”, ”blue”), cex=0.3)
)

125/209
Multiple figure environments
We can create an n by m array of figures on a single page using
the functions mfrow() and mfcol().
mfcol=c(nrow, mcol): to draw n rows and m columns of plots
on one page in a column-wise fashion.
mfrow=c(nrow, mcol): array filled in a row fashion.
Do not forget to incorporate the function par(. . . .) as
par(mfcol=c(, )); par(mfrow=c(, ))
Example. Let us produce four different graphs based on the
built in dataset AirPassengers
par(mfrow=c(2,2))
plot(AirPassengers, type = ”p”)
title(”Points”)
)

126/209
Multiple figure environments
plot(AirPassengers, type = ”l”)
title(”Lines”)
plot(AirPassengers, type = ”b”)
title(”Points Lines”)
plot(AirPassengers, type = ”h”)
title(”High Density”)
)

127/209
Exercise 7
)

128/209
Customizing plots
Once we are familiar with the base graphics in R, We can add
more information on our graphs.
Labels to the x and y axes can be included by using the ylab = ”
” and xlab = ” ” arguments, respectively in theplot() function.
plot(mtcars$mpg, mtcars$wt, xlab=”Miles per galon”, ylab =
”Weight of car in lbs”)
There are also some situations where we would like to adjust
figure margins.
)

129/209
Customizing plots
Figure margins can be adjusted by using the par() function and
the mar = argument before we plot the graph.
par(mar = c(bottom, left, top, right) – the arguments
bottom, left, top right are the size of the corresponding
margins.
By default R uses (5.1, 4.1, 4.1, 2.1) where these numbers
represent the number of lines in each margin.
par(mar = c(5, 4, 4,6))
”Weight of car in lbs”) # will shrink the width
)

130/209
Customizing plots
We can control the range of axes scales by using the xlim
=c(min, max) and ylim = c(min, max) arguments.
Let us set the x axis scale from 0 to 40 and the range of the y
axis scale from 0 to 8 and see the difference.
”Weight of car in lbs”, xlim=c(0,40), ylim =c(0,8)
)

131/209
Customizing plots
We can also change the color and the size of the symbol used in
plotting by using the col = and cex = arguments, respectively.
col = can either take an integer value to specify the color or a
character string (col=”green”) giving the color name.
The colors() function gives a list of all 657 preset colors in
base R.
cex = requires a numeric value to indicate the proportional
increase or decrease in size relative to the default value of 1.
Let us make the color of the dots to “green” and decrease the
size of the symbol by 40% in the above plot.
”Weight of car in lbs”, xlim=c(0,40), ylim =c(0,8),
col=”green”, cex =0.6)
)

132/209
Customizing plots
The function text() can be used to add a text label in a specific
(x, y) coordinate of the plot.
Let us add a text (30,2) in the above plot.
”Weight of car in lbs”, xlim=c(0,40), ylim =c(0,8),
col=”green”, cex =0.6, text(30,2, label=”(30, 2)”)
The background of a plot can be changed by par(bg=”color”) –
color can be ”red”, ”white”, ”green”, etc.
)

133/209
Customizing plots
In R, it is possible to make different symbol colors of data
points depending on different level of a factor variable by using a
low-level function points().
Suppose we want to produce a scatterplots of mpg versus wt
based on each levels of cyl in mtcars dataset.
Step 1. Let us include type = ”n” argument to create an empty
plot region.
plot(mtcars$mpg, mtcars$wt, xlab = ”Miles per galon”,
ylab=”Weight in lbs”, xlim=c(0,40), ylim=c(0,8), bty = ”l”,
type = ”n”)
)

134/209
Customizing plots
Step 2. Plot for cyl == 4
points(x = mtcars$mpg[mtcars$cyl==4], y =
mtcars$wt[mtcars$cyl==4], pch = 2, col =”red”)
mtcars$wt[mtcars$cyl==6], pch = 3, col =”yellow”)
)

135/209
Customizing plots
mtcars$wt[mtcars$cyl==8], pch = 4, col =”green”)
)

136/209
Customizing plots
Let us finally include a legend to describe what each symbol and
color designate in the plot.
Cols − c(”red”, ”yellow”, ”green”)
Symbol − c(2,3,4)
Label − c(”4 Cylinder”, ”6 Cylinder”, ”8 Cylinder”)
legend(x = 10,y = 4, col = Cols, pch = Symbol, legend =
Label, cex = 0.25)
)

137/209
Exporting plots
Plots in R can be exported to different formats such as jpeg,
pdf, bmp.
The first option is to click on the Export button in the Plots tab.
The second option is through writing codes in R script.
To save a plot in pdf format we will use the pdf() function.
Similarly, we use the functions jpeg(), bmp() for jpeg and bmp
formats.
Once we run the codes to export plots we need to close the
plotting device using the dev.off() function.
)

138/209
Exporting plots
Let us export the plots we had in Slide 132 to jpeg and bmp
formats.
jpeg(’my plot.jpeg’)
ylab=”Weight in lbs”, xlim=c(0,40), ylim=c(0,8), col=”green”,
cex=0.6, text(30,2, label=”(30, 2)”))
dev.off()
png(’my plot.png’)
ylab=”Weight in lbs”, xlim=c(0,40), ylim=c(0,8), col=”green”,
cex=0.6, text(30,2, label=”(30, 2)”))
dev.off()
)

139/209
Exercise 8
)

140/209
Basic inferential analysis
)

141/209
Basic inference
After performing data cleaning and descriptive analysis, we need
to perform inferential analysis.
This includes estimation of model parameters and tests of
hypotheses.
Tests of independence: The chi-square test of independence
tests whether two categorical variables are independent.
We use the function chisq.test() to perform the test.
Let us consider the Arthritis dataset in vcd package and test if
there is association between Treatment and Improved.
library(vcd)
mytable − xtabs(∼ Treatment+Improved, data=Arthritis)
chisq.test(mytable)
)

142/209
Basic inference
The null hypothesis, i.e., there is no association between the two
variables will be rejected since the p-value = 0.001463 0.05.
We may also use the prop.test function to analyze this dataset.
The proportions of improvements with respect to the treatment
groups (Placebo, Treated) are: 0.69,0.5,0.25 for the None,
Some, Marked, respectively.
These proportions differ, but is it statistically supported?
Let us consider the vectors of improvements in the Placebo
group, i.e., (29,7,7) and the total = (42,14,28)
)

143/209
Basic inference
Row Total − margin.table(mytable, 2)
Placebo − mytable[1, ]
prop.test(Placebo, Row Total)
We can conclude that the proportions differ statistically since
the p-value = 0.001463 0.05.
)

144/209
Correlations
We use correlation coefficients to measure the association
between two quantitative variables.
The Pearson, Spearman, and Kendall correlation coefficients can
be obtained using: cor(x, use= , method= )
x = Matrix or data frame, use = option to handle missing data,
method = pearson, spearman, and kendall.
A partial correlation is a correlation between two quantitative
variables, controlling for one or more other quantitative variables.
Code to use:
library(ggm)
pcor(u, S) # First two numbers in u = variable numbers to
be correlated, last numbers partialed vars
S is the covariance matrix.
Let us consider the built in dataset state.x77 to illustrate the
procedure.
)

145/209
Correlations
states − state.x77[, 1:6]
colnames(states) gives
[1] ”Population” ”Income” ”Illiteracy” ”Life Exp” ”Murder”
”HS Grad”
cor(states) gives
Let us find the partial correlation coefficient between var1
(Population) var5 (Murder rate) keeping the effects of var2,
var3, var 6 constant.
)

146/209
Correlations
library(ggm)
pcor(c(1,5,2,3,6), cov(states))# gives [1] 0.346
0.346 = correlation between Population (var1) Murder rate
(var5)
cov(states) is the covariance matrix among variables in states.
The function Corr − cor.test(x, y, alternative = , method
= ) tests for significance of correlation coefficients.
print(corr.p(Cor$r, n=), short = FALSE) prints the p-value
and confidence intervals.
Let us test which of the correlations given above are statistically
significant.
Cor −corr.test(states, use=”complete”)
print(corr.p(Cor$r, n=50), short = FALSE)
)

147/209
Correlations
library(psych)
Corr − corr.test(states, use=”complete”)
print(corr.p(Cor$r, n=50), short = FALSE)
There is a strong (0.7) and statistically significant
(p-value = 0.00) correlation between Illiteracy and
Murder rate
)

148/209
T-tests
Tests to compare mean of continuous data either against an a
priori stipulated value or between two groups.
One Sample t Test
H0 : µ = µ0 vs H1 : µ , , ̸= µ0
We can use the function t.test() to perform a t-test.
t.test(data vector, proposed mean value, Optional arguments)
Options: alternative = “greater” ; alternative = “less” and
conf.level=X.
Example. Use the following data to test the mean value is 7500.
Y = {5660, 5470, 5640, 6180, 6390, 6515, 6805, 7515, 7515,
8230, 8770, 8800, 8000, 7750, 6950 }
t.test(Y, mu = 7500)
)

149/209
T-tests
There is no sufficient statistical evidence to reject the null
hypothesis (p-value = 0.1636 0.05).
The mean value is not different from 7500.
We can apply the logic of the one-sample t-test to test whether
two population means are different.
We may encounter to test for mean differences in two dependent
or independent samples.
)

150/209
T-tests
Independent t-test assumes that the two groups are independent
the data are sampled from normal population.
It can be performed by:
t.test(y ∼ x, option =, data) # y is numeric x is a
dichotomous vector.
t.test(y1, y2, option =) # both y1 y2 are numeric
The option can be var.equal=TRUE, alternative=”less”,
alternative=”greater”
A dependent t-test assumes that the difference between groups
is normally distributed.
Dependent t-test can be performed by:
t.test(y1, y2, paired=TRUE) # y1 y2 are numeric vectors
for the two dependent groups.
)

151/209
T-tests
Let us consider part of the built in dataset PlantGrowth and test
whether the mean weight for trt1 is the same as that of trt2.
Weight1 −
PlantGrowth$weight[PlantGrowth$group==”trt1”]
Weight2 −
PlantGrowth$weight[PlantGrowth$group==”trt2”]
t.test(Weight1, Weight2)
)

152/209
T-tests
The null hypothesis is rejected (p-value = 0.009298 0.05) and
we conclude that the mean weights for trt1 is statistically
different from that of trt2.
Consider a dataset (marg) on a study to check whether
cholesterol was reduced after using a certain brand of margarine
as part of a low fat, low cholesterol diet. This data set contains
information on 18 people using margarine to reduce cholesterol
over two time points.
Test whether the difference in mean cholesterol level before using
the margarine is the same as that of after using the margarine.
Marg −
read.csv(”C:/Users/User/OneDrive/Desktop/QuantDataAna/
marg.csv”, header=TRUE)
with(Marg, t.test(Before,After4weeks, paired=TRUE))
)

153/209
T-tests
The null hypothesis will be rejected (p-value =
1.958x10−11
0.05 ) and the conclusion is that the mean
cholesterol level before using the margarine differ to that of after
using the margarine.
)

154/209
Nonparametric tests
The t-test is dependent on the normality assumption of the data
being analyzed.
It happens that the distribution for the data under consideration
is not normal.
In such cases, we can use the nonparametric tests as alternatives
to t-tests.
We can use the function wilcox.test() to perform rank-based
(nonparametric) tests.
Let us recall the dataset given in Slide 148 and test whether the
median value = 7500 or not.
wilcox.test(Y, mu=7500)
)

155/209
Nonparametric tests
The p-value = 0.2219 0.05 and hence the null hypothesis will
be retained.
The Wilcoxon rank sum test for two independent groups can be
performed by:
wilcox.test(y ∼ x, data) # y is numeric x is a dichotomous.
wilcox.test(y1, y2) # y1 y2 are outcome variables.
A nonparametric alternative to the dependent sample t-test.
wilcox.test(y1, y2, paired = TRUE)
)

156/209
Exercise 9
)

157/209
Intermediate statistical methods
)

158/209
Comparing more than two groups - ANOVA
ANOVA is an extension of the t-test, and compares means of
two or more groups.
One-way ANOVA y ∼ A.
Two-way factorial ANOVA y ∼ A ∗ B
Randomized block ANOVA y ∼ B + A; B is a blocking factor.
y is the dependent variable and the letters A and B represent
factors.
Example. Consider the cholesterol dataset in the multcomp
package. i) Find the group means, ii) Produce and interpret
boxplots, iii) test for group mean differences.
i) library(multcomp)
attach(cholesterol)
aggregate(response, by=list(trt), FUN=mean)
)

159/209
ANOVA
ii) boxplot(response ∼ trt, data =cholesterol)
The boxplots indicate that the mean responses differ between
the different groups.
)

160/209
ANOVA
fit − aov(response ∼ trt), data = cholesterol
summary(fit)
The above ANOVA table shows that the mean response in at
least one of the groups differ to that of the other (p-value =
9.82x10−13
0.05 ).
Let us further proceed to test which pair of mean responses
differ statistically.
)

161/209
ANOVA
The function TukeyHSD() is the most commonly used test to
compare all pairwise differences between group means.
Let us apply it to the present case.
TukeyHSD(fit)
)

162/209
One-way ANOVA
Testing model assumptions
Assumptions: the dependent variable is normally distributed with
equal variance in each group.
Use a Q-Q plot to assess the normality assumption:
library(car)
qqPlot(lm(response ∼ trt, data=cholesterol))
Observe that qqPlot() requires an lm() fit.
The normality assumption is not violated as the plots are
close to the referent line.
)

163/209
One-way ANOVA
The constant (homogeneity) variance assumption can be
assessed by Bartlett’s test:
bartlett.test(response ∼ trt, data=cholesterol)
There is no sufficient statistical evidence (p-value = 0.9653
0.05) to reject the null hypothesis of constant variance.
The ANOVA model appears to correctly fit the data since
the above assumptions are fulfilled.
)

164/209
Two-way ANOVA
To simultaneously evaluate the effect of two grouping variables
on a response variable.
These grouping variables are also known as factors.
Three possible effects in this design: two main effects and one
interaction effect.
Example. Sixty guinea pigs are randomly assigned to receive
one of three dose levels of vitamin C (0.5, 1, or 2 mg/day) and
one of two delivery methods (orange juice or ascorbic acid), and
tooth length was measured (see ToothGrowth dataset).
i) Find the group means of tooth length, ii) Produce and
interpret box plots, iii) Test whether main effects (supp and
dose) interaction between these factors are significant.
)

165/209
Two-way ANOVA
attach(ToothGrowth)
i) aggregate(len, by=list(supp, dose), FUN=mean)
ii) boxplot(len ∼ supp:dose, data = ToothGrowth)
)

166/209
Two-way ANOVA
iii) dose − factor(dose) # Converts ”dose” to a factor
fit − aov(len ∼ supp*dose, data = ToothGrowth)
summary(fit)
Each of the main effects and interaction are statistically
significant.
)

167/209
Two-way ANOVA
Checking for normality assumption
library(car)
qqPlot(lm(len ∼ supp*dose, data=ToothGrowth))
What can you say about the normality assumption based on the
above plot?
)

168/209
Two-way ANOVA
Checking constant variance
library(car)
leveneTest(len ∼ factor(supp)*factor(dose))
What can be said about the constant variance
assumption?
)

169/209
Regression
Regression analysis can be used to:
1 identify the explanatory variables that are related to a response
variable,
2 to describe the type of the relationships (if any),
3 to predict the value of response variable from the explanatory
variables.
Some examples a regression model is suitable include:
What is the relationship between surface stream salinity and
paved road surface area?
Which qualities of an educational environment are most strongly
related to higher student achievement scores?
What is the form of the relationship between blood pressure, salt
intake, and age? Is it the same for men and women?
)

170/209
Regression
)

171/209
Regression
A function for fitting a linear model:
myfit − lm(formula, data) # formula = Y ∼ X1 + . . . + Xk
Some symbols to be used in the formula:
∼: Separates response variables on the left from the explanatory
variables on the right.
+ : Separates predictor variables.
: denotes an interaction between predictor variables.
∗: A shortcut for denoting all possible interactions.
−1: Suppresses the intercept. y ∼ x − 1 fits a regression of y on
x without intercept.
Other functions:
summary(): Detailed results for the fitted model.
coefficients(): Lists the intercept slopes for the fitted mode.
confint(): Confidence intervals for the model parameters.
)

172/209
Regression
fitted(): Lists the predicted values in a fitted model.
residuals(): Lists the residual values in a fitted model.
anova(): An ANOVA table for a fitted model.
vcov(): Lists the covariance matrix for model parameters.
AIC(): Prints Akaike’s Information Criterion.
plot(): Diagnostic plots for evaluating the fit of a model.
predict(): Uses a fitted model to predict response values for a
new dataset.
A code for polynomial regression of degree n:
fit1 − lm(y ∼ x + I(x∧
2) + . . . + I(x∧
n) , data=)
Scatter plots matrix can be generated from:
scatterplotMatrix(data, smoother.args=list(lty=2))
)

173/209
Regression
Example. Consider the built in dataset ”women” which
provides the height and weight for a set of 15 women, and fit a
regression model.
Let us first give the scatterplot of weight versus height.
plot(women$height,women$weight)
This scatterplot suggests a linear relationship between weight
and height of women.
)

174/209
Regression
fit − lm(weight height, data=women)
summary(fit) # Model summary
)

175/209
Regression
According to the above result, height is found to be a
statistically significant factor for weight of women.
When height of a woman increases by 1 unit the weight
increases by 3.45 units on average.
Let us now superimposes the fitted regression line on the
scatterplot.
abline(fit)
)

176/209
Regression
Let us rerun the regression with a quadratic term (that is, X2
):
fit2 − lm(weight ∼ height + I(height2
), data=women)
lines(women$height,fitted(fit2))
Does the quadratic term improve prediction?
)

177/209
Regression
Categorical Independent Variables:
These are recoded into a set of separate binary variables before
we enter them into a regression model..
Such recoding is known as “dummy coding”.
For example, R will automatically create a dummy variable
SexMale from the factor variable Sex.
1 if a person is Male
0 if a person is Female
The default option in R is to use the first level of the factor as a
reference and interpret the remaining levels relative to this level.
We can use the function contrasts() to see the coding that R
has used to create the dummy variables.
contrasts(Sex)
)

178/209
Regression
Suppose that the coefficient corresponding to a male is found to
be 2.5 in a fitted regression model where the response is score.
Note here that sex = Female is the reference category.
The interpretation will be: a Male would get 2.5 points more
than a female on average.
We can use the function relevel() to make the reference category
to Male as follows:
mutate(Sex = relevel(Sex, ref = ’Male’))
A dummy variable SexFemale will be created once we executed
the above code.
In general, a categorical variable with d levels will be
transformed into d - 1 variables each with two levels.
Suppose for example that a variable Education has 4 levels:
None, Primary, Secondary, Tertiary.
)

179/209
Regression
The three dummy variables: Primary, Secondary and Tertiary.
▶ If Education = Primary, then the column Primary would be
coded with a 1 while Secondary and Tertiary would be with a 0.
▶ If Education = Secondary, then the column Secondary would be
coded with a 1 while Primary and Tertiary would be with a 0.
▶ If Education = None, then each of the columns Primary,
Secondary and Tertiary would be coded with a 0.
▶ Note that we should first convert a character vector to a factor
so as to have such dummy coding in R.
)

180/209
Regression
Multiple linear regression is an extension of the simple linear
regression.
Example. Use the built in dataset ”state.x77” to explore the
relationship between a state’s murder rate and other
characteristics including population, illiteracy rate, average
income, and frost levels.
Let us extract the variables that we are interested.
states − as.data.frame(state.x77[, c(”Murder”,
”Population”, ”Illiteracy”, ”Income”, ”Frost”)])
cor(states) # Gives pairwise correlation
Scatter plot matrix
library(car)
scatterplotMatrix(states)
)

181/209
Regression
Let us fit the multiple linear regression model:
fit − lm(Murder ∼ Population + Illiteracy + Income +
Frost, data=states)
summary(fit)
)

182/209
Regression
Population and Illiteracy are significant predictors of Murder.
For a one unit increase in Illiteracy the Murder rate increases by
4.14 units on average keeping the other factors fixed.
)

183/209
Regression diagnostics
Normality can be assessed by the qqPlot() function.
Example.
library(car)
states − as.data.frame(state.x77[,c(”Murder”,
”Population”, ”Illiteracy”, ”Income”, ”Frost”)])
fit − lm(Murder ∼ Population + Illiteracy + Income +
Frost, data=states)
qqPlot(fit, labels=row.names(states), id.method=”identify”,
simulate=TRUE, main=”Q-Q Plot”)
simulate=TRUE adds a 95% confidence envelope using a
parametric bootstrap.
id.method =”identify” allows to interactively add ”labels” on
the graph using mouse.
)

184/209
The response variable approximately follow a normal distribution.
Independence can be checked by the Durbin–Watson test.
library(car)
durbinWatsonTest(fit)
)

185/209
Homoscedascticity can be tested by ncvTest() function.
library(car)
ncvTest(fit)
)

186/209
Multicollinearity can be assessed by variance inflation factor as:
library(car)
vif(fit)
Since the variance inflation factors for each of the above
variables is small (less than 5), we can conclude that
multicollinearity is not a problem here.
)

187/209
Logistic Regression — Binary
Observe that a linear regression model is employed when the
dependent variable is quantitative.
However, we may encounter a categorical response variable with
a success/failure scenario, such as democracy / autocracy, war /
peace, trade agreement / no trade agreement, underweight/
normal.
We can use a binary logistic regression to model such response
variable as a function of one or more independent variables.
Unlike in linear regression model, we will predict the probabilities
of the response variable as a function of independent variable (s)
in logistic regression.
We can use the function glm() to fit a logistic regression model.
)

188/209
The hypothesis of interest in logistic regression:
H0: An independent variable had no impact on the probability to
success of the response
H1: An independent variable had impact on the probability to
success of the response
Note that we need to turn our dependent variable into a factor
before we proceed to use glm().
Example. Let us consider the data on the passengers of the
Titanic in 1912 and investigate whether age influenced the
probability to travel in first-class.
)

189/209
Load the data into R:
Titanic −
Titanic.csv”, header = TRUE)
Let us first recode the response variable (plass) in to binary.
library(tidyr)
Titanic − Titanic %%
mutate(class = as.numeric( recode(pclass, ’1st’=’1’, ’2nd’=’0’,
’3rd’=’0’)))
We use the following code to fit a simple binary logistic
regression model of class on age.
Lfit − glm(class ∼ age, data = titanic, na.action =
na.exclude, family = ”binomial”)
summary(Lfit)
)

190/209
The coefficient for age is highly significant (p-value =
2x10−16
).
The odds ratio for the regression coefficients can be obtained by
exp(coef(Lfit))
)

191/209
The odds ratio corresponding to age = exp(0.067767) = 1.07
Interpretation:
▶ For every unit (one year) increase in age, the odds of traveling
in first-class increases by a factor of 1.07.
▶ Let us clarify this by taking specific age values:
At age = 30, the predicted log-odds of traveling in first class =
−3.187456 + 0.067767x(30) = −1.154446.
Taking the exponential of −1.154446 yields odds of
0.315232127.
At age = 31, the predicted log-odds of traveling in first class =
−3.187456 + 0.067767x(31) = −1.086679.
e−1.086679
yields odds of 0.337334925.
Dividing the odds at age = 31 by odds at age = 30,
i.e.,0.337334925/0.315232127 gives 1.07.
)

192/209
Note
When the odds ratio for a given predictor is less than one, an
increase in the value of the predictor leads to a decreased odds
of success on the response.
If the odds ratio for a given predictor is exactly 1, the odds of
success on the response would not change when the value of the
predictor changes.
Odds can be converted to a probability by using the following
formula: probability = odds / (1 + odds).
For example, the predicted probability of traveling in first class
when a passenger is 30 years old = 0.315232127
1+0.315232127
= 0.23967794.
We can also predict the probabilities based on previously created
sequence.
)

193/209
Sq age −seq(0, 80, 1)
Prob age −predict(Lfit, list(age = Sq age), type =
”response”)
Prob age # gives
Let us plot age versus probability of traveling in first class.
)

194/209
plot(Sq age, Prob age, xlab = ”age”, ylab = ”Probability to travel
in First Class”, type=”l”)
)

195/209
How good is the model?
A model is considered to be good when the proportion of
correctly predicting success (1) and failure (0) is high.
We can use the ROC (receiver operating characteristic) curve to
assess correctly predicting 1s and 0s.
The y-axis in ROC-curve is the probability of correctly predicting
a 1— Sensitivity.
The x-axis in ROC-curve is the probability of correctly predicting
a 0 — Specificity.
The model predicts 1s and 0s well if the curve is further away
from the diagonal.
The area under the curve (auc) will be 100% if the model
correctly predicted everything.
)

196/209
install.packages(”pROC”)
library(pROC)
Prob trav − predict(Lfit, type=”response”)
Titanic$Prob trav − unlist(Prob trav)
ROC − roc(Titanic$class, Titanic$Prob trav)
auc(ROC)
plot(ROC, print.auc = TRUE, col = ”blue”)
)

197/209
Logistic Regression — Multinomial
Multinomial Logistic Regression (MLR) is conducted when the
outcome variable is nominal with more than two levels.
Example. In an election, voters may choose any one of Party A,
Party B or Party C. Their choice might be affected by the
party’s economic policy, foreign policy, educational levels of
candidates, etc.
In MLR, the log odds of the outcomes are modeled as a linear
combination of the predictor variables.
The MLR estimates a separate binary logistic regression model
for each dummy variables.
If the outcome variable has M levels, then we will fit M-1 binary
logistic regression models.
Each model has its own intercept regression coefficients: the
predictors can affect each category differently.
)

198/209
Consider the data set Highschool.csv.
▶ Outcome variable: Program Type = {academic, general,
vocational}
▶ Predictors: Writing Score, Math Score, Sch Type, Sex, Ses.
Import the data to R:
Mlog −
Data/Highschool.csv”, header = TRUE)
Suppose we choose academic as the baseline category for the
outcome variable.
library(foreign)
Mlog$Program Type ¡- relevel(factor(Mlog$Program Type),
ref = ”academic”)
)

199/209
library(nnet)
Mlogit − multinom(Program Type ∼
Writing Score+Math Score+Sch Type+Sex+Ses, data = Mlog)
summary(Mlogit)
)

200/209
Let us now calculate Z score and p-values.
z −
summary(Mlogit)$coefficients/summary(Mlogit)$standard.errors
z
Note that the coefficients corresponding to the row ”general”
will be used to compare Program Type = ”general” to
Program Type = ”academic”.
)

201/209
p − (1 - pnorm(abs(z), 0, 1))*2
p
Interpretations can only be given for those coefficients with
p − values 0.05.
)

Quantitative Data Analysis using R Guide

Quantitative Data Analysis using R Guide

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Quantitative Data Analysis using R Guide

Similar to Quantitative Data Analysis using R Guide (20)

Recently uploaded

Recently uploaded (20)

Quantitative Data Analysis using R Guide