SlideShare a Scribd company logo
1 of 209
Download to read offline
Quantitative Data Analysis using R
By Taddesse Kassahun
Department of Statistics
Addis Ababa University
1/209
2/209
Introduction
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 2 / 209
3/209
Introduction
What are quantitative data?
Quantitative data refer to a set of values of observations that
can be counted or measured.
They answer questions of the following form.
▶ How many?
▶ How often?
▶ How much?
Quantitative data are mainly collected for statistical analysis.
Some examples of quantitative data:
▶ Number of times students in a college updated their phones in a
quarter.
▶ Percentage increase in revenue of wholesalers with the inclusion
of a new product.
▶ Price of Teff/kg in different kebeles of a region.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 3 / 209
4/209
Introduction
Analysis methods of quantitative data
Quantitative data can be analyzed by using descriptive and
inferential methods.
Descriptive analysis can be performed using tables, graphs, and
summary measures.
Tables can be classified into two broad classes: simple and
complex tables.
Commonly used graphs: bar, pie, histogram, boxplot, and line.
Some widely used summary measures include mean, median,
mode, frequency, minimum, maximum, total, standard deviation,
range, and percent
Inferential method involves estimation, model fitting, and
hypothesis testing.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 4 / 209
5/209
Introduction
Steps in inferential analysis
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 5 / 209
6/209
Introduction
Why R for data management and analysis?
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 6 / 209
7/209
Introduction
Where to find R?
R is freely available from the Comprehensive R Archive Network
(CRAN) at http://cran.r-project.org.
Once R is installed, go to
https://www.rstudio.com/products/rstudio/download/
to install R Studio.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 7 / 209
8/209
Introduction
R studio layout
The location of each pane can be customized by clicking Tools
⇒ Global Options ⇒ Pane Layout
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 8 / 209
9/209
Introduction
R studio layout - console
All the output generated by R except plots goes to the console.
R evaluates all the codes in the console.
R studio function calls can be entered into the console to
produce output, for example, try the following.
> print(”Hello IFA”) gives [1] ”Hello IFA”
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 9 / 209
10/209
Introduction
R studio layout - console
We can enter commands one at a time at the command prompt
(>)
We may also use R as a calculator. Try the following
> 5+4 gives [1] 9
> 5-4 gives [1] 1
> 5*4 gives [1] 20
> 5/4 gives [1] 1.25
It is not recommended to enter longer pieces of code into the
console. Instead use the script editor (source) window.
Create an R script by clicking File ⇒ New File ⇒ R Script.
In general, execute codes from saved R scripts for the sake of
reproducibility.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 10 / 209
11/209
Introduction
R studio layout - source
The R scripts are written in the source pane where we can run a
set of commands at a time.
A command in the source pane can be executed by selecting the
line and pressing Ctrl + Enter or hitting the Run icon.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 11 / 209
12/209
Introduction
R studio layout - source
The corresponding result should be seen in the console window.
Comments can be incorporated after the hashtag #.
>print(”Hello IFA”) # this function prints Hello IFA.
To comment multiple lines of codes: highlight the codes and
then press ctrl + shift + c.
Save the R-scripts by clicking a save shortcut or pressing Ctrl +
S with a file name, e.g., ”test.R”.
A time stamp can be included in R scripts by writing ts and then
pressing the shift + tab keys.
To open R scripts click on File ⇒ Open File ...
KEEP ALL THE CODES YOU USE IN R SCRIPTS!
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 12 / 209
13/209
Introduction
Environment/history/connection/tutorial
Understanding environments is not necessary for day-to-day use
of R.
In the present training, we will only work on the Global
Environment.
All the objects assigned in R are stored and displayed in the
Global Environment.
We can also store an output with an object of arbitrary name,
say x, using an assignment operator (< −), i.e., >x < −
Type x in the script and execute it or type x into the console
and press Enter.
If we write x < − 5+5 on the console, x will appear in the
environment pane under Values section (see figure below).
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 13 / 209
14/209
Introduction
Environment/history/connection/tutorial
It’s important to always start with a clean environment when
working on an R project.
To clean the working environment, selecting “Restart R” option
from the “Session” tab at the top of the window or press Ctrl +
Shift + F10.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 14 / 209
15/209
Introduction
Environment/history/connection/tutorial
The History tab contains a list of commands entered into R
console.
Previously used commands can be searched through the History
tab.
The Connections tab allows to connect to various data sources
like external databases.
The Tutorials tab is used to run tutorials for R studio.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 15 / 209
16/209
Introduction
Files/plots/packages/help/viewer
The Files tab lists external files in the current working directory.
We can open, copy, rename, move and delete files listed in the
window.
The Plots tab is where all the plots we create in R are displayed.
There is an option of exporting plots to an external file using the
Export drop down menu.
The Packages tab lists all of the packages that are installed on
the computer.
We can install new packages & update existing ones by clicking
on the Install and Update buttons.
We view the help coming from R documentation in the Help tab.
The Viewer tab displays local web content such as web graphics
generated by some packages.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 16 / 209
17/209
Introduction
Projects in R
RStudio Projects help to keep things organised.
An RStudio Project keeps all of our R scripts, R functions and
data together in one place.
To create a project, open RStudio and select File ⇒ New
Project... from the menu.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 17 / 209
18/209
Introduction
Projects in R
In the next window select New Project.
Enter the name of the directory we want to create in the
Directory name: field.
Click the Browse... button and navigate to change the location
of the directory.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 18 / 209
19/209
Introduction
Projects in R
Tick the Open in new session box and then hit the Create
Project button.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 19 / 209
20/209
Introduction
R workspace
The working directory is the default location where R finds files
to load and put any files.
The file path of current working directory can be obtained at the
top of the Console pane or using the command > getwd()
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 20 / 209
21/209
Introduction
R workspace
A directory structure can be created by clicking on the New
folder button in the Files pane.
> ls() — Lists the objects in the current workspace.
> rm(objectlist) — Removes (deletes) one or more objects.
> rm(list=c(”Object1”, ”Object2”, ”Object3”))
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 21 / 209
22/209
Introduction
Basics
R is a case-sensitive software: r is different from R.
An object in R is anything that can be assigned a value, e.g.,
> X = 5
> Name = ”Tadddesse”
Objects can also be output of a plot, a summary of statistical
analysis or a set of R commands that perform a specific task.
Commands in R can be separated by either a new line or
semicolon (;).
If a continuation prompt + appears in the console after a code is
executed, this means the code was not completed correctly.
R statements consist of functions and assignments.
In general, R is tolerant of extra spaces inserted into codes.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 22 / 209
23/209
Introduction
Basics
Do not use spaces into the assignment operator (< −).
If a console ‘hangs’ and unresponsive after running a command,
press the escape key (esc) on the keyboard.
We can also click on the stop icon in the top right of our console
to terminate most current operations.
To save an object to an .RData file use save(nameOfObject, file
= ”file name.RData”)
To save all objects in a workspace into a single .RData file use
save.image(file = ”file name.RData”)
To load .RData file back into RStudio use load(file =
”file name.RData”)
We can end the R session that we are working on by typing and
running the command q().
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 23 / 209
24/209
Introduction
Packages
Packages are collections of R functions, data, and compiled code.
The base installation of R comes with many packages as
standard.
The directory where packages are stored on your computer is
called the library.
Standard set of packages in R: base, datasets, utils, grDevices,
graphics, stats, and methods.
Use the install.packages() command to install a package for the
first time.
Example. A package car can be installed by writing
install.packages(”car”) into the Console window of RStudio.
We need to have a working internet connection to do this.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 24 / 209
25/209
Introduction
Packages
We may be asked to select a CRAN mirror, we can select
0-cloud or a mirror near to our location.
We may include the dependencies = TRUE argument to install
additional required packages.
Packages can be updated using the command
update.packages().
The ask = FALSE argument helps to update all installed
packages.
To use a package in an R session, we need to first load the
package using > library(pkg name).
We need to load the packages we will be using every time we
start a new R session.
> help(package=”package name”) provides a brief description
of the package
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 25 / 209
26/209
Introduction
Help in R
> help.start() — General help.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 26 / 209
27/209
Introduction
Help in R
> help(”mean”) or ?mean — Help on function mean.
> help.search(”linear”) or ??linear — Searches the help system
for the string linear.
> example(”anova”) — Examples of function anova.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 27 / 209
28/209
Introduction
Help in R
> apropos(”mean”, mode=”function”) — Lists all functions
with mean in their name.
Programming help:
StackOverflow (https://stackoverflow.com/) is a Q & A
website focused on programming in all languages.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 28 / 209
29/209
Introduction
Exercise 1
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 29 / 209
30/209
Data structures, input & output
Data structures, input & output
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 30 / 209
31/209
Data structures, input & output
Introduction
The base data structure in R can be organized by their
dimensionality (1d, 2d, or nd).
In R, ID (case identifier) is considered as rownames.
Data structure can constitute either homogeneous (all contents
same type) or heterogeneous (contents are different types).
The FIVE most often used structures:
We can use str() to understand what data structures an object is
composed of.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 31 / 209
32/209
Data structures, input & output
Notation and naming
Generally, variable names should be nouns and function names
should be verbs.
Use an underscore ( ) to separate words within a name.
Avoid using names of existing functions and variables. Some
examples of BAD variable names:
> F < − FALSE
> c < − 10
Place spaces around all infix operators (=, +, -, <, >, etc.).
Always put a space after a comma.
:, ::, < − and ::: do not need spaces around them.
Place a space before left parentheses, except in a function call.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 32 / 209
33/209
Data structures, input & output
Vectors
Vectors are one-dimensional arrays that can hold numeric,
character, or logical data.
Vector is the basic data structure in R.
Scalars are also considered as vectors with only one-elements.
is.atomic(x) tests if an object x is actually a vector.
There are four common types of atomic vectors: logical, integer,
double (numeric), and character.
Atomic vectors are usually created with c(), short for
concatenate.
Example
▶ Logical < − c(TRUE, FALSE)
▶ Integer < − c(1, 3, 6, 9)
▶ Double < − c(1, 2.4, 3.7)
▶ Character < − c(”He”, ”is”, ”a”, ”statistician”)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 33 / 209
34/209
Data structures, input & output
Vectors ...
Data in a vector must be only one type or mode – numeric,
character, or logical.
We can check the data type (mode) of a vector using the
command mode()
▶ > mode(Integer) gives ”numeric”
We use square brackets [ ] to extract elements of a vector.
Example
▶ > Double[2] gives 2.4
The colon operator (:) can be used to generate a sequence of
numbers/characters.
Example
▶ > Character[2:4] gives ”is” ”a” ”statistician”
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 34 / 209
35/209
Data structures, input & output
Vectors
A summary of logical test and coercion functions
> X < − 5.8
> class(X) gives [1] ”numeric”
> is.numeric(X) gives [1] TRUE
> is.character(X) gives [1] FALSE
> X1 < − as.character(X)
> class(X1) gives [1] ”character”
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 35 / 209
36/209
Data structures, input & output
Factors
Factors are specific type of vectors to store values that take a
pre-specified set of values.
The function factor() stores categorical values as a vector of
integers in the range 1, . . . , k,
where k is the number of unique categories. > x < − c(”A”,
”B”, ”B”, ”A”)
> xf < − factor(x) stores this vector as (1, 2, 2, 1) this can be
checked by >as.numeric(xf)
The variable x is now treated as nominal.
For ordinal variables, we add the parameter ordered=TRUE to
the factor() function.
> y < − c(”Good”, ”Very good”, ”Excellent”, ”Good”)
> factor(y, ordered=TRUE) is encoded as (2,3,1,2)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 36 / 209
37/209
Data structures, input & output
Factors
The variable y is now treated as ordinal.
By default, factor levels for character vectors are created in
alphabetical order.
We can override the default by specifying a levels option.
levels(x) defines the set of allowed values for a vector x.
>factor(y, ordered=TRUE, levels=c(”Good”, ”Very good”,
”Excellent”)) is encoded as (1,2,3,1)
Numeric variables can be coded as factors using the levels and
labels options.
>Sex < −c(0,1,1,1,0,1,1,0,0)
> factor(Sex, levels=c(1, 2), labels=c(”Male”, ”Female”))
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 37 / 209
38/209
Data structures, input & output
List
Lists are the most complex of the R data types.
A list may contain a combination of vectors, matrices, data
frames, and even other lists. Use list() to create lists.
x < − list(1:3, ”a”, c(TRUE, FALSE, TRUE), c(2.3, 5.9))
str(x) gives
List of 4
$ : int [1:3] 1 2 3
$ : chr ”a”
$ : logi [1:3] TRUE FALSE TRUE
$ : num [1:2] 2.3 5.9
Note that the results of many R functions return lists.
We can turn a list into an atomic vector with unlist().
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 38 / 209
39/209
Data structures, input & output
Data frames
A data frame is a list of equal-length vectors. It shares
properties of a matrix.
It is more general than a matrix as it can contain different types
of data.
Each row corresponds to an individual observation and each
column corresponds to a different measured or recorded variable.
We create a data frame using data.frame(), which takes named
vectors as input.
> mydata < − data.frame(col1, col2, col3,...)
DF < − data.frame(x = 1:3, y = c(”A”, ”B”, ”C”))
We can combine data frames using cbind() and rbind()
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 39 / 209
40/209
Data structures, input & output
Data frames
The dimensions of a data frame can be determined by the dim()
function.
To access columns of a data frame we can use any of the
following ways:
1 Using a square bracket and column number: DF[2] gives
2 Using a square bracket and column names: DF[”y”]
3 Using dollar sign: DF$y gives [1] ”A” ”B” ”C”
Typing DF$VarName is somewhat tiresome for a data frame
having several variables.
We can use the attach() function as a shortcut.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 40 / 209
41/209
Data structures, input & output
Data frames
The attach() function adds the data frame to the R search path.
Example. > attach(ChickWeight); plot(weight,Diet)
> weight < − c(23,25,20); attach(ChickWeight) gives
The following object is masked by .GlobalEnv: weight
> plot(weight,Diet) gives Error. Why?
We can use the detach() function to remove the data frame
from the search path.
The operator $ can also be used to create a new column.
> DF$Sex < − c(”M”, ”F”, ”M”)
R automatically decides a character class for the above variable
”y”.
We can include the stringsAsFactors = TRUE argument to
convert ”y” to factors.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 41 / 209
42/209
Data structures, input & output
Data frames
Values stored in a data frame can be retrieved by a square
bracket [i, j]; i for row and j for column.
> DF[3,2] gives C
> DF$y[3] gives C
> DF[,1] gives 1 2 3
> DF[1,] gives 1 A
The function is.data.frame(object) checks if an object is of class
data frame.
> nrow(DF) # to determine number of rows of DF
> ncol(DF) # to determine number of columns of DF
> colnames(DF) # to display column names of DF
> rownames(DF) # to display row names of DF
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 42 / 209
43/209
Data structures, input & output
Missing values
Missing values in R are marked by NA.
We can use is.na() function to check whether a value of an R
object is missing.
It returns TRUE if a value is missing.
> myvec< −c(2, 5,NA ,8)
> is.na(myvec) gives [1] FALSE FALSE TRUE FALSE
A sum() function can be used to determine the number of NA
values in a vector.
> sum(is.na(myvec)) gives [1] 1
The anyNA() function can be used to check any missing values
in a data frame.
> anyNA(DF) gives [1] FALSE
Let us consider the following data frame and check for missing
values.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 43 / 209
44/209
Data structures, input & output
Missing values
To verify which observation(s) is/are the cause, we can use
> mydf[!complete.cases(mydf),] gives
Ad Car Mi
3 1 NA 13
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 44 / 209
45/209
Data structures, input & output
Exercise 2
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 45 / 209
46/209
Data structures, input & output
Data input
R can import data from a variety of sources.
Entering data from the keyboard:
We can use the edit() function to enter data manually.
> mydata < − data.frame(age=numeric(0),
gender=character(0), weight=numeric(0))
> mydata < − edit(mydata)
A shortcut for mydata < − edit(mydata) is fix(mydata).
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 46 / 209
47/209
Data structures, input & output
Data input
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 47 / 209
48/209
Data structures, input & output
Data input
We can embed data directly using the code
> mydatatxt < − ”
age gender weight
25 m 166
30 f 115
”
mydata < − read.table(header=TRUE, text=mydatatxt)
Importing data from a delimited text file
Syntax: mydataframe < − read.table(file, options)
Some of the options: header, sep, row.names, col.names,
na.Strings, quote, skip, dec.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 48 / 209
49/209
Data structures, input & output
Data input
The above data can be imported into a data frame using the
following code:
grades < − read.table(file=”studentgrades.csv”, header=TRUE,
row.names=”StudentID”, sep=”,”, colClasses=c(”character”,
”character”, ”character”, ”numeric”, ”numeric”, ”numeric”))
Importing data from Excel
Save the file into Tab-delimited Text (.txt) form: File ⇒ Save as
... ⇒ select the Text (Tab delimited) in the Save as type:
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 49 / 209
50/209
Data structures, input & output
Data input
Once the excel file is saved in tab delimited format we can use
the read.table() function.
We may also save excel files in comma separated values (.csv)
format and use the function read.csv() to input the data.
Excel files can be directly imported into R using xlsx package.
Syntax to import the first worksheet(1) from the workbook.
library(xlsx)
workbook < − ”c:/myworkbook.xlsx”
mydataframe < − read.xlsx(workbook, 1)
For large worksheets (say, 100,000+ cells), we can use
read.xlsx2().
In general, it is recommended to save data as tab or comma
delimited files and import into R using read.table()
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 50 / 209
51/209
Data structures, input & output
Data input
Importing data from SPSS
SPSS datasets can be imported into R via read.spss() function
in the foreign package.
We can also use the spss.get() function in the Hmisc package.
library(Hmisc)
mydataframe < − spss.get(”mydata.sav”,
use.value.labels=TRUE)
Importing data from Stata
We can employ the following syntax.
library(foreign)
mydataframe < − read.dta(”mydata.dta”)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 51 / 209
52/209
Data structures, input & output
Data input
We can also use a shortcut Import Dataset from the R
Environment to input data.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 52 / 209
53/209
Data structures, input & output
Data input
It is common to get error messages when a person starts
importing data due to one or more of the following reasons.
1 Mistake in the spelling of either file name or file path.
2 Forget to include the file extension (.txt, .csv, etc) in the file
name.
3 An incorrect file path is used.
4 Forget to include the header = TRUE argument when the first
row a data frame contains variable names.
Always use the function str() after importing the data to see the
structure of the data set.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 53 / 209
54/209
Data structures, input & output
Annotating datasets
Annotating includes adding descriptive labels to variable names
and value labels to the codes for categorical variables.
Consider the following data frame
Age Sex
15 2
17 2
16 1
The following code can be used to rename ”Age” as ”Age at
first marriage”.
> names(Mydata)[1] < − ”Age at first marriage”
The following code creates value labels for variable ”Sex”.
Mydata$Sex < − factor(Mydata$Sex,
levels = c(1,2),
labels = c(”male”, ”female”))
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 54 / 209
55/209
Data structures, input & output
Data output
The main function to export data frames is write.table().
> write.table(DF, file = ”file path/file name.txt”,
col.names=TRUE, row.names=FALSE, sep = ”
”)
DF is the data frame we want to export;
file name.txt is the file name for exported data frame;
col.names=TRUE indicates that the variable name should be
written in the first row the file;
row.names = FALSE stops R from including the row names in
the first column of the file.
The above exported file can be opened in any text editor as it is
saved in tab delimited text form.
We can also export a data frame to csv format by setting sep =
”,” in the write.table() function.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 55 / 209
56/209
Data structures, input  output
Data output
The function write.csv() can also be used directly as
 write.csv(DF, ”filepath/mydf.csv”, row.names=FALSE)
 library(xlsx)
 write.xlsx(df, ”filepath/mydf.xlsx”) # To export a data frame
”df” to ”mydf.xlsx”
We need to first install the library haven to export data into
SPSS and Stata.
 library(haven)
write sav(df, ”filepath/mydf.sav”) # To SPSS
 write dta(df, ”filepath/mydf.dta”) # To Stata
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 56 / 209
57/209
Data structures, input  output
Exercise 3
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 57 / 209
58/209
Data structures, input  output
Data wrangling
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 58 / 209
59/209
Data wrangling
Introduction
Data wrangling is the most important activity before data
analysis begins.
It includes:
▶ creating new variables,
▶ extracting part of a data frame,
▶ recoding existing variables,
▶ sorting and merging data,
▶ selecting and dropping variables,
▶ working with dates, etc.
The following codes create a new variable, say SUM from an
existing data frame ”mydata”.
 mydata −data.frame(x1 = c(2, 2, 6, 4), x2 = c(3, 4, 2, 8))
 mydata$SUM  − mydata$x1 + mydata$x2
 mydata
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 59 / 209
60/209
Data wrangling
Extracting values
Consider the built in data frame USArrests and check its
structure.
Let us extract the third value of the Assault variable.
 USArrests[3,2] gives [1] 294
 USArrests$Assault[3] gives the same thing as above.
The data in the first 10 rows and 3 columns of USArrests can be
extracted by:
 USArrests[1:10, 1:3]
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 60 / 209
61/209
Data wrangling
Extracting values
We use negative positional indexes to exclude certain rows and
columns from a data frame.
Let us extract all of the rows except the first 40 rows and all
columns except the 2nd
and the 4th
columns in USArrests
dataset.
 USArrests[-(1:40), -c(2,4)] # gives
Values can also be extracted based on a logical test.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 61 / 209
62/209
Data wrangling
Extracting values
Let us extract all rows where the value of Assault is 200 or more
and all the columns in USArrests dataset.
 USArrests[USArrests$Assault = 200, ]
We can use the following to extract all rows where the value of
UrbanPop is 80 and all the columns in USArrests dataset.
 USArrests[USArrests$UrbanPop == 80, ]
Boolean expressions can be used to extract values based on a
combination of logical tests.
To extract rows based on an AND Boolean expression we can
use the  symbol.
Le us extract all columns and rows where Assault  250 and
UrbanPop  70.
 USArrests[USArrests$Assault  50  USArrests$UrbanPop 
70, ]
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 62 / 209
63/209
Data wrangling
Extracting values
To extract rows based on an OR Boolean expression we can use
the | symbol.
Extract values where UrbanPop is greater than 50 OR less than
60.
 USArrests[UArrests$UrbanPop  50 | USArrests$UrbanPop 
60, ]
Alternatively, one can use the subset() function to select parts of
a data frame.
 subset(USArrests, UrbanPop  50 UrbanPop  60)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 63 / 209
64/209
Data wrangling
Ordering data frames
We can use the order() function to sort a data frame.
Let us order the USArrests data frame based on the values of
Murder
 USArrests[order(USArrests$Murder), ] #Ascending
 USArrests[order(-USArrests$Murder), ] #Descending OR
 USArrests[order(USArrests$Murder, decreasing = TRUE), ]
Let us order the USArrests data frame based on the values of
Murder and Rape.
 USArrests[order(USArrests$Murder, USArrests$Rape), ]
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 64 / 209
65/209
Data wrangling
Inclusion of extra rows and column
The function rbind() can be used to append additional rows on a
data frame.
Use cbind() to append columns on an existing data frame.
 DF1  − data.frame(Id = 1:3, Age = c(20, 19, 23), Sex =
c(”Male”, ”Female”, ”Female”))
 DF2  − data.frame(Id = 4:5, Age = c(27, 25), Sex =
c(”Male”, ”Male”))
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 65 / 209
66/209
Data wrangling
Inclusion of extra rows and columns
 DF3  − data.frame(Height = c(178, 171, 167))
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 66 / 209
67/209
Data wrangling
Inclusion of extra rows and columns
 rbind(DF1, DF3) will give
 cbind(DF1, DF3) will give
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 67 / 209
68/209
Data wrangling
Merging data frames
The function merge() merges two data frames horizontally.
We need to have at least one unique identifier which is available
in both data frames.
The following code merges data frame A  data frame B by ID.
 total  − merge(dataframeA, dataframeB, by=”ID”)
The following code merges data frame A  data frame B by ID
and Region.
 total  − merge(dataframeA, dataframeB, by=c(”ID”,
”Region”)
Use the all = TRUE argument if we want to include all data
from both data frames.
 total  − merge(dataframeA, dataframeB, all = TRUE)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 68 / 209
69/209
Data wrangling
Merging data frames
Let us merge the two data frames given below.
 Total  − merge(DF1, DF2) gives
 Total1  − merge(DF1, DF2, all = TRUE) gives
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 69 / 209
70/209
Data wrangling
Recoding variables
Recoding refers to creating new values of a variable from the
existing values.
It may include:
▶ changing a continuous variable into a set of categories,
▶ replacing miscoded values with correct values,
Suppose we want to recode the stopping distance of cars
(”dist”) in ”cars” dataset to distcat (Small, Medium, Long).
We first recode the value 999 for ”dist” to indicate that this
value is missing:
 cars$dist[cars$dist == 999]  − NA
 cars$distcat[cars$dist  50]  − ”Small”
 cars$distcat[cars$dist = 50  cars$dist  100]  −
”Medium”
 cars$distcat[cars$dist = 100]  − ”Long”
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 70 / 209
71/209
Data wrangling
Recoding variables
We may also use the within() function as follows.
 cars  − within(cars,{ distcat1  − NA
distcat1[dist  50]  − ”Small”
distcat1[dist = 50  dist  100]  − ”Medium”
distcat1[dist = 100]  − ”Long” })
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 71 / 209
72/209
Data wrangling
Exercise 4
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 72 / 209
73/209
Data wrangling
Packages for wrangling data
Important packages to install: dplyr; tidyr.
Once these packages are successfully installed, attach them to
the working environment.
 install.packages(”dplyr”)
 install.packages(”tidyr”)
 library(dplyr)
 library(tidyr)
The pipe operator %  % is used to perform a set of procedures
in a sequential manner and get the result at once.
Let us determine the number of missing values in Total1 data
frame (see Slide 69).
 sum(is.na(Total1)) gives [1] 2
This procedure can also be done using the pipe operator.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 73 / 209
74/209
Data wrangling
Packages for wrangling data
 Total1 %% is.na() %% sum()
Key functions used for data management in the dplyr packages
include:
mutate – to modify and create columns in a data frame.
select – to select columns by name.
filter – to select rows based on a set of logical values.
Let us create a column variable time by dividing dist to speed in
the cars data frame.
 mutate(cars, time = dist/speed)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 74 / 209
75/209
Data wrangling
Packages for wrangling data
Let us now list out the cars with speed less than 10.
 cars %% filter(speed  10)
Suppose that we would like to take out only the dist of cars with
speed less than 10. Then we can use:
 cars1  − cars %% filter(speed  10)
 cars1 %% select(dist)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 75 / 209
76/209
Data wrangling
Packages for wrangling data
Each of the above procedures can be performed simultaneously
by using the pipe operator as:
 cars %%
 mutate(time = dist/speed) %%
 filter(speed  10) %%
 select(dist)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 76 / 209
77/209
Data wrangling
Reshaping data frames
Two main data frame shapes: the long format (sometimes called
stacked) and the wide format.
In the long data format, a separate column represents the name
of the variable and a separate one value of the corresponding
variable.
In the wide data format, each column represents a variable.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 77 / 209
78/209
Data wrangling
Reshaping data frames
We use the function pivot longer() to reshape data into long
format.
pivot longer(Wide DF, ColNames in DF, names to = ” ”,
values to = ””)
Wide DF is the name of the data frame in wide format.
ColNames in DF represents the column names in the wide
format (Test1, Test2, Final in the above data).
The names to = ” ” argument specifies the name of the variable
that will be used to store the names of reformatted variables
(Exam in the above data).
The values to = ” ” argument specifies the name of the variable
that will be used to store the values (Score in the above data).
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 78 / 209
79/209
Data wrangling
Reshaping data frames
Consider the data frame given in Slide 77 and reshape the data
in wide format to long format.
 Wide  − data.frame(Stud.Id = 1:3, Test1 = c(20,23,18),
Test2 = c(26,28,25), Final = c(32,30,34))
 library(tidyr)
 pivot longer(Wide, c(Test1, Test2, Final), names to =
”Exam”, values to = ”Score”)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 79 / 209
80/209
Data wrangling
Reshaping data frames
We use the function pivot wider() to reshape data into wide
format.
pivot wider(Long DF, names from = ” ”, values from = ””)
Long DF is the name of the data frame in long format.
The names from = ” ” argument to specify the name of the
variable containing the variable names (Exam in the above data).
The values from = ” ” argument to specify the variable
containing the values (Score in the above data).
Consider the data frame given in Slide 77 and reshape the data
in long format to wide format.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 80 / 209
81/209
Data wrangling
Reshaping data frames
 Long  − data.frame(Stud.Id = c(1,1,1,2,2,2,3,3,3), Exam =
rep(c(”Test1”,”Test2”,”Final”),3), Score =
c(20,26,32,23,28,30,18,25,34))
library(tidyr)
 pivot wider(Long, names from = ”Exam”, values from =
”Score”)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 81 / 209
82/209
Data wrangling
Date values
The function as.Date() translates character strings into date
variable.
Syntax is as.Date(x, ”input format”). The input format can be:
 mydate −as.Date(c(”2021-01-01”, ”2021-03-02”,
”2021-03-29”))
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 82 / 209
83/209
Data wrangling
Date values
We can use the format(x, format=”output format”) function to
extract portions of dates.
x is a given date value.
format = ”” is any one or more of the formats %d, %a, %A,
%m, %b, %B, %y, %Y
 today  − Sys.Date()
 today gives the current date.
 format(today, format=”%y”) gives [1] ”22”
We may also extract more than one format using:
 format(today, c(”%A”,”%m” )) this extracts the
unabbreviated weekday and the month number.
It is also possible to perform arithmetic operations in dates as
follows.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 83 / 209
84/209
Data wrangling
Date values
For example, how old in years is a person if he was born on 12
October 1950?
DoB  − as.Date(”1950-10-12”)
 today - DoB gives the time difference in terms of days.
Alternatively, the following code can be employed.
 difftime(today,DoB, units = ”days”) gives the time difference
in terms of days.
How old in years is the person if he was born on 12 October
1950?
First install the lubridate package and then load it.
 library(lubridate)
 trunc(interval(DoB, today) / years(1)) gives [1] the age in
years.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 84 / 209
85/209
Data wrangling
Date values
Suppose we would like to limit our analyses to observations
collected between January 1, 2019 and December 31, 2020 in
”Ex” dataset.
 Ex  −data.frame(A=1:10, B = c(”2019-08-11”,
”2020-12-18”,”2018-05-14”,”2020-07-26”,”2018-11-23”,
”2020-01-03”, ”2018-05-05”, ”2018-11-07”,
”2019-03-06”,”2019-05-08”))
attach(Ex)
Ex$date  − as.Date(Ex$B)
 startdate  − as.Date(”2010-01-01”)
 enddate  − as.Date(”2020-10-31”)
 newdata  − Ex[which(Ex$date = startdate  Ex$date =
enddate), ]
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 85 / 209
86/209
Data wrangling
Exercise 5
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 86 / 209
87/209
Data wrangling
Descriptive analysis
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 87 / 209
88/209
Descriptive analysis
Commonly used mathematical functions
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 88 / 209
89/209
Descriptive analysis
Commonly used statistical functions
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 89 / 209
90/209
Descriptive analysis
Descriptive statistics
Load the built in dataset swiss in the workspace.
It is good practice to look at how many observations and
variables are included in a dataset prior to further analysis.
 dim(swiss) gives [1] 47 6
Following this, have a look at the structure  str(swiss)
Then get basic summary statistics by using the function
summary().
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 90 / 209
91/209
Descriptive analysis
Descriptive statistics
Missing values may be available in the dataset under analysis.
We can compute summary statistics by removing them, for
example,
 mean(swiss$Fertility, na.rm=TRUE)
 with(swiss, c(Mean = mean(Fertility, na.rm=TRUE), Sd =
sd(Fertility, na.rm=TRUE) )
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 91 / 209
92/209
Descriptive analysis
Descriptive statistics
Suppose we would like to calculate summary statistics, e.g.,
mean for each level of a categorical variable.
We can use the tapply() function.
Let us calculate the mean miles per gallon (mpg) for each of the
cylinder types (cyl) in mtcars dataset.
tapply(mtcars$mpg, mtcars$cyl, mean)
We can also use tapply() to apply on more than one factor.
Let us calculate mean mpg for each combination of gear  cyl
 tapply(mtcars$mpg, list(Cylinder = mtcars$cyl, Gear
=mtcars$gear), mean)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 92 / 209
93/209
Descriptive analysis
Descriptive statistics
aggregate() works in a similar way to that of apply() but it is
flexible.
Suppose we want to calculate the mean values of mpg and wt
for each level of cyl.
 aggregate(mtcars[, c(”mpg”, ”wt”)], by=list(Cylinder =
mtcars$cyl), FUN = mean)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 93 / 209
94/209
Descriptive analysis
Descriptive statistics
We can also use the aggregate() function by using the formula
method.
 aggregate(mpg ∼ cyl,FUN = mean, data =mtcars)
Furthermore, we may compute summary statistics for subsets of
the original data.
Let us compute the mean of mpg for each level of cyl only for hp
 115
 aggregate(mpg ∼ cyl,FUN = mean, subset = hp  115, data
=mtcars)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 94 / 209
95/209
Descriptive analysis
More descriptive statistics
More descriptive statistics can be obtained by installing and
loading different packages.
 install.packages(”pastecs”) # If it is not already installed
 library(pastecs)
 myvars  − c(”mpg”, ”hp”, ”wt”)
 stat.desc(mtcars[myvars]) # mtcars is a built in dataset
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 95 / 209
96/209
Descriptive analysis
More descriptive statistics
 install.packages(”psych”) # If it is not already installed
 library(psych)
 myvars  − c(”mpg”, ”hp”, ”wt”)
 describe(mtcars[myvars]) # gives
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 96 / 209
97/209
Descriptive analysis
Descriptive statistics by group
Descriptive statistics can be obtained with respect to different
groups by using summaryBy in the doBy package.
 iinstall.packages(”psych”) # If it is not already installed
 library(doBy)
 summaryBy(mpg+hp+wt am, data=mtcars, FUN=c(mean, sd,
min.max))
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 97 / 209
98/209
Descriptive analysis
Descriptive statistics by group
We may also use the describeBy function in the psych package.
 library(psych)
 myvars  − c(”mpg”, ”hp”, ”wt”)
 describeBy(mtcars[myvars], list(am=mtcars$am)) # gives
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 98 / 209
99/209
Descriptive analysis
Frequency tables
Following are functions for creating tables
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 99 / 209
100/209
Descriptive analysis
Tables
Categorical variables are usually described by frequency tables.
Let us tabulate the values of gear in the mtcars dataset.
 table(mtcars$gear)
We can produce table of proportions instead of counts by using
 prop.table(table(mtcars$gear))
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 100 / 209
101/209
Descriptive analysis
Two-way tables
We can also use the table() function to cross tabulate two
categorical variables.
Let us cross tabulate cyl and gear in mtcars dataset.
 with(mtcars, table(cyl, gear))
The above table can be obtained by the more flexible function,
xtabs()
 xtabs(∼ cyl+gear, data = mtcars)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 101 / 209
102/209
Descriptive analysis
Two-way tables
There are times where we want to include row and column sums
of a table.
This can be done by applying the addmargins on the table.
Let us include the row and column sums for the table produced
in Slide 101.
 addmargins(xtabs(∼ cyl+gear, data = mtcars))
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 102 / 209
103/209
Descriptive analysis
Two-way tables
We can calculate proportions with respect to row or column
margins as follows.
 prop.table(xtabs( cyl+gear, data = mtcars), 1) #row-wise
 prop.table(xtabs( cyl+gear, data = mtcars), 2) #column-wise
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 103 / 209
104/209
Descriptive analysis
Multidimensional tables
The functions table(), margin.table(), xtabs(), prop.table(), and
addmargins() also extend to more than 2-dimensions.
A 3-way table of cyl, gear and am in mtcars dataset can be
obtained from:
 mytable  − xtabs(∼ cyl+gear+am, data=mtcars)
The function ftable() prints an attractive multidimensional table.
 ftable(mytable)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 104 / 209
105/209
Descriptive analysis
Exercise 6
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 105 / 209
106/209
Descriptive analysis
Graphs with base R
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 106 / 209
107/209
Descriptive analysis
Descriptive analysis through graphs
The base R graphics system is the original plotting which comes
together with installing R.
Graphs in base R are created by high-level plotting commands,
e.g., plot() and then more information can be added by using
low-level commands, e.g., lines(), text()..
When a plot is created in RStudio it will be displayed in the
Plots tab by default.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 107 / 209
108/209
Descriptive analysis
Descriptions through graphs
Previously created plots can be scrolled by clicking on one of the
arrow buttons.
We can save plots in a variety of formats (pdf, png, tiff, jpeg
etc) by clicking on the Export button.
plot() is the most common high-level function to make one or
more plots.
Let us create a scatterplot of mpg in the mtcars dataset.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 108 / 209
109/209
Descriptive analysis
Descriptions through graphs
 with(mtcars, plot(mpg))
We can produce a scatterplot of mpg versus wt in the mtcars
dataset using  with(mtcars, plot(mpg, wt))
It is possible to specify the type of graph we wish to plot using
the type = argument.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 109 / 209
110/209
Descriptive analysis
Descriptions through graphs
Let us apply the different types of arguments to plot x
={1,2,3,4,5,6,7,8} versus y = {12,21,9,11,8,10,17,18}
 par(mfrow=c(2,2)) # to plot 4 graphs in one page
 plot(x, y, type=”l”, main=”l”)
 plot(x, y, type=”b”, main=”b”)
 plot(x, y, type=”o”, main=”o”)
 plot(x, y, type=”c”, main=”c”)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 110 / 209
111/209
Descriptive analysis
Bar plots
A bar plot displays the distribution (frequency) of a categorical
variable through vertical or horizontal bars.
Vertical bar plot:  barplot(height)
Horizontal bar plot:  barplot(height, horiz = TRUE)
 barplot(height, names.arg = Bar Labels)
The following command gives a bar plot of cyl in mtcars dataset.
 barplot(table(mtcars$cyl))
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 111 / 209
112/209
Descriptive analysis
Bar plots
If height is a matrix rather than a vector, we will have a stacked
or grouped bar plot.
Stacked bar plot: barplot(height)
Grouped bar plot:  barplot(height, beside = TRUE)
Let us use the built in dataset VADeaths and represent it by
using bar plots.
 barplot(VADeaths)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 112 / 209
113/209
Descriptive analysis
Bar plots
 barplot(VADeaths, beside = TRUE)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 113 / 209
114/209
Descriptive analysis
Pie charts
A pie chart is a circular diagram to present data by using the
function pie().
Let us produce a pie chart from the following data.
 slices  − c(10, 12,4, 16, 8)
 lbls  − c(”US”, ”UK”, ”Australia”, ”Germany”, ”France”)
 pie(slices, labels = lbls, main=”Simple Pie Chart”)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 114 / 209
115/209
Descriptive analysis
Pie charts
Let include percentages on the above pie chart
 pct  − round(slices/sum(slices)*100)
 lbls2  − paste(lbls, ” ”, pct, ”%”, sep=””)
 pie(slices, labels=lbls2, col=rainbow(length(lbls2)),
main=”Pie Chart with Percentages”)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 115 / 209
116/209
Descriptive analysis
Histograms
Histograms are useful when we want to get an idea about the
distribution of values in a numeric variable.
We use the function hist() to produce a histogram.
Let us generate a histogram for the variable mpg in mtcars
dataset.
 hist(mtcars$mpg)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 116 / 209
117/209
Descriptive analysis
Histograms
The freq = FALSE argument can be used to display the
histogram as a proportion rather than a frequency.
 hist(mtcars$mpg, freq = FALSE)
We can control the breakpoints of a histogram using the breaks
= argument.
 hist(mtcars$mpg, breaks = seq(from = 0,to = 35, by = 5))
It is also possible to add a kernel density curve to the histogram
by using the density() and lines() function.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 117 / 209
118/209
Descriptive analysis
Histograms
Kernel density estimation is a nonparametric method for
estimating the probability density function of a random variable.
Let us add a kernel density plot on the histogram of mpg.
 Density  − density(mtcars$mpg)
 hist(mtcars$mpg, freq = FALSE)
 lines(Density)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 118 / 209
119/209
Descriptive analysis
Box plots
Boxplots are useful to graphically summarize distribution of a
variable, identify potential unusual values  compare
distributions between different groups.
We use the function boxplot() to create a boxplot.
Let us create a boxplot for mpg in the mtcars dataset.
 boxplot(mtcars$mpg)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 119 / 209
120/209
Descriptive analysis
Box plots
Box plots of a quantitative variable with respect to different
levels of a factor can be produced as follows.
 boxplot(mpg ∼ cyl, data=mtcars)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 120 / 209
121/209
Descriptive analysis
Box plots
We can also have box plots based on more than one grouping
factor, e.g., cyl and am.
Let us first create factors from these variables.
 mtcars$cyl.f  − factor(mtcars$cyl, levels=c(4,6,8),
labels=c(”4”,”6”,”8”)) # to factor cyl
 mtcars$am.f  − factor(mtcars$am, levels=c(0,1),
labels=c(”auto”, ”standard”)) # to factor am
 boxplot(mpg ∼ am.f*cyl.f, data=mtcars)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 121 / 209
122/209
Descriptive analysis
Scatterplot matrix
Matrix scatterplots can be created if we want to graphically
explore relationship between more than two variables.
We use the function pairs() to create pairs of scatterplots.
 pairs(mtcars[,c(”mpg”, ”disp”, ”hp”, ”drat”)])
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 122 / 209
123/209
Descriptive analysis
Low-level plotting functions
To add extra information such as points, lines, arrows or text on
the different plots.
points(x, y,...): Adds points to the current plot.
lines(x, y, ...): Adds line segments.
text(x, y, labels, ...): Adds text into the graph.
abline(a, b, ...): Adds the line y = a + bx.
abline(h=y, ...): Adds a horizontal line.
segments(x0, y0, x1, y1, ...): Draws line segments with x0 and
y0 initial values.
legend(“arg”, fill= , cex = , ...) : Displays a legend.
Let us include a legend on the barplot produced in Slide 112.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 123 / 209
124/209
Descriptive analysis
Low-levl plotting functions
 barplot(VADeaths, col =c(”green”, ”yellow”,”red”, ”black”,
”blue”))
 legend(”topright”, legend = rownames(VADeaths), fill =
c(”green”, ”yellow”, ”red”, ”black”, ”blue”), cex=0.3)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 124 / 209
125/209
Descriptive analysis
Multiple figure environments
We can create an n by m array of figures on a single page using
the functions mfrow() and mfcol().
 mfcol=c(nrow, mcol): to draw n rows and m columns of plots
on one page in a column-wise fashion.
 mfrow=c(nrow, mcol): array filled in a row fashion.
Do not forget to incorporate the function par(. . . .) as
par(mfcol=c(, )); par(mfrow=c(, ))
Example. Let us produce four different graphs based on the
built in dataset AirPassengers
 par(mfrow=c(2,2))
 plot(AirPassengers, type = ”p”)
 title(”Points”)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 125 / 209
126/209
Descriptive analysis
Multiple figure environments
 plot(AirPassengers, type = ”l”)
 title(”Lines”)
 plot(AirPassengers, type = ”b”)
 title(”Points  Lines”)
 plot(AirPassengers, type = ”h”)
 title(”High Density”)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 126 / 209
127/209
Descriptive analysis
Exercise 7
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 127 / 209
128/209
Descriptive analysis
Customizing plots
Once we are familiar with the base graphics in R, We can add
more information on our graphs.
Labels to the x and y axes can be included by using the ylab = ”
” and xlab = ” ” arguments, respectively in theplot() function.
 plot(mtcars$mpg, mtcars$wt, xlab=”Miles per galon”, ylab =
”Weight of car in lbs”)
There are also some situations where we would like to adjust
figure margins.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 128 / 209
129/209
Descriptive analysis
Customizing plots
Figure margins can be adjusted by using the par() function and
the mar = argument before we plot the graph.
 par(mar = c(bottom, left, top, right) – the arguments
bottom, left, top  right are the size of the corresponding
margins.
By default R uses (5.1, 4.1, 4.1, 2.1) where these numbers
represent the number of lines in each margin.
 par(mar = c(5, 4, 4,6))
 plot(mtcars$mpg, mtcars$wt, xlab=”Miles per galon”, ylab =
”Weight of car in lbs”) # will shrink the width
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 129 / 209
130/209
Descriptive analysis
Customizing plots
We can control the range of axes scales by using the xlim
=c(min, max) and ylim = c(min, max) arguments.
Let us set the x axis scale from 0 to 40 and the range of the y
axis scale from 0 to 8 and see the difference.
 plot(mtcars$mpg, mtcars$wt, xlab=”Miles per galon”, ylab =
”Weight of car in lbs”, xlim=c(0,40), ylim =c(0,8)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 130 / 209
131/209
Descriptive analysis
Customizing plots
We can also change the color and the size of the symbol used in
plotting by using the col = and cex = arguments, respectively.
col = can either take an integer value to specify the color or a
character string (col=”green”) giving the color name.
The  colors() function gives a list of all 657 preset colors in
base R.
cex = requires a numeric value to indicate the proportional
increase or decrease in size relative to the default value of 1.
Let us make the color of the dots to “green” and decrease the
size of the symbol by 40% in the above plot.
 plot(mtcars$mpg, mtcars$wt, xlab=”Miles per galon”, ylab =
”Weight of car in lbs”, xlim=c(0,40), ylim =c(0,8),
col=”green”, cex =0.6)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 131 / 209
132/209
Descriptive analysis
Customizing plots
The function text() can be used to add a text label in a specific
(x, y) coordinate of the plot.
Let us add a text (30,2) in the above plot.
 plot(mtcars$mpg, mtcars$wt, xlab=”Miles per galon”, ylab =
”Weight of car in lbs”, xlim=c(0,40), ylim =c(0,8),
col=”green”, cex =0.6, text(30,2, label=”(30, 2)”)
The background of a plot can be changed by par(bg=”color”) –
color can be ”red”, ”white”, ”green”, etc.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 132 / 209
133/209
Descriptive analysis
Customizing plots
In R, it is possible to make different symbol  colors of data
points depending on different level of a factor variable by using a
low-level function points().
Suppose we want to produce a scatterplots of mpg versus wt
based on each levels of cyl in mtcars dataset.
Step 1. Let us include type = ”n” argument to create an empty
plot region.
 plot(mtcars$mpg, mtcars$wt, xlab = ”Miles per galon”,
ylab=”Weight in lbs”, xlim=c(0,40), ylim=c(0,8), bty = ”l”,
type = ”n”)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 133 / 209
134/209
Descriptive analysis
Customizing plots
Step 2. Plot for cyl == 4
 points(x = mtcars$mpg[mtcars$cyl==4], y =
mtcars$wt[mtcars$cyl==4], pch = 2, col =”red”)
Step 3. Plot for cyl == 6
 points(x = mtcars$mpg[mtcars$cyl==6], y =
mtcars$wt[mtcars$cyl==6], pch = 3, col =”yellow”)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 134 / 209
135/209
Descriptive analysis
Customizing plots
Step 4. Plot for cyl == 8
 points(x = mtcars$mpg[mtcars$cyl==8], y =
mtcars$wt[mtcars$cyl==8], pch = 4, col =”green”)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 135 / 209
136/209
Descriptive analysis
Customizing plots
Let us finally include a legend to describe what each symbol and
color designate in the plot.
Cols  − c(”red”, ”yellow”, ”green”)
 Symbol  − c(2,3,4)
 Label  − c(”4 Cylinder”, ”6 Cylinder”, ”8 Cylinder”)
 legend(x = 10,y = 4, col = Cols, pch = Symbol, legend =
Label, cex = 0.25)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 136 / 209
137/209
Descriptive analysis
Exporting plots
Plots in R can be exported to different formats such as jpeg,
pdf, bmp.
The first option is to click on the Export button in the Plots tab.
The second option is through writing codes in R script.
To save a plot in pdf format we will use the pdf() function.
Similarly, we use the functions jpeg(), bmp() for jpeg and bmp
formats.
Once we run the codes to export plots we need to close the
plotting device using the dev.off() function.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 137 / 209
138/209
Descriptive analysis
Exporting plots
Let us export the plots we had in Slide 132 to jpeg and bmp
formats.
 jpeg(’my plot.jpeg’)
 plot(mtcars$mpg, mtcars$wt, xlab = ”Miles per galon”,
ylab=”Weight in lbs”, xlim=c(0,40), ylim=c(0,8), col=”green”,
cex=0.6, text(30,2, label=”(30, 2)”))
 dev.off()
 png(’my plot.png’)
 plot(mtcars$mpg, mtcars$wt, xlab = ”Miles per galon”,
ylab=”Weight in lbs”, xlim=c(0,40), ylim=c(0,8), col=”green”,
cex=0.6, text(30,2, label=”(30, 2)”))
 dev.off()
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 138 / 209
139/209
Descriptive analysis
Exercise 8
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 139 / 209
140/209
Basic inferential analysis
Basic inferential analysis
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 140 / 209
141/209
Basic inferential analysis
Basic inference
After performing data cleaning and descriptive analysis, we need
to perform inferential analysis.
This includes estimation of model parameters and tests of
hypotheses.
Tests of independence: The chi-square test of independence
tests whether two categorical variables are independent.
We use the function chisq.test() to perform the test.
Let us consider the Arthritis dataset in vcd package and test if
there is association between Treatment and Improved.
 library(vcd)
 mytable  − xtabs(∼ Treatment+Improved, data=Arthritis)
 chisq.test(mytable)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 141 / 209
142/209
Basic inferential analysis
Basic inference
The null hypothesis, i.e., there is no association between the two
variables will be rejected since the p-value = 0.001463  0.05.
We may also use the prop.test function to analyze this dataset.
The proportions of improvements with respect to the treatment
groups (Placebo, Treated) are: 0.69,0.5,0.25 for the None,
Some,  Marked, respectively.
These proportions differ, but is it statistically supported?
Let us consider the vectors of improvements in the Placebo
group, i.e., (29,7,7) and the total = (42,14,28)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 142 / 209
143/209
Basic inferential analysis
Basic inference
 Row Total  − margin.table(mytable, 2)
 Placebo  − mytable[1, ]
 prop.test(Placebo, Row Total)
We can conclude that the proportions differ statistically since
the p-value = 0.001463  0.05.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 143 / 209
144/209
Basic inferential analysis
Correlations
We use correlation coefficients to measure the association
between two quantitative variables.
The Pearson, Spearman, and Kendall correlation coefficients can
be obtained using: cor(x, use= , method= )
x = Matrix or data frame, use = option to handle missing data,
method = pearson, spearman, and kendall.
A partial correlation is a correlation between two quantitative
variables, controlling for one or more other quantitative variables.
Code to use:
 library(ggm)
 pcor(u, S) # First two numbers in u = variable numbers to
be correlated, last numbers partialed vars
S is the covariance matrix.
Let us consider the built in dataset state.x77 to illustrate the
procedure.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 144 / 209
145/209
Basic inferential analysis
Correlations
 states  − state.x77[, 1:6]
 colnames(states) gives
[1] ”Population” ”Income” ”Illiteracy” ”Life Exp” ”Murder”
”HS Grad”
 cor(states) gives
Let us find the partial correlation coefficient between var1
(Population)  var5 (Murder rate) keeping the effects of var2,
var3,  var 6 constant.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 145 / 209
146/209
Basic inferential analysis
Correlations
 library(ggm)
 pcor(c(1,5,2,3,6), cov(states))# gives [1] 0.346
0.346 = correlation between Population (var1)  Murder rate
(var5)
cov(states) is the covariance matrix among variables in states.
The function  Corr  − cor.test(x, y, alternative = , method
= ) tests for significance of correlation coefficients.
print(corr.p(Cor$r, n=), short = FALSE) prints the p-value
and confidence intervals.
Let us test which of the correlations given above are statistically
significant.
 Cor −corr.test(states, use=”complete”)
 print(corr.p(Cor$r, n=50), short = FALSE)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 146 / 209
147/209
Basic inferential analysis
Correlations
 library(psych)
 Corr  − corr.test(states, use=”complete”)
 print(corr.p(Cor$r, n=50), short = FALSE)
There is a strong (0.7) and statistically significant
(p-value = 0.00) correlation between Illiteracy and
Murder rate
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 147 / 209
148/209
Basic inferential analysis
T-tests
Tests to compare mean of continuous data either against an a
priori stipulated value or between two groups.
One Sample t Test
H0 : µ = µ0 vs H1 : µ , , ̸= µ0
We can use the function t.test() to perform a t-test.
t.test(data vector, proposed mean value, Optional arguments)
Options: alternative = “greater” ; alternative = “less” and
conf.level=X.
Example. Use the following data to test the mean value is 7500.
Y = {5660, 5470, 5640, 6180, 6390, 6515, 6805, 7515, 7515,
8230, 8770, 8800, 8000, 7750, 6950 }
 t.test(Y, mu = 7500)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 148 / 209
149/209
Basic inferential analysis
T-tests
There is no sufficient statistical evidence to reject the null
hypothesis (p-value = 0.1636  0.05).
The mean value is not different from 7500.
We can apply the logic of the one-sample t-test to test whether
two population means are different.
We may encounter to test for mean differences in two dependent
or independent samples.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 149 / 209
150/209
Basic inferential analysis
T-tests
Independent t-test assumes that the two groups are independent
 the data are sampled from normal population.
It can be performed by:
 t.test(y ∼ x, option =, data) # y is numeric  x is a
dichotomous vector.
 t.test(y1, y2, option =) # both y1  y2 are numeric
The option can be var.equal=TRUE, alternative=”less”,
alternative=”greater”
A dependent t-test assumes that the difference between groups
is normally distributed.
Dependent t-test can be performed by:
 t.test(y1, y2, paired=TRUE) # y1  y2 are numeric vectors
for the two dependent groups.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 150 / 209
151/209
Basic inferential analysis
T-tests
Let us consider part of the built in dataset PlantGrowth and test
whether the mean weight for trt1 is the same as that of trt2.
 Weight1  −
PlantGrowth$weight[PlantGrowth$group==”trt1”]
 Weight2  −
PlantGrowth$weight[PlantGrowth$group==”trt2”]
 t.test(Weight1, Weight2)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 151 / 209
152/209
Basic inferential analysis
T-tests
The null hypothesis is rejected (p-value = 0.009298  0.05) and
we conclude that the mean weights for trt1 is statistically
different from that of trt2.
Consider a dataset (marg) on a study to check whether
cholesterol was reduced after using a certain brand of margarine
as part of a low fat, low cholesterol diet. This data set contains
information on 18 people using margarine to reduce cholesterol
over two time points.
Test whether the difference in mean cholesterol level before using
the margarine is the same as that of after using the margarine.
 Marg  −
read.csv(”C:/Users/User/OneDrive/Desktop/QuantDataAna/
marg.csv”, header=TRUE)
 with(Marg, t.test(Before,After4weeks, paired=TRUE))
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 152 / 209
153/209
Basic inferential analysis
T-tests
The null hypothesis will be rejected (p-value =
1.958x10−11
 0.05 ) and the conclusion is that the mean
cholesterol level before using the margarine differ to that of after
using the margarine.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 153 / 209
154/209
Basic inferential analysis
Nonparametric tests
The t-test is dependent on the normality assumption of the data
being analyzed.
It happens that the distribution for the data under consideration
is not normal.
In such cases, we can use the nonparametric tests as alternatives
to t-tests.
We can use the function wilcox.test() to perform rank-based
(nonparametric) tests.
Let us recall the dataset given in Slide 148 and test whether the
median value = 7500 or not.
 wilcox.test(Y, mu=7500)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 154 / 209
155/209
Basic inferential analysis
Nonparametric tests
The p-value = 0.2219  0.05 and hence the null hypothesis will
be retained.
The Wilcoxon rank sum test for two independent groups can be
performed by:
 wilcox.test(y ∼ x, data) # y is numeric  x is a dichotomous.
 wilcox.test(y1, y2) # y1  y2 are outcome variables.
A nonparametric alternative to the dependent sample t-test.
 wilcox.test(y1, y2, paired = TRUE)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 155 / 209
156/209
Basic inferential analysis
Exercise 9
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 156 / 209
157/209
Intermediate statistical methods
Intermediate statistical methods
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 157 / 209
158/209
Intermediate statistical methods
Comparing more than two groups - ANOVA
ANOVA is an extension of the t-test, and compares means of
two or more groups.
One-way ANOVA y ∼ A.
Two-way factorial ANOVA y ∼ A ∗ B
Randomized block ANOVA y ∼ B + A; B is a blocking factor.
y is the dependent variable and the letters A and B represent
factors.
Example. Consider the cholesterol dataset in the multcomp
package. i) Find the group means, ii) Produce and interpret
boxplots, iii) test for group mean differences.
i)  library(multcomp)
 attach(cholesterol)
 aggregate(response, by=list(trt), FUN=mean)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 158 / 209
159/209
Intermediate statistical methods
ANOVA
ii)  boxplot(response ∼ trt, data =cholesterol)
The boxplots indicate that the mean responses differ between
the different groups.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 159 / 209
160/209
Intermediate statistical methods
ANOVA
 fit  − aov(response ∼ trt), data = cholesterol
summary(fit)
The above ANOVA table shows that the mean response in at
least one of the groups differ to that of the other (p-value =
9.82x10−13
 0.05 ).
Let us further proceed to test which pair of mean responses
differ statistically.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 160 / 209
161/209
Intermediate statistical methods
ANOVA
The function TukeyHSD() is the most commonly used test to
compare all pairwise differences between group means.
Let us apply it to the present case.
TukeyHSD(fit)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 161 / 209
162/209
Intermediate statistical methods
One-way ANOVA
Testing model assumptions
Assumptions: the dependent variable is normally distributed with
equal variance in each group.
Use a Q-Q plot to assess the normality assumption:
 library(car)
 qqPlot(lm(response ∼ trt, data=cholesterol))
Observe that qqPlot() requires an lm() fit.
The normality assumption is not violated as the plots are
close to the referent line.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 162 / 209
163/209
Intermediate statistical methods
One-way ANOVA
The constant (homogeneity) variance assumption can be
assessed by Bartlett’s test:
 bartlett.test(response ∼ trt, data=cholesterol)
There is no sufficient statistical evidence (p-value = 0.9653 
0.05) to reject the null hypothesis of constant variance.
The ANOVA model appears to correctly fit the data since
the above assumptions are fulfilled.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 163 / 209
164/209
Intermediate statistical methods
Two-way ANOVA
To simultaneously evaluate the effect of two grouping variables
on a response variable.
These grouping variables are also known as factors.
Three possible effects in this design: two main effects and one
interaction effect.
Example. Sixty guinea pigs are randomly assigned to receive
one of three dose levels of vitamin C (0.5, 1, or 2 mg/day) and
one of two delivery methods (orange juice or ascorbic acid), and
tooth length was measured (see ToothGrowth dataset).
i) Find the group means of tooth length, ii) Produce and
interpret box plots, iii) Test whether main effects (supp and
dose)  interaction between these factors are significant.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 164 / 209
165/209
Intermediate statistical methods
Two-way ANOVA
 attach(ToothGrowth)
i)  aggregate(len, by=list(supp, dose), FUN=mean)
ii)  boxplot(len ∼ supp:dose, data = ToothGrowth)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 165 / 209
166/209
Intermediate statistical methods
Two-way ANOVA
iii)  dose  − factor(dose) # Converts ”dose” to a factor
 fit  − aov(len ∼ supp*dose, data = ToothGrowth)
 summary(fit)
Each of the main effects and interaction are statistically
significant.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 166 / 209
167/209
Intermediate statistical methods
Two-way ANOVA
Checking for normality assumption
 library(car)
 qqPlot(lm(len ∼ supp*dose, data=ToothGrowth))
What can you say about the normality assumption based on the
above plot?
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 167 / 209
168/209
Intermediate statistical methods
Two-way ANOVA
Checking constant variance
 library(car)
 leveneTest(len ∼ factor(supp)*factor(dose))
What can be said about the constant variance
assumption?
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 168 / 209
169/209
Intermediate statistical methods
Regression
Regression analysis can be used to:
1 identify the explanatory variables that are related to a response
variable,
2 to describe the type of the relationships (if any),
3 to predict the value of response variable from the explanatory
variables.
Some examples a regression model is suitable include:
What is the relationship between surface stream salinity and
paved road surface area?
Which qualities of an educational environment are most strongly
related to higher student achievement scores?
What is the form of the relationship between blood pressure, salt
intake, and age? Is it the same for men and women?
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 169 / 209
170/209
Intermediate statistical methods
Regression
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 170 / 209
171/209
Intermediate statistical methods
Regression
A function for fitting a linear model:
myfit  − lm(formula, data) # formula = Y ∼ X1 + . . . + Xk
Some symbols to be used in the formula:
∼: Separates response variables on the left from the explanatory
variables on the right.
+ : Separates predictor variables.
: denotes an interaction between predictor variables.
∗: A shortcut for denoting all possible interactions.
−1: Suppresses the intercept. y ∼ x − 1 fits a regression of y on
x without intercept.
Other functions:
summary(): Detailed results for the fitted model.
coefficients(): Lists the intercept  slopes for the fitted mode.
confint(): Confidence intervals for the model parameters.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 171 / 209
172/209
Intermediate statistical methods
Regression
fitted(): Lists the predicted values in a fitted model.
residuals(): Lists the residual values in a fitted model.
anova(): An ANOVA table for a fitted model.
vcov(): Lists the covariance matrix for model parameters.
AIC(): Prints Akaike’s Information Criterion.
plot(): Diagnostic plots for evaluating the fit of a model.
predict(): Uses a fitted model to predict response values for a
new dataset.
A code for polynomial regression of degree n:
fit1  − lm(y ∼ x + I(x∧
2) + . . . + I(x∧
n) , data=)
Scatter plots matrix can be generated from:
 scatterplotMatrix(data, smoother.args=list(lty=2))
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 172 / 209
173/209
Intermediate statistical methods
Regression
Example. Consider the built in dataset ”women” which
provides the height and weight for a set of 15 women, and fit a
regression model.
Let us first give the scatterplot of weight versus height.
 plot(women$height,women$weight)
This scatterplot suggests a linear relationship between weight
and height of women.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 173 / 209
174/209
Intermediate statistical methods
Regression
 fit  − lm(weight height, data=women)
 summary(fit) # Model summary
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 174 / 209
175/209
Intermediate statistical methods
Regression
According to the above result, height is found to be a
statistically significant factor for weight of women.
When height of a woman increases by 1 unit the weight
increases by 3.45 units on average.
Let us now superimposes the fitted regression line on the
scatterplot.
 plot(women$height,women$weight)
 abline(fit)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 175 / 209
176/209
Intermediate statistical methods
Regression
Let us rerun the regression with a quadratic term (that is, X2
):
 fit2  − lm(weight ∼ height + I(height2
), data=women)
 plot(women$height,women$weight)
 lines(women$height,fitted(fit2))
Does the quadratic term improve prediction?
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 176 / 209
177/209
Intermediate statistical methods
Regression
Categorical Independent Variables:
These are recoded into a set of separate binary variables before
we enter them into a regression model..
Such recoding is known as “dummy coding”.
For example, R will automatically create a dummy variable
SexMale from the factor variable Sex.
1 if a person is Male
0 if a person is Female
The default option in R is to use the first level of the factor as a
reference and interpret the remaining levels relative to this level.
We can use the function contrasts() to see the coding that R
has used to create the dummy variables.
 contrasts(Sex)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 177 / 209
178/209
Intermediate statistical methods
Regression
Suppose that the coefficient corresponding to a male is found to
be 2.5 in a fitted regression model where the response is score.
Note here that sex = Female is the reference category.
The interpretation will be: a Male would get 2.5 points more
than a female on average.
We can use the function relevel() to make the reference category
to Male as follows:
 mutate(Sex = relevel(Sex, ref = ’Male’))
A dummy variable SexFemale will be created once we executed
the above code.
In general, a categorical variable with d levels will be
transformed into d - 1 variables each with two levels.
Suppose for example that a variable Education has 4 levels:
None, Primary, Secondary, Tertiary.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 178 / 209
179/209
Intermediate statistical methods
Regression
The three dummy variables: Primary, Secondary and Tertiary.
▶ If Education = Primary, then the column Primary would be
coded with a 1 while Secondary and Tertiary would be with a 0.
▶ If Education = Secondary, then the column Secondary would be
coded with a 1 while Primary and Tertiary would be with a 0.
▶ If Education = None, then each of the columns Primary,
Secondary and Tertiary would be coded with a 0.
▶ Note that we should first convert a character vector to a factor
so as to have such dummy coding in R.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 179 / 209
180/209
Intermediate statistical methods
Regression
Multiple linear regression is an extension of the simple linear
regression.
Example. Use the built in dataset ”state.x77” to explore the
relationship between a state’s murder rate and other
characteristics including population, illiteracy rate, average
income, and frost levels.
Let us extract the variables that we are interested.
 states  − as.data.frame(state.x77[, c(”Murder”,
”Population”, ”Illiteracy”, ”Income”, ”Frost”)])
 cor(states) # Gives pairwise correlation
Scatter plot matrix
 library(car)
 scatterplotMatrix(states)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 180 / 209
181/209
Intermediate statistical methods
Regression
Let us fit the multiple linear regression model:
 fit  − lm(Murder ∼ Population + Illiteracy + Income +
Frost, data=states)
 summary(fit)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 181 / 209
182/209
Intermediate statistical methods
Regression
Population and Illiteracy are significant predictors of Murder.
For a one unit increase in Illiteracy the Murder rate increases by
4.14 units on average keeping the other factors fixed.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 182 / 209
183/209
Intermediate statistical methods
Regression diagnostics
Normality can be assessed by the qqPlot() function.
Example.
 library(car)
 states  − as.data.frame(state.x77[,c(”Murder”,
”Population”, ”Illiteracy”, ”Income”, ”Frost”)])
 fit  − lm(Murder ∼ Population + Illiteracy + Income +
Frost, data=states)
 qqPlot(fit, labels=row.names(states), id.method=”identify”,
simulate=TRUE, main=”Q-Q Plot”)
simulate=TRUE adds a 95% confidence envelope using a
parametric bootstrap.
id.method =”identify” allows to interactively add ”labels” on
the graph using mouse.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 183 / 209
184/209
Intermediate statistical methods
Regression diagnostics
The response variable approximately follow a normal distribution.
Independence can be checked by the Durbin–Watson test.
 library(car)
durbinWatsonTest(fit)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 184 / 209
185/209
Intermediate statistical methods
Regression diagnostics
Homoscedascticity can be tested by ncvTest() function.
 library(car)
 ncvTest(fit)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 185 / 209
186/209
Intermediate statistical methods
Regression diagnostics
Multicollinearity can be assessed by variance inflation factor as:
 library(car)
 vif(fit)
Since the variance inflation factors for each of the above
variables is small (less than 5), we can conclude that
multicollinearity is not a problem here.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 186 / 209
187/209
Intermediate statistical methods
Logistic Regression — Binary
Observe that a linear regression model is employed when the
dependent variable is quantitative.
However, we may encounter a categorical response variable with
a success/failure scenario, such as democracy / autocracy, war /
peace, trade agreement / no trade agreement, underweight/
normal.
We can use a binary logistic regression to model such response
variable as a function of one or more independent variables.
Unlike in linear regression model, we will predict the probabilities
of the response variable as a function of independent variable (s)
in logistic regression.
We can use the function glm() to fit a logistic regression model.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 187 / 209
188/209
Intermediate statistical methods
Logistic Regression — Binary
The hypothesis of interest in logistic regression:
H0: An independent variable had no impact on the probability to
success of the response
H1: An independent variable had impact on the probability to
success of the response
Note that we need to turn our dependent variable into a factor
before we proceed to use glm().
Example. Let us consider the data on the passengers of the
Titanic in 1912 and investigate whether age influenced the
probability to travel in first-class.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 188 / 209
189/209
Intermediate statistical methods
Logistic Regression — Binary
Load the data into R:
 Titanic  −
read.csv(”C:/Users/User/OneDrive/Desktop/QuantDataAna/
Titanic.csv”, header = TRUE)
Let us first recode the response variable (plass) in to binary.
 library(tidyr)
Titanic  − Titanic %%
mutate(class = as.numeric( recode(pclass, ’1st’=’1’, ’2nd’=’0’,
’3rd’=’0’)))
We use the following code to fit a simple binary logistic
regression model of class on age.
Lfit  − glm(class ∼ age, data = titanic, na.action =
na.exclude, family = ”binomial”)
summary(Lfit)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 189 / 209
190/209
Intermediate statistical methods
Logistic Regression — Binary
The coefficient for age is highly significant (p-value =
 2x10−16
).
The odds ratio for the regression coefficients can be obtained by
exp(coef(Lfit))
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 190 / 209
191/209
Intermediate statistical methods
Logistic Regression — Binary
The odds ratio corresponding to age = exp(0.067767) = 1.07
Interpretation:
▶ For every unit (one year) increase in age, the odds of traveling
in first-class increases by a factor of 1.07.
▶ Let us clarify this by taking specific age values:
At age = 30, the predicted log-odds of traveling in first class =
−3.187456 + 0.067767x(30) = −1.154446.
Taking the exponential of −1.154446 yields odds of
0.315232127.
At age = 31, the predicted log-odds of traveling in first class =
−3.187456 + 0.067767x(31) = −1.086679.
e−1.086679
yields odds of 0.337334925.
Dividing the odds at age = 31 by odds at age = 30,
i.e.,0.337334925/0.315232127 gives 1.07.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 191 / 209
192/209
Intermediate statistical methods
Logistic Regression — Binary
Note
When the odds ratio for a given predictor is less than one, an
increase in the value of the predictor leads to a decreased odds
of success on the response.
If the odds ratio for a given predictor is exactly 1, the odds of
success on the response would not change when the value of the
predictor changes.
Odds can be converted to a probability by using the following
formula: probability = odds / (1 + odds).
For example, the predicted probability of traveling in first class
when a passenger is 30 years old = 0.315232127
1+0.315232127
= 0.23967794.
We can also predict the probabilities based on previously created
sequence.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 192 / 209
193/209
Intermediate statistical methods
Logistic Regression — Binary
 Sq age  −seq(0, 80, 1)
 Prob age  −predict(Lfit, list(age = Sq age), type =
”response”)
 Prob age # gives
Let us plot age versus probability of traveling in first class.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 193 / 209
194/209
Intermediate statistical methods
Logistic Regression — Binary
 plot(Sq age, Prob age, xlab = ”age”, ylab = ”Probability to travel
in First Class”, type=”l”)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 194 / 209
195/209
Intermediate statistical methods
Logistic Regression — Binary
How good is the model?
A model is considered to be good when the proportion of
correctly predicting success (1) and failure (0) is high.
We can use the ROC (receiver operating characteristic) curve to
assess correctly predicting 1s and 0s.
The y-axis in ROC-curve is the probability of correctly predicting
a 1— Sensitivity.
The x-axis in ROC-curve is the probability of correctly predicting
a 0 — Specificity.
The model predicts 1s and 0s well if the curve is further away
from the diagonal.
The area under the curve (auc) will be 100% if the model
correctly predicted everything.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 195 / 209
196/209
Intermediate statistical methods
Logistic Regression — Binary
 install.packages(”pROC”)
 library(pROC)
 Prob trav  − predict(Lfit, type=”response”)
 Titanic$Prob trav  − unlist(Prob trav)
 ROC  − roc(Titanic$class, Titanic$Prob trav)
 auc(ROC)
 plot(ROC, print.auc = TRUE, col = ”blue”)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 196 / 209
197/209
Intermediate statistical methods
Logistic Regression — Multinomial
Multinomial Logistic Regression (MLR) is conducted when the
outcome variable is nominal with more than two levels.
Example. In an election, voters may choose any one of Party A,
Party B or Party C. Their choice might be affected by the
party’s economic policy, foreign policy, educational levels of
candidates, etc.
In MLR, the log odds of the outcomes are modeled as a linear
combination of the predictor variables.
The MLR estimates a separate binary logistic regression model
for each dummy variables.
If the outcome variable has M levels, then we will fit M-1 binary
logistic regression models.
Each model has its own intercept  regression coefficients: the
predictors can affect each category differently.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 197 / 209
198/209
Intermediate statistical methods
Logistic Regression — Multinomial
Consider the data set Highschool.csv.
▶ Outcome variable: Program Type = {academic, general,
vocational}
▶ Predictors: Writing Score, Math Score, Sch Type, Sex, Ses.
Import the data to R:
 Mlog  −
read.csv(”C:/Users/User/OneDrive/Desktop/QuantDataAna/
Data/Highschool.csv”, header = TRUE)
Suppose we choose academic as the baseline category for the
outcome variable.
 library(foreign)
 Mlog$Program Type ¡- relevel(factor(Mlog$Program Type),
ref = ”academic”)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 198 / 209
199/209
Intermediate statistical methods
Logistic Regression — Multinomial
 library(nnet)
 Mlogit  − multinom(Program Type ∼
Writing Score+Math Score+Sch Type+Sex+Ses, data = Mlog)
 summary(Mlogit)
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 199 / 209
200/209
Intermediate statistical methods
Logistic Regression — Multinomial
Let us now calculate Z score and p-values.
 z  −
summary(Mlogit)$coefficients/summary(Mlogit)$standard.errors
 z
Note that the coefficients corresponding to the row ”general”
will be used to compare Program Type = ”general” to
Program Type = ”academic”.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 200 / 209
201/209
Intermediate statistical methods
Logistic Regression — Multinomial
 p  − (1 - pnorm(abs(z), 0, 1))*2
 p
Interpretations can only be given for those coefficients with
p − values  0.05.
By Taddesse Kassahun (
Department of Statistics
)
Quantitative Data Analysis using R 201 / 209
Quantitative Data Analysis using R Guide
Quantitative Data Analysis using R Guide
Quantitative Data Analysis using R Guide
Quantitative Data Analysis using R Guide
Quantitative Data Analysis using R Guide
Quantitative Data Analysis using R Guide
Quantitative Data Analysis using R Guide
Quantitative Data Analysis using R Guide

More Related Content

What's hot

ANOVA in R by Aman Chauhan
ANOVA in R by Aman ChauhanANOVA in R by Aman Chauhan
ANOVA in R by Aman ChauhanAman Chauhan
 
Descriptive Statistics with R
Descriptive Statistics with RDescriptive Statistics with R
Descriptive Statistics with RKazuki Yoshida
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In RRsquared Academy
 
R data-import, data-export
R data-import, data-exportR data-import, data-export
R data-import, data-exportFAO
 
Introduction to R and R Studio
Introduction to R and R StudioIntroduction to R and R Studio
Introduction to R and R StudioRupak Roy
 
Statistics For Data Science | Statistics Using R Programming Language | Hypot...
Statistics For Data Science | Statistics Using R Programming Language | Hypot...Statistics For Data Science | Statistics Using R Programming Language | Hypot...
Statistics For Data Science | Statistics Using R Programming Language | Hypot...Edureka!
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on rAbhik Seal
 
Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2izahn
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with RShareThis
 
Data tidying with tidyr meetup
Data tidying with tidyr  meetupData tidying with tidyr  meetup
Data tidying with tidyr meetupMatthew Samelson
 
R basics
R basicsR basics
R basicsFAO
 
Basic statistics
Basic statisticsBasic statistics
Basic statisticsGanesh Raju
 

What's hot (20)

R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
 
Unit 1 - R Programming (Part 2).pptx
Unit 1 - R Programming (Part 2).pptxUnit 1 - R Programming (Part 2).pptx
Unit 1 - R Programming (Part 2).pptx
 
ANOVA in R by Aman Chauhan
ANOVA in R by Aman ChauhanANOVA in R by Aman Chauhan
ANOVA in R by Aman Chauhan
 
Spss
SpssSpss
Spss
 
Introduction to statistical software R
Introduction to statistical software RIntroduction to statistical software R
Introduction to statistical software R
 
R studio
R studio R studio
R studio
 
Descriptive Statistics with R
Descriptive Statistics with RDescriptive Statistics with R
Descriptive Statistics with R
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In R
 
R data-import, data-export
R data-import, data-exportR data-import, data-export
R data-import, data-export
 
Introduction to R and R Studio
Introduction to R and R StudioIntroduction to R and R Studio
Introduction to R and R Studio
 
Statistics For Data Science | Statistics Using R Programming Language | Hypot...
Statistics For Data Science | Statistics Using R Programming Language | Hypot...Statistics For Data Science | Statistics Using R Programming Language | Hypot...
Statistics For Data Science | Statistics Using R Programming Language | Hypot...
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
 
Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with R
 
MatplotLib.pptx
MatplotLib.pptxMatplotLib.pptx
MatplotLib.pptx
 
Data tidying with tidyr meetup
Data tidying with tidyr  meetupData tidying with tidyr  meetup
Data tidying with tidyr meetup
 
R basics
R basicsR basics
R basics
 
Machine Learning in R
Machine Learning in RMachine Learning in R
Machine Learning in R
 
Basic statistics
Basic statisticsBasic statistics
Basic statistics
 
Class ppt intro to r
Class ppt intro to rClass ppt intro to r
Class ppt intro to r
 

Similar to Quantitative Data Analysis using R Guide

Lecture1_R.pdf
Lecture1_R.pdfLecture1_R.pdf
Lecture1_R.pdfBusyBird2
 
Modeling in R Programming Language for Beginers.ppt
Modeling in R Programming Language for Beginers.pptModeling in R Programming Language for Beginers.ppt
Modeling in R Programming Language for Beginers.pptanshikagoel52
 
Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studioDerek Kane
 
Best corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbaiBest corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbaiUnmesh Baile
 
1 Installing & getting started with R
1 Installing & getting started with R1 Installing & getting started with R
1 Installing & getting started with Rnaroranisha
 
1 installing & Getting Started with R
1 installing & Getting Started with R1 installing & Getting Started with R
1 installing & Getting Started with RDr Nisha Arora
 
2014 Taverna Tutorial R script
2014 Taverna Tutorial R script2014 Taverna Tutorial R script
2014 Taverna Tutorial R scriptmyGrid team
 
R-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdfR-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdfKabilaArun
 
R-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdfR-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdfattalurilalitha
 
Reproducible Research in R and R Studio
Reproducible Research in R and R StudioReproducible Research in R and R Studio
Reproducible Research in R and R StudioSusan Johnston
 
Talend Open Studio for Big Data | Talend Open Studio Tutorial | Talend Online...
Talend Open Studio for Big Data | Talend Open Studio Tutorial | Talend Online...Talend Open Studio for Big Data | Talend Open Studio Tutorial | Talend Online...
Talend Open Studio for Big Data | Talend Open Studio Tutorial | Talend Online...Edureka!
 
R Introduction
R IntroductionR Introduction
R Introductionschamber
 

Similar to Quantitative Data Analysis using R Guide (20)

Lecture1_R.pdf
Lecture1_R.pdfLecture1_R.pdf
Lecture1_R.pdf
 
Lecture_R.ppt
Lecture_R.pptLecture_R.ppt
Lecture_R.ppt
 
Lecture1_R.ppt
Lecture1_R.pptLecture1_R.ppt
Lecture1_R.ppt
 
Lecture1_R.ppt
Lecture1_R.pptLecture1_R.ppt
Lecture1_R.ppt
 
Lecture1 r
Lecture1 rLecture1 r
Lecture1 r
 
Modeling in R Programming Language for Beginers.ppt
Modeling in R Programming Language for Beginers.pptModeling in R Programming Language for Beginers.ppt
Modeling in R Programming Language for Beginers.ppt
 
Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studio
 
Best corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbaiBest corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbai
 
1 Installing & getting started with R
1 Installing & getting started with R1 Installing & getting started with R
1 Installing & getting started with R
 
Tableau integration with R
Tableau integration with RTableau integration with R
Tableau integration with R
 
1 installing & Getting Started with R
1 installing & Getting Started with R1 installing & Getting Started with R
1 installing & Getting Started with R
 
Basics of R
Basics of RBasics of R
Basics of R
 
2014 Taverna Tutorial R script
2014 Taverna Tutorial R script2014 Taverna Tutorial R script
2014 Taverna Tutorial R script
 
Poly_introduction_R.pdf
Poly_introduction_R.pdfPoly_introduction_R.pdf
Poly_introduction_R.pdf
 
R-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdfR-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdf
 
R-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdfR-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdf
 
R-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdfR-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdf
 
Reproducible Research in R and R Studio
Reproducible Research in R and R StudioReproducible Research in R and R Studio
Reproducible Research in R and R Studio
 
Talend Open Studio for Big Data | Talend Open Studio Tutorial | Talend Online...
Talend Open Studio for Big Data | Talend Open Studio Tutorial | Talend Online...Talend Open Studio for Big Data | Talend Open Studio Tutorial | Talend Online...
Talend Open Studio for Big Data | Talend Open Studio Tutorial | Talend Online...
 
R Introduction
R IntroductionR Introduction
R Introduction
 

Recently uploaded

PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 

Recently uploaded (20)

PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 

Quantitative Data Analysis using R Guide

  • 1. Quantitative Data Analysis using R By Taddesse Kassahun Department of Statistics Addis Ababa University 1/209
  • 2. 2/209 Introduction By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 2 / 209
  • 3. 3/209 Introduction What are quantitative data? Quantitative data refer to a set of values of observations that can be counted or measured. They answer questions of the following form. ▶ How many? ▶ How often? ▶ How much? Quantitative data are mainly collected for statistical analysis. Some examples of quantitative data: ▶ Number of times students in a college updated their phones in a quarter. ▶ Percentage increase in revenue of wholesalers with the inclusion of a new product. ▶ Price of Teff/kg in different kebeles of a region. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 3 / 209
  • 4. 4/209 Introduction Analysis methods of quantitative data Quantitative data can be analyzed by using descriptive and inferential methods. Descriptive analysis can be performed using tables, graphs, and summary measures. Tables can be classified into two broad classes: simple and complex tables. Commonly used graphs: bar, pie, histogram, boxplot, and line. Some widely used summary measures include mean, median, mode, frequency, minimum, maximum, total, standard deviation, range, and percent Inferential method involves estimation, model fitting, and hypothesis testing. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 4 / 209
  • 5. 5/209 Introduction Steps in inferential analysis By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 5 / 209
  • 6. 6/209 Introduction Why R for data management and analysis? By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 6 / 209
  • 7. 7/209 Introduction Where to find R? R is freely available from the Comprehensive R Archive Network (CRAN) at http://cran.r-project.org. Once R is installed, go to https://www.rstudio.com/products/rstudio/download/ to install R Studio. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 7 / 209
  • 8. 8/209 Introduction R studio layout The location of each pane can be customized by clicking Tools ⇒ Global Options ⇒ Pane Layout By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 8 / 209
  • 9. 9/209 Introduction R studio layout - console All the output generated by R except plots goes to the console. R evaluates all the codes in the console. R studio function calls can be entered into the console to produce output, for example, try the following. > print(”Hello IFA”) gives [1] ”Hello IFA” By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 9 / 209
  • 10. 10/209 Introduction R studio layout - console We can enter commands one at a time at the command prompt (>) We may also use R as a calculator. Try the following > 5+4 gives [1] 9 > 5-4 gives [1] 1 > 5*4 gives [1] 20 > 5/4 gives [1] 1.25 It is not recommended to enter longer pieces of code into the console. Instead use the script editor (source) window. Create an R script by clicking File ⇒ New File ⇒ R Script. In general, execute codes from saved R scripts for the sake of reproducibility. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 10 / 209
  • 11. 11/209 Introduction R studio layout - source The R scripts are written in the source pane where we can run a set of commands at a time. A command in the source pane can be executed by selecting the line and pressing Ctrl + Enter or hitting the Run icon. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 11 / 209
  • 12. 12/209 Introduction R studio layout - source The corresponding result should be seen in the console window. Comments can be incorporated after the hashtag #. >print(”Hello IFA”) # this function prints Hello IFA. To comment multiple lines of codes: highlight the codes and then press ctrl + shift + c. Save the R-scripts by clicking a save shortcut or pressing Ctrl + S with a file name, e.g., ”test.R”. A time stamp can be included in R scripts by writing ts and then pressing the shift + tab keys. To open R scripts click on File ⇒ Open File ... KEEP ALL THE CODES YOU USE IN R SCRIPTS! By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 12 / 209
  • 13. 13/209 Introduction Environment/history/connection/tutorial Understanding environments is not necessary for day-to-day use of R. In the present training, we will only work on the Global Environment. All the objects assigned in R are stored and displayed in the Global Environment. We can also store an output with an object of arbitrary name, say x, using an assignment operator (< −), i.e., >x < − Type x in the script and execute it or type x into the console and press Enter. If we write x < − 5+5 on the console, x will appear in the environment pane under Values section (see figure below). By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 13 / 209
  • 14. 14/209 Introduction Environment/history/connection/tutorial It’s important to always start with a clean environment when working on an R project. To clean the working environment, selecting “Restart R” option from the “Session” tab at the top of the window or press Ctrl + Shift + F10. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 14 / 209
  • 15. 15/209 Introduction Environment/history/connection/tutorial The History tab contains a list of commands entered into R console. Previously used commands can be searched through the History tab. The Connections tab allows to connect to various data sources like external databases. The Tutorials tab is used to run tutorials for R studio. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 15 / 209
  • 16. 16/209 Introduction Files/plots/packages/help/viewer The Files tab lists external files in the current working directory. We can open, copy, rename, move and delete files listed in the window. The Plots tab is where all the plots we create in R are displayed. There is an option of exporting plots to an external file using the Export drop down menu. The Packages tab lists all of the packages that are installed on the computer. We can install new packages & update existing ones by clicking on the Install and Update buttons. We view the help coming from R documentation in the Help tab. The Viewer tab displays local web content such as web graphics generated by some packages. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 16 / 209
  • 17. 17/209 Introduction Projects in R RStudio Projects help to keep things organised. An RStudio Project keeps all of our R scripts, R functions and data together in one place. To create a project, open RStudio and select File ⇒ New Project... from the menu. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 17 / 209
  • 18. 18/209 Introduction Projects in R In the next window select New Project. Enter the name of the directory we want to create in the Directory name: field. Click the Browse... button and navigate to change the location of the directory. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 18 / 209
  • 19. 19/209 Introduction Projects in R Tick the Open in new session box and then hit the Create Project button. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 19 / 209
  • 20. 20/209 Introduction R workspace The working directory is the default location where R finds files to load and put any files. The file path of current working directory can be obtained at the top of the Console pane or using the command > getwd() By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 20 / 209
  • 21. 21/209 Introduction R workspace A directory structure can be created by clicking on the New folder button in the Files pane. > ls() — Lists the objects in the current workspace. > rm(objectlist) — Removes (deletes) one or more objects. > rm(list=c(”Object1”, ”Object2”, ”Object3”)) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 21 / 209
  • 22. 22/209 Introduction Basics R is a case-sensitive software: r is different from R. An object in R is anything that can be assigned a value, e.g., > X = 5 > Name = ”Tadddesse” Objects can also be output of a plot, a summary of statistical analysis or a set of R commands that perform a specific task. Commands in R can be separated by either a new line or semicolon (;). If a continuation prompt + appears in the console after a code is executed, this means the code was not completed correctly. R statements consist of functions and assignments. In general, R is tolerant of extra spaces inserted into codes. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 22 / 209
  • 23. 23/209 Introduction Basics Do not use spaces into the assignment operator (< −). If a console ‘hangs’ and unresponsive after running a command, press the escape key (esc) on the keyboard. We can also click on the stop icon in the top right of our console to terminate most current operations. To save an object to an .RData file use save(nameOfObject, file = ”file name.RData”) To save all objects in a workspace into a single .RData file use save.image(file = ”file name.RData”) To load .RData file back into RStudio use load(file = ”file name.RData”) We can end the R session that we are working on by typing and running the command q(). By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 23 / 209
  • 24. 24/209 Introduction Packages Packages are collections of R functions, data, and compiled code. The base installation of R comes with many packages as standard. The directory where packages are stored on your computer is called the library. Standard set of packages in R: base, datasets, utils, grDevices, graphics, stats, and methods. Use the install.packages() command to install a package for the first time. Example. A package car can be installed by writing install.packages(”car”) into the Console window of RStudio. We need to have a working internet connection to do this. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 24 / 209
  • 25. 25/209 Introduction Packages We may be asked to select a CRAN mirror, we can select 0-cloud or a mirror near to our location. We may include the dependencies = TRUE argument to install additional required packages. Packages can be updated using the command update.packages(). The ask = FALSE argument helps to update all installed packages. To use a package in an R session, we need to first load the package using > library(pkg name). We need to load the packages we will be using every time we start a new R session. > help(package=”package name”) provides a brief description of the package By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 25 / 209
  • 26. 26/209 Introduction Help in R > help.start() — General help. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 26 / 209
  • 27. 27/209 Introduction Help in R > help(”mean”) or ?mean — Help on function mean. > help.search(”linear”) or ??linear — Searches the help system for the string linear. > example(”anova”) — Examples of function anova. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 27 / 209
  • 28. 28/209 Introduction Help in R > apropos(”mean”, mode=”function”) — Lists all functions with mean in their name. Programming help: StackOverflow (https://stackoverflow.com/) is a Q & A website focused on programming in all languages. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 28 / 209
  • 29. 29/209 Introduction Exercise 1 By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 29 / 209
  • 30. 30/209 Data structures, input & output Data structures, input & output By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 30 / 209
  • 31. 31/209 Data structures, input & output Introduction The base data structure in R can be organized by their dimensionality (1d, 2d, or nd). In R, ID (case identifier) is considered as rownames. Data structure can constitute either homogeneous (all contents same type) or heterogeneous (contents are different types). The FIVE most often used structures: We can use str() to understand what data structures an object is composed of. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 31 / 209
  • 32. 32/209 Data structures, input & output Notation and naming Generally, variable names should be nouns and function names should be verbs. Use an underscore ( ) to separate words within a name. Avoid using names of existing functions and variables. Some examples of BAD variable names: > F < − FALSE > c < − 10 Place spaces around all infix operators (=, +, -, <, >, etc.). Always put a space after a comma. :, ::, < − and ::: do not need spaces around them. Place a space before left parentheses, except in a function call. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 32 / 209
  • 33. 33/209 Data structures, input & output Vectors Vectors are one-dimensional arrays that can hold numeric, character, or logical data. Vector is the basic data structure in R. Scalars are also considered as vectors with only one-elements. is.atomic(x) tests if an object x is actually a vector. There are four common types of atomic vectors: logical, integer, double (numeric), and character. Atomic vectors are usually created with c(), short for concatenate. Example ▶ Logical < − c(TRUE, FALSE) ▶ Integer < − c(1, 3, 6, 9) ▶ Double < − c(1, 2.4, 3.7) ▶ Character < − c(”He”, ”is”, ”a”, ”statistician”) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 33 / 209
  • 34. 34/209 Data structures, input & output Vectors ... Data in a vector must be only one type or mode – numeric, character, or logical. We can check the data type (mode) of a vector using the command mode() ▶ > mode(Integer) gives ”numeric” We use square brackets [ ] to extract elements of a vector. Example ▶ > Double[2] gives 2.4 The colon operator (:) can be used to generate a sequence of numbers/characters. Example ▶ > Character[2:4] gives ”is” ”a” ”statistician” By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 34 / 209
  • 35. 35/209 Data structures, input & output Vectors A summary of logical test and coercion functions > X < − 5.8 > class(X) gives [1] ”numeric” > is.numeric(X) gives [1] TRUE > is.character(X) gives [1] FALSE > X1 < − as.character(X) > class(X1) gives [1] ”character” By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 35 / 209
  • 36. 36/209 Data structures, input & output Factors Factors are specific type of vectors to store values that take a pre-specified set of values. The function factor() stores categorical values as a vector of integers in the range 1, . . . , k, where k is the number of unique categories. > x < − c(”A”, ”B”, ”B”, ”A”) > xf < − factor(x) stores this vector as (1, 2, 2, 1) this can be checked by >as.numeric(xf) The variable x is now treated as nominal. For ordinal variables, we add the parameter ordered=TRUE to the factor() function. > y < − c(”Good”, ”Very good”, ”Excellent”, ”Good”) > factor(y, ordered=TRUE) is encoded as (2,3,1,2) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 36 / 209
  • 37. 37/209 Data structures, input & output Factors The variable y is now treated as ordinal. By default, factor levels for character vectors are created in alphabetical order. We can override the default by specifying a levels option. levels(x) defines the set of allowed values for a vector x. >factor(y, ordered=TRUE, levels=c(”Good”, ”Very good”, ”Excellent”)) is encoded as (1,2,3,1) Numeric variables can be coded as factors using the levels and labels options. >Sex < −c(0,1,1,1,0,1,1,0,0) > factor(Sex, levels=c(1, 2), labels=c(”Male”, ”Female”)) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 37 / 209
  • 38. 38/209 Data structures, input & output List Lists are the most complex of the R data types. A list may contain a combination of vectors, matrices, data frames, and even other lists. Use list() to create lists. x < − list(1:3, ”a”, c(TRUE, FALSE, TRUE), c(2.3, 5.9)) str(x) gives List of 4 $ : int [1:3] 1 2 3 $ : chr ”a” $ : logi [1:3] TRUE FALSE TRUE $ : num [1:2] 2.3 5.9 Note that the results of many R functions return lists. We can turn a list into an atomic vector with unlist(). By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 38 / 209
  • 39. 39/209 Data structures, input & output Data frames A data frame is a list of equal-length vectors. It shares properties of a matrix. It is more general than a matrix as it can contain different types of data. Each row corresponds to an individual observation and each column corresponds to a different measured or recorded variable. We create a data frame using data.frame(), which takes named vectors as input. > mydata < − data.frame(col1, col2, col3,...) DF < − data.frame(x = 1:3, y = c(”A”, ”B”, ”C”)) We can combine data frames using cbind() and rbind() By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 39 / 209
  • 40. 40/209 Data structures, input & output Data frames The dimensions of a data frame can be determined by the dim() function. To access columns of a data frame we can use any of the following ways: 1 Using a square bracket and column number: DF[2] gives 2 Using a square bracket and column names: DF[”y”] 3 Using dollar sign: DF$y gives [1] ”A” ”B” ”C” Typing DF$VarName is somewhat tiresome for a data frame having several variables. We can use the attach() function as a shortcut. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 40 / 209
  • 41. 41/209 Data structures, input & output Data frames The attach() function adds the data frame to the R search path. Example. > attach(ChickWeight); plot(weight,Diet) > weight < − c(23,25,20); attach(ChickWeight) gives The following object is masked by .GlobalEnv: weight > plot(weight,Diet) gives Error. Why? We can use the detach() function to remove the data frame from the search path. The operator $ can also be used to create a new column. > DF$Sex < − c(”M”, ”F”, ”M”) R automatically decides a character class for the above variable ”y”. We can include the stringsAsFactors = TRUE argument to convert ”y” to factors. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 41 / 209
  • 42. 42/209 Data structures, input & output Data frames Values stored in a data frame can be retrieved by a square bracket [i, j]; i for row and j for column. > DF[3,2] gives C > DF$y[3] gives C > DF[,1] gives 1 2 3 > DF[1,] gives 1 A The function is.data.frame(object) checks if an object is of class data frame. > nrow(DF) # to determine number of rows of DF > ncol(DF) # to determine number of columns of DF > colnames(DF) # to display column names of DF > rownames(DF) # to display row names of DF By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 42 / 209
  • 43. 43/209 Data structures, input & output Missing values Missing values in R are marked by NA. We can use is.na() function to check whether a value of an R object is missing. It returns TRUE if a value is missing. > myvec< −c(2, 5,NA ,8) > is.na(myvec) gives [1] FALSE FALSE TRUE FALSE A sum() function can be used to determine the number of NA values in a vector. > sum(is.na(myvec)) gives [1] 1 The anyNA() function can be used to check any missing values in a data frame. > anyNA(DF) gives [1] FALSE Let us consider the following data frame and check for missing values. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 43 / 209
  • 44. 44/209 Data structures, input & output Missing values To verify which observation(s) is/are the cause, we can use > mydf[!complete.cases(mydf),] gives Ad Car Mi 3 1 NA 13 By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 44 / 209
  • 45. 45/209 Data structures, input & output Exercise 2 By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 45 / 209
  • 46. 46/209 Data structures, input & output Data input R can import data from a variety of sources. Entering data from the keyboard: We can use the edit() function to enter data manually. > mydata < − data.frame(age=numeric(0), gender=character(0), weight=numeric(0)) > mydata < − edit(mydata) A shortcut for mydata < − edit(mydata) is fix(mydata). By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 46 / 209
  • 47. 47/209 Data structures, input & output Data input By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 47 / 209
  • 48. 48/209 Data structures, input & output Data input We can embed data directly using the code > mydatatxt < − ” age gender weight 25 m 166 30 f 115 ” mydata < − read.table(header=TRUE, text=mydatatxt) Importing data from a delimited text file Syntax: mydataframe < − read.table(file, options) Some of the options: header, sep, row.names, col.names, na.Strings, quote, skip, dec. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 48 / 209
  • 49. 49/209 Data structures, input & output Data input The above data can be imported into a data frame using the following code: grades < − read.table(file=”studentgrades.csv”, header=TRUE, row.names=”StudentID”, sep=”,”, colClasses=c(”character”, ”character”, ”character”, ”numeric”, ”numeric”, ”numeric”)) Importing data from Excel Save the file into Tab-delimited Text (.txt) form: File ⇒ Save as ... ⇒ select the Text (Tab delimited) in the Save as type: By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 49 / 209
  • 50. 50/209 Data structures, input & output Data input Once the excel file is saved in tab delimited format we can use the read.table() function. We may also save excel files in comma separated values (.csv) format and use the function read.csv() to input the data. Excel files can be directly imported into R using xlsx package. Syntax to import the first worksheet(1) from the workbook. library(xlsx) workbook < − ”c:/myworkbook.xlsx” mydataframe < − read.xlsx(workbook, 1) For large worksheets (say, 100,000+ cells), we can use read.xlsx2(). In general, it is recommended to save data as tab or comma delimited files and import into R using read.table() By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 50 / 209
  • 51. 51/209 Data structures, input & output Data input Importing data from SPSS SPSS datasets can be imported into R via read.spss() function in the foreign package. We can also use the spss.get() function in the Hmisc package. library(Hmisc) mydataframe < − spss.get(”mydata.sav”, use.value.labels=TRUE) Importing data from Stata We can employ the following syntax. library(foreign) mydataframe < − read.dta(”mydata.dta”) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 51 / 209
  • 52. 52/209 Data structures, input & output Data input We can also use a shortcut Import Dataset from the R Environment to input data. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 52 / 209
  • 53. 53/209 Data structures, input & output Data input It is common to get error messages when a person starts importing data due to one or more of the following reasons. 1 Mistake in the spelling of either file name or file path. 2 Forget to include the file extension (.txt, .csv, etc) in the file name. 3 An incorrect file path is used. 4 Forget to include the header = TRUE argument when the first row a data frame contains variable names. Always use the function str() after importing the data to see the structure of the data set. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 53 / 209
  • 54. 54/209 Data structures, input & output Annotating datasets Annotating includes adding descriptive labels to variable names and value labels to the codes for categorical variables. Consider the following data frame Age Sex 15 2 17 2 16 1 The following code can be used to rename ”Age” as ”Age at first marriage”. > names(Mydata)[1] < − ”Age at first marriage” The following code creates value labels for variable ”Sex”. Mydata$Sex < − factor(Mydata$Sex, levels = c(1,2), labels = c(”male”, ”female”)) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 54 / 209
  • 55. 55/209 Data structures, input & output Data output The main function to export data frames is write.table(). > write.table(DF, file = ”file path/file name.txt”, col.names=TRUE, row.names=FALSE, sep = ” ”) DF is the data frame we want to export; file name.txt is the file name for exported data frame; col.names=TRUE indicates that the variable name should be written in the first row the file; row.names = FALSE stops R from including the row names in the first column of the file. The above exported file can be opened in any text editor as it is saved in tab delimited text form. We can also export a data frame to csv format by setting sep = ”,” in the write.table() function. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 55 / 209
  • 56. 56/209 Data structures, input output Data output The function write.csv() can also be used directly as write.csv(DF, ”filepath/mydf.csv”, row.names=FALSE) library(xlsx) write.xlsx(df, ”filepath/mydf.xlsx”) # To export a data frame ”df” to ”mydf.xlsx” We need to first install the library haven to export data into SPSS and Stata. library(haven) write sav(df, ”filepath/mydf.sav”) # To SPSS write dta(df, ”filepath/mydf.dta”) # To Stata By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 56 / 209
  • 57. 57/209 Data structures, input output Exercise 3 By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 57 / 209
  • 58. 58/209 Data structures, input output Data wrangling By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 58 / 209
  • 59. 59/209 Data wrangling Introduction Data wrangling is the most important activity before data analysis begins. It includes: ▶ creating new variables, ▶ extracting part of a data frame, ▶ recoding existing variables, ▶ sorting and merging data, ▶ selecting and dropping variables, ▶ working with dates, etc. The following codes create a new variable, say SUM from an existing data frame ”mydata”. mydata −data.frame(x1 = c(2, 2, 6, 4), x2 = c(3, 4, 2, 8)) mydata$SUM − mydata$x1 + mydata$x2 mydata By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 59 / 209
  • 60. 60/209 Data wrangling Extracting values Consider the built in data frame USArrests and check its structure. Let us extract the third value of the Assault variable. USArrests[3,2] gives [1] 294 USArrests$Assault[3] gives the same thing as above. The data in the first 10 rows and 3 columns of USArrests can be extracted by: USArrests[1:10, 1:3] By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 60 / 209
  • 61. 61/209 Data wrangling Extracting values We use negative positional indexes to exclude certain rows and columns from a data frame. Let us extract all of the rows except the first 40 rows and all columns except the 2nd and the 4th columns in USArrests dataset. USArrests[-(1:40), -c(2,4)] # gives Values can also be extracted based on a logical test. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 61 / 209
  • 62. 62/209 Data wrangling Extracting values Let us extract all rows where the value of Assault is 200 or more and all the columns in USArrests dataset. USArrests[USArrests$Assault = 200, ] We can use the following to extract all rows where the value of UrbanPop is 80 and all the columns in USArrests dataset. USArrests[USArrests$UrbanPop == 80, ] Boolean expressions can be used to extract values based on a combination of logical tests. To extract rows based on an AND Boolean expression we can use the symbol. Le us extract all columns and rows where Assault 250 and UrbanPop 70. USArrests[USArrests$Assault 50 USArrests$UrbanPop 70, ] By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 62 / 209
  • 63. 63/209 Data wrangling Extracting values To extract rows based on an OR Boolean expression we can use the | symbol. Extract values where UrbanPop is greater than 50 OR less than 60. USArrests[UArrests$UrbanPop 50 | USArrests$UrbanPop 60, ] Alternatively, one can use the subset() function to select parts of a data frame. subset(USArrests, UrbanPop 50 UrbanPop 60) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 63 / 209
  • 64. 64/209 Data wrangling Ordering data frames We can use the order() function to sort a data frame. Let us order the USArrests data frame based on the values of Murder USArrests[order(USArrests$Murder), ] #Ascending USArrests[order(-USArrests$Murder), ] #Descending OR USArrests[order(USArrests$Murder, decreasing = TRUE), ] Let us order the USArrests data frame based on the values of Murder and Rape. USArrests[order(USArrests$Murder, USArrests$Rape), ] By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 64 / 209
  • 65. 65/209 Data wrangling Inclusion of extra rows and column The function rbind() can be used to append additional rows on a data frame. Use cbind() to append columns on an existing data frame. DF1 − data.frame(Id = 1:3, Age = c(20, 19, 23), Sex = c(”Male”, ”Female”, ”Female”)) DF2 − data.frame(Id = 4:5, Age = c(27, 25), Sex = c(”Male”, ”Male”)) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 65 / 209
  • 66. 66/209 Data wrangling Inclusion of extra rows and columns DF3 − data.frame(Height = c(178, 171, 167)) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 66 / 209
  • 67. 67/209 Data wrangling Inclusion of extra rows and columns rbind(DF1, DF3) will give cbind(DF1, DF3) will give By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 67 / 209
  • 68. 68/209 Data wrangling Merging data frames The function merge() merges two data frames horizontally. We need to have at least one unique identifier which is available in both data frames. The following code merges data frame A data frame B by ID. total − merge(dataframeA, dataframeB, by=”ID”) The following code merges data frame A data frame B by ID and Region. total − merge(dataframeA, dataframeB, by=c(”ID”, ”Region”) Use the all = TRUE argument if we want to include all data from both data frames. total − merge(dataframeA, dataframeB, all = TRUE) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 68 / 209
  • 69. 69/209 Data wrangling Merging data frames Let us merge the two data frames given below. Total − merge(DF1, DF2) gives Total1 − merge(DF1, DF2, all = TRUE) gives By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 69 / 209
  • 70. 70/209 Data wrangling Recoding variables Recoding refers to creating new values of a variable from the existing values. It may include: ▶ changing a continuous variable into a set of categories, ▶ replacing miscoded values with correct values, Suppose we want to recode the stopping distance of cars (”dist”) in ”cars” dataset to distcat (Small, Medium, Long). We first recode the value 999 for ”dist” to indicate that this value is missing: cars$dist[cars$dist == 999] − NA cars$distcat[cars$dist 50] − ”Small” cars$distcat[cars$dist = 50 cars$dist 100] − ”Medium” cars$distcat[cars$dist = 100] − ”Long” By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 70 / 209
  • 71. 71/209 Data wrangling Recoding variables We may also use the within() function as follows. cars − within(cars,{ distcat1 − NA distcat1[dist 50] − ”Small” distcat1[dist = 50 dist 100] − ”Medium” distcat1[dist = 100] − ”Long” }) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 71 / 209
  • 72. 72/209 Data wrangling Exercise 4 By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 72 / 209
  • 73. 73/209 Data wrangling Packages for wrangling data Important packages to install: dplyr; tidyr. Once these packages are successfully installed, attach them to the working environment. install.packages(”dplyr”) install.packages(”tidyr”) library(dplyr) library(tidyr) The pipe operator % % is used to perform a set of procedures in a sequential manner and get the result at once. Let us determine the number of missing values in Total1 data frame (see Slide 69). sum(is.na(Total1)) gives [1] 2 This procedure can also be done using the pipe operator. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 73 / 209
  • 74. 74/209 Data wrangling Packages for wrangling data Total1 %% is.na() %% sum() Key functions used for data management in the dplyr packages include: mutate – to modify and create columns in a data frame. select – to select columns by name. filter – to select rows based on a set of logical values. Let us create a column variable time by dividing dist to speed in the cars data frame. mutate(cars, time = dist/speed) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 74 / 209
  • 75. 75/209 Data wrangling Packages for wrangling data Let us now list out the cars with speed less than 10. cars %% filter(speed 10) Suppose that we would like to take out only the dist of cars with speed less than 10. Then we can use: cars1 − cars %% filter(speed 10) cars1 %% select(dist) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 75 / 209
  • 76. 76/209 Data wrangling Packages for wrangling data Each of the above procedures can be performed simultaneously by using the pipe operator as: cars %% mutate(time = dist/speed) %% filter(speed 10) %% select(dist) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 76 / 209
  • 77. 77/209 Data wrangling Reshaping data frames Two main data frame shapes: the long format (sometimes called stacked) and the wide format. In the long data format, a separate column represents the name of the variable and a separate one value of the corresponding variable. In the wide data format, each column represents a variable. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 77 / 209
  • 78. 78/209 Data wrangling Reshaping data frames We use the function pivot longer() to reshape data into long format. pivot longer(Wide DF, ColNames in DF, names to = ” ”, values to = ””) Wide DF is the name of the data frame in wide format. ColNames in DF represents the column names in the wide format (Test1, Test2, Final in the above data). The names to = ” ” argument specifies the name of the variable that will be used to store the names of reformatted variables (Exam in the above data). The values to = ” ” argument specifies the name of the variable that will be used to store the values (Score in the above data). By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 78 / 209
  • 79. 79/209 Data wrangling Reshaping data frames Consider the data frame given in Slide 77 and reshape the data in wide format to long format. Wide − data.frame(Stud.Id = 1:3, Test1 = c(20,23,18), Test2 = c(26,28,25), Final = c(32,30,34)) library(tidyr) pivot longer(Wide, c(Test1, Test2, Final), names to = ”Exam”, values to = ”Score”) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 79 / 209
  • 80. 80/209 Data wrangling Reshaping data frames We use the function pivot wider() to reshape data into wide format. pivot wider(Long DF, names from = ” ”, values from = ””) Long DF is the name of the data frame in long format. The names from = ” ” argument to specify the name of the variable containing the variable names (Exam in the above data). The values from = ” ” argument to specify the variable containing the values (Score in the above data). Consider the data frame given in Slide 77 and reshape the data in long format to wide format. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 80 / 209
  • 81. 81/209 Data wrangling Reshaping data frames Long − data.frame(Stud.Id = c(1,1,1,2,2,2,3,3,3), Exam = rep(c(”Test1”,”Test2”,”Final”),3), Score = c(20,26,32,23,28,30,18,25,34)) library(tidyr) pivot wider(Long, names from = ”Exam”, values from = ”Score”) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 81 / 209
  • 82. 82/209 Data wrangling Date values The function as.Date() translates character strings into date variable. Syntax is as.Date(x, ”input format”). The input format can be: mydate −as.Date(c(”2021-01-01”, ”2021-03-02”, ”2021-03-29”)) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 82 / 209
  • 83. 83/209 Data wrangling Date values We can use the format(x, format=”output format”) function to extract portions of dates. x is a given date value. format = ”” is any one or more of the formats %d, %a, %A, %m, %b, %B, %y, %Y today − Sys.Date() today gives the current date. format(today, format=”%y”) gives [1] ”22” We may also extract more than one format using: format(today, c(”%A”,”%m” )) this extracts the unabbreviated weekday and the month number. It is also possible to perform arithmetic operations in dates as follows. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 83 / 209
  • 84. 84/209 Data wrangling Date values For example, how old in years is a person if he was born on 12 October 1950? DoB − as.Date(”1950-10-12”) today - DoB gives the time difference in terms of days. Alternatively, the following code can be employed. difftime(today,DoB, units = ”days”) gives the time difference in terms of days. How old in years is the person if he was born on 12 October 1950? First install the lubridate package and then load it. library(lubridate) trunc(interval(DoB, today) / years(1)) gives [1] the age in years. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 84 / 209
  • 85. 85/209 Data wrangling Date values Suppose we would like to limit our analyses to observations collected between January 1, 2019 and December 31, 2020 in ”Ex” dataset. Ex −data.frame(A=1:10, B = c(”2019-08-11”, ”2020-12-18”,”2018-05-14”,”2020-07-26”,”2018-11-23”, ”2020-01-03”, ”2018-05-05”, ”2018-11-07”, ”2019-03-06”,”2019-05-08”)) attach(Ex) Ex$date − as.Date(Ex$B) startdate − as.Date(”2010-01-01”) enddate − as.Date(”2020-10-31”) newdata − Ex[which(Ex$date = startdate Ex$date = enddate), ] By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 85 / 209
  • 86. 86/209 Data wrangling Exercise 5 By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 86 / 209
  • 87. 87/209 Data wrangling Descriptive analysis By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 87 / 209
  • 88. 88/209 Descriptive analysis Commonly used mathematical functions By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 88 / 209
  • 89. 89/209 Descriptive analysis Commonly used statistical functions By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 89 / 209
  • 90. 90/209 Descriptive analysis Descriptive statistics Load the built in dataset swiss in the workspace. It is good practice to look at how many observations and variables are included in a dataset prior to further analysis. dim(swiss) gives [1] 47 6 Following this, have a look at the structure str(swiss) Then get basic summary statistics by using the function summary(). By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 90 / 209
  • 91. 91/209 Descriptive analysis Descriptive statistics Missing values may be available in the dataset under analysis. We can compute summary statistics by removing them, for example, mean(swiss$Fertility, na.rm=TRUE) with(swiss, c(Mean = mean(Fertility, na.rm=TRUE), Sd = sd(Fertility, na.rm=TRUE) ) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 91 / 209
  • 92. 92/209 Descriptive analysis Descriptive statistics Suppose we would like to calculate summary statistics, e.g., mean for each level of a categorical variable. We can use the tapply() function. Let us calculate the mean miles per gallon (mpg) for each of the cylinder types (cyl) in mtcars dataset. tapply(mtcars$mpg, mtcars$cyl, mean) We can also use tapply() to apply on more than one factor. Let us calculate mean mpg for each combination of gear cyl tapply(mtcars$mpg, list(Cylinder = mtcars$cyl, Gear =mtcars$gear), mean) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 92 / 209
  • 93. 93/209 Descriptive analysis Descriptive statistics aggregate() works in a similar way to that of apply() but it is flexible. Suppose we want to calculate the mean values of mpg and wt for each level of cyl. aggregate(mtcars[, c(”mpg”, ”wt”)], by=list(Cylinder = mtcars$cyl), FUN = mean) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 93 / 209
  • 94. 94/209 Descriptive analysis Descriptive statistics We can also use the aggregate() function by using the formula method. aggregate(mpg ∼ cyl,FUN = mean, data =mtcars) Furthermore, we may compute summary statistics for subsets of the original data. Let us compute the mean of mpg for each level of cyl only for hp 115 aggregate(mpg ∼ cyl,FUN = mean, subset = hp 115, data =mtcars) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 94 / 209
  • 95. 95/209 Descriptive analysis More descriptive statistics More descriptive statistics can be obtained by installing and loading different packages. install.packages(”pastecs”) # If it is not already installed library(pastecs) myvars − c(”mpg”, ”hp”, ”wt”) stat.desc(mtcars[myvars]) # mtcars is a built in dataset By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 95 / 209
  • 96. 96/209 Descriptive analysis More descriptive statistics install.packages(”psych”) # If it is not already installed library(psych) myvars − c(”mpg”, ”hp”, ”wt”) describe(mtcars[myvars]) # gives By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 96 / 209
  • 97. 97/209 Descriptive analysis Descriptive statistics by group Descriptive statistics can be obtained with respect to different groups by using summaryBy in the doBy package. iinstall.packages(”psych”) # If it is not already installed library(doBy) summaryBy(mpg+hp+wt am, data=mtcars, FUN=c(mean, sd, min.max)) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 97 / 209
  • 98. 98/209 Descriptive analysis Descriptive statistics by group We may also use the describeBy function in the psych package. library(psych) myvars − c(”mpg”, ”hp”, ”wt”) describeBy(mtcars[myvars], list(am=mtcars$am)) # gives By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 98 / 209
  • 99. 99/209 Descriptive analysis Frequency tables Following are functions for creating tables By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 99 / 209
  • 100. 100/209 Descriptive analysis Tables Categorical variables are usually described by frequency tables. Let us tabulate the values of gear in the mtcars dataset. table(mtcars$gear) We can produce table of proportions instead of counts by using prop.table(table(mtcars$gear)) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 100 / 209
  • 101. 101/209 Descriptive analysis Two-way tables We can also use the table() function to cross tabulate two categorical variables. Let us cross tabulate cyl and gear in mtcars dataset. with(mtcars, table(cyl, gear)) The above table can be obtained by the more flexible function, xtabs() xtabs(∼ cyl+gear, data = mtcars) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 101 / 209
  • 102. 102/209 Descriptive analysis Two-way tables There are times where we want to include row and column sums of a table. This can be done by applying the addmargins on the table. Let us include the row and column sums for the table produced in Slide 101. addmargins(xtabs(∼ cyl+gear, data = mtcars)) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 102 / 209
  • 103. 103/209 Descriptive analysis Two-way tables We can calculate proportions with respect to row or column margins as follows. prop.table(xtabs( cyl+gear, data = mtcars), 1) #row-wise prop.table(xtabs( cyl+gear, data = mtcars), 2) #column-wise By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 103 / 209
  • 104. 104/209 Descriptive analysis Multidimensional tables The functions table(), margin.table(), xtabs(), prop.table(), and addmargins() also extend to more than 2-dimensions. A 3-way table of cyl, gear and am in mtcars dataset can be obtained from: mytable − xtabs(∼ cyl+gear+am, data=mtcars) The function ftable() prints an attractive multidimensional table. ftable(mytable) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 104 / 209
  • 105. 105/209 Descriptive analysis Exercise 6 By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 105 / 209
  • 106. 106/209 Descriptive analysis Graphs with base R By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 106 / 209
  • 107. 107/209 Descriptive analysis Descriptive analysis through graphs The base R graphics system is the original plotting which comes together with installing R. Graphs in base R are created by high-level plotting commands, e.g., plot() and then more information can be added by using low-level commands, e.g., lines(), text().. When a plot is created in RStudio it will be displayed in the Plots tab by default. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 107 / 209
  • 108. 108/209 Descriptive analysis Descriptions through graphs Previously created plots can be scrolled by clicking on one of the arrow buttons. We can save plots in a variety of formats (pdf, png, tiff, jpeg etc) by clicking on the Export button. plot() is the most common high-level function to make one or more plots. Let us create a scatterplot of mpg in the mtcars dataset. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 108 / 209
  • 109. 109/209 Descriptive analysis Descriptions through graphs with(mtcars, plot(mpg)) We can produce a scatterplot of mpg versus wt in the mtcars dataset using with(mtcars, plot(mpg, wt)) It is possible to specify the type of graph we wish to plot using the type = argument. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 109 / 209
  • 110. 110/209 Descriptive analysis Descriptions through graphs Let us apply the different types of arguments to plot x ={1,2,3,4,5,6,7,8} versus y = {12,21,9,11,8,10,17,18} par(mfrow=c(2,2)) # to plot 4 graphs in one page plot(x, y, type=”l”, main=”l”) plot(x, y, type=”b”, main=”b”) plot(x, y, type=”o”, main=”o”) plot(x, y, type=”c”, main=”c”) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 110 / 209
  • 111. 111/209 Descriptive analysis Bar plots A bar plot displays the distribution (frequency) of a categorical variable through vertical or horizontal bars. Vertical bar plot: barplot(height) Horizontal bar plot: barplot(height, horiz = TRUE) barplot(height, names.arg = Bar Labels) The following command gives a bar plot of cyl in mtcars dataset. barplot(table(mtcars$cyl)) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 111 / 209
  • 112. 112/209 Descriptive analysis Bar plots If height is a matrix rather than a vector, we will have a stacked or grouped bar plot. Stacked bar plot: barplot(height) Grouped bar plot: barplot(height, beside = TRUE) Let us use the built in dataset VADeaths and represent it by using bar plots. barplot(VADeaths) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 112 / 209
  • 113. 113/209 Descriptive analysis Bar plots barplot(VADeaths, beside = TRUE) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 113 / 209
  • 114. 114/209 Descriptive analysis Pie charts A pie chart is a circular diagram to present data by using the function pie(). Let us produce a pie chart from the following data. slices − c(10, 12,4, 16, 8) lbls − c(”US”, ”UK”, ”Australia”, ”Germany”, ”France”) pie(slices, labels = lbls, main=”Simple Pie Chart”) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 114 / 209
  • 115. 115/209 Descriptive analysis Pie charts Let include percentages on the above pie chart pct − round(slices/sum(slices)*100) lbls2 − paste(lbls, ” ”, pct, ”%”, sep=””) pie(slices, labels=lbls2, col=rainbow(length(lbls2)), main=”Pie Chart with Percentages”) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 115 / 209
  • 116. 116/209 Descriptive analysis Histograms Histograms are useful when we want to get an idea about the distribution of values in a numeric variable. We use the function hist() to produce a histogram. Let us generate a histogram for the variable mpg in mtcars dataset. hist(mtcars$mpg) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 116 / 209
  • 117. 117/209 Descriptive analysis Histograms The freq = FALSE argument can be used to display the histogram as a proportion rather than a frequency. hist(mtcars$mpg, freq = FALSE) We can control the breakpoints of a histogram using the breaks = argument. hist(mtcars$mpg, breaks = seq(from = 0,to = 35, by = 5)) It is also possible to add a kernel density curve to the histogram by using the density() and lines() function. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 117 / 209
  • 118. 118/209 Descriptive analysis Histograms Kernel density estimation is a nonparametric method for estimating the probability density function of a random variable. Let us add a kernel density plot on the histogram of mpg. Density − density(mtcars$mpg) hist(mtcars$mpg, freq = FALSE) lines(Density) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 118 / 209
  • 119. 119/209 Descriptive analysis Box plots Boxplots are useful to graphically summarize distribution of a variable, identify potential unusual values compare distributions between different groups. We use the function boxplot() to create a boxplot. Let us create a boxplot for mpg in the mtcars dataset. boxplot(mtcars$mpg) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 119 / 209
  • 120. 120/209 Descriptive analysis Box plots Box plots of a quantitative variable with respect to different levels of a factor can be produced as follows. boxplot(mpg ∼ cyl, data=mtcars) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 120 / 209
  • 121. 121/209 Descriptive analysis Box plots We can also have box plots based on more than one grouping factor, e.g., cyl and am. Let us first create factors from these variables. mtcars$cyl.f − factor(mtcars$cyl, levels=c(4,6,8), labels=c(”4”,”6”,”8”)) # to factor cyl mtcars$am.f − factor(mtcars$am, levels=c(0,1), labels=c(”auto”, ”standard”)) # to factor am boxplot(mpg ∼ am.f*cyl.f, data=mtcars) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 121 / 209
  • 122. 122/209 Descriptive analysis Scatterplot matrix Matrix scatterplots can be created if we want to graphically explore relationship between more than two variables. We use the function pairs() to create pairs of scatterplots. pairs(mtcars[,c(”mpg”, ”disp”, ”hp”, ”drat”)]) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 122 / 209
  • 123. 123/209 Descriptive analysis Low-level plotting functions To add extra information such as points, lines, arrows or text on the different plots. points(x, y,...): Adds points to the current plot. lines(x, y, ...): Adds line segments. text(x, y, labels, ...): Adds text into the graph. abline(a, b, ...): Adds the line y = a + bx. abline(h=y, ...): Adds a horizontal line. segments(x0, y0, x1, y1, ...): Draws line segments with x0 and y0 initial values. legend(“arg”, fill= , cex = , ...) : Displays a legend. Let us include a legend on the barplot produced in Slide 112. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 123 / 209
  • 124. 124/209 Descriptive analysis Low-levl plotting functions barplot(VADeaths, col =c(”green”, ”yellow”,”red”, ”black”, ”blue”)) legend(”topright”, legend = rownames(VADeaths), fill = c(”green”, ”yellow”, ”red”, ”black”, ”blue”), cex=0.3) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 124 / 209
  • 125. 125/209 Descriptive analysis Multiple figure environments We can create an n by m array of figures on a single page using the functions mfrow() and mfcol(). mfcol=c(nrow, mcol): to draw n rows and m columns of plots on one page in a column-wise fashion. mfrow=c(nrow, mcol): array filled in a row fashion. Do not forget to incorporate the function par(. . . .) as par(mfcol=c(, )); par(mfrow=c(, )) Example. Let us produce four different graphs based on the built in dataset AirPassengers par(mfrow=c(2,2)) plot(AirPassengers, type = ”p”) title(”Points”) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 125 / 209
  • 126. 126/209 Descriptive analysis Multiple figure environments plot(AirPassengers, type = ”l”) title(”Lines”) plot(AirPassengers, type = ”b”) title(”Points Lines”) plot(AirPassengers, type = ”h”) title(”High Density”) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 126 / 209
  • 127. 127/209 Descriptive analysis Exercise 7 By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 127 / 209
  • 128. 128/209 Descriptive analysis Customizing plots Once we are familiar with the base graphics in R, We can add more information on our graphs. Labels to the x and y axes can be included by using the ylab = ” ” and xlab = ” ” arguments, respectively in theplot() function. plot(mtcars$mpg, mtcars$wt, xlab=”Miles per galon”, ylab = ”Weight of car in lbs”) There are also some situations where we would like to adjust figure margins. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 128 / 209
  • 129. 129/209 Descriptive analysis Customizing plots Figure margins can be adjusted by using the par() function and the mar = argument before we plot the graph. par(mar = c(bottom, left, top, right) – the arguments bottom, left, top right are the size of the corresponding margins. By default R uses (5.1, 4.1, 4.1, 2.1) where these numbers represent the number of lines in each margin. par(mar = c(5, 4, 4,6)) plot(mtcars$mpg, mtcars$wt, xlab=”Miles per galon”, ylab = ”Weight of car in lbs”) # will shrink the width By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 129 / 209
  • 130. 130/209 Descriptive analysis Customizing plots We can control the range of axes scales by using the xlim =c(min, max) and ylim = c(min, max) arguments. Let us set the x axis scale from 0 to 40 and the range of the y axis scale from 0 to 8 and see the difference. plot(mtcars$mpg, mtcars$wt, xlab=”Miles per galon”, ylab = ”Weight of car in lbs”, xlim=c(0,40), ylim =c(0,8) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 130 / 209
  • 131. 131/209 Descriptive analysis Customizing plots We can also change the color and the size of the symbol used in plotting by using the col = and cex = arguments, respectively. col = can either take an integer value to specify the color or a character string (col=”green”) giving the color name. The colors() function gives a list of all 657 preset colors in base R. cex = requires a numeric value to indicate the proportional increase or decrease in size relative to the default value of 1. Let us make the color of the dots to “green” and decrease the size of the symbol by 40% in the above plot. plot(mtcars$mpg, mtcars$wt, xlab=”Miles per galon”, ylab = ”Weight of car in lbs”, xlim=c(0,40), ylim =c(0,8), col=”green”, cex =0.6) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 131 / 209
  • 132. 132/209 Descriptive analysis Customizing plots The function text() can be used to add a text label in a specific (x, y) coordinate of the plot. Let us add a text (30,2) in the above plot. plot(mtcars$mpg, mtcars$wt, xlab=”Miles per galon”, ylab = ”Weight of car in lbs”, xlim=c(0,40), ylim =c(0,8), col=”green”, cex =0.6, text(30,2, label=”(30, 2)”) The background of a plot can be changed by par(bg=”color”) – color can be ”red”, ”white”, ”green”, etc. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 132 / 209
  • 133. 133/209 Descriptive analysis Customizing plots In R, it is possible to make different symbol colors of data points depending on different level of a factor variable by using a low-level function points(). Suppose we want to produce a scatterplots of mpg versus wt based on each levels of cyl in mtcars dataset. Step 1. Let us include type = ”n” argument to create an empty plot region. plot(mtcars$mpg, mtcars$wt, xlab = ”Miles per galon”, ylab=”Weight in lbs”, xlim=c(0,40), ylim=c(0,8), bty = ”l”, type = ”n”) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 133 / 209
  • 134. 134/209 Descriptive analysis Customizing plots Step 2. Plot for cyl == 4 points(x = mtcars$mpg[mtcars$cyl==4], y = mtcars$wt[mtcars$cyl==4], pch = 2, col =”red”) Step 3. Plot for cyl == 6 points(x = mtcars$mpg[mtcars$cyl==6], y = mtcars$wt[mtcars$cyl==6], pch = 3, col =”yellow”) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 134 / 209
  • 135. 135/209 Descriptive analysis Customizing plots Step 4. Plot for cyl == 8 points(x = mtcars$mpg[mtcars$cyl==8], y = mtcars$wt[mtcars$cyl==8], pch = 4, col =”green”) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 135 / 209
  • 136. 136/209 Descriptive analysis Customizing plots Let us finally include a legend to describe what each symbol and color designate in the plot. Cols − c(”red”, ”yellow”, ”green”) Symbol − c(2,3,4) Label − c(”4 Cylinder”, ”6 Cylinder”, ”8 Cylinder”) legend(x = 10,y = 4, col = Cols, pch = Symbol, legend = Label, cex = 0.25) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 136 / 209
  • 137. 137/209 Descriptive analysis Exporting plots Plots in R can be exported to different formats such as jpeg, pdf, bmp. The first option is to click on the Export button in the Plots tab. The second option is through writing codes in R script. To save a plot in pdf format we will use the pdf() function. Similarly, we use the functions jpeg(), bmp() for jpeg and bmp formats. Once we run the codes to export plots we need to close the plotting device using the dev.off() function. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 137 / 209
  • 138. 138/209 Descriptive analysis Exporting plots Let us export the plots we had in Slide 132 to jpeg and bmp formats. jpeg(’my plot.jpeg’) plot(mtcars$mpg, mtcars$wt, xlab = ”Miles per galon”, ylab=”Weight in lbs”, xlim=c(0,40), ylim=c(0,8), col=”green”, cex=0.6, text(30,2, label=”(30, 2)”)) dev.off() png(’my plot.png’) plot(mtcars$mpg, mtcars$wt, xlab = ”Miles per galon”, ylab=”Weight in lbs”, xlim=c(0,40), ylim=c(0,8), col=”green”, cex=0.6, text(30,2, label=”(30, 2)”)) dev.off() By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 138 / 209
  • 139. 139/209 Descriptive analysis Exercise 8 By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 139 / 209
  • 140. 140/209 Basic inferential analysis Basic inferential analysis By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 140 / 209
  • 141. 141/209 Basic inferential analysis Basic inference After performing data cleaning and descriptive analysis, we need to perform inferential analysis. This includes estimation of model parameters and tests of hypotheses. Tests of independence: The chi-square test of independence tests whether two categorical variables are independent. We use the function chisq.test() to perform the test. Let us consider the Arthritis dataset in vcd package and test if there is association between Treatment and Improved. library(vcd) mytable − xtabs(∼ Treatment+Improved, data=Arthritis) chisq.test(mytable) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 141 / 209
  • 142. 142/209 Basic inferential analysis Basic inference The null hypothesis, i.e., there is no association between the two variables will be rejected since the p-value = 0.001463 0.05. We may also use the prop.test function to analyze this dataset. The proportions of improvements with respect to the treatment groups (Placebo, Treated) are: 0.69,0.5,0.25 for the None, Some, Marked, respectively. These proportions differ, but is it statistically supported? Let us consider the vectors of improvements in the Placebo group, i.e., (29,7,7) and the total = (42,14,28) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 142 / 209
  • 143. 143/209 Basic inferential analysis Basic inference Row Total − margin.table(mytable, 2) Placebo − mytable[1, ] prop.test(Placebo, Row Total) We can conclude that the proportions differ statistically since the p-value = 0.001463 0.05. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 143 / 209
  • 144. 144/209 Basic inferential analysis Correlations We use correlation coefficients to measure the association between two quantitative variables. The Pearson, Spearman, and Kendall correlation coefficients can be obtained using: cor(x, use= , method= ) x = Matrix or data frame, use = option to handle missing data, method = pearson, spearman, and kendall. A partial correlation is a correlation between two quantitative variables, controlling for one or more other quantitative variables. Code to use: library(ggm) pcor(u, S) # First two numbers in u = variable numbers to be correlated, last numbers partialed vars S is the covariance matrix. Let us consider the built in dataset state.x77 to illustrate the procedure. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 144 / 209
  • 145. 145/209 Basic inferential analysis Correlations states − state.x77[, 1:6] colnames(states) gives [1] ”Population” ”Income” ”Illiteracy” ”Life Exp” ”Murder” ”HS Grad” cor(states) gives Let us find the partial correlation coefficient between var1 (Population) var5 (Murder rate) keeping the effects of var2, var3, var 6 constant. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 145 / 209
  • 146. 146/209 Basic inferential analysis Correlations library(ggm) pcor(c(1,5,2,3,6), cov(states))# gives [1] 0.346 0.346 = correlation between Population (var1) Murder rate (var5) cov(states) is the covariance matrix among variables in states. The function Corr − cor.test(x, y, alternative = , method = ) tests for significance of correlation coefficients. print(corr.p(Cor$r, n=), short = FALSE) prints the p-value and confidence intervals. Let us test which of the correlations given above are statistically significant. Cor −corr.test(states, use=”complete”) print(corr.p(Cor$r, n=50), short = FALSE) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 146 / 209
  • 147. 147/209 Basic inferential analysis Correlations library(psych) Corr − corr.test(states, use=”complete”) print(corr.p(Cor$r, n=50), short = FALSE) There is a strong (0.7) and statistically significant (p-value = 0.00) correlation between Illiteracy and Murder rate By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 147 / 209
  • 148. 148/209 Basic inferential analysis T-tests Tests to compare mean of continuous data either against an a priori stipulated value or between two groups. One Sample t Test H0 : µ = µ0 vs H1 : µ , , ̸= µ0 We can use the function t.test() to perform a t-test. t.test(data vector, proposed mean value, Optional arguments) Options: alternative = “greater” ; alternative = “less” and conf.level=X. Example. Use the following data to test the mean value is 7500. Y = {5660, 5470, 5640, 6180, 6390, 6515, 6805, 7515, 7515, 8230, 8770, 8800, 8000, 7750, 6950 } t.test(Y, mu = 7500) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 148 / 209
  • 149. 149/209 Basic inferential analysis T-tests There is no sufficient statistical evidence to reject the null hypothesis (p-value = 0.1636 0.05). The mean value is not different from 7500. We can apply the logic of the one-sample t-test to test whether two population means are different. We may encounter to test for mean differences in two dependent or independent samples. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 149 / 209
  • 150. 150/209 Basic inferential analysis T-tests Independent t-test assumes that the two groups are independent the data are sampled from normal population. It can be performed by: t.test(y ∼ x, option =, data) # y is numeric x is a dichotomous vector. t.test(y1, y2, option =) # both y1 y2 are numeric The option can be var.equal=TRUE, alternative=”less”, alternative=”greater” A dependent t-test assumes that the difference between groups is normally distributed. Dependent t-test can be performed by: t.test(y1, y2, paired=TRUE) # y1 y2 are numeric vectors for the two dependent groups. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 150 / 209
  • 151. 151/209 Basic inferential analysis T-tests Let us consider part of the built in dataset PlantGrowth and test whether the mean weight for trt1 is the same as that of trt2. Weight1 − PlantGrowth$weight[PlantGrowth$group==”trt1”] Weight2 − PlantGrowth$weight[PlantGrowth$group==”trt2”] t.test(Weight1, Weight2) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 151 / 209
  • 152. 152/209 Basic inferential analysis T-tests The null hypothesis is rejected (p-value = 0.009298 0.05) and we conclude that the mean weights for trt1 is statistically different from that of trt2. Consider a dataset (marg) on a study to check whether cholesterol was reduced after using a certain brand of margarine as part of a low fat, low cholesterol diet. This data set contains information on 18 people using margarine to reduce cholesterol over two time points. Test whether the difference in mean cholesterol level before using the margarine is the same as that of after using the margarine. Marg − read.csv(”C:/Users/User/OneDrive/Desktop/QuantDataAna/ marg.csv”, header=TRUE) with(Marg, t.test(Before,After4weeks, paired=TRUE)) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 152 / 209
  • 153. 153/209 Basic inferential analysis T-tests The null hypothesis will be rejected (p-value = 1.958x10−11 0.05 ) and the conclusion is that the mean cholesterol level before using the margarine differ to that of after using the margarine. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 153 / 209
  • 154. 154/209 Basic inferential analysis Nonparametric tests The t-test is dependent on the normality assumption of the data being analyzed. It happens that the distribution for the data under consideration is not normal. In such cases, we can use the nonparametric tests as alternatives to t-tests. We can use the function wilcox.test() to perform rank-based (nonparametric) tests. Let us recall the dataset given in Slide 148 and test whether the median value = 7500 or not. wilcox.test(Y, mu=7500) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 154 / 209
  • 155. 155/209 Basic inferential analysis Nonparametric tests The p-value = 0.2219 0.05 and hence the null hypothesis will be retained. The Wilcoxon rank sum test for two independent groups can be performed by: wilcox.test(y ∼ x, data) # y is numeric x is a dichotomous. wilcox.test(y1, y2) # y1 y2 are outcome variables. A nonparametric alternative to the dependent sample t-test. wilcox.test(y1, y2, paired = TRUE) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 155 / 209
  • 156. 156/209 Basic inferential analysis Exercise 9 By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 156 / 209
  • 157. 157/209 Intermediate statistical methods Intermediate statistical methods By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 157 / 209
  • 158. 158/209 Intermediate statistical methods Comparing more than two groups - ANOVA ANOVA is an extension of the t-test, and compares means of two or more groups. One-way ANOVA y ∼ A. Two-way factorial ANOVA y ∼ A ∗ B Randomized block ANOVA y ∼ B + A; B is a blocking factor. y is the dependent variable and the letters A and B represent factors. Example. Consider the cholesterol dataset in the multcomp package. i) Find the group means, ii) Produce and interpret boxplots, iii) test for group mean differences. i) library(multcomp) attach(cholesterol) aggregate(response, by=list(trt), FUN=mean) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 158 / 209
  • 159. 159/209 Intermediate statistical methods ANOVA ii) boxplot(response ∼ trt, data =cholesterol) The boxplots indicate that the mean responses differ between the different groups. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 159 / 209
  • 160. 160/209 Intermediate statistical methods ANOVA fit − aov(response ∼ trt), data = cholesterol summary(fit) The above ANOVA table shows that the mean response in at least one of the groups differ to that of the other (p-value = 9.82x10−13 0.05 ). Let us further proceed to test which pair of mean responses differ statistically. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 160 / 209
  • 161. 161/209 Intermediate statistical methods ANOVA The function TukeyHSD() is the most commonly used test to compare all pairwise differences between group means. Let us apply it to the present case. TukeyHSD(fit) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 161 / 209
  • 162. 162/209 Intermediate statistical methods One-way ANOVA Testing model assumptions Assumptions: the dependent variable is normally distributed with equal variance in each group. Use a Q-Q plot to assess the normality assumption: library(car) qqPlot(lm(response ∼ trt, data=cholesterol)) Observe that qqPlot() requires an lm() fit. The normality assumption is not violated as the plots are close to the referent line. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 162 / 209
  • 163. 163/209 Intermediate statistical methods One-way ANOVA The constant (homogeneity) variance assumption can be assessed by Bartlett’s test: bartlett.test(response ∼ trt, data=cholesterol) There is no sufficient statistical evidence (p-value = 0.9653 0.05) to reject the null hypothesis of constant variance. The ANOVA model appears to correctly fit the data since the above assumptions are fulfilled. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 163 / 209
  • 164. 164/209 Intermediate statistical methods Two-way ANOVA To simultaneously evaluate the effect of two grouping variables on a response variable. These grouping variables are also known as factors. Three possible effects in this design: two main effects and one interaction effect. Example. Sixty guinea pigs are randomly assigned to receive one of three dose levels of vitamin C (0.5, 1, or 2 mg/day) and one of two delivery methods (orange juice or ascorbic acid), and tooth length was measured (see ToothGrowth dataset). i) Find the group means of tooth length, ii) Produce and interpret box plots, iii) Test whether main effects (supp and dose) interaction between these factors are significant. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 164 / 209
  • 165. 165/209 Intermediate statistical methods Two-way ANOVA attach(ToothGrowth) i) aggregate(len, by=list(supp, dose), FUN=mean) ii) boxplot(len ∼ supp:dose, data = ToothGrowth) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 165 / 209
  • 166. 166/209 Intermediate statistical methods Two-way ANOVA iii) dose − factor(dose) # Converts ”dose” to a factor fit − aov(len ∼ supp*dose, data = ToothGrowth) summary(fit) Each of the main effects and interaction are statistically significant. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 166 / 209
  • 167. 167/209 Intermediate statistical methods Two-way ANOVA Checking for normality assumption library(car) qqPlot(lm(len ∼ supp*dose, data=ToothGrowth)) What can you say about the normality assumption based on the above plot? By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 167 / 209
  • 168. 168/209 Intermediate statistical methods Two-way ANOVA Checking constant variance library(car) leveneTest(len ∼ factor(supp)*factor(dose)) What can be said about the constant variance assumption? By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 168 / 209
  • 169. 169/209 Intermediate statistical methods Regression Regression analysis can be used to: 1 identify the explanatory variables that are related to a response variable, 2 to describe the type of the relationships (if any), 3 to predict the value of response variable from the explanatory variables. Some examples a regression model is suitable include: What is the relationship between surface stream salinity and paved road surface area? Which qualities of an educational environment are most strongly related to higher student achievement scores? What is the form of the relationship between blood pressure, salt intake, and age? Is it the same for men and women? By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 169 / 209
  • 170. 170/209 Intermediate statistical methods Regression By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 170 / 209
  • 171. 171/209 Intermediate statistical methods Regression A function for fitting a linear model: myfit − lm(formula, data) # formula = Y ∼ X1 + . . . + Xk Some symbols to be used in the formula: ∼: Separates response variables on the left from the explanatory variables on the right. + : Separates predictor variables. : denotes an interaction between predictor variables. ∗: A shortcut for denoting all possible interactions. −1: Suppresses the intercept. y ∼ x − 1 fits a regression of y on x without intercept. Other functions: summary(): Detailed results for the fitted model. coefficients(): Lists the intercept slopes for the fitted mode. confint(): Confidence intervals for the model parameters. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 171 / 209
  • 172. 172/209 Intermediate statistical methods Regression fitted(): Lists the predicted values in a fitted model. residuals(): Lists the residual values in a fitted model. anova(): An ANOVA table for a fitted model. vcov(): Lists the covariance matrix for model parameters. AIC(): Prints Akaike’s Information Criterion. plot(): Diagnostic plots for evaluating the fit of a model. predict(): Uses a fitted model to predict response values for a new dataset. A code for polynomial regression of degree n: fit1 − lm(y ∼ x + I(x∧ 2) + . . . + I(x∧ n) , data=) Scatter plots matrix can be generated from: scatterplotMatrix(data, smoother.args=list(lty=2)) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 172 / 209
  • 173. 173/209 Intermediate statistical methods Regression Example. Consider the built in dataset ”women” which provides the height and weight for a set of 15 women, and fit a regression model. Let us first give the scatterplot of weight versus height. plot(women$height,women$weight) This scatterplot suggests a linear relationship between weight and height of women. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 173 / 209
  • 174. 174/209 Intermediate statistical methods Regression fit − lm(weight height, data=women) summary(fit) # Model summary By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 174 / 209
  • 175. 175/209 Intermediate statistical methods Regression According to the above result, height is found to be a statistically significant factor for weight of women. When height of a woman increases by 1 unit the weight increases by 3.45 units on average. Let us now superimposes the fitted regression line on the scatterplot. plot(women$height,women$weight) abline(fit) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 175 / 209
  • 176. 176/209 Intermediate statistical methods Regression Let us rerun the regression with a quadratic term (that is, X2 ): fit2 − lm(weight ∼ height + I(height2 ), data=women) plot(women$height,women$weight) lines(women$height,fitted(fit2)) Does the quadratic term improve prediction? By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 176 / 209
  • 177. 177/209 Intermediate statistical methods Regression Categorical Independent Variables: These are recoded into a set of separate binary variables before we enter them into a regression model.. Such recoding is known as “dummy coding”. For example, R will automatically create a dummy variable SexMale from the factor variable Sex. 1 if a person is Male 0 if a person is Female The default option in R is to use the first level of the factor as a reference and interpret the remaining levels relative to this level. We can use the function contrasts() to see the coding that R has used to create the dummy variables. contrasts(Sex) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 177 / 209
  • 178. 178/209 Intermediate statistical methods Regression Suppose that the coefficient corresponding to a male is found to be 2.5 in a fitted regression model where the response is score. Note here that sex = Female is the reference category. The interpretation will be: a Male would get 2.5 points more than a female on average. We can use the function relevel() to make the reference category to Male as follows: mutate(Sex = relevel(Sex, ref = ’Male’)) A dummy variable SexFemale will be created once we executed the above code. In general, a categorical variable with d levels will be transformed into d - 1 variables each with two levels. Suppose for example that a variable Education has 4 levels: None, Primary, Secondary, Tertiary. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 178 / 209
  • 179. 179/209 Intermediate statistical methods Regression The three dummy variables: Primary, Secondary and Tertiary. ▶ If Education = Primary, then the column Primary would be coded with a 1 while Secondary and Tertiary would be with a 0. ▶ If Education = Secondary, then the column Secondary would be coded with a 1 while Primary and Tertiary would be with a 0. ▶ If Education = None, then each of the columns Primary, Secondary and Tertiary would be coded with a 0. ▶ Note that we should first convert a character vector to a factor so as to have such dummy coding in R. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 179 / 209
  • 180. 180/209 Intermediate statistical methods Regression Multiple linear regression is an extension of the simple linear regression. Example. Use the built in dataset ”state.x77” to explore the relationship between a state’s murder rate and other characteristics including population, illiteracy rate, average income, and frost levels. Let us extract the variables that we are interested. states − as.data.frame(state.x77[, c(”Murder”, ”Population”, ”Illiteracy”, ”Income”, ”Frost”)]) cor(states) # Gives pairwise correlation Scatter plot matrix library(car) scatterplotMatrix(states) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 180 / 209
  • 181. 181/209 Intermediate statistical methods Regression Let us fit the multiple linear regression model: fit − lm(Murder ∼ Population + Illiteracy + Income + Frost, data=states) summary(fit) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 181 / 209
  • 182. 182/209 Intermediate statistical methods Regression Population and Illiteracy are significant predictors of Murder. For a one unit increase in Illiteracy the Murder rate increases by 4.14 units on average keeping the other factors fixed. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 182 / 209
  • 183. 183/209 Intermediate statistical methods Regression diagnostics Normality can be assessed by the qqPlot() function. Example. library(car) states − as.data.frame(state.x77[,c(”Murder”, ”Population”, ”Illiteracy”, ”Income”, ”Frost”)]) fit − lm(Murder ∼ Population + Illiteracy + Income + Frost, data=states) qqPlot(fit, labels=row.names(states), id.method=”identify”, simulate=TRUE, main=”Q-Q Plot”) simulate=TRUE adds a 95% confidence envelope using a parametric bootstrap. id.method =”identify” allows to interactively add ”labels” on the graph using mouse. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 183 / 209
  • 184. 184/209 Intermediate statistical methods Regression diagnostics The response variable approximately follow a normal distribution. Independence can be checked by the Durbin–Watson test. library(car) durbinWatsonTest(fit) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 184 / 209
  • 185. 185/209 Intermediate statistical methods Regression diagnostics Homoscedascticity can be tested by ncvTest() function. library(car) ncvTest(fit) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 185 / 209
  • 186. 186/209 Intermediate statistical methods Regression diagnostics Multicollinearity can be assessed by variance inflation factor as: library(car) vif(fit) Since the variance inflation factors for each of the above variables is small (less than 5), we can conclude that multicollinearity is not a problem here. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 186 / 209
  • 187. 187/209 Intermediate statistical methods Logistic Regression — Binary Observe that a linear regression model is employed when the dependent variable is quantitative. However, we may encounter a categorical response variable with a success/failure scenario, such as democracy / autocracy, war / peace, trade agreement / no trade agreement, underweight/ normal. We can use a binary logistic regression to model such response variable as a function of one or more independent variables. Unlike in linear regression model, we will predict the probabilities of the response variable as a function of independent variable (s) in logistic regression. We can use the function glm() to fit a logistic regression model. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 187 / 209
  • 188. 188/209 Intermediate statistical methods Logistic Regression — Binary The hypothesis of interest in logistic regression: H0: An independent variable had no impact on the probability to success of the response H1: An independent variable had impact on the probability to success of the response Note that we need to turn our dependent variable into a factor before we proceed to use glm(). Example. Let us consider the data on the passengers of the Titanic in 1912 and investigate whether age influenced the probability to travel in first-class. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 188 / 209
  • 189. 189/209 Intermediate statistical methods Logistic Regression — Binary Load the data into R: Titanic − read.csv(”C:/Users/User/OneDrive/Desktop/QuantDataAna/ Titanic.csv”, header = TRUE) Let us first recode the response variable (plass) in to binary. library(tidyr) Titanic − Titanic %% mutate(class = as.numeric( recode(pclass, ’1st’=’1’, ’2nd’=’0’, ’3rd’=’0’))) We use the following code to fit a simple binary logistic regression model of class on age. Lfit − glm(class ∼ age, data = titanic, na.action = na.exclude, family = ”binomial”) summary(Lfit) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 189 / 209
  • 190. 190/209 Intermediate statistical methods Logistic Regression — Binary The coefficient for age is highly significant (p-value = 2x10−16 ). The odds ratio for the regression coefficients can be obtained by exp(coef(Lfit)) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 190 / 209
  • 191. 191/209 Intermediate statistical methods Logistic Regression — Binary The odds ratio corresponding to age = exp(0.067767) = 1.07 Interpretation: ▶ For every unit (one year) increase in age, the odds of traveling in first-class increases by a factor of 1.07. ▶ Let us clarify this by taking specific age values: At age = 30, the predicted log-odds of traveling in first class = −3.187456 + 0.067767x(30) = −1.154446. Taking the exponential of −1.154446 yields odds of 0.315232127. At age = 31, the predicted log-odds of traveling in first class = −3.187456 + 0.067767x(31) = −1.086679. e−1.086679 yields odds of 0.337334925. Dividing the odds at age = 31 by odds at age = 30, i.e.,0.337334925/0.315232127 gives 1.07. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 191 / 209
  • 192. 192/209 Intermediate statistical methods Logistic Regression — Binary Note When the odds ratio for a given predictor is less than one, an increase in the value of the predictor leads to a decreased odds of success on the response. If the odds ratio for a given predictor is exactly 1, the odds of success on the response would not change when the value of the predictor changes. Odds can be converted to a probability by using the following formula: probability = odds / (1 + odds). For example, the predicted probability of traveling in first class when a passenger is 30 years old = 0.315232127 1+0.315232127 = 0.23967794. We can also predict the probabilities based on previously created sequence. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 192 / 209
  • 193. 193/209 Intermediate statistical methods Logistic Regression — Binary Sq age −seq(0, 80, 1) Prob age −predict(Lfit, list(age = Sq age), type = ”response”) Prob age # gives Let us plot age versus probability of traveling in first class. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 193 / 209
  • 194. 194/209 Intermediate statistical methods Logistic Regression — Binary plot(Sq age, Prob age, xlab = ”age”, ylab = ”Probability to travel in First Class”, type=”l”) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 194 / 209
  • 195. 195/209 Intermediate statistical methods Logistic Regression — Binary How good is the model? A model is considered to be good when the proportion of correctly predicting success (1) and failure (0) is high. We can use the ROC (receiver operating characteristic) curve to assess correctly predicting 1s and 0s. The y-axis in ROC-curve is the probability of correctly predicting a 1— Sensitivity. The x-axis in ROC-curve is the probability of correctly predicting a 0 — Specificity. The model predicts 1s and 0s well if the curve is further away from the diagonal. The area under the curve (auc) will be 100% if the model correctly predicted everything. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 195 / 209
  • 196. 196/209 Intermediate statistical methods Logistic Regression — Binary install.packages(”pROC”) library(pROC) Prob trav − predict(Lfit, type=”response”) Titanic$Prob trav − unlist(Prob trav) ROC − roc(Titanic$class, Titanic$Prob trav) auc(ROC) plot(ROC, print.auc = TRUE, col = ”blue”) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 196 / 209
  • 197. 197/209 Intermediate statistical methods Logistic Regression — Multinomial Multinomial Logistic Regression (MLR) is conducted when the outcome variable is nominal with more than two levels. Example. In an election, voters may choose any one of Party A, Party B or Party C. Their choice might be affected by the party’s economic policy, foreign policy, educational levels of candidates, etc. In MLR, the log odds of the outcomes are modeled as a linear combination of the predictor variables. The MLR estimates a separate binary logistic regression model for each dummy variables. If the outcome variable has M levels, then we will fit M-1 binary logistic regression models. Each model has its own intercept regression coefficients: the predictors can affect each category differently. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 197 / 209
  • 198. 198/209 Intermediate statistical methods Logistic Regression — Multinomial Consider the data set Highschool.csv. ▶ Outcome variable: Program Type = {academic, general, vocational} ▶ Predictors: Writing Score, Math Score, Sch Type, Sex, Ses. Import the data to R: Mlog − read.csv(”C:/Users/User/OneDrive/Desktop/QuantDataAna/ Data/Highschool.csv”, header = TRUE) Suppose we choose academic as the baseline category for the outcome variable. library(foreign) Mlog$Program Type ¡- relevel(factor(Mlog$Program Type), ref = ”academic”) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 198 / 209
  • 199. 199/209 Intermediate statistical methods Logistic Regression — Multinomial library(nnet) Mlogit − multinom(Program Type ∼ Writing Score+Math Score+Sch Type+Sex+Ses, data = Mlog) summary(Mlogit) By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 199 / 209
  • 200. 200/209 Intermediate statistical methods Logistic Regression — Multinomial Let us now calculate Z score and p-values. z − summary(Mlogit)$coefficients/summary(Mlogit)$standard.errors z Note that the coefficients corresponding to the row ”general” will be used to compare Program Type = ”general” to Program Type = ”academic”. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 200 / 209
  • 201. 201/209 Intermediate statistical methods Logistic Regression — Multinomial p − (1 - pnorm(abs(z), 0, 1))*2 p Interpretations can only be given for those coefficients with p − values 0.05. By Taddesse Kassahun ( Department of Statistics ) Quantitative Data Analysis using R 201 / 209