2. 2
In this session
What does SPSS look like?
Types of data (revision)
Data Entry in SPSS
Simple charts in SPSS
Summary statistics
Contingency tables and crosstabulations
Scatterplots and correlations
Tests of differences of means
4. 4
Aspects of SPSS
Menus - Analyse and Charts esp.
Spreadsheet view of data
Rows are cases (people, respondents etc.)
Columns are Variables
Variable view of data
Shows detail of each variable type
6. 6
In SPSS
We change ticks etc. on a questionnaire into
numbers
One number for each variable for each case
How we do this depends on the type of
variable/data
7. 7
Types of data
Nominal
Ranked
Scales/measures
Mixed types
Text answers (open ended questions)
8. 8
Nominal (categorical)
order is arbitrary
e.g. sex, country of birth, personality type, yes or no.
Use numeric in SPSS and give value labels.
(e.g. 1=Female, 2=Male, 99=Missing)
(e.g. 1=Yes, 2=No, 99=Missing)
(e.g. 1=UK, 2=Ireland, 3=Pakistan, 4=India, 5=other,
99=Missing)
9. 9
Ranks or Ordinal
in order, 1st, 2nd, 3rd etc.
e.g. status, social class
Use numeric in SPSS with value labels
E.g. 1=Working class, 2=Middle class, 3=Upper
class
E.g. Class of degree, 1=First, 2=Upper second,
3=Lower second, 4=Third, 5=Ordinary,
99=Missing
10. 10
Measures, scales
1. Interval - equal units
e.g. IQ
2. Ratio - equal units, zero on scale
e.g. height, income, family size, age
Makes sense to say one value is twice another
Use numeric (or comma, dot or scientific) in
SPSS
E.g. family size, 1, 2, 3, 4 etc.
E.g. income per year, 25000, 14500, 18650 etc.
11. 11
Mixed type
Categorised data
Actually ranked, but used to identify
categories or groups
e.g. age groups
= ratio data put into groups
Use numeric in SPSS and use value
labels.
E.g. Age group, 1=‘Under 18’, 2=‘18-24’, 3=‘25-
34’, 4=‘35-44’, 5=‘45-54’, 6=‘55 or greater’
12. 12
Text answers
E.g. answers to open-ended questions
Either enter text as given (Use String in SPSS)
Or
Code or classify answers into one of a small number
types. (Use numeric/nominal in SPSS)
13. Quantifying Data
Before we can do any kind of analysis, we
need to quantify our data
“Quantification” is the process of converting
data to a numeric format
Convert social science data into a “machine-
readable” form, a form that can be read &
manipulated by computer programs
14. Quantifying Data
Some transformations are simple:
Assign numeric representations to nominal
or ordinal variables:
Turning male into “1” and female into “2”
Assigning “3” to Very Interested, “2” to
Somewhat Interested, “1” to Not Interested
Assign numeric values to continuous
variables:
Turning born in 1973 to “35”
15. Developing Code Categories
Some data are more challenging. Open-ended
responses must be coded.
Two basic approaches:
Begin with a coding scheme derived from the
research purpose.
Generate codes from the data.
16. Coding Quantitative Data
Goal – reduce a wide variety of information to
a more limited set of variable attributes:
“What is your occupation?”
Use pre-established scheme: Professional,
Managerial, Clerical, Semi-skilled, etc.
Create a scheme after reviewing the data
Assign value to each category in the scheme:
Professional = 1, Managerial = 2, etc.
Classify the response: “Secretary” is “clerical” and is
coded as “3”
17. Coding Quantitative Data
Points to remember:
If the data are coded to maintain a good amount
of detail, they can always be combined (reduced)
later
However, if you start off with too little detail, you
can’t get it back
If you’re using a survey / questionnaire, it’s a
good idea to do your coding on the form so that it
can be entered properly (i.e. create a “codebook”)
18. Codebook Construction
Purposes:
Primary guide used in the coding process.
Should note the value assigned to each variable
attribute (response)
Guide for locating variables and interpreting
codes in the data file during analysis.
If you’re doing your own input, this will also
guide data set construction
19. 19
Data Entry in SPSS
Video by Andy Field
https://www.youtube.com/watch?v=b163iBBy
ycw&index=1&list=PL25257A24840423AE
21. 21
Data Entry into SPSS
There are 2 ways to enter data into SPSS:
1. Directly enter in to SPSS by typing in Data View
2. Enter into other database software such as
Excel then import into SPSS
Let’s start with the second option, using data in Excel.
24. 24
Importing data from Excel spreadsheet into SPSS.
In SPSS, go to:
File, Open, Data
Select Type of file (for example, Excel) you want to open
Select File name you want to open
25. 25
Importing data from SPSS to Excel.
In SPSS, go to:
Data, Save as,
Select Type of file (for example, Excel) you want to save into
Give File name you want to save into
26. 26
Frequency counts
Used with categorical and ranked variables
e.g. gender of students taking Health and
Illness option
Sex of student
Frequency Percent Valid Percent
Cumulative
Percent
Female 25 73.5 73.5 73.5
Male 9 26.5 26.5 100.0
Valid
Total 34 100.0 100.0
27. 27
e.g. Number of GCSEs passed by students taking
Health and Illness option
Number of GCSEs
Frequency Percent Valid Percent
Cumulative
Percent
0 1 2.9 2.9 2.9
1 1 2.9 2.9 5.9
2 4 11.8 11.8 17.6
3 6 17.6 17.6 35.3
4 4 11.8 11.8 47.1
5 2 5.9 5.9 52.9
6 6 17.6 17.6 70.6
7 3 8.8 8.8 79.4
8 2 5.9 5.9 85.3
9 3 8.8 8.8 94.1
13 1 2.9 2.9 97.1
14 1 2.9 2.9 100.0
Valid
Total 34 100.0 100.0
28. 28
Central Tendency
Mean
= average value
sum of all the values divided by the number of values
Mode
= the most frequent value in a distribution
(N.B. it is possible to have 2 or more modes, e.g. bimodal
distribution)
Median
= the half-way value, or the value that divides the ordered
distribution in the middle
The middle score when scores are ordered
N.B. need to put values into order first
29. 29
Dispersion and variability
Quartiles
The three values that split the sorted data into
four equal parts.
Second Quartile = median.
Lower quartile = median of lower half of the data
Upper quartile = median of upper half of the data
Need to order the individuals first
One quarter of the individuals are in each inter-
quartile range
30. 30
Used on Box Plot
Statistics
Age
Valid 34
N
Missing 0
Mean 24.03
Median 21.00
Upper quartile
Lower quartile
Median
Age of Health and Illness students
31. 31
Variance
Average deviation from the mean, squared
5.20 is the Sum of Squares
This depends on number of individuals so we divide by n (5)
Gives 1.04 which is the variance
Score Mean Deviation
Squared
Deviation
1 2.6 -1.6 2.56
2 2.6 -0.6 0.36
3 2.6 0.4 0.16
3 2.6 0.4 0.16
4 2.6 1.4 1.96
Total 5.20
32. 32
Standard Deviation
The variance has one problem: it is
measured in units squared.
This isn’t a very meaningful metric so we take
the square root value.
This is the Standard Deviation
34. 34
Charts in SPSS
Use ‘Chart Builder’ from ‘Graph’ menu or the
Legacy menu
And/or double click chart to edit it.
E.g. double click to edit bars (e.g. to change
from colour to fill pattern).
Do this in SPSS first before cut and paste to
Word
Label the chart (in SPSS or in Word)
35. 35
Stem and leaf plots
e.g. age of students taking Health and Illness
option
good at showing
distribution of data
outliers
range
36. 36
Stem and leaf plots e.g.
Age Stem-and-Leaf Plot
Frequency Stem & Leaf
6.00 1 . 999999
17.00 2 . 00000000001111134
5.00 2 . 55678
3.00 3 . 123
1.00 3 . 5
2.00 Extremes (>=36)
Stem width: 10
Each leaf: 1 case(s)
44. Analysing relationships
Contingency tables or crosstabulations
Compares nominal/categorical variables
But can include ordinal variables
N.B. table contains counts (= frequency data)
One variable on horizontal axis
One variable on vertical axis
Row and column total counts known as marginals
45. Example
In the Health and
Illness class, are
women more
likely to be under
21 than men?
48. Chi-square ²
Cross tabulations and Chi-square are tests that
can be used to look for a relationship between
two variables:
When the variables are categorical so the
data are nominal (or frequency).
For example, if we wanted to look at the
relationship between gender and age.
There are several different types of Chi-square
(²), we will be using the 2 x 2 Chi-square
52. Chi-Square analysis on SPSS
http://www.youtube.com/watch?v=Ahs8jS5m
JKk 4m15s
http://www.youtube.com/watch?v=IRCzOD27
NQU
From 6m:30s to 9m:50s
http://www.youtube.com/watch?v=532QXt1P
M-
Q&feature=plcp&context=C3ba91a4UDOEgs
ToPDskJ-ABupdp-Yfvuf4j4fJGzV 12m30s
53. Low values in cells
Get SPSS to output expected values
Look where these are <5
Consider recoding to combine cols or rows
54. Tabulating questionnaire
responses
Categorical survey data often “collapsed” for purposes of data
analysis
Original category Frequency Collapsed category Frequency
White British 284 White 304
White Irish 7
Other White 13
Indian 40 South Asian 105
Pakistani 32
Bangladeshi 33
Chinese 16 Chinese 16
Black British 30 Black 44
Afro-Caribbean 12
African 2
An analysis on a sample of 2 (e.g. Black African) would not have been very meaningful!
56. Scatterplots and correlations
Looks for association between variables, e.g.
Population size and GDP
crime and unemployment rates
height and weight
Both variables must be rank, interval or
ratio (scale or ordinal in SPSS).
Thus cannot use variables like, gender,
ethnicity, town of birth, occupation.
56
58. Interpretation
As Y increases
X increases
Called
correlation
Regression line
model in red
58
59. Correlation measures
association not causation
The older the child the better s/he is at reading
The less your income the greater the risk of
schizophrenia
Height correlates with weight
But weight does not cause height
Height is one of the causes of weight (also body
shape, diet, fitness level etc.)
Numbers of ice creams sold is correlated with
the rate of drowning
Ice creams do not cause drowning (nor vice versa)
Third variable involved – people swim more and buy
more ice creams when it’s warm
59
60. Scatterplot in SPSS
Use Graph menu
http://www.youtube.com/watch?v=74BjgPQvI
Eg 8m34s
http://www.youtube.com/watch?v=blfflA-
34pQ&feature=related 4m04s
http://www.youtube.com/watch?v=UVylQoG4
hZM 1m50s, ignore polynomial regression
60
62. If mixed data sets
Change point icon and/or colour to see
different subsets.
Overall data may have no relationship but
subsets might.
E.g. show male and female respondents.
Use Chart builder
62
63. 63
Correlation
Correlation coefficient = measure of strength
of relationship, e.g. Pearson’s r
varies from 0 to 1 with a plus or minus sign
Correlations
Number of
GCSEs Age
Pearson Correlation 1 -.415
*
Sig. (2-tailed) .015
Number of GCSEs
N 34 34
Pearson Correlation -.415
*
1
Sig. (2-tailed) .015
Age
N 34 34
*. Correlation is significant at the 0.05 level (2-tailed).
68. Interpretation cont.
r2 is a measure of degree of variation in
one variable accounted for by variation
in the other.
E.g. If r=0.7 then r2=.49 i.e. just under half
the variation is accounted for (rest
accounted for by other factors).
If r=0.3 then r2=0.09 so 91% of the
variation is explained by other things.
68
69. Significance of r
SPSS reports if r is significant at α=0.05
N.B. this is dependent on sample size to a
large extent.
Other things being equal, larger samples
more likely to be significant.
Usually, size of r is more important than
its significance
69
70. Pearson’s r in SPSS
http://www.youtube.com/watch?v=loFLqZmvf
zU 6m57s
70
71. Parametric and non-parametric
Some statistics rely on the variables being
investigated following a normal distribution. –
Called Parametric statistics
Others can be used if variables are not
distributed normally – called Non-parametric
statistics.
Pearson’s r is a parametric statistic
Kendal’s tau and Spearman’s rho (rank
correlation) are non-parametric.
71
73. Use statistical test
SPSS provides two formal tests for normality
: Kolmogorov-Smirnov (K-S) and Shapiro-
Wilks (S-W)
But, there is debate about KS
Extremely sensitive to departure from normality
May erroneously imply parametric test not
suitable – especially in small sample
So, always use a histogram as well.
73
74. Often can use parametric tests
Parametric tests (e.g. Pearson’s r) are robust
to departures from normality
Small, non-normal samples OK
But use non-parametric if
Data are skewed (questionnaire data often is)
Data are bimodal
74
76. So far…
Looked at relationships between nominal
variables
Gender vs age group
Looked at relationships between scale
variables
Height vs. Weight
Now combine the two
Groups vs a scale variable
E.g. Gender vs income
76
77. Reminder – IV vs DV
IV = independent variable
What makes a difference, causes effects, is responsible
for differences.
DV = dependent variable
What is affected by things, what is changed by the IV.
Gender vs income. Gender = IV, income = DV
So we investigate the effect of gender on income
77
78. Example 1
Age group vs. no. of GCSEs
Using the Health and Illness class data
Age group defines 2 groups
Under 21
21 and over
Just two groups
Can use independent samples t-test
Independent because the two groups consist
of different people.
t-test compares the means of the 2 groups. 78
79. 79
Difference of means
Do under 21s have more or fewer GCSEs
than 21 and overs?
Means are different (6.44 & 4.28) but is that
significant?
Group Statistics
Age group N Mean Std. Deviation Std. Error Mean
Under 21 16 6.44 3.140 .785
Number of GCSEs
21 and over 18 4.28 2.906 .685
80. 80
Independent Samples Test
s Test for Equality of
Variances t-test for Equality of Mean
Sig. t df Sig. (2-tailed)
Mean
Difference
Std.
Diffe
.164 .689 2.082 32 .045 2.160
2.073 30.789 .047 2.160
Independent Samples Test
Levene's Test for Equality of
Variances t-test for Equality of Means
95% Confidence Interval of the
Difference
F Sig. t df Sig. (2-tailed)
Mean
Difference
Std. Error
Difference Lower Upper
Equal variances assumed .164 .689 2.082 32 .045 2.160 1.037 .047 4.272
Number of GCSEs
Equal variances not
assumed
2.073 30.789 .047 2.160 1.042 .034 4.285
No significant difference therefore
assume equal variances
Means are
statistically
significantly
different
81. Parametric vs non-parametric
Just as in the case of correlations, there are
both kinds of tests.
Need to check if DV is normally distributed.
Do this visually
Also use statistical tests
81
82. Tests for normality
Kolmogorov-Smirnov and Shapiro-Wilk
If n>50 use KS
If n≤50 use SW
Null hypothesis is ‘data are normally distributed’.
So if p<0.05 then data are significantly different
from a normal distribution – use non-
parametric tests
If p≥0.05 then no significant difference – use
parametric tests
82
84. t-test
Identify your two groups.
Determine what values in the data indicate
those two groups (e.g. 1=female, 2=male)
Select Analyze:Compare Means:Independent
samples t-test
http://www.youtube.com/watch?v=_KHI3ScO
8sc 9m40s
84
85. Mann-Whitney U test
Use this when comparing two groups and the
DV is not normally distributed
http://www.youtube.com/watch?v=7iTvv3m9d
_g 3m45s
85
86. Comparing 3 or more groups
ANOVA = Analysis of Variance
Analyze: Compare Means: One-way ANOVA
http://www.youtube.com/watch?v=wFq1b3QjI
1U 4m04s
Useful to get table of means (descriptives) and
means plots from ANOVA options.
86