Intro to SPSS.ppt

1
Introduction to SPSS
Data types and SPSS
data entry and analysis

2
In this session
 What does SPSS look like?
 Types of data (revision)
 Data Entry in SPSS
 Simple charts in SPSS
 Summary statistics
 Contingency tables and crosstabulations
 Scatterplots and correlations
 Tests of differences of means

4
Aspects of SPSS
 Menus - Analyse and Charts esp.
 Spreadsheet view of data
 Rows are cases (people, respondents etc.)
 Columns are Variables
 Variable view of data
 Shows detail of each variable type

6
In SPSS
 We change ticks etc. on a questionnaire into
numbers
 One number for each variable for each case
 How we do this depends on the type of
variable/data

7
Types of data
 Nominal
 Ranked
 Scales/measures
 Mixed types
 Text answers (open ended questions)

8
Nominal (categorical)
 order is arbitrary
 e.g. sex, country of birth, personality type, yes or no.
 Use numeric in SPSS and give value labels.
(e.g. 1=Female, 2=Male, 99=Missing)
(e.g. 1=Yes, 2=No, 99=Missing)
(e.g. 1=UK, 2=Ireland, 3=Pakistan, 4=India, 5=other,
99=Missing)

9
Ranks or Ordinal
 in order, 1st, 2nd, 3rd etc.
 e.g. status, social class
 Use numeric in SPSS with value labels
 E.g. 1=Working class, 2=Middle class, 3=Upper
class
 E.g. Class of degree, 1=First, 2=Upper second,
3=Lower second, 4=Third, 5=Ordinary,
99=Missing

10
Measures, scales
1. Interval - equal units
 e.g. IQ
2. Ratio - equal units, zero on scale
 e.g. height, income, family size, age
 Makes sense to say one value is twice another
 Use numeric (or comma, dot or scientific) in
SPSS
 E.g. family size, 1, 2, 3, 4 etc.
 E.g. income per year, 25000, 14500, 18650 etc.

11
Mixed type
 Categorised data
 Actually ranked, but used to identify
categories or groups
 e.g. age groups
 = ratio data put into groups
 Use numeric in SPSS and use value
labels.
 E.g. Age group, 1=‘Under 18’, 2=‘18-24’, 3=‘25-
34’, 4=‘35-44’, 5=‘45-54’, 6=‘55 or greater’

12
Text answers
 E.g. answers to open-ended questions
 Either enter text as given (Use String in SPSS)
 Or
 Code or classify answers into one of a small number
types. (Use numeric/nominal in SPSS)

Quantifying Data
 Before we can do any kind of analysis, we
need to quantify our data
 “Quantification” is the process of converting
data to a numeric format
 Convert social science data into a “machine-
readable” form, a form that can be read &
manipulated by computer programs

Quantifying Data
Some transformations are simple:
 Assign numeric representations to nominal
or ordinal variables:
 Turning male into “1” and female into “2”
 Assigning “3” to Very Interested, “2” to
Somewhat Interested, “1” to Not Interested
 Assign numeric values to continuous
variables:
 Turning born in 1973 to “35”

Developing Code Categories
Some data are more challenging. Open-ended
responses must be coded.
 Two basic approaches:
 Begin with a coding scheme derived from the
research purpose.
 Generate codes from the data.

Coding Quantitative Data
 Goal – reduce a wide variety of information to
a more limited set of variable attributes:
 “What is your occupation?”
 Use pre-established scheme: Professional,
Managerial, Clerical, Semi-skilled, etc.
 Create a scheme after reviewing the data
 Assign value to each category in the scheme:
Professional = 1, Managerial = 2, etc.
 Classify the response: “Secretary” is “clerical” and is
coded as “3”

Coding Quantitative Data
 Points to remember:
 If the data are coded to maintain a good amount
of detail, they can always be combined (reduced)
later
 However, if you start off with too little detail, you
can’t get it back
 If you’re using a survey / questionnaire, it’s a
good idea to do your coding on the form so that it
can be entered properly (i.e. create a “codebook”)

Codebook Construction
Purposes:
 Primary guide used in the coding process.
 Should note the value assigned to each variable
attribute (response)
 Guide for locating variables and interpreting
codes in the data file during analysis.
 If you’re doing your own input, this will also
guide data set construction

19
Data Entry in SPSS
 Video by Andy Field
 https://www.youtube.com/watch?v=b163iBBy
ycw&index=1&list=PL25257A24840423AE

21
Data Entry into SPSS
There are 2 ways to enter data into SPSS:
1. Directly enter in to SPSS by typing in Data View
2. Enter into other database software such as
Excel then import into SPSS
Let’s start with the second option, using data in Excel.

24
Importing data from Excel spreadsheet into SPSS.
In SPSS, go to:
File, Open, Data
Select Type of file (for example, Excel) you want to open
Select File name you want to open

25
Importing data from SPSS to Excel.
In SPSS, go to:
Data, Save as,
Select Type of file (for example, Excel) you want to save into
Give File name you want to save into

26
Frequency counts
 Used with categorical and ranked variables
 e.g. gender of students taking Health and
Illness option
Sex of student
Frequency Percent Valid Percent
Cumulative
Percent
Female 25 73.5 73.5 73.5
Male 9 26.5 26.5 100.0
Valid
Total 34 100.0 100.0

27
e.g. Number of GCSEs passed by students taking
Health and Illness option
Number of GCSEs
Frequency Percent Valid Percent
Cumulative
Percent
0 1 2.9 2.9 2.9
1 1 2.9 2.9 5.9
2 4 11.8 11.8 17.6
3 6 17.6 17.6 35.3
4 4 11.8 11.8 47.1
5 2 5.9 5.9 52.9
6 6 17.6 17.6 70.6
7 3 8.8 8.8 79.4
8 2 5.9 5.9 85.3
9 3 8.8 8.8 94.1
13 1 2.9 2.9 97.1
14 1 2.9 2.9 100.0
Valid
Total 34 100.0 100.0

28
Central Tendency
 Mean
 = average value
 sum of all the values divided by the number of values
 Mode
 = the most frequent value in a distribution
 (N.B. it is possible to have 2 or more modes, e.g. bimodal
distribution)
 Median
 = the half-way value, or the value that divides the ordered
distribution in the middle
 The middle score when scores are ordered
 N.B. need to put values into order first

29
Dispersion and variability
 Quartiles
 The three values that split the sorted data into
four equal parts.
 Second Quartile = median.
 Lower quartile = median of lower half of the data
 Upper quartile = median of upper half of the data
 Need to order the individuals first
 One quarter of the individuals are in each inter-
quartile range

30
Used on Box Plot
Statistics
Age
Valid 34
N
Missing 0
Mean 24.03
Median 21.00
Upper quartile
Lower quartile
Median
Age of Health and Illness students

31
Variance
 Average deviation from the mean, squared
 5.20 is the Sum of Squares
 This depends on number of individuals so we divide by n (5)
 Gives 1.04 which is the variance
Score Mean Deviation
Squared
Deviation
1 2.6 -1.6 2.56
2 2.6 -0.6 0.36
3 2.6 0.4 0.16
3 2.6 0.4 0.16
4 2.6 1.4 1.96
Total 5.20

32
Standard Deviation
 The variance has one problem: it is
measured in units squared.
 This isn’t a very meaningful metric so we take
the square root value.
 This is the Standard Deviation

33
Using SPSS
 ‘Analyse>Descriptive>Explore’ menu.
 Gives mean, median, SD, variance, min,
max, range, skew and kurtosis.
 Can also produce stem and leaf, and
histogram.

34
Charts in SPSS
 Use ‘Chart Builder’ from ‘Graph’ menu or the
Legacy menu
 And/or double click chart to edit it.
 E.g. double click to edit bars (e.g. to change
from colour to fill pattern).
 Do this in SPSS first before cut and paste to
Word
 Label the chart (in SPSS or in Word)

35
Stem and leaf plots
 e.g. age of students taking Health and Illness
option
 good at showing
 distribution of data
 outliers
 range

36
Stem and leaf plots e.g.
Age Stem-and-Leaf Plot
Frequency Stem & Leaf
6.00 1 . 999999
17.00 2 . 00000000001111134
5.00 2 . 55678
3.00 3 . 123
1.00 3 . 5
2.00 Extremes (>=36)
Stem width: 10
Each leaf: 1 case(s)

37
Box Plot
Statistics
Age
Valid 34
N
Missing 0
Mean 24.03
Median 21.00

38
Box Plot
Fill colour
changed.
N.B. numbers refer
to case numbers.

39
Histograms and bar charts
 Length/height of bar indicates frequency

40
Histogram
Fill pattern suitable
for black and white
printing

41
Changing the bin size
Bin size made
smaller to show
more bars

42
Pie chart
 angle of segment indicates proportion of the
whole

Pie Chart
Shadow and one
slice moved out for
emphasis

Analysing relationships
 Contingency tables or crosstabulations
 Compares nominal/categorical variables
 But can include ordinal variables
 N.B. table contains counts (= frequency data)
 One variable on horizontal axis
 One variable on vertical axis
 Row and column total counts known as marginals

Example
 In the Health and
Illness class, are
women more
likely to be under
21 than men?

Crosstabulations
 e.g.
 Use column and row percentages to look for
relationships

Chi-square ²
Cross tabulations and Chi-square are tests that
can be used to look for a relationship between
two variables:
 When the variables are categorical so the
data are nominal (or frequency).
 For example, if we wanted to look at the
relationship between gender and age.
 There are several different types of Chi-square
(²), we will be using the 2 x 2 Chi-square

2x2 Chi-square results in
SPSS

Another example
 The Bank employees data

Bank Employees
Chi-Square tests

Chi-Square analysis on SPSS
 http://www.youtube.com/watch?v=Ahs8jS5m
JKk 4m15s
 http://www.youtube.com/watch?v=IRCzOD27
NQU
 From 6m:30s to 9m:50s
 http://www.youtube.com/watch?v=532QXt1P
M-
Q&feature=plcp&context=C3ba91a4UDOEgs
ToPDskJ-ABupdp-Yfvuf4j4fJGzV 12m30s

Low values in cells
 Get SPSS to output expected values
 Look where these are <5
 Consider recoding to combine cols or rows

Tabulating questionnaire
responses
 Categorical survey data often “collapsed” for purposes of data
analysis
Original category Frequency Collapsed category Frequency
White British 284 White 304
White Irish 7
Other White 13
Indian 40 South Asian 105
Pakistani 32
Bangladeshi 33
Chinese 16 Chinese 16
Black British 30 Black 44
Afro-Caribbean 12
African 2
An analysis on a sample of 2 (e.g. Black African) would not have been very meaningful!

Recoding variables
 http://www.youtube.com/watch?v=uzQ_522F
2SM&feature=related
 Ignore t-test for now 6m11s
 http://www.youtube.com/watch?v=FUoYZ_f6
Lxc
 Uses old version of SPSS, no submenu now. 6m

Scatterplots and correlations
 Looks for association between variables, e.g.
 Population size and GDP
 crime and unemployment rates
 height and weight
 Both variables must be rank, interval or
ratio (scale or ordinal in SPSS).
 Thus cannot use variables like, gender,
ethnicity, town of birth, occupation.
56

57
Scatterplots
 e.g. age (in years) versus Number of GCSEs

Interpretation
 As Y increases
X increases
 Called
correlation
 Regression line
model in red
58

Correlation measures
association not causation
 The older the child the better s/he is at reading
 The less your income the greater the risk of
schizophrenia
 Height correlates with weight
 But weight does not cause height
 Height is one of the causes of weight (also body
shape, diet, fitness level etc.)
 Numbers of ice creams sold is correlated with
the rate of drowning
 Ice creams do not cause drowning (nor vice versa)
 Third variable involved – people swim more and buy
more ice creams when it’s warm
59

Scatterplot in SPSS
 Use Graph menu
 http://www.youtube.com/watch?v=74BjgPQvI
Eg 8m34s
 http://www.youtube.com/watch?v=blfflA-
34pQ&feature=related 4m04s
 http://www.youtube.com/watch?v=UVylQoG4
hZM 1m50s, ignore polynomial regression
60

Modifying the Scatterplot
 http://www.youtube.com/watch?v=803YCYA2
AoQ&feature=related 4m04s
 http://www.youtube.com/watch?v=vPzvuMuV
Xk8&feature=related 3m40s
61

If mixed data sets
 Change point icon and/or colour to see
different subsets.
 Overall data may have no relationship but
subsets might.
 E.g. show male and female respondents.
 Use Chart builder
62

63
Correlation
 Correlation coefficient = measure of strength
of relationship, e.g. Pearson’s r
 varies from 0 to 1 with a plus or minus sign
Correlations
Number of
GCSEs Age
Pearson Correlation 1 -.415
*
Sig. (2-tailed) .015
Number of GCSEs
N 34 34
Pearson Correlation -.415
*
1
Sig. (2-tailed) .015
Age
N 34 34
*. Correlation is significant at the 0.05 level (2-tailed).

64
Positive correlation
 as x increases, y increases
r = 0.7

65
Negative correlation
 as x increases, y decreases
r = -0.7

66
Strong correlation (i.e. close to 1)
r = 0.9

67
Weak correlation (i.e. close to 0)
r = 0.2

Interpretation cont.
 r2 is a measure of degree of variation in
one variable accounted for by variation
in the other.
 E.g. If r=0.7 then r2=.49 i.e. just under half
the variation is accounted for (rest
accounted for by other factors).
 If r=0.3 then r2=0.09 so 91% of the
variation is explained by other things.
68

Significance of r
 SPSS reports if r is significant at α=0.05
 N.B. this is dependent on sample size to a
large extent.
 Other things being equal, larger samples
more likely to be significant.
 Usually, size of r is more important than
its significance
69

Pearson’s r in SPSS
 http://www.youtube.com/watch?v=loFLqZmvf
zU 6m57s
70

Parametric and non-parametric
 Some statistics rely on the variables being
investigated following a normal distribution. –
Called Parametric statistics
 Others can be used if variables are not
distributed normally – called Non-parametric
statistics.
 Pearson’s r is a parametric statistic
 Kendal’s tau and Spearman’s rho (rank
correlation) are non-parametric.
71

Assessing normality
 Produce histogram and normal plot
72

Use statistical test
 SPSS provides two formal tests for normality
: Kolmogorov-Smirnov (K-S) and Shapiro-
Wilks (S-W)
 But, there is debate about KS
 Extremely sensitive to departure from normality
 May erroneously imply parametric test not
suitable – especially in small sample
 So, always use a histogram as well.
73

Often can use parametric tests
 Parametric tests (e.g. Pearson’s r) are robust
to departures from normality
 Small, non-normal samples OK
 But use non-parametric if
 Data are skewed (questionnaire data often is)
 Data are bimodal
74

Spearmans’s rho
 http://www.youtube.com/watch?v=r_WQe2c-
ISU From 4.14 to 4.56
 http://www.youtube.com/watch?v=POkFi5vKv
I8&feature=fvwrel 6m16s
75

So far…
 Looked at relationships between nominal
variables
 Gender vs age group
 Looked at relationships between scale
variables
 Height vs. Weight
 Now combine the two
 Groups vs a scale variable
 E.g. Gender vs income
76

Reminder – IV vs DV
 IV = independent variable
 What makes a difference, causes effects, is responsible
for differences.
 DV = dependent variable
 What is affected by things, what is changed by the IV.
 Gender vs income. Gender = IV, income = DV
 So we investigate the effect of gender on income
77

Example 1
Age group vs. no. of GCSEs
 Using the Health and Illness class data
 Age group defines 2 groups
 Under 21
 21 and over
 Just two groups
 Can use independent samples t-test
 Independent because the two groups consist
of different people.
 t-test compares the means of the 2 groups. 78

79
Difference of means
 Do under 21s have more or fewer GCSEs
than 21 and overs?
 Means are different (6.44 & 4.28) but is that
significant?
Group Statistics
Age group N Mean Std. Deviation Std. Error Mean
Under 21 16 6.44 3.140 .785
Number of GCSEs
21 and over 18 4.28 2.906 .685

80
Independent Samples Test
s Test for Equality of
Variances t-test for Equality of Mean
Sig. t df Sig. (2-tailed)
Mean
Difference
Std.
Diffe
.164 .689 2.082 32 .045 2.160
2.073 30.789 .047 2.160
Independent Samples Test
Levene's Test for Equality of
Variances t-test for Equality of Means
95% Confidence Interval of the
Difference
F Sig. t df Sig. (2-tailed)
Mean
Difference
Std. Error
Difference Lower Upper
Equal variances assumed .164 .689 2.082 32 .045 2.160 1.037 .047 4.272
Number of GCSEs
Equal variances not
assumed
2.073 30.789 .047 2.160 1.042 .034 4.285
No significant difference therefore
assume equal variances
Means are
statistically
significantly
different

Parametric vs non-parametric
 Just as in the case of correlations, there are
both kinds of tests.
 Need to check if DV is normally distributed.
 Do this visually
 Also use statistical tests
81

Tests for normality
 Kolmogorov-Smirnov and Shapiro-Wilk
 If n>50 use KS
 If n≤50 use SW
 Null hypothesis is ‘data are normally distributed’.
 So if p<0.05 then data are significantly different
from a normal distribution – use non-
parametric tests
 If p≥0.05 then no significant difference – use
parametric tests
82

Checking normality
 Produce histogram of DV
 Tick box to undertake statistical test
 Interpret results.
83

t-test
 Identify your two groups.
 Determine what values in the data indicate
those two groups (e.g. 1=female, 2=male)
 Select Analyze:Compare Means:Independent
samples t-test
 http://www.youtube.com/watch?v=_KHI3ScO
8sc 9m40s
84

Mann-Whitney U test
 Use this when comparing two groups and the
DV is not normally distributed
 http://www.youtube.com/watch?v=7iTvv3m9d
_g 3m45s
85

Comparing 3 or more groups
 ANOVA = Analysis of Variance
 Analyze: Compare Means: One-way ANOVA
 http://www.youtube.com/watch?v=wFq1b3QjI
1U 4m04s
Useful to get table of means (descriptives) and
means plots from ANOVA options.
86

Intro to SPSS.ppt

Recommended

Recommended

More Related Content

Similar to Intro to SPSS.ppt

Similar to Intro to SPSS.ppt (20)

More from HasanGilani3

More from HasanGilani3 (11)

Recently uploaded

Recently uploaded (20)

Intro to SPSS.ppt