Using descriptive statistics when exploring a dataset is fine. But what if you want to go further, and get more insights on the relation between variables? Join our fellow Camila Salazar from Costa Rica to learn all about linear regression and improve your investigative skills.
2. Outline
1. Target audience
2. A step beyond descriptive statistics
3. What is regression analysis?
4. Example: the effect of education on wages
5. Other types of regression analysis useful in data
journalism.
6. Using regression models in data journalism
6. So you are in the newsroom...
There’s a big debate in you country about the importance of education
Your editor asks you to make a story about the importance of education
7. First step: descriptive statistics
You find data about education in your country and start
calculating the descriptive statistics.
9. And...
You interview young people that are still in highschool
that don’t want to go to college. And you want to
convince them with your story how could they
improve their future earnings if they go to college.
You can’t answer this question using descriptive
statistics :(
10. But...
You can calculate how much an extra year of schooling
increases wages using regression analysis!
12. What is regression analysis?
Regression analysis is a statistical tool for the
investigation of relationships between variables.
13. What is regression analysis?
It helps you explain how the value of a dependent
variable (Y) changes when and independent variable
(X) is varied, holding all other variables fixed.
14. What is regression analysis?
For example:
Health (Y)
Vegetables consumption (X), exercise (X), sleep (X)
dependent variable
independent variables
15. The linear regression
It’s a method for modeling the linear relationship between a
dependent variable Y and one or more explanatory variables.
dependent
variable independent
variable
error term
coefficient
We are interested in
estimating B (the
coefficient). It captures
the effect X has on Y,
holding all other factors
fixed.
16. The linear regression
For example you want to explain the effect of education on
wages.
Wage EducationExperience
Variation in wage that has
to do with educationVariation in wage that has
to do with experience
17. What is a linear regression?
• You have to formulate a hypothesis about the
relationships of interest.
• Have some theory behind your assumptions.
• There are some essential assumptions and
statistical properties of the regression that you
have to consider. Wage
21. Example: coefficients
Wage
An additional year of education increases
wage by $161.68, holding all other factors
fixed.
An additional year of experience increases
wage by $16.54, holding all other factors
fixed.
23. Example: p-value
Wage
With statistics you can’t be 100% certain.
A relatively simple way to interpret P values is
to think of them as representing how likely a
result would occur by chance.
24. Example: p-value
Wage
Null-hypothesis: is a hypothesis which the researcher tries to
disprove, reject or nullify.
“Education has NO explanatory power over wages”
“Men are NOT taller than women on average”
To test the null-hypothesis we use the p-value.
25. Example: p-value
Wage
The p-value is the probability of being wrong when rejecting
the null hypothesis
If your p-value is small < 0.05 you have strong evidence to
reject the null hypothesis.
“Men are significantly taller than women, p=0.01.” That means there is a 1%
chance that men are NOT actually taller than women and this result happened
only because of random chance.
26. Example
Wage
P-Value
It tells you if the coefficient is statistically significant.
With a low p-value (less than 10%, 5% or 1%) you can reject the null hypothesis
that the coefficient is equal to zero (it has no explanatory power). In this case,
the coefficients are significant. That means that education and experience have
explanatory power on wage.
27. Example
Wage
R-squared: This indicates
how well the explanatory
variables explain the
variability of the
dependent variable.
In this case: 33.8% of the variability of wage is
explained by the years of education and years of
experience.
29. The logistic regression
Wage
Imagine you want to estimate the probability that a
person with a college degree is employed.
The linear regression wouldn’t be very useful.
30. The logistic regression
Wage
Is a regression model where the dependent variable (Y) is
categorical. For example (binary):
1= unemployed, 0= employed
It is used to estimate the probability of a binary response based
on one or more independent variables.
31. The logistic regression
Wage
Explanatory variables:
-Age
-Education
-Family income
-Ocuppation
Logistic
regression
Employed
Unemployed
The model would tell you, for example, that a person with a college degree is three times
more likely to be employed that a person that only went to highschool.
32. The logistic regression
Wage
• The coefficients can not be interpreted as the rate
of change in the dependent variable.
• You check the sign of the coefficients.
• You can calculate marginal effects or odds ratio
(logit).
34. Some examples
"Does School Pay Off? How Much?" - El Financiero (Costa Rica),
winner of the Data Journalism Awards 2014.
http://www.elfinancierocr.
com/gnfactory/especiales/2015/calculadorasalarial/
Wage
36. Some advice
• Statistical analysis can be complex. If you’re not
sure find advice with an expert!
• Be transparent with your methodology.
• Study a lot!
• https://www.coursera.org/ Free courses!
Wage
37. References
-Wooldridge (2010). Introductory Econometrics
-Long (1997). Regression models for categorical and
limited dependent variables
-Costa Rica National Survey of Income and Spending
(2004).
Wage