Tutorial for beginning graduate students. Basic exploration of multivariate experimental data can be done with freely downloadable software. We also discuss the use of Excel because it is commonly in use.
2. Executive summary
• This is about visualizing usually smallish sets
of experimental data from lab, for use in
science
– This is NOT about beautiful impressive artistic and
emotionally enticing “infographics” to average
consumers
• Emphasis is on insights, creating or
corroborating hypotheses, or assessing
research questions
3. The purpose of visualization
• Sight is your most important sense
– One look at an image provides lots of information
very rapidly – a picture is worth a thousand words
– You are inherently good at detecting patterns
visually
• Seeing an “outlier” in a table is very difficult, it is often
easy to see in a graph
• The real purpose is “insights”, getting higher
level summary that may be useful
4. Fun quotes to know
• The purpose of computing is insight, not
numbers. (Richard Hamming)
• The purpose of the numbers computed is not
yet in sight. (unknown computer simulation
expert)
5. What kinds of insights are useful?
• Confirmatory (often “corroborating” would be a better word)
– You already have a hypothesis, or expectation of a pattern, now you
look for corroborating evidence in NEW data
• Creating new hypotheses
– Observing patterns that serve as “research questions”: will this
pattern continue or repeat, is it present in other datasets (from
another time, another experiment, or another region)
– You CAN’T CONFIRM A HYPOTHESIS FROM THE SAME DATA THAT GAVE
IT
• This is a problem for example with climate science, since we only have one history.
We can only possibly confirm, with last 10 years of data, a hypothesis that was
made from similar data at least ten years ago. And for “climate” ten years is too
short anyway…
• So if I create a hypothesis from available data and publish it, then you go “confirm
it” in the same data, this is foundationally absolute nonsense.
• Recall that predictive models need three data sets: training, validation, and test
sets.
6. More about confirmatory evidence
• The only way to prove causation:
– You can turn factor A on/off. Every time you turn it on, soon
after B happens. This convinces you that “A on” causes B. Note
that this requires experiments with manipulation of A!
• Observational data (no manipulated variables) cannot
prove causation
– It can disprove it though. If B happened first, then it was not
caused by A. This is the foundation of “Granger causality”.
– If you are observing a game between intelligent opponents,
then B could anticipate moves by A and try to counter them
ahead of time. So the future opportunities of A can cause
anticipating reactions by B. This is not how natural phenomena
work, so Granger should be OK for us.
• In any case, a game should not be modeled by simple one-time-step
rules, but natural phenomena mostly obey such difference or
differential models. In other words, the simple concept of “causation”
is not appropriate for games between intelligent players.
7. So keep in mind…
• Statistical tests that claim to prove A affects B usually
prove no such thing at all. The causation comes from
your understanding of the world, and statistics helps
you convince others…
– It is a bit of black magic there
– For example, the p-value is this:
• Assuming given statistical distributions (often Gaussian normal)
and that your null hypothesis is correct, the probability that the
chosen statistic summarizing your observed data could be more
extreme than what it is.
• It takes a lot of magic to now say: small p clearly means causation.
Statistics only shows correlations… it can’t tell if A caused B or the
other way around, or if C influenced both A and B
8. So let’s get to visualizations…
• In Excel, Home >
Conditional formatting
allows seeing the numbers
in a table.
• You can also play with
Insert > Sparklines which
allows making tiny graphs
within cells
• Easy to spot smallest and
largest, get some
impression of distribution
Random numbers
0.731033584
0.806055053
0.988103245
0.809884417
0.756027069
0.462190297
0.910670142
0.906566945
0.501780587
0.181802984
0.659130022
0.32821301
0.111329819
0.617390297
0.252291447
0.990308253
0.274208995
0.614407383
0.298483381
0.526614001
0.01251721
9. Scatter plots in
Excel
• Illustration of
Simpson’s paradox
(from Wikipedia)
– Ignoring a factor
can give
completely wrong
trend
• Seppo’s paradox
– One single failed
experiment can
give high R2
10. Trouble with Excel
• Even making a plot showing Simpson’s paradox is difficult,
Excel does not allow to format the markers by some
factor…
• However, most people can manipulate data in Excel, do
some basic transformation, delete an outlier that would
spoil the analysis (i.e., a failed experiment)
– Statisticians can make up theories and criteria for what is an
outlier. For an experimentalist, if you trust the experiment, then
it can be an interesting special case… What counts is whether
the data is real or corrupted. So one persons outlier can be the
important special case for another.
• Remember to keep your raw data safe. Do the analysis,
including deleting outliers, in a separate file, preferably in a
separate folder altogether !
11. Pivot tables and charts
• These are excellent for inspecting effects of
multiple factors, especially when each factor
only has two or three levels
• Note: often you want to “paste special”
choosing “values”, maybe also “transpose”
– Copying formulas instead of values can be trouble
– Next page has a data table, explore it in Excel…
13. Reproduce this pivot chart…
• Note that you
can sort the
“axis fields”,
and this affects
the grouping
– You can select
a primary
comparison
14. How about fitting a model?
• There are very basic model options ready-made
as trendlines in Excel
• What you really have to do typically is this:
– Your inputs and targeted model output y are in
columns
– You guess starting values for model parameters,
calculate model output y~ with these
– For every data point you form squared error (y-y~)2
– Sum the column of squared errors, then minimize the
sum by using Data > Solver, which adjusts the model
parameters
15. Note about the basic solver
• There is a better option freely available for download,
search for DirectOptimizer (you need to install it as
add-in)
– It comes with a small manual that helps you get started
• The point
– If you need to fit Arrhenius law, or whatever other model
from physics or physical chemistry, then you pretty much
have to do “nonlinear least squares” fitting
• Even if there is a “linearizing transformation” the error sum gets
also transformed, and the results can be poor because of this
– much of the time you can do this in Excel…
16. Free statistics packages
• Check out JASP or JAMOVI
– The two are very similar, JASP has some special
Bayesian statistics that are unconventional
– Note again that while people think of Bayesian
probability as causation, NO statistical test actually
proves anything about causality! (Bayesian networks
are sometimes called “causal networks”, which sounds
good but is absolutely misleading. JASP doesn’t do
them though.)
• JAMOVI current version is 0.8.0.5
– It appears to get more frequent updates than JASP
17. Hands-on exploration of JAMOVI
• Basic functionality for
– Importing data
– Adjusting metadata on variables (type, levels)
– Inspecting basic statistics
– Plotting the correlation matrix
• Note
– You can’t get a matrix scatter plot of multiple
variables from Excel…
18. Iris data in JAMOVI
• It is easy to
generate
fancy plots
of how the
data are
distributed.
• However,
you can’t
create
classifiers in
JAMOVI…
19. Significances of correlations
Correlation Matrix
Sepal.Leng
th
Sepal.Wid
th
Petal.Leng
th
Petal.Wid
th
Sepal.Length — -0.118 0.872 *** 0.818 ***
Sepal.Width — -0.428 *** -0.366 ***
Petal.Length — 0.963 ***
Petal.Width —
Note. * p < .05, ** p < .01,
*** p < .001
Note: copy/paste to Word works well, not so well from JAMOVI to PowerPoint. I
used OneNote as intermediate to get this into PowerPoint… Less than perfect.
20. The point of correlations?
• IF some variable is assumed causal, then the
trends of effects are important
– B increases or decreases with the manipulated
variable A
• If two independently measured variables have
a high correlation, then neither is badly
corrupted by noise
– Correlation indicates there is mutual information,
a variable that carries no information about
anything else might as well be noise
22. A first look at DataWarrior
• Current version 4.6.1 from Openmolecules.org
• Even if you run 64 bit Windows, take the 32
bit version – it can handle large enough data
sets
• This is a freely available professional quality
software package
– Too many features to cover… several tutorials are
available on YouTube
23. Iris data again
• I selected all columns in data view of JAMOVI,
copied, pasted to Excel, put back column
labels
• Then did “paste special” with headers to get
into DataWarrior
24. • In DataWarrior it is easy to assign marker color, size, etc., to a feature or
variable, so one plot can display multiple dimensions.
26. What I encourage is this…
• Get yourself free software
– Then learning to use it is a safe investment, because
you are not cut off by fees or licensing
• The first thing to do with new data is to look at it.
Let the data guide you more than your own prior
assumptions.
– It is big effects that are important, you should be able
to see them
– Statistically significant differences almost always
emerge if you just collect enough samples – checking
significances is to a large part a ritual without much
meaning for practice
27. Conclusions
• Most people are handy with Excel and use it to collect and
manipulate data
– It has some ability for visualization, but very limited. See how far
it can take you… maybe it is enough
– It is good for transforming data by calculating new columns
• There is now free software for some basic exploratory
plotting and statistics
– JASP and JAMOVI appear convenient for a non-statistician
• For industry-strength visualizations DataWarrior is a free
desktop application
• None of the above is for learning classifiers or for doing
nonlinear regression… but you can do basic nonlinear
regression easily in Excel, with some manual labor
– Get DirectOptimizer add-in, at no cost