SlideShare a Scribd company logo
1 of 65
Download to read offline
1
A NEW CORRELATION
COEFFICIENT AND A
DECOMPOSITION OF THE
PEARSON COEFFICIENT
Savas Papadopoulos
Department of Financial Stability
Bank of Greece
10/2/23
The views expressed are those of the author and do
not necessarily reflect those of Bank of Greece
Copyright Β© 2022 Savas Papadopoulos,
www.protectmywork.com. All rights reserved.
2
CONTENTS
ØABSTRACT
ØINTRODUCTION: What, Why, and How we do it
ØLITERATURE REVIEW: Pearson, Spearman, and Kendall
ØREFORMULATION & DECOMPOSITION OF PEARSON'S COEFFICIENT
ØPROPOSED MEASURES & METHODOLOGY
ØMADE-UP EXAMPLES: 2 & 3-Parallel Lines - Robustness to Outliers
ØSIMULATIONS WITH NORMAL DATA
ØAN APPLICATION TO GDP PER CAPITA
ØBut in fact CONCLUSIONS
3
ABSTRACT1
We decompose the Pearson correlation coefficient into two components. We recommend the
first component for detecting linear relationships and the second for recognizing patterns of two
parallel lines, providing robust versions to outliers. The significance of the Pearson coefficient
without significant components indicates a fit of three parallel lines. Thus, we reveal the
unknown aspect of the Pearson coefficient that identifies a two- or three-group association other
than a group, producing Type I or III2
errors. Finally, we apply the proposed coefficients with
permutation tests to simulated and real data. The proposed coefficients identify two- or three-
parallel line correlations or weaker but significant relationships not recognized by the Pearson
coefficient. In testing correlation hypotheses with normal data, the proposed methodology, in
an objective simulation study, minimizes the total error resulting from the Person coefficient by
4%.
Keywords: coefficients of correlation; Pearson decomposition; permutation tests; computational
statistics; robust correlation coefficient to outliers; parallel regression lines; cluster analysis
JEL CLASSIFICATION CODES: C10, C12, C14
1
[1] Papadopoulos, Savas, A Practical, Powerful, Robust and Interpretable Family of Correlation Coefficients (May 28, 2022). Available at
SSRN: https://ssrn.com/abstract=4114080 or http://dx.doi.org/10.2139/ssrn.4114080
2
https://en.wikipedia.org/wiki/Type_III_error
4
INTRODUCTION: Motivation
If we conducted a competition for which statistical quantity
would be the most valuable in exploratory data analysis, the
winner would most likely be the correlation coefficient with a
significant difference from its first competitor. It is very
striking that although the three correlation coefficients,
Pearson, Spearman, and Kendall, were developed in the late
19th and early 20th centuries, and despite the rapid
development of computers, the three coefficients still
dominate the use.
5
INTRODUCTION: What we do
Estimation & Hypothesis Testing for Correlation Coefficients
Let the hypotheses 𝐻!: 𝜌 = 0, 𝐻": 𝜌 β‰  0 and
Ξ± = P(Type I error), Ξ² = P(Type II error), Ξ³ = P(Type III error)
Type I Error: Reject a TRUE H0
Type II Error: NOT Reject a FALSE H0
Type III Error3: Reject a FALSE H0 for the WRONG
Reason (Correct by accident)
Note that the errors are due to the data and not the correlation
measures, e.g., in random data, by chance, a significant percentage of
the data could be close to a line (type I error).
3
https://en.wikipedia.org/wiki/Type_III_error
6
INTRODUCTION: What we do
Ø We propose new correlation coefficients π‘Ÿ! & 𝑣! with
permutation tests for computing the p-values.
Ø The proposed correlation coefficient, π‘Ÿ!, finds linear
relationships not detected by the Pearson coefficient (π‘Ÿ").
Ø The proposed correlation coefficient is more robust due to
multivariate outliers than the Pearson coefficient (π‘Ÿ").
Ø We reveal type III errors by using the Pearson coefficient (π‘Ÿ").
That is, the π‘Ÿ" can be significant, p-value(π‘Ÿ") < 𝛼, with an
insignificant linear relationship but a significant 2 or 3 parallel-line
association.
Ø The proposed combination coefficients π‘Ÿ!, 𝑣!, and π‘Ÿ",
significantly eliminate the total error =Ξ±+Ξ²+Ξ³ compared to π‘Ÿ" and π‘Ÿ!
separately, e.g., for normal data by 4% (See Figure 1).
7
INTRODUCTION: What we do
Figure 1: Global Simulation Results with Normal Data
0.051 0.051
0.100
0.096
0.150
0.073
0.066
0.000
0.000
0.00
0.05
0.10
0.15
0.20
0.25
Pearson r1 Combining r1, v1 & rP
Total Error for Pearson=rP=0.21, for r1=0.20 & for (r1,v1,rP)=0.17
P(Type I Error) P(Type II Error) P(Type III Error)
8
INTRODUCTION: Why are the results important?
Ø By applying π‘Ÿ", we minimize the overall error, compared
to the error using the Pearson coefficient (π‘Ÿ#), and thus
make more correct decisions about whether two variables
are linearly dependent or not.
Ø In some cases, researchers and analysts may have the
illusion that there is a significant linear relationship, but the
real situation is a significant 2 or 3 parallel lines.
Ø The correlation coefficient, π‘Ÿ", recognizes significant
linear relationships not detected by the Pearson coefficient
(see the two graphs in the 1st column of Figure 5).
Ø In cases with the presence of multivariate outliers π‘Ÿ" is
more powerful than π‘Ÿ#. That is, more accurate decisions.
9
INTRODUCTION: How we do it
Ø Decompose the Pearson coefficient (π‘Ÿ#) into two
components (The math proof is in the appendix).
𝒓𝑷 β‰ˆ 𝟎. πŸ”πŸ’ βˆ™ π’“πŸ + 𝟎. πŸ‘πŸ” βˆ™ π’—πŸ
Ø We use robust to outliers’ versions of the two components
π’“πŸ & π’—πŸ. The π‘Ÿ" shows only for 1-line fit & 𝑣" indicates 1 or 2
parallel-line fit, and thus their linear combination π‘Ÿ# detects 1
or 2 or 3 parallel-line fit.
Ø Viewing 0.64 & 0.36 as weights, at least 64% of π‘Ÿ# shows a
1-line fit, and at most 36% indicates a 2-line fit.
10
INTRODUCTION: How we do it
Ø Myths and Realities
Ø A Linear relationship Þ Significant Pearson coefficient,
p-value(π‘Ÿ#) < 𝛼
Ø Significant Pearson coefficient ⇏ Linear Association
Ø In this paper we show that:
Significant Pearson coefficient (π‘Ÿ#) Þ Fit with 1 or 2 or 3
parallel lines.
11
INTRODUCTION: How we do it
Ø Methodology:
Ø If p-value(π‘Ÿ") < 𝛼 Þ a 1 line fit (see the flowchart)
Ø If p-value(π‘Ÿ") β‰₯ 𝛼 & If p-value(𝑣") < 𝛼 Þ a 2 parallel-
line fit
Ø If p-value(π‘Ÿ#) < 𝛼 & p-value(π‘Ÿ") β‰₯ 𝛼 & If p-value(𝑣") β‰₯
𝛼 Þ 3 parallel lines fit (based on math results)
Note: The methodology only detects some 2 or 3 parallel-line
fits. The significance of the method shows 1 or 2, or 3 parallel
lines and not the other way around. That is if there is a 2- or 3-
line fit, our methodology is not guaranteed to likely detect it.
12
Figure2: Flowchart of the Methodology
13
LITERATURE REVIEW
Ø Existing literature recommends the Pearson
correlation for normal data and the Spearman correlation
for nonnormal data. But some global simulations,
including ours, indicate that the Kendall coefficient
outperforms the Pearson and Spearman coefficients.
Ø Data-analysis software typically computes three
classic correlation coefficients, Pearson’s, Spearman’s,
and Kendall’s.
14
LITERATURE REVIEW: Rankings
Ø There is certainly a lot of overlap between significant correlation
in face values and significant correlation in rankings
Ø But significant correlation in rankings ⇏ Significant correlation in
face values
Ø And significant correlation in face values ⇏ Significant correlation
in rankings
Ø Therefore, we lose power by using rankings
Ø We can plugin rankings to π‘Ÿ!
4 and change the denominator
accordingly.
Ø Denote 𝑅!= [π‘Ÿ! with rankings]
Ø Our simulation studies, not presented here, show that 𝑅! is more
powerful and robust than Spearman’s coefficient.
4
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4276146
15
LITERATURE REVIEW: Kendall Coefficient
The Kendall coefficient (π‘Ÿ") just shows:
The Percentage of monotonic pairs=
!#|%!|
&
(a novel approach)
from all possible point pairs
'βˆ™('*!)
&
= 𝑖 + 𝑑, where π‘Ÿ" = 2 βˆ™
,*-
'βˆ™('*!)
𝑖 = pairs with increasing pattern
𝑑 = pairs with decreasing pattern
Ø Thus, we correct that the π‘Ÿ" is not only a rank correlation coefficient;
π‘Ÿ" is the same for nominal values and no-repeat rankings.
Ø π‘Ÿ" does not show how close the points are to a curve; the points
could be close to multi-curves.
Ø Our simulations5
show that π‘Ÿ" reveals relationships with one-sided
multivariate outliers not signaled by Pearson or Spearman coefficients.
5
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4276146
16
LITERATURE REVIEW: Outliers
Ø The Pearson coefficient (π‘Ÿ#) is known to be sensitive to
outliers, so some researchers suggest using rank
coefficients, e.g., Spearman & Kendall.
Ø Alternatively, we recommend removing univariate
outliers, e.g., using the simple π‘˜ βˆ™ 𝐼𝑄𝑅 criterion, thus
eliminating influential points.
Ø After removing univariate outliers, we cannot have
extreme multivariate outliers. Near the most extreme cases
are the made-up examples in Figure 4.
Ø Note that with rankings, there are never univariate
outliers.
17
MAIN RESULT: THE DECOMPOSITION
We proved the following meaningful decomposition:
Pearson = 𝒓𝑷 =
𝟐
𝝅
βˆ™ 𝒓±,𝟐 +
π…βˆ’πŸ
𝝅
βˆ™ π’—πŸ β‰ˆ 𝟎. πŸ”πŸ’ βˆ™ 𝒓±,𝟐 + 𝟎. πŸ‘πŸ” βˆ™ π’—πŸ
π‘ŸΒ±,& = Β± ;1 βˆ’
12"
($)
βˆ“3"
($)
1
444444444444444$
5/7
>, 𝑣±,& = Β± A1 βˆ’
8
&'
(
($)
βˆ“*
(
($)
&
$
&βˆ™9!*
$
+
:
B
(The mathematical proof is in the appendix).
STANDARDIZED VARIABLES: 𝑧,
(;)
=
<(*<
|<"*<|𝑠
44444444441/𝑠 , 𝑠 = 1,2; 𝑖 = 1,2, β‹― , 𝑛
Symbolize + for positive and – for negative relationship
Symbolize by 𝑧 the sample mean, by |π‘Ž| ≑ absolute value and
by s the population variance.
18
MAIN RESULT: The Proposed Coefficients
Instead of π‘ŸΒ±,& and 𝑣±,& we use the following robust versions:
Proposed corr. Coef. π‘Ÿ! for one line: π‘ŸΒ±,! = Β± ;1 βˆ’
12"
(,)
βˆ“3"
(,)
1
444444444444444$
&
>
Corr. Coef. 𝑣!for two-parallel lines: 𝑣±,! = Β± A1 βˆ’
MAD
&'
(
(,)
βˆ“*
(
(,)
&
$
@
B
𝑐 = 0.73161 =
Μ‡ 𝐸 mMAD
12(
(,)
βˆ“3(
(,)
1
&
o (Numerical result)
π‘Ÿ! =
⎩
βŽͺ
⎨
βŽͺ
βŽ§π‘Ÿ#,!, if uπ‘₯B
(!)
βˆ’ 𝑦B
(!)
u
xxxxxxxxxxxxxxxx
< uπ‘₯B
(!)
+ 𝑦B
(!)
u
xxxxxxxxxxxxxxxx
& uπ‘₯B
(!)
βˆ’ 𝑦B
(!)
u
xxxxxxxxxxxxxxxx
< √2,
π‘Ÿ*,!, if uπ‘₯B
(!)
+ 𝑦B
(!)
u
xxxxxxxxxxxxxxxx
< uπ‘₯B
(!)
βˆ’ 𝑦B
(!)
u
xxxxxxxxxxxxxxxx
& uπ‘₯B
(!)
+ 𝑦B
(!)
u
xxxxxxxxxxxxxxxx
< √2,
0, otherwise
19
𝑣! =
⎩
βŽͺ
⎨
βŽͺ
βŽ§π‘£#,!, if uπ‘₯B
(!)
βˆ’ 𝑦B
(!)
u
xxxxxxxxxxxxxxxx
< uπ‘₯B
(!)
+ 𝑦B
(!)
u
xxxxxxxxxxxxxxxx
& 𝑀𝐴𝐷12(
(,)
*3(
(,)
1
< βˆšπ‘,
𝑣*,!, if uπ‘₯B
(!)
+ 𝑦B
(!)
u
xxxxxxxxxxxxxxxx
< uπ‘₯B
(!)
βˆ’ 𝑦B
(!)
u
xxxxxxxxxxxxxxxx
& 𝑀𝐴𝐷12(
(,)
#3(
(,)
1
< βˆšπ‘,
0, otherwise
MAD=mean absolute deviation over i.
First, we standardize the x and y values,
Second, we infer whether there is a positive or negative relationship by
comparing the mean distances from the y=x and y=-x lines (see Figure 8).
Third, we rescale the mean squared and variance of the distances by
dividing by their expected values and then subtracting the results from 1.
Finally, we define π‘Ÿ!and 𝑣! so that we have the following properties:
20
PROPERTIES FOR π’“πŸ & π’—πŸ (two-parallel lines)
Ø βˆ’1 ≀ π‘Ÿ& ≀ 1 & βˆ’1 ≀ 𝑣& ≀ 1 due to the restrictions in the
definitions.
Ø An exact value of +1 or -1 for π‘Ÿ& indicates a perfect positive
or negative relationship and vice versa, since the mean
distance of the standardized points is 0 and so all points fall
on the line y=x or on the line y=-x.
Ø An exact value of +1 or -1 for π’—πŸ implies (two-parallel
lines):
|𝑣&| = 1 ⟺ MAD()"
($)
βˆ“+"
($)
(
,
= 0 ⟺ βˆ€ 𝑖, ?π‘₯-
(&)
βˆ“ 𝑦-
(&)
? = 𝑐 ⟺
21
⟺ βˆ€ 𝑖, 𝑦-
(&)
= Β±π‘₯0,-
(&)
+ 𝑐 or 𝑦0,-
(&)
= Β±π‘₯0,-
(&)
βˆ’ 𝑐 (continued)
Two symmetrical lines, y=x+c and y=x-c around the line
y=x or two symmetrical lines, y=-x+c and y=-x-c around
the line y=-x. In the special case for c=0 the points fall on
the line y=x or on the line y=-x.
Ø A correlation value close to 0 or 0 indicates no relationship.
Ø The closer to +1 or -1 the coefficient, π‘Ÿ!, the stronger a one-line
association, assuming no influential points.
Ø The closer to +1 or -1 the coefficient, 𝑣!, the stronger a one-line
or a two-line association, assuming no influential points.
22
Two- or Three parallel lines
The significance of the Pearson correlation coefficient may be
misleading if some standardized points are close to the line y=x or y=-x,
and other standardized points are close to the lines y=x+c and y=x-c or
close to the lines y=-x+c and y=-x-c. The 1st set of points magnifies the
1st component π‘Ÿ& towards 1 or -1, and the 2nd set stretches the 2nd
component 𝑣& towards 1. The linear combination 𝒓𝑷 β‰ˆ 𝟎. πŸ”πŸ’ βˆ™ π’“πŸ +
𝟎. πŸ‘πŸ” βˆ™ π’—πŸ can give a significant value for the Pearson coefficient 𝒓𝑷
because most points are close to 2 or 3 lines and not necessarily close
to 1 line, as known in the literature. We justify this with simulations and
real data later.
23
Two- or Three parallel lines (cont.)
Without loss of generality, we assume univariate outlier removal, variable
standardization, and a positive relationship.
If π‘Ÿ# ⟢ 1 (close to 1), then $π‘₯$
(#)
βˆ’ 𝑦$
(#)
$
((((((((((((((((
⟢ 0, and thus the points are close to
the line y=x.
If 𝑣# ⟢ 1, then 𝜎
'($
(&)
)*$
(&)
'
#
⟢ 0 or $π‘₯+
(#)
βˆ’ 𝑦+
(#)
$ = 𝑐 and thus the points are
close to the parallel lines, y=x+c and y=x-c, symmetric to the line y=x (shown
earlier).
If π‘Ÿ, ⟢ 1, then π‘Ÿ# ⟢ 1 & 𝑣# ↛ 1 (a 1-line fit only)
or π‘Ÿ# ↛ 1 & 𝑣# ⟢ 1 (a 2-line fit only). In this case, we cannot
have a single line because otherwise, π‘Ÿ# ⟢ 1.
24
Two- or Three parallel lines (cont.)
or π‘Ÿ' ⟢ 1 & 𝑣' ⟢ 1 (for c = 0, we have a 1-line, & for c>>0,
have a 3-line fit)
Example with 1 line: A perfect single-line fit with
𝒓𝑷 = π’“πŸ = π’—πŸ = 1.
Example with 3 parallel lines: In Figure 3, the 2nd example
with 3 lines. In this example, all three coefficients are large 𝒓𝑷 =
0.71, π’“πŸ = 0.66, π’—πŸ = 0.79. We have three points exactly to the
middle line and six points to two symmetric lines around the
middle line, 3 points to each line.
25
Three-parallel lines (cont.)
The significance of the Pearson coefficient with insignificant
components indicates a fit of three parallel lines. The logic behind
this is that the insignificance of the two components indicates that
the points are not close to one or two parallel lines; otherwise,
they would be significant. Therefore, the significance of the
Pearson coefficient, in this case, is that some points are close to a
straight line, and others are close to two symmetrical lines around
the first line. Another example is the 3rd plot of Figure 5, π‘Ÿ1 = 0.16,
π‘Ÿ, = 0.12, 𝑣, = 0.25 with p-values 0.02, 0.07, and 0.07,
respectively. The existence of such cases is shown in Figure 10.
26
Two- or Three parallel lines
Mean and STD Relationship for positive numbers
In general, the Mean (M) and the Standard Deviation (SD) sizes
are independent of each other. But for positive numbers, as in
the case with Vertical Projections (VP) Gπ‘₯(
(')
βˆ’ 𝑦(
(')
G, a small M
implies a small SD but not vice versa, without outliers. A small M
indicates small positive numbers close together and, therefore,
a small SD. A small SD can also occur in large positive numbers,
so M can be large.
27
Ø A small M implies small VPs, uπ‘₯,
(&)
βˆ’ 𝑦,
(&)
u β‰ˆ 𝑀, or points close to
the lines 𝑦 = π‘₯ βˆ’ 𝑀 and 𝑦 = π‘₯ + 𝑀. Since M is small, the points can
be considered close to the line 𝑦 = π‘₯.
Ø A small SD (significant 𝑣&) implies VPs close to M, uπ‘₯,
(&)
βˆ’ 𝑦,
(&)
u β‰ˆ
𝑀 = uπ‘₯B
(&)
βˆ’ 𝑦B
(&)
u
xxxxxxxxxxxxxxxx
or points close to the lines 𝑦 = π‘₯ βˆ’ 𝑀 and 𝑦 = π‘₯ +
𝑀. If M is large (insignificant π‘Ÿ&) then we have a two-line fit.
Ø For large M & SD (insignificant π‘Ÿ& & 𝑣&) with significant π‘ŸD, a
large portion of the data minimizes M & SD and therefore maximizes
π‘ŸD. In addition, another large portion of points maximizes M, SD,
otherwise π‘Ÿ& & 𝑣& would be significant, and therefore minimizes π‘Ÿ&
& 𝑣&, see Figure 9.
28
Figure 3: MADE-UP EXAMPLES: Two & Three parallel lines
In both cases, the Pearson coefficient is significant (p-values of π’“πŸ 0.02 & 0.03),
giving us the impression that there is a significant linear relationship. The
proposed coefficients π’“πŸ and π’—πŸ show an insignificant linear relationship (p-
values of π’“πŸ 0.09 & 0.06), a significant two-parallel line fit in Case 1 (p-value of
π’—πŸ=0.002), and a significant three-parallel line fit in Case 2 (insignificant
π’“πŸ and π’—πŸ, 0.06 & 0.06, and a significant Pearson coefficient, p-value=0.03).
29
Figure 4: A MADE-UP EXAMPLES: Perfect Linear
Relationships with 1 & 2 outliers
The Pearson correlation coefficient is not robust under the presence of outliers6.
We consider a perfect linear relationship with ONE &TWO multivariate outliers.
We use the Pearson coefficient & the proposed π‘Ÿ. coefficient. Only the
proposed correlation, π’“πŸ, recognizes the linear relationships, p-
values=0.04 & 0.02<0.05.
6
https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
30
Constructing a Correlation Measure
Let G be a distance measure between the standardized values
of two paired variables and
π‘ŸΒ±,- = Β± L1 βˆ’
.
/
M, where 𝐺
01
O
P 𝐿 (convergence in probability)
π‘Ÿ
- = R
π‘Ÿ2,-, if positive relationship & 𝐺 < 𝐿,
π‘Ÿ3,-, if negative relationship & 𝐺 < 𝐿,
0, otherwise
Note that the desired properties apply.
Ø e.g., in π‘ŸΒ±,- and 𝑣±,-, for g=1,2; the denominators of the
fractions are the expected values of the numerators under
the assumptions of normality and independence.
31
AN INTERPRETATION OF THE CORRELATION COEFFICIENT
Ø The correlation coefficient, π‘Ÿ&, can be interpreted as the
percentage change between the squared distance of the
standardized values and the squared limiting distance for
independent normal π‘₯ and 𝑦.
Ø For example, a value of π‘Ÿ&=0.5 implies 50% reduction of the
squared observed distance from the squared distance of the
limiting distance under independence and normality.
Ø In the literature, it is known that Pearson’s correlation can be
viewed as a rescaled variance of the difference between
standardized scores7.
7
Rodgers; Nicewander (1988). "Thirteen ways to look at the correlation coefficient" (PDF). The American
Statistician. 42 (1): 59–66.
32
PERMUTATION TESTS: computing the p-values
Permutation tests8 have been used for hypothesis testing of correlation
coefficients between two variables, x and y. Initially, calculate the
correlation coefficient repeatedly after shuffling the observations of the
variable y and keeping constant the order of the observations for the
variable x. Then, we can derive p-values from the distribution of the
computed correlation coefficients. Permutation tests9 enjoy the
following merits against other standard statistical tests:
β€’ Approximate p-values very satisfactory.
β€’ Do not assume any particular distribution (distribution-free).
β€’ Are suitable for small samples.
β€’ Are applicable to non-random samples, e.g., time-series data.
8
https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
9
Berry, K. J., Johnston, J. E., & Mielke, Jr.(Paul W.). (2018). The measurement of association: a permutation statistical
approach. Springer International Publishing.
33
SIMULATION DESIGN WITH NORMAL DATA
Ø 100,000 simulations by Python (NUMPY library)
Ø Picking randomly one of the following 6 pairs (n, r) with equal
probabilities (1/6):
n 20 35 50 100 150 200
r 0.7 0.6 0.5 0.35 0.3 0.25
Ø Generating n correlated-bivariate data, π‘₯- and 𝑦- with Pearson’s
coefficient π‘Ÿ1 = r. All the data are correlated as follows:
Ø π‘₯- ∼ π‘π‘œπ‘Ÿπ‘šπ‘Žπ‘™(0,1), 𝑒- ∼ π‘π‘œπ‘Ÿπ‘šπ‘Žπ‘™(0,1),
Ø 𝑦- = r βˆ™ π‘₯- + R1 βˆ’ r βˆ™ 𝑒-
Ø Applied in Figure 5, 1st row, & Tables 1a & 1b.
34
AN APPLICATION TO GDP PER CAPITA
Ø Publicly available data10 WORLD BANK
Ø N=62 countries with GDP per Cap > 10,000$ in 2020, and
full annual data for 1982-2020
Ø T=39 for the period 1982-2020. Analyze Growth Rates (%)
Ø (622
-62)/2 = 1891 pairs (x,y) correlation cases
Ø No causality. Lurking variables: Global or Continental
Economy
Ø Compare the economic growth of a country with its
correlated countries by regression residuals.
Ø Applied in Figure 5, 2nd
row, & Table 2.
10
https://data.worldbank.org/indicator/NY.GDP.PCAP.CD
35
Result Presentation in a Graph, see Figure 4
Ø Simulated data in the 1st row and application data in the
2nd row.
Ø 1-, 2-, and 3-line fit in the 1st, 2nd, & 3rd columns,
respectively.
Ø In the application initial sample size n=39 and we remove
univariate outliers by the 1.5 βˆ™ 𝐼𝑄𝑅 criterion. In the
simulation plots the sample size is n=200.
Ø By p, p1, & p2 we represent the p-values of Pearson,
π‘Ÿ"and 𝑣", respectively.
Ø By significance, we mean that the p-value<0.05
36
Conclusions from Figure 4
Ø In the 1st column, only π‘Ÿ" shows a significant linear
relationship. The application graph shows the robustness of π‘Ÿ"
to non-normal data.
Ø In the 2nd column, in the 1st graph with simulated data, the
Pearson coefficient is significant, and the insignificance of π‘Ÿ"
with the significance of 𝑣" indicates a 2-line fit. From the
existing literature, there is a significant 1-line fit; this is a Type
III error.
Ø In the 2nd column, in the 2nd graph with the application
data, the Pearson coefficient is insignificant, and the
insignificance of π‘Ÿ" with the significance of 𝑣" indicates a 2-line
fit. From the existing literature, there is no significant 1-line fit.
37
Conclusions from Figure 4 (continued)
Ø In the 3rd column, in both graphs, the Pearson coefficient is
significant, and both π‘Ÿ" and 𝑣" are insignificant. Thus, we
conclude that there is a significant fit of a 3-line fit. From the
existing literature, there is a significant 1-line fit; this is a Type III
error.
38
Figure 5: Simulated and Real-Data Evidence
39
Simulation Results (Tables 1a & 1b)
Ø See the simulation design earlier.
Ø Tables 1a (𝜌 = 0) & 1b (𝜌¹0) are 3-way tables for the 3
correlation coefficients π‘Ÿ# (Pearson), π‘Ÿ", and 𝑣", with 2
categories each, whether significant or not.
Ø For 𝜌 = 0, the Ξ± = P(Type I error) for π‘Ÿ# is Ξ± =5.1% (close to
5 as was expected), and for π‘Ÿ" is Ξ± =5.1%=3.2+1.9. By
combining π‘Ÿ#, π‘Ÿ" and 𝑣", Ξ± =10%=1-0.90.
Ø For 𝜌 β‰  0, the Ξ² = P(Type II error) for π‘Ÿ# is Ξ² =9.6%, while
for π‘Ÿ" is Ξ² =15%=6.6+8.4. By combining π‘Ÿ#, π‘Ÿ" and 𝑣", Ξ²=7.3%.
40
Simulation Results (Tables 1a & 1b)
Ø For 𝜌 β‰  0, the Ξ³ = P(Type III error) for π‘Ÿ! is Ξ³ =6.6%=4.3+2.3 while
combining π‘Ÿ!, π‘Ÿ" and 𝑣" is Ξ³ =0.
Ø With the proposed methodology, the α =P(Type I error) doubles
from 5 to 10, as it detects parallel lines in addition to linear fits, but
the Ξ² = P(Type II error) is significantly smaller than that for Pearson’s
correlation from 9.6 to 7.3, and there is no Ξ³ = P(Type III error).
Ø The above results appear in Figure 1.
Ø Thus, by combining π‘Ÿ", 𝑣", and π‘Ÿ!, we minimize the total error
compared to the total error by π‘Ÿ# and π‘Ÿ" separately.
Ø In this case, we get the smallest total error if we use the Pearson
significance and exclude the Type III error, which the significance of π‘Ÿ!,
π‘Ÿ", and 𝑣" can account for. In the case of non-normal data, we can use
the combination π‘Ÿ", 𝑣", and π‘Ÿ!.
41
Table 1a: Simulation (𝝆 = 𝟎)
N=100,000 𝝆 = 𝟎 π’—πŸ
Pearson π’“πŸ Significant Not Signif. Total
SignificantSignificant 1.1, 1 line 2.1, 1 line 3.2
Not Signif. 1.0, 2 lines 0.8, 3 lines 1.8
Total 2.1 2.9 5.0
Not Signif.Significant 0.0, 1 line 1.9, 1 line 1.9
Not Signif. 3.1, 2 lines 90.0, No lines 93.1
Total 3.1 91.9 95.0
42
Table 1b: Simulation (π†ΒΉπŸŽ)
N=100,000 π†ΒΉπŸŽ π’—πŸ
Pearson π’“πŸ Significant Not Signif. Total
Significant Significant 66.1, 1 line 17.7, 1 line 83.8
Not Signif. 4.3, 2 lines 2.3, 3 lines 6.6
Total 70.4 20.0 90.4
Not Signif. Significant 0.0, 1 line 1.2, 1 line 1.2
Not Signif. 1.1, 2 lines 7.3, No lines 8.4
Total 1.1 8.5 9.6
43
Application Results (Table 2)
Ø See the application information earlier.
Ø Table 2 is a 3-way table for the 3 correlation coefficients π‘Ÿ1
(Pearson), π‘Ÿ&, and 𝑣&, with 2 categories each, whether significant or
not.
Ø The initial sample size is n=39, and we remove univariate
outliers by the 1.5 βˆ™ 𝐼𝑄𝑅 criterion.
Ø In 4.4%=0.3+4.1 of the 1891 cases, π‘Ÿ& recognizes a significant
linear relationship not detected by the Pearson π‘Ÿ1 coefficient.
Ø Combining π‘Ÿ&, and 𝑣&, we show a significant 2-line fit in
4.9%=0.6+4.3 of the 1891 cases, not detected otherwise.
Ø By combining π‘Ÿ&, 𝑣&, and π‘Ÿ1, we detect a significant 3-line fit in
1.5% of the 1891 cases, not recognized otherwise.
44
Table 2: AN APPLICATION TO GDP PER CAPITA
N=1891 % π’—πŸ
Pearson π’“πŸ Significant Not Signif. Total
SignificantSignificant 40.8, 1 line 16.9, 1 line 57.7
Not Signif. 0.6, 2 lines 1.5, 3 lines 2.1
Total 41.4 18.4 59.8
Not Signif.Significant 0.4, 1 line 4.0, 1 line 4.4
Not Signif. 4.2, 2 lines31.6, No 1-3 lines 35.8
Total 4.6 35.6 40.2
45
Figure 6: Venn Diagrams11 for the Simulation (Table 1a & 1b)
𝝆 = 𝟎 π†ΒΉπŸŽ
11
For understanding the diagrams see the link: https://www.easycalculation.com/algebra/venn-diagram-3sets.php
46
Figure 7: Venn Diagrams12 for the Application (Table 2)
12
For understanding the diagrams see the link: https://www.easycalculation.com/algebra/venn-diagram-3sets.php
1-line=π‘Ÿ!=40.8+16.9+4+0.4
2-lines=Pearson ∩ 𝑣& ∩ π‘Ÿ&
2
=0.6
3-lines=Pearson ∩ π‘Ÿ&
2
∩ 𝑣&
2
=1.5
e.g. 57.7=16.9+40.8,
Pearson ∩ π‘Ÿ& ∩ 𝑣&=40.8
Pearson2
β‹‚π‘Ÿ&
2
∩ 𝑣&
2
=31.6
47
Comments on Figures 6 & 7 (Venn Diagrams)
Ø The Pearson π‘ŸD and the proposed coefficient for a linear fit π‘Ÿ! AGREE
by 93.2%=90+1.1+2.1, by 91.1%=66.1+17.7+7.3, and by
89.3%=40.8+16.9+31.6, in the cases for 𝜌 = 0, 𝜌¹0, and in the application,
respectively. Roughly, there is 90% agreement and 10% disagreement.
Ø A significant TWO- or THREE-LINE FIT appears in 4.1%=3.1+1 & 0.8% in
the case 𝜌 = 0, 5.4%=4.3+1.1 & 2.3% in the case 𝜌¹0, and 4.8%=4.2+0.6 &
1.5% in the application, respectively.
Ø The robust version of the 2nd
component of the Pearson
decomposition, the coefficient 𝑣!, does not recognize the linear
relationship for a large amount, 17.7% in the case 𝜌¹0, and 16.9% in the
application, respectively. Thus, we recommend π‘Ÿ! for linear relationships
and 𝑣! for 2 and 3-parellel fitting lines in combination with π‘Ÿ! and π‘ŸD.
48
OVERALL CONCLUSIONS
Ø The significance of the Pearson correlation coefficient may
imply 1 or 2, or 3 parallel-line fits. There is a type III error; we may
think there is a significant fit with 1 line, but in fact, exists a fit of 2
or 3 parallel lines.
Ø We provide a methodology to detect 1 or 2- or 3-line fits.
Ø Prove a meaningful decomposition of the Pearson coefficient.
Ø The proposed π‘Ÿ& coefficient detects more linear relationships
and has a smaller overall error than the Pearson coefficient.
49
OVERALL CONCLUSIONS
Ø π‘Ÿ! is more robust to non-normality & outliers than the Pearson
coefficient.
Ø The robust version of the 2nd component of the Pearson
decomposition, the coefficient 𝑣!, does not recognize many linear
relationships since it detects two parallel lines. Thus, we recommend
only π‘Ÿ! for single-line detection.
Ø The two robust version components, π‘Ÿ!, and 𝑣!, of the Pearson
decomposition combined with the Pearson coefficient, can be applied by
all scientists to detect accurate significant fits of 1 or 2, or 3 parallel
lines.
50
APPENDIX: THE DECOMPOSITION
Pearson =
𝟐
𝝅
βˆ™ 𝒓±,𝟐 +
𝝅3𝟐
𝝅
βˆ™ π’—πŸ β‰ˆ 𝟎. πŸ”πŸ’ βˆ™ 𝒓±,𝟐 + 𝟎. πŸ‘πŸ” βˆ™ π’—πŸ
π‘ŸΒ±,& = Β± ;1 βˆ’
12"
($)
βˆ“3"
($)
1
444444444444444$
5/7
>, 𝑣±,& = Β± A1 βˆ’
8
&'
(
($)
βˆ“*
(
($)
&
$
&βˆ™9!*
$
+
:
B
With standardized values: 𝑧,
(&)
=
<(*<
|<"*<|2
44444444441/2 , 𝑖 = 1,2, β‹― , 𝑛
Symbolize by πœ‹ the known mathematical constant, by 𝑧 the sample mean,
by |π‘Ž| ≑ absolute value, and by s the population variance.
Define as π‘Ÿ& = Ε 
π‘Ÿ#,&, for positive relationship
π‘Ÿ*,&, for negative relationship
and 𝑣& = Ε 
𝑣#,& for positive relationship
𝑣*,& for negative relationship
51
APPENDIX: A Proof for the Decomposition:
We write the Pearson coefficient as: π‘Ÿ14 = π‘₯5
(,)
βˆ™ 𝑦5
(,)
aaaaaaaaaaaa
.
The mean square of the standardized values is equal to 1:
β€Ήπ‘₯B
(&)
Ε’
&
xxxxxxxxxx
= ‹𝑦B
(&)
Ε’
&
xxxxxxxxxx
= 1, and from the identity (π‘Ž βˆ“ 𝑏)&
= π‘Ž&
+ 𝑏&
βˆ“ 2π‘Žπ‘
we get the following reformulation of the Pearson coefficient:
π‘Ÿ14 = π‘₯5
(,)
βˆ™ 𝑦5
(,)
aaaaaaaaaaaa
= Β± k1 βˆ’
6)&
(')
βˆ“+&
(')
7
'
888888888888888888
,
l
Using the known identity13 |𝑧9|,
aaaaaa = p|𝑧9|
aaaaq
,
+ 𝜎|zι|
2
, for 𝑧9 = π‘₯-
(,)
βˆ“ 𝑦-
(,)
and by some calculations we get the result.
13
https://en.wikipedia.org/wiki/Root_mean_square
52
π‘ŸDG = Β± β€’1 βˆ’
1
2
βˆ™ β€˜uπ‘₯B
(&)
βˆ“ 𝑦B
(&)
u
xxxxxxxxxxxxxxxx&
+ 𝜎
12(
($)
βˆ“3(
($)
1
&
β€™β€œ =
= Β± ”1 βˆ’
1
2
βˆ™ β€’
4
πœ‹
βˆ™
uπ‘₯B
(&)
βˆ“ 𝑦B
(&)
u
xxxxxxxxxxxxxxxx&
4/πœ‹
+ 2 βˆ™ m1 βˆ’
2
πœ‹
o βˆ™
𝜎
12(
($)
βˆ“3(
($)
1
&
2 βˆ™ β€Ή1 βˆ’
2
πœ‹
Ε’
Λœβ„’ =
= Β± ”
2
πœ‹
+ 1 βˆ’
2
πœ‹
βˆ’
2
πœ‹
βˆ™
uπ‘₯B
(&)
βˆ“ 𝑦B
(&)
u
xxxxxxxxxxxxxxxx&
4
πœ‹
βˆ’ m1 βˆ’
2
πœ‹
o βˆ™
𝜎
12(
($)
βˆ“3(
($)
1
&
2 βˆ™ β€Ή1 βˆ’
2
πœ‹
Ε’
β„’ =
= Β±
2
πœ‹
βˆ™ β€’1 βˆ’
uπ‘₯B
(&)
βˆ“ 𝑦B
(&)
u
xxxxxxxxxxxxxxxx&
4/πœ‹
˜ Β± m1 βˆ’
2
πœ‹
o βˆ™ A1 βˆ’
𝜎
12(
($)
βˆ“3(
($)
1
&
2 βˆ™ β€Ή1 βˆ’
2
πœ‹
Ε’
B
=
&
7
βˆ™ π‘ŸΒ±,& +
7*&
7
βˆ™ 𝑣&. ∎
53
APPENDIX: Notes on the Decomposition:
Ø If we assume π‘₯,~𝑁(πœ‡2, 𝜎2
&) & 𝑦,~π‘ΕΎπœ‡3, 𝜎3
&
ΕΈ then the sum of the
standardized values: π‘₯,
(&)
βˆ“ 𝑦,
(&)
∼ 𝑁(0,2) and therefore
uπ‘₯,
(&)
βˆ“ 𝑦,
(&)
u ~π»π‘Žπ‘™π‘“-𝑁 Β€
&
√7
, 2 βˆ™ β€Ή1 βˆ’
&
7
Ε’Β₯ (Half-normal14 distribution)
We used that if π‘₯,~𝑁(0, 𝜎&) then
|π‘₯|~π»π‘Žπ‘™π‘“-𝑁 =𝜎 βˆ™ @
2
πœ‹
, 𝜎!
βˆ™ C1 βˆ’
2
πœ‹
FG
Ø In π‘ŸΒ±,& and 𝑣±,&, the denominators of the fractions are the expected
values of the numerators under the assumptions of normality and
independence.
14
https://en.wikipedia.org/wiki/Half-normal_distribution
54
APPENDIX: Notes on the Normality Assumption
Ø Despite using the normality assumption to obtain the limiting
distances in the proposed coefficients, this assumption does not affect
the p-values computed by permutation tests.
Ø For non-normal data, we may have some bias in estimating the
correlation coefficients π‘Ÿ! and 𝑣!. However, the estimation of their p-
values does not depend on the normality assumption. So, with non-
normal data, we look at the p-values of π‘Ÿ! and 𝑣!.
Ø For large sample sizes and normal data, we could use already
calculated critical values if the calculated time was an issue.
Ø For large sample sizes and non-normal data, we could use already
calculated critical values of the respective rank coefficients π‘Ÿ! and 𝑣!,
denoted as 𝑅! and 𝑉!, with a power compensation if the calculated time
was an issue.
55
APPENDIX:
Figure 8: Standardized Values around the lines y=x and y=-x
56
APPENDIX: A Tip: The Parallelism between
Hypothesis Testing and Proof by Contradiction
Ø In proof by contradiction, we establish a
proposition by leading to a contradiction by
assuming that the
Ø proposition is false.
Ø In hypothesis testing, we accept the
alternative hypothesis, H1, concluding
improbably (in a small p-value) by assuming
a false H1 (a true H0). And
57
APPENDIX: THE CRITICAL VALUES AND P-VALUES
FOR π’“πŸ ARE COMPUTED BY PERMUTATION TESTS
Cut-Off Points for π‘Ÿ" (two-sided Ξ±=0.05 or one-sided Ξ±=0.025)
n c n c n c n c n c n c
5 0.924 12 0.662 19 0.543 30 0.437 80 0.278 500 0.113
6 0.868 13 0.637 20 0.528 35 0.409 90 0.262 1000 0.080
7 0.819 14 0.617 21 0.517 40 0.386 100 0.249 2000 0.057
8 0.780 15 0.602 22 0.505 45 0.363 150 0.205 5000 0.036
9 0.742 16 0.582 23 0.496 50 0.346 200 0.179 104
0.026
10 0.715 17 0.570 24 0.487 60 0.318 300 0.146 105
0.008
11 0.687 18 0.556 25 0.477 70 0.296 400 0.126 106
0.001
Under Normality
Example: In a case with data not rejected as normal and n=90, we observe π‘Ÿ. =
0.21. Then we cannot reject 𝐻/: 𝜌 = 0 against 𝐻.: 𝜌 β‰  0 with Ξ±=0.05 since |π‘Ÿ.| <
𝑐. = 0.262.
58
APPENDIX: THE CRITICAL VALUES AND P-VALUES
FOR π’—πŸ ARE COMPUTED BY PERMUTATION TESTS
Cut-Off Points for 𝑣" (two-sided Ξ±=0.05 or one-sided Ξ±=0.025)
n c n c n c n c n c n c
5 0.962 12 0.751 19 0.629 30 0.522 80 0.337 500 0.141
6 0.932 13 0.732 20 0.616 35 0.490 90 0.319 1000 0.100
7 0.893 14 0.709 21 0.605 40 0.461 100 0.303 2000 0.072
8 0.860 15 0.692 22 0.595 45 0.435 150 0.251 5000 0.046
9 0.830 16 0.675 23 0.582 50 0.415 200 0.219 104
0.033
10 0.801 17 0.658 24 0.573 60 0.384 300 0.180 105
0.011
11 0.775 18 0.644 25 0.560 70 0.358 400 0.158 106
0.003
Under normal data
59
APPENDIX: THE CRITICAL VALUES FOR PEARSON
COMPUTED BY PERMUTATION TESTS
Cut-Off Points for π‘Ÿ# (two-sided Ξ±=0.05 or one-sided Ξ±=0.025)
n c n c n c n c n c n c
5 0.923 12 0.640 19 0.510 30 0.408 80 0.250 500 0.100
6 0.866 13 0.618 20 0.497 35 0.379 90 0.236 1000 0.071
7 0.816 14 0.591 21 0.488 40 0.353 100 0.223 2000 0.050
8 0.771 15 0.576 22 0.475 45 0.332 150 0.182 5000 0.031
9 0.730 16 0.558 23 0.468 50 0.317 200 0.158 104
0.022
10 0.698 17 0.540 24 0.457 60 0.291 300 0.130 105
0.007
11 0.668 18 0.528 25 0.446 70 0.267 400 0.112 106
0.002
Under normal data
60
APPENDIX: EXACT CRITICAL VALUES FOR PEARSON
Cut-Off Points for π‘Ÿ# (two-sided Ξ±=0.05 or one-sided Ξ±=0.025)
n c n c n c n c n c n c
5 0.878 12 0.576 19 0.456 30 0.361 80 0.220 500 0.088
6 0.811 13 0.553 20 0.444 35 0.334 90 0.207 1000 0.062
7 0.754 14 0.532 21 0.433 40 0.312 100 0.197 2000 0.044
8 0.707 15 0.514 22 0.423 45 0.294 150 0.160 5000 0.028
9 0.666 16 0.497 23 0.413 50 0.279 200 0.139 104
0.020
10 0.632 17 0.482 24 0.404 60 0.254 300 0.113 105
0.006
11 0.602 18 0.468 25 0.396 70 0.235 400 0.098 106
0.002
π‘Ÿ =
𝑑;/,
t𝑛 βˆ’ 2 + 𝑑;/,
,
61
Figure 9. Example n=1000, r =0.5, standardized points
Blue points (near the y=x line) maximize the 1st component, red points
maximize the 2nd component. In the histogram the selected points are the
red points. Red and blue points may overlap or disjoint.
62
Figure10. A Significant Pearson coefficient with
insignificant component
e.g., n=50, r2<0.335, v2<0.404, v2>0.7512-1.751938*r2
0.279< 0.64*r2+0.36*v2<1.
This case occurs in the triangle. So, it is possible.
63
PYTHON CODE FOR THE COEFFICIENTS & PERMUTATION TESTS
# =============================================================================
def r1(x,y): # Computes the coefficient r1
nxy = x.shape[0]
xm = np.mean(x)*np.ones(nxy); ym = np.mean(y)*np.ones(nxy)
dmx = np.mean(abs(x-xm)); dmy = np.mean(abs(y-ym))
x = (x-xm)/dmx; y = (y-ym)/dmy; del xm, ym, dmx, dmy
dxyp = (np.mean(abs(x-y))); dxyn = (np.mean(abs(x+y)))
if ( (dxyp < dxyn) & (dxyp < np.sqrt(2)) ):
rc1 = 1-dxyp**2 / 2
elif ( (dxyp > dxyn) & (dxyn < np.sqrt(2)) ):
rc1 = -( 1-dxyn**2 / 2 )
else:
rc1 = 0
return rc1
# =============================================================================
def pvr1(x,y): # p-value for r1 by permutation tests
rcc1 = r1(x,y)
nper=1000
dr = np.zeros(nper)
yh = y.tolist()
for ii in range(0,nper): # Permutation
yr = np.array( random.sample( yh, len(yh) ) )
dr[ii] = r1(x,yr)
ecdfr = ECDF(dr[:])
return ecdfr(-abs(rcc1)) + 1 - ecdfr(abs(rcc1))
# =============================================================================
def v1(x,y): # Computes the coefficient v1
nxy = x.shape[0]
xm = np.mean(x)*np.ones(nxy); ym = np.mean(y)*np.ones(nxy)
dmx = np.mean(abs(x-xm)); dmy = np.mean(abs(y-ym))
64
x = (x-xm)/dmx; y = (y-ym)/dmy; del xm, ym, dmx, dmy
dvp = np.mean( abs( abs(x-y) - np.mean(abs(x-y)) ) )
dvn = np.mean( abs( abs(x+y) - np.mean(abs(x+y)) ) )
plim = 0.731606694191509
if ( (dvp < dvn) & (dvp < np.sqrt(plim)) ):
vc1 = 1 - ( dvp )**2 /plim
elif ( (dvp > dvn) & (dvn < np.sqrt(plim)) ):
vc1 = -( 1 - ( dvn )**2 /plim )
else:
vc1 = 0
return vc1
# =============================================================================
def pvv1(x,y): # p-value for v1 by permutation tests
vcc1 = v1(x,y)
nper=1000
dr = np.zeros(nper)
yh = y.tolist()
for ii in range(0,nper): # Permutation
yr = np.array( random.sample( yh, len(yh) ) )
dr[ii] = v1(x,yr)
ecdfr = ECDF(dr[:])
return ecdfr(-abs(vcc1)) + 1 - ecdfr(abs(vcc1))
# ===============================================================================
def outlier(var,kk):
Q1, Q3 = np.percentile(var, [25,75])
IQR = Q3 - Q1
ul = Q3+kk*IQR
ll = Q1-kk*IQR
outliers = ((var > ul) | (var < ll))
return outliers
# ==============================================================================
65
15
https://www.linkedin.com/in/savas-papadopoulos-ph-d-564a1899/

More Related Content

Similar to A NEW CORRELATION COEFFICIENT AND A DECOMPOSITION OF THE PEARSON COEFFICIENT

Assessing relative importance using rsp scoring to generate
Assessing relative importance using rsp scoring to generateAssessing relative importance using rsp scoring to generate
Assessing relative importance using rsp scoring to generateDaniel Koh
Β 
Assessing Relative Importance using RSP Scoring to Generate VIF
Assessing Relative Importance using RSP Scoring to Generate VIFAssessing Relative Importance using RSP Scoring to Generate VIF
Assessing Relative Importance using RSP Scoring to Generate VIFDaniel Koh
Β 
Les5e ppt 09
Les5e ppt 09Les5e ppt 09
Les5e ppt 09Subas Nandy
Β 
Scatterplots, Correlation, and Regression
Scatterplots, Correlation, and RegressionScatterplots, Correlation, and Regression
Scatterplots, Correlation, and RegressionLong Beach City College
Β 
correlation-analysis.pptx
correlation-analysis.pptxcorrelation-analysis.pptx
correlation-analysis.pptxSoujanyaLk1
Β 
correlation-analysis-160424020323.pptx
correlation-analysis-160424020323.pptxcorrelation-analysis-160424020323.pptx
correlation-analysis-160424020323.pptxSoujanyaLk1
Β 
Correlation and Regression ppt
Correlation and Regression pptCorrelation and Regression ppt
Correlation and Regression pptSantosh Bhaskar
Β 
Correlation analysis
Correlation analysisCorrelation analysis
Correlation analysisShiela Vinarao
Β 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionAbdelaziz Tayoun
Β 
Comparing the methods of Estimation of Three-Parameter Weibull distribution
Comparing the methods of Estimation of Three-Parameter Weibull distributionComparing the methods of Estimation of Three-Parameter Weibull distribution
Comparing the methods of Estimation of Three-Parameter Weibull distributionIOSRJM
Β 
special-correlation.ppt
special-correlation.pptspecial-correlation.ppt
special-correlation.pptLovelyCamposano1
Β 
Artificial Intelligence (Unit - 8).pdf
Artificial Intelligence   (Unit  -  8).pdfArtificial Intelligence   (Unit  -  8).pdf
Artificial Intelligence (Unit - 8).pdfSathyaNarayanan47813
Β 
Correlation.pptx
Correlation.pptxCorrelation.pptx
Correlation.pptxShivakumar B N
Β 
What is Karl Pearson Correlation Analysis and How Can it be Used for Enterpri...
What is Karl Pearson Correlation Analysis and How Can it be Used for Enterpri...What is Karl Pearson Correlation Analysis and How Can it be Used for Enterpri...
What is Karl Pearson Correlation Analysis and How Can it be Used for Enterpri...Smarten Augmented Analytics
Β 
Course pack unit 5
Course pack unit 5Course pack unit 5
Course pack unit 5Rai University
Β 

Similar to A NEW CORRELATION COEFFICIENT AND A DECOMPOSITION OF THE PEARSON COEFFICIENT (20)

Correlation Analysis
Correlation AnalysisCorrelation Analysis
Correlation Analysis
Β 
Assessing relative importance using rsp scoring to generate
Assessing relative importance using rsp scoring to generateAssessing relative importance using rsp scoring to generate
Assessing relative importance using rsp scoring to generate
Β 
Assessing Relative Importance using RSP Scoring to Generate VIF
Assessing Relative Importance using RSP Scoring to Generate VIFAssessing Relative Importance using RSP Scoring to Generate VIF
Assessing Relative Importance using RSP Scoring to Generate VIF
Β 
Les5e ppt 09
Les5e ppt 09Les5e ppt 09
Les5e ppt 09
Β 
Scatterplots, Correlation, and Regression
Scatterplots, Correlation, and RegressionScatterplots, Correlation, and Regression
Scatterplots, Correlation, and Regression
Β 
Regression ppt
Regression pptRegression ppt
Regression ppt
Β 
correlation-analysis.pptx
correlation-analysis.pptxcorrelation-analysis.pptx
correlation-analysis.pptx
Β 
correlation-analysis-160424020323.pptx
correlation-analysis-160424020323.pptxcorrelation-analysis-160424020323.pptx
correlation-analysis-160424020323.pptx
Β 
Correlation and Regression ppt
Correlation and Regression pptCorrelation and Regression ppt
Correlation and Regression ppt
Β 
Correlation analysis
Correlation analysisCorrelation analysis
Correlation analysis
Β 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
Β 
Comparing the methods of Estimation of Three-Parameter Weibull distribution
Comparing the methods of Estimation of Three-Parameter Weibull distributionComparing the methods of Estimation of Three-Parameter Weibull distribution
Comparing the methods of Estimation of Three-Parameter Weibull distribution
Β 
special-correlation.ppt
special-correlation.pptspecial-correlation.ppt
special-correlation.ppt
Β 
Artificial Intelligence (Unit - 8).pdf
Artificial Intelligence   (Unit  -  8).pdfArtificial Intelligence   (Unit  -  8).pdf
Artificial Intelligence (Unit - 8).pdf
Β 
Correlation.pptx
Correlation.pptxCorrelation.pptx
Correlation.pptx
Β 
What is Karl Pearson Correlation Analysis and How Can it be Used for Enterpri...
What is Karl Pearson Correlation Analysis and How Can it be Used for Enterpri...What is Karl Pearson Correlation Analysis and How Can it be Used for Enterpri...
What is Karl Pearson Correlation Analysis and How Can it be Used for Enterpri...
Β 
Ijetcas14 608
Ijetcas14 608Ijetcas14 608
Ijetcas14 608
Β 
Course pack unit 5
Course pack unit 5Course pack unit 5
Course pack unit 5
Β 
Measure of Association
Measure of AssociationMeasure of Association
Measure of Association
Β 
Multiple linear regression
Multiple linear regressionMultiple linear regression
Multiple linear regression
Β 

Recently uploaded

zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
Β 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
Β 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
Β 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
Β 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
Β 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
Β 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
Β 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
Β 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSΓ©rgio Sacani
Β 
Genomic DNA And Complementary DNA Libraries construction.
Genomic DNA And Complementary DNA Libraries construction.Genomic DNA And Complementary DNA Libraries construction.
Genomic DNA And Complementary DNA Libraries construction.k64182334
Β 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...SΓ©rgio Sacani
Β 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
Β 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
Β 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
Β 
Call Us ≽ 9953322196 β‰Ό Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 β‰Ό Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 β‰Ό Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 β‰Ό Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
Β 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
Β 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
Β 
Call Girls in Munirka Delhi πŸ’―Call Us πŸ”9953322196πŸ” πŸ’―Escort.
Call Girls in Munirka Delhi πŸ’―Call Us πŸ”9953322196πŸ” πŸ’―Escort.Call Girls in Munirka Delhi πŸ’―Call Us πŸ”9953322196πŸ” πŸ’―Escort.
Call Girls in Munirka Delhi πŸ’―Call Us πŸ”9953322196πŸ” πŸ’―Escort.aasikanpl
Β 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
Β 

Recently uploaded (20)

zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
Β 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Β 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
Β 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
Β 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
Β 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
Β 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Β 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
Β 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Β 
Genomic DNA And Complementary DNA Libraries construction.
Genomic DNA And Complementary DNA Libraries construction.Genomic DNA And Complementary DNA Libraries construction.
Genomic DNA And Complementary DNA Libraries construction.
Β 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Β 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Β 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
Β 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Β 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
Β 
Call Us ≽ 9953322196 β‰Ό Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 β‰Ό Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 β‰Ό Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 β‰Ό Call Girls In Mukherjee Nagar(Delhi) |
Β 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
Β 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
Β 
Call Girls in Munirka Delhi πŸ’―Call Us πŸ”9953322196πŸ” πŸ’―Escort.
Call Girls in Munirka Delhi πŸ’―Call Us πŸ”9953322196πŸ” πŸ’―Escort.Call Girls in Munirka Delhi πŸ’―Call Us πŸ”9953322196πŸ” πŸ’―Escort.
Call Girls in Munirka Delhi πŸ’―Call Us πŸ”9953322196πŸ” πŸ’―Escort.
Β 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Β 

A NEW CORRELATION COEFFICIENT AND A DECOMPOSITION OF THE PEARSON COEFFICIENT

  • 1. 1 A NEW CORRELATION COEFFICIENT AND A DECOMPOSITION OF THE PEARSON COEFFICIENT Savas Papadopoulos Department of Financial Stability Bank of Greece 10/2/23 The views expressed are those of the author and do not necessarily reflect those of Bank of Greece Copyright Β© 2022 Savas Papadopoulos, www.protectmywork.com. All rights reserved.
  • 2. 2 CONTENTS ØABSTRACT ØINTRODUCTION: What, Why, and How we do it ØLITERATURE REVIEW: Pearson, Spearman, and Kendall ØREFORMULATION & DECOMPOSITION OF PEARSON'S COEFFICIENT ØPROPOSED MEASURES & METHODOLOGY ØMADE-UP EXAMPLES: 2 & 3-Parallel Lines - Robustness to Outliers ØSIMULATIONS WITH NORMAL DATA ØAN APPLICATION TO GDP PER CAPITA ØBut in fact CONCLUSIONS
  • 3. 3 ABSTRACT1 We decompose the Pearson correlation coefficient into two components. We recommend the first component for detecting linear relationships and the second for recognizing patterns of two parallel lines, providing robust versions to outliers. The significance of the Pearson coefficient without significant components indicates a fit of three parallel lines. Thus, we reveal the unknown aspect of the Pearson coefficient that identifies a two- or three-group association other than a group, producing Type I or III2 errors. Finally, we apply the proposed coefficients with permutation tests to simulated and real data. The proposed coefficients identify two- or three- parallel line correlations or weaker but significant relationships not recognized by the Pearson coefficient. In testing correlation hypotheses with normal data, the proposed methodology, in an objective simulation study, minimizes the total error resulting from the Person coefficient by 4%. Keywords: coefficients of correlation; Pearson decomposition; permutation tests; computational statistics; robust correlation coefficient to outliers; parallel regression lines; cluster analysis JEL CLASSIFICATION CODES: C10, C12, C14 1 [1] Papadopoulos, Savas, A Practical, Powerful, Robust and Interpretable Family of Correlation Coefficients (May 28, 2022). Available at SSRN: https://ssrn.com/abstract=4114080 or http://dx.doi.org/10.2139/ssrn.4114080 2 https://en.wikipedia.org/wiki/Type_III_error
  • 4. 4 INTRODUCTION: Motivation If we conducted a competition for which statistical quantity would be the most valuable in exploratory data analysis, the winner would most likely be the correlation coefficient with a significant difference from its first competitor. It is very striking that although the three correlation coefficients, Pearson, Spearman, and Kendall, were developed in the late 19th and early 20th centuries, and despite the rapid development of computers, the three coefficients still dominate the use.
  • 5. 5 INTRODUCTION: What we do Estimation & Hypothesis Testing for Correlation Coefficients Let the hypotheses 𝐻!: 𝜌 = 0, 𝐻": 𝜌 β‰  0 and Ξ± = P(Type I error), Ξ² = P(Type II error), Ξ³ = P(Type III error) Type I Error: Reject a TRUE H0 Type II Error: NOT Reject a FALSE H0 Type III Error3: Reject a FALSE H0 for the WRONG Reason (Correct by accident) Note that the errors are due to the data and not the correlation measures, e.g., in random data, by chance, a significant percentage of the data could be close to a line (type I error). 3 https://en.wikipedia.org/wiki/Type_III_error
  • 6. 6 INTRODUCTION: What we do Ø We propose new correlation coefficients π‘Ÿ! & 𝑣! with permutation tests for computing the p-values. Ø The proposed correlation coefficient, π‘Ÿ!, finds linear relationships not detected by the Pearson coefficient (π‘Ÿ"). Ø The proposed correlation coefficient is more robust due to multivariate outliers than the Pearson coefficient (π‘Ÿ"). Ø We reveal type III errors by using the Pearson coefficient (π‘Ÿ"). That is, the π‘Ÿ" can be significant, p-value(π‘Ÿ") < 𝛼, with an insignificant linear relationship but a significant 2 or 3 parallel-line association. Ø The proposed combination coefficients π‘Ÿ!, 𝑣!, and π‘Ÿ", significantly eliminate the total error =Ξ±+Ξ²+Ξ³ compared to π‘Ÿ" and π‘Ÿ! separately, e.g., for normal data by 4% (See Figure 1).
  • 7. 7 INTRODUCTION: What we do Figure 1: Global Simulation Results with Normal Data 0.051 0.051 0.100 0.096 0.150 0.073 0.066 0.000 0.000 0.00 0.05 0.10 0.15 0.20 0.25 Pearson r1 Combining r1, v1 & rP Total Error for Pearson=rP=0.21, for r1=0.20 & for (r1,v1,rP)=0.17 P(Type I Error) P(Type II Error) P(Type III Error)
  • 8. 8 INTRODUCTION: Why are the results important? Ø By applying π‘Ÿ", we minimize the overall error, compared to the error using the Pearson coefficient (π‘Ÿ#), and thus make more correct decisions about whether two variables are linearly dependent or not. Ø In some cases, researchers and analysts may have the illusion that there is a significant linear relationship, but the real situation is a significant 2 or 3 parallel lines. Ø The correlation coefficient, π‘Ÿ", recognizes significant linear relationships not detected by the Pearson coefficient (see the two graphs in the 1st column of Figure 5). Ø In cases with the presence of multivariate outliers π‘Ÿ" is more powerful than π‘Ÿ#. That is, more accurate decisions.
  • 9. 9 INTRODUCTION: How we do it Ø Decompose the Pearson coefficient (π‘Ÿ#) into two components (The math proof is in the appendix). 𝒓𝑷 β‰ˆ 𝟎. πŸ”πŸ’ βˆ™ π’“πŸ + 𝟎. πŸ‘πŸ” βˆ™ π’—πŸ Ø We use robust to outliers’ versions of the two components π’“πŸ & π’—πŸ. The π‘Ÿ" shows only for 1-line fit & 𝑣" indicates 1 or 2 parallel-line fit, and thus their linear combination π‘Ÿ# detects 1 or 2 or 3 parallel-line fit. Ø Viewing 0.64 & 0.36 as weights, at least 64% of π‘Ÿ# shows a 1-line fit, and at most 36% indicates a 2-line fit.
  • 10. 10 INTRODUCTION: How we do it Ø Myths and Realities Ø A Linear relationship Þ Significant Pearson coefficient, p-value(π‘Ÿ#) < 𝛼 Ø Significant Pearson coefficient ⇏ Linear Association Ø In this paper we show that: Significant Pearson coefficient (π‘Ÿ#) Þ Fit with 1 or 2 or 3 parallel lines.
  • 11. 11 INTRODUCTION: How we do it Ø Methodology: Ø If p-value(π‘Ÿ") < 𝛼 Þ a 1 line fit (see the flowchart) Ø If p-value(π‘Ÿ") β‰₯ 𝛼 & If p-value(𝑣") < 𝛼 Þ a 2 parallel- line fit Ø If p-value(π‘Ÿ#) < 𝛼 & p-value(π‘Ÿ") β‰₯ 𝛼 & If p-value(𝑣") β‰₯ 𝛼 Þ 3 parallel lines fit (based on math results) Note: The methodology only detects some 2 or 3 parallel-line fits. The significance of the method shows 1 or 2, or 3 parallel lines and not the other way around. That is if there is a 2- or 3- line fit, our methodology is not guaranteed to likely detect it.
  • 12. 12 Figure2: Flowchart of the Methodology
  • 13. 13 LITERATURE REVIEW Ø Existing literature recommends the Pearson correlation for normal data and the Spearman correlation for nonnormal data. But some global simulations, including ours, indicate that the Kendall coefficient outperforms the Pearson and Spearman coefficients. Ø Data-analysis software typically computes three classic correlation coefficients, Pearson’s, Spearman’s, and Kendall’s.
  • 14. 14 LITERATURE REVIEW: Rankings Ø There is certainly a lot of overlap between significant correlation in face values and significant correlation in rankings Ø But significant correlation in rankings ⇏ Significant correlation in face values Ø And significant correlation in face values ⇏ Significant correlation in rankings Ø Therefore, we lose power by using rankings Ø We can plugin rankings to π‘Ÿ! 4 and change the denominator accordingly. Ø Denote 𝑅!= [π‘Ÿ! with rankings] Ø Our simulation studies, not presented here, show that 𝑅! is more powerful and robust than Spearman’s coefficient. 4 https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4276146
  • 15. 15 LITERATURE REVIEW: Kendall Coefficient The Kendall coefficient (π‘Ÿ") just shows: The Percentage of monotonic pairs= !#|%!| & (a novel approach) from all possible point pairs 'βˆ™('*!) & = 𝑖 + 𝑑, where π‘Ÿ" = 2 βˆ™ ,*- 'βˆ™('*!) 𝑖 = pairs with increasing pattern 𝑑 = pairs with decreasing pattern Ø Thus, we correct that the π‘Ÿ" is not only a rank correlation coefficient; π‘Ÿ" is the same for nominal values and no-repeat rankings. Ø π‘Ÿ" does not show how close the points are to a curve; the points could be close to multi-curves. Ø Our simulations5 show that π‘Ÿ" reveals relationships with one-sided multivariate outliers not signaled by Pearson or Spearman coefficients. 5 https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4276146
  • 16. 16 LITERATURE REVIEW: Outliers Ø The Pearson coefficient (π‘Ÿ#) is known to be sensitive to outliers, so some researchers suggest using rank coefficients, e.g., Spearman & Kendall. Ø Alternatively, we recommend removing univariate outliers, e.g., using the simple π‘˜ βˆ™ 𝐼𝑄𝑅 criterion, thus eliminating influential points. Ø After removing univariate outliers, we cannot have extreme multivariate outliers. Near the most extreme cases are the made-up examples in Figure 4. Ø Note that with rankings, there are never univariate outliers.
  • 17. 17 MAIN RESULT: THE DECOMPOSITION We proved the following meaningful decomposition: Pearson = 𝒓𝑷 = 𝟐 𝝅 βˆ™ 𝒓±,𝟐 + π…βˆ’πŸ 𝝅 βˆ™ π’—πŸ β‰ˆ 𝟎. πŸ”πŸ’ βˆ™ 𝒓±,𝟐 + 𝟎. πŸ‘πŸ” βˆ™ π’—πŸ π‘ŸΒ±,& = Β± ;1 βˆ’ 12" ($) βˆ“3" ($) 1 444444444444444$ 5/7 >, 𝑣±,& = Β± A1 βˆ’ 8 &' ( ($) βˆ“* ( ($) & $ &βˆ™9!* $ + : B (The mathematical proof is in the appendix). STANDARDIZED VARIABLES: 𝑧, (;) = <(*< |<"*<|𝑠 44444444441/𝑠 , 𝑠 = 1,2; 𝑖 = 1,2, β‹― , 𝑛 Symbolize + for positive and – for negative relationship Symbolize by 𝑧 the sample mean, by |π‘Ž| ≑ absolute value and by s the population variance.
  • 18. 18 MAIN RESULT: The Proposed Coefficients Instead of π‘ŸΒ±,& and 𝑣±,& we use the following robust versions: Proposed corr. Coef. π‘Ÿ! for one line: π‘ŸΒ±,! = Β± ;1 βˆ’ 12" (,) βˆ“3" (,) 1 444444444444444$ & > Corr. Coef. 𝑣!for two-parallel lines: 𝑣±,! = Β± A1 βˆ’ MAD &' ( (,) βˆ“* ( (,) & $ @ B 𝑐 = 0.73161 = Μ‡ 𝐸 mMAD 12( (,) βˆ“3( (,) 1 & o (Numerical result) π‘Ÿ! = ⎩ βŽͺ ⎨ βŽͺ βŽ§π‘Ÿ#,!, if uπ‘₯B (!) βˆ’ 𝑦B (!) u xxxxxxxxxxxxxxxx < uπ‘₯B (!) + 𝑦B (!) u xxxxxxxxxxxxxxxx & uπ‘₯B (!) βˆ’ 𝑦B (!) u xxxxxxxxxxxxxxxx < √2, π‘Ÿ*,!, if uπ‘₯B (!) + 𝑦B (!) u xxxxxxxxxxxxxxxx < uπ‘₯B (!) βˆ’ 𝑦B (!) u xxxxxxxxxxxxxxxx & uπ‘₯B (!) + 𝑦B (!) u xxxxxxxxxxxxxxxx < √2, 0, otherwise
  • 19. 19 𝑣! = ⎩ βŽͺ ⎨ βŽͺ βŽ§π‘£#,!, if uπ‘₯B (!) βˆ’ 𝑦B (!) u xxxxxxxxxxxxxxxx < uπ‘₯B (!) + 𝑦B (!) u xxxxxxxxxxxxxxxx & 𝑀𝐴𝐷12( (,) *3( (,) 1 < βˆšπ‘, 𝑣*,!, if uπ‘₯B (!) + 𝑦B (!) u xxxxxxxxxxxxxxxx < uπ‘₯B (!) βˆ’ 𝑦B (!) u xxxxxxxxxxxxxxxx & 𝑀𝐴𝐷12( (,) #3( (,) 1 < βˆšπ‘, 0, otherwise MAD=mean absolute deviation over i. First, we standardize the x and y values, Second, we infer whether there is a positive or negative relationship by comparing the mean distances from the y=x and y=-x lines (see Figure 8). Third, we rescale the mean squared and variance of the distances by dividing by their expected values and then subtracting the results from 1. Finally, we define π‘Ÿ!and 𝑣! so that we have the following properties:
  • 20. 20 PROPERTIES FOR π’“πŸ & π’—πŸ (two-parallel lines) Ø βˆ’1 ≀ π‘Ÿ& ≀ 1 & βˆ’1 ≀ 𝑣& ≀ 1 due to the restrictions in the definitions. Ø An exact value of +1 or -1 for π‘Ÿ& indicates a perfect positive or negative relationship and vice versa, since the mean distance of the standardized points is 0 and so all points fall on the line y=x or on the line y=-x. Ø An exact value of +1 or -1 for π’—πŸ implies (two-parallel lines): |𝑣&| = 1 ⟺ MAD()" ($) βˆ“+" ($) ( , = 0 ⟺ βˆ€ 𝑖, ?π‘₯- (&) βˆ“ 𝑦- (&) ? = 𝑐 ⟺
  • 21. 21 ⟺ βˆ€ 𝑖, 𝑦- (&) = Β±π‘₯0,- (&) + 𝑐 or 𝑦0,- (&) = Β±π‘₯0,- (&) βˆ’ 𝑐 (continued) Two symmetrical lines, y=x+c and y=x-c around the line y=x or two symmetrical lines, y=-x+c and y=-x-c around the line y=-x. In the special case for c=0 the points fall on the line y=x or on the line y=-x. Ø A correlation value close to 0 or 0 indicates no relationship. Ø The closer to +1 or -1 the coefficient, π‘Ÿ!, the stronger a one-line association, assuming no influential points. Ø The closer to +1 or -1 the coefficient, 𝑣!, the stronger a one-line or a two-line association, assuming no influential points.
  • 22. 22 Two- or Three parallel lines The significance of the Pearson correlation coefficient may be misleading if some standardized points are close to the line y=x or y=-x, and other standardized points are close to the lines y=x+c and y=x-c or close to the lines y=-x+c and y=-x-c. The 1st set of points magnifies the 1st component π‘Ÿ& towards 1 or -1, and the 2nd set stretches the 2nd component 𝑣& towards 1. The linear combination 𝒓𝑷 β‰ˆ 𝟎. πŸ”πŸ’ βˆ™ π’“πŸ + 𝟎. πŸ‘πŸ” βˆ™ π’—πŸ can give a significant value for the Pearson coefficient 𝒓𝑷 because most points are close to 2 or 3 lines and not necessarily close to 1 line, as known in the literature. We justify this with simulations and real data later.
  • 23. 23 Two- or Three parallel lines (cont.) Without loss of generality, we assume univariate outlier removal, variable standardization, and a positive relationship. If π‘Ÿ# ⟢ 1 (close to 1), then $π‘₯$ (#) βˆ’ 𝑦$ (#) $ (((((((((((((((( ⟢ 0, and thus the points are close to the line y=x. If 𝑣# ⟢ 1, then 𝜎 '($ (&) )*$ (&) ' # ⟢ 0 or $π‘₯+ (#) βˆ’ 𝑦+ (#) $ = 𝑐 and thus the points are close to the parallel lines, y=x+c and y=x-c, symmetric to the line y=x (shown earlier). If π‘Ÿ, ⟢ 1, then π‘Ÿ# ⟢ 1 & 𝑣# ↛ 1 (a 1-line fit only) or π‘Ÿ# ↛ 1 & 𝑣# ⟢ 1 (a 2-line fit only). In this case, we cannot have a single line because otherwise, π‘Ÿ# ⟢ 1.
  • 24. 24 Two- or Three parallel lines (cont.) or π‘Ÿ' ⟢ 1 & 𝑣' ⟢ 1 (for c = 0, we have a 1-line, & for c>>0, have a 3-line fit) Example with 1 line: A perfect single-line fit with 𝒓𝑷 = π’“πŸ = π’—πŸ = 1. Example with 3 parallel lines: In Figure 3, the 2nd example with 3 lines. In this example, all three coefficients are large 𝒓𝑷 = 0.71, π’“πŸ = 0.66, π’—πŸ = 0.79. We have three points exactly to the middle line and six points to two symmetric lines around the middle line, 3 points to each line.
  • 25. 25 Three-parallel lines (cont.) The significance of the Pearson coefficient with insignificant components indicates a fit of three parallel lines. The logic behind this is that the insignificance of the two components indicates that the points are not close to one or two parallel lines; otherwise, they would be significant. Therefore, the significance of the Pearson coefficient, in this case, is that some points are close to a straight line, and others are close to two symmetrical lines around the first line. Another example is the 3rd plot of Figure 5, π‘Ÿ1 = 0.16, π‘Ÿ, = 0.12, 𝑣, = 0.25 with p-values 0.02, 0.07, and 0.07, respectively. The existence of such cases is shown in Figure 10.
  • 26. 26 Two- or Three parallel lines Mean and STD Relationship for positive numbers In general, the Mean (M) and the Standard Deviation (SD) sizes are independent of each other. But for positive numbers, as in the case with Vertical Projections (VP) Gπ‘₯( (') βˆ’ 𝑦( (') G, a small M implies a small SD but not vice versa, without outliers. A small M indicates small positive numbers close together and, therefore, a small SD. A small SD can also occur in large positive numbers, so M can be large.
  • 27. 27 Ø A small M implies small VPs, uπ‘₯, (&) βˆ’ 𝑦, (&) u β‰ˆ 𝑀, or points close to the lines 𝑦 = π‘₯ βˆ’ 𝑀 and 𝑦 = π‘₯ + 𝑀. Since M is small, the points can be considered close to the line 𝑦 = π‘₯. Ø A small SD (significant 𝑣&) implies VPs close to M, uπ‘₯, (&) βˆ’ 𝑦, (&) u β‰ˆ 𝑀 = uπ‘₯B (&) βˆ’ 𝑦B (&) u xxxxxxxxxxxxxxxx or points close to the lines 𝑦 = π‘₯ βˆ’ 𝑀 and 𝑦 = π‘₯ + 𝑀. If M is large (insignificant π‘Ÿ&) then we have a two-line fit. Ø For large M & SD (insignificant π‘Ÿ& & 𝑣&) with significant π‘ŸD, a large portion of the data minimizes M & SD and therefore maximizes π‘ŸD. In addition, another large portion of points maximizes M, SD, otherwise π‘Ÿ& & 𝑣& would be significant, and therefore minimizes π‘Ÿ& & 𝑣&, see Figure 9.
  • 28. 28 Figure 3: MADE-UP EXAMPLES: Two & Three parallel lines In both cases, the Pearson coefficient is significant (p-values of π’“πŸ 0.02 & 0.03), giving us the impression that there is a significant linear relationship. The proposed coefficients π’“πŸ and π’—πŸ show an insignificant linear relationship (p- values of π’“πŸ 0.09 & 0.06), a significant two-parallel line fit in Case 1 (p-value of π’—πŸ=0.002), and a significant three-parallel line fit in Case 2 (insignificant π’“πŸ and π’—πŸ, 0.06 & 0.06, and a significant Pearson coefficient, p-value=0.03).
  • 29. 29 Figure 4: A MADE-UP EXAMPLES: Perfect Linear Relationships with 1 & 2 outliers The Pearson correlation coefficient is not robust under the presence of outliers6. We consider a perfect linear relationship with ONE &TWO multivariate outliers. We use the Pearson coefficient & the proposed π‘Ÿ. coefficient. Only the proposed correlation, π’“πŸ, recognizes the linear relationships, p- values=0.04 & 0.02<0.05. 6 https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
  • 30. 30 Constructing a Correlation Measure Let G be a distance measure between the standardized values of two paired variables and π‘ŸΒ±,- = Β± L1 βˆ’ . / M, where 𝐺 01 O P 𝐿 (convergence in probability) π‘Ÿ - = R π‘Ÿ2,-, if positive relationship & 𝐺 < 𝐿, π‘Ÿ3,-, if negative relationship & 𝐺 < 𝐿, 0, otherwise Note that the desired properties apply. Ø e.g., in π‘ŸΒ±,- and 𝑣±,-, for g=1,2; the denominators of the fractions are the expected values of the numerators under the assumptions of normality and independence.
  • 31. 31 AN INTERPRETATION OF THE CORRELATION COEFFICIENT Ø The correlation coefficient, π‘Ÿ&, can be interpreted as the percentage change between the squared distance of the standardized values and the squared limiting distance for independent normal π‘₯ and 𝑦. Ø For example, a value of π‘Ÿ&=0.5 implies 50% reduction of the squared observed distance from the squared distance of the limiting distance under independence and normality. Ø In the literature, it is known that Pearson’s correlation can be viewed as a rescaled variance of the difference between standardized scores7. 7 Rodgers; Nicewander (1988). "Thirteen ways to look at the correlation coefficient" (PDF). The American Statistician. 42 (1): 59–66.
  • 32. 32 PERMUTATION TESTS: computing the p-values Permutation tests8 have been used for hypothesis testing of correlation coefficients between two variables, x and y. Initially, calculate the correlation coefficient repeatedly after shuffling the observations of the variable y and keeping constant the order of the observations for the variable x. Then, we can derive p-values from the distribution of the computed correlation coefficients. Permutation tests9 enjoy the following merits against other standard statistical tests: β€’ Approximate p-values very satisfactory. β€’ Do not assume any particular distribution (distribution-free). β€’ Are suitable for small samples. β€’ Are applicable to non-random samples, e.g., time-series data. 8 https://en.wikipedia.org/wiki/Pearson_correlation_coefficient 9 Berry, K. J., Johnston, J. E., & Mielke, Jr.(Paul W.). (2018). The measurement of association: a permutation statistical approach. Springer International Publishing.
  • 33. 33 SIMULATION DESIGN WITH NORMAL DATA Ø 100,000 simulations by Python (NUMPY library) Ø Picking randomly one of the following 6 pairs (n, r) with equal probabilities (1/6): n 20 35 50 100 150 200 r 0.7 0.6 0.5 0.35 0.3 0.25 Ø Generating n correlated-bivariate data, π‘₯- and 𝑦- with Pearson’s coefficient π‘Ÿ1 = r. All the data are correlated as follows: Ø π‘₯- ∼ π‘π‘œπ‘Ÿπ‘šπ‘Žπ‘™(0,1), 𝑒- ∼ π‘π‘œπ‘Ÿπ‘šπ‘Žπ‘™(0,1), Ø 𝑦- = r βˆ™ π‘₯- + R1 βˆ’ r βˆ™ 𝑒- Ø Applied in Figure 5, 1st row, & Tables 1a & 1b.
  • 34. 34 AN APPLICATION TO GDP PER CAPITA Ø Publicly available data10 WORLD BANK Ø N=62 countries with GDP per Cap > 10,000$ in 2020, and full annual data for 1982-2020 Ø T=39 for the period 1982-2020. Analyze Growth Rates (%) Ø (622 -62)/2 = 1891 pairs (x,y) correlation cases Ø No causality. Lurking variables: Global or Continental Economy Ø Compare the economic growth of a country with its correlated countries by regression residuals. Ø Applied in Figure 5, 2nd row, & Table 2. 10 https://data.worldbank.org/indicator/NY.GDP.PCAP.CD
  • 35. 35 Result Presentation in a Graph, see Figure 4 Ø Simulated data in the 1st row and application data in the 2nd row. Ø 1-, 2-, and 3-line fit in the 1st, 2nd, & 3rd columns, respectively. Ø In the application initial sample size n=39 and we remove univariate outliers by the 1.5 βˆ™ 𝐼𝑄𝑅 criterion. In the simulation plots the sample size is n=200. Ø By p, p1, & p2 we represent the p-values of Pearson, π‘Ÿ"and 𝑣", respectively. Ø By significance, we mean that the p-value<0.05
  • 36. 36 Conclusions from Figure 4 Ø In the 1st column, only π‘Ÿ" shows a significant linear relationship. The application graph shows the robustness of π‘Ÿ" to non-normal data. Ø In the 2nd column, in the 1st graph with simulated data, the Pearson coefficient is significant, and the insignificance of π‘Ÿ" with the significance of 𝑣" indicates a 2-line fit. From the existing literature, there is a significant 1-line fit; this is a Type III error. Ø In the 2nd column, in the 2nd graph with the application data, the Pearson coefficient is insignificant, and the insignificance of π‘Ÿ" with the significance of 𝑣" indicates a 2-line fit. From the existing literature, there is no significant 1-line fit.
  • 37. 37 Conclusions from Figure 4 (continued) Ø In the 3rd column, in both graphs, the Pearson coefficient is significant, and both π‘Ÿ" and 𝑣" are insignificant. Thus, we conclude that there is a significant fit of a 3-line fit. From the existing literature, there is a significant 1-line fit; this is a Type III error.
  • 38. 38 Figure 5: Simulated and Real-Data Evidence
  • 39. 39 Simulation Results (Tables 1a & 1b) Ø See the simulation design earlier. Ø Tables 1a (𝜌 = 0) & 1b (𝜌¹0) are 3-way tables for the 3 correlation coefficients π‘Ÿ# (Pearson), π‘Ÿ", and 𝑣", with 2 categories each, whether significant or not. Ø For 𝜌 = 0, the Ξ± = P(Type I error) for π‘Ÿ# is Ξ± =5.1% (close to 5 as was expected), and for π‘Ÿ" is Ξ± =5.1%=3.2+1.9. By combining π‘Ÿ#, π‘Ÿ" and 𝑣", Ξ± =10%=1-0.90. Ø For 𝜌 β‰  0, the Ξ² = P(Type II error) for π‘Ÿ# is Ξ² =9.6%, while for π‘Ÿ" is Ξ² =15%=6.6+8.4. By combining π‘Ÿ#, π‘Ÿ" and 𝑣", Ξ²=7.3%.
  • 40. 40 Simulation Results (Tables 1a & 1b) Ø For 𝜌 β‰  0, the Ξ³ = P(Type III error) for π‘Ÿ! is Ξ³ =6.6%=4.3+2.3 while combining π‘Ÿ!, π‘Ÿ" and 𝑣" is Ξ³ =0. Ø With the proposed methodology, the Ξ± =P(Type I error) doubles from 5 to 10, as it detects parallel lines in addition to linear fits, but the Ξ² = P(Type II error) is significantly smaller than that for Pearson’s correlation from 9.6 to 7.3, and there is no Ξ³ = P(Type III error). Ø The above results appear in Figure 1. Ø Thus, by combining π‘Ÿ", 𝑣", and π‘Ÿ!, we minimize the total error compared to the total error by π‘Ÿ# and π‘Ÿ" separately. Ø In this case, we get the smallest total error if we use the Pearson significance and exclude the Type III error, which the significance of π‘Ÿ!, π‘Ÿ", and 𝑣" can account for. In the case of non-normal data, we can use the combination π‘Ÿ", 𝑣", and π‘Ÿ!.
  • 41. 41 Table 1a: Simulation (𝝆 = 𝟎) N=100,000 𝝆 = 𝟎 π’—πŸ Pearson π’“πŸ Significant Not Signif. Total SignificantSignificant 1.1, 1 line 2.1, 1 line 3.2 Not Signif. 1.0, 2 lines 0.8, 3 lines 1.8 Total 2.1 2.9 5.0 Not Signif.Significant 0.0, 1 line 1.9, 1 line 1.9 Not Signif. 3.1, 2 lines 90.0, No lines 93.1 Total 3.1 91.9 95.0
  • 42. 42 Table 1b: Simulation (π†ΒΉπŸŽ) N=100,000 π†ΒΉπŸŽ π’—πŸ Pearson π’“πŸ Significant Not Signif. Total Significant Significant 66.1, 1 line 17.7, 1 line 83.8 Not Signif. 4.3, 2 lines 2.3, 3 lines 6.6 Total 70.4 20.0 90.4 Not Signif. Significant 0.0, 1 line 1.2, 1 line 1.2 Not Signif. 1.1, 2 lines 7.3, No lines 8.4 Total 1.1 8.5 9.6
  • 43. 43 Application Results (Table 2) Ø See the application information earlier. Ø Table 2 is a 3-way table for the 3 correlation coefficients π‘Ÿ1 (Pearson), π‘Ÿ&, and 𝑣&, with 2 categories each, whether significant or not. Ø The initial sample size is n=39, and we remove univariate outliers by the 1.5 βˆ™ 𝐼𝑄𝑅 criterion. Ø In 4.4%=0.3+4.1 of the 1891 cases, π‘Ÿ& recognizes a significant linear relationship not detected by the Pearson π‘Ÿ1 coefficient. Ø Combining π‘Ÿ&, and 𝑣&, we show a significant 2-line fit in 4.9%=0.6+4.3 of the 1891 cases, not detected otherwise. Ø By combining π‘Ÿ&, 𝑣&, and π‘Ÿ1, we detect a significant 3-line fit in 1.5% of the 1891 cases, not recognized otherwise.
  • 44. 44 Table 2: AN APPLICATION TO GDP PER CAPITA N=1891 % π’—πŸ Pearson π’“πŸ Significant Not Signif. Total SignificantSignificant 40.8, 1 line 16.9, 1 line 57.7 Not Signif. 0.6, 2 lines 1.5, 3 lines 2.1 Total 41.4 18.4 59.8 Not Signif.Significant 0.4, 1 line 4.0, 1 line 4.4 Not Signif. 4.2, 2 lines31.6, No 1-3 lines 35.8 Total 4.6 35.6 40.2
  • 45. 45 Figure 6: Venn Diagrams11 for the Simulation (Table 1a & 1b) 𝝆 = 𝟎 π†ΒΉπŸŽ 11 For understanding the diagrams see the link: https://www.easycalculation.com/algebra/venn-diagram-3sets.php
  • 46. 46 Figure 7: Venn Diagrams12 for the Application (Table 2) 12 For understanding the diagrams see the link: https://www.easycalculation.com/algebra/venn-diagram-3sets.php 1-line=π‘Ÿ!=40.8+16.9+4+0.4 2-lines=Pearson ∩ 𝑣& ∩ π‘Ÿ& 2 =0.6 3-lines=Pearson ∩ π‘Ÿ& 2 ∩ 𝑣& 2 =1.5 e.g. 57.7=16.9+40.8, Pearson ∩ π‘Ÿ& ∩ 𝑣&=40.8 Pearson2 β‹‚π‘Ÿ& 2 ∩ 𝑣& 2 =31.6
  • 47. 47 Comments on Figures 6 & 7 (Venn Diagrams) Ø The Pearson π‘ŸD and the proposed coefficient for a linear fit π‘Ÿ! AGREE by 93.2%=90+1.1+2.1, by 91.1%=66.1+17.7+7.3, and by 89.3%=40.8+16.9+31.6, in the cases for 𝜌 = 0, 𝜌¹0, and in the application, respectively. Roughly, there is 90% agreement and 10% disagreement. Ø A significant TWO- or THREE-LINE FIT appears in 4.1%=3.1+1 & 0.8% in the case 𝜌 = 0, 5.4%=4.3+1.1 & 2.3% in the case 𝜌¹0, and 4.8%=4.2+0.6 & 1.5% in the application, respectively. Ø The robust version of the 2nd component of the Pearson decomposition, the coefficient 𝑣!, does not recognize the linear relationship for a large amount, 17.7% in the case 𝜌¹0, and 16.9% in the application, respectively. Thus, we recommend π‘Ÿ! for linear relationships and 𝑣! for 2 and 3-parellel fitting lines in combination with π‘Ÿ! and π‘ŸD.
  • 48. 48 OVERALL CONCLUSIONS Ø The significance of the Pearson correlation coefficient may imply 1 or 2, or 3 parallel-line fits. There is a type III error; we may think there is a significant fit with 1 line, but in fact, exists a fit of 2 or 3 parallel lines. Ø We provide a methodology to detect 1 or 2- or 3-line fits. Ø Prove a meaningful decomposition of the Pearson coefficient. Ø The proposed π‘Ÿ& coefficient detects more linear relationships and has a smaller overall error than the Pearson coefficient.
  • 49. 49 OVERALL CONCLUSIONS Ø π‘Ÿ! is more robust to non-normality & outliers than the Pearson coefficient. Ø The robust version of the 2nd component of the Pearson decomposition, the coefficient 𝑣!, does not recognize many linear relationships since it detects two parallel lines. Thus, we recommend only π‘Ÿ! for single-line detection. Ø The two robust version components, π‘Ÿ!, and 𝑣!, of the Pearson decomposition combined with the Pearson coefficient, can be applied by all scientists to detect accurate significant fits of 1 or 2, or 3 parallel lines.
  • 50. 50 APPENDIX: THE DECOMPOSITION Pearson = 𝟐 𝝅 βˆ™ 𝒓±,𝟐 + 𝝅3𝟐 𝝅 βˆ™ π’—πŸ β‰ˆ 𝟎. πŸ”πŸ’ βˆ™ 𝒓±,𝟐 + 𝟎. πŸ‘πŸ” βˆ™ π’—πŸ π‘ŸΒ±,& = Β± ;1 βˆ’ 12" ($) βˆ“3" ($) 1 444444444444444$ 5/7 >, 𝑣±,& = Β± A1 βˆ’ 8 &' ( ($) βˆ“* ( ($) & $ &βˆ™9!* $ + : B With standardized values: 𝑧, (&) = <(*< |<"*<|2 44444444441/2 , 𝑖 = 1,2, β‹― , 𝑛 Symbolize by πœ‹ the known mathematical constant, by 𝑧 the sample mean, by |π‘Ž| ≑ absolute value, and by s the population variance. Define as π‘Ÿ& = Ε  π‘Ÿ#,&, for positive relationship π‘Ÿ*,&, for negative relationship and 𝑣& = Ε  𝑣#,& for positive relationship 𝑣*,& for negative relationship
  • 51. 51 APPENDIX: A Proof for the Decomposition: We write the Pearson coefficient as: π‘Ÿ14 = π‘₯5 (,) βˆ™ 𝑦5 (,) aaaaaaaaaaaa . The mean square of the standardized values is equal to 1: β€Ήπ‘₯B (&) Ε’ & xxxxxxxxxx = ‹𝑦B (&) Ε’ & xxxxxxxxxx = 1, and from the identity (π‘Ž βˆ“ 𝑏)& = π‘Ž& + 𝑏& βˆ“ 2π‘Žπ‘ we get the following reformulation of the Pearson coefficient: π‘Ÿ14 = π‘₯5 (,) βˆ™ 𝑦5 (,) aaaaaaaaaaaa = Β± k1 βˆ’ 6)& (') βˆ“+& (') 7 ' 888888888888888888 , l Using the known identity13 |𝑧9|, aaaaaa = p|𝑧9| aaaaq , + 𝜎|zΞΉ| 2 , for 𝑧9 = π‘₯- (,) βˆ“ 𝑦- (,) and by some calculations we get the result. 13 https://en.wikipedia.org/wiki/Root_mean_square
  • 52. 52 π‘ŸDG = Β± β€’1 βˆ’ 1 2 βˆ™ β€˜uπ‘₯B (&) βˆ“ 𝑦B (&) u xxxxxxxxxxxxxxxx& + 𝜎 12( ($) βˆ“3( ($) 1 & β€™β€œ = = Β± ”1 βˆ’ 1 2 βˆ™ β€’ 4 πœ‹ βˆ™ uπ‘₯B (&) βˆ“ 𝑦B (&) u xxxxxxxxxxxxxxxx& 4/πœ‹ + 2 βˆ™ m1 βˆ’ 2 πœ‹ o βˆ™ 𝜎 12( ($) βˆ“3( ($) 1 & 2 βˆ™ β€Ή1 βˆ’ 2 πœ‹ Ε’ Λœβ„’ = = Β± ” 2 πœ‹ + 1 βˆ’ 2 πœ‹ βˆ’ 2 πœ‹ βˆ™ uπ‘₯B (&) βˆ“ 𝑦B (&) u xxxxxxxxxxxxxxxx& 4 πœ‹ βˆ’ m1 βˆ’ 2 πœ‹ o βˆ™ 𝜎 12( ($) βˆ“3( ($) 1 & 2 βˆ™ β€Ή1 βˆ’ 2 πœ‹ Ε’ β„’ = = Β± 2 πœ‹ βˆ™ β€’1 βˆ’ uπ‘₯B (&) βˆ“ 𝑦B (&) u xxxxxxxxxxxxxxxx& 4/πœ‹ ˜ Β± m1 βˆ’ 2 πœ‹ o βˆ™ A1 βˆ’ 𝜎 12( ($) βˆ“3( ($) 1 & 2 βˆ™ β€Ή1 βˆ’ 2 πœ‹ Ε’ B = & 7 βˆ™ π‘ŸΒ±,& + 7*& 7 βˆ™ 𝑣&. ∎
  • 53. 53 APPENDIX: Notes on the Decomposition: Ø If we assume π‘₯,~𝑁(πœ‡2, 𝜎2 &) & 𝑦,~π‘ΕΎπœ‡3, 𝜎3 & ΕΈ then the sum of the standardized values: π‘₯, (&) βˆ“ 𝑦, (&) ∼ 𝑁(0,2) and therefore uπ‘₯, (&) βˆ“ 𝑦, (&) u ~π»π‘Žπ‘™π‘“-𝑁 Β€ & √7 , 2 βˆ™ β€Ή1 βˆ’ & 7 Ε’Β₯ (Half-normal14 distribution) We used that if π‘₯,~𝑁(0, 𝜎&) then |π‘₯|~π»π‘Žπ‘™π‘“-𝑁 =𝜎 βˆ™ @ 2 πœ‹ , 𝜎! βˆ™ C1 βˆ’ 2 πœ‹ FG Ø In π‘ŸΒ±,& and 𝑣±,&, the denominators of the fractions are the expected values of the numerators under the assumptions of normality and independence. 14 https://en.wikipedia.org/wiki/Half-normal_distribution
  • 54. 54 APPENDIX: Notes on the Normality Assumption Ø Despite using the normality assumption to obtain the limiting distances in the proposed coefficients, this assumption does not affect the p-values computed by permutation tests. Ø For non-normal data, we may have some bias in estimating the correlation coefficients π‘Ÿ! and 𝑣!. However, the estimation of their p- values does not depend on the normality assumption. So, with non- normal data, we look at the p-values of π‘Ÿ! and 𝑣!. Ø For large sample sizes and normal data, we could use already calculated critical values if the calculated time was an issue. Ø For large sample sizes and non-normal data, we could use already calculated critical values of the respective rank coefficients π‘Ÿ! and 𝑣!, denoted as 𝑅! and 𝑉!, with a power compensation if the calculated time was an issue.
  • 55. 55 APPENDIX: Figure 8: Standardized Values around the lines y=x and y=-x
  • 56. 56 APPENDIX: A Tip: The Parallelism between Hypothesis Testing and Proof by Contradiction Ø In proof by contradiction, we establish a proposition by leading to a contradiction by assuming that the Ø proposition is false. Ø In hypothesis testing, we accept the alternative hypothesis, H1, concluding improbably (in a small p-value) by assuming a false H1 (a true H0). And
  • 57. 57 APPENDIX: THE CRITICAL VALUES AND P-VALUES FOR π’“πŸ ARE COMPUTED BY PERMUTATION TESTS Cut-Off Points for π‘Ÿ" (two-sided Ξ±=0.05 or one-sided Ξ±=0.025) n c n c n c n c n c n c 5 0.924 12 0.662 19 0.543 30 0.437 80 0.278 500 0.113 6 0.868 13 0.637 20 0.528 35 0.409 90 0.262 1000 0.080 7 0.819 14 0.617 21 0.517 40 0.386 100 0.249 2000 0.057 8 0.780 15 0.602 22 0.505 45 0.363 150 0.205 5000 0.036 9 0.742 16 0.582 23 0.496 50 0.346 200 0.179 104 0.026 10 0.715 17 0.570 24 0.487 60 0.318 300 0.146 105 0.008 11 0.687 18 0.556 25 0.477 70 0.296 400 0.126 106 0.001 Under Normality Example: In a case with data not rejected as normal and n=90, we observe π‘Ÿ. = 0.21. Then we cannot reject 𝐻/: 𝜌 = 0 against 𝐻.: 𝜌 β‰  0 with Ξ±=0.05 since |π‘Ÿ.| < 𝑐. = 0.262.
  • 58. 58 APPENDIX: THE CRITICAL VALUES AND P-VALUES FOR π’—πŸ ARE COMPUTED BY PERMUTATION TESTS Cut-Off Points for 𝑣" (two-sided Ξ±=0.05 or one-sided Ξ±=0.025) n c n c n c n c n c n c 5 0.962 12 0.751 19 0.629 30 0.522 80 0.337 500 0.141 6 0.932 13 0.732 20 0.616 35 0.490 90 0.319 1000 0.100 7 0.893 14 0.709 21 0.605 40 0.461 100 0.303 2000 0.072 8 0.860 15 0.692 22 0.595 45 0.435 150 0.251 5000 0.046 9 0.830 16 0.675 23 0.582 50 0.415 200 0.219 104 0.033 10 0.801 17 0.658 24 0.573 60 0.384 300 0.180 105 0.011 11 0.775 18 0.644 25 0.560 70 0.358 400 0.158 106 0.003 Under normal data
  • 59. 59 APPENDIX: THE CRITICAL VALUES FOR PEARSON COMPUTED BY PERMUTATION TESTS Cut-Off Points for π‘Ÿ# (two-sided Ξ±=0.05 or one-sided Ξ±=0.025) n c n c n c n c n c n c 5 0.923 12 0.640 19 0.510 30 0.408 80 0.250 500 0.100 6 0.866 13 0.618 20 0.497 35 0.379 90 0.236 1000 0.071 7 0.816 14 0.591 21 0.488 40 0.353 100 0.223 2000 0.050 8 0.771 15 0.576 22 0.475 45 0.332 150 0.182 5000 0.031 9 0.730 16 0.558 23 0.468 50 0.317 200 0.158 104 0.022 10 0.698 17 0.540 24 0.457 60 0.291 300 0.130 105 0.007 11 0.668 18 0.528 25 0.446 70 0.267 400 0.112 106 0.002 Under normal data
  • 60. 60 APPENDIX: EXACT CRITICAL VALUES FOR PEARSON Cut-Off Points for π‘Ÿ# (two-sided Ξ±=0.05 or one-sided Ξ±=0.025) n c n c n c n c n c n c 5 0.878 12 0.576 19 0.456 30 0.361 80 0.220 500 0.088 6 0.811 13 0.553 20 0.444 35 0.334 90 0.207 1000 0.062 7 0.754 14 0.532 21 0.433 40 0.312 100 0.197 2000 0.044 8 0.707 15 0.514 22 0.423 45 0.294 150 0.160 5000 0.028 9 0.666 16 0.497 23 0.413 50 0.279 200 0.139 104 0.020 10 0.632 17 0.482 24 0.404 60 0.254 300 0.113 105 0.006 11 0.602 18 0.468 25 0.396 70 0.235 400 0.098 106 0.002 π‘Ÿ = 𝑑;/, t𝑛 βˆ’ 2 + 𝑑;/, ,
  • 61. 61 Figure 9. Example n=1000, r =0.5, standardized points Blue points (near the y=x line) maximize the 1st component, red points maximize the 2nd component. In the histogram the selected points are the red points. Red and blue points may overlap or disjoint.
  • 62. 62 Figure10. A Significant Pearson coefficient with insignificant component e.g., n=50, r2<0.335, v2<0.404, v2>0.7512-1.751938*r2 0.279< 0.64*r2+0.36*v2<1. This case occurs in the triangle. So, it is possible.
  • 63. 63 PYTHON CODE FOR THE COEFFICIENTS & PERMUTATION TESTS # ============================================================================= def r1(x,y): # Computes the coefficient r1 nxy = x.shape[0] xm = np.mean(x)*np.ones(nxy); ym = np.mean(y)*np.ones(nxy) dmx = np.mean(abs(x-xm)); dmy = np.mean(abs(y-ym)) x = (x-xm)/dmx; y = (y-ym)/dmy; del xm, ym, dmx, dmy dxyp = (np.mean(abs(x-y))); dxyn = (np.mean(abs(x+y))) if ( (dxyp < dxyn) & (dxyp < np.sqrt(2)) ): rc1 = 1-dxyp**2 / 2 elif ( (dxyp > dxyn) & (dxyn < np.sqrt(2)) ): rc1 = -( 1-dxyn**2 / 2 ) else: rc1 = 0 return rc1 # ============================================================================= def pvr1(x,y): # p-value for r1 by permutation tests rcc1 = r1(x,y) nper=1000 dr = np.zeros(nper) yh = y.tolist() for ii in range(0,nper): # Permutation yr = np.array( random.sample( yh, len(yh) ) ) dr[ii] = r1(x,yr) ecdfr = ECDF(dr[:]) return ecdfr(-abs(rcc1)) + 1 - ecdfr(abs(rcc1)) # ============================================================================= def v1(x,y): # Computes the coefficient v1 nxy = x.shape[0] xm = np.mean(x)*np.ones(nxy); ym = np.mean(y)*np.ones(nxy) dmx = np.mean(abs(x-xm)); dmy = np.mean(abs(y-ym))
  • 64. 64 x = (x-xm)/dmx; y = (y-ym)/dmy; del xm, ym, dmx, dmy dvp = np.mean( abs( abs(x-y) - np.mean(abs(x-y)) ) ) dvn = np.mean( abs( abs(x+y) - np.mean(abs(x+y)) ) ) plim = 0.731606694191509 if ( (dvp < dvn) & (dvp < np.sqrt(plim)) ): vc1 = 1 - ( dvp )**2 /plim elif ( (dvp > dvn) & (dvn < np.sqrt(plim)) ): vc1 = -( 1 - ( dvn )**2 /plim ) else: vc1 = 0 return vc1 # ============================================================================= def pvv1(x,y): # p-value for v1 by permutation tests vcc1 = v1(x,y) nper=1000 dr = np.zeros(nper) yh = y.tolist() for ii in range(0,nper): # Permutation yr = np.array( random.sample( yh, len(yh) ) ) dr[ii] = v1(x,yr) ecdfr = ECDF(dr[:]) return ecdfr(-abs(vcc1)) + 1 - ecdfr(abs(vcc1)) # =============================================================================== def outlier(var,kk): Q1, Q3 = np.percentile(var, [25,75]) IQR = Q3 - Q1 ul = Q3+kk*IQR ll = Q1-kk*IQR outliers = ((var > ul) | (var < ll)) return outliers # ==============================================================================