A NEW CORRELATION COEFFICIENT AND A DECOMPOSITION OF THE PEARSON COEFFICIENT

1
A NEW CORRELATION
COEFFICIENT AND A
DECOMPOSITION OF THE
PEARSON COEFFICIENT
Savas Papadopoulos
Department of Financial Stability
Bank of Greece
10/2/23
The views expressed are those of the author and do
not necessarily reflect those of Bank of Greece
Copyright © 2022 Savas Papadopoulos,
www.protectmywork.com. All rights reserved.

2
CONTENTS
ØABSTRACT
ØINTRODUCTION: What, Why, and How we do it
ØLITERATURE REVIEW: Pearson, Spearman, and Kendall
ØREFORMULATION & DECOMPOSITION OF PEARSON'S COEFFICIENT
ØPROPOSED MEASURES & METHODOLOGY
ØMADE-UP EXAMPLES: 2 & 3-Parallel Lines - Robustness to Outliers
ØSIMULATIONS WITH NORMAL DATA
ØAN APPLICATION TO GDP PER CAPITA
ØBut in fact CONCLUSIONS

3
ABSTRACT1
We decompose the Pearson correlation coefficient into two components. We recommend the
first component for detecting linear relationships and the second for recognizing patterns of two
parallel lines, providing robust versions to outliers. The significance of the Pearson coefficient
without significant components indicates a fit of three parallel lines. Thus, we reveal the
unknown aspect of the Pearson coefficient that identifies a two- or three-group association other
than a group, producing Type I or III2
errors. Finally, we apply the proposed coefficients with
permutation tests to simulated and real data. The proposed coefficients identify two- or three-
parallel line correlations or weaker but significant relationships not recognized by the Pearson
coefficient. In testing correlation hypotheses with normal data, the proposed methodology, in
an objective simulation study, minimizes the total error resulting from the Person coefficient by
4%.
Keywords: coefficients of correlation; Pearson decomposition; permutation tests; computational
statistics; robust correlation coefficient to outliers; parallel regression lines; cluster analysis
JEL CLASSIFICATION CODES: C10, C12, C14
1
[1] Papadopoulos, Savas, A Practical, Powerful, Robust and Interpretable Family of Correlation Coefficients (May 28, 2022). Available at
SSRN: https://ssrn.com/abstract=4114080 or http://dx.doi.org/10.2139/ssrn.4114080
2
https://en.wikipedia.org/wiki/Type_III_error

4
INTRODUCTION: Motivation
If we conducted a competition for which statistical quantity
would be the most valuable in exploratory data analysis, the
winner would most likely be the correlation coefficient with a
significant difference from its first competitor. It is very
striking that although the three correlation coefficients,
Pearson, Spearman, and Kendall, were developed in the late
19th and early 20th centuries, and despite the rapid
development of computers, the three coefficients still
dominate the use.

5
INTRODUCTION: What we do
Estimation & Hypothesis Testing for Correlation Coefficients
Let the hypotheses 𝐻!: 𝜌 = 0, 𝐻": 𝜌 ≠ 0 and
α = P(Type I error), β = P(Type II error), γ = P(Type III error)
Type I Error: Reject a TRUE H0
Type II Error: NOT Reject a FALSE H0
Type III Error3: Reject a FALSE H0 for the WRONG
Reason (Correct by accident)
Note that the errors are due to the data and not the correlation
measures, e.g., in random data, by chance, a significant percentage of
the data could be close to a line (type I error).
3
https://en.wikipedia.org/wiki/Type_III_error

6
Ø We propose new correlation coefficients 𝑟! & 𝑣! with
permutation tests for computing the p-values.
Ø The proposed correlation coefficient, 𝑟!, finds linear
relationships not detected by the Pearson coefficient (𝑟").
Ø The proposed correlation coefficient is more robust due to
multivariate outliers than the Pearson coefficient (𝑟").
Ø We reveal type III errors by using the Pearson coefficient (𝑟").
That is, the 𝑟" can be significant, p-value(𝑟") < 𝛼, with an
insignificant linear relationship but a significant 2 or 3 parallel-line
association.
Ø The proposed combination coefficients 𝑟!, 𝑣!, and 𝑟",
significantly eliminate the total error =α+β+γ compared to 𝑟" and 𝑟!
separately, e.g., for normal data by 4% (See Figure 1).

7
Figure 1: Global Simulation Results with Normal Data
0.051 0.051
0.100
0.096
0.150
0.073
0.066
0.000
0.000
0.00
0.05
0.10
0.15
0.20
0.25
Pearson r1 Combining r1, v1 & rP
Total Error for Pearson=rP=0.21, for r1=0.20 & for (r1,v1,rP)=0.17
P(Type I Error) P(Type II Error) P(Type III Error)

8
INTRODUCTION: Why are the results important?
Ø By applying 𝑟", we minimize the overall error, compared
to the error using the Pearson coefficient (𝑟#), and thus
make more correct decisions about whether two variables
are linearly dependent or not.
Ø In some cases, researchers and analysts may have the
illusion that there is a significant linear relationship, but the
real situation is a significant 2 or 3 parallel lines.
Ø The correlation coefficient, 𝑟", recognizes significant
linear relationships not detected by the Pearson coefficient
(see the two graphs in the 1st column of Figure 5).
Ø In cases with the presence of multivariate outliers 𝑟" is
more powerful than 𝑟#. That is, more accurate decisions.

9
INTRODUCTION: How we do it
Ø Decompose the Pearson coefficient (𝑟#) into two
components (The math proof is in the appendix).
𝒓𝑷 ≈ 𝟎. 𝟔𝟒 ∙ 𝒓𝟐 + 𝟎. 𝟑𝟔 ∙ 𝒗𝟐
Ø We use robust to outliers’ versions of the two components
𝒓𝟏 & 𝒗𝟏. The 𝑟" shows only for 1-line fit & 𝑣" indicates 1 or 2
parallel-line fit, and thus their linear combination 𝑟# detects 1
or 2 or 3 parallel-line fit.
Ø Viewing 0.64 & 0.36 as weights, at least 64% of 𝑟# shows a
1-line fit, and at most 36% indicates a 2-line fit.

10
Ø Myths and Realities
Ø A Linear relationship Þ Significant Pearson coefficient,
p-value(𝑟#) < 𝛼
Ø Significant Pearson coefficient ⇏ Linear Association
Ø In this paper we show that:
Significant Pearson coefficient (𝑟#) Þ Fit with 1 or 2 or 3
parallel lines.

11
Ø Methodology:
Ø If p-value(𝑟") < 𝛼 Þ a 1 line fit (see the flowchart)
Ø If p-value(𝑟") ≥ 𝛼 & If p-value(𝑣") < 𝛼 Þ a 2 parallel-
line fit
Ø If p-value(𝑟#) < 𝛼 & p-value(𝑟") ≥ 𝛼 & If p-value(𝑣") ≥
𝛼 Þ 3 parallel lines fit (based on math results)
Note: The methodology only detects some 2 or 3 parallel-line
fits. The significance of the method shows 1 or 2, or 3 parallel
lines and not the other way around. That is if there is a 2- or 3-
line fit, our methodology is not guaranteed to likely detect it.

12
Figure2: Flowchart of the Methodology

13
LITERATURE REVIEW
Ø Existing literature recommends the Pearson
correlation for normal data and the Spearman correlation
for nonnormal data. But some global simulations,
including ours, indicate that the Kendall coefficient
outperforms the Pearson and Spearman coefficients.
Ø Data-analysis software typically computes three
classic correlation coefficients, Pearson’s, Spearman’s,
and Kendall’s.

14
LITERATURE REVIEW: Rankings
Ø There is certainly a lot of overlap between significant correlation
in face values and significant correlation in rankings
Ø But significant correlation in rankings ⇏ Significant correlation in
face values
Ø And significant correlation in face values ⇏ Significant correlation
in rankings
Ø Therefore, we lose power by using rankings
Ø We can plugin rankings to 𝑟!
4 and change the denominator
accordingly.
Ø Denote 𝑅!= [𝑟! with rankings]
Ø Our simulation studies, not presented here, show that 𝑅! is more
powerful and robust than Spearman’s coefficient.
4
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4276146

15
LITERATURE REVIEW: Kendall Coefficient
The Kendall coefficient (𝑟") just shows:
The Percentage of monotonic pairs=
!#|%!|
&
(a novel approach)
from all possible point pairs
'∙('*!)
&
= 𝑖 + 𝑑, where 𝑟" = 2 ∙
,*-
'∙('*!)
𝑖 = pairs with increasing pattern
𝑑 = pairs with decreasing pattern
Ø Thus, we correct that the 𝑟" is not only a rank correlation coefficient;
𝑟" is the same for nominal values and no-repeat rankings.
Ø 𝑟" does not show how close the points are to a curve; the points
could be close to multi-curves.
Ø Our simulations5
show that 𝑟" reveals relationships with one-sided
multivariate outliers not signaled by Pearson or Spearman coefficients.
5
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4276146

16
LITERATURE REVIEW: Outliers
Ø The Pearson coefficient (𝑟#) is known to be sensitive to
outliers, so some researchers suggest using rank
coefficients, e.g., Spearman & Kendall.
Ø Alternatively, we recommend removing univariate
outliers, e.g., using the simple 𝑘 ∙ 𝐼𝑄𝑅 criterion, thus
eliminating influential points.
Ø After removing univariate outliers, we cannot have
extreme multivariate outliers. Near the most extreme cases
are the made-up examples in Figure 4.
Ø Note that with rankings, there are never univariate
outliers.

17
MAIN RESULT: THE DECOMPOSITION
We proved the following meaningful decomposition:
Pearson = 𝒓𝑷 =
𝟐
𝝅
∙ 𝒓±,𝟐 +
𝝅−𝟐
𝝅
∙ 𝒗𝟐 ≈ 𝟎. 𝟔𝟒 ∙ 𝒓±,𝟐 + 𝟎. 𝟑𝟔 ∙ 𝒗𝟐
𝑟±,& = ± ;1 −
12"
($)
∓3"
($)
1
444444444444444$
5/7
>, 𝑣±,& = ± A1 −
8
&'
(
($)
∓*
(
($)
&
$
&∙9!*
$
+
:
B
(The mathematical proof is in the appendix).
STANDARDIZED VARIABLES: 𝑧,
(;)
=
<(*<
|<"*<|𝑠
44444444441/𝑠 , 𝑠 = 1,2; 𝑖 = 1,2, ⋯ , 𝑛
Symbolize + for positive and – for negative relationship
Symbolize by 𝑧 the sample mean, by |𝑎| ≡ absolute value and
by s the population variance.

18
MAIN RESULT: The Proposed Coefficients
Instead of 𝑟±,& and 𝑣±,& we use the following robust versions:
Proposed corr. Coef. 𝑟! for one line: 𝑟±,! = ± ;1 −
12"
(,)
∓3"
(,)
1
444444444444444$
&
>
Corr. Coef. 𝑣!for two-parallel lines: 𝑣±,! = ± A1 −
MAD
&'
(
(,)
∓*
(
(,)
&
$
@
B
𝑐 = 0.73161 =
̇ 𝐸 mMAD
12(
(,)
∓3(
(,)
1
&
o (Numerical result)
𝑟! =
⎩
⎪
⎨
⎪
⎧𝑟#,!, if u𝑥B
(!)
− 𝑦B
(!)
u
xxxxxxxxxxxxxxxx
< u𝑥B
(!)
+ 𝑦B
(!)
u
xxxxxxxxxxxxxxxx
& u𝑥B
(!)
− 𝑦B
(!)
u
xxxxxxxxxxxxxxxx
< √2,
𝑟*,!, if u𝑥B
(!)
+ 𝑦B
(!)
u
xxxxxxxxxxxxxxxx
< u𝑥B
(!)
− 𝑦B
(!)
u
xxxxxxxxxxxxxxxx
& u𝑥B
(!)
+ 𝑦B
(!)
u
xxxxxxxxxxxxxxxx
< √2,
0, otherwise

19
𝑣! =
⎩
⎪
⎨
⎪
⎧𝑣#,!, if u𝑥B
(!)
− 𝑦B
(!)
u
xxxxxxxxxxxxxxxx
< u𝑥B
(!)
+ 𝑦B
(!)
u
xxxxxxxxxxxxxxxx
& 𝑀𝐴𝐷12(
(,)
*3(
(,)
1
< √𝑐,
𝑣*,!, if u𝑥B
(!)
+ 𝑦B
(!)
u
xxxxxxxxxxxxxxxx
< u𝑥B
(!)
− 𝑦B
(!)
u
xxxxxxxxxxxxxxxx
& 𝑀𝐴𝐷12(
(,)
#3(
(,)
1
< √𝑐,
0, otherwise
MAD=mean absolute deviation over i.
First, we standardize the x and y values,
Second, we infer whether there is a positive or negative relationship by
comparing the mean distances from the y=x and y=-x lines (see Figure 8).
Third, we rescale the mean squared and variance of the distances by
dividing by their expected values and then subtracting the results from 1.
Finally, we define 𝑟!and 𝑣! so that we have the following properties:

20
PROPERTIES FOR 𝒓𝟏 & 𝒗𝟏 (two-parallel lines)
Ø −1 ≤ 𝑟& ≤ 1 & −1 ≤ 𝑣& ≤ 1 due to the restrictions in the
definitions.
Ø An exact value of +1 or -1 for 𝑟& indicates a perfect positive
or negative relationship and vice versa, since the mean
distance of the standardized points is 0 and so all points fall
on the line y=x or on the line y=-x.
Ø An exact value of +1 or -1 for 𝒗𝟏 implies (two-parallel
lines):
|𝑣&| = 1 ⟺ MAD()"
($)
∓+"
($)
(
,
= 0 ⟺ ∀ 𝑖, ?𝑥-
(&)
∓ 𝑦-
(&)
? = 𝑐 ⟺

21
⟺ ∀ 𝑖, 𝑦-
(&)
= ±𝑥0,-
(&)
+ 𝑐 or 𝑦0,-
(&)
= ±𝑥0,-
(&)
− 𝑐 (continued)
Two symmetrical lines, y=x+c and y=x-c around the line
y=x or two symmetrical lines, y=-x+c and y=-x-c around
the line y=-x. In the special case for c=0 the points fall on
the line y=x or on the line y=-x.
Ø A correlation value close to 0 or 0 indicates no relationship.
Ø The closer to +1 or -1 the coefficient, 𝑟!, the stronger a one-line
association, assuming no influential points.
Ø The closer to +1 or -1 the coefficient, 𝑣!, the stronger a one-line
or a two-line association, assuming no influential points.

22
Two- or Three parallel lines
The significance of the Pearson correlation coefficient may be
misleading if some standardized points are close to the line y=x or y=-x,
and other standardized points are close to the lines y=x+c and y=x-c or
close to the lines y=-x+c and y=-x-c. The 1st set of points magnifies the
1st component 𝑟& towards 1 or -1, and the 2nd set stretches the 2nd
component 𝑣& towards 1. The linear combination 𝒓𝑷 ≈ 𝟎. 𝟔𝟒 ∙ 𝒓𝟐 +
𝟎. 𝟑𝟔 ∙ 𝒗𝟐 can give a significant value for the Pearson coefficient 𝒓𝑷
because most points are close to 2 or 3 lines and not necessarily close
to 1 line, as known in the literature. We justify this with simulations and
real data later.

23
Two- or Three parallel lines (cont.)
Without loss of generality, we assume univariate outlier removal, variable
standardization, and a positive relationship.
If 𝑟# ⟶ 1 (close to 1), then $𝑥$
(#)
− 𝑦$
(#)
$
((((((((((((((((
⟶ 0, and thus the points are close to
the line y=x.
If 𝑣# ⟶ 1, then 𝜎
'($
(&)
)*$
(&)
'
#
⟶ 0 or $𝑥+
(#)
− 𝑦+
(#)
$ = 𝑐 and thus the points are
close to the parallel lines, y=x+c and y=x-c, symmetric to the line y=x (shown
earlier).
If 𝑟, ⟶ 1, then 𝑟# ⟶ 1 & 𝑣# ↛ 1 (a 1-line fit only)
or 𝑟# ↛ 1 & 𝑣# ⟶ 1 (a 2-line fit only). In this case, we cannot
have a single line because otherwise, 𝑟# ⟶ 1.

24
Two- or Three parallel lines (cont.)
or 𝑟' ⟶ 1 & 𝑣' ⟶ 1 (for c = 0, we have a 1-line, & for c>>0,
have a 3-line fit)
Example with 1 line: A perfect single-line fit with
𝒓𝑷 = 𝒓𝟐 = 𝒗𝟐 = 1.
Example with 3 parallel lines: In Figure 3, the 2nd example
with 3 lines. In this example, all three coefficients are large 𝒓𝑷 =
0.71, 𝒓𝟐 = 0.66, 𝒗𝟐 = 0.79. We have three points exactly to the
middle line and six points to two symmetric lines around the
middle line, 3 points to each line.

25
Three-parallel lines (cont.)
The significance of the Pearson coefficient with insignificant
components indicates a fit of three parallel lines. The logic behind
this is that the insignificance of the two components indicates that
the points are not close to one or two parallel lines; otherwise,
they would be significant. Therefore, the significance of the
Pearson coefficient, in this case, is that some points are close to a
straight line, and others are close to two symmetrical lines around
the first line. Another example is the 3rd plot of Figure 5, 𝑟1 = 0.16,
𝑟, = 0.12, 𝑣, = 0.25 with p-values 0.02, 0.07, and 0.07,
respectively. The existence of such cases is shown in Figure 10.

26
Two- or Three parallel lines
Mean and STD Relationship for positive numbers
In general, the Mean (M) and the Standard Deviation (SD) sizes
are independent of each other. But for positive numbers, as in
the case with Vertical Projections (VP) G𝑥(
(')
− 𝑦(
(')
G, a small M
implies a small SD but not vice versa, without outliers. A small M
indicates small positive numbers close together and, therefore,
a small SD. A small SD can also occur in large positive numbers,
so M can be large.

27
Ø A small M implies small VPs, u𝑥,
(&)
− 𝑦,
(&)
u ≈ 𝑀, or points close to
the lines 𝑦 = 𝑥 − 𝑀 and 𝑦 = 𝑥 + 𝑀. Since M is small, the points can
be considered close to the line 𝑦 = 𝑥.
Ø A small SD (significant 𝑣&) implies VPs close to M, u𝑥,
(&)
− 𝑦,
(&)
u ≈
𝑀 = u𝑥B
(&)
− 𝑦B
(&)
u
xxxxxxxxxxxxxxxx
or points close to the lines 𝑦 = 𝑥 − 𝑀 and 𝑦 = 𝑥 +
𝑀. If M is large (insignificant 𝑟&) then we have a two-line fit.
Ø For large M & SD (insignificant 𝑟& & 𝑣&) with significant 𝑟D, a
large portion of the data minimizes M & SD and therefore maximizes
𝑟D. In addition, another large portion of points maximizes M, SD,
otherwise 𝑟& & 𝑣& would be significant, and therefore minimizes 𝑟&
& 𝑣&, see Figure 9.

28
Figure 3: MADE-UP EXAMPLES: Two & Three parallel lines
In both cases, the Pearson coefficient is significant (p-values of 𝒓𝟏 0.02 & 0.03),
giving us the impression that there is a significant linear relationship. The
proposed coefficients 𝒓𝟏 and 𝒗𝟏 show an insignificant linear relationship (p-
values of 𝒓𝟏 0.09 & 0.06), a significant two-parallel line fit in Case 1 (p-value of
𝒗𝟏=0.002), and a significant three-parallel line fit in Case 2 (insignificant
𝒓𝟏 and 𝒗𝟏, 0.06 & 0.06, and a significant Pearson coefficient, p-value=0.03).

29
Figure 4: A MADE-UP EXAMPLES: Perfect Linear
Relationships with 1 & 2 outliers
The Pearson correlation coefficient is not robust under the presence of outliers6.
We consider a perfect linear relationship with ONE &TWO multivariate outliers.
We use the Pearson coefficient & the proposed 𝑟. coefficient. Only the
proposed correlation, 𝒓𝟏, recognizes the linear relationships, p-
values=0.04 & 0.02<0.05.
6
https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

30
Constructing a Correlation Measure
Let G be a distance measure between the standardized values
of two paired variables and
𝑟±,- = ± L1 −
.
/
M, where 𝐺
01
O
P 𝐿 (convergence in probability)
𝑟
- = R
𝑟2,-, if positive relationship & 𝐺 < 𝐿,
𝑟3,-, if negative relationship & 𝐺 < 𝐿,
0, otherwise
Note that the desired properties apply.
Ø e.g., in 𝑟±,- and 𝑣±,-, for g=1,2; the denominators of the
fractions are the expected values of the numerators under
the assumptions of normality and independence.

31
AN INTERPRETATION OF THE CORRELATION COEFFICIENT
Ø The correlation coefficient, 𝑟&, can be interpreted as the
percentage change between the squared distance of the
standardized values and the squared limiting distance for
independent normal 𝑥 and 𝑦.
Ø For example, a value of 𝑟&=0.5 implies 50% reduction of the
squared observed distance from the squared distance of the
limiting distance under independence and normality.
Ø In the literature, it is known that Pearson’s correlation can be
viewed as a rescaled variance of the difference between
standardized scores7.
7
Rodgers; Nicewander (1988). "Thirteen ways to look at the correlation coefficient" (PDF). The American
Statistician. 42 (1): 59–66.

32
PERMUTATION TESTS: computing the p-values
Permutation tests8 have been used for hypothesis testing of correlation
coefficients between two variables, x and y. Initially, calculate the
correlation coefficient repeatedly after shuffling the observations of the
variable y and keeping constant the order of the observations for the
variable x. Then, we can derive p-values from the distribution of the
computed correlation coefficients. Permutation tests9 enjoy the
following merits against other standard statistical tests:
• Approximate p-values very satisfactory.
• Do not assume any particular distribution (distribution-free).
• Are suitable for small samples.
• Are applicable to non-random samples, e.g., time-series data.
8
https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
9
Berry, K. J., Johnston, J. E., & Mielke, Jr.(Paul W.). (2018). The measurement of association: a permutation statistical
approach. Springer International Publishing.

33
SIMULATION DESIGN WITH NORMAL DATA
Ø 100,000 simulations by Python (NUMPY library)
Ø Picking randomly one of the following 6 pairs (n, r) with equal
probabilities (1/6):
n 20 35 50 100 150 200
r 0.7 0.6 0.5 0.35 0.3 0.25
Ø Generating n correlated-bivariate data, 𝑥- and 𝑦- with Pearson’s
coefficient 𝑟1 = r. All the data are correlated as follows:
Ø 𝑥- ∼ 𝑁𝑜𝑟𝑚𝑎𝑙(0,1), 𝑒- ∼ 𝑁𝑜𝑟𝑚𝑎𝑙(0,1),
Ø 𝑦- = r ∙ 𝑥- + R1 − r ∙ 𝑒-
Ø Applied in Figure 5, 1st row, & Tables 1a & 1b.

34
AN APPLICATION TO GDP PER CAPITA
Ø Publicly available data10 WORLD BANK
Ø N=62 countries with GDP per Cap > 10,000$ in 2020, and
full annual data for 1982-2020
Ø T=39 for the period 1982-2020. Analyze Growth Rates (%)
Ø (622
-62)/2 = 1891 pairs (x,y) correlation cases
Ø No causality. Lurking variables: Global or Continental
Economy
Ø Compare the economic growth of a country with its
correlated countries by regression residuals.
Ø Applied in Figure 5, 2nd
row, & Table 2.
10
https://data.worldbank.org/indicator/NY.GDP.PCAP.CD

35
Result Presentation in a Graph, see Figure 4
Ø Simulated data in the 1st row and application data in the
2nd row.
Ø 1-, 2-, and 3-line fit in the 1st, 2nd, & 3rd columns,
respectively.
Ø In the application initial sample size n=39 and we remove
univariate outliers by the 1.5 ∙ 𝐼𝑄𝑅 criterion. In the
simulation plots the sample size is n=200.
Ø By p, p1, & p2 we represent the p-values of Pearson,
𝑟"and 𝑣", respectively.
Ø By significance, we mean that the p-value<0.05

36
Conclusions from Figure 4
Ø In the 1st column, only 𝑟" shows a significant linear
relationship. The application graph shows the robustness of 𝑟"
to non-normal data.
Ø In the 2nd column, in the 1st graph with simulated data, the
Pearson coefficient is significant, and the insignificance of 𝑟"
with the significance of 𝑣" indicates a 2-line fit. From the
existing literature, there is a significant 1-line fit; this is a Type
III error.
Ø In the 2nd column, in the 2nd graph with the application
data, the Pearson coefficient is insignificant, and the
insignificance of 𝑟" with the significance of 𝑣" indicates a 2-line
fit. From the existing literature, there is no significant 1-line fit.

37
Conclusions from Figure 4 (continued)
Ø In the 3rd column, in both graphs, the Pearson coefficient is
significant, and both 𝑟" and 𝑣" are insignificant. Thus, we
conclude that there is a significant fit of a 3-line fit. From the
existing literature, there is a significant 1-line fit; this is a Type III
error.

38
Figure 5: Simulated and Real-Data Evidence

39
Simulation Results (Tables 1a & 1b)
Ø See the simulation design earlier.
Ø Tables 1a (𝜌 = 0) & 1b (𝜌¹0) are 3-way tables for the 3
correlation coefficients 𝑟# (Pearson), 𝑟", and 𝑣", with 2
categories each, whether significant or not.
Ø For 𝜌 = 0, the α = P(Type I error) for 𝑟# is α =5.1% (close to
5 as was expected), and for 𝑟" is α =5.1%=3.2+1.9. By
combining 𝑟#, 𝑟" and 𝑣", α =10%=1-0.90.
Ø For 𝜌 ≠ 0, the β = P(Type II error) for 𝑟# is β =9.6%, while
for 𝑟" is β =15%=6.6+8.4. By combining 𝑟#, 𝑟" and 𝑣", β=7.3%.

40
Simulation Results (Tables 1a & 1b)
Ø For 𝜌 ≠ 0, the γ = P(Type III error) for 𝑟! is γ =6.6%=4.3+2.3 while
combining 𝑟!, 𝑟" and 𝑣" is γ =0.
Ø With the proposed methodology, the α =P(Type I error) doubles
from 5 to 10, as it detects parallel lines in addition to linear fits, but
the β = P(Type II error) is significantly smaller than that for Pearson’s
correlation from 9.6 to 7.3, and there is no γ = P(Type III error).
Ø The above results appear in Figure 1.
Ø Thus, by combining 𝑟", 𝑣", and 𝑟!, we minimize the total error
compared to the total error by 𝑟# and 𝑟" separately.
Ø In this case, we get the smallest total error if we use the Pearson
significance and exclude the Type III error, which the significance of 𝑟!,
𝑟", and 𝑣" can account for. In the case of non-normal data, we can use
the combination 𝑟", 𝑣", and 𝑟!.

41
Table 1a: Simulation (𝝆 = 𝟎)
N=100,000 𝝆 = 𝟎 𝒗𝟏
Pearson 𝒓𝟏 Significant Not Signif. Total
SignificantSignificant 1.1, 1 line 2.1, 1 line 3.2
Not Signif. 1.0, 2 lines 0.8, 3 lines 1.8
Total 2.1 2.9 5.0
Not Signif.Significant 0.0, 1 line 1.9, 1 line 1.9
Not Signif. 3.1, 2 lines 90.0, No lines 93.1
Total 3.1 91.9 95.0

42
Table 1b: Simulation (𝝆¹𝟎)
N=100,000 𝝆¹𝟎 𝒗𝟏
Significant Significant 66.1, 1 line 17.7, 1 line 83.8
Total 70.4 20.0 90.4
Not Signif. Significant 0.0, 1 line 1.2, 1 line 1.2
Not Signif. 1.1, 2 lines 7.3, No lines 8.4
Total 1.1 8.5 9.6

43
Application Results (Table 2)
Ø See the application information earlier.
Ø Table 2 is a 3-way table for the 3 correlation coefficients 𝑟1
(Pearson), 𝑟&, and 𝑣&, with 2 categories each, whether significant or
not.
Ø The initial sample size is n=39, and we remove univariate
outliers by the 1.5 ∙ 𝐼𝑄𝑅 criterion.
Ø In 4.4%=0.3+4.1 of the 1891 cases, 𝑟& recognizes a significant
linear relationship not detected by the Pearson 𝑟1 coefficient.
Ø Combining 𝑟&, and 𝑣&, we show a significant 2-line fit in
4.9%=0.6+4.3 of the 1891 cases, not detected otherwise.
Ø By combining 𝑟&, 𝑣&, and 𝑟1, we detect a significant 3-line fit in
1.5% of the 1891 cases, not recognized otherwise.

44
Table 2: AN APPLICATION TO GDP PER CAPITA
N=1891 % 𝒗𝟏
SignificantSignificant 40.8, 1 line 16.9, 1 line 57.7
Total 41.4 18.4 59.8
Not Signif.Significant 0.4, 1 line 4.0, 1 line 4.4
Not Signif. 4.2, 2 lines31.6, No 1-3 lines 35.8
Total 4.6 35.6 40.2

45
Figure 6: Venn Diagrams11 for the Simulation (Table 1a & 1b)
𝝆 = 𝟎 𝝆¹𝟎
11
For understanding the diagrams see the link: https://www.easycalculation.com/algebra/venn-diagram-3sets.php

46
Figure 7: Venn Diagrams12 for the Application (Table 2)
12
For understanding the diagrams see the link: https://www.easycalculation.com/algebra/venn-diagram-3sets.php
1-line=𝑟!=40.8+16.9+4+0.4
2-lines=Pearson ∩ 𝑣& ∩ 𝑟&
2
=0.6
3-lines=Pearson ∩ 𝑟&
2
∩ 𝑣&
2
=1.5
e.g. 57.7=16.9+40.8,
Pearson ∩ 𝑟& ∩ 𝑣&=40.8
Pearson2
⋂𝑟&
2
∩ 𝑣&
2
=31.6

47
Comments on Figures 6 & 7 (Venn Diagrams)
Ø The Pearson 𝑟D and the proposed coefficient for a linear fit 𝑟! AGREE
by 93.2%=90+1.1+2.1, by 91.1%=66.1+17.7+7.3, and by
89.3%=40.8+16.9+31.6, in the cases for 𝜌 = 0, 𝜌¹0, and in the application,
respectively. Roughly, there is 90% agreement and 10% disagreement.
Ø A significant TWO- or THREE-LINE FIT appears in 4.1%=3.1+1 & 0.8% in
the case 𝜌 = 0, 5.4%=4.3+1.1 & 2.3% in the case 𝜌¹0, and 4.8%=4.2+0.6 &
1.5% in the application, respectively.
Ø The robust version of the 2nd
component of the Pearson
decomposition, the coefficient 𝑣!, does not recognize the linear
relationship for a large amount, 17.7% in the case 𝜌¹0, and 16.9% in the
application, respectively. Thus, we recommend 𝑟! for linear relationships
and 𝑣! for 2 and 3-parellel fitting lines in combination with 𝑟! and 𝑟D.

48
OVERALL CONCLUSIONS
Ø The significance of the Pearson correlation coefficient may
imply 1 or 2, or 3 parallel-line fits. There is a type III error; we may
think there is a significant fit with 1 line, but in fact, exists a fit of 2
or 3 parallel lines.
Ø We provide a methodology to detect 1 or 2- or 3-line fits.
Ø Prove a meaningful decomposition of the Pearson coefficient.
Ø The proposed 𝑟& coefficient detects more linear relationships
and has a smaller overall error than the Pearson coefficient.

49
OVERALL CONCLUSIONS
Ø 𝑟! is more robust to non-normality & outliers than the Pearson
coefficient.
Ø The robust version of the 2nd component of the Pearson
decomposition, the coefficient 𝑣!, does not recognize many linear
relationships since it detects two parallel lines. Thus, we recommend
only 𝑟! for single-line detection.
Ø The two robust version components, 𝑟!, and 𝑣!, of the Pearson
decomposition combined with the Pearson coefficient, can be applied by
all scientists to detect accurate significant fits of 1 or 2, or 3 parallel
lines.

50
APPENDIX: THE DECOMPOSITION
Pearson =
𝟐
𝝅
∙ 𝒓±,𝟐 +
𝝅3𝟐
𝝅
∙ 𝒗𝟐 ≈ 𝟎. 𝟔𝟒 ∙ 𝒓±,𝟐 + 𝟎. 𝟑𝟔 ∙ 𝒗𝟐
𝑟±,& = ± ;1 −
12"
($)
∓3"
($)
1
444444444444444$
5/7
>, 𝑣±,& = ± A1 −
8
&'
(
($)
∓*
(
($)
&
$
&∙9!*
$
+
:
B
With standardized values: 𝑧,
(&)
=
<(*<
|<"*<|2
44444444441/2 , 𝑖 = 1,2, ⋯ , 𝑛
Symbolize by 𝜋 the known mathematical constant, by 𝑧 the sample mean,
by |𝑎| ≡ absolute value, and by s the population variance.
Define as 𝑟& = Š
𝑟#,&, for positive relationship
𝑟*,&, for negative relationship
and 𝑣& = Š
𝑣#,& for positive relationship
𝑣*,& for negative relationship

51
APPENDIX: A Proof for the Decomposition:
We write the Pearson coefficient as: 𝑟14 = 𝑥5
(,)
∙ 𝑦5
(,)
aaaaaaaaaaaa
.
The mean square of the standardized values is equal to 1:
‹𝑥B
(&)
Œ
&
xxxxxxxxxx
= ‹𝑦B
(&)
Œ
&
xxxxxxxxxx
= 1, and from the identity (𝑎 ∓ 𝑏)&
= 𝑎&
+ 𝑏&
∓ 2𝑎𝑏
we get the following reformulation of the Pearson coefficient:
𝑟14 = 𝑥5
(,)
∙ 𝑦5
(,)
aaaaaaaaaaaa
= ± k1 −
6)&
(')
∓+&
(')
7
'
888888888888888888
,
l
Using the known identity13 |𝑧9|,
aaaaaa = p|𝑧9|
aaaaq
,
+ 𝜎|zι|
2
, for 𝑧9 = 𝑥-
(,)
∓ 𝑦-
(,)
and by some calculations we get the result.
13
https://en.wikipedia.org/wiki/Root_mean_square

52
𝑟DG = ± •1 −
1
2
∙ ‘u𝑥B
(&)
∓ 𝑦B
(&)
u
xxxxxxxxxxxxxxxx&
+ 𝜎
12(
($)
∓3(
($)
1
&
’“ =
= ± ”1 −
1
2
∙ •
4
𝜋
∙
u𝑥B
(&)
∓ 𝑦B
(&)
u
xxxxxxxxxxxxxxxx&
4/𝜋
+ 2 ∙ m1 −
2
𝜋
o ∙
𝜎
12(
($)
∓3(
($)
1
&
2 ∙ ‹1 −
2
𝜋
Œ
˜™ =
= ± ”
2
𝜋
+ 1 −
2
𝜋
−
2
𝜋
∙
u𝑥B
(&)
∓ 𝑦B
(&)
u
xxxxxxxxxxxxxxxx&
4
𝜋
− m1 −
2
𝜋
o ∙
𝜎
12(
($)
∓3(
($)
1
&
2 ∙ ‹1 −
2
𝜋
Œ
™ =
= ±
2
𝜋
∙ •1 −
u𝑥B
(&)
∓ 𝑦B
(&)
u
xxxxxxxxxxxxxxxx&
4/𝜋
˜ ± m1 −
2
𝜋
o ∙ A1 −
𝜎
12(
($)
∓3(
($)
1
&
2 ∙ ‹1 −
2
𝜋
Œ
B
=
&
7
∙ 𝑟±,& +
7*&
7
∙ 𝑣&. ∎

53
APPENDIX: Notes on the Decomposition:
Ø If we assume 𝑥,~𝑁(𝜇2, 𝜎2
&) & 𝑦,~𝑁ž𝜇3, 𝜎3
&
Ÿ then the sum of the
standardized values: 𝑥,
(&)
∓ 𝑦,
(&)
∼ 𝑁(0,2) and therefore
u𝑥,
(&)
∓ 𝑦,
(&)
u ~𝐻𝑎𝑙𝑓-𝑁 ¤
&
√7
, 2 ∙ ‹1 −
&
7
Œ¥ (Half-normal14 distribution)
We used that if 𝑥,~𝑁(0, 𝜎&) then
|𝑥|~𝐻𝑎𝑙𝑓-𝑁 =𝜎 ∙ @
2
𝜋
, 𝜎!
∙ C1 −
2
𝜋
FG
Ø In 𝑟±,& and 𝑣±,&, the denominators of the fractions are the expected
values of the numerators under the assumptions of normality and
independence.
14
https://en.wikipedia.org/wiki/Half-normal_distribution

54
APPENDIX: Notes on the Normality Assumption
Ø Despite using the normality assumption to obtain the limiting
distances in the proposed coefficients, this assumption does not affect
the p-values computed by permutation tests.
Ø For non-normal data, we may have some bias in estimating the
correlation coefficients 𝑟! and 𝑣!. However, the estimation of their p-
values does not depend on the normality assumption. So, with non-
normal data, we look at the p-values of 𝑟! and 𝑣!.
Ø For large sample sizes and normal data, we could use already
calculated critical values if the calculated time was an issue.
Ø For large sample sizes and non-normal data, we could use already
calculated critical values of the respective rank coefficients 𝑟! and 𝑣!,
denoted as 𝑅! and 𝑉!, with a power compensation if the calculated time
was an issue.

55
APPENDIX:
Figure 8: Standardized Values around the lines y=x and y=-x

56
APPENDIX: A Tip: The Parallelism between
Hypothesis Testing and Proof by Contradiction
Ø In proof by contradiction, we establish a
proposition by leading to a contradiction by
assuming that the
Ø proposition is false.
Ø In hypothesis testing, we accept the
alternative hypothesis, H1, concluding
improbably (in a small p-value) by assuming
a false H1 (a true H0). And

57
APPENDIX: THE CRITICAL VALUES AND P-VALUES
FOR 𝒓𝟏 ARE COMPUTED BY PERMUTATION TESTS
Cut-Off Points for 𝑟" (two-sided α=0.05 or one-sided α=0.025)
n c n c n c n c n c n c
5 0.924 12 0.662 19 0.543 30 0.437 80 0.278 500 0.113
6 0.868 13 0.637 20 0.528 35 0.409 90 0.262 1000 0.080
7 0.819 14 0.617 21 0.517 40 0.386 100 0.249 2000 0.057
8 0.780 15 0.602 22 0.505 45 0.363 150 0.205 5000 0.036
9 0.742 16 0.582 23 0.496 50 0.346 200 0.179 104
0.026
10 0.715 17 0.570 24 0.487 60 0.318 300 0.146 105
0.008
11 0.687 18 0.556 25 0.477 70 0.296 400 0.126 106
0.001
Under Normality
Example: In a case with data not rejected as normal and n=90, we observe 𝑟. =
0.21. Then we cannot reject 𝐻/: 𝜌 = 0 against 𝐻.: 𝜌 ≠ 0 with α=0.05 since |𝑟.| <
𝑐. = 0.262.

58
APPENDIX: THE CRITICAL VALUES AND P-VALUES
FOR 𝒗𝟏 ARE COMPUTED BY PERMUTATION TESTS
Cut-Off Points for 𝑣" (two-sided α=0.05 or one-sided α=0.025)
5 0.962 12 0.751 19 0.629 30 0.522 80 0.337 500 0.141
6 0.932 13 0.732 20 0.616 35 0.490 90 0.319 1000 0.100
7 0.893 14 0.709 21 0.605 40 0.461 100 0.303 2000 0.072
8 0.860 15 0.692 22 0.595 45 0.435 150 0.251 5000 0.046
9 0.830 16 0.675 23 0.582 50 0.415 200 0.219 104
0.033
10 0.801 17 0.658 24 0.573 60 0.384 300 0.180 105
0.011
11 0.775 18 0.644 25 0.560 70 0.358 400 0.158 106
0.003
Under normal data

59
APPENDIX: THE CRITICAL VALUES FOR PEARSON
COMPUTED BY PERMUTATION TESTS
Cut-Off Points for 𝑟# (two-sided α=0.05 or one-sided α=0.025)
5 0.923 12 0.640 19 0.510 30 0.408 80 0.250 500 0.100
6 0.866 13 0.618 20 0.497 35 0.379 90 0.236 1000 0.071
7 0.816 14 0.591 21 0.488 40 0.353 100 0.223 2000 0.050
8 0.771 15 0.576 22 0.475 45 0.332 150 0.182 5000 0.031
9 0.730 16 0.558 23 0.468 50 0.317 200 0.158 104
0.022
10 0.698 17 0.540 24 0.457 60 0.291 300 0.130 105
0.007
11 0.668 18 0.528 25 0.446 70 0.267 400 0.112 106
0.002
Under normal data

60
APPENDIX: EXACT CRITICAL VALUES FOR PEARSON
Cut-Off Points for 𝑟# (two-sided α=0.05 or one-sided α=0.025)
5 0.878 12 0.576 19 0.456 30 0.361 80 0.220 500 0.088
6 0.811 13 0.553 20 0.444 35 0.334 90 0.207 1000 0.062
7 0.754 14 0.532 21 0.433 40 0.312 100 0.197 2000 0.044
8 0.707 15 0.514 22 0.423 45 0.294 150 0.160 5000 0.028
9 0.666 16 0.497 23 0.413 50 0.279 200 0.139 104
0.020
10 0.632 17 0.482 24 0.404 60 0.254 300 0.113 105
0.006
11 0.602 18 0.468 25 0.396 70 0.235 400 0.098 106
0.002
𝑟 =
𝑡;/,
t𝑛 − 2 + 𝑡;/,
,

61
Figure 9. Example n=1000, r =0.5, standardized points
Blue points (near the y=x line) maximize the 1st component, red points
maximize the 2nd component. In the histogram the selected points are the
red points. Red and blue points may overlap or disjoint.

62
Figure10. A Significant Pearson coefficient with
insignificant component
e.g., n=50, r2<0.335, v2<0.404, v2>0.7512-1.751938*r2
0.279< 0.64*r2+0.36*v2<1.
This case occurs in the triangle. So, it is possible.

63
PYTHON CODE FOR THE COEFFICIENTS & PERMUTATION TESTS
# =============================================================================
def r1(x,y): # Computes the coefficient r1
nxy = x.shape[0]
xm = np.mean(x)*np.ones(nxy); ym = np.mean(y)*np.ones(nxy)
dmx = np.mean(abs(x-xm)); dmy = np.mean(abs(y-ym))
x = (x-xm)/dmx; y = (y-ym)/dmy; del xm, ym, dmx, dmy
dxyp = (np.mean(abs(x-y))); dxyn = (np.mean(abs(x+y)))
if ( (dxyp < dxyn) & (dxyp < np.sqrt(2)) ):
rc1 = 1-dxyp**2 / 2
elif ( (dxyp > dxyn) & (dxyn < np.sqrt(2)) ):
rc1 = -( 1-dxyn**2 / 2 )
else:
rc1 = 0
return rc1
# =============================================================================
def pvr1(x,y): # p-value for r1 by permutation tests
rcc1 = r1(x,y)
nper=1000
dr = np.zeros(nper)
yh = y.tolist()
for ii in range(0,nper): # Permutation
yr = np.array( random.sample( yh, len(yh) ) )
dr[ii] = r1(x,yr)
ecdfr = ECDF(dr[:])
return ecdfr(-abs(rcc1)) + 1 - ecdfr(abs(rcc1))
# =============================================================================
def v1(x,y): # Computes the coefficient v1
nxy = x.shape[0]
xm = np.mean(x)*np.ones(nxy); ym = np.mean(y)*np.ones(nxy)
dmx = np.mean(abs(x-xm)); dmy = np.mean(abs(y-ym))

64
x = (x-xm)/dmx; y = (y-ym)/dmy; del xm, ym, dmx, dmy
dvp = np.mean( abs( abs(x-y) - np.mean(abs(x-y)) ) )
dvn = np.mean( abs( abs(x+y) - np.mean(abs(x+y)) ) )
plim = 0.731606694191509
if ( (dvp < dvn) & (dvp < np.sqrt(plim)) ):
vc1 = 1 - ( dvp )**2 /plim
elif ( (dvp > dvn) & (dvn < np.sqrt(plim)) ):
vc1 = -( 1 - ( dvn )**2 /plim )
else:
vc1 = 0
return vc1
# =============================================================================
def pvv1(x,y): # p-value for v1 by permutation tests
vcc1 = v1(x,y)
nper=1000
dr = np.zeros(nper)
yh = y.tolist()
for ii in range(0,nper): # Permutation
yr = np.array( random.sample( yh, len(yh) ) )
dr[ii] = v1(x,yr)
ecdfr = ECDF(dr[:])
return ecdfr(-abs(vcc1)) + 1 - ecdfr(abs(vcc1))
# ===============================================================================
def outlier(var,kk):
Q1, Q3 = np.percentile(var, [25,75])
IQR = Q3 - Q1
ul = Q3+kk*IQR
ll = Q1-kk*IQR
outliers = ((var > ul) | (var < ll))
return outliers
# ==============================================================================

65
15
https://www.linkedin.com/in/savas-papadopoulos-ph-d-564a1899/

A NEW CORRELATION COEFFICIENT AND A DECOMPOSITION OF THE PEARSON COEFFICIENT

Recommended

Recommended

More Related Content

Similar to A NEW CORRELATION COEFFICIENT AND A DECOMPOSITION OF THE PEARSON COEFFICIENT

Similar to A NEW CORRELATION COEFFICIENT AND A DECOMPOSITION OF THE PEARSON COEFFICIENT (20)

Recently uploaded

Recently uploaded (20)

A NEW CORRELATION COEFFICIENT AND A DECOMPOSITION OF THE PEARSON COEFFICIENT