Application of parallel hierarchical matrices and low-rank tensors in spatial statistics

Application of Parallel Hierarchical Matrices and
Low-Rank Tensors in Spatial Statistics and
Parameter Identiﬁcation
Alexander Litvinenko,
(joint work with M. Genton, Y. Sun and D. Keyes)
SIAM PP 2018, Tokyo, Japan
Bayesian Computational Statistics & Modeling, KAUST
https://bayescomp.kaust.edu.sa/
Stochastic Numerics Group, KAUST
http://sri-uq.kaust.edu.sa/
Extreme Computing Research Center, KAUST
https://ecrc.kaust.edu.sa/

4*
The structure of the talk
Part 1: Parallel H-matrices in spatial statistics
1. Motivation: improve statistical model
2. Tools: Hierarchical matrices [Hackbusch 1999]
3. Mat´ern covariance function and joint Gaussian likelihood
4. Identiﬁcation of unknown parameters via maximizing Gaussian
log-likelihood
5. Implementation with HLIBPro.
Part 2: Low-rank Tucker tensor methods in spatial statistics
2

4*
Motivation
Tasks: 1) to improve statistical model, which predicts moisture; 2) use
this improved model to forecast missing values.
Given: n ≈ 2.5Mi observations of moisture data
−120 −110 −100 −90 −80 −70
253035404550
Soil moisture
longitude
latitude
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
High-resolution daily soil moisture data at the top layer of the Mississippi
basin, U.S.A., 01.01.2014 (Chaney et al., in review).
Important for agriculture, defense, ...
3

4*
Goal: Estimation of unknown parameters
H-matrix rank
3 7 9 12
nu
0.42
0.44
0.46
0.48
0.5
0.52
0.54
0.56
0.58
number of measurements
1000 2000 4000 8000 16000 32000
nu
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(left) Dependence of the boxplots for ν on the H-matrix rank,
when n = 16,000; (right) Convergence of the boxplots for ν with
increasing n; 100 replicates.
4

4*
What will change after H-matrix approximation?
Approximate C by CH
1. How the eigenvalues of C and CH differ ?
2. How det(C) differs from det(CH) ? [Below]
3. How L differs from LH ? [Mario Bebendorf et al]
4. How C−1 differs from (CH)−1 ? [Mario Bebendorf et al]
5. How ˜L(θ, k) differs from L(θ)? [Below]
6. What is optimal H-matrix rank? [Below]
7. How θH differs from θ? [Below]
For theory, estimates for the rank and accuracy see works of
Bebendorf, Grasedyck, Le Borne, Hackbusch,...
5

4*
Hierarchical (H)-matrices
Introduction into Hierarchical (H)-matrix technique
6

4*
H-matrix approximations of C, L and C−1
43
13
28
14 34
10 5
9 15
44
13
29
13 35
14 7
5 14
50
14 62
15 6
6 15
60
15
31
17 44
13
14 6
7 14
3 13 6
5
16
12 8
7 15
3 15
34
11 31
16 59
13 6
7 12
58
15 51
16 7
6 16
31
15 39
12
36
14 40
15 6
6 17
32
14 40
10 7
7 10
41
16 35
5
13 8
6
9
10 7
7 9
3 10
1 5 6
6
6
8
8 7
8 8
3 9 7
7
9
8 6
7 7
3 9
1 6
47
12 51
15 7
7 15
38
17 34
17
34
15 32
13 7
7
10
9 7
6 9
3 10
56
14 46
14 7
8 12
64
15 57
9 6
6 8
16 8
7 14
3
9 5
5 9
9
7
15
16 6
5 14
2
9 6
6 8
54
16 30
10 5
7 8
32
15 33
14 5
8 16
34
14 35
9 6
7 9
34
15 31
17 6
7 15
60
15
34
14 28
15 6
7 13
29
13 35
16 54
2 1
1 2
14
12 6
9 12
2 12 6
9
8 6
6 8
16 9
6 13
3 16
1 1
1
2 1
1 2
1 6
1 1
1 6
1
1
2 1
1 3
7 6
6 9
15 6
7 14
2
9 6
6 8
8
5
8 5
7 9
14 6
8 15
2 14
1 1
1 1
3 1
1 2
56
16
31
12 32
12 6
10 14
55
16
35
14 43
15 8
7
9
5 8
4 9
3 10
41
16 28
15
30
16 38
14 7
6 17
36
11 38
8 8
6 9
40
17 36
13
13 7
7 15
3 14 8
6
12
16 6
6 13
2 16
60
17
29
14 35
11 6
6 17
54
15 67
14 6
6 16
36
14 39
18 59
14 6
7 17
52
14 58
6
9
9 8
6 8
3 10 7
8 17
1 5 5
7
6
15 7
9 13
1 5
61
13 59
13 6
6 13
37
16 38
15 58
13 5
4 16
31
13 33
18 59
12 6
7 12
74
15 63
15
15 7
7 17
3
9 6
6 9
7
7
9 6
6 8
15 7
7 14
1 14
51
15
42
13 37
13 7
6 14
59
16 54
14 8
6 16
36
14 37
19 57
14 7
6 10
52
13 65
43
13
28
13 34
10 5
9 15
44
13
29
13 35
14 7
5 14
50
14 62
14 6
6 14
60
15
31
16 44
13
14 8
7 14
3 13 5
5
17
11 7
7 15
2 15
34
11 31
16 59
13 6
7 12
58
15 51
16 7
6 16
31
15 39
12
36
14 40
15 5
6 17
32
14 40
10 7
7 10
41
17 35
5
13 8
6
9
10 7
7 9
3 10
1 5 5
6
5
9
8 7
8 8
3 9 7
7
10
8 5
7 7
4 9
1 5
47
12 51
15 7
7 15
38
16 34
16
34
15 32
13 8
7
12
8 6
6 9
3 10
56
14 46
14 8
8 12
64
15 57
11 6
6 8
15 9
7 14
3
10 5
5 9
7
7
16
16 7
5 15
2
9 6
6 9
54
15 30
10 5
7 8
32
14 33
13 5
8 16
34
14 35
9 6
7 9
34
15 31
17 5
7 15
60
15
34
13 28
14 5
7 13
29
13 35
15 54
1 1
1 2
14
12 6
9 11
2 12 5
9
9 5
6 8
16 10
6 13
3 16
1 2 1
1 2
1 6
1
1 6
1
1
2 1
1 2
8 5
6 9
15 8
7 14
3
9 6
6 8
8
5
9 5
7 9
14 7
8 15
2 14
1 1
1
2 1
1 1
56
15
31
12 32
12 6
10 13
55
16
35
14 43
15 8
7
10
5 9
4 9
3 10
41
16 28
9 4
8 7
30
16 38
14 7
6 17
36
11 38
8 8
6 9
40
17 36
13
13 6
7 15
3 14 8
6
13
16 6
6 13
1 16
60
16
29
14 35
11 8
6 17
54
15 67
13 5
6 16
36
14 39
17 59
14 6
7 17
52
14 58
6
10
9 8
6 8
3 10 7
8 16
1 4 5
7
5
15 6
9 13
1 5
61
13 59
12 6
6 13
37
16 38
15 58
13 5
4 16
31
12 33
18 59
12 5
7 12
74
14 63
14
15 8
7 18
2
10 6
6 8
5
7
11 6
6 8
15 6
7 14
2 14
51
15
42
13 37
12 8
6 14
59
15 54
13 8
6 15
36
13 37
19 57
15 6
6 10
52
13 65
H-matrix approximations of the exponential covariance matrix
(left), its hierarchical Cholesky factor ˜L (middle), and the zoomed
upper-left corner of the matrix (right), n = 4000, = 0.09,
ν = 0.5, σ2 = 1.
The adaptive-rank arithmetic is used, the relative accuracy in each sub-block is 1e − 5. Green blocks indicate
low-rank matrices, the number inside of each block indicates the maximal rank in this block. Very small dark-red
blocks are dense blocks, in this example they are located only on the diagonal. The “stairs” inside blocks indicate
decay of singular values in log-scale.
7

4*
H-matrix approximations of C, L and C−1
0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1599
I
I
I I
I
I
I I I I1
1
2
2
11 12 21 22
I11
I12
I21
I22
Examples of a cluster tree TI (left) and a block cluster tree TI×I
(right). The decomposition of the matrix into sub-blocks is deﬁned
by TI×I and the admissibility condition.
8

4*
Mat´ern covariance functions
Mat´ern covariance functions
C(θ) =
2σ2
Γ(ν)
r
2
ν
Kν
r
, θ = (σ2
, ν, ).
Cν=3/2(r) = 1 +
√
3r
exp −
√
3r
(1)
Cν=5/2(r) = 1 +
√
5r
+
5r2
3 2
exp −
√
5r
(2)
ν = 1/2 exponential covariance function Cν=1/2(r) = exp(−r),
ν → ∞ Gaussian covariance function Cν=∞(r) = exp(−r2).
9

4*
Identifying unknown parameters
10

4*
Given: Z = {Z(s1), . . . , Z(sn)} , where Z(s) is a Gaussian
random ﬁeld indexed by a spatial location s ∈ Rd , d ≥ 1.
Assumption: Z has mean zero and stationary parametric
covariance function, e.g. Mat´ern:
C(θ) =
2σ2
Γ(ν)
r
2
ν
Kν
r
, θ = (σ2
, ν, ).
To identify: unknown parameters θ := (σ2, ν, ).
11

4*
Statistical inference about θ is then based on the Gaussian
log-likelihood function:
L(θ) = −
n
2
log(2π) −
1
2
log|C(θ)| −
1
2
Z C(θ)−1
Z, (3)
where the covariance matrix C(θ) has entries C(si − sj ; θ),
i, j = 1, . . . , n. The maximum likelihood estimator of θ is the value
θ that maximizes (3).
12

4*
Details of the identiﬁcation
To maximize the log-likelihood function we use the Brent’s method
[Brent’73] (combining bisection method, secant method and
inverse quadratic interpolation).
1. H-matrix: C(θ) ≈ C(θ, k) or ≈ C(θ, ε).
2. H-Cholesky: C(θ, k) = L(θ, k)L(θ, k)T
3. ZT C−1Z = ZT (LLT )−1Z = vT · v, where v is a solution of
L(θ, k)v(θ) := Z.
log det{C} = log det{LLT
} = log det{
n
i=1
λ2
i } = 2
n
i=1
logλi ,
L(θ, k) = −
N
2
log(2π) −
N
i=1
log{Lii (θ, k)} −
1
2
(v(θ)T
· v(θ)). (4)
13

4*
Convergence of the optimization method
But in 3D case, n = 128000, I may need about 1 hour and 150-250
iterations.
14

Dependence of log-likelihood and its ingredients on parameters,
n = 4225. k = 8 in the first row and k = 16 in the second.
First column: = 0.2337 fixed; Second column: ν = 0.5 fixed.
15

4*
Remark: stabilization with nugget
To avoid instability in computing Cholesky, we add: Cm = C + δ2I.
Let λi be eigenvalues of C, then eigenvalues of Cm will be λi + δ2,
log det(Cm) = log n
i=1(λi + δ2) = n
i=1 log(λi + δ2).
(left) Dependence of the negative log-likelihood on parameter
with nuggets {0.01, 0.005, 0.001} for the Gaussian covariance.
(right) Zoom of the left ﬁgure near minimum; n = 2000 random
points from moisture example, rank k = 14, σ2 = 1, ν = 0.5.
16

4*
Error analysis
Theorem (1)
Let C be an H-matrix approximation of matrix C ∈ Rn×n such that
ρ(C−1
C − I) ≤ ε < 1.
Then
|log|C| − log|C|| ≤ −nlog(1 − ε), (5)
Proof: See [Ballani, Kressner 14] and [Ipsen 05].
Remark: factor n is pessimistic and is not really observed
numerically.
17

4*
Error in the log-likelihood
Theorem (2)
Let C ≈ C ∈ Rn×n and Z be a vector, Z ≤ c0 and C−1 ≤ c1.
Let ρ(C−1C − I) ≤ ε < 1. Then it holds
|L(θ) − L(θ)| =
1
2
(log|C| − log|C|) +
1
2
|ZT
C−1
− C−1
Z|
≤ −
1
2
· nlog(1 − ε) +
1
2
|ZT
C−1
C − C−1
C C−1
Z|
≤ −
1
2
· nlog(1 − ε) +
1
2
c2
0 · c1 · ε.
18

4*
H-matrix approximation
Denote accuracy in each sub-block by ε, n = 16641, ν = 0.5,
c.r.=compression ratio. log|C| = 2.63 for = 0.0334 and
log|C| = 6.36 for = 0.2337.
ε |log|C| − log|C|| |
log|C|−log|C|
log|C|
| C − C F
C−C 2
C 2
I − (LL)−1
C 2 c.r. in %
= 0.0334
1e-1 3.2e-4 1.2e-4 7.0e-3 7.6e-3 2.9 9.16
1e-2 1.6e-6 6.0e-7 1.0e-3 6.7e-4 9.9e-2 9.4
1e-4 1.8e-9 7.0e-10 1.0e-5 7.3e-6 2.0e-3 10.2
1e-8 4.7e-13 1.8e-13 1.3e-9 6e-10 2.1e-7 12.7
= 0.2337
1e-4 9.8e-5 1.5e-5 8.1e-5 1.4e-5 2.5e-1 9.5
1e-8 1.45e-9 2.3e-10 1.1e-8 1.5e-9 4e-5 11.3
19

4*
Boxplots for unknown parameters
Moisture data example. Boxplots for the 100 estimates of ( , ν,
σ2), respectively, when n = 32K, 16K, 8K, 4K, 2K. H-matrix with
a ﬁxed rank k = 11. Green horizontal lines denote 25% and 75%
quantiles for n = 32K.
20

4*
How much memory is needed?
0 0.5 1 1.5 2 2.5
ℓ, ν = 0.325, σ2
= 0.98
5
5.5
6
6.5
size,inbytes
×10 6
1e-4
1e-6
0.2 0.4 0.6 0.8 1 1.2 1.4
ν, ℓ = 0.58, σ2
= 0.98
5.2
5.4
5.6
5.8
6
6.2
6.4
6.6
size,inbytes
×10 6
1e-4
1e-6
(left) Dependence of the matrix size on the covariance length ,
and (right) the smoothness ν for two diﬀerent accuracies in the
H-matrix sub-blocks ε = {10−4, 10−6}, for n = 2, 000 locations in
the domain [32.4, 43.4] × [−84.8, −72.9].
21

4*
Error convergence
0 10 20 30 40 50 60 70 80 90 100
−25
−20
−15
−10
−5
0
rank k
log(rel.error)
Spectral norm, L=0.1, nu=1
Frob. norm, L=0.1
Spectral norm, L=0.2
Frob. norm, L=0.2
Frob. norm, L=0.5
0 10 20 30 40 50 60 70 80 90 100
−16
−14
−12
−10
−8
−6
−4
−2
0
rank k
log(rel.error)
Spectral norm, L=0.1, nu=0.5
Frob. norm, L=0.1
Frob. norm, L=0.2
Frob. norm, L=0.5
Convergence of the H-matrix approximation errors for covariance
lengths {0.1, 0.2, 0.5}; (left) ν = 1 and (right) ν = 0.5,
computational domain [0, 1]2.
22

4*
How log-likelihood depends on n?
Figure: Dependence of negative log-likelihood function on diﬀerent
number of locations n = {2000, 4000, 8000, 16000, 32000} in log-scale.
23

4*
Time and memory for parallel H-matrix approximation
Maximal # cores is 40, ν = 0.325, = 0.64, σ2 = 0.98
n ˜C ˜L˜L
time size kB/dof time size I − (˜L˜L )−1 ˜C 2
sec. MB sec. MB
32,000 3.3 162 5.1 2.4 172.7 2.4 · 10−3
128,000 13.3 776 6.1 13.9 881.2 1.1 · 10−2
512,000 52.8 3420 6.7 77.6 4150 3.5 · 10−2
2,000,000 229 14790 7.4 473 18970 1.4 · 10−1
24

4*
Parameter identiﬁcation
64 32 16 8 4 2
n, samples in thousands
0.07
0.08
0.09
0.1
0.11
0.12
ℓ
64 32 16 8 4 2
n, samples in thousands
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
ν
Synthetic data with known parameters ( ∗, ν∗, σ2∗
) = (0.5, 1, 0.5).
Boxplots for and ν for n = 1, 000 × {64, 32, ..., 4, 2}; 100
replicates.
25

4*
Conclusion of Part 1
Mat´ern covariance matrices can be approximated in the
H-matrix format.
Provided estimate of L − L .
Researched inﬂuence of H-matrix approximation error on the
estimated parameters (boxplots).
With application of H-matrices
we extend the class of covariance functions to work with,
we allow non-regular discretization of the covariance function
on large spatial grids.
With the maximizing algorithm we are able to identify three
parameters: covariance lengths , smoothness ν and variance
26

4*
Part 2: Low-rank tensor methods for spatial statistics
r3
3
2
r
1
r
r
r1
B
1
3
2
r
2
A
V
(1)
V
(2)
V
(3)
n
n
n
n
1
n
3
n
2
See more in
[A. Litvinenko, D. Keyes, V. Khoromskaia, B. N. Khoromskij, H. G. Matthies,
Tucker Tensor analysis of Mat´ern functions in spatial statistics, preprint
arXiv:1711.06874, 2017]Accepted for CMAM, March 9, 2018.
27

4*
Five tasks in statistics to solve
Task 1. Approximate a Mat´ern covariance function in a low-rank
tensor format.
c(x, y) − r
i=1
d
µ=1 ciµ(xµ, yµ) ≤ ε, for some given ε > 0.
Alternatively, we may look for factors Ciµ such that
C − r
i=1
d
µ=1 Ciµ ≤ ε. Here, matrices Ciµ correspond to the
one-dimensional functions ciµ(xµ, yµ) in the direction µ.
Task 2. Computing C1/2
Task 3. Kriging estimate ˆs = Csy C−1
yy y.

4*
Five tasks in statistics to solve
Task 4. Geostatistical design.
φA = N−1
trace Css|y , and φC = zT
(Css|y )z,
where Css|y := Css − Csy C−1
yy Cys
Task 5. Computing the joint Gaussian log-likelihood function.
L(θ) = −
N
2
log(2π) −
1
2
log det{C(θ)} −
1
2
(zT
· C(θ)−1
z).
29

4*
Two ways to find a low-rank tensor approx. of Matérn cov.
Task 1. Approximate Matérn covariance
30

4*
Trace, diagonal, and determinant of C:
Let C ≈ ˜C = r
i=1
d
µ=1 Ciµ, then
diag(˜C) = diag


r
i=1
d
µ=1
Ciµ

 =
r
i=1
d
µ=1
diag (Ciµ) , (6)
trace(˜C) = trace


r
i=1
d
µ=1
Ciµ

 =
r
i=1
d
µ=1
trace(Ciµ), (7)
and for the determinant it holds only for r = 1
log det(C1⊗C2⊗C3) = n2n3log det C1+n1n3log det C2+n1n2log det C3.
(8)
31

4*
Existence of the canonical low-rank tensor approximation
A scheme of the proof of existence of the canonical low-rank tensor
approximation. See details in [A. Litvinenko, et al., Tucker Tensor analysis
of Matérn functions in spatial statistics, preprint arXiv:1711.06874, 2017]
It could be easier to apply the Laplace transform to the Fourier
transform of a Matérn covariance matrix, than to the Matérn
covariance. To approximate the resulting Laplace integral we apply
the sinc quadrature.
32

4*
Likelihood in d-dimensions
Example: N = 60003. Using MATLAB on a MacBookPro with 16
GB RAM, the time required set up the matrices C1, C2, and C3 is
11 seconds; it takes 4 seconds to compute L1, L2, and L3.
In previous work we used the H-matrices to approximate Ci and its
H-Cholesky factor Li for n = 2 · 106 in 2 minutes. Here, assuming
C = C1 ⊗ . . . ⊗ Cd , we approximate C for N = (2 · 106)d in 2d
minutes.
L ≈ ˜L = −
d
ν=1 nν
log(2π)
−
d
j=1
log det Cj
d
i=1,i=j
ni −
r
i=1
r
j=1
d
ν=1
(uT
i,ν, uj,ν).
33

4*
Tensor approximation of Gaussian covariance
Let cov(x, y) = e−|x−y|2
be the Gaussian covariance function,
where x = (x1, .., xd ), y = (y1, ..., yd ) ∈ [a, b]d . Let C be the
Gaussian covariance matrix deﬁned on a tensor grid in [a, b]d ,
d ≥ 1.
Then C can be written as Kronecker product of one-dimensional
covariance matrices, i.e., C = C1 ⊗ ... ⊗ Cd .
34

4*
Tensor Cholesky
If d Cholesky decompositions exist, i.e, Ci = Li · LT
i , i = 1..d.
Then
C1 ⊗ ... ⊗ Cd = (L1LT
1 ) ⊗ ... ⊗ (Ld LT
d ) (9)
= (L1 ⊗ ... ⊗ Ld ) · (LT
1 ⊗ ... ⊗ LT
d ) =: L · LT
, (10)
where L := L1 ⊗ ... ⊗ Ld , LT := LT
1 ⊗ ... ⊗ LT
d are also low- and
upper-triangular matrices.
The computational complexity drops from O(NlogN), N = nd , to
O(dnlogn), where n is number of mesh points in one-dimensional
problem. Further research is required for non-Gaussian covariance
functions.

4*
Tensor Inverse and determinant
If inverse matrices C−1
i , i = 1..d, exist, then
(C1 ⊗ ... ⊗ Cd )−1
= C−1
1 ⊗ ... ⊗ C−1
d . (11)
log det(C1⊗C2⊗C3) = n2n3log det C1+n1n3log det C2+n1n2log det C3.
(12)
The cost drops again from O(NlogN), N = nd , to O(dnlogn).

4*
Literature
1. A. Litvinenko, HLIBCov: Parallel Hierarchical Matrix Approximation of Large Covariance Matrices and
Likelihoods with Applications in Parameter Identification, preprint arXiv:1709.08625, 2017
2. A. Litvinenko, Y. Sun, M.G. Genton, D. Keyes, Likelihood Approximation With Hierarchical Matrices For Large
Spatial Datasets, preprint arXiv:1709.04419, 2017
3. B.N. Khoromskij, A. Litvinenko, H.G. Matthies, Application of hierarchical matrices for computing the
Karhunen-Loéve expansion, Computing 84 (1-2), 49-67, 31, 2009
4. H.G. Matthies, A. Litvinenko, O. Pajonk, B.V. Rosić, E. Zander, Parametric and uncertainty computations with
tensor product representations, Uncertainty Quantification in Scientific Computing, 139-150, 2012
5. W. Nowak, A. Litvinenko, Kriging and spatial design accelerated by orders of magnitude: Combining low-rank
covariance approximations with FFT-techniques, Mathematical Geosciences 45 (4), 411-435, 2013
6. P. Waehnert, W.Hackbusch, M. Espig, A. Litvinenko, H. Matthies: Efficient low-rank approximation of the
stochastic Galerkin matrix in the tensor format, Computers & Mathematics with Applications, 67 (4), 818-829,
2014
7. M. Espig, W. Hackbusch, A. Litvinenko, H.G. Matthies, E. Zander, Efficient analysis of high dimensional data in
tensor formats, Sparse Grids and Applications, 31-56, Springer, Berlin, 2013
37

4*
Acknowledgement
1. Ronald Kriemann (MPI Leipzig) for www.hlibpro.com
2. KAUST Research Computing group, KAUST Supercomputing
Lab (KSL)
38

Application of parallel hierarchical matrices and low-rank tensors in spatial statistics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Application of parallel hierarchical matrices and low-rank tensors in spatial statistics

Similar to Application of parallel hierarchical matrices and low-rank tensors in spatial statistics (20)

More from Alexander Litvinenko

More from Alexander Litvinenko (20)

Recently uploaded

Recently uploaded (20)

Application of parallel hierarchical matrices and low-rank tensors in spatial statistics