Application of parallel hierarchical matrices and low-rank tensors in spatial statistics
1. Application of Parallel Hierarchical Matrices and
Low-Rank Tensors in Spatial Statistics and
Parameter Identification
Alexander Litvinenko,
(joint work with M. Genton, Y. Sun and D. Keyes)
SIAM PP 2018, Tokyo, Japan
Bayesian Computational Statistics & Modeling, KAUST
https://bayescomp.kaust.edu.sa/
Stochastic Numerics Group, KAUST
http://sri-uq.kaust.edu.sa/
Extreme Computing Research Center, KAUST
https://ecrc.kaust.edu.sa/
2. 4*
The structure of the talk
Part 1: Parallel H-matrices in spatial statistics
1. Motivation: improve statistical model
2. Tools: Hierarchical matrices [Hackbusch 1999]
3. Mat´ern covariance function and joint Gaussian likelihood
4. Identification of unknown parameters via maximizing Gaussian
log-likelihood
5. Implementation with HLIBPro.
Part 2: Low-rank Tucker tensor methods in spatial statistics
2
3. 4*
Motivation
Tasks: 1) to improve statistical model, which predicts moisture; 2) use
this improved model to forecast missing values.
Given: n ≈ 2.5Mi observations of moisture data
−120 −110 −100 −90 −80 −70
253035404550
Soil moisture
longitude
latitude
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
High-resolution daily soil moisture data at the top layer of the Mississippi
basin, U.S.A., 01.01.2014 (Chaney et al., in review).
Important for agriculture, defense, ...
3
4. 4*
Goal: Estimation of unknown parameters
H-matrix rank
3 7 9 12
nu
0.42
0.44
0.46
0.48
0.5
0.52
0.54
0.56
0.58
number of measurements
1000 2000 4000 8000 16000 32000
nu
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(left) Dependence of the boxplots for ν on the H-matrix rank,
when n = 16,000; (right) Convergence of the boxplots for ν with
increasing n; 100 replicates.
4
5. 4*
What will change after H-matrix approximation?
Approximate C by CH
1. How the eigenvalues of C and CH differ ?
2. How det(C) differs from det(CH) ? [Below]
3. How L differs from LH ? [Mario Bebendorf et al]
4. How C−1 differs from (CH)−1 ? [Mario Bebendorf et al]
5. How ˜L(θ, k) differs from L(θ)? [Below]
6. What is optimal H-matrix rank? [Below]
7. How θH differs from θ? [Below]
For theory, estimates for the rank and accuracy see works of
Bebendorf, Grasedyck, Le Borne, Hackbusch,...
5
8. 4*
H-matrix approximations of C, L and C−1
0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1599
I
I
I I
I
I
I I I I1
1
2
2
11 12 21 22
I11
I12
I21
I22
Examples of a cluster tree TI (left) and a block cluster tree TI×I
(right). The decomposition of the matrix into sub-blocks is defined
by TI×I and the admissibility condition.
8
11. 4*
Identifying unknown parameters
Given: Z = {Z(s1), . . . , Z(sn)} , where Z(s) is a Gaussian
random field indexed by a spatial location s ∈ Rd , d ≥ 1.
Assumption: Z has mean zero and stationary parametric
covariance function, e.g. Mat´ern:
C(θ) =
2σ2
Γ(ν)
r
2
ν
Kν
r
, θ = (σ2
, ν, ).
To identify: unknown parameters θ := (σ2, ν, ).
11
12. 4*
Identifying unknown parameters
Statistical inference about θ is then based on the Gaussian
log-likelihood function:
L(θ) = −
n
2
log(2π) −
1
2
log|C(θ)| −
1
2
Z C(θ)−1
Z, (3)
where the covariance matrix C(θ) has entries C(si − sj ; θ),
i, j = 1, . . . , n. The maximum likelihood estimator of θ is the value
θ that maximizes (3).
12
13. 4*
Details of the identification
To maximize the log-likelihood function we use the Brent’s method
[Brent’73] (combining bisection method, secant method and
inverse quadratic interpolation).
1. H-matrix: C(θ) ≈ C(θ, k) or ≈ C(θ, ε).
2. H-Cholesky: C(θ, k) = L(θ, k)L(θ, k)T
3. ZT C−1Z = ZT (LLT )−1Z = vT · v, where v is a solution of
L(θ, k)v(θ) := Z.
log det{C} = log det{LLT
} = log det{
n
i=1
λ2
i } = 2
n
i=1
logλi ,
L(θ, k) = −
N
2
log(2π) −
N
i=1
log{Lii (θ, k)} −
1
2
(v(θ)T
· v(θ)). (4)
13
14. 4*
Convergence of the optimization method
But in 3D case, n = 128000, I may need about 1 hour and 150-250
iterations.
14
15. Dependence of log-likelihood and its ingredients on parameters,
n = 4225. k = 8 in the first row and k = 16 in the second.
First column: = 0.2337 fixed; Second column: ν = 0.5 fixed.
15
16. 4*
Remark: stabilization with nugget
To avoid instability in computing Cholesky, we add: Cm = C + δ2I.
Let λi be eigenvalues of C, then eigenvalues of Cm will be λi + δ2,
log det(Cm) = log n
i=1(λi + δ2) = n
i=1 log(λi + δ2).
(left) Dependence of the negative log-likelihood on parameter
with nuggets {0.01, 0.005, 0.001} for the Gaussian covariance.
(right) Zoom of the left figure near minimum; n = 2000 random
points from moisture example, rank k = 14, σ2 = 1, ν = 0.5.
16
17. 4*
Error analysis
Theorem (1)
Let C be an H-matrix approximation of matrix C ∈ Rn×n such that
ρ(C−1
C − I) ≤ ε < 1.
Then
|log|C| − log|C|| ≤ −nlog(1 − ε), (5)
Proof: See [Ballani, Kressner 14] and [Ipsen 05].
Remark: factor n is pessimistic and is not really observed
numerically.
17
18. 4*
Error in the log-likelihood
Theorem (2)
Let C ≈ C ∈ Rn×n and Z be a vector, Z ≤ c0 and C−1 ≤ c1.
Let ρ(C−1C − I) ≤ ε < 1. Then it holds
|L(θ) − L(θ)| =
1
2
(log|C| − log|C|) +
1
2
|ZT
C−1
− C−1
Z|
≤ −
1
2
· nlog(1 − ε) +
1
2
|ZT
C−1
C − C−1
C C−1
Z|
≤ −
1
2
· nlog(1 − ε) +
1
2
c2
0 · c1 · ε.
18
19. 4*
H-matrix approximation
Denote accuracy in each sub-block by ε, n = 16641, ν = 0.5,
c.r.=compression ratio. log|C| = 2.63 for = 0.0334 and
log|C| = 6.36 for = 0.2337.
ε |log|C| − log|C|| |
log|C|−log|C|
log|C|
| C − C F
C−C 2
C 2
I − (LL)−1
C 2 c.r. in %
= 0.0334
1e-1 3.2e-4 1.2e-4 7.0e-3 7.6e-3 2.9 9.16
1e-2 1.6e-6 6.0e-7 1.0e-3 6.7e-4 9.9e-2 9.4
1e-4 1.8e-9 7.0e-10 1.0e-5 7.3e-6 2.0e-3 10.2
1e-8 4.7e-13 1.8e-13 1.3e-9 6e-10 2.1e-7 12.7
= 0.2337
1e-4 9.8e-5 1.5e-5 8.1e-5 1.4e-5 2.5e-1 9.5
1e-8 1.45e-9 2.3e-10 1.1e-8 1.5e-9 4e-5 11.3
19
20. 4*
Boxplots for unknown parameters
Moisture data example. Boxplots for the 100 estimates of ( , ν,
σ2), respectively, when n = 32K, 16K, 8K, 4K, 2K. H-matrix with
a fixed rank k = 11. Green horizontal lines denote 25% and 75%
quantiles for n = 32K.
20
21. 4*
How much memory is needed?
0 0.5 1 1.5 2 2.5
ℓ, ν = 0.325, σ2
= 0.98
5
5.5
6
6.5
size,inbytes
×10 6
1e-4
1e-6
0.2 0.4 0.6 0.8 1 1.2 1.4
ν, ℓ = 0.58, σ2
= 0.98
5.2
5.4
5.6
5.8
6
6.2
6.4
6.6
size,inbytes
×10 6
1e-4
1e-6
(left) Dependence of the matrix size on the covariance length ,
and (right) the smoothness ν for two different accuracies in the
H-matrix sub-blocks ε = {10−4, 10−6}, for n = 2, 000 locations in
the domain [32.4, 43.4] × [−84.8, −72.9].
21
23. 4*
How log-likelihood depends on n?
Figure: Dependence of negative log-likelihood function on different
number of locations n = {2000, 4000, 8000, 16000, 32000} in log-scale.
23
24. 4*
Time and memory for parallel H-matrix approximation
Maximal # cores is 40, ν = 0.325, = 0.64, σ2 = 0.98
n ˜C ˜L˜L
time size kB/dof time size I − (˜L˜L )−1 ˜C 2
sec. MB sec. MB
32,000 3.3 162 5.1 2.4 172.7 2.4 · 10−3
128,000 13.3 776 6.1 13.9 881.2 1.1 · 10−2
512,000 52.8 3420 6.7 77.6 4150 3.5 · 10−2
2,000,000 229 14790 7.4 473 18970 1.4 · 10−1
24
25. 4*
Parameter identification
64 32 16 8 4 2
n, samples in thousands
0.07
0.08
0.09
0.1
0.11
0.12
ℓ
64 32 16 8 4 2
n, samples in thousands
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
ν
Synthetic data with known parameters ( ∗, ν∗, σ2∗
) = (0.5, 1, 0.5).
Boxplots for and ν for n = 1, 000 × {64, 32, ..., 4, 2}; 100
replicates.
25
26. 4*
Conclusion of Part 1
Mat´ern covariance matrices can be approximated in the
H-matrix format.
Provided estimate of L − L .
Researched influence of H-matrix approximation error on the
estimated parameters (boxplots).
With application of H-matrices
we extend the class of covariance functions to work with,
we allow non-regular discretization of the covariance function
on large spatial grids.
With the maximizing algorithm we are able to identify three
parameters: covariance lengths , smoothness ν and variance
26
27. 4*
Part 2: Low-rank tensor methods for spatial statistics
r3
3
2
r
1
r
r
r1
B
1
3
2
r
2
A
V
(1)
V
(2)
V
(3)
n
n
n
n
1
n
3
n
2
See more in
[A. Litvinenko, D. Keyes, V. Khoromskaia, B. N. Khoromskij, H. G. Matthies,
Tucker Tensor analysis of Mat´ern functions in spatial statistics, preprint
arXiv:1711.06874, 2017]Accepted for CMAM, March 9, 2018.
27
28. 4*
Five tasks in statistics to solve
Task 1. Approximate a Mat´ern covariance function in a low-rank
tensor format.
c(x, y) − r
i=1
d
µ=1 ciµ(xµ, yµ) ≤ ε, for some given ε > 0.
Alternatively, we may look for factors Ciµ such that
C − r
i=1
d
µ=1 Ciµ ≤ ε. Here, matrices Ciµ correspond to the
one-dimensional functions ciµ(xµ, yµ) in the direction µ.
Task 2. Computing C1/2
Task 3. Kriging estimate ˆs = Csy C−1
yy y.
29. 4*
Five tasks in statistics to solve
Task 4. Geostatistical design.
φA = N−1
trace Css|y , and φC = zT
(Css|y )z,
where Css|y := Css − Csy C−1
yy Cys
Task 5. Computing the joint Gaussian log-likelihood function.
L(θ) = −
N
2
log(2π) −
1
2
log det{C(θ)} −
1
2
(zT
· C(θ)−1
z).
29
30. 4*
Two ways to find a low-rank tensor approx. of Mat´ern cov.
Task 1. Approximate Mat´ern covariance
30
31. 4*
Trace, diagonal, and determinant of C:
Let C ≈ ˜C = r
i=1
d
µ=1 Ciµ, then
diag(˜C) = diag
r
i=1
d
µ=1
Ciµ
=
r
i=1
d
µ=1
diag (Ciµ) , (6)
trace(˜C) = trace
r
i=1
d
µ=1
Ciµ
=
r
i=1
d
µ=1
trace(Ciµ), (7)
and for the determinant it holds only for r = 1
log det(C1⊗C2⊗C3) = n2n3log det C1+n1n3log det C2+n1n2log det C3.
(8)
31
32. 4*
Existence of the canonical low-rank tensor approximation
A scheme of the proof of existence of the canonical low-rank tensor
approximation. See details in [A. Litvinenko, et al., Tucker Tensor analysis
of Mat´ern functions in spatial statistics, preprint arXiv:1711.06874, 2017]
It could be easier to apply the Laplace transform to the Fourier
transform of a Mat´ern covariance matrix, than to the Mat´ern
covariance. To approximate the resulting Laplace integral we apply
the sinc quadrature.
32
33. 4*
Likelihood in d-dimensions
Example: N = 60003. Using MATLAB on a MacBookPro with 16
GB RAM, the time required set up the matrices C1, C2, and C3 is
11 seconds; it takes 4 seconds to compute L1, L2, and L3.
In previous work we used the H-matrices to approximate Ci and its
H-Cholesky factor Li for n = 2 · 106 in 2 minutes. Here, assuming
C = C1 ⊗ . . . ⊗ Cd , we approximate C for N = (2 · 106)d in 2d
minutes.
L ≈ ˜L = −
d
ν=1 nν
log(2π)
−
d
j=1
log det Cj
d
i=1,i=j
ni −
r
i=1
r
j=1
d
ν=1
(uT
i,ν, uj,ν).
33
34. 4*
Tensor approximation of Gaussian covariance
Let cov(x, y) = e−|x−y|2
be the Gaussian covariance function,
where x = (x1, .., xd ), y = (y1, ..., yd ) ∈ [a, b]d . Let C be the
Gaussian covariance matrix defined on a tensor grid in [a, b]d ,
d ≥ 1.
Then C can be written as Kronecker product of one-dimensional
covariance matrices, i.e., C = C1 ⊗ ... ⊗ Cd .
34
35. 4*
Tensor Cholesky
If d Cholesky decompositions exist, i.e, Ci = Li · LT
i , i = 1..d.
Then
C1 ⊗ ... ⊗ Cd = (L1LT
1 ) ⊗ ... ⊗ (Ld LT
d ) (9)
= (L1 ⊗ ... ⊗ Ld ) · (LT
1 ⊗ ... ⊗ LT
d ) =: L · LT
, (10)
where L := L1 ⊗ ... ⊗ Ld , LT := LT
1 ⊗ ... ⊗ LT
d are also low- and
upper-triangular matrices.
The computational complexity drops from O(NlogN), N = nd , to
O(dnlogn), where n is number of mesh points in one-dimensional
problem. Further research is required for non-Gaussian covariance
functions.
36. 4*
Tensor Inverse and determinant
If inverse matrices C−1
i , i = 1..d, exist, then
(C1 ⊗ ... ⊗ Cd )−1
= C−1
1 ⊗ ... ⊗ C−1
d . (11)
log det(C1⊗C2⊗C3) = n2n3log det C1+n1n3log det C2+n1n2log det C3.
(12)
The cost drops again from O(NlogN), N = nd , to O(dnlogn).
37. 4*
Literature
1. A. Litvinenko, HLIBCov: Parallel Hierarchical Matrix Approximation of Large Covariance Matrices and
Likelihoods with Applications in Parameter Identification, preprint arXiv:1709.08625, 2017
2. A. Litvinenko, Y. Sun, M.G. Genton, D. Keyes, Likelihood Approximation With Hierarchical Matrices For Large
Spatial Datasets, preprint arXiv:1709.04419, 2017
3. B.N. Khoromskij, A. Litvinenko, H.G. Matthies, Application of hierarchical matrices for computing the
Karhunen-Lo´eve expansion, Computing 84 (1-2), 49-67, 31, 2009
4. H.G. Matthies, A. Litvinenko, O. Pajonk, B.V. Rosi´c, E. Zander, Parametric and uncertainty computations with
tensor product representations, Uncertainty Quantification in Scientific Computing, 139-150, 2012
5. W. Nowak, A. Litvinenko, Kriging and spatial design accelerated by orders of magnitude: Combining low-rank
covariance approximations with FFT-techniques, Mathematical Geosciences 45 (4), 411-435, 2013
6. P. Waehnert, W.Hackbusch, M. Espig, A. Litvinenko, H. Matthies: Efficient low-rank approximation of the
stochastic Galerkin matrix in the tensor format, Computers & Mathematics with Applications, 67 (4), 818-829,
2014
7. M. Espig, W. Hackbusch, A. Litvinenko, H.G. Matthies, E. Zander, Efficient analysis of high dimensional data in
tensor formats, Sparse Grids and Applications, 31-56, Springer, Berlin, 2013
37