SlideShare a Scribd company logo
1 of 28
Download to read offline
Integrating Cache Oblivious Approach
with Modern Processor Architecture:
The Case of Floyd-Warshall Algorithm
Toshio Endo (遠藤敏夫)
GSIC, Tokyo Institute of Technology (東京工業大学)
Supported by NEDO and RWBC-OIL, AIST
Architecture Trends
• Each processor has more and more cores
– Recent Xeon/EPYC have up to 56/64 cores
• Each core gains higher FLOPs with SIMD
instructions
– AVX-2, AVX-512, SVE…
• In order to mitigate memory-wall problem,
modern architecture tends to have
– Deeper cache hierarchy
• L1  L2  L3  Main memory
– Hybrid memory including High-bandwidth
memory or NVM
 Algorithm kernels has been & need to be
reconsidered
Cache Blocking
• Cache blocking is one of standard techniques to
improve locality
• Used to accelerate
– Dense/sparse linear algebra
– Stencil computation
– Graph algorithms, etc.
Block size
< Cache size
access
pattern
Issues of Cache Blocking
• Block sizes need to be
architecture aware
– Sizes of each cache level
– Number of cache levels
• cf: Typical HPC CPUs have 3 level, while
Xeon Phi have 2 level
• If we support multi-level blocking,
programming gets harder
Registers
Main Memory
L3 cache
L2 Cache
L1 Inst
Cache
L1 Data
Cache
Cache-Oblivious Approach
• Cache-oblivious approach has been proposed [Frigo et
al. 99]
• Recursive divide & conquer is used to make the
working set size fit size of each cache level
This approach makes algorithms more architecture
independent
– Applied to linear algebra kernel, stencil, graph, FFT…
Locality is a Big Issue, But We Have More
• Cache oblivious approach improve locality for multiple
level of caches
• However, we need to investigate whether it works well
with considerations of other features in modern
processors
– SIMD parallelism
– Multi/many core parallelism
Cache oblivious
approach
Aligned access
for SIMD
Data layout
transformation
Multi-core
Parallelism
Optimization Techniques
Our Target Algorithm:
Floyd-Warshall Algorithm
Floyd-Warshall (FW)algorithm:
• A well-known algorithm for All-pairs Shortest Path
(APSP) problem in graph analysis
Input
N
Output
Summary of This Work
• A high performance FW implementation is given
– Works with AVX-512 SIMD instructions
– Supports multi-core
– Based on cache-oblivious approach
• 1.1 TFlops on dual Skylake Xeon
• 700 Gflops on Xeon Phi KNL
– In single precision
• https://github.com/toshioendo/hoalgos
Non-Blocked FW Algorithm
N
D: a distance
matrix of size N
Complexity:
O(N3)
D[i,j]: the weight of
the edge from i to j
D[i,j]: the length of
shortest path from i to j
(Non-Recursive) Blocked FW Algorithm
procedure FW-Blocking(D)
for k = 0 … N/BS-1
FW-BASE(Dkk, Dkk, Dkk)
for all Dkj
FW-BASE(Dkk, Dkj ,Dkj)
for all Dik
FW-BASE(Dik, Dkk, Dik)
for all Dij
FW-BASE(Dik, Dkj ,Dkj)
BS
Block D11
Base kernel (Block-wise)
Main algorithm
Recursive Blocking FW Algorithm
[Park et al. 04]
 Stop recursion
A00 A01
A10 A11
B00 B01
B10 B11
C00 C01
C10 C11
A00 A01
A10 A11
B00 B01
B10 B11
C00 C01
C10 C11
1st half
2nd half
Integration with Optimizations for
Modern Processors
• So far, cache oblivious approach has been adopted
• Furthermore, we need to introduce optimizations for
modern processors
– SIMD parallelism
– Data layout transformation
– Multi-core parallelism
– Kernel optimization with register blocking
Acceleration with
AVX-512 SIMD Instructions
[Rucci et al. 17]
With AVX-512, 16 SP values
are computed at once
“min (c, a+b)” is achieved
by using a mask register
“c = min (c, a+b)”
BS should be a multiple of 16
Introducing Block Data Layout
• With cache blocking, memory access pattern is improved
• However, we may still suffer from conflict cache misses with
the standard column major format
Block data layout is adopted
Layout transformation is done before&after FW computation
(performance measurement includes this overhead)
column major format
block data layout
BS
Note: recursive division sizes
should be a multiple of BS
consecutive
memory
address
Acceleration with
Muti-Core Parallelism
• OpenMP is used
procedure FW-Blocking(D)
for k = 0 … N/BS-1
FW-BASE(Dkk, Dkk, Dkk)
for all Dkj
FW-BASE(Dkk, Dkj ,Dkj)
for all Dik
FW-BASE(Dik, Dkk, Dik)
for all Dij
FW-BASE(Dik, Dkj ,Dkj)
Non-recursive blocked algorithm
omp for
omp for
omp for
Recursive blocked algorithm
omp task
omp task
omp taskwait
omp task
omp task
omp taskwait
Re-visiting Base Kernel (1)
Every element in C is read from/written to memory for BS times
This “inefficiency” is required preserve data dependency
– Data written to C in k-th step may be read (as A or B) in k’-th step (k’>k)
 Loop interchange is illegal in such cases
k loop is at outermost
16 elements are read from C
16 elements are written to C
Re-visiting Base Kernel (2)
Do we always need to preserve dependency? No!
Aliased cases
If (A=C and/or B=C), we have to
preserve data dependency
 Blocks A, B are read and C is written
(1) A=B=C
(2) A!=B=C
(3) B!=A=C
Non-aliased cases
If (A!=C and B!=C), we have
opportunities for loop
interchange optimization
(4) A!=C && B!=C
Optimized Kernel with Loop
Interchange and Register Blocking
Now k loop is
inner
16accumulators
areused
After k loop finishes, memory
read/write to C occur
only once per element
This kernel can be used only when A!=C and B!=C
Floyd-Warshall Implementations
Park et al. 04 Rucci et al. 17 Ours
Cache Blocking Yes Yes Yes
Recursive
Cache Blocking
Yes - Yes
SIMD Parallelism - Yes Yes
Block Data Layout Yes ? Yes
Multi-core
Parallelism
- Yes Yes
Register Blocking - - Yes
Experimental Environments
2 machines, both of which
support AVX-512 are used
• 2-socket Xeon Skylake
• Xeon Phi KNL
Block Size Configuration
• Even with cache oblivious approach, we still have to
determine a single parameter, block size (BS)
From the result of preliminary
evaluation, we use BS=64
BS=16
BS=32
BS=64
BS=128
Performance Evaluation:
1-Core SkyLake
Non-Recursive
is good, but
Recursive achieves
the fastest speed!!
Register blocking
contributes +30%
speed-up
W/o layout change,
we see slowdown
when N is a multiple of 1024
Faster
Performance Evaluation:
1-Core KNL
We suffer from
heavier impacts of
conflict misses!
Register blocking
contributes +100%
speed-up
Matrix D is put on MCDRAM; using DDR4 showed similar performance
(refer to the paper)
Faster
Performance Evaluation:
(16+16)-Core SkyLake
Faster
Recursive version
exceeds 1.1TFlops!!
On the other hand, our recursive version gets slower with smaller N 
• Overhead of “omp task” ?
cross point
Performance Evaluation:
64-Core KNL
Faster
~0.7TFlops !!
We see that slow-down with smaller N is heavier
Conflict misses make
execution completely
impractical
cross point
Peak Performance Ratio
• 2-socket Skylake:
– Measured: 1.117 TFlops
– Peak (SP): 5.304 TFlops
Peak perf ratio=21%
If we do not count FMAD in peak, ratio=42%
• KNL:
– Measured: 0.687 TFlops
– Peak (SP): 5.324 TFlops
Peak perf ratio=13%
If we do not count FMAD in peak, ratio=26%
NOTE: In FW, FMAD cannot be used efficiently
Summary
• A high performance FW implementation is given
– Cache-oblivious approach is integrated with
• SIMD parallelism
• Multi-core parallelism
• Data layout transformation
• Register blocking with careful algorithm analysis
• Succeeds performance of state-of-art implementations
– 1.1 TFlops on dual Skylake Xeon
– 700 Gflops on Xeon Phi KNL
• In this experiment, MCDRAM and DDR4 worked similarly
https://github.com/toshioendo/hoalgos
Future Work
• Evaluation of other heterogeneous memory
– DIMM type 3D-Xpoint
• Towards further performance implementation
– Reducing overhead of task creation by “omp task”
– Improving memory affinity
• Recursion + task creation works worse in this aspect
Need improved multi-task runtime
• Towards more “architecture-independent” implementation
– Our current version is free from cache-size parameter, but
– The base kernel depends on SIMD-type and width
 ARM SVE (scalable vector extension) looks attractive

More Related Content

What's hot

Building a Better JVM
Building a Better JVMBuilding a Better JVM
Building a Better JVMSimon Ritter
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPDockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPThomas Graf
 
Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)Hajime Tazaki
 
Playing BBR with a userspace network stack
Playing BBR with a userspace network stackPlaying BBR with a userspace network stack
Playing BBR with a userspace network stackHajime Tazaki
 
VLANs in the Linux Kernel
VLANs in the Linux KernelVLANs in the Linux Kernel
VLANs in the Linux KernelKernel TLV
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Hajime Tazaki
 
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUsShoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUsJiannan Ouyang, PhD
 
Memory Barriers in the Linux Kernel
Memory Barriers in the Linux KernelMemory Barriers in the Linux Kernel
Memory Barriers in the Linux KernelDavidlohr Bueso
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
 
Comprehensive XDP Off‌load-handling the Edge Cases
Comprehensive XDP Off‌load-handling the Edge CasesComprehensive XDP Off‌load-handling the Edge Cases
Comprehensive XDP Off‌load-handling the Edge CasesNetronome
 
Achieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsAchieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsJiannan Ouyang, PhD
 
Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)
Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)
Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)Hajime Tazaki
 
LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1Hajime Tazaki
 
NUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osioNUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osioHajime Tazaki
 
Get Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java ApplicationsGet Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java ApplicationsScyllaDB
 

What's hot (20)

Building a Better JVM
Building a Better JVMBuilding a Better JVM
Building a Better JVM
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPDockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
 
Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)
 
Playing BBR with a userspace network stack
Playing BBR with a userspace network stackPlaying BBR with a userspace network stack
Playing BBR with a userspace network stack
 
VLANs in the Linux Kernel
VLANs in the Linux KernelVLANs in the Linux Kernel
VLANs in the Linux Kernel
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUsShoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
 
Memory Barriers in the Linux Kernel
Memory Barriers in the Linux KernelMemory Barriers in the Linux Kernel
Memory Barriers in the Linux Kernel
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
Comprehensive XDP Off‌load-handling the Edge Cases
Comprehensive XDP Off‌load-handling the Edge CasesComprehensive XDP Off‌load-handling the Edge Cases
Comprehensive XDP Off‌load-handling the Edge Cases
 
Achieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsAchieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-Kernels
 
Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)
Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)
Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)
 
Debug generic process
Debug generic processDebug generic process
Debug generic process
 
LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1
 
mTCP使ってみた
mTCP使ってみたmTCP使ってみた
mTCP使ってみた
 
NUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osioNUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osio
 
Get Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java ApplicationsGet Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java Applications
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
 
DPDK In Depth
DPDK In DepthDPDK In Depth
DPDK In Depth
 

Similar to Integrating Cache Oblivious Approach with Modern Processor Architecture: The Case of Floyd-Warshall Algorithm

Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFShapeBlue
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Intel® Software
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFZoltan Arnold Nagy
 
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- MulticoreLec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- MulticoreHsien-Hsin Sean Lee, Ph.D.
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Andriy Berestovskyy
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wallugur candan
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilersAnastasiaStulova
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor DesignSri Prasanna
 

Similar to Integrating Cache Oblivious Approach with Modern Processor Architecture: The Case of Floyd-Warshall Algorithm (20)

Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on SparkInferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- MulticoreLec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
Pipelining slides
Pipelining slides Pipelining slides
Pipelining slides
 
Coa.ppt2
Coa.ppt2Coa.ppt2
Coa.ppt2
 
Data race
Data raceData race
Data race
 
Technology
TechnologyTechnology
Technology
 
Technology
TechnologyTechnology
Technology
 
Technology
TechnologyTechnology
Technology
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
Ceph on rdma
Ceph on rdmaCeph on rdma
Ceph on rdma
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 
Technology (1)
Technology (1)Technology (1)
Technology (1)
 
General Purpose GPU Computing
General Purpose GPU ComputingGeneral Purpose GPU Computing
General Purpose GPU Computing
 

Recently uploaded

How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 

Recently uploaded (20)

How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 

Integrating Cache Oblivious Approach with Modern Processor Architecture: The Case of Floyd-Warshall Algorithm

  • 1. Integrating Cache Oblivious Approach with Modern Processor Architecture: The Case of Floyd-Warshall Algorithm Toshio Endo (遠藤敏夫) GSIC, Tokyo Institute of Technology (東京工業大学) Supported by NEDO and RWBC-OIL, AIST
  • 2. Architecture Trends • Each processor has more and more cores – Recent Xeon/EPYC have up to 56/64 cores • Each core gains higher FLOPs with SIMD instructions – AVX-2, AVX-512, SVE… • In order to mitigate memory-wall problem, modern architecture tends to have – Deeper cache hierarchy • L1  L2  L3  Main memory – Hybrid memory including High-bandwidth memory or NVM  Algorithm kernels has been & need to be reconsidered
  • 3. Cache Blocking • Cache blocking is one of standard techniques to improve locality • Used to accelerate – Dense/sparse linear algebra – Stencil computation – Graph algorithms, etc. Block size < Cache size access pattern
  • 4. Issues of Cache Blocking • Block sizes need to be architecture aware – Sizes of each cache level – Number of cache levels • cf: Typical HPC CPUs have 3 level, while Xeon Phi have 2 level • If we support multi-level blocking, programming gets harder Registers Main Memory L3 cache L2 Cache L1 Inst Cache L1 Data Cache
  • 5. Cache-Oblivious Approach • Cache-oblivious approach has been proposed [Frigo et al. 99] • Recursive divide & conquer is used to make the working set size fit size of each cache level This approach makes algorithms more architecture independent – Applied to linear algebra kernel, stencil, graph, FFT…
  • 6. Locality is a Big Issue, But We Have More • Cache oblivious approach improve locality for multiple level of caches • However, we need to investigate whether it works well with considerations of other features in modern processors – SIMD parallelism – Multi/many core parallelism Cache oblivious approach Aligned access for SIMD Data layout transformation Multi-core Parallelism Optimization Techniques
  • 7. Our Target Algorithm: Floyd-Warshall Algorithm Floyd-Warshall (FW)algorithm: • A well-known algorithm for All-pairs Shortest Path (APSP) problem in graph analysis Input N Output
  • 8. Summary of This Work • A high performance FW implementation is given – Works with AVX-512 SIMD instructions – Supports multi-core – Based on cache-oblivious approach • 1.1 TFlops on dual Skylake Xeon • 700 Gflops on Xeon Phi KNL – In single precision • https://github.com/toshioendo/hoalgos
  • 9. Non-Blocked FW Algorithm N D: a distance matrix of size N Complexity: O(N3) D[i,j]: the weight of the edge from i to j D[i,j]: the length of shortest path from i to j
  • 10. (Non-Recursive) Blocked FW Algorithm procedure FW-Blocking(D) for k = 0 … N/BS-1 FW-BASE(Dkk, Dkk, Dkk) for all Dkj FW-BASE(Dkk, Dkj ,Dkj) for all Dik FW-BASE(Dik, Dkk, Dik) for all Dij FW-BASE(Dik, Dkj ,Dkj) BS Block D11 Base kernel (Block-wise) Main algorithm
  • 11. Recursive Blocking FW Algorithm [Park et al. 04]  Stop recursion A00 A01 A10 A11 B00 B01 B10 B11 C00 C01 C10 C11 A00 A01 A10 A11 B00 B01 B10 B11 C00 C01 C10 C11 1st half 2nd half
  • 12. Integration with Optimizations for Modern Processors • So far, cache oblivious approach has been adopted • Furthermore, we need to introduce optimizations for modern processors – SIMD parallelism – Data layout transformation – Multi-core parallelism – Kernel optimization with register blocking
  • 13. Acceleration with AVX-512 SIMD Instructions [Rucci et al. 17] With AVX-512, 16 SP values are computed at once “min (c, a+b)” is achieved by using a mask register “c = min (c, a+b)” BS should be a multiple of 16
  • 14. Introducing Block Data Layout • With cache blocking, memory access pattern is improved • However, we may still suffer from conflict cache misses with the standard column major format Block data layout is adopted Layout transformation is done before&after FW computation (performance measurement includes this overhead) column major format block data layout BS Note: recursive division sizes should be a multiple of BS consecutive memory address
  • 15. Acceleration with Muti-Core Parallelism • OpenMP is used procedure FW-Blocking(D) for k = 0 … N/BS-1 FW-BASE(Dkk, Dkk, Dkk) for all Dkj FW-BASE(Dkk, Dkj ,Dkj) for all Dik FW-BASE(Dik, Dkk, Dik) for all Dij FW-BASE(Dik, Dkj ,Dkj) Non-recursive blocked algorithm omp for omp for omp for Recursive blocked algorithm omp task omp task omp taskwait omp task omp task omp taskwait
  • 16. Re-visiting Base Kernel (1) Every element in C is read from/written to memory for BS times This “inefficiency” is required preserve data dependency – Data written to C in k-th step may be read (as A or B) in k’-th step (k’>k)  Loop interchange is illegal in such cases k loop is at outermost 16 elements are read from C 16 elements are written to C
  • 17. Re-visiting Base Kernel (2) Do we always need to preserve dependency? No! Aliased cases If (A=C and/or B=C), we have to preserve data dependency  Blocks A, B are read and C is written (1) A=B=C (2) A!=B=C (3) B!=A=C Non-aliased cases If (A!=C and B!=C), we have opportunities for loop interchange optimization (4) A!=C && B!=C
  • 18. Optimized Kernel with Loop Interchange and Register Blocking Now k loop is inner 16accumulators areused After k loop finishes, memory read/write to C occur only once per element This kernel can be used only when A!=C and B!=C
  • 19. Floyd-Warshall Implementations Park et al. 04 Rucci et al. 17 Ours Cache Blocking Yes Yes Yes Recursive Cache Blocking Yes - Yes SIMD Parallelism - Yes Yes Block Data Layout Yes ? Yes Multi-core Parallelism - Yes Yes Register Blocking - - Yes
  • 20. Experimental Environments 2 machines, both of which support AVX-512 are used • 2-socket Xeon Skylake • Xeon Phi KNL
  • 21. Block Size Configuration • Even with cache oblivious approach, we still have to determine a single parameter, block size (BS) From the result of preliminary evaluation, we use BS=64 BS=16 BS=32 BS=64 BS=128
  • 22. Performance Evaluation: 1-Core SkyLake Non-Recursive is good, but Recursive achieves the fastest speed!! Register blocking contributes +30% speed-up W/o layout change, we see slowdown when N is a multiple of 1024 Faster
  • 23. Performance Evaluation: 1-Core KNL We suffer from heavier impacts of conflict misses! Register blocking contributes +100% speed-up Matrix D is put on MCDRAM; using DDR4 showed similar performance (refer to the paper) Faster
  • 24. Performance Evaluation: (16+16)-Core SkyLake Faster Recursive version exceeds 1.1TFlops!! On the other hand, our recursive version gets slower with smaller N  • Overhead of “omp task” ? cross point
  • 25. Performance Evaluation: 64-Core KNL Faster ~0.7TFlops !! We see that slow-down with smaller N is heavier Conflict misses make execution completely impractical cross point
  • 26. Peak Performance Ratio • 2-socket Skylake: – Measured: 1.117 TFlops – Peak (SP): 5.304 TFlops Peak perf ratio=21% If we do not count FMAD in peak, ratio=42% • KNL: – Measured: 0.687 TFlops – Peak (SP): 5.324 TFlops Peak perf ratio=13% If we do not count FMAD in peak, ratio=26% NOTE: In FW, FMAD cannot be used efficiently
  • 27. Summary • A high performance FW implementation is given – Cache-oblivious approach is integrated with • SIMD parallelism • Multi-core parallelism • Data layout transformation • Register blocking with careful algorithm analysis • Succeeds performance of state-of-art implementations – 1.1 TFlops on dual Skylake Xeon – 700 Gflops on Xeon Phi KNL • In this experiment, MCDRAM and DDR4 worked similarly https://github.com/toshioendo/hoalgos
  • 28. Future Work • Evaluation of other heterogeneous memory – DIMM type 3D-Xpoint • Towards further performance implementation – Reducing overhead of task creation by “omp task” – Improving memory affinity • Recursion + task creation works worse in this aspect Need improved multi-task runtime • Towards more “architecture-independent” implementation – Our current version is free from cache-size parameter, but – The base kernel depends on SIMD-type and width  ARM SVE (scalable vector extension) looks attractive