This document discusses heterogeneous systems architecture and its potential to enable technologies for virtual reality environments like holodecks. It provides an overview of holodeck enabling technologies such as computational photography, directional audio, natural user interfaces, and augmented reality. It then discusses how heterogeneous systems architecture can accelerate these technologies by allowing more flexible partitioning of workloads between the CPU and GPU for improved performance and energy efficiency. As an example, it analyzes how HSA could improve the performance of face detection algorithms by offloading certain stages to the GPU. Overall, the document argues that HSA is key to realizing the advanced computing capabilities needed for future immersive virtual environments.
HSA Powers the Holodeck: Heterogeneous Computing Enables Immersive Virtual Environments
1. HETEROGENEOUS SYSTEMS ARCHITECTURE:
THE NEXT AREA OF COMPUTING INNOVATION
CASE STUDY: THE HOLODECK
Dr. Lisa Su
Senior Vice President and GM, Global Business Units,
AMD
ISSCC Conference
February 18, 2013
2. CHALLENGES TO MOORE’S LAW SCALING
Area Scaling by Technology Generation Cost Per Transistor Scaling
1.0 1.0
Normalized Cost/Transistor
0.8 0.8
Normalized Area
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
45nm 40nm 32nm 28nm 20nm 20 45nm 40nm 32nm 28nm 20nm 20
FinFET FinFET
Lithography challenges begin severely limiting area scaling at 20nm node
– Fewer 1X metals due to cost
– Less aggressive feature scaling due to lithography challenges
Compounded by rapidly increasing lithography costs
– 28 20nm transition is inflection point with dual exposure
– No cost / transistor crossover for first time at 28 20nm transition
2 | ISSCC Keynote | February 18th, 2013
3. A PARADIGM SHIFT…
Microprocessor Advancement
CPU
Single-Core Multi-Core Heterogeneous
Era Era Systems Era
High-level
Heterogeneous programmable
Computing
OpenCL/DX
driver-based
Homogeneous programs
Programmability
Computing
Advancement
GPU
Graphics
driver-based
programs
Throughput Performance Accelerator
3 | ISSCC Keynote | February 18th, 2013
5. ARCHITECTURES – A HISTORICAL PERSPECTIVE
Legacy Processing Era Surround Computing Era
Single Core CPUs
Traditionally Optimized Platforms
Multi-Core CPUs/GPUs
APUs and legacy SOC
Heterogeneous Architectures
1981 1990s 2000s 2010s
5 | ISSCC Keynote | February 18th, 2013
6. CHANGING THE THINKING, CHANGING THE GAME
HSA is designed to make the GPU hardware
directly accessible to the software, using the high
level languages programmers already in use on
the CPU
C, C++, Java, Python…even JavaScript, HTML5
ISA agnostic – e.g., x86, 64-bit ARM, Radeon, Mali
GPU becomes a peer processor to the CPU in
terms of system integration
Full programming language features
Shared virtual memory: pointer is a pointer
Coherency
Context switching
HSA Foundation – an
industry-wide initiative
6 | ISSCC Keynote | February 18th, 2013
8. EFFECTIVE COMPUTE OFFLOAD
APU Accelerated HSA Accelerated Processing Unit
Software Applications
Data Parallel Workloads
Serial and Task
Parallel Workloads
Made easy by HSA
Unleash the best compute elements depending on task
8 | ISSCC Keynote | February 18th, 2013
9. BRINGING IT ALL TOGETHER
MOTION DSP 720P
Power Performance
35 W 25 fps
30 W
DRAM 20 fps
25 W
NB+GPU DRAM
20 W 15 fps
NB+GPU
15 W
10 fps
10 W CPU Cores
CPU Cores 5 fps
5W
0W 0 fps
CPU CPU+GPU CPU CPU+GPU
Synergistic use of GPU compute
+ shared memory >4.0X Better Energy
= Efficiency1
lower power and higher performance
AMD internal testing: AMD E2-3200 APU (2 cores @ 2400Mhz, GPU:2 CU @ 444Mhz),
Windows 7 OS, MotionDSP vReveal Applications 720P MP4 input
(http://www.vreveal.com/stabilization)
9 | ISSCC Keynote | February 18th, 2013
10. TODAY’S DISCUSSION: FROM SURROUND COMPUTING TO
ENABLING THE HOLODECK
1. A fully featured Holodeck is
still many years away
2. Today our discussion will:
Establish a Holodeck framework
Identify Holodeck enabling technologies
Discuss how Heterogeneous Systems
Architecture (HSA) accelerates these
technologies
Undertake an HSA deep dive on one of
these enabling technologies
Look at how new dedicated processors
will enable Holodeck functionality
10 | ISSCC Keynote | February 18th, 2013
11. WHAT IS A HOLODECK?
11 | ISSCC Keynote | February 18th, 2013
12. THE HOLODECK FRAMEWORK:
AN EVOLUTION OF SURROUND COMPUTING
Natural User Interfaces
Context Computing
360 Degree Virtual
Environments
12 | ISSCC Keynote | February 18th, 2013
13. HOLODECK ENABLING TECHNOLOGIES:
PROFOUND IMPLICATIONS FOR COMPUTER ARCHITECTURE
Computational Photography
Delivering seamless and immersive video environments
Directional Audio
Using audio to enhance immersion and realism of our environments
Natural User Interfaces
Enabling realistic, natural human
communication
Context Computing
Delivering an intuitive understanding
of the user’s needs in real time
Augmented Reality
Bringing it all together – combining the
real and the virtual
13 | ISSCC Keynote | February 18th, 2013
14. COMPUTATIONAL PHOTOGRAPHY
360 DEGREE VISUAL ENVIRONMENTS, PHOTOSTITCHING, PERIPHERAL VISION AND HSA
Mapping real life scenes through finite images
Photo stitching of tiled environments and
perceptual correction
Detect interest points & match features
Projecting geometry with point features
using algorithms like RANSAC
Image processing to account for
curved screen surfaces
Modulate brightness to account for
peripheral vision
HSA presents a unified view of the
system with shared memory so CPU and
GPU acceleration in the entire process
14 | ISSCC Keynote | February 18th, 2013
15. DIRECTIONAL AUDIO
Couples computationally demanding 3D
audio and spatialization effects with
"always on" background processing like
(VAD) Voice Activity Detection
Voice activity detection is best
implemented with special audio
processors and acceleration
techniques
Spatialization effects such as
“Convolution Reverb” are best
done with GPU acceleration
HSA enables seamless
integration of CPU and GPU
acceleration with other
independent accelerators
15 | ISSCC Keynote | February 18th, 2013
16. NATURAL USER INTERFACES
Speech Recognition:
Background processing – echo
cancellation & noise suppression
Audio feature extraction
Voice pattern recognition through
Markov model or similar algorithm
Gesture Recognition:
Frame preprocessing & filtering
Optical flow or object tracking
Sophisticated computer vision
algorithms to delineate the hand or
body parts from the background
NUI algorithms all benefit from
CPU/GPU and audio processors to
efficiently perform these functions at
the lowest power
16 | ISSCC Keynote | February 18th, 2013
17. CONTEXT COMPUTING
BIOMETRICS EXAMPLE
• Facial Recognition:
• Face detection (is there a face) –
GPU acceleration
• Face identification (pattern
matching through algorithms like
Haar face detection) – CPU and
GPU acceleration
• Validation through blink detection
(make sure it is a real face) –
GPU acceleration
HSA enables mix and match of the best
acceleration for each phase of the
process
17 | ISSCC Keynote | February 18th, 2013
18. AUGMENTED REALITY
• Image Registration:
• Relies on robust and fast feature
detection – benefits from
CPU/GPU acceleration
• Object Tracking:
• Relies on “optical flow” algorithm
– benefits from CPU/GPU
acceleration
• Image Composition:
• Once information exists from the
above, becomes a classic
graphics rendering use case
The building blocks of HSA enable the
augmented reality world.
18 | ISSCC Keynote | February 18th, 2013
19. THE WAY FORWARD
Many technologies required to
enable our vision
– Heterogeneous engines that
accelerate key client and server
workloads
– Datacenters optimized for
latency, scalability, and
efficiency
– Processors optimized for new
and emerging workloads
– Active research into new
algorithms
19 | ISSCC Keynote | February 18th, 2013
20. ENABLING TECHNOLOGY DEEP DIVE:
ACCELERATING NATURAL USER INTERFACES (HAAR
FACE DETECTION) WITH HETEROGENEOUS
SYSTEMS ARCHITECTURE
21. LOOKING FOR FACES IN ALL THE RIGHT PLACES
21 | ISSCC Keynote | February 18th, 2013
22. LOOKING FOR FACES IN ALL THE RIGHT PLACES
Quick HD Calculations
Search square = 21 x 21
Pixels = 1920 x 1080 = 2,073,600
Search squares = 1900 x 1060 = ~2 Million
22 | ISSCC Keynote | February 18th, 2013
23. LOOKING FOR DIFFERENT SIZE FACES
BY SCALING THE VIDEO FRAME
23 | ISSCC Keynote | February 18th, 2013
24. LOOKING FOR DIFFERENT SIZE FACES
BY SCALING THE VIDEO FRAME
More HD Calculations
70% scaling in H and V
Total Pixels = 4.07 Million
Search squares = 3.8 Million
24 | ISSCC Keynote | February 18th, 2013
25. HAAR CASCADE STAGES
Feature k
Feature l Stage N
Feature m
Face still
Yes possible?
Feature p
No
Feature r Stage N+1
Feature q REJECT
FRAME
25 | ISSCC Keynote | February 18th, 2013
26. 22 CASCADE STAGES, EARLY OUT BETWEEN EACH
FACE
STAGE 1 STAGE 2 STAGE 21 STAGE 22 CONFIRMED
NO FACE
Final HD Calculations Calculation Rate
Search squares = 3.8 million 30 frames/sec = 1.4TCalcs/second
Average features per square = 124 60 frames/sec = 2.8TCalcs/second
Calculations per feature = 100
Calculations per frame = 47 GCalcs …and this only gets front-facing faces
26 | ISSCC Keynote | February 18th, 2013
28. UNBALANCING DUE TO EXITS IN EARLIER CASCADE STAGES
Live
Dead
When running on the GPU, we run each search rectangle on a separate
work item
Early out algorithms, like HAAR, exhibit divergence between work items
– Some work items exit early
– Their neighbors continue
– SIMD packing suffers as a result
28 | ISSCC Keynote | February 18th, 2013
30. PERFORMANCE CPU-VS-GPU
AMD A10-4600M APU (6CU@497Mhz, 4 cores@2700Mhz)
12
CPU HSA GPU
10
8
Images/Sec
6
4
2
0
0 1 2 3 4 5 6 7 8 22
Number of Cascade Stages on GPU
AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G,
6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL™ 1.1 (873.1)
30 | ISSCC Keynote | February 18th, 2013
31. HAAR SOLUTION
RUN DIFFERENT CASCADES ON GPU AND CPU
By seamlessly sharing data between CPU and GPU,
HSA allows the right processor to handle its appropriate
workload
+2.5x
-2.5x
INCREASED DECREASED ENERGY
PERFORMANCE PER FRAME
31 | ISSCC Keynote | February 18th, 2013
32. APPLICATION ACCELERATION USING HSA
Gesture recognition 12x
Photo indexing 10x
Voice recognition 10x
Visual Search 9x
Audio search 5x
Stereo vision 4x
Video stabilization 4x
Face detect 2x
0 2 4 6 8 10 12 14
Acceleration vs. CPU
AMD estimates Source:AMD Whitepaper, Accelerating Consumer/Prosumer Multimedia with HSA, June 2012
32 | ISSCC Keynote | February 18th, 2013
33. HSA EVOLUTION
Llano Trinity Kaveri Next Gen
Physical Optimized Architectural System
Integration Platforms Integration Integration
Integrate CPU & GPU GPU Compute C++ Unified Address Space GPU compute
in silicon support for CPU and GPU context switch
GPU uses pageable
Unified Memory GPU graphics
User mode scheduling system memory via
Controller pre-emption
CPU pointers
Common Bi-Directional Power
Fully coherent memory
Manufacturing Mgmt between CPU Quality of Service
between CPU & GPU
Technology and GPU
33 | ISSCC Keynote | February 18th, 2013
34. HSA PROGRAMMABILITY ADVANTAGE
Unified Programming Models Domain-
HSA OpenCL, C++ DX11, Specific
C, C++, Java … AMP, Java8 … OpenGL … Ext / APIs
Foundation
HSA Intermediate Language (HSAIL)
Compute Acceleration Graphics Acceleration
• Works with today’s programming models and languages
• Architected to enable CPU like programmability
• Promotes development and adoption of extended standards
• Write Once Run Anywhere – with Performance
34 | ISSCC Keynote | February 18th, 2013
35. CONCLUSION
The age of traditional computing is
dead.
A paradigm shift in processing has
brought about the Heterogeneous
Systems Era
HSA will enable us to dramatically
scale processing power while
increasing power efficiency
The Holodeck still years away, but
HSA and dedicated hardware
blocks will accelerate and enable
technologies as they emerge
35 | ISSCC Keynote | February 18th, 2013
36. ACKNOWLEDGEMENTS
Bill Herz
Phil Rogers
Marty Johnson
Chris Hook
Sumant Subramanian
36 | ISSCC Keynote | February 18th, 2013