SlideShare a Scribd company logo
1 of 49
Stockholm Game Developer Forum 2010 Parallel Futures of a Game Engine (v2.0) Johan Andersson Rendering Architect, DICE
Background DICE Stockholm, Sweden ~250 employees Part of Electronic Arts Battlefield & Mirror’s Edge game series Frostbite Proprietary game engine used at DICE & EA Developed by DICE over the last 5 years 2
http://badcompany2.ea.com/ 3
http://badcompany2.ea.com/ 4
Outline Current parallelism Futures Q&A Slides will be available online later today http://publications.dice.se  and http://repi.se 5
Current parallelism 6
Levels of code in Frostbite Editor (C#) Pipeline (C++) Game code (C++) System CPU-jobs (C++) System SPU-jobs (C++/asm) Generated shaders (HLSL) Compute kernels (HLSL) Offline CPU Runtime GPU 7
Levels of code in Frostbite Editor (C#) Pipeline (C++) Game code (C++) System CPU-jobs (C++) System SPU-jobs (C++/asm) Generated shaders (HLSL) Compute kernels (HLSL) Offline CPU Runtime GPU 8
General ”game code” (1/2) This is the majority of our 1.5 million lines of C++ Runs on Win32, Win64, Xbox 360 and PS3 We do not use any scripting language Similar to general application code Huge amount of code & logic to maintain + continue to develop Low compute density ”Glue code” Scattered in memory (pointer chasing) Difficult to efficiently parallelize Out-of-order execution is a big help, but consoles are in-order  Key to be able to quickly iterate & change This is the actual game logic & glue that builds the game C++ not ideal, but has the invested infrastructure 9
General ”game code” (2/2) PS3 is one of the main challenges Standard CPU parallelization doesn’t help (much) CELL only has 2 HW threads on the PPU Split the code in 2: game code & system code Game logic, policy and glue code only on CPU ”If it runs well on the PS3 PPU, it runs well everywhere” Lower-level systems on PS3 SPUs Main goals going forward: Simplify & structure code base Reduce coupling with lower-level systems Increase in task parallelism for PC CELL processor 10
Levels of code in Frostbite Editor (C#) Pipeline (C++) Game code (C++) System CPU-jobs (C++) System SPU-jobs (C++/asm) Generated shaders (HLSL) Compute kernels (HLSL) Offline CPU Runtime GPU 11
Job-based parallelism Essential to utilize the cores on our target platforms Xbox 360: 6 HW threads PlayStation 3: 2 HW threads + 6 powerful SPUs PC: 2-16 HW threads Divide up system work into Jobs (a.k.a. Tasks) 15-200k C++ code each. 25k is common Can depend on each other (if needed) Dependencies create job graph All HW threads consume jobs  ~200-300 / frame 12
What is a Job for us? An asynchronous function call Function ptr + 4 uintptr_t parameters Cross-platform scheduler: EA JobManager Uses work stealing 2 types of Jobs in Frostbite: CPU job (good) General code moved into job instead of threads SPU job (great!) Stateless pure functions, no side effects Data-oriented, explicit memory DMA to local store Designed to run on the PS3 SPUs = also very fast on in-order CPU Can hot-swap quick iterations  13
Frostbite CPU job graph Build big job graphs: ,[object Object]
Mix CPU- & SPU-jobs
Future: Mix in low-latency GPU-jobsJob dependencies determine: ,[object Object]
Sync points
Load balancing
i.e. the effective parallelismIntermixed task- & data-parallelism ,[object Object]
aka Nested Data-Parallelism
aka Tasks and Kernels14
Data-parallel jobs 15
Task-parallel algorithms & coordination 16
17 Timing View: Battlefield Bad Company 2 (PS3)
Rendering jobs Rendering systems are heavily divided up into CPU-& SPU-jobs Jobs: Terrain geometry [3] Undergrowth generation [2] Decal projection [4] Particle simulation Frustum culling Occlusion culling Occlusion rasterization Command buffer generation [6] PS3: Triangle culling [6] Most will move to GPU Eventually.. A few have already! PC CPU<->GPU latency wall  Mostly one-way data flow 18
Occlusion culling job example Problem: Buildings & env occlude large amounts of objects Obscured objects still have to: Update logic & animations Generate command buffer Processed on CPU & GPU   = expensive & wasteful  Difficult to implement full culling: Destructible buildings Dynamic occludees Difficult to precompute  From Battlefield: Bad Company PS3 19
Solution: Software occlusion culling Rasterize coarse zbuffer on SPU/CPU 256x114 float Low-poly occluder meshes 100 m view distance Max 10000 vertices/frame Parallel vertex & raster SPU-jobs Cost: a few milliseconds Cull all objects against zbuffer Screen-space bounding-box test Before passed to all other systems Big performance savings! 20
GPU occlusion culling Ideally want to use the GPU, but current APIs are limited: Occlusion queries introduces overhead & latency Conditional rendering only helps GPU Compute Shader impl. possible, but same latency wall Future 1: Low-latency GPU execution context Rasterization and testing done on GPU where it belongs Lockstep with CPU, need to read back within a few ms Should be possible on Larrabee & Fusion, want on all PC Future 2: Move entire cull & rendering to ”GPU” World, cull, systems, dispatch. End goal Very thin GPU graphics driver – GPU feeds itself 21
Levels of code in Frostbite Editor (C#) Pipeline (C++) Game code (C++) System CPU-jobs (C++) System SPU-jobs (C++/asm) Generated shaders (HLSL) Compute kernels (HLSL) Offline CPU Runtime GPU 22
Shader types Generated shaders [1] Graph-based surface shaders Treated as content, not code Artist created Generates HLSL code Used by all meshes and 3d surfaces Graphics / Compute kernels Hand-coded & optimized HLSL Statically linked in with C++ Pixel- & compute-shaders Lighting, post-processing & special effects Graph-based surface shader in FrostEd 2 23
Futures 24
3 major challenges/goals going forward: How do we make it easier to develop, maintain & parallelize general game code? What do we need to continue to innovate & scale up real-time computational graphics? How can we move & scale up advanced simulation and non-graphics tasks to data-parallel manycore processors? Challenges Most likely the same solution(s)! 25
Challenge 1 “How do we make it easier to develop, maintain & parallelize general game code?” Shared State Concurrency is a killer Not a big believer in Software Transactional Memory either  Because of performance and too ”optimistic” flow A more strict & adapted C++ model Support for true immutable & r/w-only memory access Per-thread/task memory access opt-in Reduce the possibility for side effects in parallel code As much compile-time validation as possible Micro-threads / coroutines as first class citizens More? Other languages? 26
Challenge 1 - Task parallelism Multiple task libraries EA JobManager Dynamic job graphs, on-demand dependencis & continuations Targeted to cross-platform SPU-jobs, key requirement for this generation Not geared towards super-simple to use CPU parallelism MS ConcRT, Apple GCD, Intel TBB All has some good parts! Neither works on all of our current platforms OpenMP  Just say no. Parallelism can’t be tacked on, need to be an explicit core part Need C++ enhancements to simplify usage C++ 0x lambdas / GCD blocks  Glacial C++ development & deployment  Want on all platforms, so lost on this console generation 27
Challenge 2 - Definition ”Real-time interactive graphics & simulation at Pixar level of quality” Needed visual features: Complete anti-aliasing + natural motion blur & DOF Global indirect lighting & reflections Sub-pixel geometry OIT Huge improvements in character animation These require massively more compute, BW and improved model! (animation can’t be solved with just more/better compute, so pretend it doesn’t exist for now) 28
Challenge 2 - Problems Problems & limitations with current GPU model: Fixed rasterization pipeline Compute pipeline not fast enough to replace it GPU is handicapped by being spoon-fed by CPU Irregular workloads are difficult & inefficient Can’t express nested data-parallelismwell Current HLSL is a very limited language 29
Challenge 2 - Solutions Absolutely a job for a high-throughput oriented data-parallel processor With a highly flexible programming model The CPU, as we know it, and APIs/drivers are only in the way Pure software solution not practical as next step after DX11 PC  Multi-vendor & multi-architecture marketplace But for future consoles, the more flexible the better (perf. permitting) Long term goal:  Multi-vendor & multi-platform standard/virtual DP ISA Want a rich programmable compute model as next step Nested data-parallelism & GPU self-feeding Low-latency CPU<->GPU interaction Efficiently target varying HW architectures 30
”Pipelined Compute Shaders” Queues as streaming I/O between compute kernels Simple & expressive model supporting irregular workloads Keeps data on chip, supports variable sized caches & cores Can target multiple types of HW & architectures Hybrid graphics/compute user-defined pipelines Language/API defining fixed stages inputs & outputs Pipelines can feed other pipelines (similar to DrawIndirect) Reyes-style Rendering with Ray Tracing Shade Split Raster Sub-D Prims Tess Frame Buffer Trace 31
”Pipelined Compute Shaders” Wanted for next major DirectX and OpenCL/OpenGL As a standard, as soon as possible Run on all: discrete GPU, integrated GPU and CPU Model is also a good fit for many of our CPU/SPU jobs Parts of job graph can be seen as queues between stages Easier to write kernels/jobs with streaming I/O  Instead of explicit fixed-buffers and ”memory passes” Or dynamic memory allocation 32
Language? Future DP language is a big question But the concepts & infrastructure are what is important! Could be an extendedHLSL or ”data-parallel C++” Data-oriented imperative language (i.e. not standard C++) Think HLSL would probably be easier & the most explicit Amount of code is small and written from scratch Shader-like implicit SoA vectorization (SIMT) Instead of SSE/LRBni-like explicit vectorization Need to be first class language citizen, not tacked on C++ Easier to target multiple evolving architectures implicitly 33
Future hardware (1/2) Single main memory & address space Critical to share resources between graphics, simulation and game in immersive dynamic worlds Configurable kernel local stores / cache Similar to Nvidia Fermi & Intel Larrabee Local stores = reliability & good for regular loads Together with user-driven async DMA engines Caches = essential for irregular data structures Cache coherency? Not always important for kernels But essential for general code, can partition? 34
Future hardware (2/2) 2015 = 40 TFLOPS, we would spend it on: 80% graphics 15% simulation 4% misc 1% game (wouldn’t use all 400 GFLOPS for game logic & glue!) OOE CPUs more efficient for the majority of our game code But for the vast majority of our FLOPS these are fully irrelevant Can evolve to a small dot on a sea of DP cores Or run scalar on general DP cores wasting some compute In other words: no need for separate CPU and GPU! 35 -> single heterogeneous processor
Conclusions Future is an interleaved mix of task- & data-parallelism On both the HW and SW level But programmable DP is where the massive compute is done Data-parallelism requires data-oriented design Developer productivity can’t be limited by model(s) It should enhance productivity & perf on all levels Tools & language constructs play a critical role We should welcome our parallel future! 36
Thanks to DICE, EA and the Frostbite team The graphics/gamedev community on Twitter Steve McCalla, Mike Burrows Chas Boyd Nicolas Thibieroz, Mark Leather Dan Wexler, Yury Uralsky Kayvon Fatahalian 37
References Previous Frostbite-related talks: [1] Johan Andersson. ”Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing Techniques ”. GDC 2007. http://repi.blogspot.com/2009/01/conference-slides.html [2] Natasha Tartarchuk & Johan Andersson. ”Rendering Architecture and Real-time Procedural Shading & Texturing Techniques”. GDC 2007. http://developer.amd.com/Assets/Andersson-Tatarchuk-FrostbiteRenderingArchitecture(GDC07_AMD_Session).pdf [3] Johan Andersson. ”Terrain Rendering in Frostbite using Procedural Shader Splatting”. Siggraph 2007. http://developer.amd.com/media/gpu_assets/Andersson-TerrainRendering(Siggraph07).pdf [4] Daniel Johansson & Johan Andersson. “Shadows & Decals – D3D10 techniques from Frostbite”. GDC 2009. http://repi.blogspot.com/2009/03/gdc09-shadows-decals-d3d10-techniques.html [5] Bill Bilodeau & Johan Andersson. “Your Game Needs Direct3D 11, So Get Started Now!”. GDC 2009. http://repi.blogspot.com/2009/04/gdc09-your-game-needs-direct3d-11-so.html [6] Johan Andersson. ”Parallel Graphics in Frostbite”. Siggraph 2009, Beyond Programmable Shading course. http://repi.blogspot.com/2009/08/siggraph09-parallel-graphics-in.html 38
Questions? Email:johan.andersson@dice.se   Blog: http://repi.se    Twitter: @repi 39 For more DICE talks: http://publications.dice.se
Bonus slides 40
Game development 2 year development cycle New IP often takes much longer, 3-5 years Engine is continuously in development & used AAA teams of 70-90 people 40% artists 30% designers 20% programmers 10% audio Budgets $20-40 million Cross-platform development is market reality Xbox 360 and PlayStation 3 PC (DX9, DX10 & DX11 for BC2) Current consoles will stay with us for many more years 41
Game engine requirements (1/2) Stable real-time performance Frame-driven updates, 30 fps  Few threads, instead per-frame jobs/tasks for everything  Predictable memory usage Fixed budgets for systems & content, fail if over Avoid runtime allocations Love unified memory!  Cross-platform The consoles determines our base tech level & focus PS3 is design target, most difficult and good potential Scale up for PC, dual core is min spec (slow!) 42

More Related Content

What's hot

Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The RunFive Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The RunElectronic Arts / DICE
 
Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)Tiago Sousa
 
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3Electronic Arts / DICE
 
Checkerboard Rendering in Dark Souls: Remastered by QLOC
Checkerboard Rendering in Dark Souls: Remastered by QLOCCheckerboard Rendering in Dark Souls: Remastered by QLOC
Checkerboard Rendering in Dark Souls: Remastered by QLOCQLOC
 
Graphics Gems from CryENGINE 3 (Siggraph 2013)
Graphics Gems from CryENGINE 3 (Siggraph 2013)Graphics Gems from CryENGINE 3 (Siggraph 2013)
Graphics Gems from CryENGINE 3 (Siggraph 2013)Tiago Sousa
 
Dissecting the Rendering of The Surge
Dissecting the Rendering of The SurgeDissecting the Rendering of The Surge
Dissecting the Rendering of The SurgePhilip Hammer
 
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)Johan Andersson
 
The Rendering Pipeline - Challenges & Next Steps
The Rendering Pipeline - Challenges & Next StepsThe Rendering Pipeline - Challenges & Next Steps
The Rendering Pipeline - Challenges & Next StepsJohan Andersson
 
Z Buffer Optimizations
Z Buffer OptimizationsZ Buffer Optimizations
Z Buffer Optimizationspjcozzi
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
 
Executable Bloat - How it happens and how we can fight it
Executable Bloat - How it happens and how we can fight itExecutable Bloat - How it happens and how we can fight it
Executable Bloat - How it happens and how we can fight itElectronic Arts / DICE
 
Destruction Masking in Frostbite 2 using Volume Distance Fields
Destruction Masking in Frostbite 2 using Volume Distance FieldsDestruction Masking in Frostbite 2 using Volume Distance Fields
Destruction Masking in Frostbite 2 using Volume Distance FieldsElectronic Arts / DICE
 
A Bit More Deferred Cry Engine3
A Bit More Deferred   Cry Engine3A Bit More Deferred   Cry Engine3
A Bit More Deferred Cry Engine3guest11b095
 
Physically Based and Unified Volumetric Rendering in Frostbite
Physically Based and Unified Volumetric Rendering in FrostbitePhysically Based and Unified Volumetric Rendering in Frostbite
Physically Based and Unified Volumetric Rendering in FrostbiteElectronic Arts / DICE
 
Taking Killzone Shadow Fall Image Quality Into The Next Generation
Taking Killzone Shadow Fall Image Quality Into The Next GenerationTaking Killzone Shadow Fall Image Quality Into The Next Generation
Taking Killzone Shadow Fall Image Quality Into The Next GenerationGuerrilla
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14AMD Developer Central
 

What's hot (20)

Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The RunFive Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
 
Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)
 
Light prepass
Light prepassLight prepass
Light prepass
 
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
 
Checkerboard Rendering in Dark Souls: Remastered by QLOC
Checkerboard Rendering in Dark Souls: Remastered by QLOCCheckerboard Rendering in Dark Souls: Remastered by QLOC
Checkerboard Rendering in Dark Souls: Remastered by QLOC
 
Graphics Gems from CryENGINE 3 (Siggraph 2013)
Graphics Gems from CryENGINE 3 (Siggraph 2013)Graphics Gems from CryENGINE 3 (Siggraph 2013)
Graphics Gems from CryENGINE 3 (Siggraph 2013)
 
Dissecting the Rendering of The Surge
Dissecting the Rendering of The SurgeDissecting the Rendering of The Surge
Dissecting the Rendering of The Surge
 
Scope Stack Allocation
Scope Stack AllocationScope Stack Allocation
Scope Stack Allocation
 
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
 
The Rendering Pipeline - Challenges & Next Steps
The Rendering Pipeline - Challenges & Next StepsThe Rendering Pipeline - Challenges & Next Steps
The Rendering Pipeline - Challenges & Next Steps
 
DirectX 11 Rendering in Battlefield 3
DirectX 11 Rendering in Battlefield 3DirectX 11 Rendering in Battlefield 3
DirectX 11 Rendering in Battlefield 3
 
Z Buffer Optimizations
Z Buffer OptimizationsZ Buffer Optimizations
Z Buffer Optimizations
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
 
Executable Bloat - How it happens and how we can fight it
Executable Bloat - How it happens and how we can fight itExecutable Bloat - How it happens and how we can fight it
Executable Bloat - How it happens and how we can fight it
 
Destruction Masking in Frostbite 2 using Volume Distance Fields
Destruction Masking in Frostbite 2 using Volume Distance FieldsDestruction Masking in Frostbite 2 using Volume Distance Fields
Destruction Masking in Frostbite 2 using Volume Distance Fields
 
A Bit More Deferred Cry Engine3
A Bit More Deferred   Cry Engine3A Bit More Deferred   Cry Engine3
A Bit More Deferred Cry Engine3
 
Physically Based and Unified Volumetric Rendering in Frostbite
Physically Based and Unified Volumetric Rendering in FrostbitePhysically Based and Unified Volumetric Rendering in Frostbite
Physically Based and Unified Volumetric Rendering in Frostbite
 
Stochastic Screen-Space Reflections
Stochastic Screen-Space ReflectionsStochastic Screen-Space Reflections
Stochastic Screen-Space Reflections
 
Taking Killzone Shadow Fall Image Quality Into The Next Generation
Taking Killzone Shadow Fall Image Quality Into The Next GenerationTaking Killzone Shadow Fall Image Quality Into The Next Generation
Taking Killzone Shadow Fall Image Quality Into The Next Generation
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
 

Viewers also liked

The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...Johan Andersson
 
FrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteFrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteElectronic Arts / DICE
 
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect AndromedaElectronic Arts / DICE
 
Shadows & Decals: D3D10 Techniques in Frostbite (GDC'09)
 	 Shadows & Decals: D3D10 Techniques in Frostbite (GDC'09) 	 Shadows & Decals: D3D10 Techniques in Frostbite (GDC'09)
Shadows & Decals: D3D10 Techniques in Frostbite (GDC'09)Johan Andersson
 
5 Major Challenges in Real-time Rendering (2012)
5 Major Challenges in Real-time Rendering (2012)5 Major Challenges in Real-time Rendering (2012)
5 Major Challenges in Real-time Rendering (2012)Electronic Arts / DICE
 
High Dynamic Range color grading and display in Frostbite
High Dynamic Range color grading and display in FrostbiteHigh Dynamic Range color grading and display in Frostbite
High Dynamic Range color grading and display in FrostbiteElectronic Arts / DICE
 
Audio for Multiplayer & Beyond - Mixing Case Studies From Battlefield: Bad Co...
Audio for Multiplayer & Beyond - Mixing Case Studies From Battlefield: Bad Co...Audio for Multiplayer & Beyond - Mixing Case Studies From Battlefield: Bad Co...
Audio for Multiplayer & Beyond - Mixing Case Studies From Battlefield: Bad Co...Electronic Arts / DICE
 
5 Major Challenges in Interactive Rendering
5 Major Challenges in Interactive Rendering5 Major Challenges in Interactive Rendering
5 Major Challenges in Interactive RenderingElectronic Arts / DICE
 
How High Dynamic Range Audio Makes Battlefield: Bad Company Go BOOM
How High Dynamic Range Audio Makes Battlefield: Bad Company Go BOOMHow High Dynamic Range Audio Makes Battlefield: Bad Company Go BOOM
How High Dynamic Range Audio Makes Battlefield: Bad Company Go BOOMAnders Clerwall
 
Stable SSAO in Battlefield 3 with Selective Temporal Filtering
Stable SSAO in Battlefield 3 with Selective Temporal FilteringStable SSAO in Battlefield 3 with Selective Temporal Filtering
Stable SSAO in Battlefield 3 with Selective Temporal FilteringElectronic Arts / DICE
 
Photogrammetry and Star Wars Battlefront
Photogrammetry and Star Wars BattlefrontPhotogrammetry and Star Wars Battlefront
Photogrammetry and Star Wars BattlefrontElectronic Arts / DICE
 

Viewers also liked (17)

The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
 
Bending the Graphics Pipeline
Bending the Graphics PipelineBending the Graphics Pipeline
Bending the Graphics Pipeline
 
FrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteFrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in Frostbite
 
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
 
Shadows & Decals: D3D10 Techniques in Frostbite (GDC'09)
 	 Shadows & Decals: D3D10 Techniques in Frostbite (GDC'09) 	 Shadows & Decals: D3D10 Techniques in Frostbite (GDC'09)
Shadows & Decals: D3D10 Techniques in Frostbite (GDC'09)
 
5 Major Challenges in Real-time Rendering (2012)
5 Major Challenges in Real-time Rendering (2012)5 Major Challenges in Real-time Rendering (2012)
5 Major Challenges in Real-time Rendering (2012)
 
Rendering Battlefield 4 with Mantle
Rendering Battlefield 4 with MantleRendering Battlefield 4 with Mantle
Rendering Battlefield 4 with Mantle
 
Lighting the City of Glass
Lighting the City of GlassLighting the City of Glass
Lighting the City of Glass
 
High Dynamic Range color grading and display in Frostbite
High Dynamic Range color grading and display in FrostbiteHigh Dynamic Range color grading and display in Frostbite
High Dynamic Range color grading and display in Frostbite
 
Introduction to Data Oriented Design
Introduction to Data Oriented DesignIntroduction to Data Oriented Design
Introduction to Data Oriented Design
 
Audio for Multiplayer & Beyond - Mixing Case Studies From Battlefield: Bad Co...
Audio for Multiplayer & Beyond - Mixing Case Studies From Battlefield: Bad Co...Audio for Multiplayer & Beyond - Mixing Case Studies From Battlefield: Bad Co...
Audio for Multiplayer & Beyond - Mixing Case Studies From Battlefield: Bad Co...
 
A Real-time Radiosity Architecture
A Real-time Radiosity ArchitectureA Real-time Radiosity Architecture
A Real-time Radiosity Architecture
 
5 Major Challenges in Interactive Rendering
5 Major Challenges in Interactive Rendering5 Major Challenges in Interactive Rendering
5 Major Challenges in Interactive Rendering
 
How High Dynamic Range Audio Makes Battlefield: Bad Company Go BOOM
How High Dynamic Range Audio Makes Battlefield: Bad Company Go BOOMHow High Dynamic Range Audio Makes Battlefield: Bad Company Go BOOM
How High Dynamic Range Audio Makes Battlefield: Bad Company Go BOOM
 
Mantle for Developers
Mantle for DevelopersMantle for Developers
Mantle for Developers
 
Stable SSAO in Battlefield 3 with Selective Temporal Filtering
Stable SSAO in Battlefield 3 with Selective Temporal FilteringStable SSAO in Battlefield 3 with Selective Temporal Filtering
Stable SSAO in Battlefield 3 with Selective Temporal Filtering
 
Photogrammetry and Star Wars Battlefront
Photogrammetry and Star Wars BattlefrontPhotogrammetry and Star Wars Battlefront
Photogrammetry and Star Wars Battlefront
 

Similar to Stockholm Game Developer Forum 2010 Parallel Futures of a Game Engine

Introduction to the Graphics Pipeline of the PS3
Introduction to the Graphics Pipeline of the PS3Introduction to the Graphics Pipeline of the PS3
Introduction to the Graphics Pipeline of the PS3Slide_N
 
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...AMD Developer Central
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Slide_N
 
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio [Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio Owen Wu
 
Compute API –Past & Future
Compute API –Past & FutureCompute API –Past & Future
Compute API –Past & FutureOfer Rosenberg
 
TestUpload
TestUploadTestUpload
TestUploadZarksaDS
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmicguest40fc7cd
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrjRoberto Brandao
 
Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Intel® Software
 
mloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentmloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentDavid Galeano
 
Computação acelerada – a era das ap us roberto brandão, ciência
Computação acelerada – a era das ap us   roberto brandão,  ciênciaComputação acelerada – a era das ap us   roberto brandão,  ciência
Computação acelerada – a era das ap us roberto brandão, ciênciaCampus Party Brasil
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...Edge AI and Vision Alliance
 
The Best Programming Practice for Cell/B.E.
The Best Programming Practice for Cell/B.E.The Best Programming Practice for Cell/B.E.
The Best Programming Practice for Cell/B.E.Slide_N
 
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driver
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driverKernel Recipes 2014 - The Linux graphics stack and Nouveau driver
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driverAnne Nicolas
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecturemohamedragabslideshare
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
BitSquid Tech: Benefits of a data-driven renderer
BitSquid Tech: Benefits of a data-driven rendererBitSquid Tech: Benefits of a data-driven renderer
BitSquid Tech: Benefits of a data-driven renderertobias_persson
 

Similar to Stockholm Game Developer Forum 2010 Parallel Futures of a Game Engine (20)

Introduction to the Graphics Pipeline of the PS3
Introduction to the Graphics Pipeline of the PS3Introduction to the Graphics Pipeline of the PS3
Introduction to the Graphics Pipeline of the PS3
 
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
 
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio [Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
 
Compute API –Past & Future
Compute API –Past & FutureCompute API –Past & Future
Compute API –Past & Future
 
TestUpload
TestUploadTestUpload
TestUpload
 
Gpu
GpuGpu
Gpu
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrj
 
Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel
 
Battlefield 4 + Frostbite + Mantle
Battlefield 4 + Frostbite + MantleBattlefield 4 + Frostbite + Mantle
Battlefield 4 + Frostbite + Mantle
 
mloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentmloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game development
 
Computação acelerada – a era das ap us roberto brandão, ciência
Computação acelerada – a era das ap us   roberto brandão,  ciênciaComputação acelerada – a era das ap us   roberto brandão,  ciência
Computação acelerada – a era das ap us roberto brandão, ciência
 
Haskell Accelerate
Haskell  AccelerateHaskell  Accelerate
Haskell Accelerate
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
The Best Programming Practice for Cell/B.E.
The Best Programming Practice for Cell/B.E.The Best Programming Practice for Cell/B.E.
The Best Programming Practice for Cell/B.E.
 
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driver
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driverKernel Recipes 2014 - The Linux graphics stack and Nouveau driver
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driver
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
BitSquid Tech: Benefits of a data-driven renderer
BitSquid Tech: Benefits of a data-driven rendererBitSquid Tech: Benefits of a data-driven renderer
BitSquid Tech: Benefits of a data-driven renderer
 

Stockholm Game Developer Forum 2010 Parallel Futures of a Game Engine

  • 1. Stockholm Game Developer Forum 2010 Parallel Futures of a Game Engine (v2.0) Johan Andersson Rendering Architect, DICE
  • 2. Background DICE Stockholm, Sweden ~250 employees Part of Electronic Arts Battlefield & Mirror’s Edge game series Frostbite Proprietary game engine used at DICE & EA Developed by DICE over the last 5 years 2
  • 5. Outline Current parallelism Futures Q&A Slides will be available online later today http://publications.dice.se and http://repi.se 5
  • 7. Levels of code in Frostbite Editor (C#) Pipeline (C++) Game code (C++) System CPU-jobs (C++) System SPU-jobs (C++/asm) Generated shaders (HLSL) Compute kernels (HLSL) Offline CPU Runtime GPU 7
  • 8. Levels of code in Frostbite Editor (C#) Pipeline (C++) Game code (C++) System CPU-jobs (C++) System SPU-jobs (C++/asm) Generated shaders (HLSL) Compute kernels (HLSL) Offline CPU Runtime GPU 8
  • 9. General ”game code” (1/2) This is the majority of our 1.5 million lines of C++ Runs on Win32, Win64, Xbox 360 and PS3 We do not use any scripting language Similar to general application code Huge amount of code & logic to maintain + continue to develop Low compute density ”Glue code” Scattered in memory (pointer chasing) Difficult to efficiently parallelize Out-of-order execution is a big help, but consoles are in-order  Key to be able to quickly iterate & change This is the actual game logic & glue that builds the game C++ not ideal, but has the invested infrastructure 9
  • 10. General ”game code” (2/2) PS3 is one of the main challenges Standard CPU parallelization doesn’t help (much) CELL only has 2 HW threads on the PPU Split the code in 2: game code & system code Game logic, policy and glue code only on CPU ”If it runs well on the PS3 PPU, it runs well everywhere” Lower-level systems on PS3 SPUs Main goals going forward: Simplify & structure code base Reduce coupling with lower-level systems Increase in task parallelism for PC CELL processor 10
  • 11. Levels of code in Frostbite Editor (C#) Pipeline (C++) Game code (C++) System CPU-jobs (C++) System SPU-jobs (C++/asm) Generated shaders (HLSL) Compute kernels (HLSL) Offline CPU Runtime GPU 11
  • 12. Job-based parallelism Essential to utilize the cores on our target platforms Xbox 360: 6 HW threads PlayStation 3: 2 HW threads + 6 powerful SPUs PC: 2-16 HW threads Divide up system work into Jobs (a.k.a. Tasks) 15-200k C++ code each. 25k is common Can depend on each other (if needed) Dependencies create job graph All HW threads consume jobs ~200-300 / frame 12
  • 13. What is a Job for us? An asynchronous function call Function ptr + 4 uintptr_t parameters Cross-platform scheduler: EA JobManager Uses work stealing 2 types of Jobs in Frostbite: CPU job (good) General code moved into job instead of threads SPU job (great!) Stateless pure functions, no side effects Data-oriented, explicit memory DMA to local store Designed to run on the PS3 SPUs = also very fast on in-order CPU Can hot-swap quick iterations  13
  • 14.
  • 15. Mix CPU- & SPU-jobs
  • 16.
  • 19.
  • 21. aka Tasks and Kernels14
  • 23. Task-parallel algorithms & coordination 16
  • 24. 17 Timing View: Battlefield Bad Company 2 (PS3)
  • 25. Rendering jobs Rendering systems are heavily divided up into CPU-& SPU-jobs Jobs: Terrain geometry [3] Undergrowth generation [2] Decal projection [4] Particle simulation Frustum culling Occlusion culling Occlusion rasterization Command buffer generation [6] PS3: Triangle culling [6] Most will move to GPU Eventually.. A few have already! PC CPU<->GPU latency wall  Mostly one-way data flow 18
  • 26. Occlusion culling job example Problem: Buildings & env occlude large amounts of objects Obscured objects still have to: Update logic & animations Generate command buffer Processed on CPU & GPU = expensive & wasteful  Difficult to implement full culling: Destructible buildings Dynamic occludees Difficult to precompute From Battlefield: Bad Company PS3 19
  • 27. Solution: Software occlusion culling Rasterize coarse zbuffer on SPU/CPU 256x114 float Low-poly occluder meshes 100 m view distance Max 10000 vertices/frame Parallel vertex & raster SPU-jobs Cost: a few milliseconds Cull all objects against zbuffer Screen-space bounding-box test Before passed to all other systems Big performance savings! 20
  • 28. GPU occlusion culling Ideally want to use the GPU, but current APIs are limited: Occlusion queries introduces overhead & latency Conditional rendering only helps GPU Compute Shader impl. possible, but same latency wall Future 1: Low-latency GPU execution context Rasterization and testing done on GPU where it belongs Lockstep with CPU, need to read back within a few ms Should be possible on Larrabee & Fusion, want on all PC Future 2: Move entire cull & rendering to ”GPU” World, cull, systems, dispatch. End goal Very thin GPU graphics driver – GPU feeds itself 21
  • 29. Levels of code in Frostbite Editor (C#) Pipeline (C++) Game code (C++) System CPU-jobs (C++) System SPU-jobs (C++/asm) Generated shaders (HLSL) Compute kernels (HLSL) Offline CPU Runtime GPU 22
  • 30. Shader types Generated shaders [1] Graph-based surface shaders Treated as content, not code Artist created Generates HLSL code Used by all meshes and 3d surfaces Graphics / Compute kernels Hand-coded & optimized HLSL Statically linked in with C++ Pixel- & compute-shaders Lighting, post-processing & special effects Graph-based surface shader in FrostEd 2 23
  • 32. 3 major challenges/goals going forward: How do we make it easier to develop, maintain & parallelize general game code? What do we need to continue to innovate & scale up real-time computational graphics? How can we move & scale up advanced simulation and non-graphics tasks to data-parallel manycore processors? Challenges Most likely the same solution(s)! 25
  • 33. Challenge 1 “How do we make it easier to develop, maintain & parallelize general game code?” Shared State Concurrency is a killer Not a big believer in Software Transactional Memory either Because of performance and too ”optimistic” flow A more strict & adapted C++ model Support for true immutable & r/w-only memory access Per-thread/task memory access opt-in Reduce the possibility for side effects in parallel code As much compile-time validation as possible Micro-threads / coroutines as first class citizens More? Other languages? 26
  • 34. Challenge 1 - Task parallelism Multiple task libraries EA JobManager Dynamic job graphs, on-demand dependencis & continuations Targeted to cross-platform SPU-jobs, key requirement for this generation Not geared towards super-simple to use CPU parallelism MS ConcRT, Apple GCD, Intel TBB All has some good parts! Neither works on all of our current platforms OpenMP Just say no. Parallelism can’t be tacked on, need to be an explicit core part Need C++ enhancements to simplify usage C++ 0x lambdas / GCD blocks  Glacial C++ development & deployment  Want on all platforms, so lost on this console generation 27
  • 35. Challenge 2 - Definition ”Real-time interactive graphics & simulation at Pixar level of quality” Needed visual features: Complete anti-aliasing + natural motion blur & DOF Global indirect lighting & reflections Sub-pixel geometry OIT Huge improvements in character animation These require massively more compute, BW and improved model! (animation can’t be solved with just more/better compute, so pretend it doesn’t exist for now) 28
  • 36. Challenge 2 - Problems Problems & limitations with current GPU model: Fixed rasterization pipeline Compute pipeline not fast enough to replace it GPU is handicapped by being spoon-fed by CPU Irregular workloads are difficult & inefficient Can’t express nested data-parallelismwell Current HLSL is a very limited language 29
  • 37. Challenge 2 - Solutions Absolutely a job for a high-throughput oriented data-parallel processor With a highly flexible programming model The CPU, as we know it, and APIs/drivers are only in the way Pure software solution not practical as next step after DX11 PC Multi-vendor & multi-architecture marketplace But for future consoles, the more flexible the better (perf. permitting) Long term goal: Multi-vendor & multi-platform standard/virtual DP ISA Want a rich programmable compute model as next step Nested data-parallelism & GPU self-feeding Low-latency CPU<->GPU interaction Efficiently target varying HW architectures 30
  • 38. ”Pipelined Compute Shaders” Queues as streaming I/O between compute kernels Simple & expressive model supporting irregular workloads Keeps data on chip, supports variable sized caches & cores Can target multiple types of HW & architectures Hybrid graphics/compute user-defined pipelines Language/API defining fixed stages inputs & outputs Pipelines can feed other pipelines (similar to DrawIndirect) Reyes-style Rendering with Ray Tracing Shade Split Raster Sub-D Prims Tess Frame Buffer Trace 31
  • 39. ”Pipelined Compute Shaders” Wanted for next major DirectX and OpenCL/OpenGL As a standard, as soon as possible Run on all: discrete GPU, integrated GPU and CPU Model is also a good fit for many of our CPU/SPU jobs Parts of job graph can be seen as queues between stages Easier to write kernels/jobs with streaming I/O Instead of explicit fixed-buffers and ”memory passes” Or dynamic memory allocation 32
  • 40. Language? Future DP language is a big question But the concepts & infrastructure are what is important! Could be an extendedHLSL or ”data-parallel C++” Data-oriented imperative language (i.e. not standard C++) Think HLSL would probably be easier & the most explicit Amount of code is small and written from scratch Shader-like implicit SoA vectorization (SIMT) Instead of SSE/LRBni-like explicit vectorization Need to be first class language citizen, not tacked on C++ Easier to target multiple evolving architectures implicitly 33
  • 41. Future hardware (1/2) Single main memory & address space Critical to share resources between graphics, simulation and game in immersive dynamic worlds Configurable kernel local stores / cache Similar to Nvidia Fermi & Intel Larrabee Local stores = reliability & good for regular loads Together with user-driven async DMA engines Caches = essential for irregular data structures Cache coherency? Not always important for kernels But essential for general code, can partition? 34
  • 42. Future hardware (2/2) 2015 = 40 TFLOPS, we would spend it on: 80% graphics 15% simulation 4% misc 1% game (wouldn’t use all 400 GFLOPS for game logic & glue!) OOE CPUs more efficient for the majority of our game code But for the vast majority of our FLOPS these are fully irrelevant Can evolve to a small dot on a sea of DP cores Or run scalar on general DP cores wasting some compute In other words: no need for separate CPU and GPU! 35 -> single heterogeneous processor
  • 43. Conclusions Future is an interleaved mix of task- & data-parallelism On both the HW and SW level But programmable DP is where the massive compute is done Data-parallelism requires data-oriented design Developer productivity can’t be limited by model(s) It should enhance productivity & perf on all levels Tools & language constructs play a critical role We should welcome our parallel future! 36
  • 44. Thanks to DICE, EA and the Frostbite team The graphics/gamedev community on Twitter Steve McCalla, Mike Burrows Chas Boyd Nicolas Thibieroz, Mark Leather Dan Wexler, Yury Uralsky Kayvon Fatahalian 37
  • 45. References Previous Frostbite-related talks: [1] Johan Andersson. ”Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing Techniques ”. GDC 2007. http://repi.blogspot.com/2009/01/conference-slides.html [2] Natasha Tartarchuk & Johan Andersson. ”Rendering Architecture and Real-time Procedural Shading & Texturing Techniques”. GDC 2007. http://developer.amd.com/Assets/Andersson-Tatarchuk-FrostbiteRenderingArchitecture(GDC07_AMD_Session).pdf [3] Johan Andersson. ”Terrain Rendering in Frostbite using Procedural Shader Splatting”. Siggraph 2007. http://developer.amd.com/media/gpu_assets/Andersson-TerrainRendering(Siggraph07).pdf [4] Daniel Johansson & Johan Andersson. “Shadows & Decals – D3D10 techniques from Frostbite”. GDC 2009. http://repi.blogspot.com/2009/03/gdc09-shadows-decals-d3d10-techniques.html [5] Bill Bilodeau & Johan Andersson. “Your Game Needs Direct3D 11, So Get Started Now!”. GDC 2009. http://repi.blogspot.com/2009/04/gdc09-your-game-needs-direct3d-11-so.html [6] Johan Andersson. ”Parallel Graphics in Frostbite”. Siggraph 2009, Beyond Programmable Shading course. http://repi.blogspot.com/2009/08/siggraph09-parallel-graphics-in.html 38
  • 46. Questions? Email:johan.andersson@dice.se Blog: http://repi.se Twitter: @repi 39 For more DICE talks: http://publications.dice.se
  • 48. Game development 2 year development cycle New IP often takes much longer, 3-5 years Engine is continuously in development & used AAA teams of 70-90 people 40% artists 30% designers 20% programmers 10% audio Budgets $20-40 million Cross-platform development is market reality Xbox 360 and PlayStation 3 PC (DX9, DX10 & DX11 for BC2) Current consoles will stay with us for many more years 41
  • 49. Game engine requirements (1/2) Stable real-time performance Frame-driven updates, 30 fps Few threads, instead per-frame jobs/tasks for everything Predictable memory usage Fixed budgets for systems & content, fail if over Avoid runtime allocations Love unified memory! Cross-platform The consoles determines our base tech level & focus PS3 is design target, most difficult and good potential Scale up for PC, dual core is min spec (slow!) 42
  • 50. Game engine requirements (2/2) Full system profiling/debugging Engine is a vertical solution, touches everywhere PIX, xbtracedump, SN Tuner, ETW, GPUView Quick iterations Essential in order to be creative Fast building & fast loading, hot-swapping resources Affects both the tools and the game Middleware Use when it make senses, cross-platform & optimized Parallelism have to go through our systems 43
  • 51. Editor & Pipeline Editor (”FrostEd 2”) WYSIWYG editor for content C#, Windows only Basic threading / tasks Pipeline Offline/background data-processing & conversion C++, some MC++, Windows only Typically IO-bound A few compute-heavy steps use CPU-jobs Texture compression uses CUDA, prefer OpenCL or CS CPU parallelism models are generally not a problem here 44
  • 52. struct FB_ALIGN(16) EntityRenderCullJobData { enum { MaxSphereTreeCount = 2, MaxStaticCullTreeCount = 2 }; uint sphereTreeCount; const SphereNode* sphereTrees[MaxSphereTreeCount]; u8 viewCount; u8 frustumCount; u8 viewIntersectFlags[32]; Frustum frustums[32]; .... (cut out 2/3 of struct for display size) u32 maxOutEntityCount; // Output data, pre-allocated by callee u32 outEntityCount; EntityRenderCullInfo* outEntities; }; void entityRenderCullJob(EntityRenderCullJobData* data); void validate(const EntityRenderCullJobData& data); Frustum culling of dynamic entities in sphere tree struct contain all input data needed Max output data pre-allocated by callee Single job function Compile both as CPU & SPU job Optional struct validation func EntityRenderCull job example 45
  • 53. EntityRenderCull SPU setup // local store variables EntityRenderCullJobData g_jobData; float g_zBuffer[256*114]; u16 g_terrainHeightData[64*64]; int main(uintptr_t dataEa, uintptr_t, uintptr_t, uintptr_t) { dmaBlockGet("jobData", &g_jobData, dataEa, sizeof(g_jobData)); validate(g_jobData); if (g_jobData.zBufferTestEnable) { dmaAsyncGet("zBuffer", g_zBuffer, g_jobData.zBuffer, g_jobData.zBufferResX*g_jobData.zBufferResY*4); g_jobData.zBuffer = g_zBuffer; if (g_jobData.zBufferShadowTestEnable && g_jobData.terrainHeightData) { dmaAsyncGet("terrainHeight", g_terrainHeightData, g_jobData.terrainHeightData, g_jobData.terrainHeightDataSize); g_jobData.terrainHeightData = g_terrainHeightData; } dmaWaitAll(); // block on both DMAs } // run the actual job, will internally do streaming DMAs to the output entity list entityRenderCullJob(&g_jobData); // put back the data because we changed outEntityCount dmaBlockPut(dataEa, &g_jobData, sizeof(g_jobData)); return 0; } 46
  • 54. Timing view Example: PC, 4 CPU cores, 2 GPUs in AFR (AMD Radeon 4870x2) Real-time in-game overlay See timing events & effective parallelism On CPU, SPU & GPU – for all platforms Use to reduce sync-points & optimize load balancing GPU timing through DX event queries Our main performance tool! 47
  • 55. Language (cont.) Requirements: Full rich debugging, ideally in Visual Studio Asserts Internal kernel profiling Hot-swapping / edit-and-continue of kernels Opportunity for IHVs and platform providers to innovate here! Goal: Cross-vendor standard Think of the co-development of Nvidia Cg and HLSL 48
  • 56. Unified development environment Want to debug/profile task- & data-parallel code seamlessly On all processors! CPU, GPU & manycore From any vendor = requires standard APIs or ISAs Visual Studio 2010 looks promising for task-parallel PC code Usable by our offline tools & hopefully PC runtime Want to integrate our own JobManager Nvidia Nexus looks great for data-parallel GPU code Eventual must have for all HW, how? Huge step forward! 49 VS2010 Parallel Tasks

Editor's Notes

  1. Xbox 360 version + offline super-high AA and resolution
  2. ”Glue code”
  3. Better code structure!Gustafson’s LawFixed 33 ms/fWe don’t use threads for anything that can be a bit computationally heavy, jobs instead
  4. Data-layout
  5. We have a lot of jobs across the engine, too many to go through so I chose to focus more on some of the rendering.Our intention is to move all of these to the GPU
  6. Conservative
  7. Want to rasterize on GPU, not CPU CPU Û GPU job dependencies
  8. Simple keywords as override &amp; sealed have been a help in other areas
  9. No low-latency GPU contextsNested data-parallelism
  10. Would require multi-vendor open ISA standard
  11. With nested data parallelism inside kernels
  12. OpenCL
  13. Isolate systems/areasWe do not use exceptions, nor much error handling
  14. Havok &amp; Kynapse
  15. Bindings to C very interesting!