Slideshow transcript
Slide 1: Game Developers Conference 2008 Optimizing DirectX on Multi-core architectures Leigh Davies Contributions from; Senior Application Engineer, INTEL David Potages Grin* February 2008 Jeff Andrews Intel® Rita Turkowski Intel® Leigh.Davies@Intel.com Kev Gee Microsoft* *Other names and brands may be claimed as the property of others 1
Slide 2: Legal Disclaimer INF OR MATION IN THIS DOCUME NT IS P R OVIDE D IN CONNE CTION WITH INTE L® P R ODUCTS . NO LICE NS E , E XP R E S S OR IMP LIE D, B Y E S TOP P E L OR OTHE R WIS E , TO ANY INTE LLE CTUAL P R OP E R TY R IG HTS IS G R ANTE D B Y THIS DOCUME NT. E XCE P T AS P R OVIDE D IN INTE L’S TE R MS AND CONDITIONS OF S ALE F OR S UCH P R ODUCTS , INTE L AS S UME S NO LIAB ILITY WHATS OE VE R , AND INTE L DIS CLAIMS ANY E XP R E S S OR IMP LIE D WAR R ANTY, R E LATING TO S ALE AND/ OR US E OF INTE L® P R ODUCTS INCLUDING LIAB ILITY OR WAR R ANTIE S R E LATING TO F ITNE S S F OR A P AR TICULAR P UR P OS E , ME R CHANTAB ILITY, OR INF R ING E ME NT OF ANY P ATE NT, COP YR IG HT OR OTHE R INTE LLE CTUAL P R OP E R TY R IG HT. INTE L P R ODUCTS AR E NOT INTE NDE D F OR US E IN ME DICAL, LIF E S AVING , OR LIF E S US TAINING AP P LICATIONS . Inte l m a y m a ke c h a ng e s to s pe c ific a tions a nd produc t de s c riptions a t a ny tim e , with out notic e . All produc ts , da te s , a nd fig ure s s pe c ifie d a re pre lim ina ry b a s e d on c urre nt e xpe c ta tions , a nd a re s ub je c t to c h a ng e with out notic e . Inte l, proc e s s ors , c h ips e ts , a nd de s ktop b oa rds m a y c onta in de s ig n de fe c ts or e rrors known a s e rra ta , wh ic h m a y c a us e th e produc t to de via te from publis h e d s pe c ific a tions . Curre nt c h a ra c te riz e d e rra ta a re a va ila ble on re q ue s t. P e rform a nc e te s ts a nd ra ting s a re m e a s ure d us ing s pe c ific c om pute r s ys te m s a nd/ c om pone nts a nd re fle c t or th e a pproxim a te pe rform a nc e of Inte l produc ts a s m e a s ure d b y th os e te s ts . Any diffe re nc e in s ys te m h a rdwa re or s oftwa re de s ig n or c onfig ura tion m a y a ffe c t a c tua l pe rform a nc e . Inte l, Inte l Ins ide , a nd th e Inte l log o a re tra de m a rks of Inte l Corpora tion in th e Unite d S ta te s a nd oth e r c ountrie s . *Oth e r na m e s a nd b ra nds m a y b e c la im e d a s th e prope rty of oth e rs . Copyrig h t © 2008 Inte l Corpora tion. 2
Slide 3: Agenda Graphics and the CPU Profiling Graphics and Drivers Threading the render thread Case Study GRIN* Summary *Other names and brands may be claimed as the property of others 3
Slide 4: Graphics is CPU Intensive. World in Conflict* Crysis* CPU Benchmark Legend Application D3D Runtime Bionic Commando* Driver Crysis* GPU Benchmark Other D3D Runtime and Driver account for 25-40% of CPU cycles per frame *Other names and brands may be claimed as the property of others **Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those 4 tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX6700® Processor at 2.67 GHz, NVIDIA 8800GTX GPU, 2Gig memory.
Slide 5: Designing the Rendering Pipeline. Render Direct3D* Command Software Video Application Functions Runtime Buffer Driver Card •Analyze the whole program DX9 API Call** Cycles count SetVertexShader 3000-12100 – Your Application SetPixelShaderConstant 1500-9000 – Direct API usage and SetTexture 2500-3100 overheads DrawPrimative 1050-1150 – Video card driver ZFUNC 510-700 •Have Defined Performance Goals - Use key game play targeted scenarios for perf analysis - Build benchmarks / test levels World in Conflict* *Other names and brands may be claimed as the property of others **Timings taken from msdn2.microsoft.com/en-us/library/bb172234(VS.85).aspx 5
Slide 6: Balancing Future Workloads Intel® Roadmap Graphics Tick 2 YEARS Compaction/Derivative Intel Core™ Duo · Pentium-D 65nm Tock Intel Core™ Microarchitecture Intel Core™2 Duo, DC Intel Xeon® 5100 Tick Compaction/Derivative 2 YEARS PENRYN 45nm Tock New Microarchitecture NEHALEM Scalable & Configurable Scalable Cache, Performance: 1 to 8 Threads Interconnects & & Memory 1 to 4 Cores Controllers 6
Slide 7: Time is Money Be realistic, Rendering Costs CPU Time Rendering thread potential bottleneck for N-Core scaling Rendering costs likely to increase as you add more physics, effects or even AI objects Runtime and driver costs are significantly higher on the PC than the consoles Use Performance Analysis results to focus development efforts Analyze regularly and catch regressions early Optimise the graphics thread. Offload as much as possible. 7
Slide 8: Agenda Graphics and the CPU Profiling Graphics and Drivers Threading the render thread Case Study GRIN Summary 8
Slide 9: Overview of Graphics Driver Models Windows* XP Display Model XPDM - DX* - DX9 - The Kernel mode driver controls threading Windows Vista* Display Driver Model WDDM - DX9 - The D3D9 runtime manages creation of threads - One is created specifically for the User Mode Driver (UMD) Windows Vista Display Driver Model WDDM - DX10 - The Driver is responsible for creating threads - Currently released drivers don’t thread - Could change in the near future Graphics driver can have a major impact on performance and multi-core scaling. *Other names and brands may be claimed as the property of others 9
Slide 10: Profiling Tools Need to use a variety of tools; - Use repeatable workload CPU Tools; - VTune™ Performance Analyser. VTune - Intel® Thread Profiler - PIX for Windows* - AMD Code Analyst™ Analyst GPU Tools; - PIX for Windows with vendor plugins - NVIDIA* Perfhud - ATI* PerfStudio *Other names and brands may be claimed as the property of others 10
Slide 11: Profiling Graphics with VTune™ Analyzer Select Counter Monitor for a quick overview; Not necessary to launch the app Disable display of counter data unless running windowed Profile across a selection of configurations - Identify different bottlenecks based on h/w limitations - “Works great on my machine” isn’t good enough 11
Slide 12: VTune™ Performance Analyzer - Sampling •Calibration isn’t needed for games •Delay sampling allows alt-tab or bypass loading •Tracking core usage needs to be added •Privileged time shows time inside Kernel 12
Slide 13: VTune™ Analyzer Views •Processor Usage •Memory Usage •Context Switching •CPU Frequency VTune™ Analyzer allows you to add your own counters. 13
Slide 14: Sampling - Display Model XPDM Session Space Display Driver Miniport Driver Win32k & Dxg Videoport Kernel Mode User Mode Application D3D Runtime 14
Slide 15: Sampling - Display Model WDDM Session Space CDD Kernel Driver Win32k Dxgkrnl Kernel Mode User Mode Application D3D Runtime DWM Application Process User Mode Driver DWM Process 15
Slide 16: Associating Symbols in VTune™ Analyzer Configure->Options->Directories->Symbol Repository View Symbol Repository->Delete unassociated modules In Tuning Browser select \"Results\" -> \"Module Associations...\" Edit symbol associations 16
Slide 17: Symbol Information for DX10Core.dll Symbols Taken while profiling SoftParticle Sample on SDK 17
Slide 18: PIX for Windows CPU GPU Gathering GPU events requires Windows Vista Cross over between PIX and VTune™ Counters Easy to see CPU/GPU headroom 18
Slide 19: Intel® PIX Plug-in: Beta Available Now # Metric Name Description 1 Frame Time Instantaneous frame time in milliseconds. 2 Frames per Second Instantaneous frame rate normalized to seconds. (inverted frame time). 3 Driver Time The amount of time spent in the display driver, normalized to milliseconds. 4 Driver Time Stalled The amount of time spent in the display driver either busy stalled or in a sleep state, normalized to milliseconds. 5 Graphics Memory Used – MB The amount of graphics memory currently utilized, normalized to MB. 6 Graphics Memory Used - bytes The amount of graphics memory currently utilized, normalized to bytes. 7 Texture Memory Used The amount of texture memory currently utilized, normalized to MB. 8 GPU Busy The percent utilization of the front end of the GPU. This metric shall describe the incoming command stream and does NOT describe the utilization of the array of execution units (cores). 9 Cores Busy The percentage of time that any core in the array is either actively executing instructions or stalled. 10 Cores Active The percentage of time that the core array is actively executing instructions. 11 Vertex Count The number of vertices that entered the pipeline. 12 Triangle Count The number of triangles that flowed through the pipeline prior to any clipping or culling. 13 Texel Count The number of texels that were fetched by the pipeline. 14 Pixels Drawn The number of pixels that were actually written to the render target. 15 Mathbox Utilization The aggregated percentage of time that the mathbox was actively executing instructions. 16 Texture Unit(s) Utilization The aggregated percentage of time that the texture units were actively processing texels. Provides access to Intel® Counters in PIX Rollout now to support IIG Profiling 19
Slide 20: Agenda Graphics and the CPU Profiling Graphics and Drivers Threading the render thread Case Study GRIN Summary 20
Slide 21: Starting Points Common Issues: - Naive Ports to Windows from console models - Excessive context switching/synchronization overhead - Work starvation due to thread sync dependencies General Rules - Use only 1 heavy weight thread per Core on Windows - Manage Job distribution - The OS scheduler knows best - Consider memory bandwidth Multi-core and D3D Usage - Avoid Use of the D3DCREATE_MULTITHREADED flag - You CAN manage synch costs better - Design around a single threaded D3D Device Access model - Lock resources from main thread, manually protect access *Other names and brands may be claimed as the property of others 21
Slide 22: Making the Drivers Work for You! Potential 20%+ speed gain. App Can be disabled by application App behaviour. D3D Runtime Producer & Consumer threads dispatch D3D Driver commands to GPU D3D Driver Pack your DrawPrimitive2 calls together Frequently creating & destroying shaders, VB, IB, and surfaces will impact performance Avoid allocating too many system memory resources DrawPrimitiveUP or DrawIndexedPrimitiveUP 22
Slide 23: Making the Drivers Work for You! Avoid any calls that return GPU state information, requires a CPU thread synchronization Driver Queries are OK (calls are asynchronous) Do not lock threads to a specific CPU! Group all resource updates (Texture and Vertex) together once per frame beginning or end is fine, just don’t scatter them among drawing calls Minimize use of any locks/unlocks System Memory Vertex Buffers - D3DUSAGE_DYNAMIC, use with D3DUSAGE_WRITEONLY - Lock with D3DLOCK_DISCARD or D3DLOCK_NOOVERWRITE 23
Slide 24: Threading Issues Move Object X Delete Object Y Render Object Render Object X Y Main Thread (Frame n) Render Thread (Frame n-1) Time Race Conditions between threads. - Object Updates - Creation/deletion of objects False sharing of data between threads. Accessing hardware resources. 24
Slide 25: Threading Options Pipeline Consumer thread Front- End Front- Logic end Back- Logic end EOF Back- end Cmd Render Queue Render EOF • Avoiding the Issues • Use an update queue, lightweight (lock-free?) • Make duplicate objects/double-buffered • Reference count objects 25
Slide 26: Buffering Dynamic Data Partially buffered locks Modify Vertex Buffer0 Modify Vertex Buffer1 Main Thread (Frame n) Main Thread (Frame n+1) Render Thread (Frame n-1) Render Thread (Frame n) Render Object from Vertex Buffer1 Render Object from Vertex Buffer0 Fully buffered locks Main Thread Render Thread Lock Modify Unlock Lock Copy Unlock Buffer Buffer Buffer Buffer Data Buffer Local Data Data Video Buffer Queue 0 Queue1 Buffer Partially buffered locks consume more video memory. Fully Buffered consume more system memory and have an associated CPU cost for memory copying. 26
Slide 27: Sub Threading Options Job Queue • Job Queue offloads Front- Job •Software Visibility Culling End Job •Particle generation Logic Job •Character Skinning •Procedural updates EOF Job Queue • Reduces path size through both front and back ends Job Back-end Job Render Job 27
Slide 28: Threading the DX API DX9 Render System Similar to DX9 threading in the runtime D3D9Wrapper D3DDevice9 D3DVertexBuffer9 - Potentially repeating the D3D9 Wrapper D3DDevice9 Wrapper D3DVertexBuffer9 same work Potential to move simple API code out of main Graphics Driver thread, i.e. state management Graphics Device DX10 has lower runtime costs DX9 DX10 Main Thread 46.46 39.08 Main Thread 45.72 63.84 DX API Thread (15.82%) in DX9 7.38 DX API Thread (28.39% in DX10+Driver) 18.12 NVIDIA driver 23.02 Physics 13.95 13.95 Physics 10.91 Other threads 21.88 21.88 Physics 10.91 Other threads Other threads 19.35 19.35 39% increase* 16% increase* * Theoretical increase based on amount of API work offloaded, does not include threading overhead** **Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration 28 may affect actual performance Data taken on Intel® QX6700® Processor at 2.67 GHz, NVIDIA 8800GTX GPU, 2Gig memory.
Slide 29: Agenda Graphics and the CPU Profiling Graphics and Drivers Threading the render thread Case Study GRIN Summary *Other names and brands may be claimed as the property of others 29
Slide 30: Case study: Grin’s engine* David Potages Senior Engine Architect, GRIN February 2008 david.potages@grin.se *Performance figures discussed in this case study refer to a pre release version of the game. They are names and brands may release and are for illustration only. *Other subject to change before be claimed as the property of others 30
Slide 31: Quick Engine Overview 3rd generation of threaded engine 2nd generation of threaded renderer Used in several games 31
Slide 32: Quick Engine Overview Not game specific: game code in Lua scripts Allows hot-reload, no link time, custom debugger But single threaded, a lot of memory allocations Deferred rendering DX9 – DX10 being implemented Libraries: - PhysX™ - OpenAL - Bink* All the technology choices have great impact on the possible parallelization! *Other names and brands may be claimed as the property of others 32
Slide 33: Why multi-threading? Poor CPU usage - Can go down to 30% A lot of time spent in D3D/driver - 35-45%* But a lot of the Legend 17% Application application time is dedicated to rendering 46% D3D Runtime - Up to 37%* 29% Driver - Grand total of 53%* of Other frame with D3D/driver *Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX9650® Processor at 2.33 GHz, NVIDIA 8800GTX GPU, 2Gig memory, Windows Vista™ Ultimate. 33
Slide 34: Why multi-threading the renderer? Simplified pipeline (ST version) World Script Rendering Sound Network update update Lua* PhysX™ OpenAL* Culling Particles batch optimizations Rendering is an easy target for multithreading: low system dependencies, 53% of frame time But easier said than done! Some systems or the drivers they use can take advantage of multi-cores Rendering has low dependencies with other systems, but big data dependencies *Other names and brands may be claimed as the property of others 34
Slide 35: Implementation Details Main thread Entity/World updates, Animations, Input, Network, Lua, SoundSystem, Physics (main) Renderer thread Culling (including software occlusion queries) Particle effects batch optimizations RenderDevice (D3D) Win32 messaging Other File streaming PhysX™ threads PhysX Driver threads 35
Slide 36: Implementation Details Messages sent to the renderer - Non blocking: Front- end render_scene Logic Back- render_frame end update_window Render Etc Flush - Blocking: Idle Flush flush_pipe Sync Idle flush_pipe forces the renderer to Front- Back- execute all the queued jobs end end => synchronization point Logic Render - Used between frames on main thread - Can be used to ensure that data (eg Textures) is ready 36
Slide 37: Implementation Details States needs to be mirrored States changes are queued, and updated in the freeze The proper state is returned depending on the calling thread This will avoid contention when data is accessed in the renderer, but mirror only what is required 37
Slide 38: Results Better CPU usage 40-60%* Better threads workload *Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX6700® Processor at 2.67 GHz, NVIDIA 8800GTX GPU, 2Gig memory. 38
Slide 39: Results: Rendering Performance • Effect on a low physics/gameplay workload Better FPS 100 - 4C MT is 1.88x faster than 1C* 90 - 4C MT is 1.20x faster than 4C ST* 80 Analysis 70 1C - Remember that the drivers are 60 2C ST partially threaded: we save up to 50 2C MT 17% + %of D3D/driver time that is 40 4C ST not threaded 30 4C MT - Close to 1.20x 20 if D3D/driver were completely threaded, new 10 frame time would be 1-0.17=83% less, and the 0 scale-up : CPU FPS fpsnew/fpsold=timeold/timenew =timeold/(timeold*0.83)=1.20 Maximum scale-up vs. 1C is 2.12x - Context switches, cache misses and contention slow us down. - Render-thread bound *Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX9650® Processor at 2.33 GHz, NVIDIA 8800GTX GPU, 2Gig memory, Windows Vista™ Ultimate. 39
Slide 40: Improvements Threading some parts of the render thread E.g.: culling (~9-25%* of the render thread) Reducing contentions Mainly memory Batch more E.g.: Effects Triple buffering? *Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX9650® Processor at 2.33 GHz, NVIDIA 8800GTX GPU, 2Gig memory, Windows Vista™ Ultimate. 40
Slide 41: Scalability We can push for instance more physics/effects, while we are render-thread bound, or more AI But hard to find the right balance between CPU and GPU workload! Example: falling cars aka pushing more physics 41
Slide 42: Scalability - ~256 cars falling and bouncing 50 - 4C MT is 1.42x* faster than 40 4C ST, and 3.23x* faster 30 than 1C 1C 4C ST - PhysX™ helped us a lot to PhysX 20 4C MT propagate the workload, but 10 occupies the other cores 0 quite heavily, thus FPS preventing D3D/drivers to take advantage of them. - Rendering overhead was not that big with the additional units since they batch well. *Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX9650® Processor at 2.33 GHz, NVIDIA 8800GTX GPU, 2Gig memory, Windows Vista™ Ultimate. 42
Slide 43: Issues A proper benchmark system is required A fly-through benchmark is not enough! The CPU & GPU workloads vary a lot on different maps Easy to forget a data that needs to be mirrored Lockfree algorithm are nice, but to be used with care Memory contention + cache misses + false sharing Behaviour of drivers varies quite alot… 43
Slide 44: Agenda Graphics and the CPU Profiling Graphics and Drivers Threading the render thread Case Study GRIN Summary *Other names and brands may be claimed as the property of others 44
Slide 45: Summary/Conclusion Graphic pipeline is still very CPU intensive Future CPUs will have increasing logical processors It is worth threading your renderer as much as possible if you want to be able to push more things in your game Hard to balance the workloads though, need to profile whole system Making the most of the graphics driver essential 45
Slide 46: References: Accurately Profiling Direct3D API Calls. - msdn2.microsoft.com/en-us/library/bb172234(VS.85).aspx Debugging Tools and Symbols: Getting Started - www.microsoft.com/whdc/devtools/debugging/debugstart.mspx Threading the OGRE3D Render System - www.intel.com/cd/ids/developer/asmo-na/eng/dc/games/331359.htm 46
Slide 47: 47



Add a comment on Slide 1
If you have a SlideShare account, login to comment; else you can comment as a guest- Favorites & Groups
Showing 1-50 of 0 (more)