Loading...
Flash Player 9 (or above) is needed to view slideshows. We have detected that you do not have it on your computer.To install it, go here
Slideshow Transcript
- Slide 1: Speed Up Synchronization Locks: A Scaleform Case Study Abhishek Agrawal Software Solutions Group
- Slide 2: Legal Disclaimer INF OR MATION IN THIS DOCUME NT IS P R OVIDE D IN CONNE CTION WITH INTE L® P R ODUCTS . NO LICE NS E , E XP R E S S OR IMP LIE D, B Y E S TOP P E L OR OTHE R WIS E , TO ANY INTE LLE CTUAL P R OP E R TY R IG HTS IS G R ANTE D B Y THIS DOCUME NT. E XCE P T AS P R OVIDE D IN INTE L’S TE R MS AND CONDITIONS OF S ALE F OR S UCH P R ODUCTS , INTE L AS S UME S NO LIAB ILITY WHATS OE VE R , AND INTE L DIS CLAIMS ANY E XP R E S S OR IMP LIE D WAR R ANTY, R E LATING TO S ALE AND/ OR US E OF INTE L® P R ODUCTS INCLUDING LIAB ILITY OR WAR R ANTIE S R E LATING TO F ITNE S S F OR A P AR TICULAR P UR P OS E , ME R CHANTAB ILITY, OR INF R ING E ME NT OF ANY P ATE NT, COP YR IG HT OR OTHE R INTE LLE CTUAL P R OP E R TY R IG HT. INTE L P R ODUCTS AR E NOT INTE NDE D F OR US E IN ME DICAL, LIF E S AVING , OR LIF E S US TAINING AP P LICATIONS . Inte l m a y m a ke c h a ng e s to s pe c ific a tions a nd produc t de s c riptions a t a ny tim e , with out notic e . All produc ts , da te s , a nd fig ure s s pe c ifie d a re pre lim ina ry b a s e d on c urre nt e xpe c ta tions , a nd a re s ub je c t to c h a ng e with out notic e . Inte l, proc e s s ors , c h ips e ts , a nd de s ktop b oa rds m a y c onta in de s ig n de fe c ts or e rrors known a s e rra ta , wh ic h m a y c a us e th e produc t to de via te from publis h e d s pe c ific a tions . Curre nt c h a ra c te riz e d e rra ta a re a va ila ble on re q ue s t. P e rform a nc e te s ts a nd ra ting s a re m e a s ure d us ing s pe c ific c om pute r s ys te m s a nd/ c om pone nts a nd re fle c t or th e a pproxim a te pe rform a nc e of Inte l produc ts a s m e a s ure d b y th os e te s ts . Any diffe re nc e in s ys te m h a rdwa re or s oftwa re de s ig n or c onfig ura tion m a y a ffe c t a c tua l pe rform a nc e . Inte l, Inte l Ins ide , a nd th e Inte l log o a re tra de m a rks of Inte l Corpora tion in th e Unite d S ta te s a nd oth e r c ountrie s . *Oth e r na m e s a nd b ra nds m a y b e c la im e d a s th e prope rty of oth e rs . Copyrig h t © 2008 Inte l Corpora tion. 2
- Slide 3: Agenda Common Locking Issues Windows* Locking Methodologies and associated performance User Level Atomic Locks with Scaleform* case Study Hot Locks and Lock Contention with Flight Simulator* Case Study Locks in Intel TBB® Summary & Call to Action 3
- Slide 4: Why care for Locking ?? Locking code can be the most frequently run code in a multi-threaded application Determining which methodology of locking to utilize can be as critical as identification of parallelism within an application Improper use of locking mechanism can lead to situations like lock stuttering, very high contention and new types of programming bugs Proper use of locks is crucial for multi-threading applications 4
- Slide 5: Common Lock Pathologies Can introduce performance and correctness problems Some potential problems – Deadlock Happens when tasks are trying to acquire more than one lock and each holds some of the locks the other tasks need in order to proceed – Convoying Occurs when the operating system interrupts a task that is holding a lock – Priority Inversion Refers to the scenario where a lower-priority task holds a shared resource that is required by a higher-priority task 5
- Slide 6: How to avoid Lock Pathologies Deadlocks – Avoid needing to hold two locks at the same time – Always acquire locks in the same order (e.g. outer container and inner container mutexes) – Use atomic operations Convoying & Priority Inversion – Use atomic operations instead of locks where possible Use Atomic Operations and User-Level Locks 6
- Slide 7: Agenda Common Locking Issues Windows* Locking Methodologies and associated performance User Level Atomic Locks with Scaleform* case Study Hot Locks and Lock Contention with Flight Simulator* Case Study Locks in Intel TBB® Summary & Call to Action 7
- Slide 8: Windows* Locking Methodologies Interlocked Functions – Located in kernel32.dll – Essentially just utilizing atomic instructions TryEnterCriticalSection (Non-Blocking) – Attempts to get a lock N times in ring 3 EnterCriticalSection (Blocking) – Attempts to get the lock one time in ring 3 and then jumps into ring 0 WaitForSingleObject – Jumps into ring 0 100% of the time whether the lock is achieved or not – Mutexes and Semaphore APIs follow the same path 8
- Slide 9: WaitForSingleObject Vs. EnterCriticalSection WaitForSingleObject EnterCriticalSection An overloaded Microsoft API which can Can be used by putting an be used to check and modify the state EnterCriticalSection and of a number of different objects such as LeaveCriticalSection API call events, jobs etc surrounding the critical section code Advantage of WaitForSingleObject is The API has the advantage over that it can be processed globally which WaitForSingleObject in that it will not enables it to be used for synchronization enter the kernel unless there is between processes contention on the lock One major disadvantage of Disadvantage of EnterCriticalSection WaitForSingleObject is that it will always - It’s a blocking call obtain a kernel lock, so it enters privileged mode (ring 0) whether the - It cannot be processed globally lock is achieved or not and there is no guarantee on the order which threads obtain the lock 9
- Slide 10: EnterCriticalSection Vs. WaitForSingleObject Timings for the sample memory management kernel for Timings for the sample memory management kernel for 1 1 and 2 threads. to 64 threads. EnterCriticalSection is much faster under 1 thread (no contention) since it will not jump into the kernel if lock is achieved WaitForSingleObject and EnterCriticalSection have similar costs associated with them under high contention scenarios Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations (http://www.intel.com/performance/resources/limits.htm) 10
- Slide 11: Where is the Performance Hit ?? Window’s locking APIs have the possibility of jumping into the operating system kernel Both EnterCriticalSection and WaitForSingleObject will enter the kernel if there is contention on the lock. The transition from user mode to privileged mode can be costly if accomplished excessively Most performance impact is in the case of granular locking where the lock is achieved and released in hundreds of cycles User Level Locks should be used for Granular Operations and in High Contention Scenarios 11
- Slide 12: Agenda Common Locking Issues Windows* Locking Methodologies and associated performance User Level Atomic Locks with Scaleform* case Study Hot Locks and Lock Contention with Flight Simulator* Case Study Locks in Intel TBB® Summary & Call to Action 12
- Slide 13: User Level Atomic Locks Involves utilizing the atomic instructions of processor to atomically update a memory space The atomic instructions involve utilizing a lock prefix on the instruction and having the destination operand assigned to a memory address Some of the instructions which can run atomically with a lock prefix on current Intel processors are: ADD, ADC, AND, BTC, BTR, CMPXCHG, DEC, INT, SUB, XOR, XADD, XCHG etc 13
- Slide 14: A Sample User Level Atomic Lock Figure shows the assembly of a simple mutex lock demonstrating usage of it necessary toan atomic Is utilizing write instruction with a lock prefix for take assembly to obtaining a lock advantage of user land locks which utilize the lock prefix ?? 14
- Slide 15: Windows Interlocked Functions Windows provides access to the most frequently used atomic instructions for synchronization through the “interlocked” APIs InterlockedExchange, InterlockedIncrement, InterlockedDecrement, InterlockedCompareExchange and InterlockedExchangeAdd etc. API’s reside in kernel32.dll The interlocked functions do not have any possibility of jumping into the Windows kernel 15
- Slide 16: Atomic Lock (Performance Comparison) The figure compares the cost of user-level atomic lock vs. WaitForSingleObject Both under high and low contention scenarios, the user-level atomic lock is several orders of magnitude cheaper. For this reason, a user-level lock is preferable for frequently called granular locking Cost of user-level atomic lock vs. WaitForSingleObject for the memory management locking kernel example Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations (http://www.intel.com/performance/resources/limits.htm) 16
- Slide 17: Scaleform* Scaleform GFx: The #1 Video Game UI Solution GFx is a rich media player that supports Flash Licensed for Crysis, Mass Effect, and 150+ games Available on all leading PC and Console platforms Used for Menus, HUDs, and Animated Textures Recently introduced Thread Support into the GFx for Simultaneous Playback, Optimized Loading, ActionScript Processing and other tasks 17
- Slide 18: Why Is Threaded UI Important ?? The Future of Animated Flash and Video Textures! 18
- Slide 19: Scaleform* Case Study Summary Background loading, vector tessellation, Flash playback and ActionScript execution may require many allocations, which reduce performance. Solution: Innovative allocator that uses about 35 cycles for allocate/free requests but that optimization is meaningless if it needs to be synchronized with a critical section. In allocation-heavy examples, system lock can reduce performance by 10-30%. GLock gives about 50% locking performance improvement. Based on “Fast Critical Sections” post by Vladislav Gelfer on Code Project. 19
- Slide 20: Using Fast Locks in Scaleform* volatile DWORD LockedThreadId = 0; void GLock::Lock() { DWORD threadId = GetCurrentThreadId(); if (threadId != LockedThreadId) { if ((LockedThreadId == 0) && ((LockedThreadId (InterlockedCompareExchange((long*)&LockedThreadId, threadId, 0) == 0)) ) { // Single instruction atomic quick-lock was successful. } else { // Potentially locked elsewhere, so do a more expensive // lock with system wait on semaphore. PerfLock(threadId); } } RecursiveLockCount++; } void GLock::Unlock() { if (--RecursiveLockCount == 0) { // Release lock does not need atomic op on Intel Architecture! LockedThreadId = 0; // Release other system semaphore waiters, if any. } } 20
- Slide 21: Scaleform GFx* Multi-threaded Demo Playback multiple files at once on separate threads ActionScript intensive Flash file 21
- Slide 22: Agenda Common Locking Issues Windows Locking Methodologies and associated performance User Level Atomic Locks with Scaleform* case Study Hot Locks and Lock Contention with Flight Simulator* Case Study Locks in Intel TBB® Summary & Call to Action 22
- Slide 23: Finding Lock Contention Using Intel Tools Lock Contention is another major issue which limits Scalability and adds Complexity Intel Tools can help in finding high contention scenarios – VTune™ Collecting clock ticks event via event based sampling using the Intel VTune Analyzer can be useful to help determine how much contention is occurring – Thread Profiler™ Provides an API for users to instrument user synchronization Spin waits appear as a hashed color in the Thread Profiler GUI Please refer to Intel Session on “Comparative Analysis of Game Parallelization” for more details on Thread Profiler 23
- Slide 24: Contention using VTune™ (Where to Look) EnterCriticalSection – Ring0 ntoskrnl.exe becomes hotter – For very high contention scenario, ring 0 becomes hot and number of context switches become very high TryEnterCriticalSection – Ntdll.dll will become hotter as you add threads WaitForSingleObject – Similar behavior as EnterCriticalSection Interlocked Functions – kernel32.dll will get hot 24
- Slide 25: Contention in WaitForSingleObject using VTune™ Example shows the hot functions within the Windows OS kernel, ntdll.dll, and hal.dll under no contention and high contention for WaitForSingleObject call 25
- Slide 26: Possible Ways to Reduce Lock Contention Lock Stripping. – Does your whole array really need to be protected by the same lock or can you give each element its own lock? Protect data, not code. – Common technique is to put a lock around the whole function call. Remember that it’s only data that needs to be protected, not the code. Use Reader-Writer Locks where applicable. – For the cases where a lot of threads read a memory location that is rarely changed. – Ensures that multiple readers can enter the lock at the same time. 26
- Slide 27: Microsoft Flight Simulator* Case Study Multi-Threading Goal – Separate terrain processing from rendering Loading games once in the beginning The engine keeps loading contents in the background while playing Main thread runs D3D, physics, etc. All other threads loads and pre-processes the terrain textures and other contents – Loading and processing textures without slowing down frame-rate Expected to scale in terms of processing more contents as more processors are available 27
- Slide 28: Locking Problem Symptoms and Thread Profiling – Occasional Stuttering Main Thread BKG Thread – Doesn’t scale well from 2->4 Cores because of very high contention Main Thread BKG Thread 28
- Slide 29: Locking Root-Cause Both cases lead to global hash map access. – Only 1 thread can access the hash map while all other threads are blocked – Entire hash map was protected by a critical section (probably the worst choice) Solution – Protect each bucket in the hash map instead of the whole hash map. As long as multiple threads are accessing different buckets, they are safe and don’t block each other – Use of Lock Free Library Microsoft* internal tools The concept is to have a single thread to write, but multiple threads can read at the same time as long as it is not being written. TBB provides similar locking mechanism 29
- Slide 30: Flight Simulator* Result Reduced stuttering, lower latency in terrain loading, and better visuals without sacrificing frame rates 30
- Slide 31: Synchronization Primitives in Intel TBB® Atomic Operations High-level abstraction for atomic instructions. OS/Compiler Portable Supports Processors like (Itanium) which have weak memory consistency Exception-safe Locks Scalable Fair Reentrant Sleeps mutex OS dependent OS dependent No Yes spin_mutex No No No No queuing_mutex Yes Yes No No spin_rw_mutex No No No No queuing_rw_mutex Yes Yes No No 31
- Slide 32: Example TBB® Reader-Writer Lock #include “tbb/spin_rw_mutex.h” using namespace tbb; spin_rw_mutex MyMutex; int foo (){ /* Construction of ‘lock’ acquires ‘MyMutex’ */ spin_rw_mutex::scoped_lock lock (MyMutex, /*is_writer*/ false); … if (!lock.upgrade_to_writer ()) { /*data may have been modified since the last read*/ } else { /* data was not modified by other thread */ } return 0; /* Destructor of ‘lock’ releases ‘MyMutex’ */ } If exception occurs within the protected code block destructor will automatically release the lock if it’s acquired avoiding a dead-lock Any reader lock may be upgraded to writer lock; upgrade_to_writer indicates whether the lock had to be released before it can upgrade 32
- Slide 33: General Recommendations for TBB® Locks spin_mutex is VERY FAST in lightly contended situations; use it if you need to protect very few instructions Use queuing_rw_mutex when scalability and fairness are important Use reader-writer mutex to allow non-blocking read for multiple threads Please refer to Intel Session on “Comparative Analysis of Game Parallelization” for more details on TBB 33
- Slide 34: Summary & Call to Action The use of inefficient synchronization strategy can have a big impact on the performance of your Multi- Threaded application: if it doesn’t hit you today then it sure will do tomorrow. Try using User-Level Atomic Locks instead of very expensive Kernel-Locks. Use Intel Tools (VTune™ and Thread Profiler™) to help identify potential lock problems. Use the locks properly to avoid high contention scenarios and make your code more scalable. 34
- Slide 35: Contact Info For more info –see our Graphics, Game Development and Threading resources at: http://softwarecommunity.intel.com/ Feel free to contact me directly: abhishek.r.agrawal@intel.com 35
- Slide 36: 36

