SlideShare a Scribd company logo
1 of 34
Download to read offline
Having fun with GPU
resets :D
André Almeida
XDC 2023
1
Hi!
Kernel developer
Working in the Steam Deck
2
Oh no, my GPU hanged!
You are playing your game on Linux
Something wrong is sent to the device
???
Game over, reboot your machine
3
4
Modern GPUs are complex
Really complex
AMD Radeon RX 7900 XTX
96 Compute units, 384 texture units, 6 shader
engines, 58 B transistors...
Shaders are Turing Complete
5
Modern GPUs are complex
DCHUB
HUBP
(n)
DPP
(n)
MPC OPTC DIO
DCCG DMU AZ
MMHUBBUB
DWB
(n)
Global sync
Pixel data
Sideband signal
Config. Bus
SDP Monitor
OPP
dc_plane dc_stream
dc_state
Code struct
dc_link
Floating point
calculation
bit-depth
reduction/dither
}
Notes
6
Modern GPUs are complex
If you have an infinity loop in the CPU, it's not that
bad
CPU programs has virtual memory and virtual
processor
Things might be more barebone in the GPU
But in a GPU, the display won't be able to update
7
Detecting GPU hangs
From the hardware to the application
8
Detecting GPU hangs
Device to Kernel
Submit a job and wait until is done (halting problem)
Check fences
Or timeouts
The kernel driver does a GPU reset
This can be "soft" resets, one hw engine/context
reset or full device reset
Discrete amdgpu struggles with per-context
resets
More complete resets are more destructive
Total lost of VRAM
Now report to userspace
9
Reporting GPU hangs
Kernel to Mesa
DRM has no API for that
I915_GET_RESET_STATS
AMDGPU_CTX_OP_QUERY_STATE2
MSM_PARAM_FAULTS
Return -ERROR for ioctls
Nothing for Raspberry or Vivante (?)
It's not really hw specific
But there's no common DRM context concept
10
Reporting GPU hangs
Mesa to application
Vulkan
Before submitting commands or wait operation, Mesa
asks the kernel if the device is around
Otherwise, it propagates VK_ERROR_DEVICE_LOST
to the app
Then the app (maybe) do something about it
Recreate the context
Or just exit
VK_EXT_device_fault to query info about the
reset
11
Reporting GPU hangs
Mesa to application
OpenGL
Before context creation and cs flush do a reset check
If the app has GL_ARB_robustness support, it
propagates an error for the app
Otherwise:
Some drivers just kill the application
Other just block new calls
12
Reporting GPU hangs
Mesa to application
APIs provide a way to tell apps that a reset
happened:
VK_ERROR_DEVICE_LOST
GL_ARB_robustness
Non-robust GL apps are just killed
But even robust apps can misbehave
Mesa/kernel can ban fd if they keep resetting the GPU
13
Reporting GPU hangs
Mesa to application
Applications then can recreate the context
Usually that means:
Oh no, my submit command returned a reset error
The GPU state is corrupted, but the CPU still
good!
Let's recreate all contexts, buffers, shaders, etc
Cool, let's go!
Bad apps may fall in a broken loop
14
#version 330 core
out vec4 FragColor;
void main()
{
for (float x = 0.01; x < 1; x--) {
FragColor = vec4(x * 1.0f, 1.0f, 1.0f, 1.0f)
}
}
15
What can we do here?
DRM <-> Mesa it's not really hw specific
How about we have a DRM_GET_RESET_STATE?
drm/doc: Document DRM device reset
expectations: A DRM documentation explaining
what DRM drivers and Mesa should do when a reset
happens
16
DRM_IOCTL_GET_RESET
struct drm_get_reset {
/** Context ID to query resets (in) */
__u32 ctx_id; // no global context ID...
/** Flags (out) */
__u32 flags;
/** Global reset counter for this card (out) */
__u64 reset_count;
/** Reset counter for this context (out) */
__u64 reset_count_ctx;
};
17
drmGetReset(ctx_id, &reset);
if (reset.reset_count_ctx)
return PIPE_GUILTY_CONTEXT;
if (reset.reset_count)
return PIPE_INNOCENT_CONTEXT;
18
What happens in practice?
There was no reset check at RADV
+ device->vk.check_status = radv_check_status;
19
What happens in practice?
For RadeonSI, there was reset check...
But the driver would return
GL_CONTEXT_RESET_ARB forever
So the app would recreate the context, get a reset
return and recreate the context, get a reset return, ad
infinitum
20
What happens in practice?
For Iris (Intel), there is reset check...
But it only notify the guilty application that a reset
happened
The guilty application then quits/recreate the context
The rest aren't notified and counts on being luck to
still be alive
Some resets are more destructive than others...
21
What happens in practice?
Not much shared code, standardization, tests,
validation...
22
What happens in practice?
Each vendor reacts differently to resets
My focus is on amdgpu
The state for discrete is that it would be
unrecoverable for any kind of reset
Just a black screen and not responsive. Access via
ssh/tty sometimes worked
Pierre-Eric (AMD) and improved this for KDE
compositor
radeonsi wasn't following the spec
More testing is need for robustness
23
What happens in practice?
Other OSs have more control in the stack, so they can
be more reliable
In particular in the compositor side, so it's easier to
get in a standard behavior
24
Good reporting of GPU hangs
Apart from reporting to userspace that the GPU was
reset, would be nice to tell what happened
Currently Mesa developers have a hard time figuring
out what in the game caused the hang
25
Good reporting of GPU hangs
GPU hang have two main sources:
Hardware settings (voltage, frequency)
Application errors (infinite loops)
There's no way to distinguish this right now
26
Good reporting of GPU hangs
Ideally without overhead so can be enabled by default
I proposed AMDGPU_INFO_GUILTY_APP to capture
data about the hanged app (e.g. buffer in use)
This callbacks need to be platform specific
Reads some registers about which buffer was in
use
AMD replied that we can't trust register values after
a reset
We probably need some firmware support at this
i t
27
Good reporting of GPU hangs
RADV_DEBUG=hang isn't always effective, it changes
the ordering of jobs
Challenge: when the GPU hangs the hardware state
can be a bit unreliable.
How to get the right info correctly?
Using the GPU in "debug" mode or inserting fences,
barrier and extra information causes overhead
No easy way to deploy to all users
28
Roadmap for better GPU resets
Standardization of how DRM reports GPU hangs to
userspace
of how usermode driver deals with a hang and with
the guilty application
what compositors should do after a hang
Better hang log
Show which buffer caused the hang
Dump hardware state reliably
devcoredump
Tests!
29
Links
https://lore.kernel.org/lkml/20230501185747.3351
9-1-andrealmeid@igalia.com/
https://lore.kernel.org/lkml/20230424014324.2185
31-1-andrealmeid@igalia.com/
https://lore.kernel.org/dri-devel/20230929092509
.42042-1-andrealmeid@igalia.com/
https://gitlab.freedesktop.org/mesa/mesa/-/merge
_requests/22290
https://gitlab.freedesktop.org/mesa/mesa/-/merge
_requests/22253
30
31
Thanks!
igalia.com/jobs
32
33
Having fun with GPU resets in Linux – XDC  2023

More Related Content

Similar to Having fun with GPU resets in Linux – XDC 2023

Laptop syllabus 1 month
Laptop syllabus 1 monthLaptop syllabus 1 month
Laptop syllabus 1 monthchiptroniks
 
Mainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernels
Mainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernelsMainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernels
Mainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernelsDobrica Pavlinušić
 
Laptop Chip level repairing(CPU section)
Laptop Chip level repairing(CPU section)Laptop Chip level repairing(CPU section)
Laptop Chip level repairing(CPU section)chiptroniks
 
laptop repairing institute
laptop repairing institutelaptop repairing institute
laptop repairing institutechiptroniks
 
laptop repairing institute
laptop repairing institutelaptop repairing institute
laptop repairing institutechiptroniks
 
CUDA by Example : Graphics Interoperability : Notes
CUDA by Example : Graphics Interoperability : NotesCUDA by Example : Graphics Interoperability : Notes
CUDA by Example : Graphics Interoperability : NotesSubhajit Sahu
 
GPU Rigid Body Simulation GDC 2013
GPU Rigid Body Simulation GDC 2013GPU Rigid Body Simulation GDC 2013
GPU Rigid Body Simulation GDC 2013ecoumans
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
 
Android CTS training
Android CTS trainingAndroid CTS training
Android CTS trainingjtbuaa
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
OSMC 2014: Server Hardware Monitoring done right | Werner Fischer
OSMC 2014: Server Hardware Monitoring done right | Werner FischerOSMC 2014: Server Hardware Monitoring done right | Werner Fischer
OSMC 2014: Server Hardware Monitoring done right | Werner FischerNETWAYS
 
ADS Lab 5 Report
ADS Lab 5 ReportADS Lab 5 Report
ADS Lab 5 ReportRiddhi Shah
 
Intel galileo gen 2
Intel galileo gen 2Intel galileo gen 2
Intel galileo gen 2srknec
 

Similar to Having fun with GPU resets in Linux – XDC 2023 (20)

Laptop syllabus 1 month
Laptop syllabus 1 monthLaptop syllabus 1 month
Laptop syllabus 1 month
 
Mainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernels
Mainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernelsMainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernels
Mainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernels
 
Laptop Chip level repairing(CPU section)
Laptop Chip level repairing(CPU section)Laptop Chip level repairing(CPU section)
Laptop Chip level repairing(CPU section)
 
laptop repairing institute
laptop repairing institutelaptop repairing institute
laptop repairing institute
 
laptop repairing institute
laptop repairing institutelaptop repairing institute
laptop repairing institute
 
CUDA by Example : Graphics Interoperability : Notes
CUDA by Example : Graphics Interoperability : NotesCUDA by Example : Graphics Interoperability : Notes
CUDA by Example : Graphics Interoperability : Notes
 
GPU Rigid Body Simulation GDC 2013
GPU Rigid Body Simulation GDC 2013GPU Rigid Body Simulation GDC 2013
GPU Rigid Body Simulation GDC 2013
 
Cuda intro
Cuda introCuda intro
Cuda intro
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
Dx diag
Dx diagDx diag
Dx diag
 
SpeedIT FLOW
SpeedIT FLOWSpeedIT FLOW
SpeedIT FLOW
 
Android CTS training
Android CTS trainingAndroid CTS training
Android CTS training
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
OSMC 2014: Server Hardware Monitoring done right | Werner Fischer
OSMC 2014: Server Hardware Monitoring done right | Werner FischerOSMC 2014: Server Hardware Monitoring done right | Werner Fischer
OSMC 2014: Server Hardware Monitoring done right | Werner Fischer
 
Graphics card ppt
Graphics card pptGraphics card ppt
Graphics card ppt
 
ADS Lab 5 Report
ADS Lab 5 ReportADS Lab 5 Report
ADS Lab 5 Report
 
Intel galileo gen 2
Intel galileo gen 2Intel galileo gen 2
Intel galileo gen 2
 

More from Igalia

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Building End-user Applications on Embedded Devices with WPE
Building End-user Applications on Embedded Devices with WPEBuilding End-user Applications on Embedded Devices with WPE
Building End-user Applications on Embedded Devices with WPEIgalia
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Automated Testing for Web-based Systems on Embedded Devices
Automated Testing for Web-based Systems on Embedded DevicesAutomated Testing for Web-based Systems on Embedded Devices
Automated Testing for Web-based Systems on Embedded DevicesIgalia
 
Embedding WPE WebKit - from Bring-up to Maintenance
Embedding WPE WebKit - from Bring-up to MaintenanceEmbedding WPE WebKit - from Bring-up to Maintenance
Embedding WPE WebKit - from Bring-up to MaintenanceIgalia
 
Optimizing Scheduler for Linux Gaming.pdf
Optimizing Scheduler for Linux Gaming.pdfOptimizing Scheduler for Linux Gaming.pdf
Optimizing Scheduler for Linux Gaming.pdfIgalia
 
Running JS via WASM faster with JIT
Running JS via WASM      faster with JITRunning JS via WASM      faster with JIT
Running JS via WASM faster with JITIgalia
 
To crash or not to crash: if you do, at least recover fast!
To crash or not to crash: if you do, at least recover fast!To crash or not to crash: if you do, at least recover fast!
To crash or not to crash: if you do, at least recover fast!Igalia
 
Implementing a Vulkan Video Encoder From Mesa to GStreamer
Implementing a Vulkan Video Encoder From Mesa to GStreamerImplementing a Vulkan Video Encoder From Mesa to GStreamer
Implementing a Vulkan Video Encoder From Mesa to GStreamerIgalia
 
8 Years of Open Drivers, including the State of Vulkan in Mesa
8 Years of Open Drivers, including the State of Vulkan in Mesa8 Years of Open Drivers, including the State of Vulkan in Mesa
8 Years of Open Drivers, including the State of Vulkan in MesaIgalia
 
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por IgaliaIntroducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por IgaliaIgalia
 
2023 in Chimera Linux
2023 in Chimera                    Linux2023 in Chimera                    Linux
2023 in Chimera LinuxIgalia
 
Building a Linux distro with LLVM
Building a Linux distro        with LLVMBuilding a Linux distro        with LLVM
Building a Linux distro with LLVMIgalia
 
turnip: Update on Open Source Vulkan Driver for Adreno GPUs
turnip: Update on Open Source Vulkan Driver for Adreno GPUsturnip: Update on Open Source Vulkan Driver for Adreno GPUs
turnip: Update on Open Source Vulkan Driver for Adreno GPUsIgalia
 
Graphics stack updates for Raspberry Pi devices
Graphics stack updates for Raspberry Pi devicesGraphics stack updates for Raspberry Pi devices
Graphics stack updates for Raspberry Pi devicesIgalia
 
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOSDelegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOSIgalia
 
MessageFormat: The future of i18n on the web
MessageFormat: The future of i18n on the webMessageFormat: The future of i18n on the web
MessageFormat: The future of i18n on the webIgalia
 
Replacing the geometry pipeline with mesh shaders
Replacing the geometry pipeline with mesh shadersReplacing the geometry pipeline with mesh shaders
Replacing the geometry pipeline with mesh shadersIgalia
 
I'm not an AMD expert, but...
I'm not an AMD expert, but...I'm not an AMD expert, but...
I'm not an AMD expert, but...Igalia
 
Status of Vulkan on Raspberry
Status of Vulkan on RaspberryStatus of Vulkan on Raspberry
Status of Vulkan on RaspberryIgalia
 

More from Igalia (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Building End-user Applications on Embedded Devices with WPE
Building End-user Applications on Embedded Devices with WPEBuilding End-user Applications on Embedded Devices with WPE
Building End-user Applications on Embedded Devices with WPE
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Automated Testing for Web-based Systems on Embedded Devices
Automated Testing for Web-based Systems on Embedded DevicesAutomated Testing for Web-based Systems on Embedded Devices
Automated Testing for Web-based Systems on Embedded Devices
 
Embedding WPE WebKit - from Bring-up to Maintenance
Embedding WPE WebKit - from Bring-up to MaintenanceEmbedding WPE WebKit - from Bring-up to Maintenance
Embedding WPE WebKit - from Bring-up to Maintenance
 
Optimizing Scheduler for Linux Gaming.pdf
Optimizing Scheduler for Linux Gaming.pdfOptimizing Scheduler for Linux Gaming.pdf
Optimizing Scheduler for Linux Gaming.pdf
 
Running JS via WASM faster with JIT
Running JS via WASM      faster with JITRunning JS via WASM      faster with JIT
Running JS via WASM faster with JIT
 
To crash or not to crash: if you do, at least recover fast!
To crash or not to crash: if you do, at least recover fast!To crash or not to crash: if you do, at least recover fast!
To crash or not to crash: if you do, at least recover fast!
 
Implementing a Vulkan Video Encoder From Mesa to GStreamer
Implementing a Vulkan Video Encoder From Mesa to GStreamerImplementing a Vulkan Video Encoder From Mesa to GStreamer
Implementing a Vulkan Video Encoder From Mesa to GStreamer
 
8 Years of Open Drivers, including the State of Vulkan in Mesa
8 Years of Open Drivers, including the State of Vulkan in Mesa8 Years of Open Drivers, including the State of Vulkan in Mesa
8 Years of Open Drivers, including the State of Vulkan in Mesa
 
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por IgaliaIntroducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
 
2023 in Chimera Linux
2023 in Chimera                    Linux2023 in Chimera                    Linux
2023 in Chimera Linux
 
Building a Linux distro with LLVM
Building a Linux distro        with LLVMBuilding a Linux distro        with LLVM
Building a Linux distro with LLVM
 
turnip: Update on Open Source Vulkan Driver for Adreno GPUs
turnip: Update on Open Source Vulkan Driver for Adreno GPUsturnip: Update on Open Source Vulkan Driver for Adreno GPUs
turnip: Update on Open Source Vulkan Driver for Adreno GPUs
 
Graphics stack updates for Raspberry Pi devices
Graphics stack updates for Raspberry Pi devicesGraphics stack updates for Raspberry Pi devices
Graphics stack updates for Raspberry Pi devices
 
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOSDelegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
 
MessageFormat: The future of i18n on the web
MessageFormat: The future of i18n on the webMessageFormat: The future of i18n on the web
MessageFormat: The future of i18n on the web
 
Replacing the geometry pipeline with mesh shaders
Replacing the geometry pipeline with mesh shadersReplacing the geometry pipeline with mesh shaders
Replacing the geometry pipeline with mesh shaders
 
I'm not an AMD expert, but...
I'm not an AMD expert, but...I'm not an AMD expert, but...
I'm not an AMD expert, but...
 
Status of Vulkan on Raspberry
Status of Vulkan on RaspberryStatus of Vulkan on Raspberry
Status of Vulkan on Raspberry
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

Having fun with GPU resets in Linux – XDC 2023

  • 1. Having fun with GPU resets :D André Almeida XDC 2023 1
  • 3. Oh no, my GPU hanged! You are playing your game on Linux Something wrong is sent to the device ??? Game over, reboot your machine 3
  • 4. 4
  • 5. Modern GPUs are complex Really complex AMD Radeon RX 7900 XTX 96 Compute units, 384 texture units, 6 shader engines, 58 B transistors... Shaders are Turing Complete 5
  • 6. Modern GPUs are complex DCHUB HUBP (n) DPP (n) MPC OPTC DIO DCCG DMU AZ MMHUBBUB DWB (n) Global sync Pixel data Sideband signal Config. Bus SDP Monitor OPP dc_plane dc_stream dc_state Code struct dc_link Floating point calculation bit-depth reduction/dither } Notes 6
  • 7. Modern GPUs are complex If you have an infinity loop in the CPU, it's not that bad CPU programs has virtual memory and virtual processor Things might be more barebone in the GPU But in a GPU, the display won't be able to update 7
  • 8. Detecting GPU hangs From the hardware to the application 8
  • 9. Detecting GPU hangs Device to Kernel Submit a job and wait until is done (halting problem) Check fences Or timeouts The kernel driver does a GPU reset This can be "soft" resets, one hw engine/context reset or full device reset Discrete amdgpu struggles with per-context resets More complete resets are more destructive Total lost of VRAM Now report to userspace 9
  • 10. Reporting GPU hangs Kernel to Mesa DRM has no API for that I915_GET_RESET_STATS AMDGPU_CTX_OP_QUERY_STATE2 MSM_PARAM_FAULTS Return -ERROR for ioctls Nothing for Raspberry or Vivante (?) It's not really hw specific But there's no common DRM context concept 10
  • 11. Reporting GPU hangs Mesa to application Vulkan Before submitting commands or wait operation, Mesa asks the kernel if the device is around Otherwise, it propagates VK_ERROR_DEVICE_LOST to the app Then the app (maybe) do something about it Recreate the context Or just exit VK_EXT_device_fault to query info about the reset 11
  • 12. Reporting GPU hangs Mesa to application OpenGL Before context creation and cs flush do a reset check If the app has GL_ARB_robustness support, it propagates an error for the app Otherwise: Some drivers just kill the application Other just block new calls 12
  • 13. Reporting GPU hangs Mesa to application APIs provide a way to tell apps that a reset happened: VK_ERROR_DEVICE_LOST GL_ARB_robustness Non-robust GL apps are just killed But even robust apps can misbehave Mesa/kernel can ban fd if they keep resetting the GPU 13
  • 14. Reporting GPU hangs Mesa to application Applications then can recreate the context Usually that means: Oh no, my submit command returned a reset error The GPU state is corrupted, but the CPU still good! Let's recreate all contexts, buffers, shaders, etc Cool, let's go! Bad apps may fall in a broken loop 14
  • 15. #version 330 core out vec4 FragColor; void main() { for (float x = 0.01; x < 1; x--) { FragColor = vec4(x * 1.0f, 1.0f, 1.0f, 1.0f) } } 15
  • 16. What can we do here? DRM <-> Mesa it's not really hw specific How about we have a DRM_GET_RESET_STATE? drm/doc: Document DRM device reset expectations: A DRM documentation explaining what DRM drivers and Mesa should do when a reset happens 16
  • 17. DRM_IOCTL_GET_RESET struct drm_get_reset { /** Context ID to query resets (in) */ __u32 ctx_id; // no global context ID... /** Flags (out) */ __u32 flags; /** Global reset counter for this card (out) */ __u64 reset_count; /** Reset counter for this context (out) */ __u64 reset_count_ctx; }; 17
  • 18. drmGetReset(ctx_id, &reset); if (reset.reset_count_ctx) return PIPE_GUILTY_CONTEXT; if (reset.reset_count) return PIPE_INNOCENT_CONTEXT; 18
  • 19. What happens in practice? There was no reset check at RADV + device->vk.check_status = radv_check_status; 19
  • 20. What happens in practice? For RadeonSI, there was reset check... But the driver would return GL_CONTEXT_RESET_ARB forever So the app would recreate the context, get a reset return and recreate the context, get a reset return, ad infinitum 20
  • 21. What happens in practice? For Iris (Intel), there is reset check... But it only notify the guilty application that a reset happened The guilty application then quits/recreate the context The rest aren't notified and counts on being luck to still be alive Some resets are more destructive than others... 21
  • 22. What happens in practice? Not much shared code, standardization, tests, validation... 22
  • 23. What happens in practice? Each vendor reacts differently to resets My focus is on amdgpu The state for discrete is that it would be unrecoverable for any kind of reset Just a black screen and not responsive. Access via ssh/tty sometimes worked Pierre-Eric (AMD) and improved this for KDE compositor radeonsi wasn't following the spec More testing is need for robustness 23
  • 24. What happens in practice? Other OSs have more control in the stack, so they can be more reliable In particular in the compositor side, so it's easier to get in a standard behavior 24
  • 25. Good reporting of GPU hangs Apart from reporting to userspace that the GPU was reset, would be nice to tell what happened Currently Mesa developers have a hard time figuring out what in the game caused the hang 25
  • 26. Good reporting of GPU hangs GPU hang have two main sources: Hardware settings (voltage, frequency) Application errors (infinite loops) There's no way to distinguish this right now 26
  • 27. Good reporting of GPU hangs Ideally without overhead so can be enabled by default I proposed AMDGPU_INFO_GUILTY_APP to capture data about the hanged app (e.g. buffer in use) This callbacks need to be platform specific Reads some registers about which buffer was in use AMD replied that we can't trust register values after a reset We probably need some firmware support at this i t 27
  • 28. Good reporting of GPU hangs RADV_DEBUG=hang isn't always effective, it changes the ordering of jobs Challenge: when the GPU hangs the hardware state can be a bit unreliable. How to get the right info correctly? Using the GPU in "debug" mode or inserting fences, barrier and extra information causes overhead No easy way to deploy to all users 28
  • 29. Roadmap for better GPU resets Standardization of how DRM reports GPU hangs to userspace of how usermode driver deals with a hang and with the guilty application what compositors should do after a hang Better hang log Show which buffer caused the hang Dump hardware state reliably devcoredump Tests! 29
  • 31. 31
  • 33. 33