Xen Project is a static partitioning hypervisor for embedded deployments (industrial, medical, etc.) Xen enforces strong isolation between domains so that one cannot affect the execution of another. Features such as cache coloring reduce interference and improve interrupt latency and determinism. A real-time workload can run alongside a more complex guest. But can it be used in safety-critical environments? The Xen hypervisor has a microkernel design: services and tools are non-essential and run in unprivileged VMs, while the core is less than 50K LOC. This architecture lends itself well to safety-critical applications as only the core is critical and needs to go through the certification process. This presentation will describe the activities of the Xen FuSa SIG (Special Interest Group) to make Xen easier to safety-certify. It will go through the aspects of Xen that pertain safety and it will explain how to set up a mixed-criticality system with Xen. The talk will discuss the challenges of making an Open Source project safety-certifiable and the progress that the Xen community made so far in the areas of documentation and requirements, MISRA-C code compliance, and interference reduction.
Xen in Safety-Critical Systems - Critical Summit 2022
1. Xen in Safety-Critical Systems
Stefano Stabellini & Bertrand Marquis
Critical Summit NA 2022
2. What is safety?
• “Safety critical embedded software applications are developed for systems whose
failures contribute to hazards in the system for safety of life”
• Safety certifications (ISO 26262)
• Strict coding guidelines (MISRA C)
• Strict testing and documentation requirements
3. Why Xen matters for safety
• It is common to have a mix of critical and non-critical
components (mixed-criticality)
• Xen has been enabling mixed-criticality workloads for years
• Componentization
• Highly secure environments
• Isolate critical apps from non-critical apps
• Separate work environment from Personal environment on a laptop
e.g. Qubes OS and OpenXT
• Xen as static partitioning tool for embedded
• from industrial to medical and automotive
• the real-time (critical) domain must be isolated from the non-critical
• the real-time domain cannot miss any deadlines
• Safety-critical systems are mixed-criticality systems
Xen
Critical
Non-
Critical
Non-
Critical
4. Why Xen is a good fit for safety
• Small codebase (less than 50K LOC)
• Micro-kernel architecture
• only the hypervisor requires privileges
• dom0 is only optional now
• Xen supports disaggregation and driver domains: large amounts of code run unprivileged
• No need for a “dom0” privileged environment
• Linux is not required
• Supports real-time and cache coloring
• Thorough review process
• Thorough security process
Xen
Zephyr Linux
5. Example Industrial
• Xen static partitioning configuration with
dom0less
• 2 domains with hardware directly assigned
• no dom0
• 1 Linux VM, networking and cloud
• 1 Zephyr VM, motor controller application
with real-time requirements
Xen
Linux
networking, cloud APIs
Zephyr
real-time
controller
6. Example Automotive
• Xen static partitioning configuration with dom0less and also dom0 for monitoring
• 4 domains created statically at boot time
• 1 minimal dom0 VM (Zephyr) for system monitoring
• 1 Linux VM, infotainment
• 1 Zephyr VM, real-time sensor processing
• 1 Instrument Cluster VM
Xen
Zephyr
mini-Dom0
Linux
infotainment
Instrument Cluster
Zephyr
real-time
sensors
7. Real-time and Xen
• What is Real-time ?
• Real time is not fast
• I will answer on average in 5ms is not real-time
• Real time is a guarantee on the maximum time to respond
• I will give a response to an event in no more than 100ms
• Why does it matter in a safety context ?
• Safety usually equals time constraints
• If I detect a wall, stop before the wall
• How long to action the breaks
• If longer ….
0
10
20
30
40
50
60
70
80
<
1
m
s
1
-
5
m
s
5
-
9
5
m
s
9
5
-
1
0
0
m
s
>
1
0
0
m
s
My system
8.
9. Real-time and Xen
• What is Real-time ?
• Real time is not fast
• I will answer on average in 5ms is not real-time
• Real time is a guarantee on the maximum time to respond
• I will give a response to an event in no more than 100ms
• Why does it matter in a safety context ?
• Safety usually equals time constraints
• If I detect a wall, stop before the wall
• How long to action the breaks
• If longer ….
• In safety software: Worst Case Execution Time (WCET)
• By demonstration, not by test
• Usually, the WCET is a case impossible to trigger by test
• For Xen several subjects are being investigated
10. Real-time - Interrupts
• Interrupt latency
• Maximum time until a guest receives an irq
• Depend on time required by guest
• Time depend on hardware
• Context of the analysis
• Arm64
• Guest alone on his core
• Zephyr as guest
• Timer interrupt
• Procedure
• Code analysis/inspection
• Confirm with tracing on a real target
Zephyr
Xen
Timer irq handler
Forward irq
Timer irq
11. Real-time - Interrupts
• Overall result: 1090 instructions
• Save guest context (cpu and irq controller): 356
• Xen irq handler: 144
• Xen virtual timer: 360
• Xen exit irq handler: 97
• Restore guest: 133
• Assumption and limitations
• No hypercall from real time guest
• No interaction with guests on other cores
• After Xen init phase
• Fix configuration (guest, communication, …etc)
Zephyr
Xen
Timer irq handler
Enter irq handler
Timer irq
Save guest ctxt
Restore guest ctxt
Exit irq handler
Virtual timer
12. Real-time - Interrupts
• Issues and future work
• IPI interrupts and Xen RCU tasks
• Limited to guest isolated on its own core
• Wfi/wfe handling disabled (power consumption)
• No PV driver
• No hypercalls in real-time guest
• Status: Full analysis is public, link
13. Real-time – MPU support
• MMU hard to use for real-time
• TLB miss -> page table walk
• Influence of other cores (TLB sync)
• Influence of other guests (TLB miss)
MMU
L1 L2 L3
TLB
VA PA
14. MPU
REGs
VA PA
Real-time – MPU support
• MMU hard to use for real-time
• TLB miss -> page table walk
• Influence of other cores (TLB sync)
• Influence of other guests (TLB miss)
• MPU
• No translation (1 to 1 mapping)
• Register based (no page tables)
• No cache effect
15. Real-time and Xen – MPU support
• Arm Cortex R (Cortex R82)
• Both MMU and MPU
• EL2 (Xen): MPU
• For Xen
• Control Guest allowed memory
• EL1 (RTOS): MPU
• Real time
• EL1 (Linux): MMU
• Not real-time
• Cohabitation of MPU and MMU guests
• Xen and RTOS real-time
• Linux or other non-real-time OS running on same system
• Status: Proof of concept available
• Upstream in Xen ongoing
Linux
Zephyr
Xen
18. Real-time – Cache coloring
• CPUs clusters often share L2 cache
• Interference via L2 cache affects performance
• App0 running on CPU0 can cause cache entries evictions, which
affect App1 running on CPU1
• App1 running on CPU1 could miss a deadline due to App0’s
behavior
• It can happen between Apps running on the same OS & between
VMs on the same hypervisor
• Hypervisor Solution: Cache Partitioning,
AKA Cache Coloring
• Each VM gets its own allocation of cache entries
• No shared cache entries between VMs
• Allows real-time apps to run with deterministic IRQ latency
• 3us IRQ latency
Core
1
Core
2
Core
3
Core
4
L2
D
D
R
19. Static configuration with Xen
• What is it ?
• Defining the whole system (guests and communication) statically
• Why does it maters in a safety context ?
• No random behaviour
• Same after reboot (target, task, guest, …etc)
• Example: application on same core at same address in memory
• Reduce amount of testing
• Limit possibilities
• Example: only used functions, compile out the rest
• No dynamic behaviour
• Limit complexity
• Example: allocation on boot or static, no free
• Conclusion: reduce certification costs
20. Static configuration – dom0less
• Define the system during design phase
• How much guests
• What characteristics
• memory, device access, cpus
• Create them directly on boot
• Defined in configuration (device-tree)
• Advantages for safety
• No need for a complex dom0
• No dependency to Linux (not certifiable)
• Faster boot
• Guests start directly on boot
• Reduce system complexity
• No dynamic guest creation
• Status: available
Linux
Zephyr
Xen
Device tree:
domU1 {
#address-cells = <1>;
#size-cells = <1>;
compatible = "xen,domain";
memory = <0 0x20000>;
cpus = <1>;
vpl011;
module@2000000 {
compatible = "multiboot,kernel",
"multiboot,module";
reg = <0x2000000 0xffffff>;
bootargs = "console=ttyAMA0";
};
};
21. Static configuration – memory
• Define the address and size of memories
• For guests memory
• For Xen heap
• Internal Xen allocation
• For Xen guest heap
• Xen allocation related to a guest
• Defined in configuration (device-tree)
• Advantages for safety
• System or guest identical upon reboot
• Reduce possible interferences
• A guest cannot impact another
• Adding a guest in the future
• Current guests unchanged
• Incremental certification
• Status: available in next Xen release
Linux
Zephyr
Xen
Memory
Zephyr Linux
Device tree:
domU1 {
compatible = "xen,domain";
#address-cells = <0x2>;
#size-cells = <0x2>;
cpus = <2>;
memory = <0x0 0x80000>;
#xen,static-mem-address-cells = <0x1>;
#xen,static-mem-size-cells = <0x1>;
xen,static-mem = <0x30000000 0x20000000>;
...
};
22. Static configuration – communication
Linux
Zephyr
Xen
Memory
SHM
Event
Device tree:
shared-mem@10000000 {
compatible = "xen,domain-shared-memory-v1";
role = "owner";
xen,shm-id = <0x0>;
xen,shared-mem = <0x10000000 0x10000000 0x10000000>;
};
domU1 {
….
domU1-shared-mem@10000000 {
compatible = "xen,domain-shared-memory-v1";
role = "borrower";
xen,shm-id = <0x0>;
xen,shared-mem = <0x10000000 0x10000000 0x50000000>;
};
….
};
• Xenbus is too complex for safety
• Need Dom0 or Linux
• Drivers are complex
• One guest access another guest memory
• Several components for safety (not only)
• Static shared memory
• Area of memory accessible by several guests
• Defined in configuration (device-tree)
• Static event channels
• Solution to ping between 2 guests
• Defined in configuration (device-tree)
• Any protocol possible to build on top
• Status: Available in next Xen release
• Linux support
• Zephyr support
23. Static configuration - cpupools
• Define which core(s) are useable by who
• Xen cpupool: a pool with cores
• One or several cores
• Scheduler to use
• A guest can be assigned to a cpupool
• Several guests can be in one cpupool
• Scheduler independent between cpupools
• A core can only be assigned to one cpupool
• Defined in configuration (device-tree)
• Advantages for safety
• Static core assignment
• Isolation between guests
• Scheduler per cpupool
• Status: available in next Xen release
Linux1
Zephyr
Xen
Linux2
Pool-1
Pool-0
Device tree:
cp0:cpupool0 {
compatible = "xen,cpupool";
cpupool-cpus = <&a72_0 &a72_1>;
cpupool-sched = "credit2";
};
domU1 {
#address-cells = <1>;
#size-cells = <1>;
compatible = "xen,domain";
memory = <0 0x20000>;
cpus = <1>;
domain-cpupool = <&cp0>;
…
};
24. Safety Certifications Activities
• Xen can be already safety-certified, but at what cost?
• It has been done already
• Safety experts have analyzed the code and deemed it safety-certifiable
• Require significant downstream work
• GOAL: make it easier for users to deploy Xen in safety environments
• "safety-certifiable", not safety-certified
• users can fill the gaps
• we can be flexible: it is OK to decide not to follow certain rules
• let's focus on what we do best: robustness of the code
• Clarity: What does Xen support? What's missing?
• Users should be able to estimate precisely the work required
25. Code First
• Robustness and Safety of the code
• Code is Xen Project's primary output
• The most important item for safety-certifications is robustness/safety of the codebase
• Documentation, requirements, and tests can be more easily outsourced
• Main code safety aspects:
• Coding style and MISRA C rules
• Determinism: deterministic IRQ handling and memory allocations
• Enhanced Kconfig for a smaller codebase (less to certify)
• Why MISRA C?
• A de facto standard in all industry sectors
• Maintained and backed up by an authoritative organization (MISRA consortium)
• A pragmatic approach and a perfect match for Xen: MISRA documents clearly state that code quality
should never be sacrificed for compliance (deviation process)
26. MISRA C: status
• MISRA C Tailoring completed: ~100 rules considered relevant for Xen
• MISRA C Rules adoption in progress ~15/100 rules
• Xen is actually already following many MISRA C rules, just not officially
• Add Xen Rules we already follow to CODING_STYLE
• Discuss the others
• Decide we follow a rule, add it to CODING_STYLE, check for it using MISRA C scanners
• Decide we follow a rule with deviations
• deviations are intentional and documented exceptions to the rule
• document deviations with in-code or out-of-code comments so that MISRA C
scanners will “ignore” them
• check for the rule automatically using MISRA C scanners
• Not follow the rule and not scan for it
• cannot be scanned automatically by static analyzers
27. MISRA C: status
• Benefits:
• Static code analyzers available to check for the rules, from ECLAIR to cppcheck
• Check individual patches in advance before review even start
• Ease code reviews & reduce maintainers work
• Improve code quality
• Reduce defects
• Working with Roberto Bagnara and Bugseng ECLAIR
• Improve existing coding style and coding conventions in Xen
• Improve safety of the code
• Improve code security – defensive programming
• Widen compilers compatibility
• Ensure we do not violate the C99 standard
• Ensure we do not unknowingly use language extensions that may not be available in
other compilers
28. Tooling
• CPPCheck
• Available to any developer without license
• Good for pre-submission checks
• Open Source
• Good coverage but not 100%
• ECLAIR
• 100% coverage of MISRA C:2012 with very high accuracy
• Automatically adapts to the toolchain to capture all implementation-defined aspects of the
language
• Very detailed and actionable reports
• Made publicly available by BUGSENG at http://eclairit.com