A Robust and Flexible Operating System Compatibility Architecture

A Robust and Flexible Operating System
Compatibility Architecture
Takahiro Shinagawa Shinichi HonidenYuichi NishiwakiTakaya Saeki*
* Mr. Saeki is currently at Microsoft Development Co., Ltd.
The 16th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE 2020, March 17, 2020)

Running Applications on Another OS
• Useful in various cases
(a) Running commercial binary applications
• Only supplied in the binary format of a specific OS
(b) Using a build environment with several toolchains and libraries
• E.g., building Linux applications on macOS
A Robust and Flexible Operating System Compatibility Architecture (VEE 2020, March 17, 2020) 2
OS1 OS2
Application
Binary
for OS1
Application
Binary
for OS1
(a) Running commercial binary applications (b) Building Linux applications on macOS
macOS
Build environment
with toolchains and libraries

Existing Approaches
A) Porting applications
👍 No runtime overhead
👎 Much porting effort (by developers and users)
B) Running applications with the OS in a virtual machine (VM)
👍 No porting effort, strong isolation, …
👎 Challenges in resource sharing due to the existence of two OS kernels
OS1 OS2
Application
Binary
for OS1
Application
Binary
for OS2
much
effort
Guest OS
Host OS
Guest
Application
A) Porting applications B) Running apps and OS in a VM
challenges in seamless
resource sharing

Using OS Compatibility Layers
• No porting effort
• Absorb the differences between the guest and host interfaces
• Seamless resource management
• The host OS manages both the guest and host resources in the same way
• Guest and host applications can communicate seamlessly via host system calls
Host OS
Guest Application
Binary
OS Compatibility Layer ← converting system calls from guest to host
Host ApplicationsHost ApplicationsHost Applications
← managing resources of guest and host apps

Kernel-space v.s. User-space Implementations
• Kernel space
👍 Flexibility to achieve binary compatibility
• System calls and memory management can be easily handled
👎 Vulnerability against bugs in OS compatibility layers
• A bug could lead to system crashes
• User space
👍 Robustness against bugs
• Bugs do not affect the OS stability
👎 Inflexibility to achieve full compatibility
• E.g., copy-on-write not implemented in Cygwin
Host OS Kernel
Guest Application Binary
OS Compatibility Layer
Host OS Kernel
Guest
Application
Binary
OS
Compat.
Layer

Proposed OS Compatibility Architecture
• Running each guest process in a VM (without its OS kernel)
👍 Robustness
• Most part of OS compatibility layers can be implemented in user space
• Bugs do not cause kernel crashes
👍 Flexibility
• Hardware virtualization technology provides low-layer event handling functionalities
• E.g., trapping system calls and page faults, manipulating page tables, …
Host OS Kernel
OS Compatibility Layer
VMHost Process
Standardized Virtualization Interface
Guest Application Process
CPUHardware Virtualization Function

Overall Design
• Three main components
• Guest VM
• VMM module
• Monitor process
monitor process guest process
Guest VMs
kernel
emulate
system calls
User
Space
Kernel
space
trap
system calls &
exceptions
no kernel
upcall
monitor
VMM module
manage
VMs
Host OS

Guest VM
• An instance of the guest execution environment
• Run on CPU directly
• Without overhead
• Provide a virtual address space
• With a host-managed page table
• Trap events
• E.g., system calls, page faults, …
Assume hardware-assisted virtualization
• E.g., Intel VT and AMD SVM
Guest VMs
kernel
emulate
system calls
User
Space
Kernel
space
trap
system calls &
page faults
no kernel
upcall
monitor
VMM module
manage
VMs
Host OS

VMM module
• A small driver of hardware-assisted virtualization
• Provide API to manage VMs
• Create and destroy VMs and vCPUs
• Read and write VM states
• Manipulate page tables
• Manage control transfer
• Assume its own robustness
• Small size
• No kernel dependency
• Widely available
• E.g., Apple Hypervisor.framework, Windows Hypervisor Platform, KVM, …
Guest VMs
kernel
emulate
system calls
User
Space
Kernel
space
trap
system calls &
page faults
no kernel
upcall
monitor
VMM module
manage
VMs
Host OS

Monitor Process
• The main component to implement OS compatibility functions
• Prepare execution environments
• Create a VM and setup the page table
• Load the program
• Emulate guest system calls
• With host system calls
• Clean up on exit
• Free allocated resources
• Destroy the VM
• Terminate itself
Guest VMs
kernel
emulate
system calls
User
Space
Kernel
space
trap
system calls &
page faults
no kernel
upcall
monitor
VMM module
manage
VMs
Host OS

Advantages of Our Architecture
• Robustness
• User-space implementation: bugs of the monitor process will not cause crashes
• Flexibility
• Compatibility: achieve full binary compatibility
• Performance: copy-on-write can be supported
• Others (of OS Compatibility layers)
• Development cost
• Rich host OS functionalities: system calls, libraries, high-level languages, …
• Seamlessness
• Single kernel: share the same file system, IPCs between guest and host, process
scheduling, resource management, ...

Implementation
• Target Linux 4.6 of x86-64 (Intel VT-x)
• Noah: Linux compatibility layer for macOS
• Run on macOS 10.12 Sierra or higher
• Use Apple Hypervisor.framework as the VMM module
• NoahW: Linux compatibility layer for Windows (preliminary)
• Run on Windows 7 or higher
• Use Intel Hardware Accelerated Execution Manager as the VMM module

Boot and Load
(1) Create a VM
• Through the VMM module
(2) Setup the VM state
• Start in x86-64 long mode directly
• Trap SYSCALL instructions
• Use trampoline code in Windows
(3) Load the Linux ELF loader (ld.so)
• Using our original ELF loader
• Using the internal mmap() to map the ELF file
(4) Pass the control to the Linux ELF loader
guest program
(1) Create VM
VM state
Mode: longmode
IA32_EFER.SCE: 0
VMM module
(2) Setup VM state
ld.so
(3) Load the
ELF loader
monitor process
ELF loader

Memory Management
• Two page tables
(a) Guest page table in the VM
(b) Nested page table (EPT) in the VMM
• Fix (a) and modify (b)
• (a) is fixed to the straight mapping
• Virtual address = Physical address
• (b) can be manipulated with the API
• Provided by the VMM module
Limitation: GVA is up to 512 GiB
• 39-bit physical address in Intel CPU
• 48-bit virtual address
• Stack is moved to the lower address
• No kernel area
511 GiB
0
GVA GPA HPA
GVA: Guest Virtual Address
GPA: Guest Physical Address
HPA: Host Physical Address
512 GiB
Guest
page table
(fixed)
Nested
page table
(modified)
1-GiB guest system data area
(page tables, segment descriptors, …)

Process Management (fork)
• Noah (on macOS)
• Implement a subset of clone()
• Apple Hypervisor.framework does not support fork() with a VM
• Save and destroy the VM before fork()
• Restore the VM after fork()
• NoahW (on Windows)
• Implement fork() with copy-on-write using shared memory and virtualization
• Create a memory region shared among monitor processes
• Save, restore, and modify the VM states on fork()
• Trap page faults in the VMs to implement copy-on-write

Evaluation
• Performance
• Primitive benchmark: CPU cycles of the dup() system call on macOS
• Micro benchmark: lmbench-3.0-a9 on macOS
• Macro benchmark: Phoronix Test Suite v7.6.0 on macOS
• Performance comparison of OS compatibility layers on Windows
• Setup
• MacBook Pro (13-inch, 2017) for Noah
• Intel Core i7-7567U, 16GB DDR3 memory, 500GB SSD
• Ubuntu 16.04 on macOS High Sierra 10.13.3
• Surface Pro 2017 for NoahW
• Intel Core i7-7660U, 16GB DDR3 memory, 512GB SSD
• Windows 10 Professional Version 1709 and HAXM 6.2.1

Primitive Benchmark: dup() system call
270
3202520
11091
1330
7044
2770
2118
588
5504
2809
11091
251
297
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
macOS Windows
CycleNumber
VM enter
downcall
post-process
host syscall
pre-process
upcall
VM exit

Micro Benchmark: lmbench (processor, process)
410%
310%
175%
172%
13%
329%
239%
256%
-24%
46%
-100% 0% 100% 200% 300% 400%
null call
null I/O
stat
open clos
slct TCP
sig inst
sig hndl
fork proc
exec proc
sh proc

Micro Benchmark: lmbench (File & VM latency)
42%
5%
17%
4%
28%
-45%
-92%
8%
-100% -50% 0% 50%
0K Create
0K Delete
10K Create
10K Delete
Mmap Latency
Prot Fault
Page Fault
100fd selct

Macro Benchmark (Phoronix Test Suite + α)
16%
-23%
-4%
50%
-58%
9%
-200% -100% 0% 100% 200%
Linux kernel build
unpack-linux
postmark
sqlite
openssl
compress-7zip

Comparison of OS Compatibility Layers
Benchmark NoahW Cygwin WSL1
dup2() [call per second] 36,723 556,453 693,309
write() [call per second] 0.30 0.56 0.57
fork() (0 MiB array) [ms] 106.4 219.4 2.06
fork() (512 MiB array) [ms] 338.9 789.9 32.51
fork() (1 GiB array) [ms] 458.4 1531.8 62.66

Conclusion
• Proposed a novel OS compatibility architecture
• Exploited the OS-standard virtualization technology support
• Achieved both robustness and flexibility
• Consist of three components
• VMs to run guest processes
• The VMM module to provide API for hardware virtualization technology
• Monitor processes to implement OS compatibility functions
• Run Linux binary on macOS and Windows (preliminary)
• Noah implemented 172 out of 329 Linux system calls
• The overhead of Linux kernel build time on Noah was 16%

Acknowledgement and Availability
• This work was partially supported by Exploratory IT Human
Resources Project (MITOU Program) of Information technology
Promotion Agency, Japan (IPA) in the fiscal 2016 and JSPS
KAKENHI
• Noah is publicly available from https://github.com/linux-noah/noah
under the MIT / GPL dual licenses

A Robust and Flexible Operating System Compatibility Architecture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A Robust and Flexible Operating System Compatibility Architecture

Similar to A Robust and Flexible Operating System Compatibility Architecture (20)

More from Shinagawa Laboratory, The University of Tokyo

More from Shinagawa Laboratory, The University of Tokyo (11)

Recently uploaded

Recently uploaded (20)

A Robust and Flexible Operating System Compatibility Architecture

Editor's Notes