A look at the new enhancements to core storage in vSphere 6.5, including VMFS6, Automated UNMAP, I/O Filters, and much more, as delivered by Cormac Hogan and Cody Hosterman
2. • This presentation may contain product features that are currently under development.
• This overview of new technology represents no commitment from VMware to deliver these
features in any generally available product.
• Features are subject to change, and must not be included in contracts, purchase orders, or
sales agreements of any kind.
• Technical feasibility and market demand will affect final delivery.
• Pricing and packaging for any new technologies or features discussed or presented have not
been determined.
Disclaimer
2#SER1143BU CONFIDENTIAL
7. vSphere 6.5 Scaling and Limits
Paths
• ESXi hosts now support up to 2000 paths
– Increase from the 1024 per host paths supported previously
Devices
• ESXi hosts now support up to 512 devices
– Increase from the 256 devices supported per host previously
– Multiple targets are required to address more than 256 devices
– This does not impact Virtual Volumes (aka VVols), which can address 16,383 VVol per PE
7#SER1143BU CONFIDENTIAL
8. vSphere 6.5 Scaling and Limits
• 512e Advanced Format Device Support
• Capacity limits are now an issue with 512n (native) sector size used currently in disk drives
• New Advanced Format (AF) drives use a 4K native sector size for higher capacity
• These 4Kn devices are not yet supported on vSphere
• For legacy applications and operating systems that cannot support 4KN drives, new 4K sector
size drives that run in 512 emulation (512e) mode are now available
– These drives will have a physical sector size of 4K but the logical sector size of 512 bytes
• These drives are now supported on vSphere 6.5 for VMFS and RDM (Raw Device Mappings)
8#SER1143BU CONFIDENTIAL
10. DSNRO
The setting “Disk.SchedNumReqOutstanding” aka “No of outstanding IOs with
competing worlds” has changed in behavior
• DSRNO can be set to a maximum of:
– 6.0 and earlier: 256
– 6.5 and on: Whatever the HBA Device Queue Depth Limit is
• Allows for extreme levels of performance
10#SER1143BU CONFIDENTIAL
12. VMFS-6: On-disk Format Changes
• File System Resource Management - File Block Format
• VMFS-6 has two new “internal” block sizes, small file block (SFB) and large file block (LFB)
– The SFB size is set to 1MB; the LFB size is set to 512MB
– These are internal concepts for ”files” only; the VMFS block size is still 1MB
• Thin disks created on VMFS-6 are initially backed with SFBs
• Thick disks created on VMFS-6 are allocated LFBs as much as possible
– For the portion of the thick disk which does not fit into an LFB, SFBs are allocated
• These enhancements should result in much faster file creation times
– Especially true with swap file creation so long as the swap file can be created with all LFBs
– Swap files are always thickly provisioned
12
VMFS-6
#SER1143BU CONFIDENTIAL
13. VMFS-6: On-disk Format Changes
• Dynamic System Resource Files
• System resource files (.fdc.sf, .pbc.sf, .sbc.sf, .jbc.sf) are now extended dynamically
for VMFS-6
– Previously these were static in size
– These may show a much smaller size initially, when compared to previous versions of
VMFS, but they will grow over time
• If the filesystem exhausts any resources, the respective system resource file is
extended to create additional resources
• VMFS-6 can now support millions of files / pointer blocks / sub blocks (as long as
volume has free space)
13
VMFS-6
#SER1143BU CONFIDENTIAL
15. vmkfstools – 500GB VMFS-6 volume
15
Large file blocksIn VMFS-6, Sub Blocks for are used for Pointer Blocks.
That's why Ptr Blocks max is shown as 0 here
#SER1143BU CONFIDENTIAL
16. VMFS-6: On-disk Format Changes
• File System Resource Management - Journaling
• VMFS is a distributed journaling filesystem
• Journals are used on VMFS when performing metadata updates on the filesystem
• Previous versions of VMFS used regular file blocks as journal resource blocks
• In VMFS-6, journal blocks tracked in a separate system resource file called .jbc.sf.
• Introduced to address VMFS journal related issues on previous versions of VMFS,
due to the use of regular files blocks as journal blocks and vice-versa
– E.g. full file system, see VMware KB article 1010931
16
VMFS-6
#SER1143BU CONFIDENTIAL
17. New Journal System File Resource
17
VMFS-5
VMFS-6
#SER1143BU CONFIDENTIAL
18. VMFS-6: VM-based Block Allocation Affinity
• Resources for VMs (blocks, file descriptors, etc.) on earlier VMFS versions were
allocated on a per host basis (host-based block allocation affinity)
• Host contention issues arose when a VM/VMDK was created on one host, and then
vMotion was used to migrate the VM to another host
• If additional blocks were allocated to the VM/VMDK by the new host at the same time
as the original host tried to allocate blocks for a different VM in the same resource
group, the different hosts could contend for resource locks on the same resource
• This change introduces VM-based block allocation affinity, which will decrease
resource lock contention
18
VMFS-6
#SER1143BU CONFIDENTIAL
19. VMFS-6: Parallelism/Concurrency Improvements
• Some of the biggest delays on VMFS were in device scanning and filesystem probing
• vSphere 6.5 has new, highly parallel, device discovery and filesystem probing
mechanisms
– Previous versions of VMFS only allowed one transaction at a time per host on a given
filesystem; VMFS-6 supports multiple, concurrent transactions at a time per host
• These improvements are significant for fail-over event, and Site Recover Manager
(SRM) should especially benefit
• Requirement to support higher limits on number of devices and paths in vSphere 6.5
19
VMFS-6
#SER1143BU CONFIDENTIAL
20. Hot Extend Support
• Prior to ESXi 6.5, VMDKs on a powered on VM
could only be grown if size was less than 2TB
• If the size of a VMDK was 2TB or larger, or the
expand operation caused it to exceed 2TB, the hot
extend operation would fail
• This required administrators to typically shut down
the virtual machine to expand it beyond 2TB
• The behavior has been changed in vSphere 6.5
and hot extend no longer has this limitation
20
This is a vSphere 6.5 improvement, not specific to VMFS-6.
This will also work on VMFS-5 volumes.
#SER1143BU CONFIDENTIAL
21. “Upgrading” to VMFS-6
• No direct ‘in-place’ upgrade of filesystem to VMFS-6 available.
New datastores only.
• Customers upgrading to vSphere 6.5 release should continue to
use VMFS-5 datastores (or older) until they can create new
VMFS-6 datastores
• Use migration techniques such as Storage vMotion to move
VMs from the old datastore to the new VMFS-6 datastore
21#SER1143BU CONFIDENTIAL
25. ATS Miscompare Handling (1 of 3)
• The heartbeat region of VMFS is used
for on-disk locking
• Every host that uses the VMFS volume
has its own heartbeat region
• This region is updated by the host on
every heartbeat
• The region that is updated is the time
stamp, which tells others that this host
is alive
• When the host is down, this region is
used to communicate lock state to
other hosts
25
ATS
#SER1143BU CONFIDENTIAL
26. ATS Miscompare Handling (2 of 3)
• In vSphere 5.5 U2, we started using ATS for maintaining the heartbeat
• ATS is the Atomic Test and Set primitive which is one of the VAAI primitives
• Prior to this release, we only used ATS when the heartbeat state changed
• For example, we would use ATS in the following cases:
– Acquire a heartbeat
– Clear a heartbeat
– Replay a heartbeat
– Reclaim a heartbeat
• We did not use ATS for maintaining the ‘liveness’ of a heartbeat
• This change for using ATS to maintain ‘liveness’ of a heartbeat appears to have led to issues for
certain storage arrays
26
ATS
#SER1143BU CONFIDENTIAL
27. ATS Miscompare Handling (3 of 3)
• When an ATS Miscompare is received, all outstanding IO is aborted
• This led to additional stress and load being placed on the storage arrays
– In some cases, this led to the controllers crashing on the array
• In vSphere 6.5, there are new heuristics added so that when we get a miscompare event, we
retry the read and verify that there is a miscompare
• If the miscompare is real, then we do the same as before, i.e. abort outstanding I/O
• If the on-disk HB data has not changed, then this is a false miscompare
• In the event of a false miscompare:
– VMFS will not immediately abort IOs
– VMFS will re-attempt ATS HB after a short interval (usually less than 100ms)
27
ATS
#SER1143BU CONFIDENTIAL
28. An Introduction to UNMAP
UNMAP via datastore
• VAAI UNMAP was introduced in vSphere 5.0
• Enables ESXi host to inform the backing storage that files or VMs had be moved or
deleted from a Thin Provisioned VMFS datastore
• Allows the backing storage to reclaim the freed blocks
• No way of doing this previously, resulting in stranded space on Thin Provisioned
VMFS datastores
28
UNMAP
#SER1143BU CONFIDENTIAL
29. Automated UNMAP in vSphere 6.5
Introducing Automated UNMAP Space Reclamation
• In vSphere 6.5, there is now an automated UNMAP crawler mechanism for
reclaiming dead or stranded space on VMFS datastores
• Now UNMAP will run continuously in the background
• UNMAP granularity on the storage array
– The granularity of the reclaim is set to 1MB chunk
– Automatic UNMAP is not supported on arrays with UNMAP granularity greater
than 1MB
– Auto UNMAP feature support is footnoted in the VMware Hardware
Compatibility Guide (HCL)
29
UNMAP
#SER1143BU CONFIDENTIAL
30. Some Considerations with Automated UNMAP
• Only is issued to VMFS datastores that are VMFS-6 and have powered-on VMs
• Can take 12-24 hours to fully reclaim
• Default behavior is turned on, but can be turned off on the host (host won’t
participate) …
– EnableVMFS6Unmap
• …or on the datastore (no hosts will reclaim it)
30
UNMAP
#SER1143BU CONFIDENTIAL
31. An Introduction to Guest OS UNMAP
UNMAP via Guest OS
• In vSphere 6.0, additional improvements to UNMAP facilitate the reclaiming of
stranded space from within a Guest OS
• Effectively, ability for a Guest OS in a thinly provisioned VM to tell the backing
storage that blocks
• Backing storage to reclaim this capacity, and shrink the size of the VMDK
31
UNMAP
#SER1143BU CONFIDENTIAL
32. Some Considerations with Automated UNMAP
TRIM Handling
• UNMAP work at certain block boundaries on VMFS, whereas TRIM does not have
such restrictions
• While this should be fine on VMFS-6, which is now 4K aligned, certain TRIMs
converted into UNMAPs may fail due to block alignment issues on previous
versions of VMFS
Linux Guest OS SPC-4 support
• Initially in-guest UNMAP support to reclaim in-guest dead space natively was
limited to Windows 2012 R2
• Linux distributions check the SCSI version, and unless it is version 5 or greater, it
does not send UNMAPs
• With SPC-4 support introduced in vSphere 6.5, Linux Guest OS’es will now also
be able to issue UNMAPs
32
UNMAP
#SER1143BU CONFIDENTIAL
33. Automated UNMAP Limits and Considerations
Guest OS filesystem alignment
• VMDK alignment is aligned on 1 MB block boundaries
• However un-alignment may still occur within the guest OS filesystem
• This may also prevent UNMAP from working correctly
• A best practice is to align guest OS partitions to the 1MB granularity
boundary
33
UNMAP
#SER1143BU CONFIDENTIAL
34. Known Automated UNMAP Issues
• vSphere 6.5
– Tools in guest operating system might send unmap requests that are not aligned
to the VMFS unmap granularity.
– Such requests are not passed to the storage array for space reclamation.
– Further info in KB article 2148987
– This issue is addressed in vSphere 6.5 P01
• vSphere 6.5 P01
– Certain versions of Windows Guest OS running in a VM may appear
unresponsive if UNMAP is used.
– Further info in KB article 2150591.
– This issue is addressed in vSphere 6.5 U1.
34
UNMAP
#SER1143BU CONFIDENTIAL
38. The Storage Policy Based Management (SPBM) Paradigm
• SPBM is the foundation of
VMware's Software Defined
Storage vision
• Common framework to allow
storage and host related
capabilities to be consumed
via policies.
• Applies data services (e.g.
protection, encryption,
performance) on a per VM, or
even per VMDK level
38#SER1143BU CONFIDENTIAL
39. Creating Policies via Rules and Rule Sets
• Rule
– A Rule references a combination of a metadata tag and a related value, indicating
the quality or quantity of the capability that is desired
– These two items act as a key and a value that, when referenced together through
a Rule, become a condition that must be met for compliance
• Rule Sets
– A Rule Set is comprised of one or more Rules
– A storage policy includes one or more Rule Sets that describe requirements for
virtual machine storage resources
– Multiple “Rule Sets” can be leveraged to allow a single storage policy to define
alternative selection parameters, even from several storage providers
39#SER1143BU CONFIDENTIAL
41. 41
SPBM and Common Rules
for
Data Services provided by hosts
- VM Encryption
- Storage I/O Control v2
#SER1143BU CONFIDENTIAL
42. 42
2 new features introduced with vSphere 6.5
- Encryption
- Storage I/O Control v2
Implementation is done via I/O Filters
#SER1143BU CONFIDENTIAL
43. #SER1143BU CONFIDENTIAL
Storage I/O Control v2
• VM Storage Policies in vSphere 6.5 has a new option called “Common Rules”
• These are used for configuring data services provided by hosts, such as Storage I/O Control
and Encryption. It is the same mechanism used for VAIO/IO Filters
43
44. vSphere VM Encryption
• vSphere 6.5 introduces a new VM encryption mechanism
• It requires an external Key Management Server (KMS). Check the HCL for supported vendors
• This encryption mechanism is implemented in the hypervisor, making vSphere VM encryption
agnostic to the Guest OS
• This not only encrypts the VMDK, but it also encrypts some of the VM Home directory contents,
e.g. VMX file, metadata files, etc.
• Like SIOCv2, vSphere VM Encryption in vSphere 6.5 is policy driven
44#SER1143BU CONFIDENTIAL
45. #SER1143BU CONFIDENTIAL
vSphere VM Encryption I/O Filter
• Common rules must be enabled add vSphere VM Encryption to a policy.
• Only setting in the custom encryption policy is to allow I/O filters before encryption.
45
49. NFS v4.1 Improvements
• Hardware Acceleration/VAAI-NAS Improvements
– NFS 4.1 client in vSphere 6.5 supports hardware acceleration by offloading certain
operations to the storage array.
– This comes in the form of a plugin to the ESXi host that is developed/provided by the
storage array partner.
– Refer to your NAS storage array vendor for further information.
• Kerberos IPv6 Support
– NFS v4.1 Kerberos adds IPV6 support in vSphere 6.5.
• Kerberos AES Encryption Support
– NFS v4.1 Kerberos adds Advanced Encryption Standards (AES) encryption support in
vSphere 6.5
49
NFSv4.1
#SER1143BU CONFIDENTIAL
51. iSCSI Enhancements
• ISCSI Routing and Port Binding
– ESXi 6.5 now supports having the iSCSI initiator and the iSCSI target residing in different
network subnets with port binding
• UEFI iSCSI Boot
– VMware now supports UEFI (Unified Extensible Firmware Interface) iSCSI Boot on Dell
13th generation servers with Intel x540 dual port Network Interface Card (NIC).
51
iSCSI
#SER1143BU CONFIDENTIAL
53. NVMe (1 of 2)
• Virtual NVMe Device
– New virtual storage HBA for all flash
SAN/vSAN storages
– New Operating Systems now leverage
multiple queues with NVMe devices
• Virtual NVMe device allows VMs to take
advantage of such in-guest IO stack
improvements
– Improved performance compared to Virtual
SATA device on local PCIe SSD devices
• Virtual NVMe device provides 30-50%
lower CPU cost per I/O
• Virtual NVMe device achieve 30-80%
higher IOPS
53#SER1143BU CONFIDENTIAL
54. NVMe (2 of 2)
• Supported configuration information of virtual NVMe device.
54
Number of Controllers per VM 4 Enumerated as nvme0,…, nvme3.
Number of namespaces per controller 15
Each namespace is mapped to a virtual disk.
Enumerated as nvme0:0, …, nvme0:15
Maximum queues and interrupts 16 1 admin + 15 I/O queues
Maximum queue depth 256 4K in-flight commands per controller
• Supports NVMe Specification v1.0e mandatory admin and I/O commands
• Interoperability with all existing vSphere features, except SMP-FT
#SER1143BU CONFIDENTIAL
This section will describe the fundamentals behind vSAN’s architecture.
VAAI – vSphere API for Array Integration
SPBM – Storage Policy Based Management
The ability to address > 256 devices on a single target requires a new flat addressing scheme.
Sector Readiness
As part of future-proofing, all metadata on VMFS-6 is aligned on 4KB blocks.
VMFS-6 is ready to fully support the new, larger capacity, 4KN sector disk drives when vSphere supports them.
This change makes a lot of sense, since the DQLEN was always the min of DSNRO and HBA Device Queue Depth Limit, why allow someone to make DSNRO larger? And why not larger than 256?
There is no way to select the block size. It is always 1MB. The SFB and LFB are internal concepts and are used automatically based on the type of VMDK. For ex: LZT and EZT will always use LFB's as long they are available for allocation. In addition, the swap files would also use LFB's. LFB's help in reducing the provisioning time and VM boot time (b'cos of swap file creation will be faster for large VM's). Thin vmdk's always use SFB's.
The VMFS volume has to be reasonably big for LFB's to get allocated
Idea here is that we can create volumes that are small in capacity, but do not consume a huge amount of overhead for metadata.
The other reason for this is so that we do not cap the maximum capacity of a volume with its initial formatting size, but grow it dynamically over time (future-proofing).
Note 1: VMFS-5
Note 2: 1MB file blocks
Note 3: Number of sub blocks created is 32000 (these are 8K in size) on VMFS-5
Note 1: VMFS file blocks are still 1MB on VMFS-6
Note 2: Large File Blocks are not displayed in the vmkfstools
Note 3: Pointer blocks have been replaced with Sub Blocks
Note 4: Much fewer (about half) the number of sub blocks on this newly create dVMFS06, which is the same size as the VMFS-5 volume created earlier.
Note 5: Sub-blocks went from 64K on VMFS-3 to 8K on VMFS-5 and back to 64K on VMFS-6
Note 6: Sub-blocks back a file initially, but when size is > 1 sub-block, we switch to file blocks for backing - https://blogs.vmware.com/vsphere/2012/02/something-i-didnt-known-about-vmfs-sub-blocks.html
VMFS allocates space for journal when the file system is first accessed.
Note that the journal resource file can also be dynamically extended.
Tracking journal blocks separately in a new resource file reduces the risk of issues arising due to journal blocks being interpreted as regular file blocks.
Note: As I understand it, if a filesystem filled, one could not even delete a file as this would require a journal block allocated to do the metadata operation. By moving to its own resource file, one should now be able to create a journal block to enable the delete operation, making it so much easier to deal with full filesystem issues.
In VMFS, every datastore gets its own hidden files to save the file-system structure.
.fbb – file blocks
.fdc – file descriptors
.pbc – pointer blocks
.pb2 – pointer blocks – second level of indirection
.sbc – sub-blocks
.vh – volume header
.sdd = System Data Directory
It is important to note that this does not require VMFS-6 or Virtual Machine HW version 13 to work.
VMFS-5 will also support this functionality as long as ESXi is version at 6.5.
ATS is a replacement lock mechanism for SCSI reservations on VMFS volumes when doing metadata updates. Basically ATS locks can be considered as a mechanism to modify a disk sector, which when successful, allow an ESXi host to do a metadata update on a VMFS. This includes allocating space to a VMDK during provisioning, as certain characteristics would need to be updated in the metadata to reflect the new size of the file. ATS is a standard T10 and uses opcode 0×89 (COMPARE AND WRITE).
We always do ATS with 1 LBA size, so, that is 1K in total in each ATS command. 512b for test data + 512b for set data.
What are we checking?
We are checking the ATS data which happens to be entire HB data which is also in Test-Image and Set-Image. Not just one of the fields. If any field is different, then we need to consider doing HB Reclaim.
Mains issues:
- The storage returning "ATS MISCOMPARE IN-CORRECTLY". In this scenario, we do not know why the storage returned the mis-compare. It is the storage-side issue. One storage-vendor said, it is b'cos the storage was overloaded. The other said, due to existing reservation on the LUN etc.
VMFS detected the mis-compare in-correctly. In this case, an HB I/O (1) got timed-out and VMFS aborted that I/O, however, before aborting the I/O, it made it to the disk. Then, VMFS re-tried the ATS using the the same test-image as (1) (b'cos the previous one was aborted, the assumption was that the ATS didn't made it to the disk) and since the ATS made it to the disk before the abort, storage returned "ATS MISCOMPARE".
Some storage arrays wrote the ATS data, yet, returned mis-compare. So we had to handle this case too.
Basically, when we get Mis-compare on an ATS HB, we read the HB image from the disk and compare it against both Test-Image and Set-Image which was used for ATS command that resulted in mis-compare. It’s the entire HB slot (memcmp 512 bytes on VMFS-5 and 4K on VMFS-6)
TRIM is the ATA equivalent of SCSI UNMAP. A TRIM operation gets converted to UNMAP in the I/O stack, which is SCSI. However, there are some issues with TRIM getting converted into UNMAP.
TRIM is the ATA equivalent of SCSI UNMAP. A TRIM operation gets converted to UNMAP in the I/O stack, which is SCSI. However, there are some issues with TRIM getting converted into UNMAP.
Storage Policy-Based Management (SPBM) is the foundation of the VMware SDS Control Plane and enables vSphere administrators to over come upfront storage provisioning challenges, such as capacity planning, differentiated service levels and managing capacity headroom, whether using vSAN or Virtual Volumes (VVols) on external storage arrays. SPBM provides a single unified control plane across a broad range of data services and storage solutions. The framework helps to align storage with application demands of your virtual machines.
SPBM is about ease, and agility. Traditional architectural models relied heavily on the capabilities of an independent storage system in order to meet protection and performance requirements of workloads. Unfortunately the traditional model was overly restrictive in part because standalone hardware based storage solutions were not VM aware, and were limited in their abilities to unique settings to various workloads. Storage Policy Based Management (SPBM) lets you define requirements for VMs or collection of VMs. This SPBM framework is the same framework used for storage arrays supporting VVOLs. Therefore, a common approach to managing and protecting data can be employed, regardless of the backing storage.
----------------------------------
Overview:
Key to software defined storage (SDS) architectural model
SPBM is the common framework to abstract traditional storage related settings away from hardware, and into hypervisor
Applies storage related settings for protection and performance on a per VM, or even per VMDK level
----------------------------------
Common Rules – these come from I/O Filters on hosts (VMCrypt, SIOCv2, VAIO)
Rule-Sets come from storage, either vSAN or VVOls.
Available since vSphere 6.5.
This is before I added the I/O Accelerator from Infinio.
These are provided by default in vSphere.
When the policy has been created, it may be assigned to newly deployed VMs during provisioning, or to already existing VMs by assigning this new policy to the whole VM (or just an individual VMDK) by editing its settings.
One thing to note is that IO Filter based IOPS does not look at the size of the IO. For example, there is no normalization so that a 64K IOP is not equal to 2 x 32K IOPS. It is a fixed value of IOPS irrespective of the size of the IO.
In this initial release of SIOC V2 in vSphere 6.5, there is no support for vSAN or Virtual Volumes. SIOC v2 is only supported with VMs that run on VMFS and NFS datastores.
SIOCV2 policies override SIOCV1 policies.
VAAI-NAS component has four primitives: Full file clone, Fast file clone, Extended Statistics and Reserve space.
The following Advanced Encryption Standards (AES) are now supported:
AES256-CTS-HMAC-SHA1-96
AES128-CTS-HMAC-SHA1-96
The DES-CBC-MD5 encryption type is not supported with NFSv4.1 in vSphere 6.5.
Kerberos Integrity SEC_KRB5I Support
Kerberos Integrity is a new feature in 6.5.
vSphere 6.5 introduces Kerberos Integrity SEC_KRB5I.
This feature uses checksum to protect NFS data.
Support the routing of iSCSI connections and sessions/Leverage separate gateways per VMkernel interface/Use port binding to reach targets in different subnets
VMware now supports UEFI (Unified Extensible Firmware Interface) iSCSI Boot on Dell 13th generation servers with Intel x540 dual port Network Interface Card (NIC).
On the System BIOS select Network Settings, followed by UEFI iSCSI Device Settings.
In the Connection Settings, you need to populate initiator and target settings, as well as any appropriate VLAN and CHAP settings if required.
This device will now appear in the list of options in the UEFI Boot Settings.
The NIC Configuration must then have its Legacy Boot Protocol set to iSCSI Primary, and also be populated with initiator and target settings.
You can now install your ESXi image and use any of the LUNs from the iSCSI target to install to.
Subsequent reboots will boot from the ESXi image on the iSCSI LUN