New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
OSS Presentation DRMC by Keith Brennan
1. More IOPS Please
DRMC’s VMware View Implementation Using
Nexenta
Keith Brennan
October 2011
S
2. Delano Regional Medical
Center
S 156 bed community hospital in
central California.
S Four satellite clinics.
S Only hospital in a 30 mile
radius.
S Serves approximately 60,000
people spread over several
communities.
S 80%+ of our patients are Medi-
Cal or Medicare.
S Government doesn’t pay
well.
4. The Great Directive of 2009
S Need to deploy 150 new desktops in support
of a Clinical Documentation implementation.
S Do it as cheaply as possible.
S Oh, by the way, you’re losing an FTE due to
budget cuts.
5. “Never let a good crisis go to
waste.” –Rahm Emmanuel
S Used this “Opportunity” to justify moving to VDI.
S Users resistant to using something other than a traditional
desktop.
S Perceived lack of freedom.
S Perceived increase in “Big Brother.”
S Why I wanted the transition to VDI
S Ease of management.
S We had a set, well defined, integrated, desktop experience.
S Wanted a way to deliver the same experience in a controlled
manner to a myriad of devices. IOS, Android, etc.
6. I Need Storage!
S My Existing EMC CX500 was barely cutting it for 3 ESX
hosts w/ a combined 32 VM’s.
S Lots of people on the Virtualization forums liked NetApp.
S NetApp had just published a white paper on a 750 View
virtual desktop deployment on a FAS 2050a.
S Near normal desktop load times.
S Seamless user experience.
7. Well That’s Timely!
S The next week another vendor calls letting me know that
IBM is running a huge storage sale.
S It includes their N series of network attached storage.
S Rebadged NetApps.
S Three weeks later a N3600, a rebadged NetApp 2050a,
arrives.
S It is setup identically to the VDI whitepaper’s setup.
8. Implementation Guidelines
S Linked clones are to be used whenever possible.
S Ease of maintenance
S Ease of provisioning
S No user data to be stored on the VM’s.
S Significant patching shall be done through the Golden Image
and VM’s will be re-provisioned with using the updated image.
S AV will run on the VM’s but only in real-time scan mode. No
scheduled system scans.
9. Initial Testing
S Two Hosts with 25 VM’s each.
S One connected to the N3600 via ISCSI
S The other via NFS.
S Test lab of 25 thin clients.
S Good performance.
S Equivalent to a desktop of the previous generation.
S Quick user logins due to the VM’s being always on and waiting.
S The N3600 is maintaining low utilization.
S NFS and ISCI exhibit similar speed.
10. Go Live!
S Five additional ESX Hosts are deployed.
S Each hosts ~25 VM’s
S Current setup gives me N+2 host redundancy.
S For the first week everything looks good.
S User complaints are primarily with the clinical application.
S N3600 is handling it well. Running at about 35%
utilization.
S ~1.5k IOPs/Sec of regular background chatter.
S VM’s report average latency of 12ms.
11. Disaster!
For me they happen seem to happen in threes.
S First AV engine update happens 1 week after go live.
S AV server pushes it to all clients at once.
S The simultaneous update of all the View VM’s forces the
SAN to a crawl for 3 hours.
S Users complain that the Virtual Desktops are unusable.
S Temporarily corrected the problem by only allowing the AV
to update 3 machines at once.
S This worked like a champ until a dot version update on the
AV server a month later broke that setting.
S Another 3 hour “downtime.”
12. Disaster (cont)
S Three days later a helpdesk tech forces the simultaneous
reprovisioning of 60 of the View VM’s at once.
S Was applying an application patch.
S Was trained not to restart more than 5 VM’s at once.
S That obviously didn’t stick!
S That was another hour of the SAN crawling.
S Once again, users complain that the system was unusable
during this time.
13. Disaster! (yet again)
S .net 3.5 service pack is approved for deployment.
S SP is large. >100mb.
S Set to deploy starting at 2am and only on restart.
S At 04:15 four VM’s restart within one minute of each other.
S N3600 starts to lag.
S Users seeing their system running slow decide to restart.
S At 5am I get the call regarding the issue.
S I immediately disabled the SP deployment.
S Still took an hour for the N3600 to catch up.
15. What’s Going On???
S Oh $41+…
S General use chatter is
eating my bandwidth.
S N3600 CPU utilization is
regularly now above
50%.
S Disk utilization rarely
drops below 40%.
S Average disk latency
>18ms.
16. I Have a Problem
S I’m maxing performance with just day to day operations.
S IBM has verified that the appliance is functioning
properly.
S In other words, this is all I’m going to get out of it.
S Adding disks might help some, but too costly!
S Additional Tray would be $15k!
S SAS drives to populate it are almost $1k each!
S Still have CPU limitations.
S NIC Limitations (2 – 1gbe links per head)
S Did I mention that I have no money left in the budget?
17. Nexenta to the Rescue
S Had just installed Nexenta Core for my home file server.
S Time to find some hardware:
S Pulled a box out of the View cluster.
S Installed six Intel SSD’s.
S Installed Nexenta Core. (yeah, I know.. EULA..)
S Created the volume and shared via NFS.
S The next day my poor brain figured out that I could have just
done a Nexenta VM. Doh!
S Over the next week I migrated half the virtual desktops over.
18. Its like Night and Day
S Average latency drops
from 18ms to 2ms.
S Write throughput
quadruples.
S Read throughput
doubles.
S 20x improvement on 4k
iops!
20. Time For a Full Nexenta
Implementation
S I was able to secure $45k capital for the next year.
S Normally this would just draw laughter when talking about
storage.
S I also intend on replacing the existing EMC.
S Annual maintenance too costly.
S I despise the fact that I have to call them out every time I want
to connect a new piece of hardware to it.
S Still some questioning from higher-ups on this whole open-
storage thing.
21. Final Solution Hardware
S 2x Supermicro dual Xeon servers with 96gb ram.
S 1x DataOn 1600 JBOD
S Houses twenty one 1tb nearline SAS drives.
S 1x DataOn 1620 JBOD
S Houses seventeen 300gb 10k rpm SAS drives
S 2x Stec ZeusRam
S 8x 160gb Intel 320 SSD’s
23. Why DataOn?
S Disk Shelf Manager
S One thing Nexenta lacked
was a way to monitor the
JBoD’s
S How could one of my techs
know how which drive to
pull?
S Intuitive slot lighting.
S They’re responsive even
after the sale is made!
24. Why Nexenta?
S Its good to have on demand support.
S I am the only member of our technical staff that has a basic
understanding of storage architectures.
S I like to have the ability to go on vacation from time to time!
S Its good to have experts for unique problems.
S Regular tested bug-fixes.
S Its always nice to have someone’s neck to wring!
25. The End Result
S 2ms latency.
S 500 mb/s reads
S 200 mb/s writes
S Happy Users!
S Note: Benchmark was
done on production
system with 175 active
VM’s.
26. To Dedup or Not to Dedup
S Dedup can give you huge storage savings.
S I had 14x Dedup ratio on my VDI volume.
S Inline dedup saves on disk write IO.
S It’ll still hit the ZIL, but won’t be written to disk if it is
determined to be duplicated data.
S Instead of a 4+kb write you get a sub 256 byte metadata write.
27. To Dedup or Not to Dedup
S Ram Hog!
S For good performance you need enough ram to store the
dedup table.
S Uses ARC for this, which means you will have less room for
cached data.
S Potential for hash collision.
S Odds are astronimcal, but still a chance for data corruption.
S Dedup performance penalty.
S Small IOPS suffer.
29. Is Dedup Worth it?
S If you’re using a “Golden Image” - No.
S VMDC Plugin provides great efficiency by only storing one
copy of the Golden Image vs one for each pool of VM’s.
S Compression is virtually free and will do a good job of
making up the difference in the “new” blocks.
S Disk is cheap.
S If you’re doing a bunch of P2V desktop migrations -
Maybe.
S If the desktops are poorly configured, or have other aspects
that can cause excessive I/O than no.
S If the desktops are similar and large, then sure.
30. Compression
S Use it. Unless you’re using a 5 year old processor, there
will be no noticeable performance hit.
S On by default in Nexenta 3.1
S Compresses before write. Saves disk bandwidth!
31. Cache is Key!
S Between the the 70gb of arc and 640gb of l2arc the read cache
is hit almost 98% of the time!
S This equates to sub 2ms average disk latency to the end user.
S Beats the crud out of the >15ms average latency of the N3600!
S Know your working set. You could get away with a lot
smaller or need a lot larger cache.
34. Gig-E vs TenGig-E
S Obvious differences in maximum throughput.
S Small IOP differences are mainly attributable to network
latency differences.
S If you’re stuck with Gig-E go use 802.3ad trunk groups.
S Still stuck with 100 mb/s throughput but no one ESX host
will saturate the link for the rest.
35. Gig-E vs TenGig-E - User
Perspective
S Average time from the “Power On VM” command being
issued to the user is able to login:
S 10gbe: 23 seconds
S 1gbe: 32 seconds
S Time from when user presses “login” button until the
desktop is ready to use:
S 10gbe: 5 seconds
S 1gbe: 9 seconds
*Windows 7, 2 procs, 2gb ram, DRMC’s Standard Clinical Image
36. Final Thought – All SSD
Goodness
S For deployments of Linked Clones or VM’s off of a Golden
Image.
S Allows you to get rid of the L2ARC.
S Use a good ZIL Device (STEC ZeusRam, DDRDrive)
S Allows for sequential writes to the SSD’s in the pool.
S Saves on write wear which is a SSD killer.
S My first test box with the x25m SSD’s started suffering after
about 3 months.
S If you want HA you have to use SAS drives.