At Thoughtworks, we use virtualization extensively, since a little before "cloud" became a buzzword. Until a few years ago, our platform of choice for Virtualization was VMWare. Since then, we also started to use Xen and KVM. We're yet to investigate HyperV in production use. As a software development company, we use VMWare in standalone mode, where our build agents are spread across a large number of servers (around 80 per server), and our UAT environments are made highly available on a number of VMWare clusters.
post a comment
One of the requirements of a VMWare cluster, and that the VMs which need to be made highly available, should reside on a common storage device, typically a SAN . VMWare provide their own filesystem called VMFS, which is a distributed filesystem. If a SAN device presents a raw data store called a Lun, then VMWare accesses that Lun using the iSCSI protocol or via FibreChannel. If the SAN/NAS device can expose it's storage space via the NFS protocol, then VMWare uses that storage directly over TCP/IP. We initially purchased a Dell MD 3000i SAN device which played a useful role when we were operating at a small scale of around 3 TB. As we increased the number of VMs, the disk space required also grew. We finally stopped needing more storage for a specific project at around 5 TB of utilization (8 TB of usable capacity).
Given the larger number of VMs, and the number of parallel deployments of various applications by our automated deployment scripts (we made everything part of our Go Grid), we started to see performance issues. Another long pending problem was that of being unable to answer why our SAN was sometimes slow even though there was no deployment in progress at all (never mind parallel deployments). For e.g., if we knew that there were six environments running 2 - 3 different builds, we'd not know which particular build caused an increase in disk IO. The VMWare GUI tool gives you only so much data. Also, devices like the MD3000i do not have any such thing as iSCSI analytics.
Around this time, I also wanted to experiment with creating VMs within seconds.
Since I've been closely associated with Belenix and OpenSolaris technologies for years, I decided to give Solaris 10 a spin as a storage box for some time. We used ZFS and it's snapshot feature to set up VM collections in minutes. A typical VMCollection would comprise of a Domain Controller, some IIS servers, an Exchange Server, and some VMs running Outlook 2007. We also got amazing performance. All this on a box with just three disks of 1 TB each configured in what is known in ZFS parlance as RAIDZ. RAIDZ is similar to but better than RAID5, because it avoids the RAID5 write hole problem. Our VMWare servers would access our various ZFS filesystems - some over iSCSI and the others as NFS storage.
So, now that we understood that ZFS could for certain give good performance, we needed to solve the additional problem of identifying which VMWare server was sending how much disk IO to which SAN LUN. Enter DTrace. With the help of a close friend from Sun, we put together some dtrace scripts, and these provided answers to some extent. Why only to some extent ? That's because we often didn't know what exactly to ask DTrace to monitor.
Some months later, we decided to replace our MD3000i with a higher end SAN. Now I'd heard of The famed DTrace Analytics that somes as part of the Sun Fishworks product line. Today, this is called their ZFS Storage Appliance Line. However, since we were going to sink in a lot of money into a SAN, I wanted to be cautious and check out other popular vendors too.
Some colleagues and I got together and attended review sessions at NetApp and at EMC. Both NetApp and EMC offered more than what we'd imagined as part of their standard offering, and this was good to know. Unfortunately, at that time (mid 2010), NetApp did not have any analytics around iSCSI sessions. At the EMC review session at their Bangalore office, we saw a good demo and were pleased with their product overall. But the key feature which is very important for us - iSCSI session analytics - was missing. They did some some raw iSCSI analytics, but I learned that activating this monitoring locks up the controller in processing information, and VMWare servers and VMs get disconnected. This was a clear no.
After some discussions with Oracle Bangalore, we finalized an order for a 7320 Storage box, and life hasn't been the same again !
More in part 2 of this blog post.
Given that the various high end SAN vendors all receive good reviews for their performance, let's see what some of these performance criteria are:
post a comment
a. Excellent Disk IO !
b. Excellent Network IO !
c. Excellent caching mechanisms !
d. Optimized network stacks for various protocols !
e. Fancy reporting for the management !
There are also other interesting features such as:
a. being able to use multiple connectivity mediums (1G and 10G Ethernet, Fibre Channel).
b. Phone home, where the the device diagnoses issues if any, and sends the SAN company a message so that they can take pre-emptive action.
Here are something that we liked about the Oracle SAN (the others have some of these, but not all).
1. A full blown Enterprise grade Operating System.
This SAN runs a version of OpenSolaris, based on the same enterprise grade OS that powers many of the world's performance critical environments. As screenshots in further blog posts will show, this SAN gives great performance + analytics while using neglibile amounts of CPU.
2. An excellent network stack.
Unlike some earlier versions of Solaris which were nicknamed "Slowlaris", Solaris 10 receive a TCP/IP stack rewrite. This rewrite was nicknamed "Fire Engine". Later, the developers went on to improve network performance in many other ways too, with quite some of those benefits making their way into this SAN.
3. The ZFS file system.
This filesystem was designed with some good thinking about and questioning of lessons learned in the past, and whether they need to be applied today or not. ZFS has long been acknowledged as being a really superlative filesystem, with ports to some other operating systems as well. There are efforts to write equivalents for other platforms such as Linux to which ZFS cannot be ported to for legal (licensing) reasons.
Some interesting features here are the modified elevator-seek mechanism, near-platter-speed access rates, end-to-end checksumming to provide you with greater reliability.
One interesting benefit of having ZFS around is the improved caching. ZFS lets you specify read cache devices and write cache devices. These cache devices can be ordinary disks, but practically, everyone uses SSD devices for read and write caching. This means, you can start off as we did with having just one cache, and then using the Analytics to "size your requirements". In terms of storage, this means you could now use Analytics to determine whether you have more read or write operations, and whether you need a read or a write cache, how much, etc.
Sinze ZFS is part of the kernel, it can play with unused RAM and "soft allocate" some RAM for use as a read cache for very frequently accessed blocks of data (in case of LUNs) or file blocks or even entire files (in case of NFS and CIFS). What this means is, most of your frequently accessed data will reside in RAM and be served from there. In case the kernel needs to allocate some memory, it'll take away some from ZFS' RAM cache on a need basis.
So a RAM cache coupled with a Read and/or Write cache depending upon your requirements, can do wonders for your performance, over and above what ZFS acces speeds themselves do.
5. End to end checksums in the file system
This one deserves a point by itself. One problem that had hit me twice with the the MD3000i SANs, is that my applications complained about data errors. This was really dangerous for us at work, since this happened with active source code once, and with a VM in another case. The SAN reported faithfully that it had no corruptions in data, and yet I could see with my own eyes that I was in a bit of a mess with having to restore from backups, and rebuild from individual commits made during the day.
ZFS has this notion of an end to end checksum, where a block of data is checksummed, and then sent to the storage sub-system for a write, and there's a reference block that contains its checksum, with the grand parent block containing that parent block's checksum, etc. The intent is to be able to check whether a block that's been retrieved has been retrieved generates the same checksum or not. In case there's a checksum mismatch - say due to media error - ZFS knows to retrieve an identical block and help you get back your data. There's some documentation + diagrams that explain this better than I have.
Now, end to end checksums in a SAN do not guarantee to you at all that your server (the SAN's consumer/customer) will always get the data that it had sent to the SAN. There are many locations where the data could have got corrupted on the way to/from the SAN - the server's RAM (this is why you need ECC!), the network device driver or the OS' TCP stack may have some bug, the NIC card may be faulty, the network cable and switch may have some problems of their own, etc. You get the point.
But what end to end checksumming _will_ assure you about is, once the data reaches the Oracle SAN's network driver, from that point on till it gets written to the storage medium, it'll receive high fidelity checksum calculation, and this will be used to validate the data just before it's dispatched to the SAN's NIC.
For ideal quality, you should run ZFS on your server and have ECC RAM, but for those of us who have other use cases like having to run VMWare, you can at least rest assured that once your data is written to that pool, you can is even most of the worst cases get some or all of it back.
6. Excellent reporting.
You need to see this for yourself lest you consider me biased. Most other SAN devices provide you with what I call "defensive reporting", where the Storage Administrator gets to show that his SAN's giving great disk I/O operations. If someone were to ask a non-Oracle iSCSI SAN user "please tell me which of my VMWare servers is accessing which of the SAN's LUNs", then the storage admin would likely thrust some more defensive reporting in their face and ask them to get lost. If pushed to the wall, he'll simply call the SAN vendor's tech support team, and they'll ofcourse come scrambling to his aid to throw more jargon.
Apart from the fact that such an attitude doesn't take anyone anywhere, is the fact that better/alternate reporting simply doesn't exist on these other SANs.
With the Oracle SAN, you'll find that you'll be able to design and sketch interactive and drill down reports on demand. Drill down reports are awesome. Here's an everyday scenario "Hi, the VMs are slow, could you tell us what's wrong ?", "Sure... hmm... VMWare's CPU and RAM utilization continues to remain low, let me check the SAN" (By now, I trust the Oracle SAN's reporting since it' way more superlative). "Ok, I see high write operations from two VMWare servers, they're acessing LUN_EnvironmentD and Lun_EnvironmentJ, what're you guys doing on that ?" "Aah, there's a deploy going on" "Well, I see that the storage is well within the IOPS threshold" "Must be those script changes that we put in. Anyway, thanks for helping us arrive at this so soon". Apart from the time required to log on to the VMWare Management console and the SAN's own Web console, all SAN analytics literally takes as much time as it'd take to speak this conversation I've listed.
7. Direct support from Oracle.
Apparently, before Oracle acquired Sun, support would be available via channel partners too. Post the acquisition, Oracle now handles all support cases directly. There are pros and cons to this, I feel. There may be channel partners and their team members who may have wanted a career on configuring SAN storage, while continuing to play the role of a generalist. Now, they need to be lucky to get into a technologically diverse company like mine, or join Oracle !
On the other hand, since Oracle have their reputation on the line, you get to speak to people who have access to all manner of skillsets within the company. When crisis strikes, and your customers are screaming seeking escalation, you can now reply that this is the highest you can escalate. (see below).
8. Phone home.
Since I've seen this on NetApp and EMC too, I presume that all higher end storage vendors have this feature today. I know that even within a company such as Thoughtworks, not everyone is or will be as familiar with ZFS and related topics as I am. So, higher end SAN devices today run a number of self-diagnostics, and in case they find out any errors, they send some diagnostic data back to the SAN support team. Such teams take a judgement on what needs to be done, get in touch with the customer and set fixes in motion. These could be pre-emptively replacing disks, asking the customer to add more cache, recommend a reconfiguration, etc.
Alright, more in another blog post !
During the past few weeks, I've been working a bit on Belenix as and when my illness permits.
2 comments | post a comment
So far, rpm, smart and createrepo have been working very well for me. I don't install from source, or download from openindiana repos any more. Everything that I need, I ensure that I build using spec files and rpmbuild, and install via a custom repository on my local computer.
I've been examining build systems (Jenkins/Hudson, Koji, Go), and am about to explore the Open Build System at OpenSuse. What I'd really like to find is something that lets the developer compose build pipelines, let us trigger builds on downstream components when an upstream component's build goes green, install build dependencies before triggering a build, and optionally, let users "promote" a particular package.
I realize that not all of the above would be directly possible, but I can't help feeling that surely someone somewhere would have wanted such mechanisms in place.