Compiling Ruby 1.9.3-p0 on Belenix

After a few exciting weeks at work, I picked up Belenix work again today. I decided to give Ruby 1.9.3-p0 a try, since this is the latest stable.

I ran into a libelf issue, where my the default libelf that came with illumos didn't support large file sizes. People have faced this before already,

I was able to address this by installing a different libelf from here with a prefix of /usr/local

CFLAGS="-I/usr/local/include/libelf -L/usr/local/include" LDFLAGS="-L/usr/local/lib -R/usr/local/lib" ./configure
Followed by:
make -j 4

I'd installed libelf and libyaml with the prefix of /usr/local, hence the custom CFLAGS and LDFLAGS.

For the longer term, I'm guessing we'll need to closely track eco-systems such as Ruby, Python PHP and the like via Continuous Integration.

[Update:] Update the CFLAGS and LDFLAGS to have the correct content + explain what I did.

About SAN storage at Thoughtworks - Part 3

I'd earlier posted about how and why we selected an Oracle ZFS Storage SAN - the 7320. Read Part 1 and Part 2 for more context.

Having finalized that we want an Oracle ZFS Storage SAN, we placed an order for our SAN and eagerly waited for it. Some of our team received and installed it along with the local vendor, and then even put it to use. After a few days of use, we started to notice performance problems with our most important use case - VMs on VMWare.

Here's a checklist that you should followin:
a. Ensure that everyone understands what you're buying.
Modern day SAN devices include a lot of advanced technology that we usually don't see even on servers.

b. Get the support contract in place.
We had to seek support from Oracle once to understand why we were seeing poor performance, (more on that in the next post), and we were really pleased with the skills and competence of the Oracle Premium Support Engineer.

c. Validate the entire setup by Oracle.
Ensure that the SAN is installed and configured by Oracle themselves. Do not involve your local partner for this. You are paying money already, and Oracle wants to take the responsibility of the SAN. Let them do so. There may be commercials involved, so talk to your Oracle representative.

d. Ensure that the Welcome Kit and support are activated.

e. Ensure that the OS is upgraded.

f. Configure and test the Phone home service once.

g. Read the documentation at least once.
This is not your typical SAN, and this is not your typical filesystem. You need to know what you are doing incase you want to deviate from Oracle recommended configuration. If you want to configure custom disk pools, etc, remember that your various Windows and Linux and other SAN lessons _do not_ apply here. This is especially, especially important because the GUI makes things easy to configure and mis-configure, and if you're used to LVMs and what other SANs, you can easily make a mistake.

h. Run the SAN for a week before declaring it ready for production.
This is another important point. The SAN itself can actually work very well right from day one. However, you need to get used to the SAN since there's a lot of power packed in. There's also the GUI which you should get familiar with. If you need to reconfigure the SAN (a likely scenario), you would have the opportunity to do so.

Now that we have the above out of the way, let's move on to Part 4 - the SAN GUI !

Pluribus and upcoming network innovations

One company that I'm going to keep an eye out on is Pluribus Networks . They have the byline "Virtualization without Limits". Sunay Tripathi , one of the founders, is also of the driving forces behind the Solaris 10 TCP stack rewrite, and Crossbow, the network virtualization in the opensolaris kernel (and part of the upcoming Solaris 11).

Sunay recently blogged about "Network 2.0: Virtualization without limits", where he's written about a Network OS that controlls all switches, treating the network "exactly as one giant resource pool".

Pluribus are quiet about what they're doing, but Sunay has said that they "a network hypervisor that has semantics similar to a tight coupled cluster but controls a collection of switches and scales from one instance to hundred plus instances".

This is purely speculation on my part, but I do wonder if OpenVSwitch would be the management layer, or perhaps integrate/interoperate with Pluribus' NetVizor.

In any case, the future does look exciting :)

Interesting progress with Belenix development

During the past few weeks, I've been working a bit on Belenix as and when my illness permits.

So far, rpm, smart and createrepo have been working very well for me. I don't install from source, or download from openindiana repos any more. Everything that I need, I ensure that I build using spec files and rpmbuild, and install via a custom repository on my local computer.

I've been examining build systems (Jenkins/Hudson, Koji, Go), and am about to explore the Open Build System at OpenSuse. What I'd really like to find is something that lets the developer compose build pipelines, let us trigger builds on downstream components when an upstream component's build goes green, install build dependencies before triggering a build, and optionally, let users "promote" a particular package.

I realize that not all of the above would be directly possible, but I can't help feeling that surely someone somewhere would have wanted such mechanisms in place.

About SAN storage at Thoughtworks - Part 2

Given that the various high end SAN vendors all receive good reviews for their performance, let's see what some of these performance criteria are:
a. Excellent Disk IO !
b. Excellent Network IO !
c. Excellent caching mechanisms !
d. Optimized network stacks for various protocols !
e. Fancy reporting for the management !

There are also other interesting features such as:
a. being able to use multiple connectivity mediums (1G and 10G Ethernet, Fibre Channel).
b. Phone home, where the the device diagnoses issues if any, and sends the SAN company a message so that they can take pre-emptive action.

Here are something that we liked about the Oracle SAN (the others have some of these, but not all).
1. A full blown Enterprise grade Operating System.
This SAN runs a version of OpenSolaris, based on the same enterprise grade OS that powers many of the world's performance critical environments. As screenshots in further blog posts will show, this SAN gives great performance + analytics while using neglibile amounts of CPU.

2. An excellent network stack.
Unlike some earlier versions of Solaris which were nicknamed "Slowlaris", Solaris 10 receive a TCP/IP stack rewrite. This rewrite was nicknamed "Fire Engine". Later, the developers went on to improve network performance in many other ways too, with quite some of those benefits making their way into this SAN.

3. The ZFS file system.
This filesystem was designed with some good thinking about and questioning of lessons learned in the past, and whether they need to be applied today or not. ZFS has long been acknowledged as being a really superlative filesystem, with ports to some other operating systems as well. There are efforts to write equivalents for other platforms such as Linux to which ZFS cannot be ported to for legal (licensing) reasons.

Some interesting features here are the modified elevator-seek mechanism, near-platter-speed access rates, end-to-end checksumming to provide you with greater reliability.

4. Caching.
One interesting benefit of having ZFS around is the improved caching. ZFS lets you specify read cache devices and write cache devices. These cache devices can be ordinary disks, but practically, everyone uses SSD devices for read and write caching. This means, you can start off as we did with having just one cache, and then using the Analytics to "size your requirements". In terms of storage, this means you could now use Analytics to determine whether you have more read or write operations, and whether you need a read or a write cache, how much, etc.

Sinze ZFS is part of the kernel, it can play with unused RAM and "soft allocate" some RAM for use as a read cache for very frequently accessed blocks of data (in case of LUNs) or file blocks or even entire files (in case of NFS and CIFS). What this means is, most of your frequently accessed data will reside in RAM and be served from there. In case the kernel needs to allocate some memory, it'll take away some from ZFS' RAM cache on a need basis.

So a RAM cache coupled with a Read and/or Write cache depending upon your requirements, can do wonders for your performance, over and above what ZFS acces speeds themselves do.

5. End to end checksums in the file system
This one deserves a point by itself. One problem that had hit me twice with the the MD3000i SANs, is that my applications complained about data errors. This was really dangerous for us at work, since this happened with active source code once, and with a VM in another case. The SAN reported faithfully that it had no corruptions in data, and yet I could see with my own eyes that I was in a bit of a mess with having to restore from backups, and rebuild from individual commits made during the day.

ZFS has this notion of an end to end checksum, where a block of data is checksummed, and then sent to the storage sub-system for a write, and there's a reference block that contains its checksum, with the grand parent block containing that parent block's checksum, etc. The intent is to be able to check whether a block that's been retrieved has been retrieved generates the same checksum or not. In case there's a checksum mismatch - say due to media error - ZFS knows to retrieve an identical block and help you get back your data. There's some documentation + diagrams that explain this better than I have.

Now, end to end checksums in a SAN do not guarantee to you at all that your server (the SAN's consumer/customer) will always get the data that it had sent to the SAN. There are many locations where the data could have got corrupted on the way to/from the SAN - the server's RAM (this is why you need ECC!), the network device driver or the OS' TCP stack may have some bug, the NIC card may be faulty, the network cable and switch may have some problems of their own, etc. You get the point.

But what end to end checksumming _will_ assure you about is, once the data reaches the Oracle SAN's network driver, from that point on till it gets written to the storage medium, it'll receive high fidelity checksum calculation, and this will be used to validate the data just before it's dispatched to the SAN's NIC.

For ideal quality, you should run ZFS on your server and have ECC RAM, but for those of us who have other use cases like having to run VMWare, you can at least rest assured that once your data is written to that pool, you can is even most of the worst cases get some or all of it back.

6. Excellent reporting.
You need to see this for yourself lest you consider me biased. Most other SAN devices provide you with what I call "defensive reporting", where the Storage Administrator gets to show that his SAN's giving great disk I/O operations. If someone were to ask a non-Oracle iSCSI SAN user "please tell me which of my VMWare servers is accessing which of the SAN's LUNs", then the storage admin would likely thrust some more defensive reporting in their face and ask them to get lost. If pushed to the wall, he'll simply call the SAN vendor's tech support team, and they'll ofcourse come scrambling to his aid to throw more jargon.

Apart from the fact that such an attitude doesn't take anyone anywhere, is the fact that better/alternate reporting simply doesn't exist on these other SANs.

With the Oracle SAN, you'll find that you'll be able to design and sketch interactive and drill down reports on demand. Drill down reports are awesome. Here's an everyday scenario "Hi, the VMs are slow, could you tell us what's wrong ?", "Sure... hmm... VMWare's CPU and RAM utilization continues to remain low, let me check the SAN" (By now, I trust the Oracle SAN's reporting since it' way more superlative). "Ok, I see high write operations from two VMWare servers, they're acessing LUN_EnvironmentD and Lun_EnvironmentJ, what're you guys doing on that ?" "Aah, there's a deploy going on" "Well, I see that the storage is well within the IOPS threshold" "Must be those script changes that we put in. Anyway, thanks for helping us arrive at this so soon". Apart from the time required to log on to the VMWare Management console and the SAN's own Web console, all SAN analytics literally takes as much time as it'd take to speak this conversation I've listed.

7. Direct support from Oracle.
Apparently, before Oracle acquired Sun, support would be available via channel partners too. Post the acquisition, Oracle now handles all support cases directly. There are pros and cons to this, I feel. There may be channel partners and their team members who may have wanted a career on configuring SAN storage, while continuing to play the role of a generalist. Now, they need to be lucky to get into a technologically diverse company like mine, or join Oracle !

On the other hand, since Oracle have their reputation on the line, you get to speak to people who have access to all manner of skillsets within the company. When crisis strikes, and your customers are screaming seeking escalation, you can now reply that this is the highest you can escalate. (see below).

8. Phone home.
Since I've seen this on NetApp and EMC too, I presume that all higher end storage vendors have this feature today. I know that even within a company such as Thoughtworks, not everyone is or will be as familiar with ZFS and related topics as I am. So, higher end SAN devices today run a number of self-diagnostics, and in case they find out any errors, they send some diagnostic data back to the SAN support team. Such teams take a judgement on what needs to be done, get in touch with the customer and set fixes in motion. These could be pre-emptively replacing disks, asking the customer to add more cache, recommend a reconfiguration, etc.

Alright, more in another blog post !

About SAN storage at Thoughtworks

At Thoughtworks, we use virtualization extensively, since a little before "cloud" became a buzzword. Until a few years ago, our platform of choice for Virtualization was VMWare. Since then, we also started to use Xen and KVM. We're yet to investigate HyperV in production use. As a software development company, we use VMWare in standalone mode, where our build agents are spread across a large number of servers (around 80 per server), and our UAT environments are made highly available on a number of VMWare clusters.

One of the requirements of a VMWare cluster, and that the VMs which need to be made highly available, should reside on a common storage device, typically a SAN . VMWare provide their own filesystem called VMFS, which is a distributed filesystem. If a SAN device presents a raw data store called a Lun, then VMWare accesses that Lun using the iSCSI protocol or via FibreChannel. If the SAN/NAS device can expose it's storage space via the NFS protocol, then VMWare uses that storage directly over TCP/IP. We initially purchased a Dell MD 3000i SAN device which played a useful role when we were operating at a small scale of around 3 TB. As we increased the number of VMs, the disk space required also grew. We finally stopped needing more storage for a specific project at around 5 TB of utilization (8 TB of usable capacity).

Given the larger number of VMs, and the number of parallel deployments of various applications by our automated deployment scripts (we made everything part of our Go Grid), we started to see performance issues. Another long pending problem was that of being unable to answer why our SAN was sometimes slow even though there was no deployment in progress at all (never mind parallel deployments). For e.g., if we knew that there were six environments running 2 - 3 different builds, we'd not know which particular build caused an increase in disk IO. The VMWare GUI tool gives you only so much data. Also, devices like the MD3000i do not have any such thing as iSCSI analytics.

Around this time, I also wanted to experiment with creating VMs within seconds.

Since I've been closely associated with Belenix and OpenSolaris technologies for years, I decided to give Solaris 10 a spin as a storage box for some time. We used ZFS and it's snapshot feature to set up VM collections in minutes. A typical VMCollection would comprise of a Domain Controller, some IIS servers, an Exchange Server, and some VMs running Outlook 2007. We also got amazing performance. All this on a box with just three disks of 1 TB each configured in what is known in ZFS parlance as RAIDZ. RAIDZ is similar to but better than RAID5, because it avoids the RAID5 write hole problem. Our VMWare servers would access our various ZFS filesystems - some over iSCSI and the others as NFS storage.

So, now that we understood that ZFS could for certain give good performance, we needed to solve the additional problem of identifying which VMWare server was sending how much disk IO to which SAN LUN. Enter DTrace. With the help of a close friend from Sun, we put together some dtrace scripts, and these provided answers to some extent. Why only to some extent ? That's because we often didn't know what exactly to ask DTrace to monitor.

Some months later, we decided to replace our MD3000i with a higher end SAN. Now I'd heard of The famed DTrace Analytics that somes as part of the Sun Fishworks product line. Today, this is called their ZFS Storage Appliance Line. However, since we were going to sink in a lot of money into a SAN, I wanted to be cautious and check out other popular vendors too.

Some colleagues and I got together and attended review sessions at NetApp and at EMC. Both NetApp and EMC offered more than what we'd imagined as part of their standard offering, and this was good to know. Unfortunately, at that time (mid 2010), NetApp did not have any analytics around iSCSI sessions. At the EMC review session at their Bangalore office, we saw a good demo and were pleased with their product overall. But the key feature which is very important for us - iSCSI session analytics - was missing. They did some some raw iSCSI analytics, but I learned that activating this monitoring locks up the controller in processing information, and VMWare servers and VMs get disconnected. This was a clear no.

After some discussions with Oracle Bangalore, we finalized an order for a 7320 Storage box, and life hasn't been the same again !

More in part 2 of this blog post.

Belenix - what's up ahead

Until a few months ago, the Belenix team had stopped all work on Belenix. We've been having a very good time at our respective day jobs, and that's been a very good trip so far.

Sometime ago, I decided to take a dive into Belenix again.

As of today, I've got a working rpm5 and smart package manager setup on my computer.

I've posted a roadmap here :

I can report that I'm on track, and hope to have an IPS repository for rpm, as well as the beginnings of an rpm repository by the end of this month.

The recent news about Joyent's port of KVM to Illumos, as well as today's announcement of SmartOS are both very exciting. It looks like there's much to achieve ahead !

SmartOS (and also, KVM on Illumos)

Some of the amazing team at Joyent ported KVM onto the Illumos kernel.

Then, they went on to contribute their work to Illumos.

Finally, they've set up, with the entire source code available at github.

What this means:
- An alternative to VMWare, Xen and KVM on Linux, where you now have the Illumos kernel (a fork of the opensolaris) powering your OS level virtualization.

- If it's just ruby/Python/Java/PHP apps you're running, then these can run within a Zone - you need not necessarily have a full blown OS to run your apps.

- You get the benefits of DTrace to trace performance. With KVM based guest OS, you can trace all manner of performance and resource utilization of the VM instance using DTrace.

- High speed VM setups. With ZFS clones, you'll be able to set up Windows VMs at a very high speed. (I've not done this with KVM on Illumos, but I have done this with Xen on OpenSolaris, as well as with ZFS-iSCSI, so I know that this works out very well).

- As per the FAQ, it'll be possible to use the OpenStack infrastructure to manage SmartOS too. This is something that I hope to try out real soon !

All in all, very exciting times ahead !

"But I didn't complete my training!!"

"But I didn't complete my training!!", yelled the apprentice charged with the most enormous task.

"Neither did I", replied the dying master...

We all have it in us to do awesome stuff. We need to stop thinking about why we can't achieve something, and instead focus our energies on thinking about how we can.