The Problem With Noisy Neighbors in the Cloud
When my daughter was five, I accepted a new position with a longtime client and moved 1,100 miles away. We didn’t want to buy a place in an area that was brand new to us, so we rented an upscale condo up the mountain. We expected with the location and the price that it would be relatively tranquil. What we didn’t count on was the college students, packing two to a room to afford the place next door, to be up at all hours of the day and night. We had a problem: Noisy neighbors.
This condo tragedy isn’t too far removed from what happens every day in the world of shared storage. We are living in an increasingly cloud-powered world where our workloads are consigned to small slices of enormous compute farms — and that world is increasingly filled with noisy neighbors. Much as the college students next door to my condo didn’t always allow me to sleep at night, the other tenants in a multi-tenant environment may not allow your workload to run smoothly. While CPU and memory have continued massive performance advances thanks to hardware-assisted virtualization, storage remains mired in the realm of incremental advances. With some hard drives now delivering 4 terabytes of capacity without any significant change in I/O performance, you’d be right to ask how long capacities can continue to grow and still have viable performance.
The storage industry isn’t sitting idly by. SSD drives, auto-tiering, DRAM cache, and flash cache are all employed to speed storage along. These techniques have significant benefits, but they all suffer from the same problem: no matter how fast you make your storage systems, your compute systems are orders of magnitude faster, so there is always a risk of a noisy neighbor. Even the blazing speed of a pure SSD array can be positively obliterated by extreme utilization of even a modest part of the compute side of a cloud environment. In the 14-year rise of virtualization, huge strides have been made in intelligently segmenting and rationing CPU and memory, and those benefits now accrue to cloud technologies. The storage side of the equation has been far less successful.
The result? Organizations employ tactics to avoid the risks of shared environments. Massive overprovisioning of resources in clouds, dedicated storage platforms attached to shared compute platforms, dedicated shelves in shared storage platforms, or massive horizontal scaling are options used every day. They don’t solve the problem — they avoid the problem, often at great expense or through significant architectural shifts.
Last year I was privy to one particularly nasty noisy neighbor incident with a significant blast radius. Most noisy neighbor problems are much more subtle and insidious than this debacle; they manifest as spotty performance, unexpected hiccups in service, or frustrating scaling difficulties that keep the ops team up too many nights. This time, however, things were really bad.
It began with a shared storage array from a major vendor. The purpose here isn’t to name and shame, but to illustrate how a confluence of events can create a disaster. This array supported auto-tiering between two tiers of storage. One tenant happened to be hitting the array unusually hard. The array launched into its daily auto-tier process, which also added load. To add to the load, a drive then failed. The conflux of those events actually triggered a bug in the array that caused a hard lockup and all the LUNs went unreadable. In addition to the interruption to the storage service, all the virtual machines relying on that as primary storage went read-only and had to be rebooted even after the storage array was fixed.
There was no satisfactory root cause analysis; the vendor had a lot of information but no hard answers. They believed that the anomalously heavy load, the running auto-tier process and the drive failure were all contributing factors. The costs associated with such an outage are huge: Dozens of man-hours lost to recovery and post-mortem at the least; interruption of running workloads and all the system administration work to get those systems running again; phone calls to the IT department; tickets opened and closed. If this was a service provider environment, you could also add service credits and a loss of goodwill to the list.
That’s a worst-case example, but the “little” noisy neighbor impacts happen all day. A database server has a burst of slow performance; it has to queue up a lot of extra queries; it consumes a bunch of extra RAM; the web servers using it for queries back up with requests and the pile-up of Web server threads consume all available RAM waiting for the database, and those hosts start to go into swap space. Next thing you know, that site has a temporary outage while they recover.
That’s the problem that confronts us today. The bigger, better arrays are a huge help, but the noisy neighbors are getting louder as well. The need is for QoS controls to help ration access to storage the way that other technologies offer QoS controls for network, CPU and memory. Offering an ironclad QoS guarantee promises to raise the multi-tenant storage array to the levels of reliability and performance previously reserved for dedicated arrays and DAS solutions.
Knowing what questions to ask your cloud service provider can help solve the problem of noisy neighbors and performance variability. For instance, does your CSP work with a storage vendor that offers guaranteed QoS on a storage platform? One storage demo I saw in particular was remarkable. The CSP ran its system with thousands of clients all competing for storage I/O resources, maxing out the capacity of the array. Despite that load, it was able to deliver steady performance to each volume in line with the QoS settings. In addition, it showed instant performance re-allocation by changing the performance of a volume from 500 IOPS to 2000 IOPS, with instantaneous effects. Solutions like this are a game-changer for cloud providers, and those delivering business-critical applications via cloud infrastructure: Guaranteed performance, on demand.
The ability to offer such guaranteed performance on a shared platform leapfrogs the utility of DAS and dedicated storage. Cloud environments empower you with the business agility of service on demand and flexibility to respond to changing business needs rapidly. Adding resources for a time and then giving them up when they are no longer needed is a major benefit. While the advancement of cloud computing has made those accessible on the compute side, the storage side was left behind by the limitations of rotational disks and the inability to offer ironclad QoS guarantees.
The power of a such a solution (such as the storage platform from hot startup SolidFire) is not only in knowing that you can guarantee a certain number of IOPS on each volume, but to pair that with cloud environments to allow the business agility to burst as needed on the storage array the way that cloud environments offer that flexibility for compute.
The rapid and automated provisioning world of the cloud demands that storage companies build APIs rich enough to control every aspect of an array. Building the user interface as a layer on top of the API is a demonstration of API and design maturity that shows a solution is future-proofed against demanding cloud orchestration requirements. Designing the solution to be linearly scalable without artificial breakpoints or step functions in performance keeps the provisioning and growth simple and reliable, shutting out the noisy neighbors once and for all.
Matthew Wallace is a 17-year Internet technology veteran and Director of Product Development at ViaWest. He is the co-author of “Securing the Virtual Environment: Defending the Enterprise Against Attack,” published by Wiley in 2012. He was previously a Cloud Solutions Architect at VMware, the technical founder and Principal Engineer at Exodus Communications’ Managed Security Services team.