A New Storage Paradigm for Big Data
The BIG deal is the disruptive value caused by the advent of high performance digital data capture. This technology has made data collection almost free. Internet-based marketing systems auto-magically capture masses of information about prospective customer preferences. Flash-enabled digital movie cameras can be emptied every night and re-used, a far cry from the film model in which every frame was burned on expensive media — which then had to be manually processed and edited. Digital capture systems also enable capture of exponentially more data per event. Social media sites and businesses increasingly create, store and analyze HD video instead of text, at ten to one hundred times more granularity of data per customer or product. The availability of compute power to manipulate this data to create business advantage enables companies like WalMart to maintain and frequently utilize multi-petabyte databases to analyze customer patterns and speed decision making. For the first time, solutions are also becoming available which allow for productive analysis of video as well. Finally, the resulting content is being stored forever: after all, that terabyte per day per oilfield of 3D seismic data being collected today might turn into the next decade’s oil find; or today’s genomic profile might be tomorrow’s cure for cancer.
As a result of this “free” data capture, increasing information granularity, more frequent usage and extended data value, businesses, research institutions and governments are growing enormous stores of large unstructured data (increasingly video) that needs to be stored and managed. This result presents a number of challenging data storage problems: extreme scalability; affordability (in general); managing the balance between cost and easy online access; maximizing application and user access; and assuring data durability.
Like the cavalry, a new set of “wide area storage” solutions based on second generation object storage are arriving just in time to help enterprises manage these issues. Object storage is a unique storage architecture utilizing a kind of valet parking ticket system to store and retrieve data. The creator of a piece of data (an object) hands the object to the storage in exchange for an object identifier (the digital equivalent of a valet parking ticket). When the data is later needed, the user hands the system the object identifier, and data is returned. The power of this model is that it is highly scalable — many objects can be stored and retrieved in parallel — and retrieval can be independent of the original application or physical location. Any application or user who gains authorized access to the electronic key can use the data. Historically, this technology has been used to store data in large, containerized archives, but these were limited by the size and performance of the object storage “box.”
What makes the latest generation architecture different is its ability to copy and disperse objects very efficiently across large numbers of independent processing and storage elements, which can even be distributed widely — geographically providing disaster recovery protection without the need for traditional replication. This wide area storage essentially frees object storage from the constraints of a box or even a site. Wide area storage is similar to storing data in the internet or cloud — data is freed from the traditional boundaries of expensive storage hardware and processes. This results in capabilities which are perfectly suited to the challenges of Big Data: almost infinite scale; low cost per petabyte; the ability to afford 100 percent online data storage; global multi-application access; and a content store which can live forever without ever needing to endure a disruptive data migration.
How does it achieve this? Well first, as previously described, the underlying object storage is natively extremely scalable; unlike systems with centralized indices, if you want to add more data to a wide area storage system, you simply add more objects, which are then dispersed across more scaled out components to add access, performance or storage. And the hardware architecture is scale-out, which means if you need more storage, more performance or more communications bandwidth, you just plug in more pre-packaged storage, processors or network capacity, and the system takes care of the rest. Growth is simple.
Second, much like the systems we’ve all read about at Google, wide area storage systems are built under the assumption that individual hardware components will fail. Because objects are copied and dispersed across many storage and geographic resources, a multitude of components can fail while data is continuously available. This “failure-assumed” model allows for wide area storage to operate on lower cost, off-the-shelf disk and processor technologies, which translates to much lower capital cost than traditional disk storage. The ability to defer replacement of failed components also results in lower support and operating costs. And because multiple users and sites can share the system in common, the overhead of storage and disaster protection is shared across all users. No archive silos.
Third, the native cloud-like access model of these new offerings embeds the capability for “wide” geographic and application access, supporting a wide range of uses – from streaming data (like video, sensor information, and genomic sequences) to parallel processing (like Hadoop for analytics). Some solution vendors are also providing easier access for existing applications, including policy-based tiering from traditional disk or the ability to appear like a NAS archive.
Finally, and perhaps most intriguing, these new solutions offer the capability for content, once stored, to never need to be migrated again. This may be the most critical element of all. For any user or technology manager who has endured the pain of unloading a broken, filled or obsolescent NAS or traditional block storage array in order to migrate to the next big thing, with next generation object storage, you never need do this again. This is going to be critical when your Big Data store is holding hundreds of petabytes of high value data. Forever.
Janae Stow Lee is senior vice president for Quantum’s file system and archive products.