Arik Hesseldahl

Recent Posts by Arik Hesseldahl

Big-Data Start-Up Cloudera Kicks Hadoop Up a Notch

If you’re a company working with big data, there’s a pretty good chance you’re doing some work with Hadoop. Inspired by MapReduce, one of the fundamental technologies that makes the Google search experience work so well, the Java software framework was created at Yahoo, which had the foresight to offer it to the open source community.

If you know anything at all about Hadoop, then you probably know a thing or two about Cloudera, the well-funded start-up whose aim is to build a business around Hadoop. In the same way that RedHat makes money helping companies run Linux, Cloudera wants to help companies wrestling with their own big data projects.

The idea sounded cool enough that Jeff Hammerbacher, the guy who built Facebook’s first data team–and who in that capacity used Hadoop to build powerful data analysis applications–was lured away from Team Zuckerberg in 2008. Now he’s Cloudera’s chief scientist. And venture capitalists have been investing in a big way. The company raised $36 million in three rounds from Accel Partners, Greylock Partners, Meritech Capital Partners, and then another undisclosed amount from In-Q-Tel, the venture capital arm of the U.S. intelligence community. And just last week the company hired Kirk Dunn, the former CEO of PowerFile as its new COO.

Today Cloudera is kicking it up a notch. It will announce that it has not only developed its own distribution of Hadoop that’s been tested and tuned to run in enterprise environments, but also enhanced it with eight other open source add-ons to make it more useful. Cloudera gave it a very long name: Cloudera’s Distribution including Hadoop v3–CDH3 for short.

As much as companies love Hadoop, it’s missing a lot of pieces that would make it so much more useful, Cloudera CEO Mike Olson told me. “Hadoop is a tremendous data storage and processing engine, but there are features that it is lacking, and which make it difficult to use out of the box,” Olson said. CHD3 takes care of all that, adding tools like HBase, a database for Hadoop, and Hive, a set of tools that enables easy queries. Cloudera has taken Hadoop from the Apache site and bundled it with these other open source tools to solve those deficiencies. Also, it has all its patches and updates and bug fixes, plus it’s been thoroughly tested.

“We believe this platform is the right way for enterprises to consume Hadoop,” Olson told me. “We’ve taken Hadoop and surrounded it with things that make it ready to do real work.” Olson says its the only version of Hadoop that’s enterprise ready. Could you download Hadoop and all the various pieces yourself and get the same result that Cloudera has? Sure, you could try. But why go to all the time-consuming work of doing it when Cloudera has done the heavy lifting for you?

There are other ways to consume Hadoop in various states of readiness. Amazon’s Elastic MapReduce service comes to mind, as do IBM’s family of InfoSphere BigInsights products (which are based on Hadoop), as well as new companies like Hadapt and DataStax.

So how will Cloudera make money by working with open source software, which is by its very nature free? Hadoop appeals a great deal to CIOs who want to stay away from being stuck with vendors who from time to time raise prices. Cloudera will make its living delivering a suite of management, monitoring and administrative tools that are finely tuned to work with Hadoop. While big companies like Facebook and Yahoo can devote the resources to hire dedicated pros to work in their data centers to handle all the Hadoop management tasks, Cloudera’s tools allow ordinary IT staffers to do all the stuff that’s necessary to run a big Hadoop installation. Add to that 24/7 support and regular updates on a subscription basis, and you’ve got Cloudera’s business model figured out.

Customers include Groupon, Rackspace, ComScore, Trulia, Samsung, LinkedIn and Twitter. Its reputation for having been put through its paces in such demanding enterprise environments gives Cloudera’s distribution of Hadoop a leg up when CIOs are choosing between it and the raw version available from Apache or other sources. “It’s more mature than what’s out there generally,” Olson said. “We want to be seen as the vendor of choice.”

Latest Video

View all videos »

Search »

There’s a lot of attention and PR around Marissa, but their product lineup just kind of blows.

— Om Malik on Bloomberg TV, talking about Yahoo, the September issue of Vogue Magazine, and our overdependence on Google