Five Reasons to Throw Your Data Away


Image copyright Ververidis Vasilis

Common wisdom when it comes to Big Data is that you should keep it all. The reality, however, is far different. Keeping all your data is more expensive — and carries more risk — than it first appears. Here are five key reasons to throw your data away.

More data equals more risk.
Take the case of personally identifiable information. Organizations have historically used policy to govern what data to keep and for how long.

But according to market research firm Forrester, organizations nearly double the amount of data they store each year. The resulting challenge is that as the amount of data the typical organization has to manage continues to grow, applying policy becomes exponentially more complex.

The more customer financial records, patient health information and old emails you keep around, the more risk you run in the case of a data breach, whether accidental or intentional.

More data means more noise.
The benefit of having more data is often outweighed by the difficulty of finding the information you need. Just like the old problem of finding a needle in a haystack, the bigger the haystack, the harder it is to find the needle.

It’s true that today’s software can search through millions of documents, emails and other records. But that also means you get far more results returned, results that you then have to comb through to find the one you’re looking for.

All too often, the results you actually need — the relevant results — are all too hard to find. Separating signal from noise becomes progressively harder the more noise you have to sort through.

Storing your data isn’t free.
While the cost to store an individual byte of data is declining, total storage costs for organizations are rising. This ironic situation is a result of the Jevons Paradox: The less expensive a given resource becomes, the more of it people consume.

This paradox was first popularized during the Industrial Revolution. During the Industrial Revolution, the costs of producing coal became cheaper on a per-unit basis; overall spending went up as a result.

The Jevons Paradox has been used to understand many economic trends since, from the usage of large-scale energy resources to that of public clouds such as Amazon Web Services and Microsoft Azure.

Historically, one way to address issues around data storage was to look for copies of files and discard the duplicates. Legal requirements, compliance issues and the rapidly increasing volume of data, not to mention a growing number of file formats and data linkages and dependencies, have made it far more difficult to know what to keep and what to discard. Deciding which data to discard now requires sophisticated software tools that can not only evaluate the data itself but can do so in conjunction with stated corporate policies.

To make matters even more challenging, scientists don’t expect storage costs to keep declining at the same rate that they have in the past. Kryder’s Law — the analog of Moore’s Law for disk — shows a history of disk costs dropping about 40 percent each year for the past three decades. As a result, said David Rosenthal of Stanford University, given the longevity of the law, it may be tempting to think that it will keep holding for a few decades more.

But reality, again, is far different. According to Rosenthal, if storage costs are five percent of your IT budget this year, in 10 years, they will grow to more than 100 percent of your budget to store the same relative amount of data. As a result, data storage practices must change.

Data analyzed is a lot more valuable than data stored.
Big Data advocates are fond of saying you should store your data and figure out how to analyze it later. But just storing data doesn’t get you more insights; using the right software to analyze that data does.

Put another way, data stored only has theoretical value, while data analyzed has practical value. Google had to crawl the Web to make it searchable; but the company’s real value is its search capability, not its ability to store Web pages.

In another example, baseball fans had been tracking player and team statistics for years as a hobby. But the teams themselves made player decisions without that data, choosing players purely based on the same non-data driven techniques they had relied on for years.

It wasn’t until general manager Billy Beane of the Oakland A’s decided to put the analysis of those statistics to work that the data really became valuable, enabling the A’s to build a winning team. Had that data remained stored and nothing more, its value would have been purely theoretical. Its practical value after the A’s analyzed it was far greater.

The more data you have, the less you can trust it.
One need look no further than the financial crisis of 2008 and the years required to sort through the resulting mess of financial records to see how difficult it is to trust your data when you have multiple versions of it.

In an ideal world, you would maintain one copy of data you think is important to run your business, or is required for regulatory compliance. But once again, reality is far different. With multiple people working on different versions of documents and with different versions of those documents moving around in email, on laptops, and in various storage repositories, keeping one version of your data is nearly impossible.

That holds true whether we’re talking about financial transaction records or corporate documents. Multiple versions of content can and will exist. One of the few ways to deal with the issue is to discard versions that are no longer needed.

If 2012 was “the crossover year” for Big Data in general, 2013 is clearly the year of unstructured data. That’s because unstructured data — the kind of data found in tweets, Facebook posts, documents and emails — is the fastest-growing type of Big Data. It’s the kind of data that presents both the biggest challenge and the biggest opportunity for large organizations.

Every day, Twitter users send more than 400 million tweets while Facebook users post some 2.5 billion status updates. By some estimates, that means people are creating more than half a trillion new words of unstructured data each month through social media. That means better communication, but it also makes it harder to separate the signal from the noise, what matters from what doesn’t.

Making sense of unstructured data is something we as humans do every minute of the day. We use patterns to recognize the faces of friends and colleagues, to understand the spoken word and even to decide where to go out to eat. Without consciously thinking about it, we process immense amounts of unstructured data to make decisions about what to do.

But when it comes to business data, such pattern matching isn’t so easy. Unlike our individual tastes, making decisions about corporate data doesn’t rely on personal preferences.

Rather, corporate data is subject to organizational requirements. Which document or email from among the many millions we and our colleagues generate each day might be relevant to a business decision tomorrow — or a litigation issue a year from now? It’s virtually impossible to tell without the right tool set.

While common wisdom may suggest otherwise, when it comes to data storage, adopting a contrarian point of view makes a lot of sense. The changing nature of storage costs is rapidly making it impractical to keep all your data, and the associated risks of doing so are simply far too great to ignore.

Recommind specializes in unstructured data management and analysis technology. Prior to joining Recommind, Bob served as the managing director of Swiftsure Capital and in senior positions in the Java Software division of Sun Microsystems.

Must-Reads from other Websites

Panos Mourdoukoutas

Why Apple Should Buy China’s Xiaomi

Paul Graham

What I Didn’t Say

Benjamin Bratton

We Need to Talk About TED

Mat Honan

I, Glasshole: My Year With Google Glass

Chris Ware

All Together Now

Corey S. Powell and Laurie Gwen Shapiro

The Sculpture on the Moon

About Voices

Along with original content and posts from across the Dow Jones network, this section of AllThingsD includes Must-Reads From Other Websites — pieces we’ve read, discussions we’ve followed, stuff we like. Six posts from external sites are included here each weekday, but we only run the headlines. We link to the original sites for the rest. These posts are explicitly labeled, so it’s clear that the content comes from other websites, and for clarity’s sake, all outside posts run against a pink background.

We also solicit original full-length posts and accept some unsolicited submissions.

Read more »