More Data Beats Better Algorithms — Or Does It?

Binary code illustration by Matej Pavla

Anand Rajaraman from Walmart Labs had a great post four years ago on why more data usually beats better algorithms. He cited a competition modeled after the Netflix challenge, in which he had his Stanford Data Mining students compete to produce better recommendations based on a data set of 18,000 movies. It turned out that the winning team had a very rudimentary algorithm but won because it appended data about the movies from outside the original data set (they used IMDb). With that simple study, it conclusively demonstrated that inferior algorithms with more data beat better, sophisticated algorithms with less data.

According to this line of thinking, Google proved this same lesson years ago when it showed that PageRank could outperform keyword extraction techniques (used by other search engines at that time) by leveraging data from outside the page itself (i.e., the votes that page creators made by choosing their outbound links, which defined the network topology of the Web). History is repeating itself now with Facebook, which is using detailed data about friendships (which defines the social network topology of the real world) to give it a leg up over other media companies. It is this same underlying notion that led Alex Rampell, the CEO of TrialPay, to say, “Payment data is more valuable than payment fees.” According to Alex, “Connecting the bank accounts of buyers and sellers will never be as valuable, nor defensible, as connecting buyers and sellers.”

Is this love of more data really well-founded? When do you know enough? In a world in which the amount of existing data is doubling every year, when do you shift your focus from yet another incremental attribute to writing a better algorithm to help you handle the deluge? How do you avoid being overwhelmed by the noise? Is there a tipping point at which more is less?

To better illustrate the problem, we need only remember Solomon Shereshevsky — a man with an unusual mind, who remembered in great detail everything that happened to him. His problem was that even though he remembered every detail, his brain failed to create high-level connections between the details. For example, if he saw a face, he remembered it exactly, but he failed to connect that memory to a memory of the same face with a different expression. Therefore, he had trouble connecting these faces with the person they belonged to. Similarly, when you spoke to him, he could memorize every word you said, but he would have trouble understanding your point. (A.R. Luria, “The Mind of a Mnemonist: A Little Book about a Vast Memory.”)

A healthy mind, on the other hand, has a more successful strategy for dealing with what could otherwise be data deluge. According to Leonard Mlodinow, the author of “Subliminal,” our brain processes 11 million inputs a second. It does so gracefully, because when we experience something in the real world, we aren’t just adding an isolated memory into our minds. The brain quickly connects the salient features of the memory to an entire lifetime of connected memories, including images, smells, sounds, touch and emotions. Think about how many times a smell has helped you retrieve a memory about an event. A simple fact can be transformed instantly just by being associated with other facts. That is why someone who records every fact but fails to extract the meaningful connections is at a severe disadvantage.

This leads us to our central point. Algorithms shouldn’t be one-way filters that take data out and put them to use outside of the system. Rather, the algorithm output is itself data which enhances the data asset. Even though BlueKai processes one trillion data transactions a month, we believe that the real value isn’t in the raw volume, it is in the degree of connectedness that is analytically overlaid onto the data to make it more interrelated. For example, the addition of data on sport water bottle purchase intent doesn’t just enhance the water bottle category — that might be rather uninteresting. By analyzing the behavior, for instance, of people who purchase water bottles for biking, we learn that these same people tend to own high-end vehicles. Apparently, people who like biking for sport tend to have the drive and money to enjoy a more expensive vehicle. These bicyclists also tend to take more island vacations, and, not surprisingly, so do their friends. Therefore, an isolated behavior, when evaluated and connected, can produce unexpected value.

This brings us back to the original question. If you have to choose, having more data does indeed trump a better algorithm. However, what is better than just having more data on its own is also having an algorithm that annotates the data with new linkages and statistics which alter the underlying data asset. That way, the addition of each new algorithm radically improves the underlying data asset, just like the addition of a sensory input improves the way we experience the world around us.

Are you living in a world in which more data provides diminishing returns (like Solomon Shereshevsky), or are you living in a world in which more data truly is better?

Omar is the co-founder and CEO of BlueKai, the industry’s leading data activation system that supplies both Fortune 100 companies and leading publishers with solutions for managing and activating first- and third-party data for creating highly effective customer and marketing campaigns. Omar’s previous roles include Chief Advertising Officer for mobile search and advertising solution Medio and Chief Marketing Officer for early behavioral data leader Revenue Science.

Must-Reads from other Websites

Panos Mourdoukoutas

Why Apple Should Buy China’s Xiaomi

Paul Graham

What I Didn’t Say

Benjamin Bratton

We Need to Talk About TED

Mat Honan

I, Glasshole: My Year With Google Glass

Chris Ware

All Together Now

Corey S. Powell and Laurie Gwen Shapiro

The Sculpture on the Moon

About Voices

Along with original content and posts from across the Dow Jones network, this section of AllThingsD includes Must-Reads From Other Websites — pieces we’ve read, discussions we’ve followed, stuff we like. Six posts from external sites are included here each weekday, but we only run the headlines. We link to the original sites for the rest. These posts are explicitly labeled, so it’s clear that the content comes from other websites, and for clarity’s sake, all outside posts run against a pink background.

We also solicit original full-length posts and accept some unsolicited submissions.

Read more »