Seven More Questions for Gil Elbaz, CEO of the Data Mercenary Factual

When we last left Gil Elbaz, his company Factual had just landed a $25 million round of venture capital funding from Andreessen Horowitz and Index Ventures.

I came up with the phrase “data mercenary” to describe in a fun way what Factual aims to be. If you’re developing an application or a Web service, and you need lots of data, you’re faced with several big problems up front. Where is that data going to come from? How up to date is it? How will you keep it fresh? These are questions that Factual aims to answer, both by supplying the data and helping ensure that it’s maintained. They’re big, complicated questions, and if you were going to ask someone to try and wrestle with them it would be Elbaz. He sold his first company, Applied Semantics, to Google, which went on to turn it into AdSense. Earlier this week I caught up with Elbaz in advance of his Web 2.0 talk.

NewEnterprise: So it’s been a few months since your funding announcement. How have things been going at Factual since then?

Elbaz: They’ve been going really well. We moved into a larger office. We’ve been bringing in lots of good people every other week, and we’ve accelerated the adoption with lots of leads. The places data is really taking off. We decided on a vertical approach to marketing and improving our data, so local is where we’re putting a lot of our resources. We are dabbling in other verticals and when we feel comfortable we’ll invest heavily in other areas. We just haven’t figured out which ones yet. We did recently launch a database of US physicians, which was a pretty significant effort. That’s an example of seeding the environment and starting conversations around a second vertical.

So everyone is talking a lot about “big data” and your talk at the Web 2.0 Expo is about data “haves” and “have-nots.” What do you mean by that?

The focus is to talk not just about big data as in a set of tool you need to process that data, but how do you get access to that data in the first place. The brand new startup in many cases doesn’t have any access to data, so that’s a big challenge, versus someone like LinkedIn, which has a huge batch of data to work from. But then I think every company really needs to act like they need access to much more data. Because no matter who you are there’s a lot of information you can’t access. The question is how does the ecosystem grease the wheels of efficiency of information movement, so that everyone can build much better information services. It’s still fairly stuck in my opinion in terms of easily getting information into your app.

So what do you suggest is a solution?

I break it down into many problems. There are six or seven categories of problems, and there’s many solutions for each one. One is findability, that is finding the information you need to access. The Web was built to make information findable by humans, it doesn’t necessarily mean its easy to find data you want to download. It may be government data, or data you want to license from someone. Or it could be an API. There are no big catalogs of structured data, though there’s been some progress from places like Infochimps and Microsoft Data Marketplace, though its just starting to happen. Another key issue is if you know a resource that’s available, is it easy to integrate. Many legacy data companies don’t have APIs. A lot of government data you have to request on tape and have it shipped to you. But with the advent of faster and cheaper networks, that’s improving. But it’s a chicken and egg. People have to push for these things or they don’t get fixed.

That’s two problems. Do you think people are figuring out that if they have data they need to make it useful by providing some kind of API support?

I think so. A typical Web site is much more likely to use several data sources than it would have several years ago. But I think the average will become greater and greater each year. Really there’s no limit to how many information services you want to access and integrate. That leads to my third issue which is standards and semantics. A big reason why developers will usually choose only a few sources to integrate is that they tend to be difficult to merge, unlike APIs, because of the lack of common languages for integrating. So if you have several feeds of business information, there’s no universal public identifier for businesses. You’d have to do a lot of work to integrate that information. At Factual we’re trying to popularize our own unique business identifier that we’re happy to distribute and hope that people use. We’re also trying to publish other people’s identifier, like Foursquare’s. In some way we really don’t care which one people use as long as a standard emerges.

That’s three problems. What’s number four?

Another one is the economics of data sharing. While in some cases the data that is moving around can be made free by a government or by an e-commerce site that has a big motivation for sharing it, there are many cases where there aren’t any fully fleshed-out models of sharing data, because a lot of companies are worried that if they share their data they’re not going to get paid for it, and they put effort into collecting it. The data marketplaces I mentioned before are a start. There are sites like Mashery that help you monetize your APIs. At Factual we’re trying to build a new model where companies share data with us and we share it back with the community free for most developers, that is our API stays free. We charge for high usage rates via service level agreements, but for most developers it ends up not being an issue.

So someone like say Starbucks might share data about store locations, and this one closed and this one just opened, and this one was just renovated etc. They could share that data with you?

When I usually talk about a larger company, I’m usually thinking an app developer who going to be doing millions of data lookups a day. But in terms of integrating Starbucks’ own data on their own site, it’s probably more accurate than data from anyone else. Which brings me to a fifth issue, which is how do you test data and decide which data you can trust. It’s easy to decide based on the brand, whether its the United Nations or Starbucks. But it’s hard to scale it out and be automated. We have some of our own internal tools. But it’s not something people tend to ignore. People assume that if they’re paying for data it’s probably good.

By my count that’s something like five problems you’ve identified, which means we’re somewhere near the bottom of your list.

I’ve covered most of them. Another is ownership and rights. If you’re a search engine and you access data on the Web, it doesn’t scale well to understand the terms and conditions of publishing data because a computer can’t read terms and conditions agreements. If you’re a search engine the fine print can probably be ignored. That’s maybe not surprising, but it is interesting that ignoring them has become the norm because it’s simply impossible for a computer to consider them. Creative Commons created six different designations for how you can use content from a given site, say for commercial use or for non-commercial use with attribution. Flickr is an example of a service that’s put Creative Commons tags to use. But I’d love to see more automation happen around this. But there’s fewer standards when someone doesn’t want to give their information away for free, and how they get paid when someone re-uses it. I’d love to see more automation around that. And there’s a little of that happening around APIs. But the state of the art today is a lot of phone calls and business development. And that’s fine, but if we’re really going to scale the integration of Web-wide information into information services, there’s going to have to be a better way.

Seven More Questions for Gil Elbaz, CEO of the Data Mercenary Factual

AllThingsD by Writer