Doug Cutting is the founder of Apache Hadoop and creator of numerous successful open source projects, including Lucene and Nutch. Doug joined Cloudera as an architect in 2009 from Yahoo. Doug is also chairman of the Apache Software Foundation supporting a broad range of open source projects through free software products available to a large community of users.
We spoke with him about his role with Cloudera and learned more about Big Data, training options for IT professionals interested in Big Data, and how Cloudera compares to Red Hat.
Global Knowledge: As founder of Apache Hadoop, at what point in time did you realize that the software was going to be such an important break-through technology? And was there an "Aha" moment for you?
Doug Cutting (Doug): I realized right away when I first saw that it let us do things that we couldn't do before easily. I was interested in building web search engines, but I didn't really think about applications in other areas. I was just looking to get what I wanted to do done. I thought it would probably be useful for other things if somebody had asked me at the time, but I wasn't thinking about challenging database technologies at all.
Global Knowledge: Obviously there's a huge amount of buzz, and when you say Big Data it means a lot of different things to a lot of different people, what are the biggest misconceptions about the technology and what it can do?
Doug: I think that some people think Big Data is sort of Big Brother. That it's this thing collecting everything out there, analyzing it, and using it. People also confuse it with the cloud. Hadoop technology enables people to collect and analyze more stuff, but it's not actually providing the analysis or doing the collection, it's plumbing that lets people get that stuff done.
Global Knowledge: Is it split in terms of self-hosted or hosted?
Doug: At Cloudera, the vast majority of our customers are self-hosting. I think a lot of folks start out doing things hosted in the cloud, but as their clusters get very big and they're starting to use them a lot, will bring them in-house. If you're using your cluster 24/7, then it's really expensive to keep it in the cloud compared to hosting it yourself. In terms of Big Data, cloud tends to be an all or nothing thing. It's hard to move data back and forth between your own data center and the cloud, so if you want to keep everything in the cloud then you can, but having some not and some there is inefficient.
Global Knowledge: Cloudera has been compared to being a younger, Big Data version of Red Hat, and do you think that comparison is reasonable, or do you think there are real differences in terms of the businesses?
Doug: I think it's a good analogy. I think we're seeing Hadoop is becoming very much like a kernel of a distributed operating system, and Cloudera is a packager of a distribution that based around that kernel of the core applications as well as some proprietary software and services around that distribution. It's a slightly different area in that we're talking about an operating system that runs on top of Linux and runs across a whole cluster of machines. It's a general-purpose platform that we're selling, not a particular solution. We work with other vendors to make sure that their solutions work on this platform and promote the platform in general.
Global Knowledge: How do you keep technology embraced by traditional proprietary software vendors like IBM and HP? How's that whole relationship working, and how does it impact Apache Hadoop?
Doug: I'm incredibly pleased with how that's gone. I didn't expect Oracle and Microsoft to so quickly and readily adopt Hadoop as a component for their platform. I thought maybe they would develop their own proprietary technology and compete. They instead elected to join in and work with the open source community in building out Hadoop. So that's been wonderful to see for the Hadoop community, and we very much welcome their involvement. We also have the hardware vendors starting to better support Hadoop and Cloudera.
Global Knowledge: For the IT professional wanting to learn more about Big Data, what would you suggest as a learning path? Do you have to be a Linux-type of person or a Java person? What are the biggest challenges for learning Hadoop?
Doug: It's still a relatively young technology, so there are rough edges. It can be tricky. Cloudera Manager makes it really easy to get things going. If you just want to bring a cluster up, download the free Cloudera Manager and use that. But to get started there are books, on-line tutorials, and courses folks can take. But often times the best way to learn anything is to have something that matters to you, some problem which you think you might be able to solve, and figure out whether you can, how you can, and then ask others who are involved concrete questions to try to find out if this technology can help you do what you need to do. That's always the best way to go rather than just learning technology for technology's sake. Learn something that's relevant to your goals.
Global Knowledge: What are your future goals for Apache Hadoop within the Apache Software Foundation? What do you see as sort of the biggest challenges and future challenges for adoption in Hadoop development that are near and dear to your heart?
Doug: I want to see more projects based around Hadoop. We're seeing a better story around compatibility over time, so it'll be easier to develop projects that run against multiple versions of Hadoop. That'll help the whole, the platform, and ecosystem expand. Compatibility, compatibility of data, compatibility of network protocols are all areas that we're improving in and we need to keep improving. Going forward, I think we're starting to see the beginning, Hadoop started as a batch-processing system able to economically store petabytes and process them in ways you that couldn't before - really get through datasets that large.
Batches, there's a high latency. It's a lot of applications where you really want to be able to get interactive results and do queries that take seconds rather than hours or minutes and potentially even serve large numbers of simultaneous users doing those kinds of things.
Hadoop can do incremental things, can do interactive things, can support transactional workloads, and I think the challenge is to see if we can meet that promise and really provide the sort of holy grail of computing, something which is scalable to arbitrary large numbers of computers, arbitrary size of data sets, and arbitrary latencies. We can make it as fast and faster by adding more resources and giving the transactional consistency along with the scalability. I think there was a sense that was not possible, you couldn't have it all. Now, it's starting to look like you can have a whole lot of it and maybe even all of it.