With data, the whole is often greater than the sum of the parts what you can learn from combining two data sets is often much more valuable than what you can learn from any one of them in particular. That’s why data lakes and distributed systems like Hadoop have become so popular in recent years. But these systems are complex, and no amount of technology will remove the need for human skill and expertise. Here are some of the things I’ve learned from working with Hadoop.
"Because the scaling characteristics are such an important part of Hadoop, it's a natural fit for large companies with huge customer bases"
Hadoop is Scalable and Flexible
Hadoop was originally developed at Yahoo to help build a better search engine. Some of its earliest use cases included understanding user behavior on websites at a very fine-grained level, in order to support various forms of experience optimization as well as commercial advertising models. Its design imperatives focused on allowing a broad range of down stream analysis of massive amounts of data, where you couldn’t necessarily anticipate all the kinds of analyses that you were going to do as you were building the system. If you are constantly doing new user tests, then you need an extensible system that will allow you to do things like A/B test different versions of an ad by showing them to two different groups of a million people. This kind of digital experimentation started with web properties, but because of the digitization of the entire economy, this approach is now being applied in many other industry domains as well.
It’s not a far leap, really, from online search-engine optimization (“I’m trying to get people to respond to my ads”) to something like healthcare intervention optimization (“I’m trying to get people to respond to my phone calls asking them to take their blood pressure medication”). Other industries benefit from more general pattern recognition the United States Postal Service, for instance, can optimize its distribution network based on real-world observations on the street. These days, distributed systems like Hadoop can also be used to optimize very complex manufacturing processes. Merck optimizes its pharmaceutical yield based on a deep analysis of the fermentation processes involved in generating the compounds.
The Power and Promise of Distributed Systems
Despite being made possible by technologies like Hadoop, data products and ecosystems such as an optimized healthcare management portal don’t just fall out of the sky fully formed. They are the fruit of experimentation designed to test, and then optimize, a particular hypothesis. They are the outcomes that can be achieved by what we call an Experimental Enterprise.
A hallmark of these companies is the use of cost-efficient, on- demand cloud architectures to try things out cheaply, and quickly iterate toward optimal solutions. In order to do that, there must be a tight feedback loop so that the conclusions you draw can be used to update your designs and pushed into production rapidly. This allows you to create a streamlined system in which the data, metadata, and analysis you need to run your business at any given point is gathered methodically, rather than duct-taping and paper-clipping things together after the fact. It is companies who get that feedback loop going for their business who ultimately realize a lot of value.
Pitfalls to Avoid
If the innovations that distributed systems like Hadoop make possible are so self-evidently awesome, then why aren’t more people using it? The answer is that using these technologies for competitive business advantage is not yet a well understood path, and there are challenges along the way.
Whether in healthcare, manufacturing, or digital properties, there are two significant places where companies get stuck. When they first install a distributed system like Hadoop, many select an initial use case that is suitable to the technology an exercise in understanding “speeds and feeds” rather than a usecase that is valuable to their business. Companies often hear from their Hadoop vendor that they should simply gather all their data in one place and then do some analytics, and value will somehow materialize, so they run a test that proves Hadoop can store massive amounts of unstructured data. And that’s great, but what if all your actual business problems have nothing to do with that? Identifying clear strategic objectives for the technology, which are driven by the business and not by technical interests, is critically important to success. Unless the business has a good idea up front of where value might lie, you might get stuck right from the start. Execute a proof-of-concept that doesn’t provide value to the business, and no one will care.
The second major challenge area concerns operational complexity. People often underestimate the effort required to teach folks who are familiar with managing traditional data technologies like Oracle and DB2 how to work with a Hadoop-style infrastructure. The problems and error conditions exhibited by systems like Hadoop look highly unfamiliar to people who haven’t seen them before. And as usage grows, many of these systems need more active management, or else they will begin to fail. We’ve had clients struggle to realize the value of Hadoop because, operationally speaking, they weren’t able to get a team in place who could manage the technology. This can lead not only to operational problems, but also to political problems; capability is great and the business proposition may still be sound, but if the user community thinks the system is unstable, then they will quickly turn on it.
What the Future Holds
There are indeed challenges to getting business value from Hadoop, but it should give you some confidence that other companies have successfully navigated these journeys already. And many others are well positioned to benefit because the scaling characteristics are such an important part of Hadoop, it’s a natural fit for large companies with huge customer bases that need to work with data at massive scale.
These scaling characteristics also make it a great fit for emerging areas such as the Internet of Things, or IoT, which is about the ability to coordinate the environment by collecting data from large numbers of physical objects. Whether you’re working with sensors on airplane engine parts or an offshore oil platform, if you can economically store all that data, then there’s a lot to be learned from it. Hadoop is a platform that makes that possible.
Hadoop is a great tool for understanding complex interactions or processes, because it has characteristics that allow you to experiment quickly and cheaply. Just make sure you’re applying the technology to something of business value. That means choosing your early tests wisely, creating effective feedback loops so that the results of your experiments improve your products, and anticipating the oversight and training needed to successfully manage your new system.