Premise: The Hadoop community is ten years old this year, and it’s starting to indicate signs of adolescence. As big data problems evolve, Hadoop is trying to keep up, but it can only be stretched so far.
Big data tools and platforms are widely available, but the path from procurement to business value remains complex and labor intensive. Partly, that’s because the business problems being addressed by these tools are complex, but mostly it’s because the big data community — which includes users and vendors — is too tool focused, and not enough application focused. We’ve seen this over and over in the tech industry: The tools and platforms that work for one class of applications go through a process of “adaptive stretch“, beyond which they become unproductive for new applications. After the point where the process of adaptive stretch no longer works, a new set of tools and platforms emerges for a new class of applications.
Is a changing of the guard is taking place in big data and analytics? The answer is “yes.” For the past decade, infrastructure technologies, in the form of data lakes of clustered servers, commodity storage, and Hadoop and its zoo of engines, garnered most attention. But as data lakes, Spark, and cloud deployment patterns become clearer, big data decision makers are starting to focus on new questions, including:
- How can we avoid avoid creating big data legacy technologies? The Hadoop toolkit is trendsetting new technologies, but also reshaping businesses — where it’s deeply embedded. The business value being generated in some accounts is impressive, but so is the sunk cost where it’s being applied to do things it wasn’t engineered to do.
- Why hasn’t Hadoop broken out of the big companies? Hadoop was supposed to bring big data analytics to the masses, but it still requires a variety of different and highly specialized types of scarce administrative skills.
- What are the “safe” paths forward?
Can We Avoid Creating A New “Big Data” Legacy?
As a community of users gains experience with a platform technology, they start to “stretch” it to solve problems that the platform wasn’t originally designed to serve. At some point, the platform encounters diminishing returns — it costs more to stretch than it yields in benefits. At that point, the community has a strong incentive to migrate to a platform technology specifically designed for a new, much larger class of problems. (See Figure 1).
The Hadoop ecosystem clearly is in the grip of an adaptive stretch. One of the original attractions of Hadoop — it’s mix-and-match flexibility of tools — is now exacerbating the adaptive stretch challenge. The organization and governance of the Hadoop ecosystem is also a major contributor to Hadoop’s growing interoperability challenges. Why? Because a large number of vendors, integrators, and customers are free to pursue their own Hadoop visions based on their own efforts to stretch it. As a result, integration costs among Hadoop-related components are rapidly accelerating. Cloudera told Wikibon that they budget 50% of their the engineering effort for each new component in their distribution to integration. Only the remaining 50% goes to adding functionality. This mix and match complexity explains why customers often need 5 different admins with specialized skills to operate a 10 server pilot.
One way to visualize administrative complexity is to look at a matrix of admin tasks by layers in the stack from applications down to datacenter infrastructure (See Figure 2). Precise performance monitoring of all the application components can involve dozens of different consoles. Multiply that by the other admin tasks and the other layers in the stack and it becomes clear how hard it is to retrofit a simple admin model around an infrastructure platform as complex as Hadoop.
Similar complexity faces Hadoop developers leveraging those same specialized engines and utilities. Each tool or engine has a different developer API. (See the left picture in Figure 3).
Using Hadoop: Only the Biggest Companies Need Apply
While each class of applications leverages the prior one as part of its underlying platform, customers must master a new set of skills to build and operate the emerging class of more sophisticated applications. But if the emerging set of more sophisticated applications requires equally more sophisticated skills, the applications will appeal to a progressively smaller segment of companies. We are seeing evidence of this skills challenge unfold with Hadoop’s adaptive stretch. Spiderbook, a big data CRM company, used a specialized web crawl to analyze the adoption of big data infrastructure platforms. The analysis shows that only 492 U.S. companies have scaled a Hadoop production system to more than 100 nodes, in addition to other, similar, measures of large-scale production. More remarkably, 486 of the top 713 customers are tech companies, including ad-tech. In other words, most production Hadoop deployments are in the technically sophisticated tech industry.
At this point in Hadoop’s evolution, terms and concepts are multiplying faster than applications and technologies that simplify implementations. Again, this is a natural feature of adaptive stretch, but over the next year Wikibon expects a shakeout — of concepts — in the community, which will lead to a cleaner picture of Hadoop best practices and architectural conventions that will better facilitate tool comparisons and integration. Additionally, we expect a streamlining of key roles to begin, with clearer understanding of how big data work will get institutionalized among data scientists, data engineers, data stewards, big data administrators, etc.
What Is the “Safe” Path Forward?
Mainstream adoption of big data tools and capabilities will require much greater platform simplicity. We see two promising approaches: Spark and cloud native services. Spark is purpose-built to solve many of the problems we discovered as we used Hadoop. Our research shows that it’s not perfect, it doesn’t stand all on its own, and it may well suffer from adaptive stretch as reality tries to keep up with hype. Nonetheless, it is proving an excellent complement to Hadoop; Spark actually extends the usefulness of Hadoop by reminding us where Hadoop naturally fits in the big data ecosystem.
The cloud native services from AWS, Azure, and Google are not yet widely recognized as competitors. But what the cloud native services lack in open source appeal, they make up for in simplicity. The cloud services were designed, built, integrated, tested, delivered, and continue to operate far more effectively as a single unit than Hadoop’s components. That unification will make it much easier for mainstream enterprises to build and operate applications with a single, integrated set of development and administrative tools (See Figure 4). Looking further out, self-tuning systems of intelligence are going to need continuous processing and online machine learning. Spark and the cloud native services need to grow into this new role without accreting so much complexity they collapse in adaptive stretch.
Action Items
Enterprises must navigate the universe of big data applications by recognizing that application platforms will alternately empower and constrain enterprises along a path of ever more sophisticated applications. Data lakes, intelligent systems of engagement, and self-tuning systems of intelligence all build on each other in terms of capabilities. But the underlying platforms will evolve in a cycle of simplicity, complexity from adaptive stretch, and simplicity again via a platform refresh. Because packaged big data applications are proving slow to evolve, mainstream enterprises must align their application development investments with the platforms when they are simple and before they accrete adaptive stretch. More sophisticated enterprises, particularly tech companies, can build and deploy applications ahead of the rest because of their greater tolerance for the complexity of platforms that fall into the category of adaptive stretch, before the newer, simpler successor platforms emerge.