Premise
Customers are starting to plan how to build out Hadoop-based hybrid cloud platforms in order to transform their big data and analytic infrastructure into platforms for applications that make ever better decisions ever faster. Despite the original promise of open source platforms curated from a common ecosystem, vendor offerings are transforming into what for all intents and purposes are proprietary platforms.
As big data and analytics increasingly get embedded in mainstream distributed applications (e.g. IoT), conventional data lakes are becoming less effective. Thus far the distributed Hadoop infrastructure underpinning big data and analytics has meant the infrastructure was distributed on a cluster of servers in a single site. But mainstream applications increasingly distribute data and processing across on-premises data centers, edge networks, and across clouds. However, the open source community is not innovating to support hybrid cloud architectures fast enough. While Hadoop vendors have long since left behind the promise of cross-distro workload portability, each vendor is increasingly adding single source or proprietary software in order to support hybrid cloud architectures. Each vendor is making different architectural choices, however, reflecting each company’s prioritization of use cases, current challenges, and future possibilities. In other words, each vendor’s offering has different sweet spots for hybrid cloud use cases and interoperability (see Table 1):
- MapR offers interoperability for developers of Web-scale geographically distributed applications. MapR’s Converged Storage Platform integrates files, tables, and streams exposed through standard Hadoop APIs. Developers can not only move workloads, but can distribute them seamlessly via stream processing, file, and database replication, and distributed database updates.
- Cloudera is bundling specific solutions and starting to layer on cloud-native admin. Cloudera is addressing the issue of complexity in the Hadoop ecosystem in two ways. First, the company is being prescriptive in how it bundles a limited set of services to address specific use-cases. Second, Cloudera is starting to add cloud-native management capabilities on top of the existing on-prem physical infrastructure management.
- Hortonworks offers cross-ecosystem dataplane administration and edge platform reach. Hortonworks is focused on being the ultimate “steward” for an organization’s data from the edge and across clouds, from what’s in their data lakes, in others’ repositories, and to the streams that fill them.
MapR offers the best interoperability for developers of geographically distributed applications.Table 1: Summary comparison of Wikibon’s analysis of hybrid cloud architectures based on Hadoop distributions from the leading vendors.
Impact
MapR is best enabled to support applications like connected cars where each car can generate terabytes of data each day. MapR can store, analyze, and manage models that prescribe actions on the car itself; collect the immense amounts of filtered data from the fleet of cars in the cloud for central machine learning training; and then orchestrate the distribution of the updated models back to the cars.
The enabling technology for this use case is MapR’s proprietary exabyte-scale, location-transparent, geographically replicated, polyglot storage platform. Where the other Hadoop vendors let customers choose between HDFS for analytic speed and scale, HBase for operational applications, and Kafka for stream processing, MapR built a foundation file system that enabled them to build storage services far better than the sum of the Hadoop parts. In fact, MapR built hybrid cloud-ready infrastructure before the cloud vendors themselves built something of similar scale. And the cloud vendors have yet to match MapR’s versatility. MapR’s unified polyglot storage platform can take in data in a variety of APIs and output that same data from different APIs. Despite its proprietary underpinnings, the MapR platform presents Hadoop-compatible APIs to developers (see figure 1). More than any Hadoop or cloud vendor, MapR can offer the ability to deploy and execute anywhere.
MapR’s versatile storage platform can support a wide array of hybrid cloud architectures (see table 1). MapR assumes applications based on their platform leverage the integrated file system, database, and stream processor. For example, the connected car application could ingest streaming data from an onboard sensor network into the local file system. When the car next connects to an edge gateway, the filtered data would migrate to the cloud either by file system replication or stream processing. In the cloud, the machine learning model training would operate on the database. The integration across the file system, database, and stream processor would greatly simplify building this application. For example, a single global namespace would make it easy for a developer to address data wherever it resides and to operate on the data using a variety of APIs. The same simplicity applies to management. Services for high availability, disaster recovery, authorization and other tasks are common across products.
MapR can distribute stateful microservice containers that communicate via its Kafka-compatible stream processor. MapR can also support both geographically-replicated HBase-compatible databases and the HDFS- and NFS- and POSIX-compatible file system. MapR, however, doesn’t actually manage the data science. Instead, geographically distributed event streaming makes it easy to filter and collect data from the edge. And the exabyte-scale storage enables training models with extreme amounts of data. The same storage can push trained models back to the edge.
Challenge
MapR does have some shortcomings in cloud-native operation and for IoT applications with lightweight edge devices. MapR does not yet offer the management capabilities to provision clusters elastically, a key capability for cloud-native software. And, for now, MapR’s edge offering is a 4-5 node cluster. It might work in a fit-for-purpose appliance in a connected car with space and plenty of power, but for other IoT applications, it would only get as far as an edge cluster, not an edge gateway or device (see figure 1).
Cloudera is bundling solutions and starting to layer cloud-native admin over on-prem admin.
Impact
Cloudera is ahead of its peers in assembling use case-specific, bundled solutions. Cloudera is taking specific steps to address the sprawl and complexity when trying to provision big data solutions out of more than 2 dozen services. Cloudera’s approach so far comprises core Hadoop augmented with a potpourri of relevant services:
- analytics (Impala),
- data science and engineering (Spark, Impala), and
- operational applications (HBase).
Cloudera has also taken the first step in turning one of its solutions, data engineering and science, into a cloud service, named Altus. However, Altus is not yet a cloud-native service because it doesn’t completely hide the complexity of operating the cluster on which it runs. Rather, it is more of a managed service.
Cloudera is furthest along of the three independent distro vendors in making its management software cloud-native. Cloudera’s original management console, Manager, installs, configures, manages, and monitors a cluster of Hadoop services (see figure 2). But Manager assumes a fixed size physical cluster. In order to create elastic clusters, Cloudera’s Director runs on top of Manager in order to dynamically spin up and spin down servers in the cloud (see figure 3). This elasticity is the core of Cloudera’s cloud-native administrative functionality.
Cloudera’s distributed application architecture (see table 1) reflects its more primitive storage foundation than MapR’s. Developers can build applications on compatible services across hybrid clouds, but the developers have to work with discrete, less integrated storage and processing services than MapR. Cloudera’s distro includes Kafka for communicating between many types of services, including microservices. Cloudera also partners with StreamSets for a control plane to manage and monitor geographically distributed streams. Streaming analytics is covered by Spark Structured Streaming. Under distributed file system and DBMS, Cloudera uses HDFS and HBase, respectively. HBase can trade for geographically distributed consistency at the expense of higher latency operations. Under machine learning, Cloudera focuses on Spark.
Challenge
Despite its emphasis on adding value through proprietary administrative functionality, Cloudera still has work to do in making its admin tools fully cloud-native. Director is a step in the direction of managing cloud-native services, but it still exposes the underlying physical clusters because it works through Manager, the on-prem management console. The Navigator governance tool needs even more work to be cloud-native. Cloudera designed Navigator as core technology from the company’s inception to provide data lineage, audit, and governance for on-prem clusters. But on-prem clusters are permanent and cloud-based ones are ephemeral. Cloudera’s Director is ahead of Navigator in becoming cloud-native.
Cloudera still needs work on its IoT edge offerings. The company doesn’t have services capable of operating in memory- and CPU-constrained environments. Consequently, Cloudera is a sub-optimal choice for analytics and machine learning model predictions on edge devices and gateways.
Hortonworks offers the best cross-ecosystem data plane administration and edge platform reach.
Impact
Hortonworks is putting in place a platform that is differentiated in its stewardship of data in a hybrid cloud environment. Hortonworks’ NiFi, part of the DataFlow product line, gives it the best scaffolding for managing geographically distributed IoT data flows. Along with the edge device product, MiNiFi, Hortonworks offers the best solution for managing and controlling secure, real-time IoT stream processing. NiFi manages data flows with governance extending from small footprint edge devices all the way to on-prem and multi-cloud Hadoop services. That scaffolding also enables Hortonworks to filter and collect data for centralized machine learning and to distribute the models back out to the edge (see figure 4).
Hortonworks’ stewardship of the Apache Atlas project puts it into the prime position to control the meta solution for data audit, lineage, and governance. Unlike Cloudera’s closed-source Navigator, Atlas is starting to be embraced by the broader ecosystem. When IBM deprecated its version of Hadoop, it committed to using Atlas across the rest of its big data product line.
Hortonworks’ upcoming Dataplane administration product will give customers a catalog that organizes a view of their data across on-prem and multiple clouds and optionally beyond Hadoop. The current Hive catalog is a directory of data in a single location. Hortonworks is building an analog that can manage data distributed across hybrid cloud deployments. In its initial form, it won’t be for supporting distributed queries, but it will answer questions about which data relies where.
Challenge
Hortonworks has been slow to migrate its offerings to be cloud-native. Microsoft partnered with Hortonworks for the fully cloud-native Azure HDInsight service. But that product appears to be fully controlled by Microsoft. Hortonworks’ initial cloud strategy for AWS, Azure, and GCP is to deliver the on-prem product on cloud server infrastructure with access to separate cloud storage. In other words, customers will see few of the administrative benefits of the cloud for a while. They will still have to manage physical or virtual clusters.
Hortonworks has also made some technology bets that are proving awkward to integrate. While initial SQL workloads on Hadoop were very long-running Hive jobs, the distro vendors went in different directions to deliver interactive SQL queries. Hortonworks built their interactive SQL on a version of Hive on Tez, accelerated the underlying Java nodes with LLAP, managed resources with YARN, structured the data into ORC columnar format, and stored the data in HDFS. This assemblage of technologies requires hundreds of configuration settings to run properly. The other distro vendors have versions of Hive, mainly for batch jobs, but each built an MPP SQL DBMS from the ground up for interactive queries. Not only are they easier to configure and administer, but they are optimized first for interactive performance, not long-running query throughput.
Solution-specific product bundles from Hortonworks can include partner products that overlap with each other. Hortonworks appears to select partner products to bundle into solutions tactically rather than strategically. In other words, rather than choose one partner product for common functions across different bundles, it can bundle several similar partner products in different bundles. Hortonworks’ data warehouse optimization solution includes AtScale for OLAP functionality and IBM’s MPP SQL BigSQL. AtScale and BigSQL have significant overlap. The Splunk-like log analysis product uses Elastic Search and Hive as back-ends for time-series log data. But the IoT solution uses Druid to manage time series data.
Action Item
Customers need to prioritize use case practicality over open source purity. What’s taking shape among the Hadoop distro vendors is a set of far more focused solutions from each vendor. The vendors are starting to strike out on proprietary paths that combine OSS and proprietary products or extensions. This is the new normal in what was called the Hadoop ecosystem. For customers who are not concerned with the constraints of on-prem software, Qubole, HDInsight, and AWS EMR are strong candidates. What’s more, once you step out of the “open” Hadoop ecosystem, the cloud vendors offer an order of magnitude more choices.