At DataWorks 2018, Hortonworks Accelerates Its Shift Toward Public Cloud Deployments

By James Kobielus | June 20, 2018

DataWorks Summit began as Hadoop Summit in 2008, when it was a developer event hosted by Yahoo! In the intervening decade, the event changed name, grew in size and scope, and became international. The big data market exploded as established data vendors and hot startups spawned a dynamo of innovation. The core technology evolved and eventually gave way to a deeper open source ecosystem.

What’s remained the same were both data’s centrality in the global economy, and the leading big data solution providers’ orientation toward entirely or predominantly open-source software. Hadoop is the least of it. Among multi-cloud big-data solution providers, Hortonworks remains one of the most active participants and committers in the open-source ecosystem. Its solution portfolio now incorporates 26 open-source codebases. These include widely adopted data and analytics platforms such as Apache Hadoop, Apache Hive, and Apache Spark, diverse range of code—such as Apache Atlas, Apache Ranger, Apache Ambari, and Apache Knox—for managing, securing, and governing big data environments.

The product focus at DataWorks was on the vendor’s launch of the next generation of its core Hadoop platform, which is built on the Apache Hadoop 3.1 distribution. The new Hortonworks Data Platform (HDP) 3.0 is now available for preview under an early access program and is expected to be generally available in Q3 2018.

HDP 3.0 includes the following new features:

Containerization: For developers building the next generation of cloud-native data applications, HDP 3.0 supports faster building, training, and deployment of containerized advanced analytics, machine learning (ML), deep learning (DL), and artificial intelligence (AI) workloads microservices all the way to the edges of today’s increasingly distributed cloud environments. With containers running on HDP, developers can move fast, deploy more software efficiently and operate with increased velocity, which is optimized for the DevOps environments in which more data applications are being built. Scalability enhancements in HDP 3.0 supports running of very large multitenant clusters and more packaged containerized microservices.
GPU support: Support for graphic processing units (GPUs) within Hadoop 3.0’s YARN scheduler enables AI, DL, and ML workloads to run on supported Hadoop clusters. HDP 3.0 now allows Hortonworks customers to leverage GPUs in the cloud for scalable training and inferencing workloads. When developing and refining containerized TensorFlow applications in the cloud, data scientists can share access to GPU resources through pooling and isolation in HDP 3.0.
Hive 3.0 support: HDP 3.0 includes a real-time database built on Hive 3.0, which now incorporates Tez, LLAP, and Druid to support real-time data warehousing. With this release, Hive has evolved into a full enterprise database with support for high concurrency, low latency, expanded SQL syntax and ACID compliance. Within HDP 3.0, Hive 3.0 provides a unified SQL layer that supports improved query optimization to process more data, both real-time and historical, more rapidly for low-latency and high-throughput applications. It enables scalable, interactive query of data that lives anywhere in private, public, and hybrid clouds.
Heterogeneous cloud-storage optimization: HDP 3.0 includes the ability to separate storage clusters from compute clusters. As an alternative to HDFS when running in the cloud, HDP 3.0 supports data storage in all of the major public-cloud object stores, including Amazon S3, Azure Storage Blob, Azure Data Lake, Google Cloud Storage, and AWS Elastic MapReduce File System. HDP workloads access cloud storage environments via the Hadoop Compatible File System API. The latest storage enhancements include a consistency layer for non-consistent cloud stores. And it offers improved storage scalability, leveraging enhancements in NameNode to support scale-out persistence of billions files with lower storage overhead. It also includes storage-efficiency enhancements such as support for erasure coding.
Governance and compliance enhancements: HDP 3.0 enables enhanced governance and compliance with such mandates as GDPR, through support for a full chain of data custody and fine-grained event auditing. Users can now track the lineage of data from its origin all the way to its storage in data lakes built on HDP 3.0. This allows auditors to view data without making changes, enforce time-based policies, and audit events around third parties with encryption protection. HDP 3.0 also supports shared enterprise security and data governance services across public clouds and automatic cluster scaling based on usage or time metrics.

There were no specific enhancements announced to HDF for data-in-motion, though Hortonworks discussed a new lightweight streaming technology, MiNiFi, that will enable customers to deploy containerized AI/DL/ML for deployment to IoT, edge, and embedded endpoints in multiclouds.

Likewise, there was no specific product announcement regarding DPS which Hortonworks launched as a “single pane of glass” for monitoring, managing, and deploying data applications across complex hybrid-data multiclouds. However, Hortonworks extensively discussed a new compliance-relevant DPS solution, Data Steward Studio, that it rolled out a couple of months ago at DataWorks 2018 Berlin.

As in previous releases of the platform, HDP 3.0 enables customers to build hybrid data multi-clouds that include any and all of the major public cloud providers. Last Friday, it released Cloudbreak 2.7, which supports provisioning of HDP clusters into complex hybrid cloud architectures.

Hortonworks now puts its public cloud footprint front and center in its go-to-market message, though only 25 percent of its 1,400 paying customers currently run Hortonworks solutions in public cloud and only 5 percent are only in the public cloud. By contrast, 95 percent of Hortonworks customer deploy its offerings entirely or predominantly on-premises.

Nevertheless, Hortonworks sees a customer trend toward putting more analytics workloads in public and hybrid clouds, and its entire product roadmap is focused on making that transition as seamless as possible for its customers. This is consistent with Wikibon’s finding from the recent annual update to our big-data market forecast. Our analysts found that hybrid clouds are becoming an intermediate stop for enterprise big data analytics deployments on the way to more complete deployment in public clouds in the coming decade and beyond. Across the big data market, traditionally premises-based platforms are being rearchitected to deploy primarily in public clouds.

One of the implications of the trend toward multi-cloud environments is the need for standard platform technology that can orchestrate containerized microservices across public and private clouds. As Scott Gnau, Hortonworks CTO, told me and Rebecca Knight on theCUBE at DataWorks: “Containerization affords…agility in deploying applications. For the first 30 years [data practitioners] built these enterprise software stacks that were very integrated, hugely complicated systems that could bring together multiple different applications, different workloads and manage all that in a multi-tenant kind of environment. And that was because we had to do that, right? Servers were getting bigger, they were more powerful but not particularly well distributed. Obviously in a containerized world, you now turn that whole paradigm on its head and you say, you know what? I’m just going to collect these three microservices that I need to do this job. I can isolate them. I can have them run in a serverless technology. I can actually allocate in the cloud servers to go run, and when they’re done they go away. And I don’t pay for them anymore.”

With this trend in mind, Hortonworks had public-cloud partnership announcements this week at DataWorks San Jose designed to help customers make that transition when they’re ready:

IBM: The partners announced IBM Hosted Analytics for Hortonworks (IHAH), which runs HDP 3.0 instances on IBM Cloud and incorporates IBM Db2, IBM Big SQL and IBM Data Science Experience (DSX). This move builds on last year’s announcement by IBM and Hortonworks that they were incorporating HDP and DSX into a converged solution for the next generation of developers building AI-driven applications for multicloud deployment. IHAH brings that converged data management and analytics offering into the IBM Cloud as a hosted service. It enables quick setup, provisioning, security, and deployment so that data scientists and other developers can rapidly operationalize their applications for production enterprise uses. It lets users run DSX workloads in virtual Python environment on all HDP clusters hosted in IBM Cloud without needing to install Python libraries on those nodes. Within IHAH, DSX workloads can easily consume the data and infrastructure services managed in HDP data lakes in IBM Cloud. The hosted service also enables data scientists to write ANSI SQL to invoke IBM Big SQL directly from DSX, avoiding the need to write Python scripts in order to bring together different types of data from different federated data stores in IBM Cloud.
Microsoft: The partners announced that customers can now deploy the complete Hortonworks portfolio–including HDP, Hortonworks DataFlow (HDF) and Hortonworks DataPlane Service (DPS)–natively on Microsoft Azure’s infrastructure as a service public cloud. This gives joint customers greater flexibility in distributing big data workloads throughout complex hybrid multicloud scenarios, including edge deployment in the Internet of Things. Joint customers also retain the choice of running their analytic workloads, such as Hadoop and Spark, purely in the public cloud on the existing HDInsight offering in Microsoft Azure.
Google: The partners announced expanded support for Google Cloud Platform (GCP) public-cloud storage services. Hortonworks customers can now tap into Google Cloud Storage to support HDP, HDF, and DPS workloads that run in diverse private, public, and hybrid cloud environments. In the GCP public cloud, users can run fast, scalable analytics for interactive query, AI/ML/DL, and streaming data analytics. At no upfront cost and in minutes, customers can provision HDP, HDF, and DPS workloads in GCP with unlimited elastic scalability. They can automate and optimize the provisioning of GCP resources while configuring and securing workloads in the cloud. They now have the flexibility to run ephemeral, short-lived workloads in GCP. And they can securely move any data flow from any source between on-premises HDP/HDF/DPS deployments and GCP deployments.

Connecting the dots on HDP 3.0, containerization, and connected communities, which was a core theme of his DataWorks keynote, CEO Rob Bearden told us that “HDP 3.0 is really the foundation for enabling that hybrid architecture natively, and what’s it done is it separated the storage from the compute, and so now we have the ability to deploy those workloads via a container strategy across whichever tier makes the most sense, and to move those application and datasets around, and to be able to leverage each tier in the deployment architectures that are most pragmatic. And then what that lets us do then is be able to bring all of the different data types, whether it be customer data, supply chain data, product data. So imagine as an industrial piece of equipment is, an airplane is flying from Atlanta, Georgia to London, and you want to be able to make sure you really understand how well is that each component performing, so that that plane is going to need service when it gets there, it doesn’t miss the turnaround and leave 300 passengers stranded or delayed, right? Now with our Connected platform, we have the ability to take every piece of data from every component that’s generated and see that in real time, and let the airlines make that real time.”

Arun Murthy, Hortonworks CTO, tied containerization, as enabled through HDP 3.0, directly to customer’s edge computing strategies: “Containerization [provides] complete agility in terms of how you deploy the applications. You get isolation not only at the resource management level with containers but you also get it at the software level, which means, if two data scientists wanted to use a different version of Python or Scala or Spark or whatever it is, they get that consistently and holistically. That now they can actually go from the test dev cycle into production in a completely consistent manner. So that’s why containers are so big because now we can actually leverage it across the stack and the things like [edge computing technology] MiNiFi showing up….what we’re trying to do with MiNiFi is actually not just collect data from the edge but also push the processing as much as possible to the edge because we really do believe a lot more processing is going to happen at the edge….There will be custom hardware that you can throw and essentially leverage that hardware at the edge to actually do this processing. And we want to do that even if the cost of data not actually landing up at rest because at the end of the day we’re in the insights business not in the data storage business.”