Machine Learning Pipeline: Chinese Menu of Building Blocks

By George Gilbert | November 30, 2016

Premise

Analytic data pipelines for building machine learning applications are coalescing around some well-understood design patterns. Theoretically, these design patterns should make it possible build pipelines using a chinese menu of mix-and-match analytic services based on well-defined integration points.

Analytic data pipelines have existed since businesses turned operational data into reports. Leading-edge big data and machine learning applications have evolved common design patterns in order to accommodate the ability to assemble the applications with mix and match flexibility. They have to transform data into predictions by ingesting, exploring, and processing data, and then serving the predictions (see Figure 1). Within the pipeline, the need for speed and scale requires moving as much of the compute functions as possible to the data. In other words, applications are almost subservient to data. Traditionally, data was subservient to and accessed primarily through applications. There are slight differences in the sequence of building blocks depending on whether developers or admins are working with the pipelines:

The pipeline starts with ingestion. Unlike with data warehouses, where exploration and database design come before ingestion, there is relatively little work in setting up ingestion.

Exploration includes pattern recognition and data governance. Because ingest is easy, the burden falls on the explore function to make sense of complex data, enrich it, and track where it comes from and where it goes. Once these activities are designed into the pipeline, most of them operate in the processing step at runtime.

Processing data transforms raw events and refines them to be ready for actionable predictions and prescriptions. Processing data might bring together for analysis a game player’s real-time data about her state of play relative to the history data about pathways through levels they’ve played previously. Once together, the data is ready for prediction.

Predict and serve takes the processed data, organizes it, and drives an action or informs a decision. This step uses the most recent data, historical data, and the machine learning model itself to make predictions.

The pipeline starts with ingestion.

When data was locked deep within OLTP applications, complex extraction programs were required to get it out. Creating extractions typically involved significant work to prepare the schemas required to answer questions in the data warehouse.. Today, though, ingest means you can load your data lake without figuring out the questions up front. Since no structure is needed, there’s no need to code the traditionally brittle ETL pipeline. Rather, analytic systems pipe the raw data into the data lake using a stream processor such as Kafka, Kinesis, Azure Event Hub, or Hortonworks Dataflow. Every data source, or “producer,” is accessible to any sink, or “consumer” via a hub. The streams can store the data in a temporary landing zone in the data lake for exploration. If you need more data or new sources, add them to the pipeline. There’s rarely a need to disassemble and recode the pipeline.

Part of what makes the stream processors such as Kafka elastically scalable is their ability to partition related data across a cluster. But if the next step in the pipeline requires streaming analytics rather than storing it in a database and analyzing it later, admins have to design the stream processor and the stream analysis to have the same way of partitioning the data. Otherwise, the pipeline will have to reshuffle the data between the two steps. In addition, even though the analytic pipeline is far more flexible by design than a traditional ETL pipeline, admins do have to take care that data that gets ingested evolves in a way that won’t break processing and predicting and serving steps later on.

Exploration includes pattern recognition and data governance.

Data exploration happens partly at design-time and partly at run-time. And during each phase, data engineers and data scientists have different responsibilities. During design-time, data engineers take data ingested in the landing zone and perform data prep, integration, and curation, ideally making it accessible through a catalog. Data scientists have to map the relevant elements of data to the right algorithms and then train a predictive model, typically with historical data. At run-time, data engineers have to ensure data’s lineage gets cataloged from one end of the pipeline to the other. At the same time, data scientists have to take run-time data about the performance of their predictive models and use it as part of a feedback loop to continually retrain the models.

Tools for managing data exploration for data engineers and scientists can themselves come in a chinese menu of specialized choices that requires integration work. Paxata and Trifacta specialize in data prep and integration but typically work with separate visualization and modeling tools. Pentaho can handle an end-to-end workflow, enabling close collaboration. But if data scientists want to work with Spark for machine learning, they would lose some of that end-to-end integration. The weakest part of the exploration process today is the limited tooling to support the governance function for data scientists. They need to track the features for their models, the models’ data feedback loops, and ongoing deployment of models that are continually being retrained or enhanced. In other words, we will need devops tools for data scientists.

Processing data transforms raw events and refines it to be ready for actionable predictions and prescriptions.

Processing data in the pipeline transforms it and analyzes it so that it’s ready to drive a prediction or inform a decision. Taking an online game as an example, the processing step might track how a user is progressing through her current game session and combine this data with an analysis of how she played in previous sessions. Analyzing all this contextual data would make it easier for the next step in the pipeline, predict and serve, to help adjust the game-play in real-time to better match player skill levels.

Working with data at the scale of Internet applications presents some unique challenges. Most traditional SQL DBMS’s could put all the contextual gameplay data together. But they likely couldn’t handle both speed and scale at this level. And traditional stream analysis products might be able to handle the speed and scale, but they likely couldn’t put all the contextual data together. One possible solution might combine Spark for fast processing and analysis and a NoSQL database like Cassandra for historical storage. Another potential solution is Snappy Data, a spin-off from Pivotal, which has built an in-memory DBMS that leverages all the power of Spark as a native analysis engine just as well as it uses SQL.

Predict and serve takes the processed data, organizes it, and predicts or informs an action.

Predict and serve is the part of the pipeline that is closest to the function of the data warehouse in traditional business intelligence pipelines. Predict and serve combines the fast, recent data with the big, historical data into views that offer high speed access. The data isn’t organized in normalized form like in an OLTP database. Rather, the views serve up the data for OLAP-style business intelligence that could inform human decision-making. The views also feed the predictive model with the data needed to drive an action by the application. In other words, in order to function properly, the data in predict and serve is organized to support specific perspectives, just like a data warehouse. But unlike a traditional data warehouse, adding additional views to answer new questions doesn’t require disassembling the pipeline. It might just require some additional analysis in the process step on top of data that’s already been ingested.

The predict function in the pipeline is by far the most immature. The run-time function typically provides a score indicating the likely best answer. But the follow-on workflow involves data scientists incorporating the feedback from their predictive models in their form of model retraining and model redesign. Any predictive model will begin to “drift” in accuracy once it’s in production. Retraining using the latest production data as a feedback loop “recalibrates” the model’s accuracy. Periodically, data scientists will need to redesign the model itself in order to incorporate data that adds a still richer contextual perspective. In both the retraining and redesign phases, data scientists need to be part of the feedback loop, to some extent. And the models data scientists create typically are in a language they work in, such as Python, Scala, or R. Developers typically have to rewrite the models into a production language such as Java or C++ for performance. In addition, developers typically deploy the production model behind a Web service API or as an object in a DBMS, such as a stored procedure. The bottom line is that the entire predict workflow process needs much more mature tooling for data devops or devops for data scientists.

Action Item: Machine learning pipelines have matured enough that there is widespread agreement on the boundaries between building blocks in order to make it easy for customers to assemble their own pipelines. It’s not quite as simple as mix and match, however. Customers still have to think through which building blocks should come from a single vendor. For example, machine learning model management is still immature and would benefit from single vendor integration for a while.

Article Categories

By George Gilbert | November 30, 2016

George Gilbert

George Gilbert, lead data & analytics analyst for theCUBE Research. Former Gartner analyst, former lead enterprise software analyst for Credit Suisse First Boston, one of the top investment banks serving the technology sector. Big Data analyst for Gigaom Research. Co-founded Techalphapartners, a consultancy that advised vendors and institutional investors on market development and product strategy. George has led conference panels with prominent thought leaders in cloud infrastructure and big data. He has been profiled on the front page of the Wall Street Journal and published as a guest author in a major overview of the evolution of cloud computing in The Economist. Prior to being an analyst, George was a product manager on Notes at Lotus Development. George received his BA in economics from Harvard University.

You may also be interested in

The Value Proposition of Nutanix Cloud Platform for Kubernetes

Paul Nashawaty July 11, 2025

Shaping the Future of Digital Labor: Sema4.ai’s Agentic AI Edge

Scott Hebner July 9, 2025

Cutting Edge Research, Analysis, Insights + Media

Studio Locations

Silicon Valley
989 Commercial St.
Palo Alto, CA 94303

Boston Metro
5 Mount Royal Ave.
Marlborough, MA 01752

Research Areas

Podcasts

Solutions

Engage

Stay Connected

theCUBE Research weekly

Stay ahead of the curve with the exclusive insights by our team straight to your inbox each week.

By submitting this form, you are consenting to receive marketing emails from: theCUBEResearch, info@siliconangle.com. You can revoke your consent to receive emails at any time by using the SafeUnsubscribe® link, found at the bottom of every email. Emails are serviced by Constant Contact