Premise
Data scientists are rapidly automating every last step of the machine learning (ML) development pipeline. More comprehensive automation is key to developing, optimizing, and deploying ML-based applications at enterprise scale. Data scientists will be swamped with unmanageable workloads if they don’t begin to offload many formerly manual tasks to automated tooling. Automation can also help control the cost of developing, scoring, validating, and deploying a growing scale and variety of models against ever-expanding big-data collections.
Developing, optimizing, and deploying machine learning (ML) models is an exceptionally detail-oriented craft. Every step of the ML pipeline process—from preprocessing the data and engineering the feature model through building and evaluating the model—is intricate. Connecting these steps into an end-to-end DevOps pipelines can easily cause the details and dependencies to grow unmanageably complex. Scaling up the pipeline to sustain production of a high volume of high-quality models can magnify the delays, costs, bottlenecks, and other issues with one’s existing ML-development workflow.
Automating the data science development pipeline is the key to operating at enterprise scale. Automated ML refers to an emerging practice that accelerates the process of developing, evaluating, and refining ML models. These tools use various approaches—including but not limited to specialized ML models themselves—to automatically sort through a huge range of alternatives relevant to development and deployment of ML models in application projects. As discussed here, the tools help data scientists to assess the comparative impacts of these options on model performance and accuracy. And they recommend the best alternatives so that data scientists can focus their efforts on those rather than waste their time exploring options that are unlikely to pan out.
Wikibon provides the following guidance for enterprises interested in automating their ML pipeline:
- Make the case for automation of the ML pipeline.. Advances in ML automation are powering a new DevOps-focused paradigm in the business world. When founded on DevOps practices, greater automation of the ML pipeline enables organizations to support expanding ML development workloads, address stubborn ML development resource constraints, ensure consistent ML practices across development lifecycle, augment the expertise of ML development personnel and implement strong ML-pipeline governance.
- Identify the steps in the ML pipeline that can be most readily automated. Though automating the entire ML DevOps pipeline may make sense as a long-term strategy, it is often best to prioritize specific pipeline tasks most suitable for automation in the near term. Typically, an initiative to automate the ML pipeline focuses on one or more of the following DevOps processes: data exploration, data discovery, data preprocessing, feature engineering, algorithm selection, model training, hyperparameter tuning, model evaluation, and model deployment.
- Assess the automation features of available machine learning development tools. Developers have access to a growing range of tools for automating the ML pipeline. Available ML automation solutions include both commercial and open-source tools, and they come with varying degrees of integration in organizations’ existing DevOps environments. Developers may wish to explore using multiple automation tools for various stages of the ML pipeline. In some cases, developers may need to write custom apps and glue code that integrates two or more solutions into a comprehensive ML automation toolchain.
- Implement strong governance over the automated machine learning pipeline. The foundations of comprehensive ML pipeline governance are a centralized ML source-control repository, an ML data lake, and an integrated ML development environment. ML quality control is a core governance function. No matter how automated the ML pipeline becomes, manual quality assurance will always remain essential a core task for which human experts will be responsible.
Make the Case for Automation of the Machine Learning Pipeline
Automation is coming to every segment of the data development, deployment, and management pipeline. Many upfront machine learning (ML) pipeline functions—such as data ingestion, transformation, exploration, and analysis—have long been automated to a considerable degree. The next frontier is in the automation of the processes of building, training, iterating, and serving ML models, such as those that drive predictive applications, into production environments.
Without greater end-to-end automation of the ML pipeline, these expanding workloads will become increasingly manageable if developers must rely largely on manual approaches and workflows. When founded on DevOps practices, greater automation of the ML pipeline can help organizations to:
- Support expanding ML development workloads. ML pipelines everywhere are expanding to include more participants, steps, workflows, milestones, checkpoints, reviews, and metrics; require development of more models in a wider variety of tools, libraries, and languages; incorporate more complex feature sets that incorporate more independent variables; train more models with data from more sources in more formats and schemas; run more concurrent ingest, modeling, training, and deployment jobs; and deploy, monitor, and manage more models in more downstream applications.
- Accelerate ML development and operationalization tasks: Automation can shorten latencies at each step in the ML pipeline. Taking manual processes out of data discovery, ingestion, and training—which constitute the bulk of any ML development workload—can help developers to build and evaluate more models more rapidly. Likewise, automating the arcane process of hyperparameter tuning can speed deployment of highly efficient models into production applications. In a DevOps context, these and other automation-driven task accelerations can support real-time release cycles that involve ML-infused applications.
- Address stubborn ML development resource constraints. Automation can help organizations to boost their ML productivity in spite of staff constraints, skills deficits, budget limitations, and other factors that might otherwise impede their ability to respond to new demands. It can help control the cost of developing, scoring, validating, and deploying a growing scale and variety of ML models against ever expanding big-data collections.
- Ensure consistent ML practices across development lifecycle. Automation can ensure consistency in how data sourcing, feature extraction, statistical modeling, training, and other key tasks are handled on every project. It can ensure that the most relevant data visualizations are always generated to help developers explore data sets, benchmark ML model performance, and monitor model deployment and execution in target environments.
- Augment the expertise of ML development personnel. Automation can identify, recommend, and apply optimized hyperparameters to each ML model build, freeing developers from the tedious job of exploring myriad alternatives. There are myriad options for executing ML models to achieve their intended outcomes, such making predictions, performing classifications, or recognizing some image or other phenomenon of interest. Given the finite nature of their time and resources, data scientists and other developers cannot possibly explore every possible modeling alternative relevant to their latest ML project. Even skilled data scientists can’t master every last trick of the trade.
- Implement strong ML-pipeline governance. Automation can ensure standard ML development workflows among data scientists, data engineers, business analysts, data administrators, and data-driven application developers. It can speed development by automatically applying the appropriate scripts, rules, orchestrations, and controls to deployed ML assets. It can automate monitoring of concurrent ML workstreams that are in process across the organization; training of each ML build with the best, labeled, enterprise-sanctioned training data available; ensuring that the best-fit trained “champion” ML model has been promoted to in-production status; keeping of one or more trained “challenger” models ready for promotion to in-production status in case the champion become less predictively fit for purpose; and logging of all ML pipeline steps and archiving the logs to a searchable master repository.
Identify The Steps In The Machine Learning Pipeline That Can Be Most Readily Automated
Once the development team has built a strong case for automating the ML pipeline, they will need to identify the key tasks to be automated. Though automating the entire ML DevOps pipeline may make sense as a long-term strategy, it is often best to prioritize specific pipeline tasks most suitable for automation in the near term.
The priority ML pipeline tasks for automation are those where existing manual bottlenecks make it difficult to scale, accelerate, and bring greater repeatability to workflow from one project to the next. For any ML practice, one may automate the built-in DevOps workflows among teams of data scientists and others collaborating at every stage of the development, deployment, administration, and governance pipeline. From a DevOps standpoint, it’s best to automate these collaborations around a shared repository of data, algorithms, and other assets used throughout the ML pipeline.
Table 1 provides a handy guide for developers to pinpoint the specific ML pipeline processes that can benefit from automation. Figure 1 shows the flow of these processes in the context of the principal phases of the pipeline: preparation, modeling, and operationalization.
PHASE | PROCESS | DISCUSSION |
Preparation | Data discovery
|
Developers should implement solutions that automatically discover, acquire, and ingest the data needed to build and train machine learning models. This should be a standard automation process implemented on the data lake shared by ML developers, thereby ensuring the most relevant training, validation, and other data sets are always easily discoverable and available for modeling. |
Data exploration | For any ML modeling exercise, this could involve automatically building visualizations of relationships of interest within the source data. This should be an automation priority when ML developers rely on standard visualizations to speed their exploration and modeling of regressions, predictions, and other patterns in the data. | |
Data preprocessing | Developers should institute a program of automatically building training data set through encoding of categorical variables, imputing missing variables, and executing other necessary programmatic data transformations, corrections, augmentations, and annotations. or, Alternately, if the requisite data doesn’t exist, developers should automate the generation of synthetic data suited to the ML challenge at hand. As with automated data discovery, automation of preprocessing can ensure that the data has been cleansed and otherwise prepared to speed the downstream modeling and training processes. | |
Modeling | Feature engineering | For any given training data set, one may automatically generate alternative feature representations that describe predictive variables to be included in the resulting machine learning models. Even for experienced ML modelers, iterating through alternative feature representations can be a time-consuming process. Automated generation of candidate representations can help them refine those more rapidly within the modeling segment of the pipeline. Just as important, automated feature engineering can help the less experienced ML modelers benefit from the expertise that is baked into ML automation tools that provide this capability. |
Algorithm selection | For any given feature set, one may automatically identify the statistical algorithms best suited to the learning challenge, such as making predictions, inferring abstractions, and recognizing objects in the data. As developers move toward more sophisticated multi-model ensembles, they will increasingly rely on tools that apply various ML algorithms to auto-generate alternative models that implement any specific feature representations. This is increasingly a must to help ML modelers decide whether to build their models on established ML algorithms (eg., linear regression, random forests, etc.) or on any of the newer, more advanced neural-net algorithms. | |
Model training | For any given model, one may automatically process it against a training-data test or validation set to determine whether it performs a machine-learning learning task (e.g., predicting some future event, classifying some entity, or detecting some anomalous incident) with sufficient accuracy. Alternately, ML developers might use tools that automatically generate and label synthetic training data. Considering that training is a highly structured, time-consuming process, this should be an automation priority for every ML professional. | |
Model evaluation | For two or more candidate models, one may automatically generate learning curves, partial dependence plots, and other metrics that illustrate comparative performance in accuracy, efficiency, and other trade-offs among key machine learning metrics. Considering that model evaluation metrics have long been standardized among ML professionals, there is no excuse for not automating their generation, visualization, and use in ML deployment and optimization processes. | |
Hyperparameter tuning | For any given statistical model, one may automatically identify the optimal number of hidden layers, learning rate (adjustments made to backpropagated weights at each iteration); regularization (adjustments that help models avoid overfitting), and other hyperparameters necessary for top model performance. Hyperparameter optimization is necessary to boost model accuracy and shorten time-to-value. However, it is beyond the expertise of most ML developers and should be automated as quickly as feasible. | |
Operationalization | Model deployment
|
For models that have been promoted to production status, one may automatically generate customized REST APIs and Docker images around ML models during the promotion and deployment stages; and deploying the models for execution into private, public or hybrid multi-cloud platforms. Every ML developer who builds these models into cloud-native applications should be automating these deployment processes thoroughly. |
Model resource provisioning | For models that have been deployed, one may automatically scale up or down the provisioning of CPU, memory, storage, and other resources, based on changing application requirements, resource availabilities, and priorities. These provisioning functions are also essential to prior steps in the pipeline, including feature engineering, algorithm selection, and hyperparameter tuning. To the extent that the ML developers is working in a public cloud or true private cloud environment, ML resource provisioning should be automated alongside all other microservices. | |
Model governance | For deployed models, one may automate the processes of keeping track of which model version is currently deployed; ensuring that a sufficiently predictive model in always in live production status; and retraining using fresh data prior to redeployment. Governance becomes an automation priority as ML pipelines grow in complexity and as they build more ML-infused applications that need to be productionized, monitored, and logged according to compliance, security, and other mandates. |
Table 1: Machine Learning Pipeline Processes That Can Be Automated
Figure 1: Machine Learning Pipeline Process Flow
Assess The Automation Features Of Available Machine Learning Development Tools
Developers have access to a growing range of tools for automating various stages of the ML pipeline. Available ML automation solutions include both commercial and open-source tools, and they come with varying degrees of integration in organizations’ existing DevOps environments. Please refer to this recent Wikibon note for a comprehensive comparison of commercial and open-source tools for ML automation.
Developers may wish to explore using multiple automation tools for various stages of the ML pipeline. Considering the diversity of technical requirements for ML development, training, deployment, and management, it may become necessary for developers to write custom apps and glue code that integrates two or more solutions into a comprehensive ML automation toolchain.
Already, there is a growing range of open-source ML-automation tools in Python or R that are integrated with common libraries. These tools typically focus on automating a subset of the core processes associated withi modeling, training, and refinement. They include Scikit-learn, TPOT, Auto-Sklearn, Machine-JS, Auto-Weka, Spearmint, and Sequential Model-based Algorithm Configuration.
Also, several data-science tool vendors support ML automation capabilities. Chief among these are Alteryx, AWS, DataRobot, Domino Data Lab, H20.ai, PurePredictive, Tellmeplus, and Xpanse AI. Compared to the toolkits discussed above, they automate a broader range of ML pipeline processes, focusing on modeling, training, and refinement, but also automating some front-end collaboration, exploration, and preparation as well as some back-end deployment, operationalization, and governance processes.
Furthermore, leading commercial organizations (e.g., Google), nonprofit research institutes (e.g., OpenAI), and universities (e.g, MIT, University of California Berkeley) have their smartest computer scientists working on ML automation. Consequently, there are myriad specialized ML automation projects under development in the research community– such as object recognition–with which the more adventurous developers may wish to familiarize themselves. Many of these incorporate advances in transfer learning and other sophisticated AI techniques to enable reuse of ML artifacts across projects. Some noteworthy ML-automation projects underway in the commercial and academic worlds include Google AutoML, MIT/MSU’s Auto Tune Models, and IBM Cognito.
Implement Strong Governance Over The Automated Machine Learning Pipeline
The foundations of comprehensive ML pipeline governance are a centralized ML source-control repository, an ML data lake, and an integrated ML development environment.
ML quality control is a core governance function. As organizations automate the ML pipeline, the potential glut of models in various states of development, refinement, and production creates the risk of losing control over model quality. How can one distinguish high-quality ML models from the inevitable useless junk, and is it possible to automate this process?
No matter how automated the ML pipeline becomes, manual quality assurance will always remain essential a core task for which human experts will be responsible. Every one of the ML pipeline automation scenarios these tools support requires a data scientist to set it up, monitor how it proceeds, and evaluate the results. In other words, expert human judgment will remain essential for ensuring that automation of machine learning development doesn’t run off the rails.
Under most likely future scenarios, ML developers need to review the output of their automated tools in order to ensure the validity and actionability of the results. This is analogous to how quality experts have long recommended that high-throughput manufacturing facilities dedicate personnel to test samples of production runs before they’re shipped to the customer. If nothing else, established ML developers should perform manual reviews prior to putting those assets into production. Organizations dare not automate machine learning pipeline processes any further than they can vouch for the quality of their output.
As the scale and complexities mount in the ML pipeline, strong governance tools and practices will enable your development professionals to automate tracking of metrics that address—in real time and in automated fashion–the pipeline challenges presented in Table 2.
CHALLENGE | METRICS |
Workstream concurrency | · How many concurrent data-science workstreams are in process concurrently across your organization? |
Training recency | · How recently have you evaluated and retrained each in-production data-science build with the best training data available?
· How recently was the training data associated with each data-science build refreshed in your data lake? How recently was training data labeled and certified as fit for training the associated data-science models? |
Model promotion | · How recently was each trained data-science model build promoted to production status?
· How recently have you trained and evaluated the “challenger” models associated with each in-production data-science “champion” model build and evaluated their performance vis-à-vis the champion and each other? · Where does responsibility reside in your data-science DevOps pipeline reside for approving the latest, greatest and fittest champion data-science model for production deployment? |
Pipeline tracking | · Are you logging all these data-science DevOps pipeline steps?
· Are you archiving the logs? · How searchable are the archives for process monitoring, tracking, auditing, and e-discovery? |
Table 2: ML Pipeline Governance Challenges
Manual ML quality assurance will always remain essential a core task for which human developers will be responsible, no matter how much their jobs get automated.
Action Item
Wikibon recommends that developers expand their automation efforts to encompass all or most tasks needed to operationalize ML models and then monitor and manage those assets in production applications. In taking this course of action, they will be helping their organizations to keep pace with expanding ML development workloads. At the same time, automation of the ML pipeline will mitigate data-science skills deficits, ensure consistent ML practices, and augment the expertise of ML development personnel.