The Real Challenge in (Useful) Machine Learning isn’t Learning

Last week, I reflected on the past decade of ML systems research and how new abstractions and software frameworks for ML (e.g., scitkit-learn, PyTorch, Tensorflow) catalyzed the readily available data and compute to launch the ML revolution.

This was the state of the world when I went on the academic job market in 2015. The investment in systems for training models made it clear that “learning” was increasingly “solved” (at least for standard models) and I believed the next big frontier in ML systems would be in how we deliver predictions and respond to feedback. This thesis has driven the research in my group over the last 7 years.

In this blog post, I will summarize what was (and remains) the biggest challenge in production machine learning, what we learned, what we got right, and what we got wrong.

The Next Frontier: Testing? No, Inference!

If the process of training models was solved, what would come next?  What do you do with a model once you train it?  You test it!  And, if the accuracy looks good (perhaps by chance), you write a paper about it, tweet and blog about the performance, and probably resubmit that paper several times with minor revisions until it gets published.

I wish I was joking, but this is the life of the modern ML researcher.  The idea that training and testing are the two paradigms of machine learning is so ingrained into our way of thinking that we mainly only discuss train and test data. It is hard to imagine anything beyond the test dataset. But dear ML graduate student, there is more!

Back in the real world, machine learning practitioners use models to make predictions that solve problems. This process of making predictions is called inference (and sometimes scoring, prediction serving, and model serving).

We in the academic machine learning community have been fixated on training models, but what we’ve forgotten is that models only create value when they’re used to solve problems.

The concluding slide from my 2015 job talk depicts the cyclic interaction between data models and the actions we take based on those models.

Historically, companies that have used machine learning successfully have devoted significant engineering effort to inference and the problems around managing the inference process — sometimes even building custom silicon exclusively for inference.  For example, in the widely cited Google paper “Hidden Technical Debt in Machine Learning Systems” Scully et al. depicts serving as the largest challenge in production machine learning.

The key figure from the paper “Hidden Technical Debt in Machine Learning Systems”, in which Scully et al. discuss the many challenges of production machine learning at Google.  Notice that the largest box is actually the serving infrastructure and this doesn’t even include monitoring, resource management, or feedback collection.

But what is so challenging about making predictions from a trained model? After all, in scikit-learn, you just invoke model.predict(x).

Why is inference so challenging?

When my research group at UC Berkeley embarked on this new inference-focused agenda, we did so from the perspective of technology giants.  My group, like many others, drew inspiration from the challenges faced by sponsors and recent alumni from companies like Amazon, Google, Meta, and Uber.  These large, ML-driven companies had entire teams dedicated to rendering predictions for mission-critical models.  Tasks like real-time click-through rate prediction, content recommendation, fraud detection, and route pricing were central to their business and critically relied on constantly evolving data inputs.  As a result, they were deploying large decision trees and deep neural networks, and they had tight prediction latency requirements (<10ms) for multi-stage prediction pipelines. They needed to support bursts of user traffic and dynamically scale to various tiers of hardware accelerators.  And, all of this had to be done while taking cost, high availability, and large volumes of feedback into account.

The majority of the machine learning world is focused on learning which combines development and training. However, the opportunity for impact actually happens after learning when models are deployed to render predictions that drive actions and solve problems.

These were the challenges of exciting systems research papers, and we spent the next 7 years writing those papers. As we’d later learn, that wasn’t quite what the rest of the world needed — but we’ll come back to that in a moment.

Velox: The Proto-Feature Store

Our first project, Velox, bridged our work on machine learning in Apache Spark with the emerging challenges around model deployment. The key insight in Velox stemmed from a common pattern in both classic feature engineering and contemporary deep neural networks: Most models can be decomposed into simple linear models layered over complex-but-slowly-changing features. Velox coupled a fast cache for pre-materialized features with a lightweight real-time serving layer for the simple linear models:

This figure depicts the Velox architecture that bridges training and inference.  To address the cost of slow feature functions (e.g., data lookups or neural network feature transformations), Velox maintained a feature cache.  Low-latency prediction serving was accomplished by applying lightweight linear models on-top of the cached features.  This architecture also emphasized the use of online learning for fast updates to the linear models while also allowing slow updates to the features.

While this architecture is commonly used in today’s feature stores (see our blog post), it was ahead of its time — few teams needed this level of sophistication in 2015. Instead, by that point, the world was eagerly adopting new Python ML frameworks to develop complex multi-stage models. What people really needed, we thought, was the ability to serve predictions from multiple frameworks with low latency and high throughput.

Shipping Many Models at Once with Clipper

Guided by the lessons learned from Velox, we began to explore a new layered architecture for prediction serving. The idea was to build a middle layer that interposed between prediction requests and the underlying machine learning frameworks used to serve those requests.  This led to the Clipper project.

An early slide depicting the Clipper architecture.  Each of the models (and its framework) is wrapped in model workers running in separate Docker containers.  The Clipper system then acted as a router that mapped endpoint requests to specific containers and horizontally scaled the containers in response to requests.  In addition, the Clipper system introduced cross-framework online learning, caching, and request batching.  This architecture was novel at the time when most model serving systems were much more monolithic and tightly integrated with the ML framework.  Today, this is the standard architecture adopted by most model serving systems (e.g., TF Serving, SageMaker, MLFlow).

Clipper pioneered containerized prediction serving on Kubernetes, enabling low latency and horizontal autoscaling while also addressing the challenges of library compatibility and performance isolation. The middle layer architecture allowed us to implement performance optimizations (e.g., caching, batching) and also implement online learning across ML frameworks.

Clipper saw adoption by companies like LINE, and many later prediction serving systems including TensorFlow Serving and SageMaker adopted the same architecture. The amazing students involved in the Clipper project eventually graduated, and the architecture was folded into other research projects Cloudflow and Ray Serve. (One of the downsides of being a professor is that your best students always graduate.)

From Models to Model Pipelines and Auto-Scaling

Many early Clipper users ran into the same basic problem: Before or after invoking a model, they needed to apply custom logic for tasks like computing features or translating predicted probabilities into decisions. In some cases, users wanted to compose multiple models into a single pipeline to render more complex predictions.

In the same way that simple abstractions enabled powerful ML models, it became clear that a simple abstraction for composing data processing, featurization, and prediction steps would help enable simpler prediction serving. Serving prediction pipelines was an interesting challenge for two reasons — first, we had to design the right API that enabled users to quickly and intuitively compose their Python operations; second, we needed intelligent infrastructure to automatically manage the scaling, device placement, and queuing overheads of each operator.

This figure from the CloudFlow paper depicts a pretty basic but representative pipeline that we encoded in a simple data-flow API.  Cloudflow then applied a range of strategies to meet tail latencies in a serverless setting.
This figure from the InferLine paper emphasized the new approach to provisioning CPUs and GPUs to stages in complex pipelines. The subtle boxes on each arrow depict the queues that became a key challenge to resource provisioning.

We built two research prototypes to tackle these challenges. Cloudflow introduced a simple dataflow abstraction for creating prediction pipelines and leveraged the dataflow structure to optimize the performance of these pipelines. InferLine built out a policy engine and resource manager that used real-world workflow traces to define the optimal resource allocation per stage and to manage resource allocation in response to workflow changes.

Why Machine Learning isn’t Useful (Yet)

At the end of 2020, we founded Aqueduct to commercialize our research by building the next generation of easy-to-use infrastructure that connects machine learning models to the world.

Roman aqueducts were simple-but-critical pieces of infrastructure that connected (data?!) lakes to everyone, and they’ve withstood the test of time.

As we’ve been building Aqueduct, we’ve talked to 160+ data science teams. I can confidently say that we were right: Deploying models for inference was the next step in making machine learning useful. But the biggest lesson we’ve learned is that while everyone is struggling with model deployment, it isn’t for any of the reasons we studied (latency, scale, performance, or even cost).

All along, we thought the goal of prediction serving was to build more scalable, higher performance, and more cost-efficient cloud infrastructure. In reality,  no one needed those things (yet) — what everyone needed was an easier way to deliver predictions to data systems, products, and processes.

Meanwhile, the ML systems community continues to build increasingly sophisticated, difficult-to-manage systems designed for armies of software engineers to operate… and tosses them to data science teams. The worst part is that my research group led the charge towards this engineering-centric infrastructure. Clipper pioneered this design, making the case for microservices running in Docker containers on a Kubernetes cluster, with a highly-configurable auto-scaling middle layer. Oops.

This is crazy, and it completely misses the mark for what most teams need. Data scientists aren’t software engineers and shouldn’t have to battle ops tools on a daily basis to do their jobs (neither should software engineers). What data science teams need is simplified prediction infrastructure: a lightweight way to design, deploy, manage, and debug data-intensive prediction pipelines in all settings (batch, streaming, real-time) and without ops complexity.This is exactly what we’ve been working on at Aqueduct. We’ll share more about what we’re building in our next blog post!