Databricks has announced the launch of its latest generation of industry-leading machine learning (ML) offering at the Data + AI Summit. With Databricks Machine Learning, new and existing ML capabilities on the Databricks Lakehouse Platform are integrated into a collaborative, purpose-built experience that provides ML engineers with everything they need to build, train, deploy, and manage ML models from experimentation to production, uniquely combining data and the full ML lifecycle.
A new data-native platform built on top of an open lakehouse architecture, Databricks Machine Learning includes two new capabilities. The first capability is “Databricks AutoML” to augment the machine learning process by automating all of the tedious steps that data scientists today have to manually do, while still exposing enough control and transparency, and
The second capability is “Databricks Feature Store” to improve discoverability, reuse, and governance of model features in a system integrated in the enterprise’s data engineering platform.
Databricks Machine Learning provides each member of the data team with the right tools in one collaborative environment. Users can switch between Data Science / Engineering, SQL Analytics, and the new Machine Learning experiences to access tools and features relevant to their everyday workflow. Databricks Machine Learning also provides a new ML-focused start page that surfaces the new ML capabilities and resources, with quick access to Experiments, the Feature Store, and the Model Registry. Built on an open lakehouse foundation, Databricks Machine Learning ensures customers can easily work with any type of data, at any scale, for machine learning across traditional structured tables, to unstructured data like videos and images, to streaming data from real-time applications and IoT sensors, and quickly move through the ML workflow to get more models to production faster.
Databricks AutoML: Jumpstart new projects and automate tedious ML tasks
AutoML has the potential to allow data teams to more quickly build ML models by automating a lot of heavy lifting involved in the experimentation and training phases. But, enterprises who use AutoML tools today often struggle with getting AutoML models to production. This happens because the tools provide no visibility into how they arrive at their final model, which makes it impossible to modify its performance or troubleshoot it when edge cases in data lead to low confidence predictions. Additionally, it can be difficult for organizations to satisfy compliance requirements that require them to explain how a model works, because they lack visibility into the model’s code.
The introduction of the AutoML capabilities within Databricks ML takes a unique ‘glass box’ approach instead. It allows data teams to not only quickly produce trained models either through a UI or API, but also auto-generates underlying experiments and notebooks with code so data scientists can easily validate an unfamiliar data set or modify the generated ML project. Data scientists have full transparency into how a model operates and can take control at any time. This transparency is critical in highly regulated environments and for collaboration with expert data scientists.
All AutoML experiments are integrated with the rest of the Databricks Lakehouse Platform, including MLflow, to track all the related parameters, metrics, artifacts, and models associated with every trial run to make it easy to compare models and easily deploy them to production.
Databricks Feature Store: Streamline ML at scale with simplified feature sharing and discovery
Machine learning models are built using features, which are the attributes used by a model to make a prediction. To work most efficiently, data scientists need to be able to discover what features exist within their organization, how they are built, and where they are used, rather than wasting significant time repeatedly reinventing features. Additionally, feature code needs to be kept consistent across several teams that participate in the ML workflow, otherwise, model performance will drift apart between real-time and batch use cases – a problem called online/offline skew.
The Databricks Feature Store is the first of its kind that is co-designed with a data and MLOps platform. Tight integration with the popular open source frameworks Delta Lake and MLflow guarantees that data stored in the Feature Store is open and that models trained with any ML framework can benefit from the integration of the Feature Store with the MLflow model format. Most importantly, the Feature Store eliminates online/offline skew by packaging feature store references with the model, so that the model itself can lookup features from the Feature Store instead of requiring a client application to do so. As a result, features can be updated without any changes to the client application that sends requests to the model. The Feature Store also enables reusability and discoverability with automated lineage tracking to automatically track the data sources used for feature computation, as well as the exact version of the code that was used. With this, a data scientist can find all of the features that have already been defined based on the raw data they are planning to use. Finally, the Feature Store knows exactly which models and endpoints consume any given feature, facilitating end-to-end lineage as well as safe decision-making on whether a feature can be updated or deleted.
“Humana’s machine learning platform, FlorenceAI, is enabling us to automate and accelerate the delivery lifecycle of ML solutions at scale. Databricks has been an essential underlying technology, with hundreds of our data scientists using the platform to deliver dozens of models in production, so that our teams are able to operate at orders of magnitude faster than before,” said Slawek Kierner, Senior Vice President of Enterprise Data and Analytics at Humana.