In the wild existing landscape related to the data analytics and machine learning field, Databricks stands out as a powerful and innovative platform to build, deploy, share, and maintain enterprise-grade data, analytics, and AI solutions at scale.
Databricks provides an efficient and collaborative environment for data scientists, engineers, and analysts simplifying the complexity needed for big data processing and analysis as well as machine learning projects.
In this series of articles with a monthly periodicity that we initiate with this one, we will explore different aspects of this tool, starting with its architecture, and following with ways to create workflows to orchestrate data processing, machine learning, and analytics pipelines. Also, we will explore the framework to help teams in the creation of ETL cost-effectively for data quality and error handling, the so-called Delta Live Tables (DLTs). We will continue with the components available on the platform for model governance such as the model registry or the feature store, to finish with the available solution to quickly generate baseline machine learning models, or Databricks AutoML, among other interesting contents.
It will be a long trip, but we promise that it will pay off :). Let’s go!
Databricks lets data scientists, engineers, and analysts collaborate seamlessly with each other while handling tedious backend tasks for them. Roughly, Databricks operates under the so-called control plane and compute plane. These are two layers supporting the functionalities of Databricks.
In general terms, the control plane manages all backend services. It’s where notebooks and many other workspace configurations chill out, encrypted and secured. On the other hand, the compute plane works under the control layer. This is where data gets processed. For example, the compute plane connects Databricks with other services such as an Azure Subscription to use the network configuration and resources already defined there. In other cases, the compute plane can reside inside the Databricks subscription, this is the case for the definition of serverless SQL warehouses or native model serving solutions. Databricks also manages connections with external data sources and storage destinations for batch and streaming data. It’s like having a bunch of vessels that link the platform with the outside world.
This is the main Databricks structure. What is built upon this basis? In the next section, we will explore how these layers are used to shape one big continent. From where the AI/ML solutions are managed, controlled, enhanced, and deployed to the real world.
Let’s start with the formation of a data-intelligent platform. A data intelligent platform is a common place to store massive amounts of structured and unstructured data. The data intelligent platform aims to combine the benefits of classic data warehouses and data lakes. Briefly, data warehouses are built to store structured data ready to be ingested by business insights (BI) applications. Datalakes, in contrast, provide high performance and admit all data types. Classically, datalakes are used after some ETL process has been executed. For example, to provide a storage and consumption solution for ML & AI applications that require several data transformations organized by layers.
Databricks’ data intelligent platform joins all together to mount a unified, open, and scalable data architecture that is built under metadata layers with rich management features. Thus, the data intelligent platform architecture attempts to bridge the gap between the structured and optimized nature of data warehouses and the flexibility and scalability of data lakes.
This is illustrated in the above figure. As we can see in the figure, the data is being stored, processed, and consumed in the Data Intelligent Platform for all business solutions. To address the problem of how data is represented, the Data Intelligent Platform makes use of the so-called parquet format, admitting also the computational benefits classically presented in datalakes and data warehouse solutions.
Let’s continue with the figures presented in the previous section. Following the rows presented “Storage”, “Computation” and “Consumption”, we will discuss next which benefits Databricks’ Data Intelligent Platform offers compared to the other alternatives:
The Data Intelligent Platform architecture enhances the AI/ML architecture solutions with an optimized ecosystem with already integrated tools presented below as an introduction.
Databricks’ Data Intelligent Platform serves as a ground to build a whole set of features to habilitate complex ML and AI applications on top of it. Next, we list the main ones, but there are others.
We will get deeper into each of these components in the following pills. Pinky promise!
Unity Catalog is a unified governance solution for data and AI at the Data Intelligent Platform. Unity Catalog will be the main structure where our solution will be built. All of our data, features, models, and pipelines will be stored here. Unity catalog has interoperability with all cloud services and is capable of ingesting data from all these external services. This tool not only brings us the possibility to build a complete solution for A//ML but also provides scalable features for enhanced BI tools that could serve the final client.
Unity catalog is built under the following structure and components that we will describe below:
Metastoreacts as a sentinel, safeguarding crucial metadata for data and AI assets. It transcends being a repository, defining access permissions intricately. The catalog, as the foundational layer, orchestrates the meticulous organization of diverse data assets. The schema (databases) contains tables, views, volumes, models, and functions. Thus, schemas are created inside catalogs.
Databricks usually propose the medallion schema structure to organize data inside the datalake separated by layers. It is a data design pattern used to logically organize data in the Data Intelligent Platform. This way, data quality, and data structure are provided incrementally to data when data flows from layer to layer. Thus, we can differentiate between the bronze layer, the silver layer, and the gold layer.
Finally, on top of these layers, we can find more specific ones created to give support for a specific business case or scenario. For example, for a product recommendation model that we are building. In this final layer, consolidated data for ML models emerges.
Within each schema, Functions, Features, Models, and Tables may appear, contributing to a precision-filled performance in data processing and model governance and conforming the main assets related to a business case.
As shown in the figure, the solution is composed of two parts. The first part is in charge of building the medallion architecture seen in a previous section. The final output from this part can be used for multiple business use cases. The second part extracts data from the Gold layer to enrich and prepare this data for its use by a product recommendation model.
The term “Delta” comes from the open-source storage layer “Delta Lake”. This storage layer boosts data lakes’ reliability by adding a transactional layer on top of cloud-stored data (AWS S3, Azure Storage, GCS). It supports ACID transactions, data versioning, and rollback capabilities, facilitating unified processing of both batch and streaming data.
In Databricks, we have two main assets using Delta Lake, Delta Tables and Delta Live Tables. Delta Tables is the default table format stored for all tables in Databricks. Delta Live Tables is a vivid version of Delta Tables. The abbreviation used from now on will be “DLT”. We need to think of DLTs as tables with built-in transformation pipelines. This feature improves performance by autoscaling the resources whether we’re working in batches or streaming workloads.
These two features are not the only ones using the “Delta”. Databricks optimizes the queries made to these tables using a Delta Engine by pushing computation to data. Delta Sharing is an open standard used to share data between organizations securely. And to keep track of everything we have DeltaLogs, a place where we could read all the history of our continents.
This article marks the commencement of our exploration into the capabilities of Databricks, a platform that has laid the foundations for our journey. We have introduced fundamental aspects of our metaphorical ship, highlighting its potential as an optimal environment for Machine Learning (ML) and Artificial Intelligence (AI) solutions. Databricks is showcased as a seamlessly integrated solution housed within a singular platform, offering performance, scalability, usability, and reliability.
In the following pills, we will embark on an in-depth investigation, delving into each layer of this integrated framework to unravel the possibilities it presents for advancing ML/AI applications.
[Databricks. Overview (Control & Compute Plane)]
[Databricks. (2020, January 30). What is a Data Lakehouse?]
[Databricks. The Data Lakehouse Platform for Dummies]
[Databricks. Delta Lake Guide. Retrieved]
[Databricks. Delta Live Tables Guide. Retrieved]
[Databricks. (2023, October). The Big Book of MLOps, 2nd Edition]
[Doe, J. (2022, January 16). What is a Data Intelligence Platform? Databricks Blog]
Consultant in AI/ML projects participating in the Data Science practice unit at SDG Group España with experience in the retail and pharmaceutical sectors. Giving value to business by providing end-to-end Data Governance Solutions for multi-affiliate brands. https://www.linkedin.com/in/gzabaladata/
ML architect and specialist lead participating in the architecture and methodology area of the Data Science practice unit at SDG Group España. He has experience in different sectors such as the pharmaceutical, insurance, telecommunications, and utilities sectors, managing different technologies in the Azure and AWS ecosystems. https://www.linkedin.com/in/angelmoras/