Databricks has become a reference product in the field of Data Platforms providing a unified environment for engineering and analytical roles. Due to the fact that not all organizations have the same types of workloads, Databricks has designed different plans that allow organizations to adapt to different needs and this has a direct impact on the design of the platform architecture.
This series of articles aim to address the integration of Databricks in AWS environments by analyzing the alternatives offered by the product with respect to the architectural design. Due to the length of the contents, it has been considered convenient to divide into two deliveries:
Second delivery (coming soon):
Databricks was created with the idea of being able to develop a unique environment in which different profiles such as Data Engineers, Data Scientists and Data Analysts could work collaboratively without the need for external service providers to offer the different functionalities that each of them needs on a daily basis. Databricks was born thanks to the collaboration of Spark founders, publishing DeltaLake and MLFlow as Databricks products following the open-source philosophy:
This new collaborative environment had a great impact on its presentation due to the novelties it offered by integrating different technologies:
Before starting to analyze the different alternatives provided by Databricks with respect to infrastructure deployment, it is useful to know the main components of the product:
Control Plane: hosts Databricks back-end services needed to make available the graphical interface, REST APIs for account management and workspaces. These services are deployed on an AWS account owned by Databricks.
Data Plane: hosts all the necessary infrastructure for data processing: persistence, clusters, logging services, spark libraries, etc. The Data Plane is deployed in the customer’s AWS account and can be managed by:
DBFS: distributed storage system available for clusters. It is an abstraction on an object storage system, in this case S3, and allows access to files and folders without using URLs.
Databricks REST APIs: REST APIs made available by Databricks through which its resources can be managed programmatically.
External Data Sources: possible data sources hosted outside the customer’s AWS account, such as a relational database or object storage service.
The price indicated by Databricks is bound to the DBUs consumed by the cluster. This parameter is related to the processing capacity consumed by the cluster and depends directly on the type of instances selected (an approximate calculation of the DBUs consumed per hour by the cluster is provided when configuring the cluster).
The imputed price per DBU depends on two main factors:
The combination of both computational and architectural factors will define the final cost of each DBU per work hour.
One platform for your data analytics and ML workloads
Data analytics and ML at scale across your business
Data analytics and ML for your mission critical workloads
Job Light Compute
Serverless SQL Compute
Imputed cost per DBU for computational and architectural factors
The following table shows the main characteristics by type of workload:
Jobs Light Comput
Managed Apache Spark
Job scheduling with libraries
Job scheduling with notebooks
Databricks Runtime for ML
Delta Lake with Delta Engine
Notebooks and collaboration
Characteristics by type of workload
The following table reflects the features included in each of the plans focusing on the integration with AWS. These features can be a determining factor in the selection of the plan.
Data Plane Control
Databricks Managed VPC
Customer Managed VPC
Control Plane Networking
Cluster - Control Plane (Public Connection)
Cluster - Control Plane (Private Link)
AWS S3 Permissions Management
Credentials Passthrough (SCIM)
Control Plane Encryption
Default DMK Key
DMK & CMK Keys
At Rest Encryption
S3 Bucket - EBS
Features related to the subscription plan.
The following diagram shows the communication channels between Control Plane and Data Plane:
This alternative is characterized by the fact that the Data Plane is managed by Databricks. The infrastructure is deployed through a cross-account role that is enabled so that Databricks can set up and configure the necessary infrastructure. The implemented connections are as follows:
This alternative is characterized by the fact that the Data Plane is managed by the client. The advantages of this alternative are the following:
The details of the connections implementing an internal metastore with Glue, VPC Endpoints for STS and Kinesis and private connections are as follows:
Secure cluster connectivity uses a TLS certificate hosted in a Hashicorp Vault on the Control Plane for Databricks Managed VPC and Customer Managed VPC.
In the following image you can see how by opting for the Customer Managed VPC we can reuse it within different workspaces:
It is important to note that a configuration transition from a Databricks Managed VPC to a Customer Managed VPC is not possible.
In the case of opting for a Customer Managed VPC we will be able to make the secure connection of the cluster through an internal channel by provisioning a Private Link to connect the Control Plane and the Data Plane (back-end). In the same way a Private Link will be enabled to access all the REST APIs from the Data Plane (back-end).
In addition, a transit VPC can also be enabled with a Private Link and Site-to-Site VPN through which the user will be able to make a private connection to the Control Plane (front-end).
The requirements to be able to deploy these Private Links are as follows:
All these connections can be seen in the following image:
The official documentation  describes the necessary steps to follow in order to establish these connections.
In the case of enabling Private Links on both the front-end and back-end you can enable the option for Databricks to reject any connection made through public channels. For this case you could not use the default metastore hosted on the Control Plane because AWS does not yet support JDBC type connections through Private Link so you would have to use an external metastore or implement one with Glue on the Data Plane. See Metastore Possibilities section in the next article for more information.
Single Sign-On (SSO) allows users to authenticate through an Identity Provider (IdP) provided by the organization. SAML 2.0 support is required.
The two possible alternatives when using SSO are as follows:
At the time of writing, the IdPs supported are as follows:
More information can be accessed through the following link .
This section describes the different alternatives that exist to manage user access to the different S3 buckets that exist within the customer’s infrastructure.
This method is based on making available an instance profile for the EC2 instances that make up the cluster. It is characterized by the fact that all users with access to the cluster will have the same permissions, that is to say, the permissions that the instance profile has.
The steps to be completed in order to perform this configuration are as follows:
It is important to note that Databricks checks if you have the necessary permissions to assume the instance profile when you register with Databricks. This check is a dry-run in which an attempt is made to launch an instance with this role without actually deploying it. In the case where the cross-account role has tag restrictions (e.g. Vendor: “Databricks”) this check will fail as this dry-run is executed without any tags.
This mechanism allows defining permissions at the Databricks user/group level through a meta instance profile associated with the cluster instances. The big difference between this alternative to the one discussed in the previous section is the fact that different permissions can be enabled for different users/groups using the same cluster.
The following image shows this relationship in more detail:
Therefore, the cluster instances assume the meta instance profile that acts as a container for the different data roles that can be assumed. These data roles contain the permissions to access the different buckets. On the other hand, the SCIM API will be used to define which user groups can assume the different data roles embedded in the meta instance profile.
It is important to note that it is not possible to assume several data roles simultaneously in the same notebook. In addition, the instance profile must be registered as a meta instance profile, as opposed to the previous section where it was registered as an instance profile.
All the information on this section can be found in the following link .
Credentials passthrough (SSO/SAML)
In case the organization has an Identity Provider, the Credentials passthrough can be used with SSO and thus be able to use AWS IAM federation to maintain the mapping users – IAM roles within its Identity Provider. This can be interesting for organizations that want to maintain centralized management of their users and the privileges they have.
The following diagram explains the workflow:
Therefore, this section and the previous one have certain similarities, both will need a meta instance profile that is assumed by the cluster instances and data roles with the access permissions to the S3 buckets. The difference is that in this case it is the SAML response that indicates which user groups can assume the different data roles and in the case of Credentials Passthrough (SCIM) it is the Databricks itself that denotes it
I started my career with the development, maintenance and operation of multidimensional databases and Data Lakes. From there I started to get interested in data systems and cloud architectures, getting certified in AWS as a Solutions Architect and entering the Cloud Native practice within Bluetab.
I am currently working as a Cloud Engineer developing a Data Lake with AWS for a company that organizes international sporting events.
During the last few years I have specialized as a Data Scientist in different sectors (banking, consulting,…) and the decision to switch to Bluetab was motivated by the interest in specializing as a Data Engineer and start working with the main Cloud providers (AWS, GPC and Azure).
Thus getting first access to the Data Engineer practice group and collaborating in real data projects such as the current one with Olympics.
You may be interested in