Databricks on AWS - An Architectural Perspective (part 2)

Alberto Jaen

AWS Cloud Engineer

Alfonso Jerez

AWS Cloud Engineer

This article is the second in a series of two that aims to address the integration of Databricks in AWS environments by analyzing the alternatives offered by the product with respect to architectural design. In the first one we talked about topics more related to architecture and networking, in this second delivery we will talk about topics related to security and general administration.

The contents of each article are as follows:

First delivery:

  • Workload plans and types
  • High level architecture
  • Networking
  • Identity and Access Management

This delivery:

  • Security
  • Scalability
  • Persistence
  • Billing
  • Deployment

The first article can be visited at the following link  [1].

Glossary

  • Control Plane: hosts Databricks back-end services needed to make the graphical interface available, REST APIs for account management and workspaces. These services are deployed on an AWS account owned by Databricks. See first article for more information.
  • Credentials Passthrough: mechanism used by Databricks to manage access to different data sources. See the first article for more information.
  • Cross-account role: role that is made available for Databricks to assume from your AWS account. It is used to deploy infrastructure and to be able to take on other roles within AWS. See first article for more information.
  • Data Plane: hosts all the necessary infrastructure for data processing: persistence, clusters, logging services, spark libraries, etc. The Data Plane is deployed on the customer’s AWS account. See first article for more information.
  • Data role: roles with access/write permissions to S3 buckets that will be assumed by the cluster through the meta instance profile. See first article for more information. See first article for more information.
  • DBFS: distributed storage system available for clusters. It is an abstraction on an object storage system, in this case S3, and allows access to files and folders without using URLs. See the first article for more information.
  • EC2 Fleet: configuration required to be able to define the detailed characteristics of a group of instances in AWS.
  • Elastic Block Store (EBS): AWS hard disk storage service.
  • IAM Policies: policies through which access permissions are defined in AWS.
  • Key Management Service (KMS): AWS service that allows you to create and manage encryption keys.
  • Pipelines: series of processes in which a set of data is executed.
  • Prepared: data processed from raw that will be used as a basis for creating the Trusted data.
  • Init Script (User Data Script): the EC2 instances launched from Databricks clusters allow to include a script to install software updates, download libraries/modules, among others, at the moment it is started.
  • Identity Federation: entity that maintains the identity information of individuals within an organization. Its main use is for authorization and authentication.
  • Meta instance profile: role that is provided to the cluster with permissions to assume the data roles. See first article for more information.
  • Mount: To avoid having to load internally the data required for the process, Databricks enables synchronization with external sources, such as S3, to facilitate the interaction with the different files (it simulates that these are located locally, thus simplifying the relatives paths) but they are actually stored in the corresponding external storage source.
  • Personal Access (PAT) Token: token for personal authentication that replaces user and password authentication.
  • Raw: unprocessed ingested data.
  • Root Bucket: root directory for the workspace (DBFS root). Used to host cluster logs, notebook revisions and libraries. See first article for more information.
  • Secret Scope: environment in which to store sensitive information by means of key-value pairs (name – secret).
  • Trusted: data prepared for visualization and study by the different stakeholders.
  • Workflows: sequence of tasks.

Security

This section will analyze the security of the information processed in Databricks from two perspectives: while the information is in transit and while the data is at rest.

Encryption in transit

Encryption on client (cluster-s3)

This section deals with the encryption of the data sent to the buckets from the cluster in order to define policies in the buckets to reject all incoming unencrypted communication. The necessary configuration for the root bucket and the buckets where the data is stored (raw, prepared and trusted) are different since the root bucket is part of the DBFS.

Root bucket

In order to encrypt the communication from the client we must include a file called core-site.xml in each node of the cluster through an init script, either global or cluster specific. This configuration is filled in for the DBFS since the root bucket is part of it.

In this file we must add the algorithm to use and the key in KMS, being able to be keys administered by the client. An example of the content of this file could be the following:

<property>
    <name>fs.s3a.server-side-encryption-algorithm</name>
    <value>SSE-KMS</value>
 </property>
 <property>
    <name>fs.s3a.server-side-encryption.key</name>
    <value>${kms_key_arn}</value>
 </property> 

Definition of the algorithm and KMS key to be used in the root bucket encryption.

Data lake buckets

In the case of the Data Lake buckets, we must add some Spark configurations in the cluster so that it configures the encryption.

An example of configuration could be the following, where at the bottom we include the same information as in the previous section, i.e. the encryption algorithm and key.

spark_conf = {
    "spark.databricks.repl.allowedLanguages" : "python,sql",
    "spark.databricks.cluster.profile" : "serverless",
    "spark.databricks.passthrough.enabled" : true,
    "spark.databricks.pyspark.enableProcessIsolation" : true,
    "spark.databricks.hive.metastore.glueCatalog.enabled" : true,
    "spark.hadoop.aws.region" : var.aws_region,
    #caches for glue to run faster
    "spark.hadoop.aws.glue.cache.db.enable" : true,
    "spark.hadoop.aws.glue.cache.db.size" : 1000,
    "spark.hadoop.aws.glue.cache.db.ttl-mins" : 30,
    "spark.hadoop.aws.glue.cache.table.enable" : true,
    "spark.hadoop.aws.glue.cache.table.size" : 1000,
    "spark.hadoop.aws.glue.cache.table.ttl-mins" : 30,
    #encryption
    "spark.hadoop.fs.s3a.server-side-encryption.key" : var.datalake_key_arn
    "spark.hadoop.fs.s3a.server-side-encryption-algorithm" : "SSE-KMS"
  } 

Definition of the algorithm and key to be used in the encryption of the Data Lake buckets.

This configuration can also be included through the Databricks UI in the Spark config section when creating the cluster.

Between cluster nodes

There is the possibility of adding an additional layer of security by implementing encryption when information is shared between worker nodes when the accomplishment of tasks requires it.

To facilitate the integration of this service you can make use of AWS EC2 instances that encrypt traffic between them by default. These are the following:

Breakdown of instance types available with preconfigured internal encryption

Encryption at rest

This section describes how to encrypt the data when it is at rest. The methods for encrypting this data are different for the Data Plane and Control Plane since they are managed by different entities (Databricks and the client, respectively).

s3 encryption and EBS (Data Plane)

For the encryption of our persistence layer within the Data Plane it is necessary to create a key within KMS and register it in Databricks through the Account API. It is possible to choose whether to encrypt only the root bucket or the root bucket and EBS. This differentiation is achieved through the policy attached to the key itself, giving the necessary permissions to encrypt Data Lake buckets or also including the possibility of encrypting EBS.

The features of this functionality are as follows:

  • Only write operations after the addition of the encryption key will be affected, i.e. existing data that has not been updated will remain unencrypted.
  • Control Plane encryption is not affected.
  • This functionality is compatible with the automatic rotation of keys in KMS.
  • It is important to note that this key can be registered after the creation of the workspace in question.

All the information about this type of encryption is in the following link [2].

Control Plane encryption

This section describes how to encrypt all the elements hosted in the Control Plane (notebooks, secrets and SQL queries and query history).

By default the Control Plane hosts a Databricks Managed Key (DMK) for each workspace, an encryption key managed by Databricks. A user-managed Customer Managed Key (CMK) must then be created and registered through the Databricks Account API. Databricks then uses both keys to encrypt and decrypt the Data Encryption Key (DEK), which is stored as an encrypted secret.

The DEK is cached in memory for various read and write operations and is evicted from it at certain intervals so that the consequent operations require a new call to the client infrastructure to obtain the CMK again. Therefore, in the case of losing permissions to this key, once the refresh time interval is met, access to the elements hosted on the Control Plane will be lost.

 This can be seen in more detail in the following image:

DMK and CMK encryption of the Data Plane (source: databricks)

It is important to highlight that in case you want to add an encryption key to an already created workspace you will have to make an additional PATCH request through the Account API.

All the information about this type of encryption is in the following link [3].

Secret Management

The development of pipelines or workflows generally involves the need to communicate critical or confidential information between the different phases of the process. This information in other Cloud environments is usually stored in a Secret Scope to avoid compromising its integrity. An example of the information handled in these Secret Scopes could be the access credentials to a database.

Databricks enables end-to-end management with AWS, both Secret Scope and secrets, via the Databricks API. It should be noted that by default full Secret Scope management permissions (Manage permission) are assigned by default unless this is declared at the time of deployment (the permissions assignable to the set of users are Read, Write and Manage).

For Premium or Enterprise subscriptions, Databricks offers the possibility to manage user permissions in a more granular way by means of the ACL (Access Control List), thus being able to assign different levels of permissions depending on the group of users. 

These default secrets are stored in an encrypted database managed by Databricks and access to it can be managed through Secret ACLs. This type of information can be managed directly from the AWS KMS by dumping the key-value pairs and managing access permissions to it via IAM.

Service Principal

For all those administration tasks that do not need to be performed by a particular member of the organization, we can use the Service Principal. 

A Service Principal is an entity created for use with automated tools, running jobs and applications, i.e. a non-nominal administration account. It is possible to restrict the accesses of this entity through permissions. A Service Principal cannot access Databricks through the UI, its actions are only possible through the APIs.

The recommended way to work with this entity is for an administrator to create a token on behalf of the Service Principal so that the Service Principal can authenticate with the token and perform the necessary actions.

The following limitations have been encountered when working with a Service Principal:

  • A PAT Token must be created before to initialize the token module, otherwise it will result in an error when trying to create the token for the Main Service.
  • A cluster cannot be created by a Service Principal, it must be created by an administrator.

Scalability

Scaling alternatives

Databricks, like other services that offer the possibility of deploying a serverless cluster, also offers a common functionality such as auto-scaling. This functionality is completely managed by Databricks in terms of the different situations that must occur in order to decide to auto-scale the number of nodes in the cluster (scaling out) or the type of nodes (scaling up).

Databricks stands out in this functionality with respect to the competition because it offers an optimized auto-scaling that achieves on average savings of 30% higher than the rest of the solutions (although it also offers the possibility of implementing the standard version). The main difference between the optimized version and the standard version is the incorporation of scale-down.

The fact that other solutions do not include scale-down is due to the fact that the elimination of nodes may compromise the process, since even if the node in question is not performing any task, it may contain cached information or shuffle files that will be necessary for future processes in other nodes. This action is considered as sensitive because the elimination of a node containing information used in other processes would cause the need for reprocessing, which would generate higher costs than the savings generated by the scale-down.

Databricks, in the new update, implements a validation in the identification of less required nodes when cluster underutilization occurs, taking into account the following:

  • No tasks are running
  • No shuffle files or cached information needed for future processes are identified even if these processes are to be executed on other nodes.

This allows for a more aggressive “scale-down” since at all times the nodes that can be eliminated without jeopardizing the integrity of the process are identified. Databricks takes preventive measures in this type of scale-down and implements extra waiting times to validate with greater security that this action is correct (Job Cluster: 30” , All-Purpose Cluster: 2′ 30”).

When auto-scaling, whether scale-up or scale-down, both the degree of general CPU saturation/underutilization and that of the individual nodes are taken into account, and the following scenarios can be identified:

  • Up-Scaling: if the CPU usage exceeds 80%, all nodes are being used and this is maintained for at least 30”, scale-up is performed until this situation ceases to occur.
  • Down-Scaling: In addition to the aforementioned checks, it must be identified that 90% of the nodes are not occupied for at least 10′, after which, if the situation has not changed, scale-down is performed.

Databricks’ optimized auto-scaling is especially recommended when tasks have pronounced peaks, since this generates a fast scaling to support the demand and in case it is a punctual peak and the demand is not maintained, these nodes would remain operative even if they are not going to process information:

Comparative illustration of autoscaling implementation

Workers usage packaging

In the search for the overall efficiency of the cluster, initiatives such as the possibility of grouping tasks in order to make the assignment of tasks as optimal as possible and to allow other actions, such as auto-scaling, to be carried out more efficiently and without compromising the integrity of the processes.

This grouping of nodes according to usage is done for two main reasons, the prioritization in the assignment of tasks to nodes (priority order is indicated in the following image), and the correct identification of saturation or generalized underutilization of the cluster nodes. The grouping is as follows:

  • Low Usage: no containers and CPU usage less than 30%.
  • Medium Usage: CPU usage greater than 30%.
  • High Usage: CPU usage greater than 60%.
Cluster node groupings in relation to their degree of CPU saturation.

As mentioned above, there are two major advantages to be gained from this type of grouping:

  • Identification of underutilized nodes: optimization in assigning which nodes are to be eliminated in case a scale-down is required
Description of the optimized scaling-down process (source: databricks)
  • Prioritization in the assignment of tasks: As shown in the image, a prioritization of nodes is assigned when assigning tasks, this being “Medium Usage” > “Low Usage” > “High Usage”, so that, firstly, it can always be assigned to the nodes already used and, in the event of a scale-down, “free” nodes are available, and secondly, not to oversaturate the nodes that are already operational:
Description of node relocation and tasking efficiency.

As an alternative to this clustering, Databricks offers an external storage support (External Shuffle Service – ESS) in order to have the degree of saturation of the node as the only factor to take into account when deciding which node to eliminate in the scaling-down processes.

Cluster instance family

In order to be able to further define the characteristics of the cluster that provide more power of adaptation to the different use cases that may occur, there is the alternative called EC2 Fleet. 

This is based on being able to request instances, offering a range of possibilities in terms of the type of instances (unlimited) and the availability zones where they should be purchased. It also allows you to define if the instances are On-Demand or Spot, being able to specify in both cases the maximum acquisition price:

Breakdown of capacity of instances (Spot and On-Demand) and advanced typology with EC2 Fleet

The use of EC2 Fleet does not imply an extra cost, you only pay for the cost of the instances. 

There are the following limitations related to this service:

  • Service available only via API or AWS CLI
  • The application is limited to one region

It should be noted that this solution is in Private-Preview at the time of writing.

Persistence

DBFS

In order to enable the deployment of Spark clusters, Databricks relies on DBFS as a distributed storage system. It relies on third parties, such as S3, to be able to both launch and perform the required tasks.

DBFS allows the linking of storage sources such as S3 through mounts, which makes it possible to simulate the visualization of files stored locally, even though they are still physically stored in S3. This simulation facilitates both the authentication tasks when consulting the files and the simplification of the paths when referencing them, thus avoiding having to load the files in DBFS each time they are consulted.

In addition, the information stored by the cluster when deployed, such as initial scripts, generated tables, metadata, etc., can be stored in storage sources such as S3 so that they are available even when the cluster is not operational. They can be stored in storage sources such as S3 so that they are available even if the cluster is not.

Metastore possibilities

When choosing how and where to deploy the metastore Databricks proposes the following alternatives:

  • Metastore Hive (default): use the default Hive metastore deployed in the Control Plane. This metastore will be fully managed by Databricks.
  • External Hive metastore: possibility to use an existing external Hive metastore.
  • Metastore Glue in Data Plane: possibility to host the metastore in the Data Plane itself with Glue.

It is important to note that a Customer Managed VPC is required for the external Hive and Glue options. See first article for more information.

The alternatives of an external metastore, either Hive or Glue, could be interesting in cases where access to an existing metastore is needed, centralizing these catalogs, or where more autonomy and control is sought. Also, the implementation with Glue could facilitate access to this metadata from other services in AWS, such as Athena, and it also presents a lower latency of access to these services, since the communication can be done through the private network of AWS.

Billing

This section briefly describes where to access in detail the information about the costs that Databricks will charge the customer.

In order to access this information it is necessary to complete the necessary configuration to generate and host these logs within the infrastructure deployed in AWS. All this information can be found at the following link [4].

The logs are generated daily (the workspace must have been operational for at least 24 hours) in which the computation hours and the rate applied to these hours are imputed according to the cluster characteristics and the package assigned by Databricks.

fields description
workspaceId
Workspace Id
timestamp
Established frequency (hourly, daily,...)
clusterId
Cluster Id
clusterName
Name assigned to the Cluster
clusterNodeType
Type of node assigned
clusterOwnerUserId
Cluster creator user id
clusterCustomTags
Customizable cluster information labels
sku
Package assigned by Databricks in relation to the cluster characteristics.
dbus
DBUs consumption per machine hour
machineHours
Cluster deployment machine hours
clusterOwnerUserName
Username of the cluster creator
tags
Customizable cluster information labels

Deployment

Deployment alternatives

This section will describe the different alternatives and tools that can be used to deploy a functional and secure Databricks environment on AWS.

Databricks-AWS Quickstart

This alternative is characterized by the execution of a CloudFormation script created by Databricks that gives the possibility to create a Databricks Managed VPC or Customer Managed VPC

For both modalities Databricks creates a cross-account role with the necessary permissions (which differ for each alternative, being those of the Customer Managed VPC option more limited) and creates a Lambda through calls to the Databricks APIs to raise the necessary infrastructure, configure it and register it within the Control Plane. All these calls will make use of the cross-account role in order to authenticate to the customer’s AWS account.


Infrastructure as code (IaaC)

CloudFormation

CloudFormation is the infrastructure-as-code service from AWS. It would be possible to reuse the code created by Databricks in Quickstart but this alternative would be too intensive in API calls as there are no specific modules for deployment with Databricks.

Terraform

Terraform is an infrastructure-as-code tool that is agnostic to different cloud providers, i.e. it can be used with any of them. 

Terraform has modules available for Databricks that simplify deployment and reduce the need for Databricks API calls but do not completely eliminate them. For example, API calls will have to be made in order to use Credentials Passthrough without an Identity Provider (SSO).

In this initiative, a public GitHub repository [5] has been created in order to deploy an environment using this tool.

The infrastructure to be deployed has been divided into different modules for simplicity, the main ones being the following:

  • VPC: deployment of all networks, endpoints, NAT Gateways, etc… required.
  • S3: creation of the root bucket (bucket where Databricks stores libraries and logs) and the buckets where we are going to store the data to be processed.
  • Databricks IAM roles: creation of the cross-account role, meta instance profile, data roles, role for glue and role for the cluster to leave logs in the bucket.
  • Databricks cluster: module for the creation and deployment of a cluster of EC2 instances for use in Databricks.
  • Databricks provisioning: registration of the entire infrastructure in Databricks (credentials, networks and storage) and creation of workspace within Databricks.
  • Databricks management: creation of user groups, clusters and instance profiles (meta role that is assigned to the cluster that has the capacity to assume the data roles).

Therefore, the high-level architecture diagram is as follows:

Full deployment can be accomplished in less than 5 minutes.

Launch of Clusters

When estimating the average time for launching and availability of the instances, it has been considered appropriate to differentiate into subgroups due to the similarity of the time spent in each of them:

  • Pool: in the case of the Instance Pool, the only advantage is not only to have available instances already started to be used (although this has an impact on cost), but it is the cluster that provides them in a shorter time (it is not necessary to wait for the minimum number of instances to be obtained to use them):

    In addition, it should be noted that the times indicated below are correct for the first instance that is raised, for the rest the time is considerably reduced.
  • Job and All-Purpose Cluster: in the case of the All-Purpose cluster, the High-Concurrency mode has been analyzed as it is necessary to work collaboratively. In this case, the times are somewhat longer with respect to the previous one, consuming between 3-4 minutes for a standard cluster:

Minimum AWS infrastructure requirements:

The following table specifies the minimum requirements regarding networks, computation and storage that, from our point of view, a client deployment must have. It is important to emphasize that this presents differences with the GitHub repository presented in the Deployment chapter..

The differences are as follows:

  • Control Plane – Data Plane Communication: The repository implements a public communication through a NAT and in this table we propose the use of a Private Link to be able to make the connection through the AWS private network. Visit the first article for more information.
  • User – Front-end communication (Control Plane): The repository implements public communication through the Internet and in this table we propose the use of a VPN together with Private Links, to be able to communicate through the AWS private network. See the first article for more information.
GROUP SUBGROUP MINIMUM REQUIREMENTS EXTERNAL NEEDS
Networking
Subnets
3 private subnets with 3 Availability Zones
N/A
VPC Endpoints

S3 Gateway Endpoint

STS Interface Endpoint


Kinesis Interface Endpoint

N/A

N/A

N/A

Private Links

VPC Endpoint for connection Cluster (Data Plane) - Control Plane

VPC Endpoint for connection with REST APIs (Control Plane)

VPC Endpoint for connection user-frontend (Control Plane)

Contact with authorized Databricks support contact

Contact with authorized Databricks support contact


Contact with authorized Databricks support contact

VPN
Site to site VPN para conectar la red del cliente con la VPC de tránsito
Communication with the customer
Compute

Minimal instances

Zero with one autoscaling group
N/A
Storage

Bucket

Instances

S3

EBS Volumes

N/A

N/A

Minimum requirements for Databricks deployment on AWS

References

[1] Databricks on AWS – Architectural Perspective (part 1) [link] (January 13, 2022)

[2] EBS and S3 Encryption on Data Plane [link]

[3] Control Plane Encryption  [link]

[4] Billing Logs Access [link].

[5] Databricks Deployment on AWS – GitHub Repository Guide [link].

Do you want to know more about what we offer and to see other success stories?
AWS Cloud Engineer

I started my career with the development, maintenance and operation of multidimensional databases and Data Lakes. From there I started to get interested in data systems and cloud architectures, getting certified in AWS as a Solutions Architect and entering the Cloud Native practice within Bluetab.

I am currently working as a Cloud Engineer developing a Data Lake with AWS for a company that organizes international sporting events.

AWS Cloud Engineer

During the last few years I have specialized as a Data Scientist in different sectors (banking, consulting,…) and the decision to switch to Bluetab was motivated by the interest in specializing as a Data Engineer and start working with the main Cloud providers (AWS, GPC and Azure). 

Thus getting first access to the Data Engineer practice group and collaborating in real data projects such as the current one with Olympics.