A Comparison Study on Azure Compute Platforms

Mahsa Hanifi
7 min readAug 18, 2020
Photo by Pixabay from Pexels

Wondering what could be the best fit for your machine learning application on Microsoft Azure?! This study can help you pick the most suitable platform for your purpose.

Compute Options on Azure

Azure provides different computing platforms for different use cases. Here is the list of available options on Azure:

These platforms are fabricated to assist users to get things up and running rapidly. Apart from Azure Batch, the other two options provide users with pre-packaged software libraries, components, and services. Therefore, users can get their applications up and running with significantly less effort and time. The users can select the most suitable platform based on the needed features and budget options.

It is worth to mention that Azure Kubernetes Service (AKS) has the ability to run machine learning workloads on Azure. In this study, we are not covering that since it is specific to the Compute platforms on Azure.

Let’s take a closer look at each one of them and compare them with each other.

Azure Databricks

This Apache Spark-based platform on Azure provides a strong interactive workspace for big data analytics on Azure. The typical Azure Databrick's pipeline starts with ingesting data from external sources, staging the data in persistent storage (such as Azure blob storage), preparing and harmonizing the data, and using the data from Machine Learning training. Since the data volume is large, Spark handles the orchestration, data redundancy (via RDD), and scheduling (with DAG). The data will be stored on Azure blob to be consumed by Databricks for preparation and training. Then the output model can be stored on any of Azure databases or analytics services.

Sample of an Azure Databricks pipeline

The complete information on how Azure Databricks works is provided in Azure documentation.

Databricks provides two different types of clusters:

  • Interactive Cluster: A computation capacity that provisions upfront and will be running all the time.
  • Automated Cluster: A computation capacity that provisions the cluster on demand at the time of job submission. When the job is finished, the cluster lifecycle will be over and it will die.

Databricks requires a bare minimum of two nodes to operate, a driver node and a worker node. The number of worker nodes depends on the job(s). Therefore it always requires more than one machine to operate. It is fair to say Databricks is designed for massively parallel computing. The jobs will run in parallel on different machines and Databricks will handle putting together the output.

Note: Azure Databricks can package format and deploy models to Azure ML via MLflow.

Azure Batch

Azure Batch is also suitable for high-performance parallel computing. It performs the parallelization by creating a pool of tasks and run them in parallel in the compute nodes.

Azure Batch Platform

The above image is a very high-level representation of how Azure Batch works. In a scenario, any application can be connected to the Azure Storage to upload the input data and read the output from it. Azure Batch divides the input into separate tasks and runs them in parallel on a pool of compute nodes then sends the results back to Azure Storage.

Azure Batch documentation covers all the details on how it works. Therefore, we are not going to the details in this section.

Azure ML Compute

An Azure Machine Learning Compute is part of Azure Machine Learning Service where we provide a comprehensive set of services for machine learning. Services such as notebook, experiments, pipeline, and monitoring tools. On Azure documentation for ML Compute, key benefits of using a compute instance are well explained and we are not going over it in this section.

A simplified pipeline story in the Azure ML Compute can be like the following:

  • Get the data from a storage place, can be in a container in an Azure storage account.
  • Prepare the dataset
  • Process the data
  • Push the output to a dedicated container for the outputs.

Here you can have Compute instances or a Compute cluster. They will be provisioned by ML Compute on-demand, load the job, execute it, and then kill the Compute capacity.

Note: ML Compute pipeline can have Azure Databricks as a compute target.

Now that we discussed the three platforms at a very high-level, let’s get into the comparison from different aspects. Afterward that you should be able to choose the most suitable platform for your project.

Primary Purpose

Azure Databricks

Compute platform for data scale and streaming data processing where data redundancy is required. It bundles a common set of DataBrick libraries for data analysis, machine learning, data processing (with Dataframe, and SparkSQL).

Azure Batch

Running high-performance computing (HPC) applications where engineers bring their own libraries and tools to execute large jobs in parallel.

Azure ML Compute

Azure ML Compute bundles the necessary tools and libraries targeted from Machine Learning. It is packaged as part of Azure ML Service.

Supporting Languages

Azure Databricks

Multiple languages are supported by Databricks which makes it convenient to work with. Python, Scala, R, SQL, and Java are the supported languages.

Azure Batch

Uses the following Azure SDKs to run and manage Azure Batch workloads — REST, .NET, Python, Node, and Java.

The task itself can be written in any language, as long as it is executable and the dependencies are available in the node.

Azure ML Compute

Supported languages are Python and R.

Cost

Azure Databricks

Pricing for Azure Databricks is a little bit different than the other two platforms. Meaning that aside from the virtual machines, the user needs to pay another cost for Databricks Units (DBU).

Azure Batch

The batch account is free, changed based on the number of VMs in the pool and the VM size. It can be viewed as a low cost and generic compute option. This option provides the low-priority VMs to run the batch workloads which lowers the cost.

Azure ML Compute

Based on the VM instance. The costs for different virtual machines are on the Azure website. This option also provides low-priority VMs that can lower the cost even more.

CI/CD Support

Azure Databricks

CI/CD is fully supported by Azure Databricks. The settings can be different according to the requirements of the project. Here is a link to an overview of a typical CI/CD in Azure Databricks.

Azure Batch

Azure Batch also supports the CI/CD. Here is a link to the article that explains the steps to set it up. However, compared to the other two methods this one is relatively harder to setup.

Azure ML Compute

CI/CD is fully supported via Machine Learning Operations (MLOps)which is based on DevOps.

Test Support

Azure Databricks

Testing is fully supported by Azure Databricks. There are different ways to add a test framework. For instance, Nutter made it easy to test Azure Databricks.

Azure Batch

As the task itself can be written in any language, separately each one can have its own test, unit/integration, implemented in the application. As such not simple and multiple wirings are needed.

Azure ML Compute

Deployment targets in Azure ML Compute have their own testing and debugging systems.

Minimum Number of Required Nodes

Azure Databricks

Databricks works in a way that there is a master node and there are one-to-many worker nodes. Therefore the minimum number of nodes required for this one are two nodes. We can see that Databricks can be very expensive for small processes.

Azure Batch

A minimum of one node is required to get the Azure Batch running.

Azure ML Compute

A minimum of one node is required to get the Azure ML Compute running.

Azure Data Lake Storage (ADLS) Gen2 Integration

Azure Databricks

Able to mount an Azure Data Lake Storage Gen2 account to Databricks File System (DBFS), authenticating using a service principal and OAuth 2.0. The mount is a pointer to data lake storage, so the data is never synced locally.

Azure Batch

Azure Batch includes built-in support for accessing Azure Blob storage, and tasks can download the files to compute nodes when the tasks are running. More information about the supported storage accounts by Azure Batch can be found in the Azure documentation.

Azure ML Compute

Supports multiple data storage types including ADLS Gen 2, as a Datastore for ML Compute.

Scheduling Capability

Azure Databricks

Azure Databricks has native job scheduling capabilities. Azure Data Factory can invoke Databricks' transformation.

Azure Batch

It has the same capabilities as Azure Databricks, native job scheduling capabilities. Azure Data Factory can invoke Azure Batch.

Azure ML Compute

Same as the other two, it the native job scheduling capabilities.

Parquet File Processing

Azure Databricks

Parquet is an Apache file format and it is natively supported by Azure Databricks.

Azure Batch

Not supported directly. However, it can be taken care of in the application prior to running the jobs on Azure batch.

Azure ML Compute

It supports tabular datasets, hence, it supports parquet files.

Auto Scaling

Azure Databricks

Supports autoscaling. It depends on how you define the cluster size, it can be fixed or have a minimum and a maximum number of the nodes.

Azure Batch

Supports the autoscaling. It can be defined to assign nodes dynamically.

Azure ML Compute

Supports autoscaling. At the time of creating a cluster, you can define the minimum and the maximum number of nodes. It has a default of 0 nodes for minimum value and 4 nodes for the maximum number of nodes.

Deployability to a Dedicated Virtual Network

Azure Databricks

Supports it by having all the dedicated resources including the Virtual Network (Vnet) locked into a resource group that is being used by all the clusters.

Azure Batch

Supports it by provisioning the pool into a subnet that is located into a dedicated Vnet.

Azure ML Compute

Supports it by running jobs in a dedicated Vnet.

--

--

Mahsa Hanifi

Software Engineer at Microsoft. Live and work in Redmond, Washington.