Databricks is a leading force in the data industry when it comes to big data processing and machine learning. It has many features, quirks and shortcuts that can substantially boost productivity and developer experience for Data Scientists/Engineers.

Here at nOps, we process billions in cloud spending to optimize for our customers; in the process, we’ve figured out some Databricks productivity hacks to cut runtime and speed up development. This blog will share some simple but powerful Databricks tips, tricks, and pitfalls you may not yet be aware of.

Tip: S3 Path Design Efficiency

Consider the example below, which illustrates how a small change in the S3 path design can result in substantial differences in runtime.
Old (~5 hours runtime):

p0 = f"s3://bucketname-{env}/parquet/boto3/service={chosen_service}/method={chosen_method}/year={chosen_year}/month={chosen_month}/day={chosen_day}/hour=*/client_id=*/project_id=*/*.snappy.parquet" 
df0 = spark.read.parquet(p0)

New (~25 min runtime): 

p0 = f"s3://bucketname-{env}/parquet/boto3/service={chosen_service}/method={chosen_method}/year={chosen_year}/month={chosen_month}/day={chosen_day}/"
df0 = spark.read.parquet(p0)


This dramatic difference in runtime is due to how Databricks reads Parquet files. Omitting the wildcard filtering will allow Spark to do what is known as Dynamic Partition Pruning (DPP). This method is faster since it only scans the partitions that are necessary to process the query. It effectively reduces the amount of data read at that particular step by assuming that all the data has the same shape. My advice to anyone setting up an S3 data lake is: please be intentional about the order of partitions. For example, if we had our data saved like this:
"s3://bucketname-prod/parquet/boto3/method=ec2/year=2024/month=2/day=23/hour=1/client_id=900/project_id=800/service=describe_ec2_instances/data0.snappy.parquet"
We would be in trouble, and this job would have needed more complicated tricks to speed up. The rule of thumb for S3 is to order the most obscure attribute (in our case: boto3 service/method) first, down to the most common (year, month, day, etc). There can be some wiggle room for the sake of readability, e.g. method before service. In practice this won’t impact performance significantly, since you’ll often be specifying both of those parameters together (but you’ll need all hours).

Trick: Increase Spark Shuffle Partitions to Cut SQL Runtime

Add spark.conf.set(“spark.sql.shuffle.partitions”, 1000) to the top of your notebook to speed up almost any SQL JOIN and GROUP BY. 

To understand this trick for cutting SQL runtime, we need to first introduce the concept of distributed computing. Suppose I want to re-order every letter alphabetically. One approach would be to split the book up into N partitions and give it to M people, then re-shuffle their combined answers to get the final desired result.

The most fair distribution of work would be when M = N. Also, to avoid the final step of re-shuffling in the above example, the optimal number of splits would be 26 (one for every lowercase letter in the English alphabet). If we wanted every unicode character without that last step of reordering results, the number of shuffle partitions would be over 100,000 (one for every unicode character). It should be noted that having a large number of partitions of data shuffling at exactly the same time has its own issues.

For most clean datasets and workloads, the number of shuffle partitions can be kept surprisingly low. This is because in the worst-case scenario, Spark could just keep re-shuffling combined intermediate results until we get to the final result. This is probably why the default setting in Databricks is only 200. However, the trade-off is that larger datasets with complex joins will take much longer.

This is why adding spark.conf.set(“spark.sql.shuffle.partitions”, 1000) to the top of your notebook will speed up almost any SQL JOIN and GROUP BY. This is especially true with large complex datasets. Admittedly, it may be overkill and we’re looking into finding the optimal number per notebook. So far the results have been astounding, cutting many jobs’ total runtime in half. See: official Spark documentation for details.

Sources:

https://sparktpoint.com/spark-shuffle-partitions/#Understanding_Shuffle_in_Spark

And https://stackoverflow.com/a/53187273

Databricks War Story: Don't Use TEMP VIEWs for STRING Matching

In the process of building our 5-star rated, #1 ranked cloud optimization platform, we’ve also had a mishap or two — check out this pitfall you definitely want to avoid as you leverage Databricks.

By replacing the TEMP VIEW v_import_client_sharesave_accounts with the identical static TABLE t_sharesave_accounts we “fixed” a very important failing job which led us to perhaps another deeper problem.

Old (could take hours or in some cases not finish):

and

 split(reservation_reservation_a_r_n, '[:]') [4] in (
        select
          import_sharesave_aws_account_number
        from
          v_import_client_sharesave_accounts
      ) THEN pricing_public_on_demand_cost

This is problematic because every time this query runs, the database has to:

  1. Split the string for each row in the reservation_reservation_a_r_n column.
  2. Recompute the contents of the TEMP VIEW v_import_client_sharesave_accounts.
  3. Perform the membership test to see if the split segment was in the dynamically generated list from the TEMP VIEW.

New (takes 3 min):

and

split(reservation_reservation_a_r_n, '[:]') [4] in (
        select
          import_sharesave_aws_account_number
        from
          t_sharesave_accounts
      ) THEN pricing_public_on_demand_cost

By replacing the TEMP VIEW with a static table, the database will:

  1. Directly access the pre-stored data in t_sharesave_accounts without recomputing it.
  2. Utilize indexing and other optimizations to quickly perform the membership test.
  3. Avoid the overhead associated with dynamically generating the TEMP VIEW’s contents each time the query is executed.

The TL;DR is to avoid using TEMP VIEWs for string matching in SQL queries; switching to static tables can drastically reduce job runtimes and prevent lengthy delays.

About nOps

If you’re looking to save on AWS costs, nOps makes it easy and painless for engineers to take action on cloud cost optimization.

 

The nOps all-in-one cloud platform features include:

  • Business Contexts: Understand and allocate 100% of your AWS bill down to the container level
  • Compute Copilot: Intelligent provisioner that helps you save with Spot discounts to reduce On-Demand costs by up to 90%
  • Commitment management: Automatic life-cycle management of your EC2/RDS/EKS commitments with risk-free guarantee
  • Storage migration: One-Click EBS volume migration
  • Rightsizing: Rightsize EC2 instances and Auto Scaling Groups
  • Resource Scheduling: Automatically schedule and pause idle resources

Join our customers using nOps to understand your cloud costs and leverage automation with complete confidence by booking a demo today!