Databricks is a leading force in the data industry when it comes to big data processing and machine learning. It has many features, quirks and shortcuts that can substantially boost productivity and developer experience for Data Scientists/Engineers.
Here at nOps, we process billions in cloud spending to optimize for our customers; in the process, we’ve figured out some Databricks productivity hacks to cut runtime and speed up development. This blog will share some simple but powerful Databricks tips, tricks, and pitfalls you may not yet be aware of.
Tip: S3 Path Design Efficiency
Old (~5 hours runtime): p0 = f"s3://bucketname-{env}/parquet/boto3/service={chosen_service}/method={chosen_method}/year={chosen_year}/month={chosen_month}/day={chosen_day}/hour=*/client_id=*/project_id=*/*.snappy.parquet" df0 = spark.read.parquet(p0) New (~25 min runtime): p0 = f"s3://bucketname-{env}/parquet/boto3/service={chosen_service}/method={chosen_method}/year={chosen_year}/month={chosen_month}/day={chosen_day}/" df0 = spark.read.parquet(p0)
"s3://bucketname-prod/parquet/boto3/method=ec2/year=2024/month=2/day=23/hour=1/client_id=900/project_id=800/service=describe_ec2_instances/data0.snappy.parquet"
Trick: Increase Spark Shuffle Partitions to Cut SQL Runtime
Add spark.conf.set(“spark.sql.shuffle.partitions”, 1000) to the top of your notebook to speed up almost any SQL JOIN and GROUP BY.
To understand this trick for cutting SQL runtime, we need to first introduce the concept of distributed computing. Suppose I want to re-order every letter alphabetically. One approach would be to split the book up into N partitions and give it to M people, then re-shuffle their combined answers to get the final desired result.
The most fair distribution of work would be when M = N. Also, to avoid the final step of re-shuffling in the above example, the optimal number of splits would be 26 (one for every lowercase letter in the English alphabet). If we wanted every unicode character without that last step of reordering results, the number of shuffle partitions would be over 100,000 (one for every unicode character). It should be noted that having a large number of partitions of data shuffling at exactly the same time has its own issues.
For most clean datasets and workloads, the number of shuffle partitions can be kept surprisingly low. This is because in the worst-case scenario, Spark could just keep re-shuffling combined intermediate results until we get to the final result. This is probably why the default setting in Databricks is only 200. However, the trade-off is that larger datasets with complex joins will take much longer.
This is why adding spark.conf.set(“spark.sql.shuffle.partitions”, 1000) to the top of your notebook will speed up almost any SQL JOIN and GROUP BY. This is especially true with large complex datasets. Admittedly, it may be overkill and we’re looking into finding the optimal number per notebook. So far the results have been astounding, cutting many jobs’ total runtime in half. See: official Spark documentation for details.
Sources:
https://sparktpoint.com/spark-shuffle-partitions/#Understanding_Shuffle_in_Spark
Databricks War Story: Don't Use TEMP VIEWs for STRING Matching
In the process of building our 5-star rated, #1 ranked cloud optimization platform, we’ve also had a mishap or two — check out this pitfall you definitely want to avoid as you leverage Databricks.
By replacing the TEMP VIEW v_import_client_sharesave_accounts
with the identical static TABLE t_sharesave_accounts
we “fixed” a very important failing job which led us to perhaps another deeper problem.
Old (could take hours or in some cases not finish):
and
split(reservation_reservation_a_r_n, '[:]') [4] in ( select import_sharesave_aws_account_number from v_import_client_sharesave_accounts ) THEN pricing_public_on_demand_cost
This is problematic because every time this query runs, the database has to:
- Split the string for each row in the reservation_reservation_a_r_n column.
- Recompute the contents of the TEMP VIEW v_import_client_sharesave_accounts.
- Perform the membership test to see if the split segment was in the dynamically generated list from the TEMP VIEW.
New (takes 3 min):
and
split(reservation_reservation_a_r_n, '[:]') [4] in ( select import_sharesave_aws_account_number from t_sharesave_accounts ) THEN pricing_public_on_demand_cost
By replacing the TEMP VIEW with a static table, the database will:
- Directly access the pre-stored data in t_sharesave_accounts without recomputing it.
- Utilize indexing and other optimizations to quickly perform the membership test.
- Avoid the overhead associated with dynamically generating the TEMP VIEW’s contents each time the query is executed.
The TL;DR is to avoid using TEMP VIEWs for string matching in SQL queries; switching to static tables can drastically reduce job runtimes and prevent lengthy delays.
About nOps
If you’re looking to save on AWS costs, nOps makes it easy and painless for engineers to take action on cloud cost optimization.
The nOps all-in-one cloud platform features include:
- Business Contexts: Understand and allocate 100% of your AWS bill down to the container level
- Compute Copilot: Intelligent provisioner that helps you save with Spot discounts to reduce On-Demand costs by up to 90%
- Commitment management: Automatic life-cycle management of your EC2/RDS/EKS commitments with risk-free guarantee
- Storage migration: One-Click EBS volume migration
- Rightsizing: Rightsize EC2 instances and Auto Scaling Groups
- Resource Scheduling: Automatically schedule and pause idle resources
Join our customers using nOps to understand your cloud costs and leverage automation with complete confidence by booking a demo today!