Apache Spark echo system is about to explode — Again! — this time with Sparks newest major version 3.0.
This article lists the new features and improvements to be introduced with Apache Spark 3.0 — which its preview is already out — very exciting!
Just for the stimulate — Alibaba Group competed with Spark 3.0 on the TPCDS benchmark and achieved the top spot!
Spark 3.0 is said to perform 17x faster compared with current versions on the TPCDS benchmark which is pretty impressive.
If you are a Spark user and you are familiar with Spark skip to the next section for the Spark 3.0 review
For those who are not yet familiar with Spark — it is a lightning fast, fault tolerant and in-memory general distributed data processing engine (kind of a MPP) that supports both batch and streaming processing and analytics.
Spark became very popular in late years and many organisations from small to enterprises adopted it including major cloud vendors where you can find a managed service that includes Spark.
Spark features APIs for distributed processing of structured and unstructured data like the RDD (Resilient Distributed Dataset), DataSet, DataFrame and more. it features SQL Capabilities including hooking up into any Hive metastore supporting catalogs (like AWS Glue) enabling great capabilities of explorations and analytics. On top of that it features Machine Learning capabilities (The Notorious Spark MLlib). Spark was meant to be a processing hub where it can connect many data sources — from relational and NoSQL databases, data lakes and warehouses and more — computing aggregations, data preprocessing and much more.
Spark 2.x introduced many improvements (Like Project Tungsten and the catalyst optimiser) and have made Spark shine out as a great tool and solution for ETL pipelines, Analytics in a Data Lake, engine for distributed machine learning training and serving, Streaming & Structured Streaming (If mini-batches fits you) and more.
Version 3.0 of spark is a major release and introduces major and important features:
Spark 3.0 will move to Python3 and Scala version is upgraded to version 2.12. In addition it will fully support JDK 11. Python 2.x is heavily deprecated .
Adaptive execution of Spark SQL
This feature helps in where statistics on the data source do not exists or are in accurate. So far Spark had some optimizations which could be set only in the planning phase and according to the statistics of data (e.g. the ones captured by the ANALYZE command when deciding weather to perform a Broadcast-hash join over an expensive Sort-merge join. In cases in which these statistics are missing or not accurate BHJ might not kick in. with adaptive execution in Spark 3.0 spark can examine that data at runtime once he had loaded it and opt-in to BHJ at runtime even it could not detect it on the planning phase.
Dynamic Partition Pruning (DPP)
Spark 3.0 introduces Dynamic Partition Pruning which is a major performance improvement for SQL analytics workloads that in term can make integration with BI tools much better. The idea behind DPP is to apply the filter set on the dimension table — mostly small and used in a broadcast hash join — directly on the fact table so it could skip scanning unneeded partitions.
DPP’s Optimisation is implemented both on the logical plan optimization and the physical planning. It showed speedup in many TCPDS queries and works well with star-schemas without the need to denormalize the tables.
Enhanced Support for Deep Learning
Deep Learning on Spark was already possible so far. However Spark MLlib was not focused on Deep Learning and did not offer deep learning algorithms and in particular didn’t offer much for image processing. Existing projects like TensorFlowOnSpark, MMLSpark and some others made it possible somehow but presented significant challenges. For example — given that Spark resiliency is very good and knows to recompute tasks over partitions on failure — in deep learning for if you loose a partition in the middle of a training job and you recompute this individual partition Tensorflow or others will not work well. It requires to train on all partitions in the same time.
Spark 3.0 handles the above challenges much better. In addition it adds support for different GPUs like Nvidia, AMD, Intel and can use multiple types at the same time. In addition Vectorized UDFs can use GPUs for acceleration. For Kubernetes it offers GPU support in a flexible manner when running on Kubernetes.
Better Kubernetes Integration
Spark support for Kubernetes is relatively not matured in the 2.x version and difficult to use in production and performance was lacking in compare with the YARN cluster manager. Spark 3.0 introduces new shuffle service for Spark on Kubernetes that will allow dynamic scale up and down (more precisely out and in)
Spark 3.0 also supports GPU support with pod level isolation for executors which makes scheduling more flexible on a cluster with GPUs. Spark authentication on Kubernetes also has some goodies.
Graph processing can be used in data science for several application including recommendation engine and fraud detections.
Spark 3.0 introduces a whole new module named SparkGraph with major features for Graph processing. These features include the popular Cypher query language developed by Neo4J which is a SQL like for graphs, the Property Graph Model processed by this language and Graph algorithms. This integration is something Neo4J worked on for several years and it’s named Morpheus (formerly named Cypher for Spark) but as said will be named SparkGraph inside the spark components.
Morpheus extends the Cypher language with multiple-graph feature, Graph catalog and Property graph data sources for integration with Neo4j engine, RDMS and more. It allows usage of the cypher language on graphs in a similar way SparkSQL operates over tabular data and it will have its own catalyst optimiser. In addition it would be possible to interoperate between SparkSQL and SparkGraph which can be very useful.
For a deep dive: Check out this session
ACID Transactions with Delta Lake
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark 3.0 and through easy implementation and upgrading of existing Spark applications, it brings reliability to Data Lakes. It announced to join the Linux foundation to grow its community.
It solves issues presented when data in the data lake is modified simultaneously by multiple modifiers and allows you to focus on logic and not worry from inconsistencies. Its very valuable for streaming applications but also very relevant for batch scenarios. Over 3500 organizations already use Delta Lake.
Spark 3.0 supports data lake out of the box and can be used just as it is used for example with parquet. sometimes replacing the read class to the deltalake’s one is enough to start using Delta Lake.
Quick Start: https://docs.databricks.com/delta/quick-start.html
Growing integration with Apache Arrow data format
Apache Arrow is an in-memory columnar data structure for efficient analytical operations. Its has benefits like being cross-language platform, performing zero-copy streaming messaging and interprocess communications without serialization costs which often occur with other systems.
In Spark 3.0 Usage in Apache Arrow takes bigger place and its used to improve the interchange between the Java and Python VMs. This usage enables new features like Arrow accelerated UDFs, TensorFlow being able to read error data in CUDA and more features in the Deep Learning section in Spark 3.0
Binary files data source
Spark 3.0 supports binary file data source. You can use it like this:
val df = spark.read.format(“binaryFile”)
The above will read binary files and converts each one to a single row that contains the raw content and metadata of the file. The DataFrame will contain the following columns and possibly partition columns:
writing back a binary DataFrame/RDD is currently not supported.
DataSource V2 Improvements
Few improvements for the DataSource API are included with Spark 3.0:
- Pluggable catalog integration
- Improved predicate push down for faster queries via reduced data loading
In addition there are many JIRAs to solve many issues existing with the current DataSource API.
- Spark 3.0 can auto discover GPUs on a YARN cluster and schedule tasks specifically on nodes with GPUs.
The above features are somehow the major and more influencing one but Spark 3.0 ships more enhancements and features with it.
Mostly it is clear that Spark 3.0 is a big step up for data scientists and enables them to run Deep Learning with distributed training and serving.
What’s being deprecated or removed
More than a few things are being deprecated or removed. Make sure you read the deprecation and migration notes to see that you are covered the time you want to test your code with Spark 3.0.
Python 2.7 will still work, but not anymore tested in the release lifecycle of Spark. this effectively means — you DON’T use it. This is relevant for all Panadas and NumPy users in Python 2. So from Spark 3.0 PySpark users will see a deprecation warning if Python 2 is used. A migration guide for PySpark users to migrate to Python 3 will be available. Completely removing Python 2 support is expected to occur in 1.1.2020 when PySpark uses will see an error in case Python 2 is used. After this date patches for Python 2 might be rejected for Spark versions 2.4.x
I coudnt totaly figure what will be the future of SparkMLlib with Apache Spark 3.x versions. Ill update this blog asap once I know more.
<I’ll Add more deprecations/removals here when I know>
I hope you liked this thorough review and that it was helpful. Please feel free to leave any questions on the comment sections.
I’de like to thank: