It’s tough to find a big data project that’s had as much impact as Apache Spark over the past five years. The folks at Databricks who develop Spark are keeping the project on the cutting edge with version 2.3.
Apache Spark 2.3 was unveiled by the Apache Spark project on February 28, also forms the underpinning for version 4.0 of the Databricks Runtime, which Databricks unveiled last week during the Strata Data Conference. There are some extra features that customers get in the cloud-based Databricks runtime that they can’t get in Apache Spark, which we’ll cover later.
Spark 2.3 includes more than 1,000 changes from the previous version. With that said, the top three new features Databricks added to Apache Spark with version 2.3 include continuous streaming, support for Kubernetes, and a native Python UDF.
Apache Spark is the ultimate multi-tool for data scientists and data engineers, and Spark Streaming has been one of the most popular libraries in the package. However, Spark Streaming has had one major handicap that hindered adoption for some use cases: a micro-batch approach to processing.
Instead of processing each new piece of data as it arrives, Spark Streaming would bundle up multiple pieces of data and process them in batch. While this approach worked fine for many streaming data analytic use cases, it didn’t meet the requirements of customers demanding sub-second latencies.
That reliance on micro-batch architecture is gone with Spark 2.3, Spark chief architect Reynold Xin told Datanami last week at the Strata show. “This completely eliminates any micro batching from Spark Structured Streaming,” Xin said.
The new continuous streaming execution engine can process queries with sub-millisecond latencies, the project says. Customers can continue to use the micro-batch and batch architectures if they like, but the new option will likely be chosen for many use cases.
Apache Spark 2.3 also gets the experimental Streaming API V2, which can be used to plug in new source and sinks and which works across batch, micro-batch, and continuous execution environments, but it’s not yet ready for prime-time.
Spark customers can now use Kubernetes as a resource manager for their Spark implementations. According to the Spark project, the new Kubernetes scheduler supports native submission of Spark jobs to a cluster managed by Kubernetes.
Databricks has been using Kubernetes for its control plane since 2016. Prior to that, the company used a homegrown scheduler, not Mesos, which came out of the same AMPlab that Spark came out of. “When Kubernetes came out, we really liked it and we actually replaced it with Kubernetes for our control plane,” Xin said.
However, customers should test their Spark-on-Kubernetes integration before deploying it, or to wait a few releases until the integration is fully baked. The Spark 2.3 release notes say that “this [Kubernetes] support is currently experimental and behavioral changes around configurations, container images and entrypoints should be expected.”
Vectorized Python UDFs
Data scientists who use Python and Python-based libraries to develop data processing pipelines or machine learning models that execute in Spark will be thrilled to learn about the new Python support for vectorized user defined functions (UDFs) in Spark 2.3.
“Spark has always supported Python as a language,” Xin said. “But the problem with Python is, if you want to express something that Spark doesn’t do, then you have to do it through a user defined functions, and it runs really slow.”
This new feature, which is the most commonly requested feature by data scientists, will increase the speed that custom Python-based functions run in Spark by 10x to 100x, he said. The change has to do with how Python-based applications and tools exchange data in Spark.
Xin explains: “When you call Python UDF, you have to give the actual code data. We used to give them one row at a time, and this is actually really slow. And now, instead we’re giving them multiple rows at a time, but we group them in a column fashion. So we’re passing one column with say a million data points, and another column with a million data points, and another column.”
By getting the data into Spark via column-oriented chunks from Python-based libraries, including Pandas, Scikit-Learn, and NumPy, the whole experience for data scientists should be greatly improved. “It turns out for most the Python data science tools, the actual implementations are written in C, so they are pretty fast, and all of them operate on columns,” Xin said.
Users can also use Spark to farm out jobs to the Python-based tools for processing, and then bring the data back into Spark in a parallel manner. “So it’s actually a pretty good fit of combining the small data truth with the big data truth,” Xin says.
Looking forward, the next pieces of news out of Databricks will likely revolve around Delta, the streaming data warehouse it unveiled in late 2017, as well as expanded support for more public cloud platforms.