Containers have emerged as legitimate platforms for running data science workloads. But are Docker and Kubernetes here for good, or is it just a passing data science fad? We looked to Anaconda’s SVP of Products and Marketing Mathew Lodge for insights on the matter.
Lodge joined Anaconda last year after spending years at VMware and the Cloud Native Computing Foundation. Those two previous gigs may provide clues as to how Lodge sees containers fitting into the data science discussion.
The most striking comment from Lodge may have been that the default deployment mode for Anaconda – which develops one of the most popular data science libraries for Python – is now Kubernetes. That became the standard with last summer’s launch of Anaconda Enterprise 5.
“There’s a lot of interest from the customer side around containerization,” Lodge tells Datanami. “They see the benefits of containerization around how they can move faster and improve their velocity of deployment and delivering software internally.”
In the past, the data science discussion was very much tied to what big data architecture you were running. That’s when one can’t help but notice the big yellow elephant standing in the corner of the room (or data center, as it were).
“How you feel about data science was to a large extent determined by where the data was and the data integrin and how you work with Hadoop and how do you work with Spark,” Lodge says. “All those things were really super important to how you do data science.
“What we’re starting to see now,” he continues, “is data science is starting to become more independent of the underlying data. As long as you can get to the data efficiently, it doesn’t matter where the data is. That gives a lot more freedom to data scientists. That’s the really crucial thing.”
The Devil You Know
Hadoop was the center of the big data discussion for many years because it provided an economical way to store all the data that data scientists needed.
Compared to what came before it – NFS and SMB file systems that can’t scale, relational databases that are extremely rigid, and specialized column-stores that were very expensive – Hadoop was a breath of fresh air from the open source world. A lot of organizations flocked to Hadoop, and it’s been providing economical, scalable, and flexible storage ever since.
It’s not surprising to hear that Anaconda has lots of customer who have deployed Hadoop data lakes. “It’s the devil you know,” Lodge says. “It’s well understood. It has the advantage of maturity. People understand how it works. They understand how to operate it.”
But now a lot of organizations are questioning whether Hadoop is the right platform upon which to conduct data science. They have a lot of data stored in their Hadoop data lakes, but they’re finding that getting useful information out of it is not as easy as they had hoped.
Increasingly, organizations are looking at the cloud vendors and realizing they can economically store their big data there. This is driving the rise of hybrid data lakes, where some data sits on premise Hadoop cluteers and some sits in the cloud.
Lodge says that customers who haven’t yet deployed a Hadoop lake are questioning whether to even take that step.
“You’ve got these alternative, low-cost storage infrastructures that are starting to challenge the role of Hadoop,” Lodge says. “They’re starting to ask why they should [adopt Hadoop] because they can get a lot of benefits now from other kinds of storage infrastructure, things like object stores, BLOB stores.”
Compute and Storage
While Hadoop still has value as a data store, Hadoop just isn’t a good fit for many data science workloads – particularly those developed in Python, Lodge says. It’s not a good fit for two primary reasons:
- First, Hadoop was written in Java and largely expects applications to also run in Java;
- Secondly, compute and storage are tightly integrated in Hadoop.
You can deploy Python code to Hadoop or Spark, and it’s something that Anaconda does with its enterprise software. But it’s not necessarily easy and it increases the complexity, Lodge says.
“When Hadoop was developed in 2005, Python didn’t really exist in the form that it does today,” he says. “When you look at how data science is done today, it’s largely done in Python and to a lesser extent in R, and it’s not done in a JVM language.”
The integration of compute and storage – which the Hadoop community is only now starting to unwind with the adoption of erasure coding – also increase the complexity.
“The original challenge that the Hadoop guys had back in 2005 was I/O was slow, networks were slow, so you had to move the compute closer to the data in order to solve that problem,” he says. “But when you look at the way that Google and Facebook and the mega-infrastructure people run — they don’t do this…They scale storage and compute independently, and part of what enables them to do that is the fact that [network technology] is 1,000x faster than it was in 2005.”
It’s these two architectural standards, more than anything, that have made Hadoop a poor fit for today’s data scientists, Lodge says. “We could dramatically simplify data science if we could just stop trying to cram non-Java languages into running inside of Hadoop clusters that assume everything runs in Java and is right next to the storage,” he says.
Silos? No Problemo
Anaconda tries to be agnostic to the data storage question, Lodge says. It doesn’t matter if they have Hadoop, a cloud object store, or a traditional data warehouse – the company will find a way to get the data for its data science customers.
While there are some good reasons to build a data lake, it’s by no means necessary today, particularly with the advent of data fabrics that can smooth over the storage differences. “There are a lot of interesting data virtualization products out there that have a way to bring the data together in terms of a catalog and the metadata about your data,” Lodge says. “Data silos are only a problem if you don’t know what’s in there, you can’t get to it, and it’s not integrated in terms of how you manage it.”
Processing the data is a different story. Anaconda is clearly encouraging customers to run data science workloads on containerized clusters, running either in-house or in the cloud. It’s the same approach that cloud giants have taken with their own data science offerings. Databricks also uses Kubernetes to run its Spark service on AWS, as do many other data science platforms.
While GPUs (and FPGAs) are on the roadmap for Hadoop 3.x, they don’t run “natively” yet, which means other architectures are training the deep learning models that are at the heart of today’s artificial intelligence revolution. Kubernetes is the odds-on favorite to scoop up these workloads.
“What we’re starting to see is data science is starting to become more independent of the underlying data,” Lodge says. “That freedom is very important for data scientists for things like GPU and specialized hardware for running machine learning, like Google‘s Tensor Processing Unit [TPU].”
You cannot embed that kind of hardware inside of Hadoop clusters, Lodge says. “You’ve got to run those separately, otherwise you’ll have really awful efficiency,” he adds. “You just won’t be able to keep that hardware busy, and it’s expensive hardware.”
“It’s a good story for data scientists because it’s dramatically simpler for you to deploy data science,” he says. “Being able to deploy that to a Kubernetes cluster or a containerized cluster is a really simple answer to how to how do I get that deployed.”