I started working on big data infrastructure in 2009 when I joined Cloudera, which at the time was a small startup with about 10 engineers. It was a fun place to work. My colleagues and I got paid to work on open source projects including Apache Hadoop (HDFS, YARN, and MapReduce), Apache Hive, Apache Spark, etc.
After four years of this I suddenly had the realization that my colleagues and I were building tools, but not actually using them ourselves. So I decided to move to LinkedIn, inspired in part by a Jeff Dean quote that says that people who build systems should sit no more than 50 feet away from their first customers.
What I encountered at LinkedIn was humbling. While the systems I had helped to build at Cloudera were clearly meeting the scaling challenges that LinkedIn’s ever-growing volume of data demanded, it quickly became clear that these same systems placed an enormous tax on the personal productivity of the software engineers, business analysts, and data scientists who were forced to rely on them. Platforms like R and Oracle may not be able to scale out like Hadoop and Spark, but they allow their users to focus on the questions they’re trying to answer, without requiring them become experts in how these systems work or are implemented.
In contrast, Hadoop and Spark demand more effort from users to answer the same questions, and even more effort and considerable domain expertise in order to do it efficiently. So we were forced to make tradeoffs between the productivity of people using the platform, and the efficient use of the hardware resources necessary to run these jobs. It seemed like we couldn’t have both.
Our first response to this problem was to hold office hours where members of my team, who possessed the domain expertise necessary to analyze and tune the performance of Hadoop and Spark jobs, met with individual users and offered advice. This approach helped, but it was expensive and added significant latency to the development cycle.
So we did what every self-respecting software engineer should do and started to think about ways of automating the process. Our efforts resulted in Dr. Elephant, a rule-based system that automatically detects under-performing jobs, diagnoses the root cause, and guides the owner of the job through the treatment process.
Dr. Elephant makes it easy to identify jobs that are wasting resources, as well as jobs that can achieve better performance without sacrificing efficiency. Perhaps most importantly, she makes it easy to act on these insights by making Hadoop and Spark performance tuning accessible to users regardless of their skill level. In the process, Dr. Elephant has largely eliminated the tension that previously existed between user productivity on one side and cluster efficiency on the other.
Like any physician, Dr. Elephant provides advice but doesn’t force you to follow it. With so many users and different use cases, we didn’t want to build a system that automatically adjusted jobs without a user’s input. After all, this would make Hadoop job tuning a black box, defeating the purpose of using Dr. Elephant as a tool that incrementally teaches our users more about how to tune their jobs through practice.
Furthermore, a system that automatically tunes jobs would have to be nearly 100 percent accurate in all cases, whereas a system that offers guidance but ultimately defers to the user’s discretion combines the best of human and machine.
A lot has happened since we opened sourced Dr. Elephant in 2016. Activity on Github and the Dr. Elephant mailing list has been strong from day one, and the Dr. Elephant developers at LinkedIn have made it a priority to answer questions and handle pull requests. Most of the development goals listed in the original Dr. Elephant blog post have been accomplished, and many of these — including support for the Oozie and Airflow workflow schedulers, improved metrics, and enhancements to the Spark history fetcher and Spark heuristics — were contributed by developers outside of LinkedIn. We are constantly adding to Dr. Elephant, such as more rules that allow it to identify new performance pathologies and improving support for Spark.
Finally, we have also been happy to see that many people have been able to benefit from running Dr. Elephant including companies like Airbnb, Foursquare, Hulu, Pinterest, and more. Many of these new users have already contributed back to Dr. Elephant, and we’ve even gotten interest from companies who wish to integrate Dr. Elephant into their commercial product offerings, including Pepperdata and their new Application Profiler product. The set of problems that Dr. Elephant looks for and the specific advice that it gives have no doubt been influenced by the workloads we see at LinkedIn.
While these workloads are very diverse, we suspect that other organizations may see differences, possibly quite significant. As the use of Dr. Elephant at other companies continues to grow, we’re hoping that exposure to these other workloads will result in better diagnostics and advice.
One last point: five to 10 years from now I suspect that Dr. Elephant won’t be as necessary (at least for Hadoop and Spark) as it is today since hopefully Spark and Hadoop will evolve to be more self-tuning, and hence will start to look more like a conventional database that optimizes and tunes the execution plan for you.
About the author: Carl Steinbach is a Senior Staff Software Engineer at LinkedIn, and the Tech Lead for LinkedIn’s Grid Development Team. He’s also PMC member of the Apache Hive project. Carl has an engineering degree from MIT.