In mid-2017, we were working with one of the world’s largest healthcare companies to put a new data application into production. The customer had grown through acquisition and in order to maintain compliance with the FDA, they needed to aggregate data in real-time from dozens of different divisions of the company. The consumers of this application, of course, did not care how we built the data pipeline. However, they cared greatly that if it broke, and the end-users did not get their data, they would be out of compliance and could be subject to massive fines.
The data pipeline was built primarily on the Cloudera platform using Apache Spark Streaming, Apache Kudu, and Apache Impala; however, there were components that relied upon automation built in Bash and Python as well. Having supported data products in the past, we knew that data applications need robust support, beyond strong up-front architecture and development. Specifically, we needed to ensure that errors would not go unnoticed. If any part of our pipeline were to have issues, we needed the ability to act proactively. We set up Cloudera Manager to issue alerts for the platform, but how can we be alerted if any part of the application pipeline fails—inside or outside the Cloudera platform?
There are many log aggregation and alerting tools on the market. Out of the box, Elasticsearch provides the same log searching functionality as Apache Solr, and Kibana provides a very good web user interface. In our case, though, managing an Elasticsearch cluster is extra work, and our customers were already using Cloudera Search (i.e. Solr integrated with CDH) and creating an additional cluster was not worth the extra overhead and cost. Other solutions were prohibitively expensive. For example, one of the options we evaluated was priced based on the volume of data ingested and aggregated. Having been down this road in the past, we knew that you end up spending a lot of time adjusting log verbosity and making sure only important things are logged. This is low-value work, and storage is cheap.
Given that the application had PHI and HIPAA data, we also wanted a solution that included role-based access control (RBAC). The Apache Sentry integration with Solr that Cloudera provides, enables role-based access control, so our customer is able to use their existing Sentry roles to protect their log data against unauthorized access using their existing processes.
Pulse is a log aggregation, search, and alert framework that runs on CDH. Pulse builds on Apache Hadoop/Spark technologies to add log aggregation to your existing infrastructure and integrates closely with Cloudera Manager.
When running an application, it’s important to be able to:
- Search and drill down into your logs, following line and speed of thought, to find issues quickly. Searching aggregated logs in a centralized location cuts debugging time drastically.
- Create flexible alerts reacting to your log data as it arrives, in real time. Alerts let you be proactive about your application health, and learn of issues before they are reported downstream.
- Secure your logs against unauthorized access. Hadoop clusters are multi-tenant environments, often with hundreds of users. Log data should be secured like the rest of your data.
- Keep logs for a configurable amount of time. Depending on application requirements logs may need to be kept for a few days or a few years.
Pulse stores logs in Solr, which gives full text search over all log data. Sentry handles role-based access control and works with Solr, so it’s easy to control access to logs. Pulse itself adds functionality like log lifecycle management so logs are kept only as long as needed. It includes log appenders for multiple languages that make it easy to index and search logs in a central location. Pulse also has a built-in alert engine, so you can create alerts that will notify your team, in real time, when things go wrong.
Pulse runs on your existing infrastructure and is not a cloud-based service. Pulse is packaged into a Cloudera Custom Service Descriptor (CSD). It is installed via Cloudera Manager. In addition to making the install of Pulse painless on your Cloudera Hadoop cluster, Cloudera Manager acts as a supervising process for the Pulse application, providing access to process logging and monitoring.
Pulse consists of four components:
- Log Appenders: a log4j appender is pre-packaged with Pulse, other appenders for Python and Bash are also available.
- Log Collector: an HTTP server that listens for log messages from the appenders and puts them into the Apache Solr (Cloud) full text search engine. The Log Collector can be scaled across your cluster to handle large volumes of logs.
- Alert Engine: service runs against near-real-time log data indexed in to Solr Cloud. Alerts run on an interval and can alert via email or HTTP hook.
- Collection Roller: handles application log lifecycle and plumbing. Users configure how often to create a new index for logs and how long to keep logs.
Logs stored in Solr can be visualized using existing tools, including Hue, Arcadia, Zoomdata, or Banana. Each log record stored in Pulse contains the original log message timestamp, making it easy to create time-series visualizations of your log data.
The Log Collector is an HTTP REST service that listens for log events, batches them, and writes them to Solr. Because the Log Collector is just a REST API, it’s easy to configure applications in any language to use it. It’s on the Pulse roadmap to flip the log-collector around and read log events from a Kafka topic instead of just listening for messages.
Log appenders are the application-specific code and configuration used to write log messages to the Log Collector. We’ve built clients that write to the Log Collector for Java (using log4j), Python, and Bash.
The Collection Roller will define applications in Pulse and handle log lifecycle. The image below describes how the Collection Roller works with collection aliases. There are two collection aliases, one for read and one for write.
The write alias (suffixed with _latest internally in Pulse) will always point at the most recently created collection. It’s used by the Log Collector so it doesn’t have to know about specific collections.
The read alias (suffixed with _all internally in Pulse) will always point at all log collections for a single application. It is used by visualization or search tools wanting to access all log data.
Every day, a new collection is created, while the oldest collection is deleted. The image below describes this process.
Alerts Engine and Visualization
Users interact with Pulse through the Alert Engine and visualization tools.
Because Pulse uses Solr, any visualization tool that works with Solr can be used, including Arcadia Data, Hue, Zoomdata, or Banana.
Here is a screenshot of a dashboard using Banana:
The Pulse Alert Engine is a daemon that continually monitors logs and will alert users when something goes wrong. The alert engine has rules that can use the full power of the Solr/Lucene query language. For example, this query will alert if an ERROR-level message was seen in the last 10 minutes:
timestamp:[NOW-10MINUTES TO NOW] AND level: ERROR
This query will alert if a metric goes out of the range of 50-100:
timestamp:[NOW-10MINUTES TO NOW] AND metric:[101 TO *] OR metric: [0 TO 49]
Pulse lets you keep control of your logs and while taking advantage of your existing infrastructure and tools. Pulse is Apache 2.0 Licensed. It’s available for use and contributions at https://github.com/phdata/pulse.
Tony Foerster is a Staff Engineer at phData.