Cybersecurity threats are a constant concern for every modern business. Countless examples of breaches make headlines regularly, and they seem to be increasing in number. The 2018 Thales Data Threat Report found that over one-third of global organizations were breached in 2017, a 10% increase than 2016. Moreover, the Herjavec Group claims damage from cyberattacks will double, from $3 trillion in 2015 to $6 trillion by 2021, making it top of mind for any business.
One would think that the greatest minds in technology would have come up with a solution by now, but unfortunately, that’s not the case.
The truth is, detecting and combating cybersecurity threats is extraordinarily difficult, as anyone who has spent time as a cybersecurity pro knows. It’s a constantly evolving discipline where hackers — often backed by organized crime and nation states — play cat and mouse with their targets, whose “attack surface” is increasingly large, complex and distributed. So, while countless products exist to thwart attacks, hackers always find new ways to overcome or sidestep enterprise countermeasures.
Many enterprises are finding that data is the cornerstone to improving cybersecurity. Identifying threats and diagnosing attacks is essentially an exercise in pattern matching and anomaly detection, which can benefit greatly from recent advances in big data, data science and machine learning.
The more data sources available to analyze, the better the chances at success. New, valuable sources include NetFlow session data, data from endpoint computers and smartphones, and logs from servers and security systems such as identity management, firewall, IPS and vulnerability assessment products. Combining these new data feeds with historical data and machine learning techniques can take detection capabilities far beyond where they are today.
In fact, taking an operational approach to cybersecurity-oriented data flows can help teams quickly understand the causes of recent attacks as well as detect and mitigate future threats. By operationalizing how data is ingested from the various corners of a business into a centralized data lake, it is possible to create new cybersecurity capabilities and to extend the value of existing investments.
Use a Data Lake to Accelerate Time to Insight
Cybersecurity is about proactive prevention
and, lacking that, quick remediation. When a business is compromised, it’s all hands on deck to learn what happened, establish root cause and fix the issue.
It is critical that teams quickly have access to data from affected systems so that users can analyze and assess what occurred, and then understand the effects of the attack, eradicate the malicious code and close vulnerabilities. Better yet, and of course more difficult, is to detect vulnerabilities and the early stages of an attack in order to defeat it before damage occurs.
For this to happen, however, teams and machine-learning algorithms must have current and complete data available in order to maximize the chance of detecting anomalies that identify possible attacks in progress and non-obvious vulnerabilities.
The data lake has emerged as an economical solution for making new data available to cybersecurity teams. Data lakes allow for storage and analysis of extremely large volumes of data from numerous systems and in a variety of formats.
At scale, data lakes are a fraction of the cost of traditional enterprise data warehouses and come with a variety of built-in analytic tools. The data lake offers great promise for extending the reach of cybersecurity teams across the business to safeguard key assets. With a data lake, teams can economically obtain more types of data and retain these data sets indefinitely in order to better define normal activity to detect anomalies.
Failure to Launch: The Data Ingestion Issue
Unfortunately, keeping the cybersecurity data lake well-stocked is a major challenge.
The variety of data structures and the varying pace at which different sources generate data make the process of building and operating continuous ingestion pipelines both time-consuming and complex. In addition to understanding the nuances of the data coming from various applications and systems, teams must also learn a host of new tools in order to move the data, which draws out the time it takes to make data available for analysis.
For instance, many of the new sources are valuable logs from server, endpoint, network and security systems. But continuous log shipping can become a complex development effort due to the variety and variability of log formats plus the need for expertise in the software required to move logs around the data architecture.
NetFlow and syslog protocols, for example, have their own formats and rules, requiring cybersecurity teams to build specialized capabilities. Where many log formats exist, the problem becomes acute and chews up precious, hard-to-find and expensive technical resources.
The pain doesn’t end once the initial pipeline is created. Changes to the various systems generating the logs — an example of data drift — lead to pipeline breakage. In the end, teams spend more time building and modifying data ingestion workflows than analyzing the data itself. In short, handcrafted ingestion leads to excessive spend, misplaced resources, and fragile and unreliable pipelines.
What’s required is tooling that provides an abstraction layer that simplifies the process of building and stocking the data lake. Ideally, teams can use drag-and-drop tools to take advantage of big data systems without the overhead of having to learn the ins and outs of various handcrafted data ingestion methods.
In short, hand-coding should be the exception, not the norm. While writing code to take advantage of big data or data science frameworks can add value to any cybersecurity initiative, it’s important that it not become the linchpin upon which your cybersecurity processes relies, which leads to brittleness and accumulated tech debt.
Beyond simplifying pipeline development, standardizing on a higher-level ingestion approach also allows for collaboration and sharing of data ingestion logic. Once a data flow has been designed for a specific data type, it can be reused, saving time and money while ensuring best practices are followed. This is critical for global companies where numerous local efforts may be under way that can benefit from the reuse of previous work but must also adhere to company-wide governance standards.
Expect and Architect for Change
It is important that your cybersecurity solution can adapt to the continual change that marks modern IT. As the security and big data markets converge, the number of possible combinations used to solve the cybersecurity problem grows geometrically. It is safe to assume that your chosen solution of today will evolve and change. Products become obsolete and new techniques enter the market faster than ever before.
For this reason, future-proofing your solution is a great approach. The best way to do this is to leverage technology that is only loosely coupled to the underlying data sources and stores. This simplifies upgrading and changing your infrastructure since as you don’t have to rewrite your data movement logic from scratch. By insulating yourself from architectural change, your data ingestion can adapt as your cybersecurity solution evolves. It saves you time and money and keeps you at the forefront of technology.
One area ripe with innovation within the cybersecurity space is the use of machine learning. To enable machine learning, the idea of a common data model has become popular. It enables standardization of the format and structure of various incoming data streams in order to facilitate automated analysis.
One example of a common data model is the open source Apache Spot project, where many big data and security companies are coming together to find new and innovative ways to standardize data in an attempt to simplify and enhance how they combat threats.
Cybersecurity Is an Operation, Not a Project
Finally, building the data ingestion for your cybersecurity effort is only half the battle. You also must ensure continuous operations. Adding a layer of operational oversight and health checks on the dataflows that supply your data lake for analysis will help ensure you can identify that next threat and act accordingly.
In particular, wherever possible, add mechanisms for SLAs against data delivery (throughput) and data accuracy (data quality). Doing so ensures that you are always delivering data to key stakeholders and critical processes in an immediately usable format. By adding measures for quality and timely data delivery, you can ensure the investments you make to augment your cybersecurity architecture are delivering value at all times.
Cybersecurity is war, and wars are often won or lost based on the quality of your intelligence. Big data ecosystems hold out the potential to unlock new sources of intelligence stored in network flows, server logs and endpoint systems, but only if you can simplify and operationalize the delivery of timely and reliable data. To this end, enterprises would do best with a system that includes a data operations sensibility and tools to address the never-ending build-execute-operate cycle that is today’s reality. By relying on this approach, enterprises can more effectively diagnose cybersecurity attacks and prevent future threats.
About the author: Clarke Patterson is head of product marketing for Streamsets, where he is responsible for product messaging, market intelligence, and evangelism. Clarke brings more than 20 years of big data and data management experience to StreamSets and previously held similar positions at Confluent, Cloudera, and Informatica.