Sqrrl Blog

Jun 16, 2016 4:47:34 PM

An Introduction to Machine Learning for Cybersecurity and Threat Hunting

At BSides Boston 2016, Sqrrl’s Lead Security Technologist, David Bianco, and Director of Data Science, Chris McCubbin, gave a presentation about the importance of machine learning in the field of Cyber Threat Hunting. In this interview, we talk with them about how it relates to tools like UEBA, and where they see it taking the world of cybersecurity in the future. When used effectively, machine learning provides more accurate, effective insight into threats of all kinds. They predict that machine learning will soon take hold as a major influencing factor on organizations’ Security Operations Center workflows. In addition to their presentation, David and Chris also provide code for anyone interested in taking a hands-on approach to machine learning.

What is machine learning?

Chris: Very basically, machine learning is the capability of a deployed algorithm to adapt to the data that’s being input into it. A normal algorithm, for example, will run on a particular set of data and give you a result, and if you run it on the same set of data again, it will give you the same result. Machine learning has an adaptive component where if you run it on a piece of data it will do something and then change its behavior based on that data. So, even if you ran it on the same data twice, it might give you a different result because it’s adapting. That’s a very broad definition.

Nowadays, machine learning is broken down into large subfields. Out of these, two of the most important are supervised and unsupervised learning. Supervised machine learning is trying to create a model from data, where the data is labeled. For example, log records may be labeled malicious or benign. You want to build a model based on that data such that you can take unlabeled data later, run it through your model, and be able to predict the label.

Supervised_vs_Unsupervised_ML.pngIn supervised machine learning (on the right) known data is labeled into two classes (orange and blue) and unknown data (gray) is run through the model and the label is predicted. In unsupervised machine learning (on the left) unlabeled data (gray) is grouped into classes that the model determines.

In unsupervised learning you have a bunch of unlabeled data and you run that through a program, and the program itself comes up with what it thinks the classes are. This is useful for tasks like anomaly detection, where you can run an unsupervised algorithm on a group of normal behavior and tell it “learn the boundaries of the normal set of data,” and anything outside those bounds could potentially be malicious.

So why is machine learning important to Threat Hunting?

David: If you go back in time, say, 10 years ago, one of the oft-repeated mantras was “the best practice is to review all your logs every day.” You were expected to go through everything and say “hey, this log is kind of suspicious,” but that was never really a scalable approach. It’s certainly not realistic now that we’re generating terabytes of data every day; you just can’t do it.

On the other hand, humans really are very good at finding patterns and noticing odd things. Computers are really good at doing repetitive work and working on a large scale, so one of the big roles for machine learning is to complement what an analyst can do. This allows you to apply your human smarts and pattern-recognition abilities on a big data scale.

For example, in our BSides Boston talk, we presented a tool that can comb through a giant HTTP log file and find the things that it thinks you really need to look at. In our case, greater than 99% of logs were eliminated as unnecessary. We didn’t need to look at them, and it allowed us to get back to that point of having a human look at the logs manually without actually having to analyze every single thing, because that’s pretty much impossible.

What are the machine learning-related techniques that hunters should be familiar with and are most useful for finding stuff that’s relevant?

David: The most useful thing that security analysts could take advantage of in a lot of different scenarios is classification, which is what our tool does. There are a lot of places where you will want to cull things out that you don’t care about, or you want to classify them in terms of making multiple categories, like in terms of malware families. Getting a good working knowledge of how to use these pre-built classification libraries is pretty easy to do and also very useful in a wide variety of situations. 

Chris: The central theme of our talk is that as a security analyst, you don’t have to understand how these things work. You just have to understand each technique’s strengths and weaknesses and how to use them. For example, if you’re going to use linear regression, you’re going to have a hard time classifying non-linear things. In that case you’d want to use a different technique. It’s important to understand what your problem space looks like, what you’re looking for, and what techniques will help. You then have those techniques in your back pocket.

outlier.png

A machine learning-based  anomaly detection analytic run on test data. The red dot shows where the analytics detected a clear deviation from the baseline. Analytics such as this one can be used to detect malicious activity occurring on particular nodes in a network, or in the network as a whole, on various kinds of data sets. 

How do you combat adversaries nowadays who are trying to look like normal endpoint or network traffic?

David: That’s what adversaries do anyway. There are few adversaries that know to try to actively evade machine learning techniques in the wild, so while that is a concern in theory, it’s not something that we have to worry about right now in practice. It’s more on the security product vendors to worry about how to combat that issue.

Chris: Adversaries may not be actively trying to defeat a machine learning technique, but they do try to emulate normal traffic, just on general principle.

David: They do try to sneak past your detection and your prevention, but the level at which they do that is not always very sophisticated. For example, there is a ton of HTTP-based malware out there. They use HTTP because they know it can get through proxies and firewalls. So that’s a level of trying to use a normal protocol, but a lot of those are not concerned with how they look if you actually inspect the HTTP transactions.

Chris: Another issue is that adversaries will change their techniques over time. Machine learning has advantages but also has some things to be taken under consideration. In terms of hiding in normal traffic, ML has a long history of attempting to tease out what deviates from normal behavior. That’s kind of the whole point. If things were easy to separate, nobody would care about machine learning. If the attacker is doing anything that will distinguish that traffic from normal traffic, some machine learning technique will probably find it. On the flip side, if they aren’t doing anything that differentiates their traffic from normal traffic, no algorithm will find it. It’s a double-edged thing. As the problem changes, old models will need to be refreshed. What you have to do is active re-training of your models over time as you get new information.

How has machine learning changed in recent years, like from the advent of the IDS  to now, and where do you see it going in the future? Can we make predictions as to what the applications are going to be in 5 years?

Chris: There have been a few big changes in the past five years. One is what’s motivating our talk: the ubiquity of easy-to-use, well-maintained, frequently updated libraries, which has been a key to the growth of a lot of these actual applications. The other thing is an increase in the general understanding of algorithms that have been around for a while.

Take as an example the technique of deep learning. There were problems in speech and text recognition and natural language processing that were stagnant and that had been stagnant for a very long time, around 15 years. Then people applied new spins on neural network machine learning techniques, and there was a huge increase in accuracy and potential use of these things in other applications. This reawakened a focus on neural networks and machine learning in general. Things like Siri and speech recognition on a phone are being used by everybody now because it’s feasible. You don’t need a supercomputer anymore.

In terms of the future, we have barely scratched the surface of these things. People love natural language processing and image recognition, but as more people realize these things are truly powerful, they’ll continue to be applied more and more. I don't know if there’s going to be another breakthrough in algorithm technology but we still are exploring the applications of these last couple of breakthroughs.

David: If you look at a CERT or a SOC, the current state has, for a few years, been that any machine learning or adaptive algorithmic capability is almost always delivered through a vendor tool. Yet there’s a surprising amount of actual work flow in the SOC that comes down to where an analyst has written code that they run in their workstation to do something with the data. Sometimes it’s a one-time tool, and if they do similar tasks over time they might save that and adapt it each time. In the past it’s been pretty signature- or rule-based (e.g. if you see something that’s on the intel watchlist, then trigger an alert). The problem is that these approaches are static and inflexible and give a lot of false positives, causing a lot of work for the analysts as well.

We’re now at the point where it is feasible for all the SOC analysts to be able to bust out a little bit of code to do some basic machine learning tasks without really having a deep understanding of how the algorithms work. We are to the point where individual security analysts can add machine learning to their everyday repertoire of things that they can do. That has the real potential to change the work flows in the SOC.

How can someone who’s new to the field of machine learning learn more about it?

David: I’ve  gone through this myself in the past couple of years. There is so much out there right now that if you have any interest in machine learning or data science topics, you can buy any number of good books that will give you overviews and get you started. There are online courses and tons of blogs that will cover a lot of this stuff. I would say the best thing to do is to get started and just try some stuff. Honestly, for basic machine learning, a good start is to take a look at our presentation. It was designed for this particular purpose, so that people who have a little scripting experience but are not machine learning experts or data scientists can get a repeatable methodology for applying the process of machine learning to whatever kind of classification problem that they have. This presentation is supposed to be a shortcut to machine learning for a typical analyst.

Chris: Also check out the code we released. It’s open sourced, and we used Python and a minimal number of libraries deliberately. For the core part it’s very simple. The key is to get your hands on the keyboard, get some data, run it through a program - either ours or a program you make yourself - and just get started with it.

There are a lot of security solutions on the market that claim to use machine learning. How can organizations looking for machine learning-based security solutions know that what they see is an effective implementation of machine learning?

David: It’s about transparency. Do the claims seem too good to be true? There are certain things that machine learning is good at, but it’s still not a silver bullet. When you’re evaluating a product, you should be able to ask the vendor and say “how do you use machine learning? What are you trying to accomplish with it? What’s the process? Where does your data come from? Supervised or unsupervised? If supervised, how are you generating your data and labeling it appropriately? If unsupervised, how are you use using that, in what sense?” They should be able to give you reasonable answers to all those questions. If not, be suspicious!

Editor’s note: For the answer to these questions and more info on Sqrrl’s approach to machine learning, check out our UEBA eBook below:

Download the UEBA eBook

Topics: Threat Hunting, Threat Detection, Cyber Threat Hunting, Machine Learning, UEBA