October 13, 2017

Machine Learning and AI – What’s the Scoop for Security Monitoring?

Category: Security Information Management,Security Intelligence — Raffael Marty @ 2:22 pm

The other day I presented a Webinar on Big Data and SIEM for IANS research. One of the topics I briefly touched upon was machine learning and artificial intelligence, which resulted in a couple of questions after the Webinar was over. I wanted to pass along my answers here:

Q: Hi, one of the biggest challenges we have is that we have all the data and logs as part of SIEM, but how to effectively and timely review it – distinguishing ‘information’ from ‘noise’. Is Artificial Intelligence (AI) is the answer for it?

A: AI is an overloaded term. When people talk about AI, they really mean machine learning. Let’s therefore have a look at machine learning (ML). For ML you need sample data; labeled data, which means that you need a data set where you already classified things into “information” and “noise”. Form that, machine learning will learn the characteristics of ‘noisy’ stuff. The problem is getting a good, labeled data set; which is almost impossible. Given that, what else could help? Well, we need a way to characterize or capture the knowledge of experts. That is quite hard and many companies have tried. There is a company, “Respond Software”, which developed a method to run domain experts through a set of scenarios that they have to ‘rate’. Based on that input, they then build a statistical model which distinguishes ‘information’ from ‘noise’. Coming back to the original question, there are methods and algorithms out there, but the thing to look for are systems that capture expert knowledge in a ‘scalable’ way; in a way that generalizes knowledge and doesn’t require constant re-learning.

Q: Can SIEMs create and maintain baselines using historical logs to help detect statistical anomalies against the baseline?

A: The hardest part about anomalies is that you have to first define what ‘normal’ is. A SIEM can help build up a statistical baseline. In ArcSight, for example, that’s called a moving average data monitor. However, a statistical outlier is not always a security problem. Just because I suddenly download a large file doesn’t mean I am compromised. The question then becomes, how do you separate my ‘download’ from a malicious download? You could show all statistical outliers to an analyst, but that’s a lot of false positives they’d have to deal with. If you can find a way to combine additional signals with those statistical indicators, that could be a good approach. Or combine multiple statistical signals. Be prepared for a decent amount of caring and feeding of such a system though! These indicators change over time.

Q: Have you seen any successful applications of Deep Learning in UEBA/Hunting?

A: I have not. Deep learning is just a modern machine learning algorithm that suffers from all most of the problems that machine learning suffers from as well. To start with, you need large amounts of training data. Deep learning, just like any other machine learning algorithm, also suffers from explainability. Meaning that the algorithm might classify something as bad, but you will never know why it did that. If you can’t explain a detection, how do you verify it? Or how do you make sure it’s a true positive?
Hunting requires people. Focus on enabling hunters. Focus on tools that automate as much as possible in the hunting process. Giving hunters as much context as possible, fast data access, fast analytics, etc. You are trying to make the hunters’ jobs easier. This is easier said than done. Such tools don’t really exist out of the box. To get a start though, don’t boil the ocean. You don’t even need a fully staffed hunting team. Have each analyst spend an afternoon a week on hunting. Let them explore your environment. Let them dig into the logs and events. Let them follow up on hunches they have. You will find a ton of misconfigurations in the beginning and the analysts will come up with many more questions than answers, but you will find that through all the exploratory work, you get smarter about your infrastructure. You get better at documenting processes and findings, the analysts will probably automate a bunch of things, and not to forget: this is fun. Your analysts will come to work re-energized and excited about what they do.

Q: What are some of the best tools used for tying the endpoint products into SIEMs?

A: On Windows I can recommend using sysmon as a data source. On Linux it’s a bit harder, but there are tools that can hook into the audit capability or in newer kernels, eBPF is a great facility to tap into.
If you have an existing endpoint product, you have to work with the vendor to make sure they have some kind of a central console that manages all the endpoints. You want to integrate with that central console to forward the event data from there to your SIEM. You do not want to get into the game of gathering endpoint data directly. The amount of work required can be quite significant. How, for example, do you make sure that you are getting data from all endpoints? What if an endpoint goes offline? How do you track that?
When you are integrating the data, it also matters how you correlate the data to your network data and what correlations you set up around your endpoint data. Work with your endpoint teams to brainstorm around use-cases and leverage a ‘hunting’ approach to explore the data to learn the baseline and then set up triggers from there.

Update: Check out my blog post on Unsupervised machine learning as a follow up to this post.


  1. In addition to Respond Software, check out DarkLight Cyber. They are looking at driving from Semantic Web information, not just event streams.

    For Deep learning, check out Graphistry. It is more of a visualization engine on top of existing MapReduce (i.e., Hadoop), Cassandra, and MLlib (e.g., Spark) capabilities including Hortonworks Cybersecurity Suite (Apache Metron), MapR (Sqrrl sits on top of MapR), and Cloudera (Apache Spot goes on top of Cloudera).

    Endpoint and SIEM is also something I spend a lot of time on. Tanium Connect is one well-known method, but there are many others. My favorite is Invoke-IR/ACE connected to HELK (Hunting in ELK) because both take everything to the next-level. eBPF is interested, but I was thinking that you were going to say Sysdig Falco. There are plenty of others, though: wazuh, LimaCharlie, auditd — which are free, open-source — as well as commercial solutions such as Forcepoint Threat Protection for Linux. Intigua is great to command agents and solve some of those hard-to solve issues you spoke of.

    There are some activities that are nearly-always malicious (and even at-scale, such as hundreds of thousands of hosts it can be easily to eliminate false positives because they can be sorted or stacked easier compared to other scenarios), such as beaconing and fumbling. Machine Learning, especially when combined with MapReduce (or something like MapReduce) definitely helps in these areas. Can you think of more?

    Comment by Andre Gironda — October 13, 2017 @ 3:21 pm

  2. Thanks Andre! Good stuff. Let me comment on a few things:

    1. Respond is very different from DarkLight Cyber. Has nothing to do with semantic web information. They create a belief system at Respond. Check it out.
    2. Graphistry does not do any DeepLearning. I know the guys over there. Love their visualization engine! Absolutely top notch. (*hi Leo*)
    3. Need to check out HELK and what it really adds on top of ELK.
    4. Is there a way to run sysdig falco or wazuh to log network connections (in and out) with: process, networking 5-tuple, number of bytes transmitted? That’s the problem with auditd, you only get the system calls logged and they don’t show the process with the IP addresses. You have to manually stitch together file descriptors and that gets expensive as you have to log much more.
    5. Apache Spot has some machine learning models. But in the end, those things are all not that exciting. Building precise profiles for users and devices is where it’s at. But that’s cumbersome, needs good data, needs a lot of data, needs a lot of training, and needs expertise. But I’d love to hear other people’s experiences.

    Comment by Raffael Marty — October 13, 2017 @ 4:04 pm

  3. On Windows, you can just use SRUM — https://www.sans.org/summit-archives/file/summit-archive-1492184583.pdf — to get per-app bytes/packets with src/dst/port/prot info. Would Wazuh help with that from a security-monitoring perspective? I think it does, better than 95 percent or more of other tools, especially commercial ones. If you don’t like SRUM (or don’t have Win8.1 or higher), then you can use ETW. ETW goes back to Windows 2000.

    For Linux, you can use Falco or Wazuh to perform the security-monitoring activities, but sysdig out-of the box supports mapping anything to anything. Yes, you can match your 5-tuple sets with byte/packet/flow counts to processes, vice-versa, and many other possibilities — all without strain on the system. There are other ways, such as newer ones (e.g., you described eBPF but there is also Systemtap, kprobes, et al) and older ones (i.e., ntop nprobes has a process plugin that does exactly your ask). I prefer sysdig, however. It’s just easy.

    It’s funny because you don’t need premier cloud services (i.e., Google Cloud Platform with Stackdriver monitoring) or top-rate agents (i.e., AppDynamics) to get access to these intrinsics. DigitalOcean even shows you how to do basic stuff you would otherwise pay tons of money for in AWS — https://www.digitalocean.com/community/tutorials/how-to-audit-network-traffic-in-a-lamp-server-with-sysdig-on-centos-7

    If beaconing and fumbling are not-that exciting, then what is? I think you might be able to answer your desired questions in a day or two if you control the sources for a network. For example, if you can command an org to move or create new destinations for the NetFlow/ipfix/sFlow/AppFlow exports, send/clone pcap repos, and export Bro, snort, and DNS/pDNS sensor data into AWS, then you can get up-and running with HCP (Hortonworks Cybersecurity Suite) on HDC (Hortonworks Data Cloud) in one or two maintenance windows. Instead of spending all of the budget on a UEBA solution — I agree that training, expertise, and tuning/customization are key. I’m not sure that I agree that you need a lot of data, but I do agree that it helps. Great insights.

    Comment by Andre Gironda — October 14, 2017 @ 5:26 pm

RSS feed for comments on this post. | TrackBack URI

Leave a comment

XHTML ( You can use these tags): <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> .