{"id":1014,"date":"2017-10-13T14:22:42","date_gmt":"2017-10-13T22:22:42","guid":{"rendered":"http:\/\/raffy.ch\/blog\/?p=1014"},"modified":"2018-01-14T13:53:18","modified_gmt":"2018-01-14T21:53:18","slug":"machine-learning-and-ai-whats-the-scoop-for-security-monitoring","status":"publish","type":"post","link":"https:\/\/raffy.ch\/blog\/2017\/10\/13\/machine-learning-and-ai-whats-the-scoop-for-security-monitoring\/","title":{"rendered":"Machine Learning and AI &#8211; What&#8217;s the Scoop for Security Monitoring?"},"content":{"rendered":"<p>The other day I presented a <a href=\"https:\/\/t.co\/ZR6tkfx42K\">Webinar on Big Data and SIEM<\/a> for IANS research. One of the topics I briefly touched upon was <strong><em>machine learning<\/em><\/strong> and <em><strong>artificial intelligence<\/strong><\/em>, which resulted in a couple of questions after the Webinar was over. I wanted to pass along my answers here:<\/p>\n<p><strong>Q<\/strong>: Hi, one of the biggest challenges we have is that we have all the data and logs as part of SIEM, but how to effectively and timely review it &#8211; distinguishing &#8216;information&#8217; from &#8216;noise&#8217;. Is Artificial Intelligence (AI) is the answer for it?<\/p>\n<p><strong>A<\/strong>: AI is an overloaded term. When people talk about AI, they really mean machine learning. Let\u2019s therefore have a look at machine learning (ML). For ML you need sample data; labeled data, which means that you need a data set where you already classified things into \u201cinformation\u201d and \u201cnoise\u201d. Form that, machine learning will learn the characteristics of \u2018noisy\u2019 stuff. The problem is getting a good, labeled data set; which is almost impossible. Given that, what else could help? Well, we need a way to characterize or capture the knowledge of experts. That is quite hard and many companies have tried. There is a company, \u201c<a href=\"http:\/\/www.respond-software.com\/\">Respond Software<\/a>\u201d, which developed a method to run domain experts through a set of scenarios that they have to &#8216;rate&#8217;. Based on that input, they then build a statistical model which distinguishes \u2018information\u2019 from \u2018noise\u2019. Coming back to the original question, there are methods and algorithms out there, but the thing to look for are systems that <strong><em>capture expert knowledge<\/em><\/strong> in a \u2018scalable\u2019 way; in a way that generalizes knowledge and doesn\u2019t require constant re-learning.<\/p>\n<p><strong>Q<\/strong>: Can SIEMs create and maintain baselines using historical logs to help detect statistical anomalies against the baseline?<\/p>\n<p><strong>A<\/strong>: The hardest part about anomalies is that you have to first define what \u2018normal\u2019 is. A SIEM can help build up a statistical baseline. In ArcSight, for example, that\u2019s called a moving average data monitor. However, a statistical outlier is not always a security problem. Just because I suddenly download a large file doesn\u2019t mean I am compromised. The question then becomes, how do you separate my \u2018download\u2019 from a malicious download? You could show all statistical outliers to an analyst, but that\u2019s a lot of false positives they&#8217;d have to deal with. If you can find a way to combine additional signals with those statistical indicators, that could be a good approach. Or combine multiple statistical signals. Be prepared for a decent amount of caring and feeding of such a system though! These indicators change over time.<\/p>\n<p><strong>Q<\/strong>: Have you seen any successful applications of Deep Learning in UEBA\/Hunting?<\/p>\n<p><strong>A<\/strong>: I have not. <em><strong>Deep learning<\/strong><\/em> is just a modern machine learning algorithm that suffers from all most of the problems that machine learning suffers from as well. To start with, you need large amounts of training data. Deep learning, just like any other machine learning algorithm, also suffers from explainability. Meaning that the algorithm might classify something as bad, but you will never know why it did that. If you can\u2019t explain a detection, how do you verify it? Or how do you make sure it\u2019s a true positive?<br \/>\nHunting requires people. Focus on enabling hunters. Focus on tools that automate as much as possible in the hunting process. Giving hunters as much context as possible, fast data access, fast analytics, etc. You are trying to make the hunters\u2019 jobs easier. This is easier said than done. Such tools don\u2019t really exist out of the box. To get a start though, don\u2019t boil the ocean. You don\u2019t even need a fully staffed hunting team. Have each analyst spend an afternoon a week on hunting. Let them explore your environment. Let them dig into the logs and events. Let them follow up on hunches they have. You will find a ton of misconfigurations in the beginning and the analysts will come up with many more questions than answers, but you will find that through all the exploratory work, you get smarter about your infrastructure. You get better at documenting processes and findings, the analysts will probably automate a bunch of things, and not to forget: this is fun. Your analysts will come to work re-energized and excited about what they do.<\/p>\n<p><strong>Q<\/strong>: What are some of the best tools used for tying the endpoint products into SIEMs?<\/p>\n<p><strong>A<\/strong>: On Windows I can recommend using sysmon as a data source. On Linux it\u2019s a bit harder, but there are tools that can hook into the audit capability or in newer kernels, <a href=\"http:\/\/www.brendangregg.com\/ebpf.html\">eBPF<\/a> is a great facility to tap into.<br \/>\nIf you have an existing endpoint product, you have to work with the vendor to make sure they have some kind of a central console that manages all the endpoints. You want to integrate with that central console to forward the event data from there to your SIEM. You do not want to get into the game of gathering endpoint data directly. The amount of work required can be quite significant. How, for example, do you make sure that you are getting data from all endpoints? What if an endpoint goes offline? How do you track that?<br \/>\nWhen you are integrating the data, it also matters how you correlate the data to your network data and what correlations you set up around your endpoint data. Work with your endpoint teams to brainstorm around use-cases and leverage a \u2018hunting\u2019 approach to explore the data to learn the baseline and then set up triggers from there.<\/p>\n<p>Update: Check out my blog post on <a href=\"http:\/\/raffy.ch\/blog\/2017\/10\/22\/unsupervised-machine-learning-in-cyber-security\/\">Unsupervised machine learning<\/a> as a follow up to this post. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>The other day I presented a Webinar on Big Data and SIEM for IANS research. One of the topics I briefly touched upon was machine learning and artificial intelligence, which resulted in a couple of questions after the Webinar was over. I wanted to pass along my answers here: Q: Hi, one of the biggest [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8,35],"tags":[],"class_list":["post-1014","post","type-post","status-publish","format-standard","hentry","category-security-information-management","category-security-intelligence"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/posts\/1014","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/comments?post=1014"}],"version-history":[{"count":8,"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/posts\/1014\/revisions"}],"predecessor-version":[{"id":1101,"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/posts\/1014\/revisions\/1101"}],"wp:attachment":[{"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/media?parent=1014"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/categories?post=1014"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/tags?post=1014"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}