Once again, at BlackHat Las Vegas, I will be teaching the Visual Analytics for Security Workshop. This is the 5th year in a row that I’ll be teaching this class at BlackHat US. Overall, it’s the 29th! time that I’ll be teaching this workshop. Every year, I spend a significant amount of time to get the class updated with the latest trends and developments.
This year, you should be excited about the following new and updated topics:
– Machine learning in security – what is it really?
– What’s new in the world of big data?
– Hunting to improve your security insights
– The CISO dashboard – a way to visualize your security posture
– User and Entity Behavior Analytics (UEBA) – what it is, what it isn’t, and you use it most effectively
– 10 Challenges with SIEM and Big Data for security
Don’t miss the 5th anniversary of this workshop being taught at BlackHat US. Check out the details and sign up today: http://bit.ly/2kEXDEr
Threat intelligence (TI) feeds, the way we know them today, are going to go away. In the future it’s all about fast and sometimes anonymous sharing of IOCs with trusted (and maybe untrusted) peers. TI feeds will probably stay around, but will be used for contextul information around our data.
For the past couple of years threat intelligence feeds (IOCs) emerged as a theme (hype) to help with threat detection. It almost seems like people got too tired of writing SIEM correlation rules and were hoping that IOCs would help them getting their job done much easier. There are many reasons that this approach isn’t working and won’t work going forward:
Threat intelligence feeds are not complete. A number of people have been analyzing threat feeds and found that there is generally not much overlap in the different feeds (maybe 3-5%). What that also means is that each of the feed providers looks at a very different universe. And if that is true, what are we not seeing? If we saw more, we’d have much more overlap in the feeds.
TI feeds or indicator matching is not the same as authoring correlation rules. Correlation rules can have complex temporal and spatial logic. Use it!
IOCs are generally only valid for seconds or minutes. A machine gets infected, acts maliciously, and after a few minutes, the attacker already abandons it, just to replace it with a new one. What’s the delay within your TI feed? From the vendor actually putting the indicator on the feed to you being able to match it against your data?
So, what’s the future of TI then? Here are my three predictions:
We need to enable real-time sharing of indicators. If I see something bad, I need to be able to share it with my peers immediately. Yes, this is challenging. But start small with a trusted group of peers. In some circumstances you might be okay sharing entire PCAPs. In others, just share the source of the attacks. Use an exchange mechanism that can help keeping the data source anonymous. It’s about speed and freshness of the IOCs.
We can already see signs that threat intel providers are adding ‘contextual‘ information into their feeds. Things like what IP addresses are known TOR exit nodes, for example. Or what machines are part of a CDN. Useful? Probably some. What it really comes back to is asset context for your analytics and correlations. Knowing what a machine is can be incredibly useful. If you know that machine X is a DNS server and that only, why does it suddenly offer up Web traffic? So TI feeds will move more and more into contextual land.
The ISACs will keep operating and help enable the exchange of TI among industry peers. We need to expand that concept outside of the ISACs as well. You will see more and more of the security providers that have a large enough customer base look into cross-customer intelligence. *cough* *cough*.
Keeping all of this in mind and run a serious trial before you buy any TI feed.
information classification and location (tracking movement?)
HR / presonell information
configuration information for each machine, network device, and application
With all this information, what are the different jobs / tasks / themes that need to be covered from a data consumption perspective?
situational awareness / dashboards
metric generation from raw logs / mapping to some kind of risk
incident management / sharing information / collaboration
running models on the data (data science)
reports (PDF, HTML)
real-time matching against high volume threat feeds
remediating security problems
controls verification / audit
What would the right data store look like? What would its capabilities be?
storing any kind of data
kind of schema less but with schema on demand
storing event data (time-stamped data, logs)
fast random access of small amounts of data, aka search
looking for ‘patterns’ in the data – maybe something like a computer science grammar that defines patterns
building dynamic context from the data (e.g., who was on what machine at what time)
Looks like there would probably be different data stores: We’d need an index, probably a graph store for complex queries, a columnar store for the analytical questions, pre-aggregated data to answer some of the common queries, and the raw logs as well. Oh boy 😉
I am sure I am missing a bunch of things here. Care to comment?
Here are a couple of impressions from the conference:
Here I am showing some slides where I motivate why visualization is crucial for security analysts.
And a zoom in of the reason for why visualization is important. Note that emerging blue pattern towards the right of the scatter plot on the left. On the right you can see how context was used to augment the visualization to help identify outliers or interesting areas:
On the left here you see how visualization is used to find patterns and translate what you learn into algorithmic detections. On the right, I am showing a way to set thresholds on periodic data.
Hunting is about learning about, and understand your environment. It is used to build a ‘model’ of your network and applications that you then leverage to configure and tune your detection mechanisms.
I have been preaching about the need to explore our security data for the past 9 or 10 years. I just went back in my slides and I even posted a process diagram about how to integrate exploratory analysis into the threat detection life cycle (sorry for the horrible graph, but back then I had to use PowerPoint ):
Back then, people liked the visualizations they saw in my presentations, but nobody took the process seriously. They felt that the visualizations were pretty, but not that useful. Fast forward 7 or 8 years: Everyone is talking about hunting and how they need better tools an methods to explore their data. Suddenly those visualizations from back then are not just pretty, but people start seeing how they are actually really useful.
The core problem with threat detection is a broken process. Back in the glory days of SIEM, experts (or shall we say engineers) sat in a lab cooking up correlation rules and event prioritization formulas to help implement an event funnel:
This is so incredibly broken. [What do I know about this? I ran the team that did this for a big SIEM.] Not every environment is the same. There is no one formula that will help prioritize events. The correlation rules only scratch the surface, etc. etc. The fallout from this is that dozens of companies are frustrated with their SIEMs. They are looking for alternatives; throw them out and build their own. [Please don’t be one of them!]
Then threat intelligence came about. Instead of relying on signatures that identify common threats, let’s look for adversaries as well. Has that worked? Nope. There is a bunch of research indicating that the ‘external’ threat intelligence we have is not good.
What should we do? Well, we have to get away from thinking that a product will solve our problems. We have to roll-up our own sleeves and start creating our own ‘internal’ threat intelligence – through hunting.
The Hunting Process
Another word for hunting is exploring or learning. We need to understand our infrastructure, our applications to then find when there are things happening that are out of the norm.
Let’s look at the diagram above. What you see is a very simple data pipeline that we find everywhere. We ingest any kind of data. We then apply some kind of intelligence to the data to flag relevant events. In the SIEM case, the intelligence are mostly the correlation rules and the prioritization formula. [We have discussed this: pure voodoo.] Let’s have a look at the other pieces in the diagram:
Context: This is any additional information we can gather about our objects (machines, users, and applications). This information is crucial for good intelligence. The more surprising it is that the SIEMs haven’t put much attention to features around this.
IOCs: This is any kind of external threat intelligence. (see discussion above) We can use the IOCs to flag potentially bad activity.
Interactive Visualization: This is really the heart of the hunting process. This is the interface that a hunter uses to explore and learn about the data. Flexibility, speed, and an amazing user experience are key.
Internal TI: This is the internal threat intelligence that I mentioned above. This is basically anything we learn about our environment from the hunting process. The information is fed back into the overall system in the form of context and rules or patterns.
Models: The other kind of intelligence or knowledge we gather from our hunting activities we can use to optimize and define new models for behavioral analysis, scoring, and finding anomalies. I know, this is kind of vague, but we could spend an entire blog post on just this. Ask if you wanna know more.
To summarize, the hunting process really helps us learn about our environment. It teaches us what is “normal” and how to find “anomalies”. More practically, it informs the rules we write, the patterns we try to find, the behavioral models we build. Without going through this process, we won’t be able to configure / define a system that has a fighting chance of finding advanced adversaries or attackers in our network. Non. Zero. Zilch.
Some more context and elaboration on some of the hunting concepts you can find in my recent darkreading blog post about Threat Intelligence.
Hunting in your security data is the process of using exploratory methods to discover insights and hopefully finding attacks that have previously been concealed. Visualization greatly simplifies and makes the exploratory process more efficient.
In my previous post, I talked about SIEM use-cases. I outlined how I go about defining detection use-cases. It’s not a linear process and it’s not something that is the same for every company. That’s also a reason why I didn’t give you a list of use-cases to implement in your SIEM. There are guidelines, but in the end, you need to build a use-case repository unique to your organization. In this post we are going to explore a bit more what that means and how a ‘hunting‘ capability can help you with that.
There are three main approaches to implementing a use-case in a SIEM:
Rules: Some kind of deterministic set of conditions. For example: Find three consecutive failed logins between the same machines and using the same username.
Simple statistics: Leveraging simple statistical properties, such as standard deviations, means, medians, or correlation measures to characterize attacks or otherwise interesting behavior. A simple example would be to look at the volume of traffic between machines and finding instances where the volume deviates from the norm by more than two standard deviations.
Behavioral models: Often behavioral models are just slightly more complicated statistics. Scoring is often the bases of the models. The models often rely on classifications or regressions. On top of that you then define anomaly detectors that flag outliers. An example would be to look at the behavior of each user in your system. If their behavior changes, the model should flag that. Easier said than done.
I am not going into discussing how effective these above approaches are and what the issues are with them. Let’s just assume they are effective and easy to implement. [I can’t resist: On anomaly detection: Just answer me this: “What’s normal?” Oh, and don’t get lost in complicated data science. Experience shows that simple statistics mostly yield the best output.]
Let’s bring things back to visualization and see how that related to all of this. I like to split visualization into two areas:
Communication: This is where you leverage visualization to communicate some property of your data. You already know what you want the viewer to focus on. This is often closely tied to metrics that help abstract the underlying data into something that can be put into a dashboard. Dashboards are meant to help the viewer gain an overview of what is happening; the overall state. Only in some very limited cases will you be able to use dashboards to actually discover some novel insight into your data. That’s not their main purpose.
Exploration: What is my data about? This is literal exploration where one leverages visualization to dig into the data to quickly understand it. Generally we don’t know what we are going to find. We don’t know our data and want to understand it as quickly as possible. We want insights.
From a visualization standpoint the two are very different as well. Visually exploring data is done using more sophisticated visualizations, such as parallel coordinates, heatmaps, link graphs, etc. Sometimes a bar or a line chart might come in handy, but those are generally not “data-dense” enough. Following are a couple more points about exploratory visualizations:
It important to understand that these approaches need very powerful backends or data stores to drive the visualizations. This is not Excel!
Visualization is not enough. You also need a powerful way of translating your data into visualizations. Often this is simple aggregation, but in some cases, more sophisticated data mining comes in handy. Think clustering.
The exploratory process is all about finding unknowns. If you already knew what you are looking for, visualization might help you identify those instances quicker and easier, but generally you would leverage visualization to find the unknown unknowns. Once you identified them, you can then go ahead and implement those with one of the traditional approaches: rules, statistics, or behaviors in order to automate finding them in the future.
Some of the insights you will discover can’t be described in any of the above ways. The parameters are not clear or change ever so slightly. However, visually those outliers are quite apparent. In these cases, you should extend your analysis process to regularly have someone visualize the data to look for these instances.
You can absolutely try to explore your data without visualization. In some instances that might work out well. But careful; statistical summaries of your data will never tell you the full story (see Anscombe’s Quartet – the four data series all have the same statistical summaries, but looking at the visuals, each of them tells a different story).
In cyber security (or information security), we have started calling the exploratory process “Hunting“. This closes the loop to our SIEM use-cases. Hunting is used to discover and define new detection use-cases. I see security teams leverage hunting capabilities to do exactly that. Extending their threat intelligence capabilities to find the more sophisticated attackers that other tools wouldn’t be able to identify. Or maybe only partly, but then in concert with other data sources, they are able to create a better, more insightful picture.
In the context of hunting, a client lately asked the following questions: How do you measure the efficiency of your hunting team? How do you justify a hunting team opposite an ROI? And how do you assess the efficiency of analyst A versus analyst B? I gave them the following three answers:
When running any red-teaming approach, how quickly is your hunting team able to find the attacks? Are they quicker than your SOC team?
Are your hunters better than your IDS? Or are they finding issues that your IDS already flagged? (substitute IDS with any of your other detection mechanisms)
How many incidents are reported to you from outside the security group? Does your hunting team bring those numbers down? If your hunting team wasn’t in place, would that number be even higher?
For each incident that is reported external to your security team, assess whether the hunt team should have found them. If not, figure out how to enable them to do that in the future.
Unfortunately, there are no great hunting tools out there. Especially when it comes to leveraging visualization to do so. You will find people using some of the BI tools to visualize traffic, build Hadoop-based backends to support the data needs, etc. But in the end; these approaches don’t scale and won’t give you satisfying visualization and in turn insights. But that’s what pixlcloud‘s mission is. Get in touch!
Put your hunting experience, stories, challenges, and insights into the comments! I wanna hear from you!
As announced in the previous blog post, I have been writing a paper about the security big data lake. A topic that starts coming up with more and more organizations lately. Unfortunately, there is a lot uncertainty around the term so I decided to put some structure to the discussion.
Knowing me, you might be able to guess the topic I chose to present: Visual Analytics. I am focussing on not the visualization layer or the data layer, but on the analytics layer. In the presentation I am showing what we have been doing with data analytics and data mining in cyber security. I am showing some examples for three topics:
Exploration and Discovery
At the end, I am presenting a number of challenges to the community; hard problems that we need help with to advance insights into cyber security of infrastructures and applications. The following slide summarizes the challenges I see in data mining for security:
If you have any suggestions on each of the challenges, please contact me or comment on this post!
Big data doesn’t help us to create security intelligence! Big data is like your relational database. It’s a technology that helps us manage data. We still need the analytical intelligence on top of the storage and processing tier to make sense of everything. Visual analytics anyone?
A couple of weeks ago I hung out around the RSA conference and walked the show floor. Hundreds of companies exhibited their products. The big topics this year? Big data and security intelligence. Seems like this was MY conference. Well, not so fast. Marketing does unfortunately not equal actual solutions. Here is an example out of the press. Unfortunately, these kinds of things shine the light on very specific things; in this case, the use of hadoop for security intelligence. What does that even mean? How does it work? People seem to not really care, but only hear the big words.
Here is a quick side-note or anecdote. After the big data panel, a friend of mine comes up to me and tells me that the audience asked the panel a question about how analytics played into the big data environment. The panel huddled, discussed, and said: “Ask Raffy about that“.
Back to the problem. I have been reading a bunch lately about SIEM being replaced or superseded by big data infrastructure. That’s completely and utterly stupid. These are not competing technologies. They are complementary. If anything, SIEM will be replaced by some other analytical capabilities that are leveraging big data infrastructures. Big data is like RDBMS. New analytical capabilities are like the SIEMs (correlation rules, parsed data, etc.) For example, using big data, who is going to write your parsers for you. SIEMs have spent a lot of time and resources on things like parsers, big data solutions will need to do the same! Yes, there are a couple of things that you can do with big data approaches and unparsed data. However, most discussions out there do not discuss those uses.
In the context of big data, people also talk about leveraging multiple data sources and new data sources. What’s the big deal? We have been talking about that for 6 years (or longer). Yes, we want video feeds, but how do you correlate a video with a firewall log? Well, you process the video and generate events from it. We have been doing that all along. Nothing new there.
What HAS changed is that we now have the means to store and process the data; any data. However, nobody really knows how to process it.