December 11, 2017
marketing? What works? What doesn’t? What approaches yield the biggest return for your investment? These are some questions that I have been pondering lately with startups that I am working with. I decided to do some research among my marketing and startup friends to explore what marketing approaches work for them.
You are an enterprise software startup. You are in the security space. Your company is still early, trying to sign its first 10, maybe 40 customers. What should you be doing for
The full posts you can find on my other blog at cryptojail.net. There you will find a discussion of:
- Positioning and value proposition
- Identifying your top 200 prospects
- Defining measurable goals
- Building a killer Web site
- Getting reference customers
- Keeping your marketing fun and unique
- Becoming good at PR
And don’t forget about proper advertising and marketing strategies. Vinyl car wraps Denver is a right solution!
Here are the individual posts:
- Startup Marketing
- Get Good at PR
- More PR Activities
December 6, 2017
Earlier today I was giving the keynote at ACSAC 2017. This year’s theme of the conference is big data for security. As part of my keynote, I talked about the history of big data in security. Following is the slide I put together:
This is by no means a complete picture, but I tried to pull together some of the most important milestones along the security data journey. To help interpret the graph, the top part shows some of the most important developments in security, while the bottom shows the history in the big data world at large. I am including the big data world as without the developments in big data, security would not be where it is today.
In addition to distinguishing between security and big data developments, I am using blue and green triangles to differentiate between ‘data collection or centralization’ (blue) and ‘data insight’ (green) milestones. I had a hard time coming up with too many ‘data insight’ milestones in both big data and in security. Seems we have a bit more work to do in our industry.
Let me make a few remarks about the graph. I’d be curious about your thoughts also; please leave a comment.
- Security has been dealing with big data (variety, velocity, and volume) since 1996 – we just didn’t call it that back then.
- We have been trying to apply anomaly detection to (network) security data for a long time. Interestingly enough, we are still dealing with a lot of the same issues as we had back then; one of them is having access to good training data.
- We have been talking about security visualization for a long time. The first VizSec conference was held back in 2004. And in fact, we released the first versions of AfterGlow in early 2004.
- In 2006 we released the first common data format, the common event format (CEF). CEF was a vendor driven approach (I co-wrote the standard while at ArcSight). Later in 2007, I approached mitre to help us take the effort public under the CEE banner. Unfortunately, that effort wasn’t super successful. Later I rewrote CEF at Splunk and we called it the common information model (CIM). Now we also have Apache Spot that released yet another standard data format. Which standard are we going to use? Well, for my own purposes, none of these standards really made the cut and I wrote yet another field dictionary at Sophos (to be released). The other problem with these standards is that none of them cover semantics, only syntax!
- While deep learning had a big break through in 2009, we really only started using those algorithms in security in 2016. That’s probably a good thing. It’s just another machine learning algorithm that is actually really amazing at helping classify malware. But for a lot of other security problems it’s just not suited.
- In 2014 I wrote the first version of the security data lake booklet. Most of the challenges I address in there are still applicable today. One of the developments that has helped making data lakes more realistic, are developments like Apache Drill or PrestoDB; or in general, the separation of data storage (e.g., parquet) from the query engines. These developments allow us to query our data lake with many different storage formats (CSV, JSON, parquet, ORC, etc.). It still requires a master data record to understand what we have and let us manage schemas across data sources.
- Looking at database technologies, it is amazing to see what, for example, AWS has been providing with regards to tooling around data storage and access. Athena, RDS, S3, and to a lesser degree QuickSight, are fantastic tools to manage and explore large amounts of data. They provide the user with a lot of enterprise features out of the box, such as backups, fault tolerance, queueing, access control, etc.
What major milestones am I missing?
November 24, 2017
Last week I organized the 4th iteration of the Security Chat – an informal gathering of security people in Zurich where we talk about all type of security from cyber security to home security using systems online, and you can learn more here about this. The format are 10-15 minute presentations that anyone can submit for. In good tradition, we had a great line up again:
- Steve Micallef – OSINT and The New Perimeter – his tool is available for anyone to use at www.spiderfoot.net.
- Ero Carrera – The Many Types of Machine Learning.
Let’s not get carried away by the hype around deep learning. It’s great for some things, but not suited for many others. How about using Bayesian graph networks?
- Stefan Frei – Supply Chain Security in The Digital Age
Stefan gave us a bit to debate and think about when it comes to the security of the supply chain. Especially when it comes to Government procurement.
- Maxim Salomon – Foobar on Embedded Systems
Just how secure are our IoT devices out there? The presentation showed how insecure they really are. Even government grade security products!
- Jeroen Massar – Don’t Forget The Network
In his ‘no-slide’ presentation he made us remember that there are many aspects of networking that we tend to forget or a lot of people simply don’t know anymore. Things like BCP38, DNS RPC, etc.
- Sandra Tobler – Next Generation Multi Factor Authentication
More about the system that Sandra introduced can be found at: Futurae.com
October 22, 2017
After my latest blog post on “Machine Learning and AI – What’s the Scoop for Security Monitoring?“, there was a quick discussion on twitter and Shomiron made a good point that in my post I solely focused on supervised machine learning.
In simple terms, as mentioned in the previous blog post, supervised machine learning is about learning with a training data set. Along with machines on casino games online, the games are program with a sets of data in order for the machine games to work. In contrast, unsupervised machine learning is about finding or describing hidden structures in data. If you have heard of clustering algorithms, you may not want to miss blog on card counting considering both are pertaining numbers, they are one of the main groups of algorithms in unsupervised machine learning (the other being association rule learning).
There are some quite technical problems with applying clustering to cyber security data. I won’t go into details for now. The problems are related to defining distance functions and the fact that most, if not all, clustering algorithms have been built around numerical and not categorical data. Turns out, in cyber security, we mostly deal with categorical data (urls, usernames, IPs, ports, etc.). But outside of these mathematical problems, there are other problems you face with clustering algorithms. Have a look at this visualization of clusters that I created from some network traffic:
Some of the network traffic clusters incredibly well. But now what? What does this all show us? You can basically do two things with this:
- You can try to identify what each of these clusters represent. But the explainability of clusters is not built into clustering algorithms! You don’t know why something shows up on the top right, do you? You have to somehow figure out what this traffic is. You could run some automatic feature extraction or figure out what the common features are, but that’s generall not very easy. It’s not like email traffic will nicely cluster on the top right and Web traffic on the bottom right.
- You may use this snapshot as a baseline. In fact, in the graph you see individual machines. They cluster based on their similarity of network traffic seen (with given distance functions!). If you re-run the same algorithm at a later point in time, you could try to see which machines still cluster together and which ones do not. Sort of using multiple cluster snapshots as anomaly detectors. But note also that these visualizations are generally not ‘stable’. What was on top right might end up on the bottom left when you run the algorithm again. Not making your analysis any easier.
If you are trying to implement case number one above, you can make it a bit little less generic. You can try to inject some a priori knowledge about what you are looking for. For example, BotMiner / BotHunter uses an approach to separate botnet traffic from regular activity.
These are some of my thoughts on unsupervised machine learning in security. What are your use-cases for unsupervised machine learning in security?
PS: If you are doing work on clustering, have a look at t-SNE. It’s a clustering algorithm that does a multi-dimensional projection of your data into the 2-dimensional space. I have gotten incredible results with it. I’d love to hear from you if you have used the algorithm.
October 13, 2017
The other day I presented a Webinar on Big Data and SIEM for IANS research. One of the topics I briefly touched upon was machine learning and artificial intelligence, which resulted in a couple of questions after the Webinar was over. I wanted to pass along my answers here:
Q: Hi, one of the biggest challenges we have is that we have all the data and logs as part of SIEM, but how to effectively and timely review it – distinguishing ‘information’ from ‘noise’. Is Artificial Intelligence (AI) is the answer for it?
A: AI is an overloaded term. When people talk about AI, they really mean machine learning. Let’s therefore have a look at machine learning (ML). For ML you need sample data; labeled data, which means that you need a data set where you already classified things into “information” and “noise”. Form that, machine learning will learn the characteristics of ‘noisy’ stuff. The problem is getting a good, labeled data set; which is almost impossible. Given that, what else could help? Well, we need a way to characterize or capture the knowledge of experts. That is quite hard and many companies have tried. There is a company, “Respond Software”, which developed a method to run domain experts through a set of scenarios that they have to ‘rate’. Based on that input, they then build a statistical model which distinguishes ‘information’ from ‘noise’. Coming back to the original question, there are methods and algorithms out there, but the thing to look for are systems that capture expert knowledge in a ‘scalable’ way; in a way that generalizes knowledge and doesn’t require constant re-learning.
Q: Can SIEMs create and maintain baselines using historical logs to help detect statistical anomalies against the baseline?
A: The hardest part about anomalies is that you have to first define what ‘normal’ is. A SIEM can help build up a statistical baseline. In ArcSight, for example, that’s called a moving average data monitor. However, a statistical outlier is not always a security problem. Just because I suddenly download a large file doesn’t mean I am compromised. The question then becomes, how do you separate my ‘download’ from a malicious download? You could show all statistical outliers to an analyst, but that’s a lot of false positives they’d have to deal with. If you can find a way to combine additional signals with those statistical indicators, that could be a good approach. Or combine multiple statistical signals. Be prepared for a decent amount of caring and feeding of such a system though! These indicators change over time.
Q: Have you seen any successful applications of Deep Learning in UEBA/Hunting?
A: I have not. Deep learning is just a modern machine learning algorithm that suffers from all most of the problems that machine learning suffers from as well. To start with, you need large amounts of training data. Deep learning, just like any other machine learning algorithm, also suffers from explainability. Meaning that the algorithm might classify something as bad, but you will never know why it did that. If you can’t explain a detection, how do you verify it? Or how do you make sure it’s a true positive?
Hunting requires people. Focus on enabling hunters. Focus on tools that automate as much as possible in the hunting process. Giving hunters as much context as possible, fast data access, fast analytics, etc. You are trying to make the hunters’ jobs easier. This is easier said than done. Such tools don’t really exist out of the box. To get a start though, don’t boil the ocean. You don’t even need a fully staffed hunting team. Have each analyst spend an afternoon a week on hunting. Let them explore your environment. Let them dig into the logs and events. Let them follow up on hunches they have. You will find a ton of misconfigurations in the beginning and the analysts will come up with many more questions than answers, but you will find that through all the exploratory work, you get smarter about your infrastructure. You get better at documenting processes and findings, the analysts will probably automate a bunch of things, and not to forget: this is fun. Your analysts will come to work re-energized and excited about what they do.
Q: What are some of the best tools used for tying the endpoint products into SIEMs?
A: On Windows I can recommend using sysmon as a data source. On Linux it’s a bit harder, but there are tools that can hook into the audit capability or in newer kernels, eBPF is a great facility to tap into.
If you have an existing endpoint product, you have to work with the vendor to make sure they have some kind of a central console that manages all the endpoints. You want to integrate with that central console to forward the event data from there to your SIEM. You do not want to get into the game of gathering endpoint data directly. The amount of work required can be quite significant. How, for example, do you make sure that you are getting data from all endpoints? What if an endpoint goes offline? How do you track that?
When you are integrating the data, it also matters how you correlate the data to your network data and what correlations you set up around your endpoint data. Work with your endpoint teams to brainstorm around use-cases and leverage a ‘hunting’ approach to explore the data to learn the baseline and then set up triggers from there.
Update: Check out my blog post on Unsupervised machine learning as a follow up to this post.
February 26, 2017
Visual Analytics Workshop at BlackHat Las Vegas 2017. Sign up today!
Once again, at BlackHat Las Vegas, I will be teaching the Visual Analytics for Security Workshop. This is the 5th year in a row that I’ll be teaching this class at BlackHat US. Overall, it’s the 29th! time that I’ll be teaching this workshop. Every year, I spend a significant amount of time to get the class updated with the latest trends and developments.
This year, you should be excited about the following new and updated topics:
– Machine learning in security – what is it really?
– What’s new in the world of big data?
– Hunting to improve your security insights
– The CISO dashboard – a way to visualize your security posture
– User and Entity Behavior Analytics (UEBA) – what it is, what it isn’t, and you use it most effectively
– 10 Challenges with SIEM and Big Data for security
Don’t miss the 5th anniversary of this workshop being taught at BlackHat US. Check out the details and sign up today: http://bit.ly/2kEXDEr
August 13, 2016
Threat intelligence (TI) feeds, the way we know them today, are going to go away. In the future it’s all about fast and sometimes anonymous sharing of IOCs with trusted (and maybe untrusted) peers. TI feeds will probably stay around, but will be used for contextul information around our data.
For the past couple of years threat intelligence feeds (IOCs) emerged as a theme (hype) to help with threat detection. It almost seems like people got too tired of writing SIEM correlation rules and were hoping that IOCs would help them getting their job done much easier. There are many reasons that this approach isn’t working and won’t work going forward:
- Threat intelligence feeds are not complete. A number of people have been analyzing threat feeds and found that there is generally not much overlap in the different feeds (maybe 3-5%). What that also means is that each of the feed providers looks at a very different universe. And if that is true, what are we not seeing? If we saw more, we’d have much more overlap in the feeds.
- TI feeds or indicator matching is not the same as authoring correlation rules. Correlation rules can have complex temporal and spatial logic. Use it!
- IOCs are generally only valid for seconds or minutes. A machine gets infected, acts maliciously, and after a few minutes, the attacker already abandons it, just to replace it with a new one. What’s the delay within your TI feed? From the vendor actually putting the indicator on the feed to you being able to match it against your data?
So, what’s the future of TI then? Here are my three predictions:
- We need to enable real-time sharing of indicators. If I see something bad, I need to be able to share it with my peers immediately. Yes, this is challenging. But start small with a trusted group of peers. In some circumstances you might be okay sharing entire PCAPs. In others, just share the source of the attacks. Use an exchange mechanism that can help keeping the data source anonymous. It’s about speed and freshness of the IOCs.
- We can already see signs that threat intel providers are adding ‘contextual‘ information into their feeds. Things like what IP addresses are known TOR exit nodes, for example. Or what machines are part of a CDN. Useful? Probably some. What it really comes back to is asset context for your analytics and correlations. Knowing what a machine is can be incredibly useful. If you know that machine X is a DNS server and that only, why does it suddenly offer up Web traffic? So TI feeds will move more and more into contextual land.
- The ISACs will keep operating and help enable the exchange of TI among industry peers. We need to expand that concept outside of the ISACs as well. You will see more and more of the security providers that have a large enough customer base look into cross-customer intelligence. *cough* *cough*.
Keeping all of this in mind and run a serious trial before you buy any TI feed.
For a different look at Threat Intelligence – from a company internal perspective, read my blog on Internal Threat Intelligence.
Final thought / question: Has anyone leveraged threat intelligence feeds to come up with a weather report of the Internet and then adjusted their internal defense mechanisms accordingly?
What’s your experience, your thoughts around all this?
February 24, 2016
Okay, this post is going to be a bit strange. It’s a quick brain dump from a technical perspective, what it would take to build the perfect security monitoring environment.
We’d need data and contextual information to start with:
- infrastructure / network logs (flows, dns, dhcp, proxy, routing, IPS, DLP, …)
- host logs (file access, process launch, socket activity, etc.)
- HIPS, anti virus, file integrity
- application logs (Web, SAP, HR, …)
- configuration changes (host, network equipment, physical access, applications)
- indicators of compromise (threat feeds)
- physical access logs
- cloud instrumentation data
- change tickets
- incident information
- asset information and classification
- identity context (roles, etc.)
- information classification and location (tracking movement?)
- HR / presonell information
- vulnerability scans
- configuration information for each machine, network device, and application
With all this information, what are the different jobs / tasks / themes that need to be covered from a data consumption perspective?
- situational awareness / dashboards
- alert triage
- forensic investigations
- metric generation from raw logs / mapping to some kind of risk
- incident management / sharing information / collaboration
- running models on the data (data science)
- anomaly detection
- behavioral analysis
- reports (PDF, HTML)
- real-time matching against high volume threat feeds
- remediating security problems
- continuous compliance
- controls verification / audit
What would the right data store look like? What would its capabilities be?
- storing any kind of data
- kind of schema less but with schema on demand
- storing event data (time-stamped data, logs)
- storing metrics
- fast random access of small amounts of data, aka search
- analytical questions
- looking for ‘patterns’ in the data – maybe something like a computer science grammar that defines patterns
- building dynamic context from the data (e.g., who was on what machine at what time)
Looks like there would probably be different data stores: We’d need an index, probably a graph store for complex queries, a columnar store for the analytical questions, pre-aggregated data to answer some of the common queries, and the raw logs as well. Oh boy 😉
I am sure I am missing a bunch of things here. Care to comment?
February 22, 2016
A week ago I was presenting at the Kaspersky Security Analyst Summit. My presentation was titled: “Creating Your Own Threat Intel Through Hunting & Visualization“.
Here are a couple of impressions from the conference:
Here I am showing some slides where I motivate why visualization is crucial for security analysts.
And a zoom in of the reason for why visualization is important. Note that emerging blue pattern towards the right of the scatter plot on the left. On the right you can see how context was used to augment the visualization to help identify outliers or interesting areas:
On the left here you see how visualization is used to find patterns and translate what you learn into algorithmic detections. On the right, I am showing a way to set thresholds on periodic data.
February 9, 2016
Hunting has been a fairly central topic on this blog. I have written about different aspects of hunting here and here.
I just gave a presentation at the Kaspersky Security Analytics Summit where I talked about the concept of internal threat intelligence and showed a number of visualizations to emphasize the concept of interactive discovery to find behavior that really matters in your network.