Join me for my talk about AI and ML in cyber security at BlackHat on Thursday the 9th of August in Las Vegas. I’ll be exploring the topics of artificial intelligence (AI) and machine learning (ML) to show some of the ‘dangerous’ mistakes that the industry (vendors and practitioners alike) are making in applying these concepts in security.
We don’t have artificial intelligence (yet). Machine learning is not the answer to your security problems. And downloading the ‘random’ analytic library to identify security anomalies is going to do you more harm than it helps.
We will explore these accusations and walk away with the following learnings from the talk:
I am exploring these items throughout three sections in my talk: 1) A very quick set of definitions for machine learning, artificial intelligence, and data mining with a few examples of where ML has worked really well in cyber security. Check cybersecuritycourses.com here for an overview of the best cyber security courses available. 2) A closer and more technical view on why algorithms are dangerous. Why it is not a solution to download a library from the Internet to find security anomalies in your data. 3) An example scenario where we talk through supervised and unsupervised machine learning for network traffic analysis to show the difficulties with those approaches and finally explore a concept called belief networks that bear a lot of promise to enhance our detection capabilities in security by leveraging export knowledge more closely. And if you plan to test the the vulnerability of your network, make use of Wifi Pineapple testing tool.
I keep mentioning that algorithms are dangerous. Dangerous in the sense that they might give you a false sense of security or in the worst case even decrease your security quite significantly. Here are some questions you can use to self-assess whether you are ready and ‘qualified’ to use data science or ‘advanced’ algorithms like machine learning or clustering to find anomalies in your data:
- Do you know what the difference is between supervised and unsupervised machine learning?
- Can you describe what a distance function is?
- In data science we often look at two types of data: categorical and numerical. What are port numbers? What are user names? And what are IP sequence numbers?
- In your data set you see traffic from port 0. Can you explain that?
- You see traffic from port 80. What’s a likely explanation of that? Bonus points if you can come up with two answers.
- How do you go about selecting a clustering algorithm?
- What’s the explainability problem in deep learning?
- How do you acquire labeled network data sets (netflows or pcaps)?
- Name three data cleanliness problems that you need to account for before running any algorithms?
- When running k-means, do you have to normalize your numerical inputs?
- Does k-means support categorical features?
- What is the difference between a feature, data field, and a log record?
If you can’t answer the above questions, you might want to rethink your data science aspirations and come to my talk on Thursday to hopefully walk away with answers to the above questions.
Update 8/13/18: Added presentation slides
Supervised tags data or wants tagged data. Unsup is typically clustering techniques. Ideally, you are going to want to work with packet data instead of log data first, using unsup methods, and if you are extremely well-versed in feature architecture and engineering (including AutoML), log data with sup. So try Apache Spot before Hortonworks Cybersecurity and maybe even try IVRE.rocks before both. This also answers the later questions on flows and caps.
Distance is like the Edit Distance (i.e., Levenshtein, an NLP technique) feature in viper.li. It’s how different one string (such as a filename) is from another in a set of filenames. There’s also observation distance like Euclidiean and Manhattan.
Port 0 is totally-cool. It’s legit, unlike NULL. I always nmap -p0-
Src Port 80 is typically HTTP or SPDY in Web Services form. I have seen developers (usually not operators) SRC and DST TLS via ports 80 and 8080.
ML algorithm selection is heuristic-based (joke there). Probably if you are looking at clustering you should just select Random Forest because if you can’t figure that out, then you’re not going anywhere in feature selection for cybersecurity purposes. Already mentioned automating this to some degree.
Deep learning isn’t really applicable to cybersecurity yet, except with visualizations and maybe deep reinforcement learning. Would love to pick you brain on these because I’m not an in expert in these areas.
Data quality is serious business before any stats or ML project begins. The last few Q&As here are pedantic, but I do suggest people learn them — although putting them to memory might be second to putting them into good checklists and methodologies.
Comment by dre — August 8, 2018 @ 10:59 pm