Join me for my talk about AI and ML in cyber security at BlackHat on Thursday the 9th of August in Las Vegas. I’ll be exploring the topics of artificial intelligence (AI) and machine learning (ML) to show some of the ‘dangerous’ mistakes that the industry (vendors and practitioners alike) are making in applying these concepts in security.
We don’t have artificial intelligence (yet). Machine learning is not the answer to your security problems. And downloading the ‘random’ analytic library to identify security anomalies is going to do you more harm than it helps.
We will explore these accusations and walk away with the following learnings from the talk:
I am exploring these items throughout three sections in my talk: 1) A very quick set of definitions for machine learning, artificial intelligence, and data mining with a few examples of where ML has worked really well in cyber security. 2) A closer and more technical view on why algorithms are dangerous. Why it is not a solution to download a library from the Internet to find security anomalies in your data. 3) An example scenario where we talk through supervised and unsupervised machine learning for network traffic analysis to show the difficulties with those approaches and finally explore a concept called belief networks that bear a lot of promise to enhance our detection capabilities in security by leveraging export knowledge more closely.
I keep mentioning that algorithms are dangerous. Dangerous in the sense that they might give you a false sense of security or in the worst case even decrease your security quite significantly. Here are some questions you can use to self-assess whether you are ready and ‘qualified’ to use data science or ‘advanced’ algorithms like machine learning or clustering to find anomalies in your data:
- Do you know what the difference is between supervised and unsupervised machine learning?
- Can you describe what a distance function is?
- In data science we often look at two types of data: categorical and numerical. What are port numbers? What are user names? And what are IP sequence numbers?
- In your data set you see traffic from port 0. Can you explain that?
- You see traffic from port 80. What’s a likely explanation of that? Bonus points if you can come up with two answers.
- How do you go about selecting a clustering algorithm?
- What’s the explainability problem in deep learning?
- How do you acquire labeled network data sets (netflows or pcaps)?
- Name three data cleanliness problems that you need to account for before running any algorithms?
- When running k-means, do you have to normalize your numerical inputs?
- Does k-means support categorical features?
- What is the difference between a feature, data field, and a log record?
If you can’t answer the above questions, you might want to rethink your data science aspirations and come to my talk on Thursday to hopefully walk away with answers to the above questions.
Update 8/13/18: Added presentation slides