June 28, 2010
Last week I posted the introductionary video for a talk that I gave at Source Boston in 2008. I just found the entire video of that talk. Enjoy:
Talk by Raffael Marty:
With the ever-growing amount of data collected in IT environments, we need new methods and tools to deal with them. Event and Log Analysis is becoming one of the main tools for analysts to investigate and comprehend the state of their networks, hosts, applications, and business processes. Recent developments, such as regulatory compliance and an increased focus on insider threat have increased the demand for analytical tools to help in the process. Visualization is offering a new, more effective, and simpler approach to data analysis. To date, security visualization, has mostly failed to deliver effective tools and methods. This presentation will show what the New York Times has to teach us about effective visualizations. Visualization for the masses and not visualization for the experts. Insider Threat, Governance, Risk, and Compliance (GRC), and Perimeter Threat all require effective visualization methods and they are right in front of us – in the newspaper.
June 14, 2010
A rehash of an old blog post from February 2008. I thought it would make sense to give a quick update on CEE and put the link to the public discussion archives here again:
Well well well… I get so many questions from people about CEE. Where is it at, when does it come out, what will it cover? To be honest, I don’t quite know. I have some answers. We have been working really hard on getting a syntax, and a taxonomy working draft written up. I think it’s more than just a working draft. It will be a really well thought through starting point for the final standard around log syntax and taxonomy. For years (I wish this wasn’t literal, but it is), we have been working on this now. Took quite some time to get everyone on the CEE board to run into the same direction. I can’t promise any timeline for publication, but I hope it’s close.
In the meantime, if you are interested in the public discussions around CEE, the public discussion archives are available online.
I have also been working on an application logging paper that I just submitted to USENIX. If you are interested in how we implemented logging at Loggly and want to look at the paper, drop me a line, maybe I will share it.
June 7, 2010
The following blog post was originally posted in December 2008. I updated it slightly to fit current times:
This following blog post has turned into more than just a post. It’s more of a paper. In any case, in the post I am trying to capture a number of concepts that are defining the log management and analysis market (as well as the SIEM or SEM markets).
Any company or IT department/operation can be placed along the maturity scale (see Figure 1). The further on the right, the more mature the operations with regards to IT data management. A company generally moves along the scale. A movement to the right does not just involve the purchase of new solutions or tools, but also needs to come with a new set of processes. Products are often necessary but are not a must.
The further one moves to the right, the fewer companies or IT operations can be found operating at that scale. Also note that the products that companies use are called log management tools for the ones located on the left side of the scale. In the middle, it is the security information and event management (SIEM) products that are being used, and on the right side, companies have to look at either in-house tools, scripts, or in some cases commercial tools in markets other than the security market. Some SIEM tools are offering basic advanced analytics capabilities, but they are very rudimentary. The reason why there are no security specific tools and products on the right side becomes clear when we understand a bit better what the scale encodes.
Figure 1: IT Data Management Maturity Scale.
The Maturity Scale
Let us have a quick look at each of the stages on the scale. (Skip over this if you are interested in the conclusions and not the details of the scale.)
- Do nothing: I didn’t even explicitly place this stage on the scale. However, there are a great many companies out there that do exactly this. They don’t collect data at all.
- Collecting logs: At this stage of the scale, companies are collecting some data from a few data sources for retention purposes. Sometimes compliance is the driver for this. You will mostly find things like authentication logs or maybe message logs (such as email transaction logs or proxy logs). The number of different data sources is generally very small. In addition, you mostly find log files here. No more specific IT data, such as multi-line applications logs or configurations. A new trend that we are seeing here is the emergence of the cloud. A number of companies are looking to move IT services into the cloud and have them delivered by service providers. The same is happening in log management. It doesn’t make sense for small companies to operate and maintain their own logging solutions. A cloud-based offering is perfect for those situations.
- Forensics / Troubleshooting: While companies in the previous stage simply collect logs for retention purposes, companies in this stage actually make use of the data. In the security arena they are conducting forensic investigations after something suspicious was noticed or a breach was reported. In IT operations, the use-case is troubleshooting. Take email logs, for example. A user wants to know why he did not receive a specific email. Was it eaten by the SPAM filter or is something else wrong?
- Save searches: I don’t have a better name for this. In the simplest case, someone saves the search expression used with a
grep command. In other cases, where a log management solution is used, users are saving their searches. At this stage, analysts can re-use their searches at a later point in time to find the same type of problems again, without having to reconstruct the searches every single time.
- Share searches: If a search is good for one analyst, it might be good for another one as well. Analysts at some point start sharing their ways of identifying a certain threat or analyze a specific IT problem. This greatly improves productivity.
- Reporting: Analysts need reports. They need reports to communicate findings to management. Sometimes they need reports to communicate among each other or to communicate with other teams. Generally, the reporting capabilities of log management solutions are fairly limited. They are extended in the SEM products.
- Alerting: This capability lives in somewhat of a gray-zone. Some log management solutions provide basic alerting, but generally, you will find this capability in a SEM. Alerting is used to automate some of the manual trouble-shooting that is done among companies on the left side of the scale. Instead of waiting for a user to complain that there is something wrong with his machine and then looking through the log files, analysts are setting up alerts that will notify them as soon as there are known signs of failures showing up. Things like monitoring free disk space are use-cases that are automated at this point. This can safe a lot of manual labor and help drive IT towards a more automated and pro-active discipline.
- Collecting more logs and IT data: More data means more insight, more visibility, broader coverage, and more uses. For some use-cases we now need new data sources. In some cases it’s the more exotic logs, such as multi-line application logs, instant messenger logs, or physical access logs. In addition more IT data is needed: configuration files, host status information, such as open ports or running processes, ticketing information, etc. These new data sources enable a new and broader set of use-cases, such as change validation.
- Correlation: The manual analysis of all of these new data sources can get very expensive and too resource intense. This is where SEM solutions can help automate a lot of the analysis. Uses like correlating trouble tickets with file changes, or correlating IDS data with operating system logs (Note that I didn’t say IDS and firewall logs!) There is much much more to correlation, but that’s for another blog post.
Note the big gap between the last step and this one. It takes a lot for an organization to cross this chasm. Also note that the individual mile-stones on the right side are drawn fairly close to each other. In reality, think of this as a log scale. These mile-stones can be very very far apart. The distance here is not telling anymore.
- Visual analysis: It is not very efficient to read through thousands of log messages and figure out trends or patterns, or even understand what the log entries are communicating. Visual analysis takes the textual information and packages them in an image that conveys the contents of the logs. For more information on the topic of security visualization see Applied Security Visualization.
- Pattern detection: One could view this as advanced correlation. One wants to know about patterns. Is it normal that when the DNS server is doing a zone transfer that you will also find a number of IDS alerts along with some firewall log entries? If a user browses the Web, what is the pattern of log files that are normally seen? Patter detection is the first step towards understanding an IT environment. The next step is to then figure out when something is an outlier and not part of a normal pattern. Note that this is not as simple as it sounds. There are various levels of maturity needed before this can happen. Just because something is different does not mean that it’s a “bad” anomaly or an outlier. Pattern detection engines need a lot of care and training.
- Interactive visualization: Earlier we talked about simple, static visualization to better understand our IT data. The next step in the application of visualization is interactive visualization. This type of visualization follows the principle of: “overview first, zoom and filter, then details on demand.” This type of visualization along with dynamic queries (the next step) is incredibly important for advanced analysis of IT data.
- Dynamic queries: The next step beyond interactive, single-view visualizations are multiple views of the same data. All of the views are linked together. If you select a property in one graph, the selection propagates to the others. This is also called dynamic queries. This is the gist of fast and efficient analysis of your IT data.
- Anomaly detection: Various products are trying to implement anomaly detection algorithms in order to find outliers, or anomalous behavior in the IT environment. There are many approaches that people are trying to apply. So far, however, none of them had broad success. Anomaly detection as it is known today is best understood for closed use-cases. For example, NBADs are using anomaly detection algorithms to flag interesting findings in network flows. As of today, nobody has successfully applied anomaly detection across heterogeneous data sources.
- Sharing views, patterns, and outliers: The last step on my maturity scale is the sharing of advanced analytic findings. If I know that certain versions of the Bind DNS server tend to trigger a specific set of Snort IDS alerts, it is something that others should know as well. Why not share it? Unfortunately, there are no products that allow us to share this knowledge.
While reading the maturity scale, note the gaps between the different stages. They signify how quickly after the previous step a new step sets in. If you were to look at the scale from a time-perspective, you would start an IT data management project on the left side and slowly move towards the right. Again, the gaps are fairly indicative of the relative time such a project would consume.
The scale could be overlaid with a lines showing some interesting, related properties. I decided to not do so in favor of legibility. Instead, have a look at Figure 2. It encodes a few properties: number of products on the market, number of customers / users, and number of data sources needed at that state of maturity.
Figure 2: The number of product, companies, and data sources tat are used / available along the maturity scale.
Why are so few products on the right side of the scale? The most obvious reason is one of market size. There are not many companies on the right side. Hence there are not many products. It is sort of a chicken and an egg problem. If there were more products, there might be more companies using them – maybe. However, there are more reasons. One of them being that in order to get to the right side, a company has to traverse the entire scale on the left. This means that the potential market for advanced analytics is the amount of companies that linger just before the advanced analytics market itself. That market is a very small one. The next question would be why there are not more companies close to the advanced analytics stage? There are multiple reasons. Some of them are:
- Not many environments manage to collect enough data to implement advanced analytics across heterogeneous data. Too many environments are stuck with just a few data sources. There are organizational, architectural, political, and technical reasons why this is so.
- A lack of qualified people (engineers, architects, etc) is another reason. Not many companies have the staff that understands how to deal with all the data collected. Not many people understand how to interpret the vast amount of different data sources.
The effects of these phenomenon play yet again into the availability of products for the advanced analytics side of the scale. Because there are not many environments that actually collect a diverse set of IT data, companies (or academia) cannot conduct research on the subject. And if they do, they mostly get it wrong or capture just a very narrow use-case.
What Else Does the Maturity Scale Tell Us?
Let us have a look at some of the other things that we can learn from/should know about the maturity scale:
- What does it mean for a company to be on the far right of the scale?
- In-depth understanding of the data
- Understanding of how to apply advanced analytics, such as visualization theory, anomaly detection, etc)
- Baseline of the behavior in the organization’s environment (needed for example for anomaly detection)
- Understanding of the context of the data gathered, such as what’s the network topology, what are the properties of the assets, etc.
- Have to employ knowledgeable people. These experts are scarce and expensive.
- Collecting all log data, which is hard!
- What are some other preconditions to live on the right side?
- A mature change management process
- Asset management
- IT infrastructure documentation
- Processes to deal with the findings/intelligence from advanced analytics
- A security policy that tells what is allowed and intended and what is not. (Have you ever put a sniffer on the network to see what traffic there is? Did you understand all of it? This is pretty much the same thing, you put a huge sniffer on your IT environment and try to explain everything. Wow!
- Understand the environment to the point where questions like: “What’s really normal?” are answered quickly. Don’t be fooled. This is nearly impossible. There are so many questions that need to be answered, such as: “Is a DNS server that generates ICMP messages every now and then an anomaly? Is it a security problem? What is the payload of the ICMP message? Maybe an information leak?”
- What’s the return on investment (ROI) for living on the right-side of the scale?
- It’s just not clear!
- Isn’t it cheaper to ignore than to discover?
- What do you intend to find and what will you find?
- So, what’s the ROI? It’s hard to measure, but you will be able to:
- Detect problems earlier
- Uncover attacks and policy violations quicker
- Prevent information leaks
- Reduce down-time of infrastructure and applications
- Reduce labor of service desk and system administration
- More stable applications
- etc. etc.
- What else?
November 30, 2008
I just came across this list of command to capture the state of a CISCO router. I wanted to capture this and maybe inspire someone to build an application for Splunk. It would be interesting to build a set of expect scripts that go out and capture this information in Splunk. You can then use the information for forensics, but also for change management. By building alerts you could even alert on unauthorized or potentially malicious changes. If you are interested in building an application, let me know. I’ be happy to help.
show clock detail
show processes cpu
show processes memory
content of bootflash
show ip route
show ip ospf
show ip ospf summary
show ip ospf neighbors
show ip bgp summary
show cdp neighbors
show ip arp
show ip interfaces
show tcp brief all
show ip sockets
show ip nat translations verbose
show ip cache flow
show ip cef
show snmp user
show snmp group
show snmp sessions
show file descriptors
April 1, 2008
Thanks to the design department at Addison Wesley, I have a proposal for a cover page of my upcoming book:
This is really exciting. I have been working on the book for over a year now and finally it seems that the end is in sight. I have three chapters completely done and they should appear in a rough-cuts program, as an electronic pre-version, very soon (next three weeks). Another three chapters I got back from my awesome review committee and then there are three chapters I still have to finish writing.
Applied Security Visualization should be available by Black Hat at the beginning of August. I will do anything I can to get it out by then.
Technorati Tags: applied security visualization, security visualization, visualization, security, applied
March 7, 2008
I will be at Source Boston next week, which is going to be probably one of the coolest conferences this year. The speaker lineup is absolutely fantastic. And I am not saying that because I am going to be speaking there. You can keep up with the conference on the Source Boston Blog or on the Twitter @SourceBoston feed.
My presentation carries the title: All the data that’s fit to visualize. Recognize this? It’s the New York Time’s headline. I am going to talk about what security visualization can learn from the NYT. I am very excited about the talk. I am going to try out some new presentation methods. Come and see it!
[ tags]security visualization, source boston, applied security visualization[/tags]
February 8, 2008
According to IPO Home, ArcSight is going to go public next week – The week of 2/11/08. Here some data:
- Market Cap: $309.4 million
- Revenue: $75 million
- Price range (expected): $9 – $11 / share [mind you, there was a 4:1 reverse split]
- Shares offered: 6.9 million
- Symbol: ARST
I am curious how it’ll go! Good luck!
September 14, 2007
When eIQnetworks announced their OpenLogFormat, I think they did it just for me. I love it. I really enjoy taking these things apart to show why they are really really bad attempts. I am sure these guys are not readers of my blog. Otherwise they would have known that I will question their standard, line by line. It just doesn’t add up for me. Why are companies/people not learning/listening?
So, there is yet another “standard” for event interoperability being suggested by yet another vendor. While some vendors (for example the one I used to work for), actually thought about the problem and made sure they are coming up with something useful, I am not sure this standard lives up to that promise. Let me go through the standard piece by piece, right after some general comments:
- Why another interoperability standard? There is not a single word of motivation printed in the standards document. Don’t we have existing standards already?
- You have to register for download the standard? Well, I know, ArcSight makes that same mistake. That wasn’t my doing! I promise.
- How does this standard compare to others? What’s the motivation for defining it? Is it better than everything else?
- When exactly would you apply this standard? All the time? OLF (the open log format) states:
What the heck does that mean? For everything? Do you want me to proof you wrong? There are tons of examples where this thing won’t be able to apply this standard.
OLF is designed for logging network events such as those often logged by firewalls, but it can also be used for events not related to the network.
- You did not do your homework, my friends! In a lot of areas. Some friends of mine already commented on the fact that this is advertised as an “open” log format. The press release even calls it an open source log format. What does that mean? Was there a period for public comment? Believe me, there wasn’t. I would have known FOR SURE!
- With regards to the homework. Have you heard of CEE? Yes, that’s a group that actually knows quite a bit about logging. Why bother asking them, they would only critique the proposal and possibly shoot it down? You bet. That’s what I am doing right now anyways.
- Let’s see, did you guys learn from past mistakes? Don’t get me started. I claim NO. Read on and you will see a lot of cases that proof why.
- Have you read my old blog entries and at least tried to understand what logging is about? I can guarantee that you guys have not. Or maybe you didn’t understand what I was saying. Hmm…. Here again, for your reference.
- Have you looked at the other standards out there? For example CEF (common event format) from ArcSight. I am definitely biased towards that one, as I have written it, but even now that I don’t work there anymore, I still think that CEF is actually a really good logging standard. Again. Not done your homework!
- Last general question: Why would I be using this standard as opposed to anything else, for example CEF. Is eIQnetworks big enough so I would care? Last time I checked, the answer was: No. If this was something that was done by Microsoft, I might care, just because of their size. Maybe you have a lot of vendors already supporting this standard? Yes? How many? Who? I have not heard OLF ever before and I deal with log management every day! So I doubt any significant adoption is reality. Actually, I just checked the Web page and there are six companies supporting it. Okay. All that
Let’s go through the standard in more detail:
- I already made this point: What is the area where this standard applies? Networking and non-networking events (That’s what OLF claims)? Nice. And why would you require an IP address field (to be exact: internalIP and externalIP) for every record? In your world, are there only events that contain IPs? In mine, there are many others too!
- You are proposing a log-file approach. So you are defining a file-based standard, limiting it to one transport. Okay. But why? Again, read my blog about transport-independence. Who is logging to files only? A minority of products in the networking realm.
- Have you guys written parsers before? (Yes, I have!). Do you know how bad it is to read headers first? Makes a whole lot of use-cases impossible. And to be frank, it requires too much coding (I am lazy).
- Minor detail: You guys are already on version 1.1? Hmm… I wonder how version 1.0 looked
- I don’t think the author of this paper has written a standard before: “The #Version line gives the version of OLF, which should always be 1.1.” How do you do updates? You deprecate this document? Confusing, confusing.
- Why do you need a #Date line in the header? That does not make any sense AT ALL!
- Okay, so you are using a header line that defines the fields. All right. Let’s assume that’s a good idea in order to reduce the size of an event (exercise to the reader why this is true). Why do you say then:
NOTE: The fields may not vary; they must alwas be the ones specified in this document.
What? This does not make any sense at all! Whatsoever! Delete that line. Done. It’s irrelevant.
- Let’s go back to the header line. Why all these required fields? spam-info? This is very inefficient. Why have all these fields for every event? It unnecessarily bloats your events and circumvents the idea of a header line!
- Tab-separated fields. Okay. Your choice. Square brackets to deal with escaping? Are you guys coders? That’s not a standard way of doing things at all. Anyone who wrote code before, have you seen this approach anywhere? If you stuck to commas and quotes, you might be able to read your logs in Excel without any configuration
- tab-separated subfields. Shiver.
- Guys, your example on page one is horrible. Priority in the preamble and in the suffix? Then the virtualdevice is root? Maybe I can’t count. You know what, I think the fields don’t even align. What are all the IPs in the message? Part of the message (the one with the seemingly interesting IPs) seems to be lumped together into one field (uses the square brackets). I don’t get it.
- Error lines? Come again? So there are really two different types of log entries? Or no, hang on, there aren’t. Those lines are only generated if the OLF consumer realizes that the format is not correct? What does that have to do with a logging standard. If I wasn’t confused yet, now I definitely am.
- Open source: “a device-type assigned by eIQnetworks”. No further comment.
- Wow. Is it right that every log entry carries the “original” log message also (called the Nativelog)? So, if a product supports OLF by default, that’s just empty? Come on guys. Are you really suggesting to double the size of messages?
- Talking about the field dictionary… What does it mean to have “unused” fields? Unused by what? The standard? Oh, maybe this is not a standard?
- I will spare you the analysis of all the fields in the dictionary. There are tons of problems. Just one: If you have a count bigger than one and you only have one timestamp. What does that mean? All the events happened at the same time?
- Note that the Nativelog field is defined as: Original syslog line. Okay, so this is a file-based standard, but it consumes syslog messages?
- event types: There is indeed, and I kid you not, a -1 value. Is that for real?
- priority codes: Nice. Read this (again, this is a standard, in case you forgot):
The descriptions [of the priorities] given are the official interpretation, but usage varies; some vendors report routine events with higher priority
- Note the copyright at the bottom of the pages [Okay, I admit, I might have made the same mistake with the first version of CEF, you are forgiven].
Have I convinced you yet why not to use this “standard”?
Random observation: Why does this log remind me of IIS logs gone wrong?
Technorati Tags: log standard, logging, event interoperability, cee, olf, open log format
September 11, 2007
Finally, ArcSight is going for it: http://news.google.com/news?ie=UTF-8&rlz=1B2GGGL_enUS205US205&tab=bn&ncl=1120626202&hl=en
It seems like there is a new wave of security companies going public. First sourcefire, then tippingpoint, now ArcSight. I am really curious as to what the share price is going to be and what the reverse split is going to look like.
August 25, 2007
A lot has happened the last couple of weeks and I am really behind with a lot of things that I want to blog about. If you are familiar with the field that I am working in (SIEM, SIM, ESM, log management, etc.), you will fairly quickly realize where I am going with this blog entry. This is the first of a series of posts where I want to dig into the topic of event processing.
Let me start with one of the basic concepts of event processing: normalization. When dealing with time-series data, you will very likely come across this topic. What is time-series data? I used to blog and talk about log files all the time. Log files are a type of time-series data. It’s data which is collected over time. Entries are associated with a time stamp. This covers anything from your traditional log files to snapshots of configuration files or snapshots of tools that are run on a periodic basis (e.g., capturing your netstat output every 30 seconds).
Let’s talk about normalization. Assume you have some data which reports logins to one of our servers. We would like to generate a report which shows the top ten users accessing the server. How would you do that? We’d have to identify the user name in the log entry first. Then we’d extract it, for example by writing a regular expression. Then we’d collect all the user names and compile the top ten list.
Another way would be to build a tool which picks the entire log entry apart and puts as much information from the event into a database. As opposed to just capturing the user name. We’d have to create a database with a specific schema. It would probably have these fields: timestamp, source, destination, username. Once we have all this information in a database, it is really easy to do all kinds of analysis on the data, which was not possible before we normalized it.
The process of taking raw input events and extracting individual fields is called normalization. Sometimes there are other processes which are classified as normalization. I am not going to discuss them right here, but for example normalizing numerical values to fall in a predefined range is generally referred to as normalization as well.
The advantages of normalization should be fairly obvious. You can operate on the structured and parsed data. You know which field represents the source address versus the destination address. If you don’t parse the entries, you don’t really know that. You can only guess. However, there are many disadvantages to the process of normalization that you should be aware of:
- If you are dealing with a disparate set of event sources, you have to find the union of all fields to make up your generic schema. Assume you have a telephone call log and a firewall log. You want to store both types of logs in the same database. What you have to do is take all the fields from both logs and build the database schema. This will result in a fairly large set of fields. If you keep adding new types of data sources, your database schema gets fairly big. I know of a SIM which uses more than 200 hundred fields. And still that doesn’t cover nearly all the fields that are needed to cover a good set of data sources.
- Extending the schema is incredibly hard: When building a system with a fixed schema, you need to decide what your schema will look like. If, to a later point in time, you have a need to add another type of data source, you will have to go back and modify the schema. This can have all kinds of implications on the data already captured in the data store.
- Once you decided to use a specific schema, you have to build your parsers to normalize the inputs into this schema. If you don’t have a parser, you are out of luck and you cannot use that data source.
- Before you can do any type of analysis, you need to invest the time to parse (or normalize) the data. This can become a scalability issue. Parsing is fairly slow. It generally applys regular expressions to each of the data entries, which is a fairly expensive operation.
- Humans are not perfect and programmers are not either. The parsers will have bugs and they will screw up normalization. This means that the data that is stored in the database could be wrong in a number of ways:
- A specific field doesn’t get parsed. This part of the data entry is not available for any further processing.
- A field gets parsed but assigned to the wrong field. Part of your prior analysis could be wrong.
- Breaking up the data entry into tokens (fields) is not granular enough. The parser should have broken the original entry into more specific fields.
- The data entries can change. Oftentimes, when a new version of a product is released, it either adds new data types or it changes some of the log entries. This has to be reflected in the parsers. They need to be updated to support the new data entries, before the data source can be used again.
- The original data entry is not available anymore, unless you are spending the time and space to store the original data entry along with the parsed and extracted fields. This can have quite some scalability issues as well.
I have seen all of these cases happening. And they happen all the time. Sometimes, the issues are not that bad, but other times, when you are dealing with mission critical systems, it is absolutely crucial that the normalization happens correctly and on time.
I will expand on the challenges of normalization in a future blog entry and put it into the context of security information management (SIM).
Technorati Tags: SIM, SIEM, ESM, log management, event normalization, event processing, log analysis