April 1, 2008
Thanks to the design department at Addison Wesley, I have a proposal for a cover page of my upcoming book:
This is really exciting. I have been working on the book for over a year now and finally it seems that the end is in sight. I have three chapters completely done and they should appear in a rough-cuts program, as an electronic pre-version, very soon (next three weeks). Another three chapters I got back from my awesome review committee and then there are three chapters I still have to finish writing.
Applied Security Visualization should be available by Black Hat at the beginning of August. I will do anything I can to get it out by then.
[tags]applied security visualization, security visualization, visualization, security, applied[/tags]
March 7, 2008
I will be at Source Boston next week, which is going to be probably one of the coolest conferences this year. The speaker lineup is absolutely fantastic. And I am not saying that because I am going to be speaking there. You can keep up with the conference on the Source Boston Blog or on the Twitter @SourceBoston feed.
My presentation carries the title: All the data that’s fit to visualize. Recognize this? It’s the New York Time’s headline. I am going to talk about what security visualization can learn from the NYT. I am very excited about the talk. I am going to try out some new presentation methods. Come and see it!
[ tags]security visualization, source boston, applied security visualization[/tags]
February 8, 2008
According to IPO Home, ArcSight is going to go public next week – The week of 2/11/08. Here some data:
- Market Cap: $309.4 million
- Revenue: $75 million
- Price range (expected): $9 – $11 / share [mind you, there was a 4:1 reverse split]
- Shares offered: 6.9 million
- Symbol: ARST
I am curious how it’ll go! Good luck!
September 14, 2007
When eIQnetworks announced their OpenLogFormat, I think they did it just for me. I love it. I really enjoy taking these things apart to show why they are really really bad attempts. I am sure these guys are not readers of my blog. Otherwise they would have known that I will question their standard, line by line. It just doesn’t add up for me. Why are companies/people not learning/listening?
So, there is yet another “standard” for event interoperability being suggested by yet another vendor. While some vendors (for example the one I used to work for), actually thought about the problem and made sure they are coming up with something useful, I am not sure this standard lives up to that promise. Let me go through the standard piece by piece, right after some general comments:
- Why another interoperability standard? There is not a single word of motivation printed in the standards document. Don’t we have existing standards already?
- You have to register for download the standard? Well, I know, ArcSight makes that same mistake. That wasn’t my doing! I promise.
- How does this standard compare to others? What’s the motivation for defining it? Is it better than everything else?
- When exactly would you apply this standard? All the time? OLF (the open log format) states:
OLF is designed for logging network events such as those often logged by firewalls, but it can also be used for events not related to the network.
What the heck does that mean? For everything? Do you want me to proof you wrong? There are tons of examples where this thing won’t be able to apply this standard.
- You did not do your homework, my friends! In a lot of areas. Some friends of mine already commented on the fact that this is advertised as an “open” log format. The press release even calls it an open source log format. What does that mean? Was there a period for public comment? Believe me, there wasn’t. I would have known FOR SURE!
- With regards to the homework. Have you heard of CEE? Yes, that’s a group that actually knows quite a bit about logging. Why bother asking them, they would only critique the proposal and possibly shoot it down? You bet. That’s what I am doing right now anyways.
- Let’s see, did you guys learn from past mistakes? Don’t get me started. I claim NO. Read on and you will see a lot of cases that proof why.
- Have you read my old blog entries and at least tried to understand what logging is about? I can guarantee that you guys have not. Or maybe you didn’t understand what I was saying. Hmm…. Here again, for your reference.
- Have you looked at the other standards out there? For example CEF (common event format) from ArcSight. I am definitely biased towards that one, as I have written it, but even now that I don’t work there anymore, I still think that CEF is actually a really good logging standard. Again. Not done your homework!
- Last general question: Why would I be using this standard as opposed to anything else, for example CEF. Is eIQnetworks big enough so I would care? Last time I checked, the answer was: No. If this was something that was done by Microsoft, I might care, just because of their size. Maybe you have a lot of vendors already supporting this standard? Yes? How many? Who? I have not heard OLF ever before and I deal with log management every day! So I doubt any significant adoption is reality. Actually, I just checked the Web page and there are six companies supporting it. Okay. All that ๐
Let’s go through the standard in more detail:
- I already made this point: What is the area where this standard applies? Networking and non-networking events (That’s what OLF claims)? Nice. And why would you require an IP address field (to be exact: internalIP and externalIP) for every record? In your world, are there only events that contain IPs? In mine, there are many others too!
- You are proposing a log-file approach. So you are defining a file-based standard, limiting it to one transport. Okay. But why? Again, read my blog about transport-independence. Who is logging to files only? A minority of products in the networking realm.
- Have you guys written parsers before? (Yes, I have!). Do you know how bad it is to read headers first? Makes a whole lot of use-cases impossible. And to be frank, it requires too much coding (I am lazy).
- Minor detail: You guys are already on version 1.1? Hmm… I wonder how version 1.0 looked ๐
- I don’t think the author of this paper has written a standard before: “The #Version line gives the version of OLF, which should always be 1.1.” How do you do updates? You deprecate this document? Confusing, confusing.
- Why do you need a #Date line in the header? That does not make any sense AT ALL!
- Okay, so you are using a header line that defines the fields. All right. Let’s assume that’s a good idea in order to reduce the size of an event (exercise to the reader why this is true). Why do you say then:
NOTE: The fields may not vary; they must alwas be the ones specified in this document.
What? This does not make any sense at all! Whatsoever! Delete that line. Done. It’s irrelevant.
- Let’s go back to the header line. Why all these required fields? spam-info? This is very inefficient. Why have all these fields for every event? It unnecessarily bloats your events and circumvents the idea of a header line!
- Tab-separated fields. Okay. Your choice. Square brackets to deal with escaping? Are you guys coders? That’s not a standard way of doing things at all. Anyone who wrote code before, have you seen this approach anywhere? If you stuck to commas and quotes, you might be able to read your logs in Excel without any configuration ๐
- tab-separated subfields. Shiver.
- Guys, your example on page one is horrible. Priority in the preamble and in the suffix? Then the virtualdevice is root? Maybe I can’t count. You know what, I think the fields don’t even align. What are all the IPs in the message? Part of the message (the one with the seemingly interesting IPs) seems to be lumped together into one field (uses the square brackets). I don’t get it.
- Error lines? Come again? So there are really two different types of log entries? Or no, hang on, there aren’t. Those lines are only generated if the OLF consumer realizes that the format is not correct? What does that have to do with a logging standard. If I wasn’t confused yet, now I definitely am.
- Open source: “a device-type assigned by eIQnetworks”. No further comment.
- Wow. Is it right that every log entry carries the “original” log message also (called the Nativelog)? So, if a product supports OLF by default, that’s just empty? Come on guys. Are you really suggesting to double the size of messages?
- Talking about the field dictionary… What does it mean to have “unused” fields? Unused by what? The standard? Oh, maybe this is not a standard?
- I will spare you the analysis of all the fields in the dictionary. There are tons of problems. Just one: If you have a count bigger than one and you only have one timestamp. What does that mean? All the events happened at the same time?
- Note that the Nativelog field is defined as: Original syslog line. Okay, so this is a file-based standard, but it consumes syslog messages?
- event types: There is indeed, and I kid you not, a -1 value. Is that for real?
- priority codes: Nice. Read this (again, this is a standard, in case you forgot):
The descriptions [of the priorities] given are the official interpretation, but usage varies; some vendors report routine events with higher priority
- Note the copyright at the bottom of the pages ๐ [Okay, I admit, I might have made the same mistake with the first version of CEF, you are forgiven].
Have I convinced you yet why not to use this “standard”?
Random observation: Why does this log remind me of IIS logs gone wrong?
[tags]log standard, logging, event interoperability, cee, olf, open log format[/tags]
September 11, 2007
Finally, ArcSight is going for it: http://news.google.com/news?ie=UTF-8&rlz=1B2GGGL_enUS205US205&tab=bn&ncl=1120626202&hl=en
It seems like there is a new wave of security companies going public. First sourcefire, then tippingpoint, now ArcSight. I am really curious as to what the share price is going to be and what the reverse split is going to look like.
August 25, 2007
A lot has happened the last couple of weeks and I am really behind with a lot of things that I want to blog about. If you are familiar with the field that I am working in (SIEM, SIM, ESM, log management, etc.), you will fairly quickly realize where I am going with this blog entry. This is the first of a series of posts where I want to dig into the topic of event processing.
Let me start with one of the basic concepts of event processing: normalization. When dealing with time-series data, you will very likely come across this topic. What is time-series data? I used to blog and talk about log files all the time. Log files are a type of time-series data. It’s data which is collected over time. Entries are associated with a time stamp. This covers anything from your traditional log files to snapshots of configuration files or snapshots of tools that are run on a periodic basis (e.g., capturing your netstat output every 30 seconds).
Let’s talk about normalization. Assume you have some data which reports logins to one of our servers. We would like to generate a report which shows the top ten users accessing the server. How would you do that? We’d have to identify the user name in the log entry first. Then we’d extract it, for example by writing a regular expression. Then we’d collect all the user names and compile the top ten list.
Another way would be to build a tool which picks the entire log entry apart and puts as much information from the event into a database. As opposed to just capturing the user name. We’d have to create a database with a specific schema. It would probably have these fields: timestamp, source, destination, username. Once we have all this information in a database, it is really easy to do all kinds of analysis on the data, which was not possible before we normalized it.
The process of taking raw input events and extracting individual fields is called normalization. Sometimes there are other processes which are classified as normalization. I am not going to discuss them right here, but for example normalizing numerical values to fall in a predefined range is generally referred to as normalization as well.
The advantages of normalization should be fairly obvious. You can operate on the structured and parsed data. You know which field represents the source address versus the destination address. If you don’t parse the entries, you don’t really know that. You can only guess. However, there are many disadvantages to the process of normalization that you should be aware of:
- If you are dealing with a disparate set of event sources, you have to find the union of all fields to make up your generic schema. Assume you have a telephone call log and a firewall log. You want to store both types of logs in the same database. What you have to do is take all the fields from both logs and build the database schema. This will result in a fairly large set of fields. If you keep adding new types of data sources, your database schema gets fairly big. I know of a SIM which uses more than 200 hundred fields. And still that doesn’t cover nearly all the fields that are needed to cover a good set of data sources.
- Extending the schema is incredibly hard: When building a system with a fixed schema, you need to decide what your schema will look like. If, to a later point in time, you have a need to add another type of data source, you will have to go back and modify the schema. This can have all kinds of implications on the data already captured in the data store.
- Once you decided to use a specific schema, you have to build your parsers to normalize the inputs into this schema. If you don’t have a parser, you are out of luck and you cannot use that data source.
- Before you can do any type of analysis, you need to invest the time to parse (or normalize) the data. This can become a scalability issue. Parsing is fairly slow. It generally applys regular expressions to each of the data entries, which is a fairly expensive operation.
- Humans are not perfect and programmers are not either. The parsers will have bugs and they will screw up normalization. This means that the data that is stored in the database could be wrong in a number of ways:
- A specific field doesn’t get parsed. This part of the data entry is not available for any further processing.
- A field gets parsed but assigned to the wrong field. Part of your prior analysis could be wrong.
- Breaking up the data entry into tokens (fields) is not granular enough. The parser should have broken the original entry into more specific fields.
- The data entries can change. Oftentimes, when a new version of a product is released, it either adds new data types or it changes some of the log entries. This has to be reflected in the parsers. They need to be updated to support the new data entries, before the data source can be used again.
- The original data entry is not available anymore, unless you are spending the time and space to store the original data entry along with the parsed and extracted fields. This can have quite some scalability issues as well.
I have seen all of these cases happening. And they happen all the time. Sometimes, the issues are not that bad, but other times, when you are dealing with mission critical systems, it is absolutely crucial that the normalization happens correctly and on time.
I will expand on the challenges of normalization in a future blog entry and put it into the context of security information management (SIM).
[tags]SIM, SIEM, ESM, log management, event normalization, event processing, log analysis[/tags]
July 12, 2007
Today I was booking my airline ticket to Kualalumpur, Malaysia for my trip to Hack in the Box in September. I called the sales lady for the airline and talk to her about my flight dates and all that. In the end she asks me for my credit card information. Number, expiration date, and then the CVV number on the back of my card (the security code, as it is called sometimes too). I hesitate for a second, trying to remember what I just learned from the PCI auditors we had in house. I couldn’t really remember when a merchant needed that number, but after a second I realized that it would be okay to give it to her. It’s about the same as on a Web page, where you enter that information. They can use the CVV to run a authorization with the credit card company. Well, I thought that would be it. Wrong!
A couple of hours later I get a pretty ugly Excel spreadsheet back. I am asked to print it out, sign it, and fax it back to them. I had a look at the form and I wondered what was going on. Well, there was all my information in this spreadsheet, including CVV number! They even “encrypted” my credit card number in the spreadsheet. I am just kidding. It was all in plain text. The only funny thing was that the credit card number field was not formatted as a string, but a number, so it looked like it was encrypted. *grins*. But back to serious. I was quite upset. All my information in this document. I have to assume that this excel document is on the sales person’s desktop, along with probably dozens of others. Hmmm… Maybe I should send an email with a link that points to a site that contains a … Let’s not even go there.
The next thing I did was digging up the PCI standard. And here it was, section 3.2.2:
3.2.2 Do not store the card-validation code (Three-digit or four-digit value printed on the front or back of a payment card (e.g., CVV2 and CVC2 data))
A clear violation! And you know, this is pretty much the first thing you should address; the way of authorizing credit card transactions. Just plain wrong! Darn!
I wrote them an email asking for a contact in their security department. So far, no luck, just the sales person telling me that she needs all that information to complete the transaction. Whatever. Either she needs my signature, but then no CVV, or the CVV and no signature. But not both! I wonder how this is going to continue.
[tags]pci, compliance, vioaltion,security[/tags]
May 26, 2007
Log analysis has shifted fairly significantly in the last couple of years. It is not about reporting on log records (e.g., Web statistics or user logins) anymore. It is all about pinpointing who is responsible for certain actions/activities. The problem is that the log files do oftentimes not communicate that. There are instances of logs (mainly from network centric devices), which contain IP addresses that are used to identify the subject. In other instances, there is no subject that can be identified in the log files at all (database transactions for example).
What I really want to identify is a person. I want to know who is to blame for deleting a file. The log files have not evolved to a point where they would contain the user information. It generally does not help much to know what machine the user came from when he deleted the file.
This all is old news and you probably are living with these limitations. But here is what I was wondering about: Why has nobody built a tool or started an open source project which looks at network traffic to extract user to machine mappings? It’s not _that_ hard. For example SMB traffic contains plain-text usernames, shares, originating machines, etc. You should be able to compile session tables from this. I need this information. Anyone? There is so much information you could extract from network traffic (even from Kerberos!). Most of the protocols would give you a fair understanding of who is using what machine at what time and how.
[tags]identify correlation, user, log analysis, user mapping[/tags]
May 15, 2007
I was just listening to this podcast about security information management (SIM) systems. Tom Bowers from Information Security magazine is talking about various topics in SIM. Unfortunately I have to disagree with Tom on a couple of points, if not more. But let me pick the couple I find most important:
- Visualization is a great tool to see attacks in real-time. However, you can only see where the attacks are coming from and not how many. What? Why would I not be able to visualize that? You can map that to edge size, node size, map it as a color to you nodes, etc. I don’t know what system he looked at to make this statement, but that’s wrong!
- Active Response is something that SIMs cannot do. Well. Wrong again. I could tell you how ArcSight is doing this with the Threat Response Manager (TRM), but that would be a vendor pitch. That’s why I am going to mention SEC, the simple correlation engine. It can execute an arbitrary action. Well, it’s not quantum leaps from there to imagine how you could issue a command to add an ACL to a router for example. To sum up: Active response is something SIMs can do! If you want to know how exactly you do this with SEC, read my chapter on event analysis in the new Snort book.
These were the main points where I disagree with Tom. He could have done a bit of a better job describing the benefits of visualization, but that’s another story.
[tags]arcsight,visualization[/tags]
May 11, 2007
I was trying to get my Ubuntu desktop to use Beryl, just like my laptop does. Unforunately, my NVidia drivers didn’t quite want to do what I wanted them to do. Long story short, at some point I remembered to check in the log files to see whether I could determine what exactly the problem was. Where should I look first? /var/log/messages
And right there it was:
May 11 11:15:12 zurich kernel: [ 2503.193111] NVRM: API mismatch: the client has the version 1.0-9631, but
May 11 11:15:12 zurich kernel: [ 2503.193114] NVRM: this kernel module has the version 1.0-9755. Please
May 11 11:15:12 zurich kernel: [ 2503.193115] NVRM: make sure that this kernel module and all NVIDIA driver
May 11 11:15:12 zurich kernel: [ 2503.193117] NVRM: components have the same version.
Beautiful. That’s exactly what I needed to know. But hang on a second. Isn’t this a syslog entry? Wow. It just hit me. While I really liked the verbose output, I was trying to think about how I would parse this thing. How would I normalize this message to later apply machine logic to further process this? Aweful!
I guess my conclusion would be that we need two types of Syslogs! One that logs machine readable log entries and one for humans. Is that really what we want? Maybe the even better solution would be to only have a machine readable log and then provide an application that can read the log and blow the contents up to make it readable for humans!
Where is CEE when you need it?