A lot has happened the last couple of weeks and I am really behind with a lot of things that I want to blog about. If you are familiar with the field that I am working in (SIEM, SIM, ESM, log management, etc.), you will fairly quickly realize where I am going with this blog entry. This is the first of a series of posts where I want to dig into the topic of event processing.
Let me start with one of the basic concepts of event processing: normalization. When dealing with time-series data, you will very likely come across this topic. What is time-series data? I used to blog and talk about log files all the time. Log files are a type of time-series data. It’s data which is collected over time. Entries are associated with a time stamp. This covers anything from your traditional log files to snapshots of configuration files or snapshots of tools that are run on a periodic basis (e.g., capturing your netstat output every 30 seconds).
Let’s talk about normalization. Assume you have some data which reports logins to one of our servers. We would like to generate a report which shows the top ten users accessing the server. How would you do that? We’d have to identify the user name in the log entry first. Then we’d extract it, for example by writing a regular expression. Then we’d collect all the user names and compile the top ten list.
Another way would be to build a tool which picks the entire log entry apart and puts as much information from the event into a database. As opposed to just capturing the user name. We’d have to create a database with a specific schema. It would probably have these fields: timestamp, source, destination, username. Once we have all this information in a database, it is really easy to do all kinds of analysis on the data, which was not possible before we normalized it.
The process of taking raw input events and extracting individual fields is called normalization. Sometimes there are other processes which are classified as normalization. I am not going to discuss them right here, but for example normalizing numerical values to fall in a predefined range is generally referred to as normalization as well.
The advantages of normalization should be fairly obvious. You can operate on the structured and parsed data. You know which field represents the source address versus the destination address. If you don’t parse the entries, you don’t really know that. You can only guess. However, there are many disadvantages to the process of normalization that you should be aware of:
- If you are dealing with a disparate set of event sources, you have to find the union of all fields to make up your generic schema. Assume you have a telephone call log and a firewall log. You want to store both types of logs in the same database. What you have to do is take all the fields from both logs and build the database schema. This will result in a fairly large set of fields. If you keep adding new types of data sources, your database schema gets fairly big. I know of a SIM which uses more than 200 hundred fields. And still that doesn’t cover nearly all the fields that are needed to cover a good set of data sources.
- Extending the schema is incredibly hard: When building a system with a fixed schema, you need to decide what your schema will look like. If, to a later point in time, you have a need to add another type of data source, you will have to go back and modify the schema. This can have all kinds of implications on the data already captured in the data store.
- Once you decided to use a specific schema, you have to build your parsers to normalize the inputs into this schema. If you don’t have a parser, you are out of luck and you cannot use that data source.
- Before you can do any type of analysis, you need to invest the time to parse (or normalize) the data. This can become a scalability issue. Parsing is fairly slow. It generally applys regular expressions to each of the data entries, which is a fairly expensive operation.
- Humans are not perfect and programmers are not either. The parsers will have bugs and they will screw up normalization. This means that the data that is stored in the database could be wrong in a number of ways:
- A specific field doesn’t get parsed. This part of the data entry is not available for any further processing.
- A field gets parsed but assigned to the wrong field. Part of your prior analysis could be wrong.
- Breaking up the data entry into tokens (fields) is not granular enough. The parser should have broken the original entry into more specific fields.
- The data entries can change. Oftentimes, when a new version of a product is released, it either adds new data types or it changes some of the log entries. This has to be reflected in the parsers. They need to be updated to support the new data entries, before the data source can be used again.
- The original data entry is not available anymore, unless you are spending the time and space to store the original data entry along with the parsed and extracted fields. This can have quite some scalability issues as well.
I have seen all of these cases happening. And they happen all the time. Sometimes, the issues are not that bad, but other times, when you are dealing with mission critical systems, it is absolutely crucial that the normalization happens correctly and on time.
I will expand on the challenges of normalization in a future blog entry and put it into the context of security information management (SIM).
[tags]SIM, SIEM, ESM, log management, event normalization, event processing, log analysis[/tags]
Awesome topic and thanks for bringing it up. The plan that gets to your destination is not always the same plan you need to move to your next for sure. Let me offer a set of URLs that readers might find interesting.
http://www.realsoftwaredevelopment.com/2007/08/to-normalize-or.html
http://blogs.msdn.com/pathelland/archive/2007/07/23/normalization-is-for-sissies.aspx
Heck, just google ‘normalization’ and go from there. 🙂
Unless you are willing to think beyond the RDBS mindset, please don’t argue the point.
–tk
Comment by TK — August 27, 2007 @ 7:21 pm
Well, I guess you are taking the discussion into a slightly different area than I started it. What strikes me is the parallels you can draw between the two interpretations of my blog post… Let me address where we think differently first:
You are talking about database normalizations. What I was talking about is event normalization or parsing. Although related, the two topics are fairly different. In order to store the parsed data, you would use some sort of database (normally) and use indexes and normalized database tables, etc. I leave those discussions to the database developers. But when it comes to parsing data and normalizing events, I will chime in again 😉
So, the parallels. Normalization in either world adds complexity. A fair amount of it. Interesting. I haven’t thought of this before…
Comment by Raffael Marty — August 28, 2007 @ 11:21 am
[…] topic in some future blog posts. On my personal blog I already started to outline the problem of normalization, which is probably the biggest and most important diffference. I will roll the topic up again right […]
Pingback by Raffy » Blog Archive » Raffael Marty aka Raffy — September 12, 2007 @ 7:33 am
[…] post things that are relevant to my employment and Splunk on my Splunk blog. I will continue my rant on normalization and SIEM over […]
Pingback by Raffy’s Computer Security Blog » My Splunk Blog — December 3, 2007 @ 4:02 pm
Great article! By the way, NXLog does normalization to sources from many platforms, be it Windows, Linux, Android, and more. It is open source, so a free download is available at:
https://nxlog.co/products/nxlog-community-edition
Comment by Rob Lars — October 29, 2017 @ 8:40 am
@Rob Lars – To my knowledge, NXLog does some rudimentary parsing, but not really normalize all messages from the data sources. Unlike an ArcSight connector, for example, that parses all the message fields.
Comment by Raffael Marty — November 5, 2017 @ 4:32 pm