I wanted to post this review of the ‘draft-cloud-log-00‘ for a while now. Here it finally goes. In short, there is no need for a cloud-logging standard, but a way to deal with virtualization use-cases, ideally as part of another logging standard, such as CEE.
The cloud-log-00 draft is meant to define a standard around a logging format that can be used to correlate messages generated on different physical or virtual machines but belonging to the same ‘user request’. The main contribution of the current draft proposal is that it adds a structured element to a syslog (RFC 5424) messages. It outlines a number of IDs that can be and should be used for this purpose.
This analysis of the proposed draft outlines a number of significant shortcomings of the current draft-cloud-log-00 and motivates why it is a bad idea to pursue this or any other cloud logging standard any further. I urge the working group and IEFT to not move forward with this draft, but join forces with other standards, such as CEE (cee.mitre.org) and make sure that any special requirements or use-cases can be handled with such.
Following is a more detailed analysis of the draft proposal. I am starting with a generic analysis of the necessity for such a standard and how this draft positions itself:
Section 3.2 outlines the motivation and objective for the proposed standard. The section outlines the problem of attributing ‘user requests’ to physical machine instances. This is not a problem that is unique to cloud installations. It’s a problem that was introduced through virtualization. The section misses to mention a real challenge and use-case for defining a cloud-based logging standard.
The motivation, if loosely interpreted, talks about operational and security challenges because of a lack of information in the logs, which leads to problems of attribution (see last paragraph). The section fails to identify supporting use-cases that link the draft and proposed solution to the security and operational challenges. More detail is definitely needed here. The draft suggest the introduction of user IDs to (presumably) solve this problem. What is the relationship between the two? [See below where I argue that something like a guest ID or a hypervisor ID is needed to identify the individual components]
One more detail about section 3.2. It talks about how operating system (“Linux or Windows VMs”) log files will very likely be irrelevant since one cannot tie those logs to the physical entities. This is absolutely not true. Why would one need to be able to tie these logs to physical machines? If the virtual CPU runs at 100%, that is a problem. No need to relate that back to the physical hardware. It’s irrelevant. A discussion of layers (see below) would help a lot here and it would show that the stated problems are in fact non existent. Also, why would I need to know how many users (including their roles) [quote from the draft] share the same hardware? What does that matter? I can rely completely on my virtual instances and plan load accordingly!
The proposal needs to differentiate different layers of information, which correlate with different layers where logs can be generated. There is the physical layer, the virtualization layer which is generally also called the hypervisor, then there is the guest operating system and then there are applications running inside of the guest operating system. The proposal does not mention any of these layers and does not outline how these layers interact. Especially with regards to sharing IDs across these layers, a discussion is needed. The layered model would also help to identify real problems and use-cases, which the draft fails to do.
The proposal omits to define the ‘cloud’ completely, although it is used in the title of the draft. It is not clear whether SaaS, PaaS, or IaaS is the target of this draft. If all of the above, there should be a discussion of such, which includes how the information is shared in those environments (the IDs).
Following is a more detailed analysis and questions about the proposed approach by using various IDs to track requests:
If an AID was useful, which the draft still has to motivate, how is that ID passed between different layers in the application stack? Who generates it? How does it help solve the initially stated problem of operational and security related visibility and accountability? What is being used today in many applications is the UNIQUE_ID that can be generated by a Web server when receiving the request (see Apache UNIQUE_ID). That value can then be passed around. However, operating system resources and log entries cannot be tied uniquely to an application request. OS resources are generally shared across applications and it is not possible to attribute them to a specific application, or request. The proposed approach of using an AID is not a solution for the initially stated problem.
Section 3.1 outlines a generic problem statement for log management. Why is this important for this draft? There is no relationship to the rest of the draft. In addition, the section talks about routers, firewalls, network devices, applications, etc. How are you suggesting these devices share a common ID? There needs to be a protocol to exchange these IDs or you need a way to generate the IDs based on request attributes. I do not see any discussion of this in the draft. A router will definitely not include such an ID. The processing needed is way to expensive and would likely need application layer parsing to do so. Again, the problem statement needs rewriting and rethinking.
What is the transit field (Section 4.2)? It is not motivated, nor discussed anywhere.
In general, it seems like the proposed set of fields are a random collection of such. How do we know that there are not more important fields that are missing? And what guarantees that the existing fields are good candidates to solve the stated problem (again, the draft needs to outline a real problem it is trying to solve. What is stated in the current draft is not sufficient).
The client entity (Section 4.2.1) is being defined as either an IP address or a FQDN. From a consumer’s perspective, this can be very troublesome. If in some cases a FQDN is logged and in others an IP, in order to correlate the two entities, a DNS lookup has to be performed. If this happens at the time of correlation and not at the time of log generation, the IP to FQDN mapping might have changed. This could result in a false correlation of two not related events!
I would like to point out that the ‘cloud’, be that SaaS, PaaS, or IaaS, does not require a new logging standard! We had multi-tier, as well as virtualized architectures for years and they are the real building blocks of the ‘cloud’. None of the cloud-specific attributes, like elasticity, utility-based payment, etc. require anything specific from a logging point of view. If anything, we need a logging standard that can help with virtualized and highly asynchronous, and distributed architectures. But these are not issues that a logging standard should have to deal with. It’s the infrastructure that has to make these trackers or IDs available. For a complete logging standard, have a look at CEE, where multiple different building blocks are being put in place to solve all kinds of well motivated problems associated with interchange of messages, which result in log records.
I urge to not move ahead with anything like a cloud-logging standard. The cloud is nothing special. Rather should CEE (cee.mitre.org) be leveraged and possibly extended to take into account virtualization use-cases. This draft has a lot of logical flaws, motivational shortcomings, and a lot of inconsistencies. What is needed is communication capabilities and standards that help extract and exchange information between the different layers in the application or cloud stack. The application should be able to get information on which guest it is running in (something like a guest ID) and the machine it runs on. That way, visibility is created. However, this has nothing to do with a logging standard!
The last couple of months have been pretty busy. I have been really bad about updating my personal blog here, but I have not been lazy. Among other things, I have been traveling a lot to attend a number of conferences. Here is a little summary of what’s been going on:
The Security visualization predictions post was motivated by a panel I was on at the SANS Incident Detection Summit in D.C. early December. Here are the slides for my panel discussion.
One of the topics I have been talking about lately is Cloud Security. The slides linked here are from a presentation I gave in Mexico.
The other pod cast I recorded was together with Kord and Gary for The Cloud Computing Show. We talked about all kinds of things. Mainly about Loggly and logging in the cloud. Here the mp3.
I also dug out the log maturity scale again. After mentioning it at the SANS logging summit, I got a lot of great responses on it.
The other day, one of my Google alerts surfaced this DefCon video of me talking about security visualization. It’s probably one of my first conference appearances. Is it?
And finally, 2011 started with a trip to Kauai where I presented a paper on insider threat visualization. Unfortunately, the paper is not publicly available. Email me if you want a copy.
As you are probably aware, you find my speaking schedule and slides on my personal page. That’s a good way of tracking me down. And in case you haven’t found it yet, I have a slideshare account where I try to share my presentations as well.
It’s time for a quick re-hash of recent publications and happenings in my little logging world.
First and foremost, Loggly is growing and we have around 70 users on our private beta. If you are interested in testing it out, signup online and email or tweet me.
I recorded two pod casts lately. The first one was around Logging As A Service. Check out my blog post over on Loggly’s blog to get the details.
I have been writing a little lately. I got three academic papers accepted at conferences. The one I am most excited about is the Cloud Application Logging for Forensics one. It is really applicable to any application logging effort. If you are developing an application, you should have a look at this. It talks about logging guidelines, a logging architecture and gives a bunch of very specific tips on how to go about logging. The other two papers are on insider threat and visualization: “Visualizing the Malicious Insider Threat”
I have discussed the topic of logging standards multiple times on this blog. Some recent developments in the logging space urged me to give an update and provide my opinion:
Yet another vendor just released a “standard” log format (note the quotes around standard). It’s called UCF, the Universal Collection Framework™ (UCF). This is how the vendor describes it:
UCF is the first WAN-aware, store-and-forward, encrypted, compressed IT data transport. It allows customers to gather IT data, increase resilience, reduce network chatter and encrypt from almost any device, anywhere, quickly and easily. UCF leverages a new transport and store protocol that LogLogic intends to open source in the near future.
Sounds a whole lot like syslog. (syslog-ng and rsyslog seem to support exactly this!) Okay, let’s just look at this description: WAN aware? What the heck is that supposed to mean? You mean it won’t work well on a LAN? Does that mean it knows the Internets? That’s just a strange description to start with. Oh, and it’s the first property mentioned! The rest of the description sounds like a transport protocol. Interesting. Why not stick with syslog that is well known, has proven to work, and has integration libraries built already. I never understood why vendors implemented their own transport protocols. They are hard (very hard) to implement and even harder for producers and consumers to adopt to. Oh well.
When people talk about UCF, they keep bringing up ArcSight’s CEF. Well, I am greatly responsible for that specification. But guess what? It’s not a transport protocol! It’s a syntax definition. It tells a log producer how to format their log file. Not how to transport it. Because, there is always syslog that a lot of machines have installed already and it’s easy to use. (And in newer versions you get encryption, caching, etc.).
Now, my last point about standards. Why do vendors keep trying to come up with standards by themselves? It just doesn’t make any sense. How is going to adapt it? At ArcSight, about 4 years ago, we came up with CEF because CEE didn’t move fast enough and we wanted something that our partners could easily use. An analyst wrote that ArcSight is planning to take CEF to the IETF. I hope they are not going to do that. I don’t have any control over that anymore, but that would be stupid. We rather push CEE through IETF. If you have a chance, compare the CEE syntax proposal with CEF. Notice something? Yes. It’s very similar. Again, I might have had something to do with that. Anyways. Vendors should not define logging standards!
On a good note: CEE is moving forward and just released the architecture overview for public commentary. Check them out!
Last week I posted the introductionary video for a talk that I gave at Source Boston in 2008. I just found the entire video of that talk. Enjoy:
Talk by Raffael Marty:
With the ever-growing amount of data collected in IT environments, we need new methods and tools to deal with them. Event and Log Analysis is becoming one of the main tools for analysts to investigate and comprehend the state of their networks, hosts, applications, and business processes. Recent developments, such as regulatory compliance and an increased focus on insider threat have increased the demand for analytical tools to help in the process. Visualization is offering a new, more effective, and simpler approach to data analysis. To date, security visualization, has mostly failed to deliver effective tools and methods. This presentation will show what the New York Times has to teach us about effective visualizations. Visualization for the masses and not visualization for the experts. Insider Threat, Governance, Risk, and Compliance (GRC), and Perimeter Threat all require effective visualization methods and they are right in front of us – in the newspaper.
I was giving a talk at SOURCEBoston 2008. The topic this time was general visualization and what has gone wrong in security visualization in the past. I showed how we can learn and steal from other disciplines, in this case, the New York Times. The NYT has done some pretty fantastic work in the area of data visualization. Their interactive market map, for example, is a great way of exploring stock data. During the talk, I outlined some of the design principles that the NYT graphics department is using when they are designing their graphs: Show – Don’t Tell.
To start my presentation, I showed a little video about security visualization (see below).
A rehash of an old blog post from February 2008. I thought it would make sense to give a quick update on CEE and put the link to the public discussion archives here again:
Well well well… I get so many questions from people about CEE. Where is it at, when does it come out, what will it cover? To be honest, I don’t quite know. I have some answers. We have been working really hard on getting a syntax, and a taxonomy working draft written up. I think it’s more than just a working draft. It will be a really well thought through starting point for the final standard around log syntax and taxonomy. For years (I wish this wasn’t literal, but it is), we have been working on this now. Took quite some time to get everyone on the CEE board to run into the same direction. I can’t promise any timeline for publication, but I hope it’s close.
In the meantime, if you are interested in the public discussions around CEE, the public discussion archives are available online.
I have also been working on an application logging paper that I just submitted to USENIX. If you are interested in how we implemented logging at Loggly and want to look at the paper, drop me a line, maybe I will share it.
The following blog post was originally posted in December 2008. I updated it slightly to fit current times:
This following blog post has turned into more than just a post. It’s more of a paper. In any case, in the post I am trying to capture a number of concepts that are defining the log management and analysis market (as well as the SIEM or SEM markets).
Any company or IT department/operation can be placed along the maturity scale (see Figure 1). The further on the right, the more mature the operations with regards to IT data management. A company generally moves along the scale. A movement to the right does not just involve the purchase of new solutions or tools, but also needs to come with a new set of processes. Products are often necessary but are not a must.
The further one moves to the right, the fewer companies or IT operations can be found operating at that scale. Also note that the products that companies use are called log management tools for the ones located on the left side of the scale. In the middle, it is the security information and event management (SIEM) products that are being used, and on the right side, companies have to look at either in-house tools, scripts, or in some cases commercial tools in markets other than the security market. Some SIEM tools are offering basic advanced analytics capabilities, but they are very rudimentary. The reason why there are no security specific tools and products on the right side becomes clear when we understand a bit better what the scale encodes.
Figure 1: IT Data Management Maturity Scale.
The Maturity Scale
Let us have a quick look at each of the stages on the scale. (Skip over this if you are interested in the conclusions and not the details of the scale.)
Do nothing: I didn’t even explicitly place this stage on the scale. However, there are a great many companies out there that do exactly this. They don’t collect data at all.
Collecting logs: At this stage of the scale, companies are collecting some data from a few data sources for retention purposes. Sometimes compliance is the driver for this. You will mostly find things like authentication logs or maybe message logs (such as email transaction logs or proxy logs). The number of different data sources is generally very small. In addition, you mostly find log files here. No more specific IT data, such as multi-line applications logs or configurations. A new trend that we are seeing here is the emergence of the cloud. A number of companies are looking to move IT services into the cloud and have them delivered by service providers. The same is happening in log management. It doesn’t make sense for small companies to operate and maintain their own logging solutions. A cloud-based offering is perfect for those situations.
Forensics / Troubleshooting: While companies in the previous stage simply collect logs for retention purposes, companies in this stage actually make use of the data. In the security arena they are conducting forensic investigations after something suspicious was noticed or a breach was reported. In IT operations, the use-case is troubleshooting. Take email logs, for example. A user wants to know why he did not receive a specific email. Was it eaten by the SPAM filter or is something else wrong?
Save searches: I don’t have a better name for this. In the simplest case, someone saves the search expression used with a grep command. In other cases, where a log management solution is used, users are saving their searches. At this stage, analysts can re-use their searches at a later point in time to find the same type of problems again, without having to reconstruct the searches every single time.
Share searches: If a search is good for one analyst, it might be good for another one as well. Analysts at some point start sharing their ways of identifying a certain threat or analyze a specific IT problem. This greatly improves productivity.
Reporting: Analysts need reports. They need reports to communicate findings to management. Sometimes they need reports to communicate among each other or to communicate with other teams. Generally, the reporting capabilities of log management solutions are fairly limited. They are extended in the SEM products.
Alerting: This capability lives in somewhat of a gray-zone. Some log management solutions provide basic alerting, but generally, you will find this capability in a SEM. Alerting is used to automate some of the manual trouble-shooting that is done among companies on the left side of the scale. Instead of waiting for a user to complain that there is something wrong with his machine and then looking through the log files, analysts are setting up alerts that will notify them as soon as there are known signs of failures showing up. Things like monitoring free disk space are use-cases that are automated at this point. This can safe a lot of manual labor and help drive IT towards a more automated and pro-active discipline.
Collecting more logs and IT data: More data means more insight, more visibility, broader coverage, and more uses. For some use-cases we now need new data sources. In some cases it’s the more exotic logs, such as multi-line application logs, instant messenger logs, or physical access logs. In addition more IT data is needed: configuration files, host status information, such as open ports or running processes, ticketing information, etc. These new data sources enable a new and broader set of use-cases, such as change validation.
Correlation: The manual analysis of all of these new data sources can get very expensive and too resource intense. This is where SEM solutions can help automate a lot of the analysis. Uses like correlating trouble tickets with file changes, or correlating IDS data with operating system logs (Note that I didn’t say IDS and firewall logs!) There is much much more to correlation, but that’s for another blog post.
Note the big gap between the last step and this one. It takes a lot for an organization to cross this chasm. Also note that the individual mile-stones on the right side are drawn fairly close to each other. In reality, think of this as a log scale. These mile-stones can be very very far apart. The distance here is not telling anymore.
Visual analysis: It is not very efficient to read through thousands of log messages and figure out trends or patterns, or even understand what the log entries are communicating. Visual analysis takes the textual information and packages them in an image that conveys the contents of the logs. For more information on the topic of security visualization see Applied Security Visualization.
Pattern detection: One could view this as advanced correlation. One wants to know about patterns. Is it normal that when the DNS server is doing a zone transfer that you will also find a number of IDS alerts along with some firewall log entries? If a user browses the Web, what is the pattern of log files that are normally seen? Patter detection is the first step towards understanding an IT environment. The next step is to then figure out when something is an outlier and not part of a normal pattern. Note that this is not as simple as it sounds. There are various levels of maturity needed before this can happen. Just because something is different does not mean that it’s a “bad” anomaly or an outlier. Pattern detection engines need a lot of care and training.
Interactive visualization: Earlier we talked about simple, static visualization to better understand our IT data. The next step in the application of visualization is interactive visualization. This type of visualization follows the principle of: “overview first, zoom and filter, then details on demand.” This type of visualization along with dynamic queries (the next step) is incredibly important for advanced analysis of IT data.
Dynamic queries: The next step beyond interactive, single-view visualizations are multiple views of the same data. All of the views are linked together. If you select a property in one graph, the selection propagates to the others. This is also called dynamic queries. This is the gist of fast and efficient analysis of your IT data.
Anomaly detection: Various products are trying to implement anomaly detection algorithms in order to find outliers, or anomalous behavior in the IT environment. There are many approaches that people are trying to apply. So far, however, none of them had broad success. Anomaly detection as it is known today is best understood for closed use-cases. For example, NBADs are using anomaly detection algorithms to flag interesting findings in network flows. As of today, nobody has successfully applied anomaly detection across heterogeneous data sources.
Sharing views, patterns, and outliers: The last step on my maturity scale is the sharing of advanced analytic findings. If I know that certain versions of the Bind DNS server tend to trigger a specific set of Snort IDS alerts, it is something that others should know as well. Why not share it? Unfortunately, there are no products that allow us to share this knowledge.
While reading the maturity scale, note the gaps between the different stages. They signify how quickly after the previous step a new step sets in. If you were to look at the scale from a time-perspective, you would start an IT data management project on the left side and slowly move towards the right. Again, the gaps are fairly indicative of the relative time such a project would consume.
Related Quantities
The scale could be overlaid with a lines showing some interesting, related properties. I decided to not do so in favor of legibility. Instead, have a look at Figure 2. It encodes a few properties: number of products on the market, number of customers / users, and number of data sources needed at that state of maturity.
Figure 2: The number of product, companies, and data sources tat are used / available along the maturity scale.
Why are so few products on the right side of the scale? The most obvious reason is one of market size. There are not many companies on the right side. Hence there are not many products. It is sort of a chicken and an egg problem. If there were more products, there might be more companies using them – maybe. However, there are more reasons. One of them being that in order to get to the right side, a company has to traverse the entire scale on the left. This means that the potential market for advanced analytics is the amount of companies that linger just before the advanced analytics market itself. That market is a very small one. The next question would be why there are not more companies close to the advanced analytics stage? There are multiple reasons. Some of them are:
Not many environments manage to collect enough data to implement advanced analytics across heterogeneous data. Too many environments are stuck with just a few data sources. There are organizational, architectural, political, and technical reasons why this is so.
A lack of qualified people (engineers, architects, etc) is another reason. Not many companies have the staff that understands how to deal with all the data collected. Not many people understand how to interpret the vast amount of different data sources.
The effects of these phenomenon play yet again into the availability of products for the advanced analytics side of the scale. Because there are not many environments that actually collect a diverse set of IT data, companies (or academia) cannot conduct research on the subject. And if they do, they mostly get it wrong or capture just a very narrow use-case.
What Else Does the Maturity Scale Tell Us?
Let us have a look at some of the other things that we can learn from/should know about the maturity scale:
What does it mean for a company to be on the far right of the scale?
In-depth understanding of the data
Understanding of how to apply advanced analytics, such as visualization theory, anomaly detection, etc)
Baseline of the behavior in the organization’s environment (needed for example for anomaly detection)
Understanding of the context of the data gathered, such as what’s the network topology, what are the properties of the assets, etc.
Have to employ knowledgeable people. These experts are scarce and expensive.
Collecting all log data, which is hard!
What are some other preconditions to live on the right side?
A mature change management process
Asset management
IT infrastructure documentation
Processes to deal with the findings/intelligence from advanced analytics
A security policy that tells what is allowed and intended and what is not. (Have you ever put a sniffer on the network to see what traffic there is? Did you understand all of it? This is pretty much the same thing, you put a huge sniffer on your IT environment and try to explain everything. Wow!
Understand the environment to the point where questions like: “What’s really normal?” are answered quickly. Don’t be fooled. This is nearly impossible. There are so many questions that need to be answered, such as: “Is a DNS server that generates ICMP messages every now and then an anomaly? Is it a security problem? What is the payload of the ICMP message? Maybe an information leak?”
What’s the return on investment (ROI) for living on the right-side of the scale?
It’s just not clear!
Isn’t it cheaper to ignore than to discover?
What do you intend to find and what will you find?
So, what’s the ROI? It’s hard to measure, but you will be able to:
Detect problems earlier
Uncover attacks and policy violations quicker
Prevent information leaks
Reduce down-time of infrastructure and applications
Reduce labor of service desk and system administration