September 17, 2014

AfterGlow 1.6.5 – Edge Labels

Filed under: Log Analysis,Programming,Visualization — Raffael Marty @ 5:32 am

A new version of AfterGlow is ready. Version 1.6.5 has a couple of improvements:

1. If you have an input file which only has two columns, AfterGlow now automatically switches to a two-node mode. You don’t have to use the (-t) switch explicitly anymore in this case! (I know, it’s about time I added this)

2. Very minor change, but something that kept annoying me over time is the default edge length. It was set to 3 initially and now it’s reduced to 1.5, which makes fro a bit more compact graphs. You can still change this with the -e switch on the command line

3. The major change is about adding edge label though. Here is a quick example:


This assumes that the third column of your data contains the label for the data. In the example below, the port numbers:,,53,,80

When you run afterglow, use the -t switch to have it render only two nodes, but given the configuration above, we are using the third column as the edge label. The output will look like this:



As you can see, we have twice the same edge defined in the data with two different labels (port 53 and 80). If you want to have the graph show both edges, you add the following configuration in the configuration file:


Which then results in the following graph:



Note that the duplicating of edges only works with GDF files (-k). The edge labels work in DOT and GDF files, not in GraphSON output.

January 19, 2014

A New and Updated Field Dictionary for Logging Standards

Filed under: Uncategorized — Raffael Marty @ 2:51 pm

If you have been interested and been following event interchange formats or logging standards, you know of CEF and CEE. Problem is that we lost funding for CEE, which doesn’t mean that CEE is dead! In fact, I updated the field dictionary to accommodate some more use-cases and data sources. The one currently published by CEE is horrible. Don’t use it. Use my new version!

Whether you are using CEE or any other logging standard for your message formatting, you will need a naming schema; a set of field names. In CEE we call that a field dictionary.

The problem with the currently published field dictionary of CEE is that it’s inconsistent, has duplicate field names, and is missing a bunch of field names that you commonly need. I updated and cleaned up the dictionary (see below or download it here.) Please email me with any feedback / updates / additions! This is by no means complete, but it’s a good next iteration to keep improving on! If you know and use CEF, you can use this new dictionary with it. The problem with CEF is that it has to use ArcSight’s very limited field schema. And you have to overload a bunch of fields. So, try using this schema instead!

I was emailing with my friend Jose Nazario the other day and realized that we never really published anything decent on the event taxonomy either. That’s going to be my next task to gather whatever I can find in notes and such to put together an updated version of the taxonomy with my latest thinking; which has emerged quite a bit in the last 12 years that I have been building event taxonomies (starting with the ArcSight categorization schema, Splunk’s Common Information Model, and then designing the CEE taxonomy). Stay tuned for that.

For reference purposes. Here are some spin-offs from CEE which have field dictionaries as well:

Here is the new field dictionary:

Object Field Type Description
action STRING Action taken
bytes_received NUMBER Bytes received
bytes_sent NUMBER Bytes sent
category STRING Log source assigned category of message
cmd STRING Command
duration NUMBER Duration in seconds
host STRING Hostname of the event source
in_interface STRING Inbound interface
ip_proto NUMBER IP protocol field value (8=UDP, …)
msg STRING The event message
msgid STRING The event message identifier
out_interface STRING Outbound interface
packets_received NUMBER Number of packets received
packets_sent NUMBER Number of packets sent
reason STRING Reason for action taken or activity observed
rule_number STRING Number of rule – firewalls, for example
subsys STRING Application subsystem responsible for generating the event
tcp_flags STRING TCP flags
tid NUMBER Numeric thread ID associated with the process generating the event
time DATETIME Event Start Time
time_logged DATETIME Time log record was logged
time_received DATETIME Time log record was received
vend STRING Vendor of the event source application
app name STRING Name of the application that generated the event
app session_id STRING Session identifier from application
app vend STRING Application vendor
app ver STRING Application version
dst country STRING Country name of the destination
dst host STRING Network destination hostname
dst ipv4 IPv4 Network destination IPv4 address
dst ipv6 IPv6 Network destination IPv6 address
dst nat_ipv4 IPv4 NAT IPv4 address of destination
dst nat_ipv6 IPv6 NAT IPv6 destination address
dst nat_port NUMBER NAT port number for destination
dst port NUMBER Network destination port
dst zone STRING Zone name for destination – examples: Bldg1, Europe
file line NUMBER File line number
file md5 STRING File MD5 Hash
file mode STRING File mode flags
file name STRING File name
file path STRING File system path
file perm STRING File permissions
file size NUMBER File size in bytes
http content_type STRING MIME content type within HTTP
http method STRING HTTP method – GET | POST | HEAD | …
http query_string STRING HTTP query string
http request STRING HTTP request URL
http request_protocol STRING HTTP protocol used
http status NUMBER Return code in HTTP response
palo_alto actionflags STRING Palo Alto Networks Firewall Specific Field
palo_alto config_version STRING Palo Alto Networks Firewall Specific Field
palo_alto cpadding STRING Palo Alto Networks Firewall Specific Field
palo_alto domain STRING Palo Alto Networks Firewall Specific Field
palo_alto log_type STRING Palo Alto Networks Firewall Specific Field
palo_alto padding STRING Palo Alto Networks Firewall Specific Field
palo_alto seqno STRING Palo Alto Networks Firewall Specific Field
palo_alto serial_number STRING Palo Alto Networks Firewall Specific Field
palo_alto threat_content_type STRING Palo Alto Networks Firewall Specific Field
palo_alto virtual_system STRING Palo Alto Networks Firewall Specific Field
proc id STRING Process ID (pid)
proc name STRING Process name
proc tid NUMBER Thread identifier of the process
src country STRING Country name of the source
src host STRING Network source hostname
src ipv4 IPv4 Network source IPv4 address
src ipv6 IPv6 Network source IPv6 address
src nat_ipv4 IPv4 NAT IPv4 address of source
src nat_ipv6 IPv6 NAT IPVv6 address
src nat_port NUMBER NAT port number for source
src port NUMBER Network source port
src zone STRING Zone name for source – examples: Bldg1, Europe
syslog fac NUMBER Syslog facility value
syslog pri NUMBER Syslog priority value
syslog pri STRING Event priority (ERROR|WARN|DEBUG|CRIT)
syslog sev NUMBER Event severity
syslog tag STRING Syslog Tag value
syslog ver NUMBER Syslog Protocol version (0=legacy/RFC3164; 1=RFC5424)
user auid STRING Source User login authentication ID (login id)
user domain STRING User account domain (NT Domain)
user eid STRING Source user effective ID (euid)
user gid STRING Group ID (gid)
user group STRING Group name
user id STRING User account ID (uid)
user name STRING User account name
October 25, 2013

Using Impala and Parquet to Analyze Network Traffic – VAST 2013 Challenge

Filed under: Log Analysis — Raffael Marty @ 4:51 pm

As I outlined in my previous blog post on How to clean up network traffic logs, I have been working with the VAST 2013 traffic logs. Today I am going to show you can load the traffic logs into Impala (with a parquet table) for very quick querying.

First off, Impala is a real-time search engine for Hadoop (i.e., Hive/HDFS). So, scalable, distributed, etc. In the following I am assuming that you have Impala installed already. If not, I recommend you use the Cloudera Manager to do so. It’s pretty straight forward.

First we have to load the data into Impala, which is a two step process. We are using external tables, meaning that the data will live in files on HDFS. What we have to do is getting the data into HDFS first and then loading it into Impala:

$ sudo su - hdfs
$ hdfs dfs -put /tmp/nf-chunk*.csv /user/hdfs/data

We first become the hdfs user, then copy all of the netflow files from the MiniChallenge into HDFS at /user/hdfs/data. Next up we connect to impala and create the database schema:

$ impala-shell
create external table if not exists logs (
	TimeSeconds double,
	parsedDate timestamp,
	dateTimeStr string,
	ipLayerProtocol int,
	ipLayerProtocolCode string,
	firstSeenSrcIp string,
	firstSeenDestIp string,
	firstSeenSrcPort int,
	firstSeenDestPor int,
	moreFragment int,
	contFragment int,
	durationSecond int,
	firstSeenSrcPayloadByte bigint,
	firstSeenDestPayloadByte bigint,
	firstSeenSrcTotalByte bigint,
	firstSeenDestTotalByte bigint,
	firstSeenSrcPacketCoun int,
	firstSeenDestPacketCoun int,
	recordForceOut int)
row format delimited fields terminated by ',' lines terminated by '\n'
location '/user/hdfs/data/';

Now we have a table called ‘logs’ that contains all of our data. We told Impala that the data is comma separated and told it where the data files are. That’s already it. What I did on my installation is leveraging the columnar data format of Impala to speed queries up. A lot of analytic queries don’t really suit the row-oriented manner of databases. Columnar orientation is much more suited. Therefore we are creating a Parquet-based table:

create table pq_logs like logs stored as parquetfile;
insert overwrite table pq_logs select * from logs;

The second command is going to take a bit as it loads all the data into the new Parquet table. You can now issues queries against the pq_logs table and you will get the benefits of a columnar data store:

select distinct firstseendestpor from pq_logs where morefragment=1;

Have a look at my previous blog entry for some more queries against this data.

October 22, 2013

Cleaning Up Network Traffic Logs – VAST 2013 Challenge

Filed under: Log Analysis — Raffael Marty @ 1:59 pm

I have spent some significant time with the VAST 2013 Challenge. I have been part of the program committee for a couple of years now and have seen many challenge submissions. Both good and bad. What I noticed with most submissions is that they a) didn’t really understand network data, and b) they didn’t clean the data correctly. If you wanna follow along my analysis, the data is here: Week 1 – Network Flows (~500MB)

Also check the follow-on blog post on how to load data into a columnar data store in order to work with it.

Let me help with one quick comment. There is a lot of traffic in the data that seems to be involving port 0:

$ cat nf-chunk1-rev.csv | awk -F, '{if ($8==0) print $0}'
1364803648.013658,2013-04-01 08:07:28,20130401080728.013658,1,OTHER,,,0,0,0,0,1,0,0,222,0,3,0,0

Just because it says port 0 in there doesn’t mean it’s port 0! Check out field 5, which says OTHER. That’s the transport protocol. It’s not TCP or UDP, so the port is meaningless. Most likely this is ICMP traffic!

On to another problem with the data. Some of the sources and destinations are turned around in the traffic. This happens with network flow collectors. Look at these two records:

1364803504.948029,2013-04-01 08:05:04,20130401080504.948029,6,TCP,,,9130,80,0,0,0,176,409,454,633,5,4,0
1364807428.917824,2013-04-01 09:10:28,20130401091028.917824,6,TCP,,,80,14545,0,0,0,7425,0,7865,0,8,0,0

The first one is totally legitimate. The source port is 9130, the destination 80. The second record, however, has the source and destination turned around. Port 14545 is not a valid destination port and the collector just turned the information around.

The challenge is on you now to find which records are inverted and then you have to flip them back around. Here is what I did in order to find the ones that were turned around (Note, I am only using the first week of data for MiniChallenge1!):

select firstseendestport, count(*) c from logs group by firstseendestport order
 by c desc limit 20;
| 80               | 41229910 |
| 25               | 272563   |
| 0                | 119491   |
| 123              | 95669    |
| 1900             | 68970    |
| 3389             | 58153    |
| 138              | 6753     |
| 389              | 3672     |
| 137              | 2352     |
| 53               | 955      |
| 21               | 767      |
| 5355             | 311      |
| 49154            | 211      |
| 464              | 100      |
| 5722             | 98       |

What I am looking for here are the top destination ports. My theory being that most valid ports will show up quite a lot. This gave me a first candidate list of ports. I am looking for two things here. First, the frequency of the ports and second whether I recognize the ports as being valid. Based on the frequency I would put the ports down to port 3389 on my candidate list. But because all the following ones are well known ports, I will include them down to port 21. So the first list is:


I’ll drop 0 from this due to the comment earlier!

Next up, let’s see what the top source ports are that are showing up.

| firstseensrcport | c       |
| 80               | 1175195 |
| 62559            | 579953  |
| 62560            | 453727  |
| 51358            | 366650  |
| 51357            | 342682  |
| 45032            | 288301  |
| 62561            | 256368  |
| 45031            | 227789  |
| 51359            | 180029  |
| 45033            | 157071  |
| 0                | 119491  |
| 45034            | 117760  |
| 123              | 95622   |
| 1984             | 81528   |
| 25               | 19646   |
| 138              | 6711    |
| 137              | 2288    |
| 2024             | 929     |
| 2100             | 927     |
| 1753             | 926     |

See that? Port 80 is the top source port showing up. Definitely a sign of a source/destination confusion. There are a bunch of others from our previous candidate list showing up here as well. All records where we have to turn source and destination around. But likely we are still missing some ports here.

Well, let’s see what other source ports remain:

select firstseensrcport, count(*) c from pq_logs2 group by firstseensrcport 
having firstseensrcport<1024 and firstseensrcport not in (0,123,138,137,80,25,53,21) 
order by c desc limit 10
| firstseensrcport | c      |
| 62559            | 579953 |
| 62560            | 453727 |
| 51358            | 366650 |
| 51357            | 342682 |
| 45032            | 288301 |
| 62561            | 256368 |
| 45031            | 227789 |
| 51359            | 180029 |
| 45033            | 157071 |
| 45034            | 117760 |

Looks pretty normal. Well. Sort of, but let’s not digress. But lets try to see if there are any ports below 1024 showing up. Indeed, there is port 20 that shows, totally legitimate destination port. Let’s check out the. Pulling out the destination ports for those show nice actual source ports:

| firstseensrcport | firstseendestport| c |
| 20               | 3100             | 1 |
| 20               | 8408             | 1 |
| 20               | 3098             | 1 |
| 20               | 10129            | 1 |
| 20               | 20677            | 1 |
| 20               | 27362            | 1 |
| 20               | 3548             | 1 |
| 20               | 21396            | 1 |
| 20               | 10118            | 1 |
| 20               | 8407             | 1 |

Adding port 20 to our candidate list. Now what? Let’s see what happens if we look at the top ‘connections’:

select firstseensrcport, 
firstseendestport, count(*) c from pq_logs2 group by firstseensrcport, 
firstseendestport having firstseensrcport not in (0,123,138,137,80,25,53,21,20,1900,3389,389) 
and firstseendestport not in (0,123,138,137,80,25,53,21,20,3389,1900,389) 
order by c desc limit 10
| firstseensrcport | firstseendestpor | c  |
| 1984             | 4244             | 11 |
| 1984             | 3198             | 11 |
| 1984             | 4232             | 11 |
| 1984             | 4276             | 11 |
| 1984             | 3212             | 11 |
| 1984             | 4247             | 11 |
| 1984             | 3391             | 11 |
| 1984             | 4233             | 11 |
| 1984             | 3357             | 11 |
| 1984             | 4252             | 11 |

Interesting. Looking through the data where the source port is actually 1984, we can see that a lot of the destination ports are showing increasing numbers. For example:

| 1984             | 2228             |     |    |
| 1984             | 2226             |     |    |
| 1984             | 2225             |     |    |
| 1984             | 2224             |     |    |
| 1984             | 2223             |     |    |
| 1984             | 2222             |     |    |
| 1984             | 2221             |     |    |
| 1984             | 2220             |     |    |
| 1984             | 2219             |     |    |
| 1984             | 2217             |     |    |
| 1984             | 2216             |     |    |
| 1984             | 2215             |     |    |
| 1984             | 2214             |     |    |
| 1984             | 2213             |     |    |
| 1984             | 2212             |     |    |
| 1984             | 2211             |     |    |
| 1984             | 2210             |     |    |
| 1984             | 2209             |     |    |
| 1984             | 2208             |     |    |

That would hint at this guy being actually a destination port. You can also query for all the records that have the destination port set to 1984, which will show that a lot of the source ports in those connections are definitely source ports, another hint that we should add 1984 to our list of actual ports. Continuing our journey, I found something interesting. I was looking for all connections that don’t have a source or destination port in our candidate list and sorted by the number of occurrences:

| firstseensrcport | firstseendestport| c |
| 62559            | 37321            | 9 |
| 62559            | 36242            | 9 |
| 62559            | 19825            | 9 |
| 62559            | 10468            | 9 |
| 62559            | 34395            | 9 |
| 62559            | 62556            | 9 |
| 62559            | 9005             | 9 |
| 62559            | 59399            | 9 |
| 62559            | 7067             | 9 |
| 62559            | 13503            | 9 |
| 62559            | 30151            | 9 |
| 62559            | 23267            | 9 |
| 62559            | 56184            | 9 |
| 62559            | 58318            | 9 |
| 62559            | 4178             | 9 |
| 62559            | 65429            | 9 |
| 62559            | 32270            | 9 |
| 62559            | 18104            | 9 |
| 62559            | 16246            | 9 |
| 62559            | 33454            | 9 |

This is strange in so far as this source port seems to connect to totally random ports, but not making any sense. Is this another legitimate destination port? I am not sure. It’s way too high and I don’t want to put it on our list. Open question. No idea at this point. Anyone?

Moving on without this 62559, we see the same behavior for 62560 and then 51357 and 51358, as well as 45031, 45032, 45033. And it keeps going like that. Let’s see what the machines are involved in this traffic. Sorry, not the nicest SQL, but it works:


select firstseensrcip, firstseendestip, count(*) c 
from pq_logs2 group by firstseensrcip, firstseendestip,firstseensrcport 
having firstseensrcport in (62559, 62561, 62560, 51357, 51358)  
order by c desc limit 10
| firstseensrcip | firstseendestip | c     |
|      |     | 65534 |
|      |      | 65292 |
|      |      | 65272 |
|      |      | 65180 |
|      |      | 65140 |
|      |      | 65133 |
|      |      | 65127 |
|      |      | 65124 |
|      |      | 65117 |
|      |      | 65099 |

Here we have it. Probably an attacker :). This guy is doing not so nice things. We should exclude this IP for our analysis of ports. This guy is just all over.

Now, we continue along similar lines and find what machines are using port 45034, 45034, 45035:

select firstseensrcip, firstseendestip, count(*) c 
from pq_logs2 group by firstseensrcip, firstseendestip,firstseensrcport 
having firstseensrcport in (45035, 45034, 45033)  order by c desc limit 10
| firstseensrcip | firstseendestip | c     |
|    |     | 61337 |
|    |      | 55772 |
|    |      | 53820 |
|    |      | 51382 |
|    |     | 51224 |
|     |     | 148   |
|     |     | 148   |
|     |     | 148   |
|       |      | 30    |
|       |      | 30    |

We see one dominant IP here. Probably another ‘attacker’. So we exclude that and see what we are left with. Now, this is getting tedious. Let’s just visualize some of the output to see what’s going on. Much quicker! And we only have 36970 records unaccounted for.

What you can see is the remainder of traffic. Very quickly we see that there is one dominant IP address. We are going to filter that one out. Then we are left with this:

I selected some interesting traffic here. Turns out, we just found another destination port: 5535 for our list. I continued this analysis and ended up with something like 38 records, which are shown in the last image:

I’ll leave it at this for now. I think that’s a pretty good set of ports:


Oh well, if you want to fix your traffic now and turn around the wrong source/destination pairs, here is a hack in perl:

$ cat nf*.csv | perl -F\,\ -ane 'BEGIN {@ports=(20,21,25,53,80,123,137,138,389,1900,1984,3389,5355); 
%hash = map { $_ => 1 } @ports; $c=0} if ($hash{$F[7]} && $F[8}>1024) 
{$c++; printf"%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s",
$F[15],$F[14],$F[17],$F[16],$F[18]} else {print $_} END {print "count of revers $c\n";}'

We could have switched to visual analysis way earlier, which I did in my initial analysis, but for the blog I ended up going way further in SQL than I probably should have. The next blog post covers how to load all of the VAST data into a Hadoop / Impala setup.

August 12, 2013

Cyber Security Monitoring Still Full of Challenges

Filed under: Security Intelligence,Visualization — Raffael Marty @ 8:33 am

I was greatly honored when I got an invitation from the Conference on Knowledge Discovery and Data Mining (KDD) to give a talk about data mining and cyber security.

Knowing me, you might be able to guess the topic I chose to present: Visual Analytics. I am focussing on not the visualization layer or the data layer, but on the analytics layer. In the presentation I am showing what we have been doing with data analytics and data mining in cyber security. I am showing some examples for three topics:

  • Situational Awareness
  • Exploration and Discovery
  • Forensics

At the end, I am presenting a number of challenges to the community; hard problems that we need help with to advance insights into cyber security of infrastructures and applications. The following slide summarizes the challenges I see in data mining for security:

If you have any suggestions on each of the challenges, please contact me or comment on this post!

The complete presentation is here: Cyber Security – How Visual Analytics Unlock Insight.

Calculate addresses ipv6 to ipv4

July 11, 2013

Log Management and SIEM Vendors

Filed under: Log Analysis,Security Information Management,Security Market — Raffael Marty @ 4:12 pm

LogManagement_SIEM_Products.001 (1)

This is a slide I built for my Visual Analytics Workshop at BlackHat this year. I tried to summarize all the SIEM and log management vendors out there. I am pretty sure I missed some players. What did I miss? I’ll try to add them before the training.


Here is the list of vendors that are on the slide (in no particular order):

Log Management

  • Tibco
  • KeyW
  • Tripwire
  • Splunk
  • Balabit
  • Tier-3 Systems


  • HP
  • Symantec
  • Tenable
  • Alienvault
  • Solarwinds
  • Attachmate
  • eIQ
  • EventTracker
  • BlackStratus
  • TrustWave
  • LogRhythm
  • ClickSecurity
  • IBM
  • McAfee
  • NetIQ
  • RSA
  • Event Sentry

Logging as a Service

  • SumoLogic
  • Loggly
  • PaperTrail
  • Torch
  • AlertLogic
  • SplunkStorm
  • logentries
  • eGestalt

Update: With input from a couple of folks, I updated the slide a couple of times.

March 24, 2012

Advanced Network Graph Visualization with AfterGlow

Filed under: Log Analysis,Programming,Visualization — Raffael Marty @ 12:49 pm

There are cases where you need fairly sophisticated logic to visualize data. Network graphs are a great way to help a viewer understand relationships in data. In my last blog post, I explained how to visualize network traffic. Today I am showing you how to extend your visualization with some more complicated configurations.

This blog post was inspired by an AfterGlow user who emailed me last week asking how he could keep a list of port numbers to drive the color in his graph. Here is the code snippet that I suggested he use:

variable=@ports=qw(22 80 53 110);
color="green" if (grep(/^\Q$fields[0]\E$/,@ports))

Put this in a configuration file and invoke AfterGlow with it:

perl -c file.config | ...

What this does is color all nodes green if they are part of the list of ports (22, 80, 53, 110). I am using $fields[0] to reference the first column of data. You could also use the function fields() to reference any column in the data.

Another way to define the variable is by looking it up in a file. Here is an example:

variable=open(TOR,"tor.csv"); @tor=; close(TOR);
color="red" if (grep(/^\Q$fields[1]\E$/,@tor))

This time you put the list of items in a file and read it into an array. Remember, it’s just Perl code that you execute after the variable= statement. Anything goes!

I am curious what you will come up with. Post your experiments and questions on!

Read more about how to use AfterGlow in security visualization.

March 21, 2012

Visualizing Packet Captures For Fun and Profit

Filed under: Log Analysis,Visualization — Raffael Marty @ 1:26 pm

Have you ever collected a packet capture and you needed to know what the collected traffic is about? Here is a quick tutorial on how to use AfterGlow to generate link graphs from your packet captures (PCAP).

I am sitting at the 2012 Honeynet Project Security Workshop. One of the trainers of a workshop tomorrow just approached me and asked me to help him visualize some PCAP files. I thought it might be useful for other people as well. So here is a quick tutorial.


To start with, make sure you have AfterGlow installed. This means you also need to install GraphViz on your machine!

First Visualization Attempt

The first attempt of visualizing tcpdump traffic is the following:

tcpdump -vttttnnelr file.pcap | parsers/ "sip dip" | perl graph/ -t | neato -Tgif -o test.gif

I am using the tcpdump2csv parser to deal with the source/destination confusion. The problem with this approach is that if your output format is slightly different to the regular expression used in the script, the parsing will fail [In fact, this happened to us when we tried it here on someone else's computer].
It is more elegant to use something like Argus to do this. They do a much better job at protocol parsing:

argus -r file.pcap -w - | ra -r - -nn -s saddr daddr -c, | perl graph/ -t | neato -Tgif -o test.gif

When you do this, make sure that you are using Argus 3.0 or newer. If you do not, ragator does not have the -c option!

From here you can go in all kinds of directions.

Using other data fields

argus -r file.pcap -w - | ra -r - -nn -s saddr daddr dport -c, | perl graph/ | neato -Tgif -o test.gif

Here I added the dport to the parameters. Also note that I had to remove the -t parameter from the afterglow command. This tells AfterGlow that there are not two, but three columns in the CSV file.

Or use this:

argus -r file.pcap -w - | ra -r - -nn -s daddr dport ttl -c, | perl graph/ | neato -Tgif -o test.gif

This uses the destination address, the destination port and the TTL to plot your graph. Pretty neat …

AfterGlow Properties

You can define your own property file to define the colors for the nodes, configure clustering, change the size of the nodes, etc.

argus -r file.pcap -w - | ra -r - -nn -s daddr dport ttl -c, | perl graph/ -c graph/ | neato -Tgif -o test.gif

Here is an example config file that is not as straight forward as the default one that is included in the AfterGlow distribution:

color="white" if ($fields[2] =~ /foo/)

The config uses the number of times the target shows up as the size of the target node.

Comments / Examples / Questions?

Obviously comments and questions are more than welcome. Also make sure that you post your example graphs on!

March 16, 2012

Big Data Security Intelligence – nothing to see here – move along

Filed under: Log Analysis,Security Intelligence,Visualization — Raffael Marty @ 7:22 am

Big data doesn’t help us to create security intelligence! Big data is like your relational database. It’s a technology that helps us manage data. We still need the analytical intelligence on top of the storage and processing tier to make sense of everything. Visual analytics anyone?

A couple of weeks ago I hung out around the RSA conference and walked the show floor. Hundreds of companies exhibited their products. The big topics this year? Big data and security intelligence. Seems like this was MY conference. Well, not so fast. Marketing does unfortunately not equal actual solutions. Here is an example out of the press. Unfortunately, these kinds of things shine the light on very specific things; in this case, the use of hadoop for security intelligence. What does that even mean? How does it work? People seem to not really care, but only hear the big words.

Here is a quick side-note or anecdote. After the big data panel, a friend of mine comes up to me and tells me that the audience asked the panel a question about how analytics played into the big data environment. The panel huddled, discussed, and said: “Ask Raffy about that“.

Back to the problem. I have been reading a bunch lately about SIEM being replaced or superseded by big data infrastructure. That’s completely and utterly stupid. These are not competing technologies. They are complementary. If anything, SIEM will be replaced by some other analytical capabilities that are leveraging big data infrastructures. Big data is like RDBMS. New analytical capabilities are like the SIEMs (correlation rules, parsed data, etc.) For example, using big data, who is going to write your parsers for you. SIEMs have spent a lot of time and resources on things like parsers, big data solutions will need to do the same! Yes, there are a couple of things that you can do with big data approaches and unparsed data. However, most discussions out there do not discuss those uses.

In the context of big data, people also talk about leveraging multiple data sources and new data sources. What’s the big deal? We have been talking about that for 6 years (or longer). Yes, we want video feeds, but how do you correlate a video with a firewall log? Well, you process the video and generate events from it. We have been doing that all along. Nothing new there.

What HAS changed is that we now have the means to store and process the data; any data. However, nobody really knows how to process it.

Let’s start focusing on analytics!

January 8, 2012

The Steps To a Mature Visual Analytics Practice

Filed under: Visualization — Raffael Marty @ 1:50 pm

The visualization maturity scale can be used to explain a number of issues in the visual analytics space. For example, why aren’t companies leveraging visualization to analyze their data? What are the requirements to implement visual analytics services? Or why don’t we have more visual analytics products?

About three years ago I posted the log management maturity scale. The maturity scale helped explain why companies and products are not as advanced as they should be in the log management, log analysis, and security information management space.

While preparing my presentation for the cyber security grand challenge meeting in early December, I developed the maturity scale for information visualization that you can see above.

Companies that are implementing visualization processes move from through each of the steps from left to right. So do product companies that build visualization applications. In order to build products on the right-hand side, they need to support the pieces to the left. Let’s have a look at the different stages in more detail:

  • Data Collection: No data, no visuals (see also Where Data Analytics and Security Collide). This is the foundation. Data needs to be available and accessible. Generally it is centralized in a big data store (it used to be relational databases and that’s a viable solution as well). This step generally involves parsing data. Turning unstructured data or semi-structured data into structured data. Although a fairly old problem, this is still a huge issue. I wonder if anyone is going to come up with a novel solution in this space anytime soon! The traditional regular expression based approach just doesn’t scale.
  • Data Analysis: Once data is centralized or accessible via a federated data store, you have to do something with it. A lot of companies are using Excel to do the first iteration of data analysis. Some are using R, SAS, or other statistics and data analytics software. One of the core problems here is data cleansing. Another huge problem is understanding the data itself. Not every data set is as self explanatory as sales data.
  • Context Integration: Often we collect data, analyze it, and then realize that the data doesn’t really contain enough information to understand it. For example in network security. What does the machine behind a specific IP address do? Is it a Web server? This is where we start adding more context: roles of machines, roles of users, etc. This can significantly increase the value of data analytics.
  • Visualization: Lets be clear about what I refer to as visualization. I am using visualization to mean reporting and dashboards. Reports are static summaries of historical data. They help communicate information. Dashboards are used to communicate information in real-time (or near real-time) to create situational awareness.
  • Visual Analytics: This is where things are getting interesting. Interactive interfaces are used as a means to understand and reason about the data. Often linked views, brushing, and dynamic queries are key technologies used to give the user the most freedom to look at and analyze the data.
  • Collaboration: It is one thing to have one analyst look at data and apply his/her own knowledge to understand the data. It’s another thing to have people collaborate on data and use their joint ‘wisdom’.
  • Dissemination: Once an analysis is done, the job of the analyst is not. The newly found insights have to be shared and communicated to other groups or people in order for them to take action based on the findings.
  • Put in Action: This could be regarded as part of the dissemination step. This step is about operationalizing the information. In the case of security information management, this is where the knowledge is encoded in correlation rules to catch future instances of the same or similar incidents.

For an end user, the visualization maturity scale outlines the individual steps he/she has to go through in order to achieve analytical maturity. In order to implement the ‘put in action’ step, users need to implement all of the steps on the left of the scale.

For visualization product companies, the scale means that in order to have a product that lets a user put findings into action, they have to support all the left-hand stages: there needs to be a data collection piece; a data storage. The data needs to be pre-analyzed. Operations like data cleansing, aggregation, filtering, or even the calculation of certain statistical properties fall into this step. Context is not always necessary, but often adds to the usefulness of the data. Etc. etc.

There are a number of products, both open source, as well as commercial solutions that are solving a lot of the left hand side problems. Technologies like column-based data bases (e.g., MongoDB) or map reduce (e.g., Hadoop), or search engines like ElasticSearch are great open source examples of such technologies. In the commercial space you will find companies like Karmaspehre or DataMeer tackling these problems.

Comments? Chime in!