{"id":660,"date":"2013-10-22T13:59:54","date_gmt":"2013-10-22T21:59:54","guid":{"rendered":"http:\/\/raffy.ch\/blog\/?p=660"},"modified":"2013-10-25T16:52:39","modified_gmt":"2013-10-26T00:52:39","slug":"cleaning-up-network-traffic-logs-vast-2013-challenge","status":"publish","type":"post","link":"https:\/\/raffy.ch\/blog\/2013\/10\/22\/cleaning-up-network-traffic-logs-vast-2013-challenge\/","title":{"rendered":"Cleaning Up Network Traffic Logs &#8211; VAST 2013 Challenge"},"content":{"rendered":"<p>I have spent some significant time with the <a href=\"http:\/\/vacommunity.org\/VAST+Challenge+2013\">VAST 2013 Challenge<\/a>. I have been part of the program committee for a couple of years now and have seen many challenge submissions. Both good and bad. What I noticed with most submissions is that they a) didn&#8217;t really understand network data, and b) they didn&#8217;t clean the data correctly. If you wanna follow along my analysis, the data is here: <a href=\"http:\/\/www.vacommunity.org\/dl388\">Week 1 &#8211; Network Flows (~500MB)<\/a><\/p>\n<p>Also check the follow-on blog post on how to <a href=\"http:\/\/raffy.ch\/blog\/2013\/10\/25\/using-impala-and-parquet-to-analyze-network-traffic-vast-2013-challenge\/\">load data into a columnar data store<\/a> in order to work with it.<\/p>\n<p>Let me help with one quick comment. There is a lot of traffic in the data that seems to be involving port 0:<\/p>\n<pre style=\"font-family:monospace; font-size:14px;\">$ cat nf-chunk1-rev.csv | awk -F, '{if ($8==0) print $0}'\r\n1364803648.013658,2013-04-01 08:07:28,20130401080728.013658,1,OTHER,172.10.0.6,\r\n172.10.2.6,0,0,0,0,1,0,0,222,0,3,0,0<\/pre>\n<p>Just because it says port 0 in there doesn&#8217;t mean it&#8217;s port 0! Check out field 5, which says <b>OTHER<\/b>. That&#8217;s the transport protocol. It&#8217;s not TCP or UDP, so the port is meaningless. Most likely this is ICMP traffic!<\/p>\n<p>On to another problem with the data. Some of the sources and destinations are turned around in the traffic. This happens with network flow collectors. Look at these two records:<\/p>\n<pre style=\"font-family:monospace; font-size:14px;\">1364803504.948029,2013-04-01 08:05:04,20130401080504.948029,6,TCP,172.30.1.11, \r\n10.0.0.12,9130,80,0,0,0,176,409,454,633,5,4,0\r\n1364807428.917824,2013-04-01 09:10:28,20130401091028.917824,6,TCP,172.10.0.4,\r\n172.10.2.64,80,14545,0,0,0,7425,0,7865,0,8,0,0\r\n<\/pre>\n<p>The first one is totally legitimate. The source port is 9130, the destination 80. The second record, however, has the source and destination turned around. Port 14545 is not a valid destination port and the collector just turned the information around.<\/p>\n<p>The challenge is on you now to find which records are inverted and then you have to flip them back around. Here is what I did in order to find the ones that were turned around (Note, I am only using the first week of data for MiniChallenge1!):<\/p>\n<pre style=\"font-family:monospace; font-size:14px;\">select firstseendestport, count(*) c from logs group by firstseendestport order\r\n by c desc limit 20;\r\n| 80               | 41229910 |\r\n| 25               | 272563   |\r\n| 0                | 119491   |\r\n| 123              | 95669    |\r\n| 1900             | 68970    |\r\n| 3389             | 58153    |\r\n| 138              | 6753     |\r\n| 389              | 3672     |\r\n| 137              | 2352     |\r\n| 53               | 955      |\r\n| 21               | 767      |\r\n| 5355             | 311      |\r\n| 49154            | 211      |\r\n| 464              | 100      |\r\n| 5722             | 98       |\r\n...\r\n<\/pre>\n<p>What I am looking for here are the top destination ports. My theory being that most valid ports will show up quite a lot. This gave me a first candidate list of ports. I am looking for two things here. First, the frequency of the ports and second whether I recognize the ports as being valid. Based on the frequency I would put the ports down to port 3389 on my candidate list. But because all the following ones are well known ports, I will include them down to port 21. So the first list is:<\/p>\n<pre style=\"font-family:monospace; font-size:14px;\">80,25,0,123,1900,3389,138,389,137,53,21<\/pre>\n<p>I&#8217;ll drop 0 from this due to the comment earlier!<\/p>\n<p>Next up, let&#8217;s see what the top source ports are that are showing up.<\/p>\n<pre style=\"font-family:monospace; font-size:14px;\">| firstseensrcport | c       |\r\n+------------------+---------+\r\n| 80               | 1175195 |\r\n| 62559            | 579953  |\r\n| 62560            | 453727  |\r\n| 51358            | 366650  |\r\n| 51357            | 342682  |\r\n| 45032            | 288301  |\r\n| 62561            | 256368  |\r\n| 45031            | 227789  |\r\n| 51359            | 180029  |\r\n| 45033            | 157071  |\r\n| 0                | 119491  |\r\n| 45034            | 117760  |\r\n| 123              | 95622   |\r\n| 1984             | 81528   |\r\n| 25               | 19646   |\r\n| 138              | 6711    |\r\n| 137              | 2288    |\r\n| 2024             | 929     |\r\n| 2100             | 927     |\r\n| 1753             | 926     |<\/pre>\n<p>See that? Port 80 is the top source port showing up. Definitely a sign of a source\/destination confusion. There are a bunch of others from our previous candidate list showing up here as well. All records where we have to turn source and destination around. But likely we are still missing some ports here.<\/p>\n<p>Well, let&#8217;s see what other source ports remain:<\/p>\n<pre style=\"font-family:monospace; font-size:14px;\">select firstseensrcport, count(*) c from pq_logs2 group by firstseensrcport \r\nhaving firstseensrcport&lt;1024 and firstseensrcport not in (0,123,138,137,80,25,53,21) \r\norder by c desc limit 10\r\n+------------------+--------+\r\n| firstseensrcport | c      |\r\n+------------------+--------+\r\n| 62559            | 579953 |\r\n| 62560            | 453727 |\r\n| 51358            | 366650 |\r\n| 51357            | 342682 |\r\n| 45032            | 288301 |\r\n| 62561            | 256368 |\r\n| 45031            | 227789 |\r\n| 51359            | 180029 |\r\n| 45033            | 157071 |\r\n| 45034            | 117760 |<\/pre>\n<p>Looks pretty normal. Well. Sort of, but let&#8217;s not digress. But lets try to see if there are any ports below 1024 showing up. Indeed, there is port 20 that shows, totally legitimate destination port. Let&#8217;s check out the. Pulling out the destination ports for those show nice actual source ports:<\/p>\n<pre style=\"font-family:monospace; font-size:14px;\">+------------------+------------------+---+\r\n| firstseensrcport | firstseendestport| c |\r\n+------------------+------------------+---+\r\n| 20               | 3100             | 1 |\r\n| 20               | 8408             | 1 |\r\n| 20               | 3098             | 1 |\r\n| 20               | 10129            | 1 |\r\n| 20               | 20677            | 1 |\r\n| 20               | 27362            | 1 |\r\n| 20               | 3548             | 1 |\r\n| 20               | 21396            | 1 |\r\n| 20               | 10118            | 1 |\r\n| 20               | 8407             | 1 |\r\n+------------------+------------------+---+<\/pre>\n<p>Adding port 20 to our candidate list. Now what? Let&#8217;s see what happens if we look at the top &#8216;connections&#8217;:<\/p>\n<pre style=\"font-family:monospace; font-size:14px;\">select firstseensrcport, \r\nfirstseendestport, count(*) c from pq_logs2 group by firstseensrcport, \r\nfirstseendestport having firstseensrcport not in (0,123,138,137,80,25,53,21,20,1900,3389,389) \r\nand firstseendestport not in (0,123,138,137,80,25,53,21,20,3389,1900,389) \r\norder by c desc limit 10\r\n+------------------+------------------+----+\r\n| firstseensrcport | firstseendestpor | c  |\r\n+------------------+------------------+----+\r\n| 1984             | 4244             | 11 |\r\n| 1984             | 3198             | 11 |\r\n| 1984             | 4232             | 11 |\r\n| 1984             | 4276             | 11 |\r\n| 1984             | 3212             | 11 |\r\n| 1984             | 4247             | 11 |\r\n| 1984             | 3391             | 11 |\r\n| 1984             | 4233             | 11 |\r\n| 1984             | 3357             | 11 |\r\n| 1984             | 4252             | 11 |\r\n+------------------+------------------+----+\r\n<\/pre>\n<p>Interesting. Looking through the data where the source port is actually 1984, we can see that a lot of the destination ports are showing increasing numbers. For example:<\/p>\n<pre style=\"font-family:monospace; font-size:14px;\">| 1984             | 2228             | 172.10.0.6     | 172.10.1.118    |\r\n| 1984             | 2226             | 172.10.0.6     | 172.10.1.147    |\r\n| 1984             | 2225             | 172.10.0.6     | 172.10.1.141    |\r\n| 1984             | 2224             | 172.10.0.6     | 172.10.1.115    |\r\n| 1984             | 2223             | 172.10.0.6     | 172.10.1.120    |\r\n| 1984             | 2222             | 172.10.0.6     | 172.10.1.121    |\r\n| 1984             | 2221             | 172.10.0.6     | 172.10.1.135    |\r\n| 1984             | 2220             | 172.10.0.6     | 172.10.1.126    |\r\n| 1984             | 2219             | 172.10.0.6     | 172.10.1.192    |\r\n| 1984             | 2217             | 172.10.0.6     | 172.10.1.141    |\r\n| 1984             | 2216             | 172.10.0.6     | 172.10.1.173    |\r\n| 1984             | 2215             | 172.10.0.6     | 172.10.1.116    |\r\n| 1984             | 2214             | 172.10.0.6     | 172.10.1.120    |\r\n| 1984             | 2213             | 172.10.0.6     | 172.10.1.115    |\r\n| 1984             | 2212             | 172.10.0.6     | 172.10.1.126    |\r\n| 1984             | 2211             | 172.10.0.6     | 172.10.1.121    |\r\n| 1984             | 2210             | 172.10.0.6     | 172.10.1.172    |\r\n| 1984             | 2209             | 172.10.0.6     | 172.10.1.119    |\r\n| 1984             | 2208             | 172.10.0.6     | 172.10.1.173    |<\/pre>\n<\/p>\n<p>That would hint at this guy being actually a destination port. You can also query for all the records that have the destination port set to 1984, which will show that a lot of the source ports in those connections are definitely source ports, another hint that we should add 1984 to our list of actual ports. Continuing our journey, I found something interesting. I was looking for all connections that don&#8217;t have a source or destination port in our candidate list and sorted by the number of occurrences:<\/p>\n<pre style=\"font-family:monospace; font-size:14px;\">+------------------+------------------+---+\r\n| firstseensrcport | firstseendestport| c |\r\n+------------------+------------------+---+\r\n| 62559            | 37321            | 9 |\r\n| 62559            | 36242            | 9 |\r\n| 62559            | 19825            | 9 |\r\n| 62559            | 10468            | 9 |\r\n| 62559            | 34395            | 9 |\r\n| 62559            | 62556            | 9 |\r\n| 62559            | 9005             | 9 |\r\n| 62559            | 59399            | 9 |\r\n| 62559            | 7067             | 9 |\r\n| 62559            | 13503            | 9 |\r\n| 62559            | 30151            | 9 |\r\n| 62559            | 23267            | 9 |\r\n| 62559            | 56184            | 9 |\r\n| 62559            | 58318            | 9 |\r\n| 62559            | 4178             | 9 |\r\n| 62559            | 65429            | 9 |\r\n| 62559            | 32270            | 9 |\r\n| 62559            | 18104            | 9 |\r\n| 62559            | 16246            | 9 |\r\n| 62559            | 33454            | 9 |<\/pre>\n<p>This is strange in so far as this source port seems to connect to totally random ports, but not making any sense. Is this another legitimate destination port? I am not sure. It&#8217;s way too high and I don&#8217;t want to put it on our list. Open question. No idea at this point. Anyone?<\/p>\n<p>Moving on without this 62559, we see the same behavior for 62560 and then 51357 and 51358, as well as 45031, 45032, 45033. And it keeps going like that. Let&#8217;s see what the machines are involved in this traffic. Sorry, not the nicest SQL, but it works:<\/p>\n<p>.<\/p>\n<pre style=\"font-family:monospace; font-size:14px;\">select firstseensrcip, firstseendestip, count(*) c \r\nfrom pq_logs2 group by firstseensrcip, firstseendestip,firstseensrcport \r\nhaving firstseensrcport in (62559, 62561, 62560, 51357, 51358)  \r\norder by c desc limit 10\r\n+----------------+-----------------+-------+\r\n| firstseensrcip | firstseendestip | c     |\r\n+----------------+-----------------+-------+\r\n| 10.9.81.5      | 172.10.0.40     | 65534 |\r\n| 10.9.81.5      | 172.10.0.4      | 65292 |\r\n| 10.9.81.5      | 172.10.0.4      | 65272 |\r\n| 10.9.81.5      | 172.10.0.4      | 65180 |\r\n| 10.9.81.5      | 172.10.0.5      | 65140 |\r\n| 10.9.81.5      | 172.10.0.9      | 65133 |\r\n| 10.9.81.5      | 172.20.0.6      | 65127 |\r\n| 10.9.81.5      | 172.10.0.5      | 65124 |\r\n| 10.9.81.5      | 172.10.0.9      | 65117 |\r\n| 10.9.81.5      | 172.20.0.6      | 65099 |\r\n+----------------+-----------------+-------+<\/pre>\n<p>Here we have it. Probably an attacker :). This guy is doing not so nice things. We should exclude this IP for our analysis of ports. This guy is just all over.<\/p>\n<p>Now, we continue along similar lines and find what machines are using port 45034, 45034, 45035:<\/p>\n<pre style=\"font-family:monospace; font-size:14px;\">select firstseensrcip, firstseendestip, count(*) c \r\nfrom pq_logs2 group by firstseensrcip, firstseendestip,firstseensrcport \r\nhaving firstseensrcport in (45035, 45034, 45033)  order by c desc limit 10\r\n+----------------+-----------------+-------+\r\n| firstseensrcip | firstseendestip | c     |\r\n+----------------+-----------------+-------+\r\n| 10.10.11.15    | 172.20.0.15     | 61337 |\r\n| 10.10.11.15    | 172.20.0.3      | 55772 |\r\n| 10.10.11.15    | 172.20.0.3      | 53820 |\r\n| 10.10.11.15    | 172.20.0.2      | 51382 |\r\n| 10.10.11.15    | 172.20.0.15     | 51224 |\r\n| 10.15.7.85     | 172.20.0.15     | 148   |\r\n| 10.15.7.85     | 172.20.0.15     | 148   |\r\n| 10.15.7.85     | 172.20.0.15     | 148   |\r\n| 10.7.6.3       | 172.30.0.4      | 30    |\r\n| 10.7.6.3       | 172.30.0.4      | 30    |<\/pre>\n<p>We see one dominant IP here. Probably another &#8216;attacker&#8217;. So we exclude that and see what we are left with. Now, this is getting tedious. Let&#8217;s just visualize some of the output to see what&#8217;s going on. Much quicker! And we only have 36970 records unaccounted for.<\/p>\n<p><a href=\"http:\/\/raffy.ch\/blog\/wp-content\/uploads\/2013\/10\/initial.png\"><img decoding=\"async\" src=\"http:\/\/raffy.ch\/blog\/wp-content\/uploads\/2013\/10\/initial.png\" style=\"width:700px\"\/><\/a><\/p>\n<p>What you can see is the remainder of traffic. Very quickly we see that there is one dominant IP address. We are going to filter that one out. Then we are left with this:<\/p>\n<p><a href=\"http:\/\/raffy.ch\/blog\/wp-content\/uploads\/2013\/10\/5535.png\"><img decoding=\"async\" src=\"http:\/\/raffy.ch\/blog\/wp-content\/uploads\/2013\/10\/5535.png\" style=\"width:700px\"\/><\/a><\/p>\n<p>I selected some interesting traffic here. Turns out, we just found another destination port: 5535 for our list. I continued this analysis and ended up with something like 38 records, which are shown in the last image:<\/p>\n<p><a href=\"http:\/\/raffy.ch\/blog\/wp-content\/uploads\/2013\/10\/remainder.png\"><img decoding=\"async\" src=\"http:\/\/raffy.ch\/blog\/wp-content\/uploads\/2013\/10\/remainder.png\" style=\"width:700px\"\/><\/a><\/p>\n<p>I&#8217;ll leave it at this for now. I think that&#8217;s a pretty good set of ports:<\/p>\n<pre style=\"font-family:monospace; font-size:14px;\">20,21,25,53,80,123,137,138,389,1900,1984,3389,5355<\/pre>\n<\/p>\n<p>Oh well, if you want to fix your traffic now and turn around the wrong source\/destination pairs, here is a hack in perl:<\/p>\n<pre style=\"font-family:monospace; font-size:14px;\">$ cat nf*.csv | perl -F\\,\\ -ane 'BEGIN {@ports=(20,21,25,53,80,123,137,138,389,1900,1984,3389,5355); \r\n%hash = map { $_ => 1 } @ports; $c=0} if ($hash{$F[7]} && $F[8}>1024) \r\n{$c++; printf\"%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s\",\r\n$F[0],$F[1],$F[2],$F[3],$F[4],$F[6],$F[5],$F[8],$F[7],$F[9],$F[10],$F[11],$F[13],$F[12],\r\n$F[15],$F[14],$F[17],$F[16],$F[18]} else {print $_} END {print \"count of revers $c\\n\";}'<\/pre>\n<p>We could have switched to visual analysis way earlier, which I did in my initial analysis, but for the blog I ended up going way further in SQL than I probably should have. The next blog post covers <a href=\"http:\/\/raffy.ch\/blog\/2013\/10\/25\/using-impala-and-parquet-to-analyze-network-traffic-vast-2013-challenge\/\">how to load all of the VAST data into a Hadoop \/ Impala setup<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I have spent some significant time with the VAST 2013 Challenge. I have been part of the program committee for a couple of years now and have seen many challenge submissions. Both good and bad. What I noticed with most submissions is that they a) didn&#8217;t really understand network data, and b) they didn&#8217;t clean [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[],"class_list":["post-660","post","type-post","status-publish","format-standard","hentry","category-log-analysis"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/posts\/660","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/comments?post=660"}],"version-history":[{"count":39,"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/posts\/660\/revisions"}],"predecessor-version":[{"id":705,"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/posts\/660\/revisions\/705"}],"wp:attachment":[{"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/media?parent=660"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/categories?post=660"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/tags?post=660"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}