May 7, 2015

Security Monitoring / SIEM Use-Cases

Filed under: Log Analysis,Security Information Management,Visualization — Raffael Marty @ 3:40 pm

As it happens, I do a lot of consulting for companies that have some kind of log management or SIEM solution deployed. Unfortunately, or maybe not really for me, most companies have a hard time figuring out what to do with their expensive toys. [It is a completely different topic what I think about the security monitoring / SIEM space in general – it’s quite broken.] But here are some tips that I share with companies that are trying to get more out of their SIEMs:

  • First and foremost, start with use-cases. Time and time again, I am on calls with companies and they are telling me that they have been onboarding data sources for the last 4 months. When I ask them what they are trying to do with them, it gets really quiet. Turns out that’s what they expected me to tell them. Well, that’s not how it works. You have to come up with the use-cases you want/need to implement yourself. I don’t know your specific environment, your security policy, or your threat profile. These are the factors that should drive your use-cases.
  • Second, focus on your assets / machines. Identify your most valuable assets – the high business impact (HBI) machines and network segments. Even just identifying them can be quite challenging. I can guarantee you though; the time is well spent. After all, you need to know what you are protecting.
  • Model a set of use-cases around your HBIs. Learn as much as you can about them: What software is running on them? What processes are running? What ports are open? And from a network point of view, what other machines are they communicating with? What internal machines have access to talk to them? Do they talk to the outside world? What machines? How may different ones? When? Use your imagination to come up with more use-cases. Monitor the machines for a week and start defining some policies / metrics that you can monitor. Keep adopting them over time.
  • Based on your use-cases, determine what data you need. You will be surprised what you learn. Your IDS logs might suddenly loose a lot of importance. But your authentication logs and network flows might come in pretty handy. Note how we turned things around; instead of having the data dictate our use-cases, we have the use-cases dictate what data we collect.
  • Next up, figure out how to actually implement your use-cases. Your SIEM is probably going to be the central point for most of the use-case implementations. However, it won’t be able to solve all of your use-cases. You might need some pretty specific tools to model user behavior, machine communications, etc. But also don’t give up too quickly. Your SIEM can do a lot; even initial machine profiling. Try to work with what you have.

Ideally you go through this process before you buy any products. To come up with a set of use-cases, involve your risk management people too. They can help you prioritize your efforts and probably have a number of use-cases they would like to see addressed as well. What I often do is organize a brainstorming session with many different stakeholders across different departments.

Here are some additional resources that might come in handy in your use-case development efforts:

  • Popular SIEM Starter Use Cases – This is a short list of use-cases you can work with. You will need to determine how exactly to collect the data that Anton is talking about in this blog post and how to actually implement the use-case, but the list is a great starting point.
  • AlienVault SIEM Use-Cases – Scroll down just a bit and you will see a list of SIEM use-cases. If you click on them, they will open up and show you some more details around how to implement them. Great list to get started.
  • SANS Critical Security Controls – While this is not specifically a list of SIEM use-cases, I like to use this list as a guide to explore SIEM use-cases. Go through the controls and identify which ones you care about and how you could map them to your SIEM.
  • NIST 800-53 – This is NIST’s control framework. Again, not directly a list of SIEM use-cases, but similar to the SANS list, a great place for inspiration, but also a nice framework to follow in order to make sure you cover the important use-cases. [When I was running the solution team at ArcSight, we implemented an entire solution (app) around the NIST framework.]

Interestingly enough, on most of my recent consulting calls and engagements around SIEM use-cases, I got asked about how to visualize the data in the SIEM to make it more tangible and actionable. Unfortunately, there is no tool out there that would let you do that out of the box. Not yet. Here are a couple of resources you can have a look at to get going though:

  • secviz.org is the community portal for security visualization
  • I sometimes use Gephi for network graph visualizations. The problem is that it is limited to only network graphs. Very quickly you will realize that it would be nice to have other, linked visualizations too. You are off in ‘do it yourself’ land.
  • There is also DAVIX which is a Linux distro for security visualization with a ton of visualization tools readily installed.
  • And – shameless plug – I teach log analysis and visualization workshops where we discuss more of these topics and tools.

Do you have SIEM use-cases that you find super useful? Add them in the comments!

April 14, 2015

Rockstars Use a Good Text Editor – I Use VIM

Filed under: Uncategorized — Raffael Marty @ 9:13 am

Those of you who know me most likely know that I am quite the VIM fan. At any time, there is at least one VIM window open on my computer. I just like the speed of editing and the flexibility it offers. I even use VI bindings in my UNIX shells (set -o vi). And yes, I did write my book in VIM.

Anyways, here is a command from my .vimrc file that I use a lot:

command F set guifont=Monaco:h13

Basically, if I type “:F”, it makes my font larger. I know, not earth shattering, but really useful.

Here are a couple esthetic things I like to make my VIM look nice:

set background=dark
colorscheme solarized
set guioptions=-m

This is my complete .vimrc file.

March 29, 2015

Security Dashboards – Where to Start

Filed under: Visualization — Raffael Marty @ 4:58 pm

I just got off a call with a client and they asked me what they should put on their security dashboards. It’s a nice continuation of the discussion of the SOC Overhead Dashboard.

Here are some thoughts. The list stems from a slide that I use during the Visual Analytics Workshop:

  • Audience, audience, audience!
  • Comprehensive Information (enough context) – use percentages, a single number of 100 unpatched machines doesn’t mean anything. Out of how many? How has it changed over time? etc.
  • Highlight important data – guide the user in absorbing data quickly
  • Use graphics when appropriate – tables or numbers are sometimes more effective
  • Good choice of graphics and design – treemaps might be useful, bullet graphs are great, apply Tufte’s data to ink ratio paradigm, etc.
  • Aesthetically pleasing – nobody likes to look at a boring dashboard
  • Enough information to decide if action is necessary
  • No scrolling
  • Real-time vs. batch? (Refresh-rates)
  • Clear organization

Should you be tempted to put a world map on your dashboard, I challenge you to think really hard about what is actionable about that display. What does the viewer gain from looking at the map? Is there a takeaway for them? If so, go ahead, but most likely, either a bar chart or a simple table about the top attacking or the most attacked, or most seen sources or destinations is going to be more useful! Does physical proximity really matter?

There is a fantastic book by Stephen Few also called: “Information Dashboard Design”. I cannot recommend it enough if you are going to build a dashboard.

Would love to hear your thoughts on the topic of security dashboards! And for an in depth and more elaborate treatment of the topic, attend the Visual Analytics workshop at BlackHat US.

February 20, 2015

The Security Big Data Lake – Paper Published

Filed under: Log Analysis,Security Information Management,Security Intelligence — Raffael Marty @ 8:12 pm

As announced in the previous blog post, I have been writing a paper about the security big data lake. A topic that starts coming up with more and more organizations lately. Unfortunately, there is a lot uncertainty around the term so I decided to put some structure to the discussion.

Download the paper here.

A little teaser from the paper: The following table from the paper summarizes the four main building blocks that can be used to put together a SIEM – data lake integration:

datalake_buildingblock1
datalake_buildingblock2

Thanks @antonchuvakin for brainstorming and coming up with the diagram.

February 16, 2015

Big Data Lake – Leveraging Big Data Technologies To Build a Common Data Repository For Security

Filed under: Uncategorized — Raffael Marty @ 11:50 am

Information security has been dealing with terabytes of data for over a decade; almost two. Companies of all sizes are realizing the benefit of having more data available to not only conduct forensic investigations, but also pro-actively find anomalies and stop adversaries before they cause any harm.

UPDATE: Download the paper here

I am finalizing a paper on the topic of the security big data lake. I should have the full paper available soon. As a teaser, here are the first two sections:

What Is a Data Lake?

The term data lake comes from the big data community and starts appearing in the security field more often. A data lake (or a data hub) is a central location where all security data is collected and stored. Sounds like log management or security information and event management (SIEM)? Sure. Very similar. In line with the Hadoop big data movement, one of the objectives is to run the data lake on commodity hardware and storage that is cheaper than special purpose storage arrays, SANs, etc. Furthermore, the lake should be accessible by third-party tools, processes, workflows, and teams across the organization that need the data. Log management tools do not make it easy to access the data through standard interfaces (APIs). They also do not provide a way to run arbitrary analytics code against the data.

Just because we mentioned SIEM and data lakes in the same sentence above does not mean that a data lake is a replacement for a SIEM. The concept of a data lake merely covers the storage and maybe some of the processing of data. SIEMs are so much more.

Why Implementing a Data Lake?

Security data is often found stored in multiple copies across a company. Every security product collects and stores its own copy of the data. For example, tools working with network traffic (e.g., IDS/IPS, DLP, forensic tools) monitor, process, and store their own copies of the traffic. Behavioral monitoring, network anomaly detection, user scoring, correlation engines, etc. all need a copy of the data to function. Every security solution is more or less collecting and storing the same data over and over again, resulting in multiple data copies.

The data lake tries to rid of this duplication by collecting the data once and making it available to all the tools and products that need it. This is much simpler said than done. The core of this document is to discuss the issues and approaches around the lake.

To summarize, the four goals of the data lake are:

  • One way (process) to collect all data
  • Process, clean, enrich the data in one location
  • Store data once
  • Have a standard interface to access the data

One of the main challenges with this approach is how to make all the security products leverage the data lake instead of collecting and processing their own data. Mostly this means that products have to be rebuilt by the vendors to do so.

Have you implemented something like this? Email me or put a comment on the blog. I’d love to hear your experience. And stay tuned for the full paper!

January 15, 2015

Dashboards in the Security Opartions Center (SOC)

Filed under: Security Information Management,Visualization — Raffael Marty @ 3:35 pm

SOCI am sure you have seen those huge screens in a security or network operations center (SOC or NOC). They are usually quite impressive and sometimes even quite beautiful. I have made a habit of looking a little closer at those screens and asking the analysts sitting in front of them whether and how they are using those dashboards. I would say about 80% of the time they don’t use them. I have even seen SOCs that have very expensive screens up on the wall and they are just dark. Nobody is using them. Some SOCs will turn them on when they have customers or executives walk through.

That’s just wrong! Let’s start using these screens!

I recently visited a very very large NOC. They had 6 large screens up where every single screen showed graphs of 25 different measurements: database latencies for each database cluster, number of transactions going through each specific API endpoint, number of users currently active, number of failed logins, etc.

There are two things I learned that day for security applications:

1. Use The Screens For Context

When architecting SOC dashboards, the goal is often to allow analysts to spot attacks or anomalies. That’s where things go wrong! Do you really want your analysts to focus their attention on the dashboards to detect anomalies? Why not put those dashboards on the analysts screens then? Using the SOC screens to detect anomalies or attacks is the wrong use!

Use the dashboards as context. Say an analyst is investigating a number of suspicious looking network connections to a cluster of application servers. The analyst only knows that the cluster runs some sort of business applications. She could just discard the traffic pattern, following perfectly good procedure. However, a quick look up on the overhead screens shows a list of the most recently exploited applications and among them is SAP NetWeaver Dispatcher (arbitrary example). Having that context, the analyst makes the connection between the application cluster and SAP software running on that cluster. Instead of discarding the pattern, she decides to investigate further as it seems there are some fresh exploits being used in the wild.

Or say the analyst is investigating an increase in database write failures along with an increase in inbound traffic. The analyst first suspects some kind of DoS attack. The SOC screens provide more context: Looking at the database metrics, there seems to be an increase in database write latency. It also shows that one of the database machines is down. Furthermore, the transaction volume for one of the APIs is way off the charts, but only compared to earlier in the day. Compared to a week ago (see next section), this is absolutely expected behavior. A quick look in the configuration management database shows that there is a ticket that mentions the maintenance of one of the database servers. (Ideally this information would have been on the SOC screen as well!) Given all this information, this is not a DoS attack, but an IT ops problem. On to the next event.

2. Show Comparisons

If individual graphs are shown on the screens, they can be made more useful if they show comparisons. Look at the following example:
graph

The blue line in the graph shows the metric’s value over the day. It’s 11am right now and we just observed quite a spike in this metric. Comparing the metric to itself, this is clearly an anomaly. However, having the green dotted line in the background, which shows the metric at the same time a week ago, we see that it is normal for this metric to spike at around noon. So no anomaly to be found here.
Why showing a comparison to the values a week ago? It helps absorbing seasonality. If you compared the metric to yesterday, on Monday you would compare to a Sunday, which often shows very different metrics. A month is too far away. A lot of things can change in a month. A week is a good time frame.

What should be on the screens?

The logical next question is what to put on those screens. Well, that depends a little, but here are some ideas:

  • Summary of some news feeds (FS ISAC feeds, maybe even threat feeds)
  • Monitoring twitter or IRC for certain activity
  • All kinds of volumes or metrics (e.g., #firewall blocks, #IDS alerts, #failed transactions)
  • Top 10 suspicious users
  • Top 10 servers connecting outbound (by traffic and by number of connections)

I know, I am being very vague. What is a ‘summary of a news feed’? You can extract the important words and maybe display a word cloud or a treemap. Or you might list certain objects that you find in the news feed, such as vulnerability IDs and vulnerability names. If you monitor IRC, do some natural language processing (NLP) to extract keywords. To find suspicious users you can use all kinds of behavioral models. Maybe you have a product lying around that does something like that.

Why would you want to see the top 10 servers connecting outbound? If you know which servers talk most to the outside world and the list suddenly changes, you might want to know. Maybe someone is exfiltrating information? Even if the list is not that static, your analysts will likely get really good at spotting trends over time. You might even want to filter the list so that the top entries don’t even show up, but maybe the ones at position 11-20. Or something like that. You get the idea.

Have you done anything like that? Write a comment and tell us what works for you. Have some pictures or screenshots? Even better. Send them over!

September 17, 2014

AfterGlow 1.6.5 – Edge Labels

Filed under: Log Analysis,Programming,Visualization — Raffael Marty @ 5:32 am

A new version of AfterGlow is ready. Version 1.6.5 has a couple of improvements:

1. If you have an input file which only has two columns, AfterGlow now automatically switches to a two-node mode. You don’t have to use the (-t) switch explicitly anymore in this case! (I know, it’s about time I added this)

2. Very minor change, but something that kept annoying me over time is the default edge length. It was set to 3 initially and now it’s reduced to 1.5, which makes fro a bit more compact graphs. You can still change this with the -e switch on the command line

3. The major change is about adding edge label though. Here is a quick example:

label.edge=$fields[2]

This assumes that the third column of your data contains the label for the data. In the example below, the port numbers:

10.0.0.5,10.0.0.1,53
10.0.0.5,10.0.0.1,80

When you run afterglow, use the -t switch to have it render only two nodes, but given the configuration above, we are using the third column as the edge label. The output will look like this:

edge_label

 

As you can see, we have twice the same edge defined in the data with two different labels (port 53 and 80). If you want to have the graph show both edges, you add the following configuration in the configuration file:

label.duplicate=1

Which then results in the following graph:

edge_label_duplicate

 

Note that the duplicating of edges only works with GDF files (-k). The edge labels work in DOT and GDF files, not in GraphSON output.

January 19, 2014

A New and Updated Field Dictionary for Logging Standards

Filed under: Uncategorized — Raffael Marty @ 2:51 pm

If you have been interested and been following event interchange formats or logging standards, you know of CEF and CEE. Problem is that we lost funding for CEE, which doesn’t mean that CEE is dead! In fact, I updated the field dictionary to accommodate some more use-cases and data sources. The one currently published by CEE is horrible. Don’t use it. Use my new version!

Whether you are using CEE or any other logging standard for your message formatting, you will need a naming schema; a set of field names. In CEE we call that a field dictionary.

The problem with the currently published field dictionary of CEE is that it’s inconsistent, has duplicate field names, and is missing a bunch of field names that you commonly need. I updated and cleaned up the dictionary (see below or download it here.) Please email me with any feedback / updates / additions! This is by no means complete, but it’s a good next iteration to keep improving on! If you know and use CEF, you can use this new dictionary with it. The problem with CEF is that it has to use ArcSight’s very limited field schema. And you have to overload a bunch of fields. So, try using this schema instead!

I was emailing with my friend Jose Nazario the other day and realized that we never really published anything decent on the event taxonomy either. That’s going to be my next task to gather whatever I can find in notes and such to put together an updated version of the taxonomy with my latest thinking; which has emerged quite a bit in the last 12 years that I have been building event taxonomies (starting with the ArcSight categorization schema, Splunk’s Common Information Model, and then designing the CEE taxonomy). Stay tuned for that.

For reference purposes. Here are some spin-offs from CEE which have field dictionaries as well:

Here is the new field dictionary:

Object Field Type Description
action STRING Action taken
bytes_received NUMBER Bytes received
bytes_sent NUMBER Bytes sent
category STRING Log source assigned category of message
cmd STRING Command
duration NUMBER Duration in seconds
host STRING Hostname of the event source
in_interface STRING Inbound interface
ip_proto NUMBER IP protocol field value (8=UDP, …)
msg STRING The event message
msgid STRING The event message identifier
out_interface STRING Outbound interface
packets_received NUMBER Number of packets received
packets_sent NUMBER Number of packets sent
reason STRING Reason for action taken or activity observed
rule_number STRING Number of rule – firewalls, for example
subsys STRING Application subsystem responsible for generating the event
tcp_flags STRING TCP flags
tid NUMBER Numeric thread ID associated with the process generating the event
time DATETIME Event Start Time
time_logged DATETIME Time log record was logged
time_received DATETIME Time log record was received
vend STRING Vendor of the event source application
app name STRING Name of the application that generated the event
app session_id STRING Session identifier from application
app vend STRING Application vendor
app ver STRING Application version
dst country STRING Country name of the destination
dst host STRING Network destination hostname
dst ipv4 IPv4 Network destination IPv4 address
dst ipv6 IPv6 Network destination IPv6 address
dst nat_ipv4 IPv4 NAT IPv4 address of destination
dst nat_ipv6 IPv6 NAT IPv6 destination address
dst nat_port NUMBER NAT port number for destination
dst port NUMBER Network destination port
dst zone STRING Zone name for destination – examples: Bldg1, Europe
file line NUMBER File line number
file md5 STRING File MD5 Hash
file mode STRING File mode flags
file name STRING File name
file path STRING File system path
file perm STRING File permissions
file size NUMBER File size in bytes
http content_type STRING MIME content type within HTTP
http method STRING HTTP method – GET | POST | HEAD | …
http query_string STRING HTTP query string
http request STRING HTTP request URL
http request_protocol STRING HTTP protocol used
http status NUMBER Return code in HTTP response
palo_alto actionflags STRING Palo Alto Networks Firewall Specific Field
palo_alto config_version STRING Palo Alto Networks Firewall Specific Field
palo_alto cpadding STRING Palo Alto Networks Firewall Specific Field
palo_alto domain STRING Palo Alto Networks Firewall Specific Field
palo_alto log_type STRING Palo Alto Networks Firewall Specific Field
palo_alto padding STRING Palo Alto Networks Firewall Specific Field
palo_alto seqno STRING Palo Alto Networks Firewall Specific Field
palo_alto serial_number STRING Palo Alto Networks Firewall Specific Field
palo_alto threat_content_type STRING Palo Alto Networks Firewall Specific Field
palo_alto virtual_system STRING Palo Alto Networks Firewall Specific Field
proc id STRING Process ID (pid)
proc name STRING Process name
proc tid NUMBER Thread identifier of the process
src country STRING Country name of the source
src host STRING Network source hostname
src ipv4 IPv4 Network source IPv4 address
src ipv6 IPv6 Network source IPv6 address
src nat_ipv4 IPv4 NAT IPv4 address of source
src nat_ipv6 IPv6 NAT IPVv6 address
src nat_port NUMBER NAT port number for source
src port NUMBER Network source port
src zone STRING Zone name for source – examples: Bldg1, Europe
syslog fac NUMBER Syslog facility value
syslog pri NUMBER Syslog priority value
syslog pri STRING Event priority (ERROR|WARN|DEBUG|CRIT)
syslog sev NUMBER Event severity
syslog tag STRING Syslog Tag value
syslog ver NUMBER Syslog Protocol version (0=legacy/RFC3164; 1=RFC5424)
user auid STRING Source User login authentication ID (login id)
user domain STRING User account domain (NT Domain)
user eid STRING Source user effective ID (euid)
user gid STRING Group ID (gid)
user group STRING Group name
user id STRING User account ID (uid)
user name STRING User account name
October 25, 2013

Using Impala and Parquet to Analyze Network Traffic – VAST 2013 Challenge

Filed under: Log Analysis — Raffael Marty @ 4:51 pm

As I outlined in my previous blog post on How to clean up network traffic logs, I have been working with the VAST 2013 traffic logs. Today I am going to show you can load the traffic logs into Impala (with a parquet table) for very quick querying.

First off, Impala is a real-time search engine for Hadoop (i.e., Hive/HDFS). So, scalable, distributed, etc. In the following I am assuming that you have Impala installed already. If not, I recommend you use the Cloudera Manager to do so. It’s pretty straight forward.

First we have to load the data into Impala, which is a two step process. We are using external tables, meaning that the data will live in files on HDFS. What we have to do is getting the data into HDFS first and then loading it into Impala:

$ sudo su - hdfs
$ hdfs dfs -put /tmp/nf-chunk*.csv /user/hdfs/data

We first become the hdfs user, then copy all of the netflow files from the MiniChallenge into HDFS at /user/hdfs/data. Next up we connect to impala and create the database schema:

$ impala-shell
create external table if not exists logs (
	TimeSeconds double,
	parsedDate timestamp,
	dateTimeStr string,
	ipLayerProtocol int,
	ipLayerProtocolCode string,
	firstSeenSrcIp string,
	firstSeenDestIp string,
	firstSeenSrcPort int,
	firstSeenDestPor int,
	moreFragment int,
	contFragment int,
	durationSecond int,
	firstSeenSrcPayloadByte bigint,
	firstSeenDestPayloadByte bigint,
	firstSeenSrcTotalByte bigint,
	firstSeenDestTotalByte bigint,
	firstSeenSrcPacketCoun int,
	firstSeenDestPacketCoun int,
	recordForceOut int)
row format delimited fields terminated by ',' lines terminated by '\n'
location '/user/hdfs/data/';

Now we have a table called ‘logs’ that contains all of our data. We told Impala that the data is comma separated and told it where the data files are. That’s already it. What I did on my installation is leveraging the columnar data format of Impala to speed queries up. A lot of analytic queries don’t really suit the row-oriented manner of databases. Columnar orientation is much more suited. Therefore we are creating a Parquet-based table:

create table pq_logs like logs stored as parquetfile;
insert overwrite table pq_logs select * from logs;

The second command is going to take a bit as it loads all the data into the new Parquet table. You can now issues queries against the pq_logs table and you will get the benefits of a columnar data store:

select distinct firstseendestpor from pq_logs where morefragment=1;

Have a look at my previous blog entry for some more queries against this data.

October 22, 2013

Cleaning Up Network Traffic Logs – VAST 2013 Challenge

Filed under: Log Analysis — Raffael Marty @ 1:59 pm

I have spent some significant time with the VAST 2013 Challenge. I have been part of the program committee for a couple of years now and have seen many challenge submissions. Both good and bad. What I noticed with most submissions is that they a) didn’t really understand network data, and b) they didn’t clean the data correctly. If you wanna follow along my analysis, the data is here: Week 1 – Network Flows (~500MB)

Also check the follow-on blog post on how to load data into a columnar data store in order to work with it.

Let me help with one quick comment. There is a lot of traffic in the data that seems to be involving port 0:

$ cat nf-chunk1-rev.csv | awk -F, '{if ($8==0) print $0}'
1364803648.013658,2013-04-01 08:07:28,20130401080728.013658,1,OTHER,172.10.0.6,
172.10.2.6,0,0,0,0,1,0,0,222,0,3,0,0

Just because it says port 0 in there doesn’t mean it’s port 0! Check out field 5, which says OTHER. That’s the transport protocol. It’s not TCP or UDP, so the port is meaningless. Most likely this is ICMP traffic!

On to another problem with the data. Some of the sources and destinations are turned around in the traffic. This happens with network flow collectors. Look at these two records:

1364803504.948029,2013-04-01 08:05:04,20130401080504.948029,6,TCP,172.30.1.11, 
10.0.0.12,9130,80,0,0,0,176,409,454,633,5,4,0
1364807428.917824,2013-04-01 09:10:28,20130401091028.917824,6,TCP,172.10.0.4,
172.10.2.64,80,14545,0,0,0,7425,0,7865,0,8,0,0

The first one is totally legitimate. The source port is 9130, the destination 80. The second record, however, has the source and destination turned around. Port 14545 is not a valid destination port and the collector just turned the information around.

The challenge is on you now to find which records are inverted and then you have to flip them back around. Here is what I did in order to find the ones that were turned around (Note, I am only using the first week of data for MiniChallenge1!):

select firstseendestport, count(*) c from logs group by firstseendestport order
 by c desc limit 20;
| 80               | 41229910 |
| 25               | 272563   |
| 0                | 119491   |
| 123              | 95669    |
| 1900             | 68970    |
| 3389             | 58153    |
| 138              | 6753     |
| 389              | 3672     |
| 137              | 2352     |
| 53               | 955      |
| 21               | 767      |
| 5355             | 311      |
| 49154            | 211      |
| 464              | 100      |
| 5722             | 98       |
...

What I am looking for here are the top destination ports. My theory being that most valid ports will show up quite a lot. This gave me a first candidate list of ports. I am looking for two things here. First, the frequency of the ports and second whether I recognize the ports as being valid. Based on the frequency I would put the ports down to port 3389 on my candidate list. But because all the following ones are well known ports, I will include them down to port 21. So the first list is:

80,25,0,123,1900,3389,138,389,137,53,21

I’ll drop 0 from this due to the comment earlier!

Next up, let’s see what the top source ports are that are showing up.

| firstseensrcport | c       |
+------------------+---------+
| 80               | 1175195 |
| 62559            | 579953  |
| 62560            | 453727  |
| 51358            | 366650  |
| 51357            | 342682  |
| 45032            | 288301  |
| 62561            | 256368  |
| 45031            | 227789  |
| 51359            | 180029  |
| 45033            | 157071  |
| 0                | 119491  |
| 45034            | 117760  |
| 123              | 95622   |
| 1984             | 81528   |
| 25               | 19646   |
| 138              | 6711    |
| 137              | 2288    |
| 2024             | 929     |
| 2100             | 927     |
| 1753             | 926     |

See that? Port 80 is the top source port showing up. Definitely a sign of a source/destination confusion. There are a bunch of others from our previous candidate list showing up here as well. All records where we have to turn source and destination around. But likely we are still missing some ports here.

Well, let’s see what other source ports remain:

select firstseensrcport, count(*) c from pq_logs2 group by firstseensrcport 
having firstseensrcport<1024 and firstseensrcport not in (0,123,138,137,80,25,53,21) 
order by c desc limit 10
+------------------+--------+
| firstseensrcport | c      |
+------------------+--------+
| 62559            | 579953 |
| 62560            | 453727 |
| 51358            | 366650 |
| 51357            | 342682 |
| 45032            | 288301 |
| 62561            | 256368 |
| 45031            | 227789 |
| 51359            | 180029 |
| 45033            | 157071 |
| 45034            | 117760 |

Looks pretty normal. Well. Sort of, but let’s not digress. But lets try to see if there are any ports below 1024 showing up. Indeed, there is port 20 that shows, totally legitimate destination port. Let’s check out the. Pulling out the destination ports for those show nice actual source ports:

+------------------+------------------+---+
| firstseensrcport | firstseendestport| c |
+------------------+------------------+---+
| 20               | 3100             | 1 |
| 20               | 8408             | 1 |
| 20               | 3098             | 1 |
| 20               | 10129            | 1 |
| 20               | 20677            | 1 |
| 20               | 27362            | 1 |
| 20               | 3548             | 1 |
| 20               | 21396            | 1 |
| 20               | 10118            | 1 |
| 20               | 8407             | 1 |
+------------------+------------------+---+

Adding port 20 to our candidate list. Now what? Let’s see what happens if we look at the top ‘connections':

select firstseensrcport, 
firstseendestport, count(*) c from pq_logs2 group by firstseensrcport, 
firstseendestport having firstseensrcport not in (0,123,138,137,80,25,53,21,20,1900,3389,389) 
and firstseendestport not in (0,123,138,137,80,25,53,21,20,3389,1900,389) 
order by c desc limit 10
+------------------+------------------+----+
| firstseensrcport | firstseendestpor | c  |
+------------------+------------------+----+
| 1984             | 4244             | 11 |
| 1984             | 3198             | 11 |
| 1984             | 4232             | 11 |
| 1984             | 4276             | 11 |
| 1984             | 3212             | 11 |
| 1984             | 4247             | 11 |
| 1984             | 3391             | 11 |
| 1984             | 4233             | 11 |
| 1984             | 3357             | 11 |
| 1984             | 4252             | 11 |
+------------------+------------------+----+

Interesting. Looking through the data where the source port is actually 1984, we can see that a lot of the destination ports are showing increasing numbers. For example:

| 1984             | 2228             | 172.10.0.6     | 172.10.1.118    |
| 1984             | 2226             | 172.10.0.6     | 172.10.1.147    |
| 1984             | 2225             | 172.10.0.6     | 172.10.1.141    |
| 1984             | 2224             | 172.10.0.6     | 172.10.1.115    |
| 1984             | 2223             | 172.10.0.6     | 172.10.1.120    |
| 1984             | 2222             | 172.10.0.6     | 172.10.1.121    |
| 1984             | 2221             | 172.10.0.6     | 172.10.1.135    |
| 1984             | 2220             | 172.10.0.6     | 172.10.1.126    |
| 1984             | 2219             | 172.10.0.6     | 172.10.1.192    |
| 1984             | 2217             | 172.10.0.6     | 172.10.1.141    |
| 1984             | 2216             | 172.10.0.6     | 172.10.1.173    |
| 1984             | 2215             | 172.10.0.6     | 172.10.1.116    |
| 1984             | 2214             | 172.10.0.6     | 172.10.1.120    |
| 1984             | 2213             | 172.10.0.6     | 172.10.1.115    |
| 1984             | 2212             | 172.10.0.6     | 172.10.1.126    |
| 1984             | 2211             | 172.10.0.6     | 172.10.1.121    |
| 1984             | 2210             | 172.10.0.6     | 172.10.1.172    |
| 1984             | 2209             | 172.10.0.6     | 172.10.1.119    |
| 1984             | 2208             | 172.10.0.6     | 172.10.1.173    |

That would hint at this guy being actually a destination port. You can also query for all the records that have the destination port set to 1984, which will show that a lot of the source ports in those connections are definitely source ports, another hint that we should add 1984 to our list of actual ports. Continuing our journey, I found something interesting. I was looking for all connections that don’t have a source or destination port in our candidate list and sorted by the number of occurrences:

+------------------+------------------+---+
| firstseensrcport | firstseendestport| c |
+------------------+------------------+---+
| 62559            | 37321            | 9 |
| 62559            | 36242            | 9 |
| 62559            | 19825            | 9 |
| 62559            | 10468            | 9 |
| 62559            | 34395            | 9 |
| 62559            | 62556            | 9 |
| 62559            | 9005             | 9 |
| 62559            | 59399            | 9 |
| 62559            | 7067             | 9 |
| 62559            | 13503            | 9 |
| 62559            | 30151            | 9 |
| 62559            | 23267            | 9 |
| 62559            | 56184            | 9 |
| 62559            | 58318            | 9 |
| 62559            | 4178             | 9 |
| 62559            | 65429            | 9 |
| 62559            | 32270            | 9 |
| 62559            | 18104            | 9 |
| 62559            | 16246            | 9 |
| 62559            | 33454            | 9 |

This is strange in so far as this source port seems to connect to totally random ports, but not making any sense. Is this another legitimate destination port? I am not sure. It’s way too high and I don’t want to put it on our list. Open question. No idea at this point. Anyone?

Moving on without this 62559, we see the same behavior for 62560 and then 51357 and 51358, as well as 45031, 45032, 45033. And it keeps going like that. Let’s see what the machines are involved in this traffic. Sorry, not the nicest SQL, but it works:

.

select firstseensrcip, firstseendestip, count(*) c 
from pq_logs2 group by firstseensrcip, firstseendestip,firstseensrcport 
having firstseensrcport in (62559, 62561, 62560, 51357, 51358)  
order by c desc limit 10
+----------------+-----------------+-------+
| firstseensrcip | firstseendestip | c     |
+----------------+-----------------+-------+
| 10.9.81.5      | 172.10.0.40     | 65534 |
| 10.9.81.5      | 172.10.0.4      | 65292 |
| 10.9.81.5      | 172.10.0.4      | 65272 |
| 10.9.81.5      | 172.10.0.4      | 65180 |
| 10.9.81.5      | 172.10.0.5      | 65140 |
| 10.9.81.5      | 172.10.0.9      | 65133 |
| 10.9.81.5      | 172.20.0.6      | 65127 |
| 10.9.81.5      | 172.10.0.5      | 65124 |
| 10.9.81.5      | 172.10.0.9      | 65117 |
| 10.9.81.5      | 172.20.0.6      | 65099 |
+----------------+-----------------+-------+

Here we have it. Probably an attacker :). This guy is doing not so nice things. We should exclude this IP for our analysis of ports. This guy is just all over.

Now, we continue along similar lines and find what machines are using port 45034, 45034, 45035:

select firstseensrcip, firstseendestip, count(*) c 
from pq_logs2 group by firstseensrcip, firstseendestip,firstseensrcport 
having firstseensrcport in (45035, 45034, 45033)  order by c desc limit 10
+----------------+-----------------+-------+
| firstseensrcip | firstseendestip | c     |
+----------------+-----------------+-------+
| 10.10.11.15    | 172.20.0.15     | 61337 |
| 10.10.11.15    | 172.20.0.3      | 55772 |
| 10.10.11.15    | 172.20.0.3      | 53820 |
| 10.10.11.15    | 172.20.0.2      | 51382 |
| 10.10.11.15    | 172.20.0.15     | 51224 |
| 10.15.7.85     | 172.20.0.15     | 148   |
| 10.15.7.85     | 172.20.0.15     | 148   |
| 10.15.7.85     | 172.20.0.15     | 148   |
| 10.7.6.3       | 172.30.0.4      | 30    |
| 10.7.6.3       | 172.30.0.4      | 30    |

We see one dominant IP here. Probably another ‘attacker’. So we exclude that and see what we are left with. Now, this is getting tedious. Let’s just visualize some of the output to see what’s going on. Much quicker! And we only have 36970 records unaccounted for.

What you can see is the remainder of traffic. Very quickly we see that there is one dominant IP address. We are going to filter that one out. Then we are left with this:

I selected some interesting traffic here. Turns out, we just found another destination port: 5535 for our list. I continued this analysis and ended up with something like 38 records, which are shown in the last image:

I’ll leave it at this for now. I think that’s a pretty good set of ports:

20,21,25,53,80,123,137,138,389,1900,1984,3389,5355

Oh well, if you want to fix your traffic now and turn around the wrong source/destination pairs, here is a hack in perl:

$ cat nf*.csv | perl -F\,\ -ane 'BEGIN {@ports=(20,21,25,53,80,123,137,138,389,1900,1984,3389,5355); 
%hash = map { $_ => 1 } @ports; $c=0} if ($hash{$F[7]} && $F[8}>1024) 
{$c++; printf"%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s",
$F[0],$F[1],$F[2],$F[3],$F[4],$F[6],$F[5],$F[8],$F[7],$F[9],$F[10],$F[11],$F[13],$F[12],
$F[15],$F[14],$F[17],$F[16],$F[18]} else {print $_} END {print "count of revers $c\n";}'

We could have switched to visual analysis way earlier, which I did in my initial analysis, but for the blog I ended up going way further in SQL than I probably should have. The next blog post covers how to load all of the VAST data into a Hadoop / Impala setup.