Who knows, I might just pick up my blogging again at some point. For now, I posted a short leadership related post on my Leadership | Technology | Spirit blog. Check it out.
Leadership | Technology | Spirit
*NIX Command Line Foo
Well, not one of my normal blog posts, but I hope some of you geeks out there will find this useful anyways. I will definitely use this post as a reference frequently.
I have been using various flavors of UNIX and their command lines from ksh to bash and zsh for over 25 years and there is always something new to learn to make me faster at the jobs I am doing. One tool that I keep using (despite my growing command of Excel), is VIM coupled with UNIX command line tools. It saves me hours and hours of work all the time.
Well, here are some new things I learned and want to remember from the Well, here are some new things I learned and want to remember from the art of command line github repo:
CTRL-W
on the command line deletes the last wordpgrep
to search for processes rather than doing the longer version with awklsof -iTCP -sTCP:LISTEN -P -n
processes listening on TCP ports- Diff two json files:
diff <(jq --sort-keys . < file1.json) <(jq --sort-keys . < file2.json) | colordiff | less -R
- I totally forgot about
csvkit
–brew install csvkit
in2csv file1.xls > file1.csv
csvstat data.csv
csvsql --query "select name from data where age > 30" data.csv > old.csv
I just found some additional command son OSX that I wish I had known earlier:
ditto
copies one or more source files or directories to a destination directory. If the destination directory does not exist it will be created before the first source is copied. If the destination directory already exists then the source directories are merged with the previous contents of the destination.pbcopy
past data from command line into the clipboardqlmanage
quick view from the command line
This is a great repo as well for great OSX commands.
Rockstars Use a Good Text Editor – I Use VIM
Those of you who know me most likely know that I am quite the VIM fan. At any time, there is at least one VIM window open on my computer. I just like the speed of editing and the flexibility it offers. I even use VI bindings in my UNIX shells (set -o vi). And yes, I did write my book in VIM.
Anyways, here is a command from my .vimrc file that I use a lot:
command F set guifont=Monaco:h13
Basically, if I type “:F”, it makes my font larger. I know, not earth shattering, but really useful.
Here are a couple esthetic things I like to make my VIM look nice:
set background=dark colorscheme solarized set guioptions=-m
This is my complete .vimrc file.
Big Data Lake – Leveraging Big Data Technologies To Build a Common Data Repository For Security
Information security has been dealing with terabytes of data for over a decade; almost two. Companies of all sizes are realizing the benefit of having more data available to not only conduct forensic investigations, but also pro-actively find anomalies and stop adversaries before they cause any harm.
UPDATE: Download the paper here
I am finalizing a paper on the topic of the security big data lake. I should have the full paper available soon. As a teaser, here are the first two sections:
What Is a Data Lake?
The term data lake comes from the big data community and starts appearing in the security field more often. A data lake (or a data hub) is a central location where all security data is collected and stored. Sounds like log management or security information and event management (SIEM)? Sure. Very similar. In line with the Hadoop big data movement, one of the objectives is to run the data lake on commodity hardware and storage that is cheaper than special purpose storage arrays, SANs, etc. Furthermore, the lake should be accessible by third-party tools, processes, workflows, and teams across the organization that need the data. Log management tools do not make it easy to access the data through standard interfaces (APIs). They also do not provide a way to run arbitrary analytics code against the data.
Just because we mentioned SIEM and data lakes in the same sentence above does not mean that a data lake is a replacement for a SIEM. The concept of a data lake merely covers the storage and maybe some of the processing of data. SIEMs are so much more.
Why Implementing a Data Lake?
Security data is often found stored in multiple copies across a company. Every security product collects and stores its own copy of the data. For example, tools working with network traffic (e.g., IDS/IPS, DLP, forensic tools) monitor, process, and store their own copies of the traffic. Behavioral monitoring, network anomaly detection, user scoring, correlation engines, etc. all need a copy of the data to function. Every security solution is more or less collecting and storing the same data over and over again, resulting in multiple data copies.
The data lake tries to rid of this duplication by collecting the data once and making it available to all the tools and products that need it. This is much simpler said than done. The core of this document is to discuss the issues and approaches around the lake.
To summarize, the four goals of the data lake are:
- One way (process) to collect all data
- Process, clean, enrich the data in one location
- Store data once
- Have a standard interface to access the data
One of the main challenges with this approach is how to make all the security products leverage the data lake instead of collecting and processing their own data. Mostly this means that products have to be rebuilt by the vendors to do so.
Have you implemented something like this? Email me or put a comment on the blog. I’d love to hear your experience. And stay tuned for the full paper!
A New and Updated Field Dictionary for Logging Standards
If you have been interested and been following event interchange formats or logging standards, you know of CEF and CEE. Problem is that we lost funding for CEE, which doesn’t mean that CEE is dead! In fact, I updated the field dictionary to accommodate some more use-cases and data sources. The one currently published by CEE is horrible. Don’t use it. Use my new version!
Whether you are using CEE or any other logging standard for your message formatting, you will need a naming schema; a set of field names. In CEE we call that a field dictionary.
The problem with the currently published field dictionary of CEE is that it’s inconsistent, has duplicate field names, and is missing a bunch of field names that you commonly need. I updated and cleaned up the dictionary (see below or download it here.) Please email me with any feedback / updates / additions! This is by no means complete, but it’s a good next iteration to keep improving on! If you know and use CEF, you can use this new dictionary with it. The problem with CEF is that it has to use ArcSight’s very limited field schema. And you have to overload a bunch of fields. So, try using this schema instead!
I was emailing with my friend Jose Nazario the other day and realized that we never really published anything decent on the event taxonomy either. That’s going to be my next task to gather whatever I can find in notes and such to put together an updated version of the taxonomy with my latest thinking; which has emerged quite a bit in the last 12 years that I have been building event taxonomies (starting with the ArcSight categorization schema, Splunk’s Common Information Model, and then designing the CEE taxonomy). Stay tuned for that.
For reference purposes. Here are some spin-offs from CEE which have field dictionaries as well:
- Project Lumberjack which has some field names.
- SyslogNG PatternDB has a bunch of patterns and they also have a Schema.
Here is the new field dictionary:
Object | Field | Type | Description |
action | STRING | Action taken | |
bytes_received | NUMBER | Bytes received | |
bytes_sent | NUMBER | Bytes sent | |
category | STRING | Log source assigned category of message | |
cmd | STRING | Command | |
duration | NUMBER | Duration in seconds | |
host | STRING | Hostname of the event source | |
in_interface | STRING | Inbound interface | |
ip_proto | NUMBER | IP protocol field value (8=UDP, …) | |
msg | STRING | The event message | |
msgid | STRING | The event message identifier | |
out_interface | STRING | Outbound interface | |
packets_received | NUMBER | Number of packets received | |
packets_sent | NUMBER | Number of packets sent | |
reason | STRING | Reason for action taken or activity observed | |
rule_number | STRING | Number of rule – firewalls, for example | |
subsys | STRING | Application subsystem responsible for generating the event | |
tcp_flags | STRING | TCP flags | |
tid | NUMBER | Numeric thread ID associated with the process generating the event | |
time | DATETIME | Event Start Time | |
time_logged | DATETIME | Time log record was logged | |
time_received | DATETIME | Time log record was received | |
vend | STRING | Vendor of the event source application | |
app | name | STRING | Name of the application that generated the event |
app | session_id | STRING | Session identifier from application |
app | vend | STRING | Application vendor |
app | ver | STRING | Application version |
dst | country | STRING | Country name of the destination |
dst | host | STRING | Network destination hostname |
dst | ipv4 | IPv4 | Network destination IPv4 address |
dst | ipv6 | IPv6 | Network destination IPv6 address |
dst | nat_ipv4 | IPv4 | NAT IPv4 address of destination |
dst | nat_ipv6 | IPv6 | NAT IPv6 destination address |
dst | nat_port | NUMBER | NAT port number for destination |
dst | port | NUMBER | Network destination port |
dst | zone | STRING | Zone name for destination – examples: Bldg1, Europe |
file | line | NUMBER | File line number |
file | md5 | STRING | File MD5 Hash |
file | mode | STRING | File mode flags |
file | name | STRING | File name |
file | path | STRING | File system path |
file | perm | STRING | File permissions |
file | size | NUMBER | File size in bytes |
http | content_type | STRING | MIME content type within HTTP |
http | method | STRING | HTTP method – GET | POST | HEAD | … |
http | query_string | STRING | HTTP query string |
http | request | STRING | HTTP request URL |
http | request_protocol | STRING | HTTP protocol used |
http | status | NUMBER | Return code in HTTP response |
palo_alto | actionflags | STRING | Palo Alto Networks Firewall Specific Field |
palo_alto | config_version | STRING | Palo Alto Networks Firewall Specific Field |
palo_alto | cpadding | STRING | Palo Alto Networks Firewall Specific Field |
palo_alto | domain | STRING | Palo Alto Networks Firewall Specific Field |
palo_alto | log_type | STRING | Palo Alto Networks Firewall Specific Field |
palo_alto | padding | STRING | Palo Alto Networks Firewall Specific Field |
palo_alto | seqno | STRING | Palo Alto Networks Firewall Specific Field |
palo_alto | serial_number | STRING | Palo Alto Networks Firewall Specific Field |
palo_alto | threat_content_type | STRING | Palo Alto Networks Firewall Specific Field |
palo_alto | virtual_system | STRING | Palo Alto Networks Firewall Specific Field |
proc | id | STRING | Process ID (pid) |
proc | name | STRING | Process name |
proc | tid | NUMBER | Thread identifier of the process |
src | country | STRING | Country name of the source |
src | host | STRING | Network source hostname |
src | ipv4 | IPv4 | Network source IPv4 address |
src | ipv6 | IPv6 | Network source IPv6 address |
src | nat_ipv4 | IPv4 | NAT IPv4 address of source |
src | nat_ipv6 | IPv6 | NAT IPVv6 address |
src | nat_port | NUMBER | NAT port number for source |
src | port | NUMBER | Network source port |
src | zone | STRING | Zone name for source – examples: Bldg1, Europe |
syslog | fac | NUMBER | Syslog facility value |
syslog | pri | NUMBER | Syslog priority value |
syslog | pri | STRING | Event priority (ERROR|WARN|DEBUG|CRIT) |
syslog | sev | NUMBER | Event severity |
syslog | tag | STRING | Syslog Tag value |
syslog | ver | NUMBER | Syslog Protocol version (0=legacy/RFC3164; 1=RFC5424) |
user | auid | STRING | Source User login authentication ID (login id) |
user | domain | STRING | User account domain (NT Domain) |
user | eid | STRING | Source user effective ID (euid) |
user | gid | STRING | Group ID (gid) |
user | group | STRING | Group name |
user | id | STRING | User account ID (uid) |
user | name | STRING | User account name |
Applied Security Visualization – Book Video
It’s been a while since I wrote “Applied Security Visualization“. Here is an older video that I just came about. A good overview of the book. Enjoy!
Logging Formats and Standards
I have discussed the topic of logging standards multiple times on this blog. Some recent developments in the logging space urged me to give an update and provide my opinion:
Yet another vendor just released a “standard” log format (note the quotes around standard). It’s called UCF, the Universal Collection Framework™ (UCF). This is how the vendor describes it:
UCF is the first WAN-aware, store-and-forward, encrypted, compressed IT data transport. It allows customers to gather IT data, increase resilience, reduce network chatter and encrypt from almost any device, anywhere, quickly and easily. UCF leverages a new transport and store protocol that LogLogic intends to open source in the near future.
Sounds a whole lot like syslog. (syslog-ng and rsyslog seem to support exactly this!) Okay, let’s just look at this description: WAN aware? What the heck is that supposed to mean? You mean it won’t work well on a LAN? Does that mean it knows the Internets? That’s just a strange description to start with. Oh, and it’s the first property mentioned! The rest of the description sounds like a transport protocol. Interesting. Why not stick with syslog that is well known, has proven to work, and has integration libraries built already. I never understood why vendors implemented their own transport protocols. They are hard (very hard) to implement and even harder for producers and consumers to adopt to. Oh well.
When people talk about UCF, they keep bringing up ArcSight’s CEF. Well, I am greatly responsible for that specification. But guess what? It’s not a transport protocol! It’s a syntax definition. It tells a log producer how to format their log file. Not how to transport it. Because, there is always syslog that a lot of machines have installed already and it’s easy to use. (And in newer versions you get encryption, caching, etc.).
Now, my last point about standards. Why do vendors keep trying to come up with standards by themselves? It just doesn’t make any sense. How is going to adapt it? At ArcSight, about 4 years ago, we came up with CEF because CEE didn’t move fast enough and we wanted something that our partners could easily use. An analyst wrote that ArcSight is planning to take CEF to the IETF. I hope they are not going to do that. I don’t have any control over that anymore, but that would be stupid. We rather push CEE through IETF. If you have a chance, compare the CEE syntax proposal with CEF. Notice something? Yes. It’s very similar. Again, I might have had something to do with that. Anyways. Vendors should not define logging standards!
On a good note: CEE is moving forward and just released the architecture overview for public commentary. Check them out!
All the Data That’s Fit to Visualize
Last week I posted the introductionary video for a talk that I gave at Source Boston in 2008. I just found the entire video of that talk. Enjoy:
Talk by Raffael Marty:
With the ever-growing amount of data collected in IT environments, we need new methods and tools to deal with them. Event and Log Analysis is becoming one of the main tools for analysts to investigate and comprehend the state of their networks, hosts, applications, and business processes. Recent developments, such as regulatory compliance and an increased focus on insider threat have increased the demand for analytical tools to help in the process. Visualization is offering a new, more effective, and simpler approach to data analysis. To date, security visualization, has mostly failed to deliver effective tools and methods. This presentation will show what the New York Times has to teach us about effective visualizations. Visualization for the masses and not visualization for the experts. Insider Threat, Governance, Risk, and Compliance (GRC), and Perimeter Threat all require effective visualization methods and they are right in front of us – in the newspaper.
Applied Security Visualization Book seen in Singapore
A friend just sent me couple of pictures he took in a bookstore in Singapore.
Have you seen the book Applied Security Visualization on the shelf at your local book store? If so, send me a picture and I will post it…
Security Visualization and Log Analysis Workshop – Sign up now!
“Log Analysis and Security Visualization” is a two-day training class held on March 9th and 10th 2009 in Boston during the SOURCE Boston conference that addresses the data management and analysis challenges of today’s IT environments.
Students will leave this class with the knowledge to visualize and manage their own IT data. They will learn the basics of log analysis, learn about common data sources, get an overview of visualization techniques, and learn how to generate visual representations of IT data for a number of different use-cases from DoS and worm detection to compliance reporting. The training is filled with hands-on exercises utilizing DAVIX, the open-source data analysis and visualization platform.
Register today to secure your spot.