{"id":700,"date":"2013-10-25T16:51:38","date_gmt":"2013-10-26T00:51:38","guid":{"rendered":"http:\/\/raffy.ch\/blog\/?p=700"},"modified":"2013-10-25T16:51:38","modified_gmt":"2013-10-26T00:51:38","slug":"using-impala-and-parquet-to-analyze-network-traffic-vast-2013-challenge","status":"publish","type":"post","link":"https:\/\/raffy.ch\/blog\/2013\/10\/25\/using-impala-and-parquet-to-analyze-network-traffic-vast-2013-challenge\/","title":{"rendered":"Using Impala and Parquet to Analyze Network Traffic &#8211; VAST 2013 Challenge"},"content":{"rendered":"<p>As I outlined in my previous blog post on <a href=\"http:\/\/raffy.ch\/blog\/2013\/10\/22\/cleaning-up-network-traffic-logs-vast-2013-challenge\/\">How to clean up network traffic logs<\/a>, I have been working with the <a href=\"http:\/\/vacommunity.org\/VAST+Challenge+2013\">VAST 2013<\/a> traffic logs. Today I am going to show you can load the traffic logs into Impala (with a parquet table) for very quick querying.<\/p>\n<p>First off, <a href=\"http:\/\/www.cloudera.com\/content\/cloudera\/en\/products\/cdh\/impala.html\">Impala<\/a> is a real-time search engine for Hadoop (i.e., Hive\/HDFS). So, scalable, distributed, etc. In the following I am assuming that you have Impala installed already. If not, I recommend you use the <a href=\"http:\/\/www.cloudera.com\/content\/cloudera-content\/cloudera-docs\/Impala\/latest\/Installing-and-Using-Impala\/ciiu_cm_installation.html\">Cloudera Manager<\/a> to do so. It&#8217;s pretty straight forward.<\/p>\n<p>First we have to load the data into Impala, which is a two step process. We are using external tables, meaning that the data will live in files on HDFS. What we have to do is getting the data into HDFS first and then loading it into Impala:<\/p>\n<pre style=\"font-family:monospace; font-size:14px;\">$ sudo su - hdfs\r\n$ hdfs dfs -put \/tmp\/nf-chunk*.csv \/user\/hdfs\/data<\/pre>\n<p>We first become the hdfs user, then copy all of the netflow files from the MiniChallenge into HDFS at \/user\/hdfs\/data. Next up we connect to impala and create the database schema:<\/p>\n<pre style=\"font-family:monospace; font-size:14px;\">$ impala-shell\r\ncreate external table if not exists logs (\r\n\tTimeSeconds double,\r\n\tparsedDate timestamp,\r\n\tdateTimeStr string,\r\n\tipLayerProtocol int,\r\n\tipLayerProtocolCode string,\r\n\tfirstSeenSrcIp string,\r\n\tfirstSeenDestIp string,\r\n\tfirstSeenSrcPort int,\r\n\tfirstSeenDestPor int,\r\n\tmoreFragment int,\r\n\tcontFragment int,\r\n\tdurationSecond int,\r\n\tfirstSeenSrcPayloadByte bigint,\r\n\tfirstSeenDestPayloadByte bigint,\r\n\tfirstSeenSrcTotalByte bigint,\r\n\tfirstSeenDestTotalByte bigint,\r\n\tfirstSeenSrcPacketCoun int,\r\n\tfirstSeenDestPacketCoun int,\r\n\trecordForceOut int)\r\nrow format delimited fields terminated by ',' lines terminated by '\\n'\r\nlocation '\/user\/hdfs\/data\/';<\/pre>\n<p>Now we have a table called &#8216;logs&#8217; that contains all of our data. We told Impala that the data is comma separated and told it where the data files are. That&#8217;s already it. What I did on my installation is leveraging the columnar data format of Impala to speed queries up. A lot of analytic queries don&#8217;t really suit the row-oriented manner of databases. Columnar orientation is much more suited. Therefore we are creating a <a href=\"http:\/\/parquet.io\/\">Parquet<\/a>-based table:<\/p>\n<pre style=\"font-family:monospace; font-size:14px;\">create table pq_logs like logs stored as parquetfile;\r\ninsert overwrite table pq_logs select * from logs;<\/pre>\n<p>The second command is going to take a bit as it loads all the data into the new Parquet table. You can now issues queries against the pq_logs table and you will get the benefits of a columnar data store:<\/p>\n<pre style=\"font-family:monospace; font-size:14px;\">select distinct firstseendestpor from pq_logs where morefragment=1;<\/pre>\n<p>Have a look at my <a href=\"http:\/\/raffy.ch\/blog\/2013\/10\/22\/cleaning-up-network-traffic-logs-vast-2013-challenge\/\">previous blog<\/a> entry for some more queries against this data.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>As I outlined in my previous blog post on How to clean up network traffic logs, I have been working with the VAST 2013 traffic logs. Today I am going to show you can load the traffic logs into Impala (with a parquet table) for very quick querying. First off, Impala is a real-time search [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[],"class_list":["post-700","post","type-post","status-publish","format-standard","hentry","category-log-analysis"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/posts\/700","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/comments?post=700"}],"version-history":[{"count":2,"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/posts\/700\/revisions"}],"predecessor-version":[{"id":703,"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/posts\/700\/revisions\/703"}],"wp:attachment":[{"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/media?parent=700"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/categories?post=700"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/raffy.ch\/blog\/wp-json\/wp\/v2\/tags?post=700"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}