As I outlined in my previous blog post on How to clean up network traffic logs, I have been working with the VAST 2013 traffic logs. Today I am going to show you can load the traffic logs into Impala (with a parquet table) for very quick querying.
First off, Impala is a real-time search engine for Hadoop (i.e., Hive/HDFS). So, scalable, distributed, etc. In the following I am assuming that you have Impala installed already. If not, I recommend you use the Cloudera Manager to do so. It’s pretty straight forward.
First we have to load the data into Impala, which is a two step process. We are using external tables, meaning that the data will live in files on HDFS. What we have to do is getting the data into HDFS first and then loading it into Impala:
$ sudo su - hdfs $ hdfs dfs -put /tmp/nf-chunk*.csv /user/hdfs/data
We first become the hdfs user, then copy all of the netflow files from the MiniChallenge into HDFS at /user/hdfs/data. Next up we connect to impala and create the database schema:
$ impala-shell create external table if not exists logs ( TimeSeconds double, parsedDate timestamp, dateTimeStr string, ipLayerProtocol int, ipLayerProtocolCode string, firstSeenSrcIp string, firstSeenDestIp string, firstSeenSrcPort int, firstSeenDestPor int, moreFragment int, contFragment int, durationSecond int, firstSeenSrcPayloadByte bigint, firstSeenDestPayloadByte bigint, firstSeenSrcTotalByte bigint, firstSeenDestTotalByte bigint, firstSeenSrcPacketCoun int, firstSeenDestPacketCoun int, recordForceOut int) row format delimited fields terminated by ',' lines terminated by '\n' location '/user/hdfs/data/';
Now we have a table called ‘logs’ that contains all of our data. We told Impala that the data is comma separated and told it where the data files are. That’s already it. What I did on my installation is leveraging the columnar data format of Impala to speed queries up. A lot of analytic queries don’t really suit the row-oriented manner of databases. Columnar orientation is much more suited. Therefore we are creating a Parquet-based table:
create table pq_logs like logs stored as parquetfile; insert overwrite table pq_logs select * from logs;
The second command is going to take a bit as it loads all the data into the new Parquet table. You can now issues queries against the pq_logs table and you will get the benefits of a columnar data store:
select distinct firstseendestpor from pq_logs where morefragment=1;
Have a look at my previous blog entry for some more queries against this data.