I was playing with database audit logs for a bit to try and visualize some aspects of them. While doing so, I came across a pretty interesting problem. The audit logs contain entries that indicate what exact SQL query was executed. Now, I am not interested in the entire query, but I need to know which tables were touched. I was trying to build some regular expressions to extract that information from the query, but I gave up pretty quickly. It’s just too complicated for a regex. I was wondering whether there is a way to take a SQL query, for example:
select * from a.table1 a, b.tabl2 b join c.table3 on b.id1=c.id2 where a.foo='bar'
and extract all the table names: a.table1, b.table2, c.table3. Are there tools to do that? Remember, I don’t have the database with these tables. I only have a log from some database. The script should support all the SQL perks like joins, nested selects, etc. Anyone have a good way to do this?
You will need a SQL query parser to do this. There is a commercial library from http://www.sqlparser.com to do the parsing. I also looked at some free PERL modules to do the same but they were all not as flexible as the one mentioned above. It gets you the table names, column names, user names, database names, command types etc as long as they are in the query.
Comment by msvenc — October 16, 2007 @ 5:33 am
You can’t- it may be the case that the table names are not in the query. For example, the query may be built off of views or aliases.
There are certain clauses of the SQL language that take one of the table-like entities (tables, views, rowsets, or aliases) as parameters, for example FROM, JOIN, etc. You could look for these and interpret the right-side parameter as a table name.
Also the period “.” is an object name delimiter so you could only take the rightmost part of the name after the last period.
Comment by Eric — October 16, 2007 @ 8:35 am
Eric, that’s what I tried. However, that’s exactly what is very error prone and fairly hard to get entirely right. For now that’s the approach I am using 😉
Comment by Raffael Marty — October 16, 2007 @ 8:57 am
Although the dot format is typical, it can be omitted, if the table is in the FROM clause, or it can be aliased in an AS clause. Sometimes the table and column are in square brackets. IMHO, standard SQL and MS SQL are not always equivalent, as MS always extends in a proprietary-way.
Comment by L Nonamaker — October 17, 2007 @ 11:16 am
Well, there aint many combinations, hence the general reg-ex would suffice for most of the cases. If the accuracy drops , then probably you might want to use a custom built tagger and train the model based on this and extract the table info.
Comment by Venkat — October 21, 2007 @ 8:18 am