Equal Code: DuckDB for logs analysis
How we use DuckDB to analyse application logs quickly and effectively.
For all the wonderful log collection that AWS CloudWatch does, boy does it leave a lot to be desired when querying those logs. Here's a little trick that we use to investigate our logs quickly, easily and locally.
The processing pipeline
Our pipeline for generating logs looks something like this:
- our application(s) logs something
- the pino formatter turns this "something" into a consistently formatted, JSON representation
- the logs are collated by AWS CloudWatch into various groups
- we download a group over a specific timeframe
- we import the logs into a DuckDB file
- we use plain SQL to investigate the logs in lightning time
Steps 1 - 3 are fairly unremarkable and specific to everyone's different applications. In this article, I want to talk about steps 4 - 6 that allow us to perform queries like this with ease:
-- Find the top 20 database queries with the longest query time
select trace, mean(time) as mean_time
from logs
group by trace
order by mean_time desc
limit 20
Downloading logs with the AWS CLI
The AWS CLI comes with a convenient way to stream logs straight from log groups. The syntax looks like this:
aws logs tail $group_name ...
Not so self-explanatory and a bit verbose, but there's a lot of power in there when you start to use the options provided. These include:
--since <timeframe>with fairly human-readable timeframes like24h--log-stream-name-prefixto target specific streams--filter-pattern <pattern>to pre-filter lines and download the minimum amount of information--format shortto remove a lot of cruft
We wrap these up into a small utility that cleans up the lines and run it as e.g. for recent live logs
cw live logs --since 3h > recent.logs
Importing into DuckDB
Here's where the real fun begins. DuckDB is a wonderful file-based columnar database. If you've not come across it before, it's easy (and perhaps a bit reductionist) to think of it as "SQLite, but fast for aggregations".
On top of the speed, there are wonderful quality-of-life improvements that make importing from various file formats a breeze. For our use case, the shining example is the JSON parser, which allows you to import JSONLD files and automatically create a table from the fields present in the JSON. This is as easy as
duckdb log.db "install 'json'; load 'json'; create table logs as select * from read_json('$filename', auto_detect=true);"Querying logs
Since we have a SQL database, we can query it just like anything else. DuckDB also provides some great statistical methods including basics like mean, mode, min, max as well as more sophisticated ones like histogram and stddev_pop.
We wrap all of this up in our aforementioned cw utility to do a bit of cleaning on log inputs, pass standard config to DuckDB etc. and at the end of the day our queries look like:
cw live log-db recent.logs < log_queries/api/longest_queries.sql
Most importantly for us, this makes the feedback loop shorter when doing exploratory analysis of logs. If you want to tweak the query slightly, or adapt it for a slightly different use case, all you have to do is change the SQL. If you want to have a REPL to try things out, that's all built-in. To share queries, we store them in a repository like everything else.
Simple, easy, effective!