Equal Code: DuckDB for logs analysis

How we use DuckDB to analyse application logs quickly and effectively.

A duck walks across a path, followed by her chicks.
Photo by Vlad Tchompalov / Unsplash

For all the wonderful log collection that AWS CloudWatch does, boy does it leave a lot to be desired when querying those logs. Here's a little trick that we use to investigate our logs quickly, easily and locally.

The processing pipeline

Our pipeline for generating logs looks something like this:

  1. our application(s) logs something
  2. the pino formatter turns this "something" into a consistently formatted, JSON representation
  3. the logs are collated by AWS CloudWatch into various groups
  4. we download a group over a specific timeframe
  5. we import the logs into a DuckDB file
  6. we use plain SQL to investigate the logs in lightning time

Steps 1 - 3 are fairly unremarkable and specific to everyone's different applications. In this article, I want to talk about steps 4 - 6 that allow us to perform queries like this with ease:

-- Find the top 20 database queries with the longest query time
select trace, mean(time) as mean_time
from logs
group by trace
order by mean_time desc
limit 20

Downloading logs with the AWS CLI

The AWS CLI comes with a convenient way to stream logs straight from log groups. The syntax looks like this:

aws logs tail $group_name ...

Not so self-explanatory and a bit verbose, but there's a lot of power in there when you start to use the options provided. These include:

  • --since <timeframe> with fairly human-readable timeframes like 24h
  • --log-stream-name-prefix to target specific streams
  • --filter-pattern <pattern> to pre-filter lines and download the minimum amount of information
  • --format short to remove a lot of cruft

We wrap these up into a small utility that cleans up the lines and run it as e.g. for recent live logs

cw live logs --since 3h > recent.logs

Importing into DuckDB

Here's where the real fun begins. DuckDB is a wonderful file-based columnar database. If you've not come across it before, it's easy (and perhaps a bit reductionist) to think of it as "SQLite, but fast for aggregations".

On top of the speed, there are wonderful quality-of-life improvements that make importing from various file formats a breeze. For our use case, the shining example is the JSON parser, which allows you to import JSONLD files and automatically create a table from the fields present in the JSON. This is as easy as

duckdb log.db "install 'json'; load 'json'; create table logs as select * from read_json('$filename', auto_detect=true);"

Querying logs

Since we have a SQL database, we can query it just like anything else. DuckDB also provides some great statistical methods including basics like mean, mode, min, max as well as more sophisticated ones like histogram and stddev_pop.

We wrap all of this up in our aforementioned cw utility to do a bit of cleaning on log inputs, pass standard config to DuckDB etc. and at the end of the day our queries look like:

cw live log-db recent.logs < log_queries/api/longest_queries.sql

Most importantly for us, this makes the feedback loop shorter when doing exploratory analysis of logs. If you want to tweak the query slightly, or adapt it for a slightly different use case, all you have to do is change the SQL. If you want to have a REPL to try things out, that's all built-in. To share queries, we store them in a repository like everything else.

Simple, easy, effective!