Equal Code: DuckDB for logs analysis
How we use DuckDB to analyse application logs quickly and effectively.
For all the wonderful log collection that AWS CloudWatch does, boy does it leave a lot to be desired when querying those logs. Here's a little trick that we use to investigate our logs quickly, easily and locally.
The processing pipeline
Our pipeline for generating logs looks something like this:
- our application(s) logs something
- the pino formatter turns this "something" into a consistently formatted, JSON representation
- the logs are collated by AWS CloudWatch into various groups
- we download a group over a specific timeframe
- we import the logs into a DuckDB file
- we use plain SQL to investigate the logs in lightning time
Steps 1 - 3 are fairly unremarkable and specific to everyone's different applications. In this article, I want to talk about steps 4 - 6 that allow us to perform queries like this with ease:
-- Find the top 20 database queries with the longest query time
select trace, mean(time) as mean_time
from logs
group by trace
order by mean_time desc
limit 20
Downloading logs with the AWS CLI
The AWS CLI comes with a convenient way to stream logs straight from log groups. The syntax looks like this:
aws logs tail $group_name ...
Not so self-explanatory and a bit verbose, but there's a lot of power in there when you start to use the options provided. These include:
--since <timeframe>
with fairly human-readable timeframes like24h
--log-stream-name-prefix
to target specific streams--filter-pattern <pattern>
to pre-filter lines and download the minimum amount of information--format short
to remove a lot of cruft
We wrap these up into a small utility that cleans up the lines and run it as e.g. for recent live logs
cw live logs --since 3h > recent.logs
Importing into DuckDB
Here's where the real fun begins. DuckDB is a wonderful file-based columnar database. If you've not come across it before, it's easy (and perhaps a bit reductionist) to think of it as "SQLite, but fast for aggregations".
On top of the speed, there are wonderful quality-of-life improvements that make importing from various file formats a breeze. For our use case, the shining example is the JSON parser, which allows you to import JSONLD files and automatically create a table from the fields present in the JSON. This is as easy as
duckdb log.db "install 'json'; load 'json'; create table logs as select * from read_json('$filename', auto_detect=true);"
Querying logs
Since we have a SQL database, we can query it just like anything else. DuckDB also provides some great statistical methods including basics like mean
, mode
, min
, max
as well as more sophisticated ones like histogram
and stddev_pop
.
We wrap all of this up in our aforementioned cw
utility to do a bit of cleaning on log inputs, pass standard config to DuckDB etc. and at the end of the day our queries look like:
cw live log-db recent.logs < log_queries/api/longest_queries.sql
Most importantly for us, this makes the feedback loop shorter when doing exploratory analysis of logs. If you want to tweak the query slightly, or adapt it for a slightly different use case, all you have to do is change the SQL. If you want to have a REPL to try things out, that's all built-in. To share queries, we store them in a repository like everything else.
Simple, easy, effective!