2023 • May • 12

Using SQLite as a document database for Mastodon exports

TL;DR: SQLite has JSON functions, generated columns, and full-text search - which all seems like a perfect mix for ingesting exports from Mastodon for search!

A mastodon living in a tiny database, based on output generated by Craiyon

SQlite is more capable than I remembered

I've been watching Simon Willison's adventures in SQLite and Datasette for a few years now. I keep meaning to tinker with this stuff myself, but never quite get around to it. Recently, though, a couple of interesting pieces sort of shuffled into place for me:

I read this blog post by @dgl@infosec.exchange on SQLite as a document database - by combining json_extract() and generated columns in SQLite, you can build a JSON document database with indexes and all the fun of SQL queries. Just insert blobs of JSON and SQLite takes care of the rest.
I learned a bit about full-text search in SQLite. With a few triggers, I can build an index table that supports full-text queries against content extracted from JSON in those generated columns.

Previously, I'd thought PostgreSQL was where I had to go for things like JSON functions and full-text search - I didn't really expect to find that SQLite was this capable.

While it's true you can do just about anything with a pile of docker containers, a small project fares much better with an embedded database like SQLite - especially if it's deployed to an environment like Glitch. A SQLite easily database becomes de facto file format for personal tools.

Mastodon exports are full of JSON

Okay, so that's some cool database stuff. What do I want to put in it?

Well, I've accumulated a few thousand posts between Mastodon instances over the years. Support for search is rather uneven, depending on the Mastodon instance. So, I grabbed some JSON exports from my accounts and thought it might be interesting to make them searchable.

I've tooted the most on three Mastodon instances: mastodon.social, toot.cafe, and hackers.town. I was active for a few months on my own self-hosted instance at toot.lmorchard.com, but I accidentally nuked the database at some point and lost all my content. (I guess I should have grabbed an export earlier!)

An account export from Mastodon takes the form of a compressed tarball with media uploads and several collections of data in JSON. Of particular interest is a file named outbox.json: This contains all the account's posts and boosts, expressed in ActivityStreams 2.0 JSON format with Mastodon extensions. (Notably, not in "ActivityPub format" as tantek recently pointed out.)

Tinkering with tables

Once I had the data, I tinkered with table schemas and came up with this mess:

CREATE TABLE statuses (
  json TEXT,
  created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
  id TEXT GENERATED ALWAYS AS (json_extract(json, "$.id")) VIRTUAL UNIQUE,
  type TEXT GENERATED ALWAYS AS (json_extract(json, "$.type")) VIRTUAL,
  url TEXT GENERATED ALWAYS AS (json_extract(json, "$.object.url")) VIRTUAL,
  summary TEXT GENERATED ALWAYS AS (json_extract(json, "$.object.summary")) VIRTUAL,
  content TEXT GENERATED ALWAYS AS (json_extract(json, "$.object.content")) VIRTUAL,
  displayName TEXT GENERATED ALWAYS AS (json_extract(json, "$.actor.name")) VIRTUAL,
  publishedAt DATETIME GENERATED ALWAYS AS (json_extract(json, "$.object.published")) VIRTUAL,
  accountUrl TEXT GENERATED ALWAYS AS (json_extract(json, "$.actor.url")) VIRTUAL,
  accountAvatarUrl TEXT GENERATED ALWAYS AS (json_extract(json, "$.actor.icon.url")) VIRTUAL,
  accountName TEXT GENERATED ALWAYS AS (json_extract(json, "$.actor.name")) VIRTUAL
);
CREATE INDEX `statuses_id_publishedAt` on `statuses` (`id`, `publishedAt`);
CREATE INDEX `statuses_summary` on `statuses` (`summary`);
CREATE INDEX `statuses_content` on `statuses` (`content`);

The only columns actually stored on disk for this table are json and created_at - the rest are all virtual generated columns based on properties extracted from the JSON data. Since these are virtual, the value is computed on read.

But, two neat things about this: You can create indexes on generated columns (which are stored on disk) and you can add new virtual generated columns whenever you want.

Dropping columns is a different story, but being able to add new things is at least a help for iteration without having to resort to the dance of creating a new table and copying data that SQLite often requires.

For full text search, I cobbled this together:

CREATE VIRTUAL TABLE statusesSearch
  USING fts5(id UNINDEXED, summary, content);

CREATE TRIGGER statuses_insert AFTER INSERT ON statuses BEGIN
  INSERT INTO
    statusesSearch (id, summary, content)
  VALUES
    (new.id, new.summary, new.content);
END;

CREATE TRIGGER statuses_delete AFTER DELETE ON statuses BEGIN
  DELETE FROM statusesSearch WHERE id = old.id;
END;

CREATE TRIGGER statuses_update AFTER UPDATE ON statuses BEGIN
  UPDATE statusesSearch
  SET 
    summary = new.summary,
    content = new.content
  WHERE id = new.id;
END;

This sets up a Full Text Search table, along with some triggers to keep it in sync with the source of content. Nothing too complicated going on here - although it did take me awhile to work out that this seems like the least-bad way to combine virtual generated columns and full-text search. I'm still learning about this stuff, though, so I may yet discover a better approach.

Ingesting the JSON

With all that in place, I was able to parse outbox.json from an export and iterate through the orderedItems array to just plonk the JSON into the table:

await connection.transaction(async (trx) => {
    for (const item of outbox.orderedItems) {
        await trx("statuses")
          .insert({ json: JSON.stringify(item) })
          .onConflict("id").merge();
    }
});

Of course, this snippet skips over some preamble of parsing JSON and using Knex.js to create a database connection. Also, it can help to chunk the items into smaller batches across several transactions.

That said, the gist of the operation is to just upsert the JSON in the json column, opting to replace whenever a record with the same ID is found. This means the import is idempotent and can be re-run without duplicating data.

Wrapping many inserts in a transaction is important here: Between transactions is when SQLite does the work to update indexes and run triggers. This occasionally leads to rebalancing trees in indexes and other general housekeeping in FTS5. So, inserting each row individually would incur overhead that dramatically slows down a data import.

Searching the toots

Finally, once all this data is shoehorned into the database, I can do full-text searches:

sqlite> .mode line
sqlite> select id from statusesSearch where content match "hello world" limit 10;
     id = https://mastodon.social/users/lmorchard/statuses/55864/activity

     id = https://hackers.town/users/lmorchard/statuses/107073175679835816/activity

     id = https://hackers.town/users/lmorchard/statuses/109802318069508799/activity

     id = https://hackers.town/users/lmorchard/statuses/109763470744160265/activity

It doesn't look like much, but then I can use this in a join or a subquery to relate back to the main table of imported toots:

sqlite> select
  id, publishedAt, url, displayName, accountAvatarUrl,
  json_extract(json, "$.actor.summary") as bio,
  content 
from `statuses`
where `statuses`.`id` in (
    select `id` from `statusesSearch` where `statusesSearch` match "hello world"
)
order by `statuses`.`publishedAt` desc
limit 1;

              id = https://hackers.town/users/lmorchard/statuses/109802318069508799/activity
     publishedAt = 2023-02-03T18:53:30Z
             url = https://hackers.town/@lmorchard/109802318069508799
     displayName = Les Orchard
accountAvatarUrl = https://hackers.town/system/accounts/avatars/000/136/533/original/1a8c651efe14fcd6.png
             bio = <p>he / him; semi-hermit in PDX, USA; tinkerer; old adhd cat dad; serial enthusiast; editor-at-large for <a href="http://lmorchard.com" target="_blank" rel="nofollow noopener noreferrer"><span class="invisible">http://</span><span class="">lmorchard.com</span><span class="invisible"></span></a>; astra mortemque superare gradatim; tootfinder</p>
         content = <p><span class="h-card"><a href="https://infosec.exchange/@Em0nM4stodon" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>Em0nM4stodon</span></a></span> Personally, Amazon S3 or GitHub Pages.</p><p>The former, because I've been using it for 17 years since Amazon released it and it only costs me like $10 per year.</p><p>And the latter, because it's also easy (for me) to use git to push out content.</p><p>There are more varied options these days, but I've been too lazy / busy to explore much past a hello-world stage.</p>

Since I'm able to use json_extract() in a select statement, that gives me access to everything in the imported JSON records - whether or not I thought ahead to define a column or index for any particular column. That feels pretty noSQL-ish to me!

Next steps

From here, I started tinkering with a more complex node.js web app to build a user interface for search queries and a bunch more. (Maybe too much more!) But, I'll save talking about that for future blog posts that I'll hopefully write.

For this post, I just wanted to try jotting down the core ideas of shoehorning a pile of JSON data into a SQLite database. I've got a bunch more to learn about what I can do with this and where I might run into limits. But, in the meantime, this feels like a nicely lightweight way to play with a bunch of my data from the fediverse.