ETL to QE, Update 72, Minimum Viable Workflow Engine for Nostr Scraping

Date: 2025-04-23

Lesson learned: Please don't abstract away the thing you are trying to build, to require something more complex than the thing you are building, before you have built the thing

In the last couple ETL to QE blog posts we have figured out two things,

A SQL Schema to store Nostr Events and which Relays they came from
The best tool, nostr-fetch, to actually get those nostr events off the nostr relays

Now we need to decide on a pattern to scrape nostr events, then we need a way of orchestrating the scraping so we track what has been scraped, when, and where to continue off when we pause or stop so we don't scrape the same data over and over again.

The key decision we need to make today is, are we going to log the results of every nostr-filter somewhere or are we just going to log which filter ran and how many results were returned. Or we can log which pubkeys were present. How much fidelity do we need when it comes to logs?

Well since we are going to be inserting to the database every 1000-10000 or so events we can just throw in a log there to say,

Which filter was scraping
The total of nostr events that have been scraped
The raw events that have been scraped

Yea the fact of weather or not we include the pubkeys or the filter results is not a really important consideration it will be easy to add or remove or make it optional to log those things if the code base is designed correctly.

So where does that leave us?

Reviewing the Over Engineered Workflow

TLDR; Developing a Generalized Workflow Engine is more complex than just figuring out how to scrape Nostr itself.

We have Jobs and Tasks, or in other words, Workflow's and Activities.

Back in our ETL to QE, Update 68, Thinking Through how a Workflow Engine Works post we thought we had a grasp on things but oh boy were we wrong.

The issues with a Scraping Workflow are, * Many Tasks/Activities can be recursive * Modelling Task Dependencies in SQL is Hard and Not Intuitive cause SQL * Fractal Task/Activity Creation is really hard to correctly scope out

So we are going to need two tables, one with the filters that we are currently scraping, that get's updated with the latest log and metadata, the other with the raw logs which can include the additional pubkey or raw nostr events form a filter if we really want to.

Alright let's develop a little schema and think through how it will get used,


CREATE TABLE IF NOT EXISTS scraped_nostr_filters (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    scraping_status VARCHAR,
    number_of_results INT,
    filter_json JSONB,
    metadata JSONB,
    num_results INTEGER,
    relay_url VARCHAR,
    since: INTEGER,
    until: INTEGER,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    finished_at TIMESTAMP
);

CREATE TABLE IF NOT EXISTS nostr_filter_scraping_logs (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    scraped_nostr_filter_id UUID,
    filter_json JSONB,
    metadata JSONB,
    num_resulsts INTEGER,
    relay_url VARCHAR,
    since: INTEGER,
    until: INTEGER,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    log_data TEXT,
);

That looks simple enough, we got a table which tracks the filters we are scraping, and another table incrementally ingesting all the data.

Plan to Ingest Data

First a Problem To Think Out

Logging to nostr_event_on_relay table the filter that received the event is over engineering, let's not do that.

The Plan

We decide on a Filter and a Relay
We load the Filter into scraped_nostr_filters with scraping_status set to 'TODO'
We set the scraped_nostr_filters table scraping_statuscolumn of the row we just inserted to RUNNING
We then start the scraping script, please note the since and until variables are for nostr-fetch specifically
Every tine the Async filter returns over 1000 results we log it to the database
- We also log the nostr event ID, relay_url to nostr_event_on_relay
When the filter is complete we UPDATE scraped_nostr_filters table scraping_statuscolumn toCOMPLETED

Reflection Compared to Workflow Engine

So what is different between what I just outlined and what I talked about in the Workflow Engine in the following ETL to QE update posts?

The difference is that I outlined a Job Queue in this project update not a Workflow Engine.

A Job Queue is simple, you just go to the database, say Hi I wanna do this job, then you do it. A Workflow Engine is more complicated because when you go to the database you need to ask for a Job that HAS ALL THE DEPENDENCIES ALREADY COMPLETED.

You need to build a Job Queue before you can build a Workflow Engine, or in other words, a Workflow Engine has a Job Queue built in therefore you should be building that regardless.

What Code Goes Where?

So looks like we are going to have a little script that can load in "Jobs" and another script that is the "worker".

Backlinks

Project Update Posts