ETL to QE, Update 68, Thinking Through how a Workflow Engine Works

Scraping Nostr is a little nuanced. Nostr queries are done via JSON, similar to MongoDB, but has less functionality.

Let's look at an example, let's say we want to scrape all the events from npub16kpw53ryqkpc8tufm0zhrae242hl6yp64vamerwaydpwvmt7ddtqkr92d8, we first convert it to a hex string using something like nak.nostr.com or nostr-tools. We then throw the hex encoding of that public key into a filter as such,

{
    "authors": ["d582ea4464058383af89dbc571f72aaaaffd103aab3bbc8ddd2342e66d7e6b56"]
}

The problem here is that this filter may nor return everything because that "author" has 10,000's of events. Therefore we need to sort them.

{
    "authors": ["d582ea4464058383af89dbc571f72aaaaffd103aab3bbc8ddd2342e66d7e6b56"],
    "limit": 1000
}

Adding a limit to a nostr filter will return the most recent events given that filter. Then we can use the timestamp to pagiate through the nostr filter, for example, if we run the above filter and the smallest Unix Time timestamp is 1713073541 we can change the filter to get events "before" that timestamp. Note: We can also use "since" to get even more specific.

{
    "authors": ["d582ea4464058383af89dbc571f72aaaaffd103aab3bbc8ddd2342e66d7e6b56"],
    "limit": 1000,
    "before": 1713073541
}

This means when we want to scrape the entirety of a Nostr Filter we must recursively use a filter paginating via the timestamps. This leads to a flow like,

Nostr Filter Scraping Workflow

Simple Nostr Scraping Workflow 001.svg

Oh we are treating this system like a UTxO which happens to have a fractal data flow pattern just like my Workflow Engine.

There is the data transformation step, (Rectangles)
The Decision Step (Ovals)
The Task Assignment Step (Clouds)

One key point to understand is that each step, (Rectangle, Oval, or Cloud) should be simple and atomic enough to not require logging statements for debugging. If something requires debugging for whatever reason you should separate the Activity into multiple Activities.

Writing The Code

As per the beginning of ETL to QE, Update 67, Nostr Scraping via a Custom Workflow Engine in SQL we are going to scrape all the Profiles off a series of Nostr relays then get every Nostr Users events off said relays this will look something like this,

Simple Nostr Scraping Workflow Recursive 001.svg

So things are getting a little more complicated here. The Data Saving step is not as simple as just save the Nostr Events, we also need to queue up jobs to try and Scrape stuff we have not scraped yet, we also need to remember that this is a recursive job thing.

So when dealing with recursive jobs we need to make sure to assign a job to scrape the results, but also the job to index them. We need to assign the entirety of the workflow. Now this is where things can get interesting when we start thinking of using Multiformats.

Right now in my POC. I just assign every Activity a UUID and call it a day. We also have this problem where we want to not scrape the same data over and over again, therefore we need to be tracking what activity ran with what inputs. We can have the JSON input hashed via Multiformats and stored in a column, then when we try to add a new job and can skip adding it because we already scraped the data. I like this idea, it is way easier than trying to match the Nostr Filter using JSON.

But this still does not solve the problem of assigning two tasks, where one's input is dependent on a result we have not resolved. Well we can treat it like a nostr thread, everything assigned from the initial thread must reference the initial workflow. So like the Workflow can have an ID, the ID is generated from the Input, Activity Name, and Timestamp, then every Activity spawned from that initial Activity must reference it.

Now that's totally rad man. Just imagine that system with human in the loop decision making replacing government. Of course we will need to sign the inputs and outputs of all activities, and run this some of this stuff in a secure enclave, but still, it's totally rad man.

Schema Upgrade Requirements

We need to acknowledge that every Activity has multiple outputs,
- The Multiformats hash of the input data to the initial Activity
- The Raw Resulting Data
- The Data Inserted individually into the Desired Schema(s)
- The Activities spawned from our Activity
- The root activity, as a Multiformats hash, that spawned our activity

Hey Question

Why don't we also just put those multiformats hashes inside a Nostr Events of our own Creation(Kind). Then we can use a Nostr Event Schema four our data rather than coming up with out own scheam?

Well that would be very hard to scope out.

Then we will also need to come up with our TLS Signing Key Provenance System, which should also require its own kind. That needs a separate blog post to scope out, the tech solution is easy, the use cases are hard, please remember that.

Baby steps, the reason we don't do everything at once is because we never complete the project.

Wow back when I began ETL to QE I didn't fully realize the jist of it would be a Workflow Engine.

Realization

CGFS is just a Nostr Workflow Engine.

The way the Execution Engine part of the Workflow Engine ought to work is by matching up the dependencies of a package.json file then having the Job/Activity itself packaged into a single JS file. Wait don't we already have packages like Webpack