Skip to content

ETL to QE, Update 67, Nostr Scraping via a Custom Workflow Engine in SQL

Date: 2025-04-08

My goal is to scrape a million Nostr Events and then use AI and Human Labeling to make sense of it.

Scraping Nostr

The simpleist way to scrape nostr would be to select a popular Nostr Relay, scrape all the nostr profiles(Nostr Event Kind 0) then scrape all the events from each profile.

Wow that's way more simple than what I was initially planning, my initial plan was as follows,

  • We start with a single NPUB of popular Nostr User
  • We scrape the Users NIP05 Identity for other Relays they use
  • We scrape all that users events from every relay they say they publish to
  • We then grab all the
    • events mention a pubkey using p tag
    • reactions(NIP-07) to the NPUB
    • replies(NIP-01) to the NPUB
    • Followers (NIP-02) of the NPUB
    • Badges (NIP-58) to the NPUB
  • We then look at their follow list
  • We add every NPUB to a backlog of Nostr events to scrape

Scraping this data produces a Fractal pattern of behavior. Whenever a user is scraped, it leads to new users and threads to scrape. Whenever a thread is scraped it may lead to new users to scrape. When we look at users mentioning the user we are scraping it leads to new users to scrape.

What is a Workflow Engine?

A Workflow Engine provides a substrate for orchestrating data pipelines.

A data pipeline if when you grab data from one or multiple sources, and transform them, usually requiring multiple steps.

Examples of a data pipeline would be, * Calculating the expenses of a company at the of the month * Generating and emailing the monthly invoices to charge each member of a Gym * Sending out payroll to employees and emailing them their paystub * Using website analytics to track what day of each month last year had the most traffic

A Workflow Engine for Nostr Scraping

No matter what method I choose to scrape Nostr. I need to keep track of what I scraped, how I scraped it, and what else needs to be scraped.

  • I want to make sure I don't repeatably scrape the same thing over and over again.
  • I want to track which relay I scraped which Nostr events from
  • I want to make sure to scrape all the events of a Nostr user from all the the relays they say via Nostr Kind 0 and NIP-05 they publish to

So far I have figured out how to effectively add, start, and log the results of an Activity. The issue I am having are as follow,

  • Figuring out an elegant way to log events
  • Add new activities based on the output of completed activities
  • Making sure to not mindlessly do the same Activity over and over again