ETL to QE, Update 71, Nostr SQL Over Engineering Complete, Time for Websockets
Date: 2025-04-22
Let's recite what the grand master plan is,
I am interested in two problems.
- How does one put all data they have personally generated into an single accessible API with RBAC(Role Based Access Control).
- Once all one's data is all easily queryable what do you do with it?
- Ask yourself, what is Data For?
Let's take another look at the Roadmap 1. Nostr Scraping Project 2. Nostr CMS 3. Nostr Bot Daemon Framework 4. CGFS Knowledge Graph Specification 5. RBAC LDAP Like Content Addressable Storage System 6. Nostr Token NIP 7. Provenance ETL DAG Framework for Agents 8. Nostr Onion Networking 9. Zero Knowledge DAO's
Axioms of Social Media
- Where did we write about this
- RBAC CDN Object Storage
- Pub Sub
- Relational Data
Links
I realize the Two Problems
and DDaemon Roadmap
still lack enough context to actually be marketed to people and that needs to change.
I got my Nostr Scraping Project which will provide an interface to compare how different users have their events spread across different relays. Something the Nostr system lacks. That's it that's the goal I still have a bunch of shit to do in order to deliver on that goal.
I Finally have the SQL Schema I am Happy With
Let's start by taking a look at this prestine SQL in all it's glory, it's not quite at the, dump strings of nostr events into table and have SQL Functions in Postgres do all the validation and stuff but I am happy enough with this to finally be able to move on. The things I cared about are,
- Tracking weather or not the Nostr Event was verified or not
- Having a tags table for tags that are supposed to be indexed
- Indexing all tags, separately, Data Science purposes
- Tracking which relay, and at which time, the nostr event was downloaded from
- The TAG processing is all done in SQL which I believe is faster but still need to test, that would make a good blog post for Nostr once I get this blog up there with my own KMS NIP. For me a KMS needs UUID or similar identifiers not just titles like NIP-54 has.
- My subconscious is telling me that raw TEXT columns may be faster to search through than tags via a JOIN.... but we must move forward to websockets.
CREATE TABLE IF NOT EXISTS normalized_nostr_events (
id VARCHAR NOT NULL UNIQUE,
created_at INTEGER NOT NULL,
kind integer NOT NULL,
pubkey VARCHAR NOT NULL,
sig VARCHAR,
content VARCHAR,
tags VARCHAR,
raw_event VARCHAR,
is_verified BOOLEAN
);
CREATE TABLE IF NOT EXISTS nostr_event_on_relay (
id VARCHAR,
relay_url VARCHAR,
last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT fk_nostr_event_relay_metadata
FOREIGN KEY (id)
REFERENCES normalized_nostr_events (id)
);
CREATE TABLE IF NOT EXISTS nostr_event_tags (
id VARCHAR,
first_tag VARCHAR,
tags JSONB,
CONSTRAINT fk_nostr_event_relay_metadata
FOREIGN KEY (id)
REFERENCES normalized_nostr_events (id)
ON DELETE CASCADE,
CONSTRAINT unique_nostr_event_tags UNIQUE (id, first_tag, tags)
);
CREATE TABLE IF NOT EXISTS non_standard_nostr_event_tags (
id VARCHAR,
first_tag VARCHAR,
tags JSONB,
CONSTRAINT fk_non_standard_nostr_event_tags
FOREIGN KEY (id)
REFERENCES normalized_nostr_events (id)
ON DELETE CASCADE,
CONSTRAINT unique_non_standard_nostr_event_tags UNIQUE (id, first_tag, tags)
);
CREATE OR REPLACE FUNCTION insert_nostr_event_tags()
RETURNS TRIGGER AS $$
DECLARE
nested_tags JSONB;
item JSONB;
first_tag_extracted VARCHAR;
BEGIN
-- Loop through the JSONB array 'tags' from the NEW record
FOR item IN
SELECT jsonb_array_elements(NEW.raw_event::jsonb->'tags') AS tag
LOOP
-- Check if the tag matches the pattern
first_tag_extracted := item::jsonb->>0;
IF first_tag_extracted ~ '^[A-Za-z]{1,2}$' THEN
-- Insert into nostr_event_tags
INSERT INTO nostr_event_tags (
id,
first_tag,
tags
) VALUES (
NEW.id,
first_tag_extracted, -- Insert the tag directly
item::jsonb
) ON CONFLICT (id, first_tag, tags) DO NOTHING;
ELSE
-- Insert into non_standard_nostr_event_tags
INSERT INTO non_standard_nostr_event_tags (
id,
first_tag,
tags
) VALUES (
NEW.id,
first_tag_extracted, -- Insert the tag directly
item
) ON CONFLICT (id, first_tag, tags) DO NOTHING;
END IF;
END LOOP;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER run_insert_nostr_event_tags_on_normalized_nostr_events
AFTER INSERT ON normalized_nostr_events
FOR EACH ROW
EXECUTE PROCEDURE insert_nostr_event_tags();
Next Step is mastering Websockets
I'll be honest, I feel a little retarded. The Nostr Spec is so simple, you just open up a websocket with a Nostr Relay, send some Nostr Filters JSON down the pipe with an ID, then you start getting results back for that filter that match the filter's ID, then you ought to close the filter when you are done. Like FFS we can read the spec NIP-01 which specifically explains how to send and receive messages to and from nostr relays.
Then we can take a look in nostr-tools, here is the code that actually launches the subscription. NGL it took me like 20 minutes to find that because there is so much dam abstraction in there.
Like my dude (I am talking to myself), it really shouldn't be that hard. I already have a working POC of nostr with raw websockets, here is the code where I figured out how to do that.
But then I went an developed this absolutely retarded scraper that opens and closes the websocket connection for every filter it fetches data with. Here's the code, here, it is genuinely slow, but at least I am proud that I wrote my own POC nostr filter iterator to make sure I don't miss any events.
After seeing how slow my POC was I had the fantastic idea to use nosdump to scrape data to the file system then load it into Postgres. nosdump would have its own nostr filter iterator which would work better than my nostrGet function which was opening and closing a websocket EVERY SINGLE TIME it queried a relay with a filter. I actually finished that and got it working in these files nosdump-scrape.js and nosdump-ingest2.js.
Then I had the fantastic realization that, rather then download the nostr events to the file system, I could you know, load the filter's data into postgres directly. I could technically use nosdump and feed the data into postgres somehow using a unix pipe operator but I don't know how to use those well. So let's go take a look at how nosdump get's all those nostr events.
Oh WTF, turns out nosdump is just a wrapper around nostr-fetch. Oh boy do I feel stupid now. If I had just been smart enough to Stand on The Shoulders of Giants and take a look at how the tools I ALREADY USED did their thing I would not have wasted so much time developing my own Nostr Scraper Filter Iterator. Here is the code where nostr-fetch logic get's called.
How to Scrape
As a reminder we are not mindlessly scraping everything with an empty filter. But there are theoretical problems with that. Technically speaking most relays limit the limit filter to 100 or 500 results, therefore if more than 100 or 500 results happened within a couple seconds of your filter you don't have a way to pagiage to those nostr events and will end up missing them when scraping.
The Solution I have come up with is, scrape all the Nostr Profiles(Nostr Kind 0), then iterate over all their nostr events using timestamps, then iterate over all the threads they have interacted with (e tag), then mentions of that user (p tag).
Well scraping everything with that procedure is probably two or three orders of magnitude more complex than just scraping the raw filter.
If I were a betting man I would basically bet every time on the system that is two or three order of magnitude more simple rather than the complex system. Plus we have a simple fact we need to stop ignoring. I need a filter scraping iterator for filters before I can start the Nostr Event Kind 0 -> Scrape pubkey -> Scrape Event Tags -> Scrape pubkey tags procedure. One project literally fucking encapsulates the other. Fuck I feel stupid sometimes. Now let's map out how to scrape nostr with Provenance(Tracking which relay had which notes and what filter I got them from).
Okay maybe tracking which filter I got them from is a bit too much. Literally an order of magnitude more complex. We want to track, what filters we have scraped and how many events were returned.
Nostr Scraping Plan
Well nostr-fetch seems pretty easy to use, check out this dead simple script,
import { NostrFetcher } from "nostr-fetch";
const nHoursAgo = (hrs) =>
Math.floor((Date.now() - hrs * 60 * 60 * 1000) / 1000);
const fetcher = NostrFetcher.init();
const relayUrls = [ "wss://relay.mememaps.net" ];
// fetches all text events since 24 hr ago in streaming manner
const postIter = fetcher.allEventsIterator(
relayUrls,
/* filter (kinds, authors, ids, tags) */
{ },
/* time range filter (since, until) */
{ before: nHoursAgo(24) },
/* fetch options (optional) */
{ sort: true }
);
for await (const ev of postIter) {
console.log(ev.content);
}
What's great about nostr-fetch is how it works via async and syncronous. So yea, nostr-fetch is the way to go but now we need to figure out how to schedule and log scraping which nostr filters and when, but that's the next blog post.