Query Approaches for Personal Knowledge

This document surveys query architectures for personal data operations. Query capability directly impacts P3 (Semantic Richness) and P9 (Performance Pragmatism) - a core tension identified in our analysis.

The Query Challenge in Personal Data

Core Tension: Flexibility vs Performance (identified in principles analysis)

Rich semantic queries require complex data models (slow)
Fast queries require simple models (limited expressiveness)

Requirements Implicated:

R3: Semantic query support (not just keyword matching)
R14: Graph traversal and pattern detection
R17: Non-obvious connection discovery
R30: Surface knowledge gaps
R31: Progression queries ("show my development on X")
R50: Full-text search across heterogeneous content

Principles:

P3 (Semantic Richness): Meaning must be explicit and queryable
P9 (Performance Pragmatism): Must scale to decades of data

Full-Text Search

Description: Keyword-based search using inverted index. Foundation of most search engines.

How It Works:

Build inverted index: term → documents containing term
Query breaks into terms
Look up terms in index
Rank results by relevance (TF-IDF, BM25)

Query Examples:

"distributed systems CRDT"
"quantum NOT mechanics"
title:"event sourcing"

Principle Alignment:

Supports P9 (Performance) - very fast with proper indexing
Weakens P3 (Semantic Richness) - keyword only, no relationships
Excellent for R50 (full-text across content types)

Requirements Addressed:

R50 (Full-text search) - primary solution
R4 (Time-travel) - can filter by date

Requirements Not Addressed:

R3 (Semantic queries) - no understanding of meaning
R14 (Graph traversal) - no relationship queries
R17 (Non-obvious connections) - can't find implicit links

Strengths:

Very fast (O(log n) with indexing)
Scales to millions of documents
User-familiar (everyone knows keyword search)
Works across heterogeneous content

Weaknesses:

No semantic understanding
Can't query relationships
Relevance ranking is heuristic
No inference or reasoning

Implementations:

Elasticsearch, Solr (distributed search)
Tantivy (Rust, used by some PKM tools)
SQLite FTS (lightweight, embedded)
Lunr.js (client-side JavaScript)

Use for Personal Data Ops: Essential baseline. Every system needs full-text search. But insufficient alone for semantic richness.

Graph Traversal Queries

Description: Queries that "walk" relationships between nodes. Native to graph databases.

Query Languages:

Cypher (Neo4j)
Gremlin (TinkerPop, property graphs)
SPARQL (RDF graphs)

Query Examples (Cypher):

// Find all mnemegrams 2 hops from "event sourcing"
MATCH (start:Mnemegram {title: "Event Sourcing"})-[*1..2]-(related)
RETURN related

// Find paths between two concepts
MATCH path = shortestPath(
  (a:Mnemegram {title: "CRDT"})-[*]-(b:Mnemegram {title: "Git"})
)
RETURN path

// Find clustering (highly interconnected groups)
CALL gds.louvain.stream('knowledge-graph')
YIELD nodeId, communityId

Principle Alignment:

Strongly supports P3 (Semantic Richness) - relationships are first-class
Good performance for P9 - graph databases optimized for traversal
Excellent for R14 (graph traversal), R17 (non-obvious connections)

Requirements Addressed:

R14 (Graph traversal and pattern detection) - designed for this
R17 (Non-obvious connection discovery) - pathfinding algorithms
R3 (Semantic queries) - can query by relationship type

Requirements Challenged:

R50 (Full-text across content) - requires separate text index

Strengths:

Natural for knowledge graphs
Expressive relationship queries
Pathfinding algorithms built-in
Pattern matching
Community detection, centrality measures

Weaknesses:

Requires graph data model
Query complexity can explode
Performance degrades with very large graphs (millions of edges)
Different query language per graph database

Implementations:

Neo4j (property graph, Cypher) - most mature
RDF triple stores (Blazegraph, Virtuoso) - SPARQL
Graph libraries (NetworkX, igraph) - not databases, in-memory

Use for Personal Data Ops: Essential for semantic richness. Enables queries like:

"Show me all concepts related to X within 3 degrees"
"What connects my work on A to my interest in B?"
"Which concepts am I underexploring?" (low connection density)

Challenge: Need to maintain graph structure. Links must be explicit.

Semantic/SPARQL Queries (RDF)

Description: Query RDF triples using SPARQL. Maximum semantic expressiveness.

Query Example (SPARQL):

PREFIX memex: <http://example.org/memex#>
PREFIX dc: <http://purl.org/dc/terms/>

# Find mnemegrams about CRDT with provenance
SELECT ?mnemegram ?source ?date WHERE {
  ?mnemegram memex:about <#CRDT> .
  ?mnemegram dc:source ?source .
  ?mnemegram dc:created ?date .
  FILTER (?date > "2023-01-01"^^xsd:date)
}
ORDER BY ?date

Principle Alignment:

Maximum P3 (Semantic Richness) - RDF enables formal reasoning
Poor P9 (Performance) - SPARQL is slow at scale
Excellent for P6 (Interoperability) - RDF is standard

Requirements Addressed:

R3 (Semantic queries) - most expressive option
R2 (Provenance chains) - RDF excels at provenance
R11 (Relation preservation) - RDF triples are universal

Requirements Challenged:

R9 (Performance) - SPARQL poor at scale
R50 (Full-text) - need separate text index

Strengths:

Maximally expressive
Formal semantics (can reason, infer)
W3C standard (interoperable)
Can query across federated datasets
Rich existing vocabularies (schema.org, FOAF)

Weaknesses:

Very slow (SPARQL optimization is hard)
Steep learning curve
Verbose (RDF is wordy)
Requires semantic modeling upfront

System Performance: Solid (using SPARQL) scores 0/2 on P9 (Performance) despite 2/2 on P3 (Semantic Richness). This is the expressiveness/performance tradeoff.

Use for Personal Data Ops: Consider when:

Already using RDF (e.g., Solid pods)
Formal reasoning needed
Interoperability critical
Willing to accept performance cost
Data volume is modest (< 100k triples)

Not recommended when:

Performance is priority
Real-time queries needed
Users unfamiliar with SPARQL

Vector Similarity Search

Description: Semantic search using dense vector embeddings. Find similar content without keyword matching.

How It Works:

Generate embedding for each mnemegram (OpenAI, sentence transformers)
Store embeddings in vector database
Query by converting query to embedding
Find nearest neighbors in vector space (cosine similarity)

Query Examples:

# Find notes similar to this one
similar = vector_db.search(
    embedding=embed("My note about event sourcing"),
    k=10
)

# Semantic query without exact keywords
results = vector_db.search(
    embedding=embed("How do distributed systems handle time?"),
    k=5
)

Principle Alignment:

Supports P3 (Semantic Richness) - semantic similarity, not keywords
Moderate P9 (Performance) - fast with proper indexing (HNSW)
Excellent for R3 (semantic queries), R17 (non-obvious connections)

Requirements Addressed:

R3 (Semantic query support) - finds conceptually related content
R17 (Non-obvious connection discovery) - "similar but not linked"
R30 (Surface knowledge gaps) - clustering can reveal underexplored areas

Requirements Challenged:

R2 (Provenance) - embeddings don't capture provenance
R14 (Graph traversal) - similarity is not same as explicit relationships

Strengths:

Semantic understanding without explicit modeling
Finds similar content even if different wording
Multilingual (embeddings work across languages)
Can combine with keyword search (hybrid)

Weaknesses:

Requires ML model (embedding generation)
Embeddings are opaque (hard to explain why similar)
Recalculate embeddings when content changes
Quality depends on embedding model

Implementations:

Pinecone, Weaviate (managed vector databases)
FAISS (Facebook AI, local library)
pgvector (PostgreSQL extension)
Qdrant (open source vector database)

Use for Personal Data Ops: Powerful complement to other approaches. Enables:

"Find notes similar to this, even if different terms"
Semantic clustering (group related concepts)
Anomaly detection (what doesn't fit anywhere?)

Challenge: Need to regenerate embeddings as knowledge evolves. Cost of embedding generation.

Temporal/Time-Travel Queries

Description: Queries that ask "what did I know when?" Essential for P2 (Temporal Integrity).

Query Examples:

-- What did I know about X on date Y?
SELECT * FROM mnemegrams
WHERE about = 'CRDT'
  AND created_at <= '2024-01-01'

-- How did my understanding of X evolve?
SELECT created_at, content FROM mnemegrams
WHERE about = 'event-sourcing'
ORDER BY created_at

-- What changed this week?
SELECT * FROM mnemegrams
WHERE updated_at >= NOW() - INTERVAL '7 days'

Principle Alignment:

Essential for P2 (Temporal Integrity)
Supports P12 (Provenance Traceability)
Enables R4 (time-travel views), R31 (progression queries)

Requirements Addressed:

R4 (Time-travel views) - primary solution
R1 (Temporal ordering) - queries respect time
R31 (Show progression on X) - evolution queries

Implementation Approaches:

A. Temporal Databases:

SQL:2011 temporal extensions
Datomic (point-in-time queries built-in)
Store valid-time and transaction-time

B. Event Sourcing:

Replay events up to timestamp
Natural time-travel
atproto uses this (commit history)

C. Versioning Layer:

Git-style snapshots
Query specific commit/version

Strengths:

Essential for reflection (T7)
Enables provenance queries
Audit trail intrinsic

Weaknesses:

Storage overhead (keep all history)
Query complexity (need to specify time)
Indexes must be temporal-aware

Use for Personal Data Ops: Non-negotiable for systems addressing GAP-1 (Temporal Integrity). Should be combined with other query approaches.

Spatial/Geospatial Queries

Description: Queries based on location. "Where was I when X?"

Query Examples (PostGIS):

-- Find mnemegrams created within 1km of location
SELECT * FROM mnemegrams
WHERE ST_DWithin(
    location,
    ST_MakePoint(-122.4194, 37.7749),  -- San Francisco
    1000  -- meters
)

-- What was I thinking about in this city?
SELECT * FROM mnemegrams
WHERE ST_Contains(
    (SELECT boundary FROM cities WHERE name = 'Tokyo'),
    location
)

Principle Alignment:

Supports P13 (Heterogeneous Integration) - location as first-class data
Moderate P9 (Performance) - spatial indexes efficient

Requirements Addressed:

R71 (Geospatial indexing and query)
R72 (Temporal indexing) - combine with time for "where/when"
R73 (Entity tracking - places)

Implementations:

PostGIS (PostgreSQL extension)
S2 geometry (Google)
H3 (Uber's hexagonal grid)
GeoJSON + turf.js (JavaScript)

Use for Personal Data Ops: Relevant for:

Location-aware capture (UC-16)
Travel journals
Context: "what was I thinking about in that place?"

Challenge: Location privacy (R74). Geospatial data is sensitive.

Aggregation and Analytics Queries

Description: Queries that compute over collections. "What are my patterns?"

Query Examples:

-- Most referenced concepts
SELECT referent, COUNT(*) as count
FROM relations
GROUP BY referent
ORDER BY count DESC

-- My output over time
SELECT DATE_TRUNC('month', created_at) as month,
       COUNT(*) as mnemegrams_created
FROM mnemegrams
GROUP BY month

-- Concept co-occurrence
SELECT a.concept, b.concept, COUNT(*) as frequency
FROM mnemegrams m
  JOIN concepts a ON m.id = a.mnemegram_id
  JOIN concepts b ON m.id = b.mnemegram_id
WHERE a.concept < b.concept
GROUP BY a.concept, b.concept
ORDER BY frequency DESC

Principle Alignment:

Supports P11 (Proactive Surfacing) - analytics reveal patterns
Supports P9 (Performance) - with proper indexes

Requirements Addressed:

R30 (Surface knowledge gaps) - find underexplored concepts
R77 (Pattern detection) - identify trends
R42 (Temporal decay) - measure activity over time

Implementation:

SQL aggregations (standard databases)
Graph analytics (Neo4j, NetworkX)
Time series databases (InfluxDB, TimescaleDB)

Use for Personal Data Ops: Enables:

"What am I thinking about most this year?"
"Which connections am I neglecting?" (relationship half-life)
"Am I more productive in mornings or evenings?"

Challenge: Privacy - aggregate patterns can be revealing.

Comparative Analysis

Query Expressiveness:

Full-text < Vector < Graph < SPARQL
(keywords)               (semantic reasoning)

Query Performance:

Full-text > Vector > Graph > SPARQL
(fast)                      (slow)

Ease of Use:

Full-text > SQL > Vector > Graph > SPARQL
(familiar)                      (specialist)

Semantic Richness:

Full-text < SQL < Vector < Graph < SPARQL
(surface)                        (deep)

Hybrid Query Strategies

Real systems need multiple query approaches:

Common Combinations:

1. Full-text + Graph (Most Common)

Keyword search to find candidates
Graph traversal to explore connections
Example: Roam, Obsidian graph view

2. Vector + Full-text (Emerging)

Keyword for precise matching
Vector for semantic similarity
Hybrid ranking combines both

3. Graph + Temporal

Graph for relationships
Temporal for evolution
atproto does this well

4. Full-text + Spatial + Temporal

Where + When + What
UC-16 (Event logging) requires all three

Implementation Strategy:

Primary index: Full-text (essential baseline)
Secondary: Graph (if explicit links exist)
Optional: Vector (if semantic similarity valuable)
Always: Temporal (for P2 compliance)
Conditional: Spatial (if location-aware)

Query Interface Design

Natural Language Queries (with LLM):

"Show me what I was thinking about CRDTs last summer"
  ↓ LLM translates to:
SELECT * FROM mnemegrams
WHERE content LIKE '%CRDT%'
  AND created_at BETWEEN '2024-06-01' AND '2024-08-31'

Strengths: User-friendly, no query language needed Weaknesses: LLM reliability, cost, privacy (if cloud)

Visual Query Builder: Graph-based query construction (like Roam's query builder)

DSL (Domain-Specific Language): Custom query syntax for common patterns Example: tag:work created:last-week about:strategy

Performance Considerations

Indexing Strategies:

Full-text: Inverted index
Graph: Adjacency lists
Vector: HNSW (Hierarchical Navigable Small World)
Spatial: R-tree, quadtree
Temporal: B-tree on timestamp

Materialized Views: Pre-compute common queries for speed Example: "Most referenced concepts" updated nightly

Caching: Cache frequent queries Invalidate on data change

Query Optimization:

Index selection crucial
Query planning (SPARQL needs this badly)
Limit result sets
Pagination for large results

Recommendations by Use Case

For Daily Note-Taking (Obsidian-like):

Full-text (essential)
Simple graph (backlinks, forward links)
Optional: Vector for "similar notes"

For Research/Academic:

Full-text + Graph (citations, concepts)
Temporal (track evolution of thinking)
Optional: SPARQL if formal ontology needed

For Quantified Self:

Temporal + Aggregation (patterns over time)
Spatial (location context)
Full-text (find by content)

For AI-Augmented:

Vector (semantic similarity)
Full-text (keyword fallback)
Graph (explicit relationships)

Open Questions

Can vector similarity replace explicit linking?
How do you query across multiple temporal versions efficiently?
What's the right balance of indexes (storage cost vs query speed)?
Can LLMs generate reliable queries from natural language?
How do you explain why a query returned these results (explainability)?
Can federated queries work across personal memexes?

Cross-References

principles - P3 (Semantic Richness), P9 (Performance)
gap-analysis - Query approaches address multiple gaps
glossary-engineering - Technical term definitions
storage-models - Storage affects query capability
atproto-analysis - Graph + temporal queries

Backlinks

solid-analysis