Firmwatch Data-Completeness Audit — 2026-04-24

Cross-reference of:

A. What the database actually has

21 tables. Row counts:

Has dataEmpty
deals (456)companies (0)
firms (17)company_news (0)
thesis (1)deal_email_provenance (0)
thesis_match_scores (93)deal_score_explanations (0)
thesis_versions (4)deal_source_provenance (0)
provider_usage (10)digest_item_opened (0) — NSM tracking
sources (66)digest_items (0) — the actual digest
schema_migrations (28)email_inbound (0) — newsletter ingest
firm_drift_centroids (0)
partner_thumbs (0)
render_tool_invocations (0)
saved_dashboards (0)
source_snapshots (0)

B. Critical missing data — by category

B1. Deals — 88% null on the columns that actually matter

The 456 deal rows are headlines without payloads. Sample null rates from deals table:

ColumnNull %Why it matters
firm_slug88%Without this, deal isn't attached to a watched firm. 88% are just "there was an announcement somewhere."
amount_usd86%Spec calls for "raising $5-15M" filter as a thesis dimension. Can't filter on what's null.
round_type86%Spec wants Series A/B/Growth Equity etc. for stage match.
sector86%Spec's check-1 for thesis match. Can't bucket by sector if column is null.
lead_investors90%Whole point of competitive intel — who led? Can't co-investor analyze on nulls.
company_id100%The deals table doesn't link to companies (which is itself empty — see B2).
canonical_id100%The dedup column has zero call sites. Cross-source duplicates show as separate rows.
enrichment_attemptedonly falseEnrichment pipeline (Sonnet extraction) never ran on these rows post-ingest.

What's actually populated: id, source_id, announced_at, created_at, source_url, via_scraper, raw_json. That's the headline + URL + scraper provenance — not the structured fields the thesis matcher needs.

B2. Companies — empty (0 rows)

The companies table has 18 columns including current_employee_count, twelve_months_growth_rate, current_employee_range, job_count, last_reported_revenue_usd, linkedin_url, sourcescrub_company_id. All these were specced. None are populated. Spec said:

"Sourcescrub: tomorrow → Phase 0.5 add-on"

Still tomorrow. Spec FIRM-N15 covers Sourcescrub credit budget; FIRM-313/314 cover the newsletter ingest. The companies-enrichment leg is not yet wired — even though the schema is ready.

B3. Newsletters — 0 rows in email_inbound

Spec said:

"Newsletters remain priority-1 in source fusion"

email_inbound table has 0 rows. That means ZERO newsletter-derived deals have been ingested. The Resend webhook decommissioning (FIRM-347) returned the system to a state where:

B4. Source provenance — 0 rows in source_snapshots + deal_source_provenance

Spec 023 (newsletter scraping-first) added source_snapshots for raw-content archive + deal_source_provenance to link deals to their source. Zero rows in either. Neither table is being written by current ingest paths. Migration 0027/0028 ran (verified in schema_migrations), so the schema is there — pollers just don't write to it.

B5. Engagement / NSM signals — all empty

B6. Drift signals — empty

firm_drift_centroids: 0 rows. v1 spec called for "thesis-drift detection" as one of 5 differentiators. The schema exists, the cron job (0 3 * in wrangler.toml) exists, but no firm has a centroid computed yet because deals are 86% null on sector (B1), and the centroid algorithm requires sector distribution.

B7. Digest — 0 rows in digest_items

The actual digest table is empty. The spec's #1 deliverable was "every morning a digest arrives." There's a daily cron (brief-digest), but the cron's output isn't being persisted to digest_items. Either the cron silently fails OR the persistence step was never wired.

C. Cross-reference vs original v1 spec deliverables

From REVISION 2 (Paul-approved), the user-observable behaviors:

v1 promiseStatusEvidence
Single named Blueprint thesis✅ Workingthesis row 1, thesis_versions 4 versions
Daily digest delivers 5/5 business daysFAILINGdigest_items empty; can't measure SLA
3-5 thesis-filtered items with "why this matters" narrative⚠️ Partialthesis_match_scores has 93 rows but why_narrative IS NULL on most
Watchlist Home dense table with sparklines⚠️ PartialUI exists, but firm_drift_centroids empty so sparklines have no drift data
Drift indicators per firm❌ Empty0 centroid rows
Generative chart/table responses with citations⚠️ Server-side ready, client-side blockedSpec 026 in flight (PR #446)
Thumbs feedback per item❌ Emptypartner_thumbs 0 rows
Newsletter as first-class source❌ Emptyemail_inbound 0 rows
Co-investor network analyticsSchema not startedNo co_investors or network_edges table
Per-firm thesis-match scoring⚠️ Partial93 scores but mostly score-only, no breakdown
18 firms watched⚠️ Off — we have 17spec's seed list is 17 (per MEMORY.md), but spec text says "18 firms"
Sourcescrub Data Connect API❌ Not yet writing to companiesPhase 0.5 deferred
Axios Pro Rata, StrictlyVC, PitchBook newsletters❌ Not yet ingestedspec 023 partially shipped
Gmail OAuth❌ Not yet wiredsuperseded by Resend; Resend now decommissioned (FIRM-347) — circular

Constitution cross-check:

D. Top-priority gaps (Paul-actionable)

In rough Build/Severity order:

🔴 P0 — Things that break the v1 promise

1. Deal enrichment is not running — 86% of deals are headline-only. The Sonnet extraction step + Apify retry path that was specced isn't producing structured fields. Without this, every other feature (scoring, digest, charts) is starving. 2. digest_items is empty — the v1 NSM (digest open rate) literally cannot be measured. Either cron fails silently or persistence step missing. 3. partner_thumbs is empty — feedback-loop NSM also cannot be measured.

🟠 P1 — Feature exists but data missing

4. firm_drift_centroids empty — drift glyph in Watchlist UI shows nothing because there's no centroid history. 5. why_narrative null on most thesis_match_scores — Sonnet narrative-generation is either off or its writes are dropped. 6. source_snapshots + deal_source_provenance empty — provenance/audit promise from spec 023 is not being kept.

🟡 P2 — Phase 0.5 add-ons that never landed

7. Companies table — 0 rows despite SourceScrub schema being fully ready 8. email_inbound — 0 rows — newsletter pipeline has no real ingestion (FIRM-347 decommissioned the Resend path) 9. Co-investor graph — schema not even started; spec 023 mentioned as differentiator but no co_investors / firm_relationships table

🟢 P3 — Schema gaps that haven't been filed yet

10. 18 vs 17 firms — minor; spec says "18 firms" but we have 17. Either add the 18th or update spec. 11. firms.team 100% null — spec said partner-team data should be ingested per firm. Not happening. 12. sources.url 100% null — sources table has 66 rows but no URLs on any of them. Probably fine for RSS-known names but breaks if we re-derive from URL.

E. Recommendation

Frame these as two specs:

Spec 027 — Deal-enrichment recovery (P0/P1): items 1, 2, 3, 5, 6 above. The schema is right; the writes aren't happening. This is mostly diagnose-and-wire-up work. ~6-10 tickets, fast track / Standard split.

Spec 028 — Phase 0.5 backlog (P2): items 4, 7, 8, 9 above. Bigger scope — actually building the SourceScrub→companies pipeline + reviving newsletter ingest + adding co-investor graph schema. Standard brief, 8-12 tickets.

P3 items are incremental; file as a someday-maybe.md follow-up.

Run them sequentially (027 first since the v1 promise is the bigger fire), or in parallel (different parts of the codebase) — Paul's call.

Generated 2026-04-24 by Morty · Source: ~/Documents/Mojo/Morty/briefs/firmwatch-data-completeness-audit-2026-04-24.md