๐Ÿ“Š Data Coverage Audit

Cached 5 min ยท force refresh
56,923,254 entities total

๐Ÿฉบ Cohort health โ€” completeness by data source

The gap-radar. Global averages hide sub-population gaps โ€” slicing every metric by data-source cohort is the lens that caught the 12.9M minted-entity classification hole. red = low, amber = partial, green = healthy.

CohortEntities P31ClassifiedNamed DescrSlugEmbed PViewsProgImage
wikidata_bulk 40,537,252 97.4% 97.4% 100.0% 80.7% 100.0% 16.8% 19.7% 100.0% 0.4%
musicbrainz 9,600,081 100.0% 100.0% 100.0% 100.0% 100.0% 0.0% 0.0% 100.0% 0.0%
openalex 3,194,408 100.0% 0.0% 100.0% 100.0% 100.0% 0.0% 0.0% 100.0% 0.0%
gleif 2,791,824 100.0% 100.0% 100.0% 100.0% 100.0% 0.0% 0.0% 100.0% 0.0%
usda 389,001 100.0% 100.0% 100.0% 100.0% 100.0% 0.0% 0.0% 100.0% 0.0%
openfda 336,531 100.0% 100.0% 100.0% 100.0% 100.0% 0.0% 0.0% 94.5% 0.0%
nhtsa 42,794 100.0% 100.0% 100.0% 100.0% 100.0% 0.0% 0.0% 100.0% 0.0%
github 18,239 100.0% 0.0% 100.0% 100.0% 100.0% 0.0% 0.0% 100.0% 0.0%
huggingface 13,124 100.0% 0.0% 100.0% 100.0% 100.0% 0.0% 0.0% 100.0% 0.0%

๐Ÿšจ Invariants โ€” assertions that should hold

Non-zero = a gap to fix. Catches structural surprises before they're stumbled on.

CheckViolationsStatus
Minted CC0 entities not field-classified (need a resolvable P31 Q-id) 3,225,775 ๐Ÿ”ด fix
Has P31 but unclassified โ€” flat value, not a resolvable Q-id 3,228,542 ๐Ÿ”ด fix
Field-classified yet no P31 (consistency check) 0 โœ… ok
Named entities with no slug โ†’ unreachable pages 1,679 ๐Ÿ”ด fix
Has a Wikidata P18 image claim but no stored image_url 5,457,718 โ„น๏ธ info
Entities with no name 0 โœ… ok

๐Ÿ“ฅ Data sources ingested

Two kinds: creators (added new entity rows) and augmenters (attached fields/tables to existing entities). All currently CC0 or public-domain unless tagged.

Entity-creator sources

SourceLicenseEntities created
Wikidata dump + EventStreams [wikidata] CC0 40,614,594
musicbrainz [musicbrainz] CC0 9,600,081
openalex [openalex] CC0 3,194,408
GLEIF (LEI) legal entities [gleif] CC0 2,791,824
USDA Branded Foods [usda] CC0 389,001
openFDA NDC drug directory [openfda] CC0 336,531
NHTSA vPIC vehicle catalog [nhtsa] CC0 42,794
github [github] CC0 18,239
huggingface [huggingface] CC0 13,124

Entity-augmenter sources

SourceLicenseEntities enriched
SEC EDGAR (financial filings) PD 4,601
GLEIF (LEI + legal form + jurisdiction) โ€” augments existing companies CC0 2,853,230
Wikidata claim citations (P854 + P248) CC0 4,689,490
Wikimedia sitelinks โ‰ˆ rows in entity_sitelinks CC0 65,166,968
Wikimedia pageviews (multi-lang monthly) โ‰ˆ rows in entity_pageviews_monthly CC0 30,264,706
Wikimedia clickstream edges โ‰ˆ rows in entity_clickflow CC0 14,236,762
Per-entity recent edits log โ‰ˆ rows in entity_edits CC0 2,943,869
Entity aliases (i18n) โ‰ˆ rows in entity_aliases CC0 12,553,939

โฑ๏ธ Freshness

How stale is each data stream right now. Pageview hourly lags Wikimedia's publish window (typically 3-5h behind real time); rollups settle into the daily/monthly tiers nightly.

Pageview ingest (hourly, global)

Latest hour ingested (any lang)
2026-06-09T22:00:00+00:00
0 rows

Pageview rollups

Daily rollup max(day)
2026-06-09
155,941,616 rows
Monthly rollup max(month)
2026-06-01
30,264,706 rows

Wikidata edit stream

Latest edit_ts ingested
2026-06-10T02:25:19+00:00
0 rows

Per-entity coverage

Percentage of the realistic ceiling for each field โ€” i.e. how close we are to what's actually achievable. Image coverage isn't “X of 28.8M” (most entities never had a Wikidata image), it's “X of entities that can have an image”. Hover any row for the absolute number.

Wikidata ID linked
56,923,254 / 56,923,254
100.0%
Description present
49,079,949 / 56,923,254
86.2%
URL slug generated
56,921,575 / 56,923,254
100.0%
Wikidata claims_rich (full data)
56,921,612 / 56,923,254
100.0%
Wikipedia article (any value)
39,442,132 / 56,923,254
69.3%
Image URL set(of entities with P18 (Wikidata image) in claims_rich)
148,990 / 5,608,038
2.7%
Image self-hosted (vs hotlinked)(of entities with an image URL set)
17,274 / 148,990
11.6%
Wikidata claim citations(of entities with claims_rich)
4,689,490 / 56,921,612
8.2%
Wikipedia pageview data(of entities with a real Wikipedia article)
7,970,342 / 8,870,358
89.8%
domains[] populated (post-classification)(of entities with claims_rich)
52,649,694 / 56,921,612
92.5%
professions[] populated(of entities with P106 (occupation) โ€” basically people)
6,529,005 / 6,530,361
100.0%
harvest_domain set(of entities with claims_rich)
56,903,058 / 56,921,612
100.0%
Schema.org type (not Thing/class)(of entities with claims_rich)
40,138,183 / 56,921,612
70.5%
SEC EDGAR (CIK + ticker)
4,601 / 56,923,254
0.0%
LLM-enriched article
189,443 / 56,923,254
0.3%
Vector embedding generated
189,637 / 56,923,254
0.3%
Marked for image re-harvest
9,293 / 56,923,254
0.0%
Infrastructure pages (lists/disambig/categories)
669 / 56,923,254
0.0%

๐Ÿ”— Hotlink risk

External assets we should be self-hosting. Each one = performance hit, privacy/GDPR risk, and policy violation territory.

๐Ÿ–ผ๏ธ Images hotlinked from upload.wikimedia.org (17,274 self-hosted of 148,990)
131,716
88.4%
๐Ÿ”ค Templates loading Google Fonts (GDPR risk) ()
0 / 25
0%

Auxiliary tables

Row counts for related tables. Bigger isn't always better โ€” depends on what each table represents.

wikidata_items 91,584,848
wikidata_class_hierarchy 5,172,642
wikidata_class_closure 44,316,192
wikidata_properties 13,427
entity_relationships 41,775,756
entity_filings 2,610,222
entity_pageviews_hourly 258,573,344
entity_pageviews_monthly 30,264,706
entity_clickflow 14,236,762
entity_sitelinks 65,166,968
entity_aliases 12,553,939
entity_descriptions 69,993,256
search_metrics 43,432
sitelink_languages 178