The gap-radar. Global averages hide sub-population gaps โ slicing every metric by data-source cohort is the lens that caught the 12.9M minted-entity classification hole. red = low, amber = partial, green = healthy.
| Cohort | Entities | P31 | Classified | Named | Descr | Slug | Embed | PViews | Prog | Image |
|---|---|---|---|---|---|---|---|---|---|---|
| wikidata_bulk | 40,537,252 | 97.4% | 97.4% | 100.0% | 80.7% | 100.0% | 16.8% | 19.7% | 100.0% | 0.4% |
| musicbrainz | 9,600,081 | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 0.0% | 0.0% | 100.0% | 0.0% |
| openalex | 3,194,408 | 100.0% | 0.0% | 100.0% | 100.0% | 100.0% | 0.0% | 0.0% | 100.0% | 0.0% |
| gleif | 2,791,824 | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 0.0% | 0.0% | 100.0% | 0.0% |
| usda | 389,001 | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 0.0% | 0.0% | 100.0% | 0.0% |
| openfda | 336,531 | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 0.0% | 0.0% | 94.5% | 0.0% |
| nhtsa | 42,794 | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 0.0% | 0.0% | 100.0% | 0.0% |
| github | 18,239 | 100.0% | 0.0% | 100.0% | 100.0% | 100.0% | 0.0% | 0.0% | 100.0% | 0.0% |
| huggingface | 13,124 | 100.0% | 0.0% | 100.0% | 100.0% | 100.0% | 0.0% | 0.0% | 100.0% | 0.0% |
Non-zero = a gap to fix. Catches structural surprises before they're stumbled on.
| Check | Violations | Status |
|---|---|---|
| Minted CC0 entities not field-classified (need a resolvable P31 Q-id) | 3,225,775 | ๐ด fix |
| Has P31 but unclassified โ flat value, not a resolvable Q-id | 3,228,542 | ๐ด fix |
| Field-classified yet no P31 (consistency check) | 0 | โ ok |
| Named entities with no slug โ unreachable pages | 1,679 | ๐ด fix |
| Has a Wikidata P18 image claim but no stored image_url | 5,457,718 | โน๏ธ info |
| Entities with no name | 0 | โ ok |
Two kinds: creators (added new entity rows) and augmenters (attached fields/tables to existing entities). All currently CC0 or public-domain unless tagged.
| Source | License | Entities created |
|---|---|---|
| Wikidata dump + EventStreams [wikidata] | CC0 | 40,614,594 |
| musicbrainz [musicbrainz] | CC0 | 9,600,081 |
| openalex [openalex] | CC0 | 3,194,408 |
| GLEIF (LEI) legal entities [gleif] | CC0 | 2,791,824 |
| USDA Branded Foods [usda] | CC0 | 389,001 |
| openFDA NDC drug directory [openfda] | CC0 | 336,531 |
| NHTSA vPIC vehicle catalog [nhtsa] | CC0 | 42,794 |
| github [github] | CC0 | 18,239 |
| huggingface [huggingface] | CC0 | 13,124 |
| Source | License | Entities enriched |
|---|---|---|
| SEC EDGAR (financial filings) | PD | 4,601 |
| GLEIF (LEI + legal form + jurisdiction) โ augments existing companies | CC0 | 2,853,230 |
| Wikidata claim citations (P854 + P248) | CC0 | 4,689,490 |
| Wikimedia sitelinks โ rows in entity_sitelinks | CC0 | 65,166,968 |
| Wikimedia pageviews (multi-lang monthly) โ rows in entity_pageviews_monthly | CC0 | 30,264,706 |
| Wikimedia clickstream edges โ rows in entity_clickflow | CC0 | 14,236,762 |
| Per-entity recent edits log โ rows in entity_edits | CC0 | 2,943,869 |
| Entity aliases (i18n) โ rows in entity_aliases | CC0 | 12,553,939 |
How stale is each data stream right now. Pageview hourly lags Wikimedia's publish window (typically 3-5h behind real time); rollups settle into the daily/monthly tiers nightly.
Percentage of the realistic ceiling for each field โ i.e. how close we are to what's actually achievable. Image coverage isn't “X of 28.8M” (most entities never had a Wikidata image), it's “X of entities that can have an image”. Hover any row for the absolute number.
External assets we should be self-hosting. Each one = performance hit, privacy/GDPR risk, and policy violation territory.
Row counts for related tables. Bigger isn't always better โ depends on what each table represents.