Methodology · Open

How the data is sourced, cleaned, and made comparable

No black box. SkyMind unifies fragmented government statistics for 2,619 regions into one clean, comparable, fully-traceable dataset with a transparent scoring methodology — descriptive, not a forecast. Every figure is an official number you can trace back to its source. If something doesn't add up, email us, we welcome the scrutiny.

Last updated 28 June 2026· ~7 min read

Live API docs Live map Found an error?

Contents

1. What we measure, and what we don't
2. The data & sources
3. How we make the data comparable
4. Data integrity, no fabricated values
5. Honest limitations
6. FAQ

1. What we measure, and what we don't

SkyMind takes government statistics that are normally scattered across dozens of national portals, in several languages, with inconsistent schemas and sometimes broken APIs, and unifies them into one clean, comparable dataset for 2,619 regions across forty-seven countries. The product is the official data itself, made consistent, labelled in English, and traceable to source — with a transparent, reproducible methodology.

This is a descriptive product. It shows the measured state of a region from official data. It is not a forecast, not a probability of any event, and not a rating of the future. A figure that moves over time reflects a change in the underlying published statistics, nothing more, nothing less.

2. The data & sources

Every figure traces back to an official, public source. We ingest no personal data, Zero Personal Data Architecture, GDPR public-interest basis (Article 6).

Country	Regions	Coverage	Sources
🇩🇪 Germany	401 Kreise	2014-2023	Eurostat NUTS-3
🇮🇱 Israel	255 localities	2001-2025	data.gov.il, CBS, audited financial reports
🇦🇪 UAE	340 units	2006-2024	Dubai Land Department, Dubai Statistics Center, World Bank
🇸🇦 Saudi Arabia	51 governorates	2015-2024	GASTAT, KAPSARC, RCRC, World Bank
🇶🇦 Qatar	8 municipalities (official)	2010-2025	Qatar PSA (data.gov.qa), World Bank
🇫🇷 France	96 départements	2015-2024	Eurostat NUTS-3, INSEE
🇪🇺 EU-27 + EFTA	24 countries · 706 regions	2015-2024	Eurostat NUTS-3
🇬🇧 United Kingdom	182 ITL3 areas	2023	ONS Open Geography, NOMIS (GDHI)
🌍 Western Balkans + Türkiye	4 countries · 126 regions	2018-2024	Eurostat candidate-country NUTS-3

Total: 47 countries, ~1,093 metrics, ~2.1 million observations. Coverage and depth are not uniform, we say so explicitly. Israel, for example, has a deep, fully-populated core for 2002-2018; later years are partial because the underlying government datasets thin out. Where the source data is thin, the data is thin, we do not paper over it.

ℹ️ A note on the UK and the wider European set

The 24 EU-27 + EFTA countries share one unified Eurostat NUTS-3 pipeline with the same metric set (GDP, employment, full demographics, tourism), normalised within each country. The United Kingdom is built differently: after Brexit, Eurostat no longer publishes UK regional economics, so the UK runs entirely on the UK's own ONS ITL-2025 geography. Its economic axis is GDHI per head (gross disposable household income, NOMIS, 2023) rather than GDP, and it rests on a smaller metric set (income plus population, density and median age) than the EU countries. As everywhere, UK scores are normalised within the UK and are not comparable across countries. For six countries Eurostat does not publish NUTS-3 GDP at our geography vintage, so their economic axis is taken directly from the national statistics office, matched region-by-region by name and never estimated: Switzerland (BFS cantonal GDP per capita 2022), Latvia (CSB regional GDP 2021, which unlike Eurostat still separates Rīga and Pierīga), the Netherlands (CBS, all 40 COROP regions, 2022), Portugal (INE sub-regional GDP 2024, with the Lisbon metro recombined from Grande Lisboa + Península de Setúbal), Finland (Statistics Finland, all 19 maakunta, 2022) and Norway (SSB county value-added per capita 2021). Demographics and tourism still come from Eurostat; for Portugal those axes reflect 2022 (the latest year INE published full sub-regional detail) while its economic axis is 2024. Some regions remain unscored by design, not estimated: countries with only one or two NUTS-3 regions (Luxembourg, Cyprus, Liechtenstein, Iceland, Malta) cannot be compared within themselves, and Svalbard and Jan Mayen sit outside the standard regional accounts.

What the pipeline actually looks like

The hardest part is the aggregation itself, and we do not pretend otherwise. Government data across forty-seven countries is scattered across dozens of portals, in different languages, with inconsistent identifiers and frequently broken APIs. Cleaning, geocoding, translating metric names, reconciling schemas and keeping it all current is months of unglamorous work — and genuinely difficult to reproduce.

A few examples from our own pipeline:

Israel. The CBS publishes annual municipal compendiums in Hebrew Excel files with two-row headers. Column names change between editions. The official locality code (semel yishuv) used in one table does not match the identifier used in another. Tel Aviv is 5000 in some files and 76781 in others. Matching columns to our 837 metrics required correlating values across localities against prior years — because string matching simply fails. One mismatched identifier silently drops an entire city.

Germany. Our original dataset for 401 Kreise turned out to be synthetic — calibrated on twelve Berlin sub-districts and extrapolated. Leipzig showed a population of 3.35 million (the real figure is 620,000). Wolfsburg ranked fourth in GDP instead of first. We discovered this in an internal audit, discarded the entire dataset, and rebuilt from scratch using verified Eurostat NUTS-3 data. The replacement was validated: Wolfsburg is now correctly #1 at €185k per capita, and the economic axis correlates r = 0.903 with official GDP.

Eurostat. Thirty-three countries report at the NUTS-3 level, but not all metrics exist at that granularity. R&D expenditure, education, healthcare and income are only published at NUTS-2. Each NUTS-2 region contains between two and forty NUTS-3 sub-regions. Mapping requires knowing the hierarchy, which itself changes with each NUTS revision (2016, 2021, 2024). Some countries report 2024 data in February; others still show 2021.

Gulf states. The UAE, Saudi Arabia and Qatar each publish data through their own portals, in Arabic and English, with no shared schema. District boundaries are defined differently. Some figures are official census results; others are estimates from secondary sources. Every value must be tagged with its provenance — official, estimated, or derived — because treating them equally would be dishonest.

This labour and breadth of coverage is the product. We do not sell a model built on top of it.

Watch it run. We built an interactive replay of a real ingestion — unit mapping, metric selection, the five-check validation battery, and the load gate refusing injected synthetic data: sky-mind.com/pipeline. Every number in it is verbatim from the 8 July 2026 Moldova load.

3. How we make the data comparable

"Comparable" is the whole job, and it is structural, not a score. Government statistics for these forty-seven countries arrive in different shapes, languages, identifiers and time vintages. We make them line up so a researcher can put two regions, or two countries, side by side and trust that they are looking at the same thing:

One schema. The same metric means the same thing in every country, with one consistent definition, unit and time axis — not a different spreadsheet layout per portal.
Region-by-region matching. Every official figure is tied to a specific region by its official code or name, never estimated or interpolated to fill a gap.
English labels + provenance + curation. Every metric is labelled in English (we hand-translated all 837 Israeli ones from Hebrew) and carries its source, dataset and original value. We also tag each metric with a tier — core (130 analyst-ready indicators), extended or raw — so you get the full dataset with our recommendation of where to start.

ℹ️ We don't add an index of our own

SkyMind does not compute a composite score, ranking or rating. Those are our opinions about how to weigh things, and our position is to keep our opinions out of your data. When a single recognisable comparison is useful — for example to colour a map — we use an official figure published by the relevant statistics office (Israel's CBS socio-economic cluster, Eurostat GDP per capita, and so on), clearly labelled as theirs, not ours. The judgement stays with you.

4. Data integrity, no fabricated values

Every figure is the official source value, unchanged. Two rules make the dataset honest:

No neutral defaults. If a region-year genuinely has no published figure for a metric, the cell is left empty — we never fill it with a placeholder.
No carry-forward. We never copy a prior year's value into a year that has no real data, and we never estimate or interpolate to make coverage look fuller.

This means our coverage looks smaller than a "fill every cell" approach would, by design. An empty cell is more useful than a fabricated one.

The four quality dimensions, in standard vocabulary

Data quality has a canonical vocabulary — four dimensions that recur across every major framework, from ISO 8000 to the IMF's Data Quality Assessment Framework to the European Statistics Code of Practice. We use it, and here is what each dimension means operationally at SkyMind:

Accuracy. Every figure is traceable to a named source publication and stored at the source's own precision — no rounding at storage, because rounding manufactures false ties and breaks verbatim verification. Aggregates are reconciled against components; impossible values (negative counts, shares above 100%) are unconditional load failures. Trace any number yourself at provenance.
Completeness. Missing values are identified and classified, never filled. Gaps in the source stay gaps — see the rules above and the coverage notes in the quality report.
Consistency. One schema across periods, units and countries; documented units and metric polarity; duplicates resolved against the primary source, never averaged away; schema changes dated and reasoned in the revision ledger.
Timeliness. Each dataset's currency and expected next upstream release are published in the data currency table; reference years come from the source's own labeling, never inferred from publication dates.

European official statistics phrase the same discipline as five criteria — relevance, accuracy, timeliness and punctuality, accessibility and clarity, comparability and coherence (Regulation (EC) No 223/2009, Art. 12) — and our point-by-point mapping to the European Statistics Code of Practice is on standards & conformance. If you believe any published value fails these standards, dispute it — resolutions are public.

5. Honest limitations

It does not predict. No event timing, no crash calls, no election outcomes. It describes what the published statistics currently show.
It does not explain causes. The data tells you a region's official figures, not why they are what they are.
Coverage is uneven. Compare within a country and period with confidence; be careful comparing absolute levels across countries with different source datasets.
The data is periodic. Most sources are annual. A short, sharp shock between releases will not show up until the next data point.
It does not replace domain expertise. A real-estate analyst or municipal economist reads these numbers in context; without that context a number is just a number.

6. FAQ

Isn't this just aggregating CSV files?

The aggregation is the hard part, and we don't pretend otherwise. Government data for these forty-seven countries is fragmented across dozens of portals, in multiple languages, with inconsistent identifiers and frequently broken APIs. Cleaning it, geocoding it, translating metric names, reconciling schemas and keeping it current is months of unglamorous work, and it is genuinely hard to reproduce. That labour, and the coverage breadth, is the product. We are not selling a model on top of it.

Can I verify any single number?

Yes — that is the point. The API and the provenance view expose every figure with its source, dataset and original published value. Pick any region from /map/data, follow it back to the national statistics office, and it will match. No login, no API key (rate-limited).

Do you rank regions or score them yourselves?

No. We deliberately don't add a composite, ranking or rating of our own — that would be our opinion, not official data. We deliver the official figures, made comparable and traceable. Any weighting, ranking or judgement on top is yours to make, with your own assumptions. Where a single comparison is shown (e.g. on the map) it is an official figure from the statistics office, labelled as theirs.

Do you predict prices, crises or migration?

No. We made a deliberate decision not to be in the prediction business. SkyMind shows what the official data currently says about a region. Any forecasting on top of that is the user's call, with their own assumptions.

If you find an error in the data, the sourcing or the dataset construction, including small ones, email us at info@sky-mind.com. We post methodological corrections publicly with credit to whoever found them.