Pipeline overview
- Ingest. A scheduled worker (node-cron, in-container) pulls each configured source on its declared cadence — weekly for CDC NNDSS, every 15 minutes for Google News and GDELT, etc.
- Normalise. Each ingestor produces a uniform
Signalshape: id, source, sourceCode, category, rank, title, summary, url, language, countryIso2, publishedAt, ingestedAt. - Resolve. News URLs are de-redirected (e.g. resolving Google News redirector links to canonical publisher URLs); language is detected; country is inferred from the source-feed metadata or text.
- De-duplicate. Signals are keyed by canonical-URL hash; duplicates are merged with first-seen
publishedAtretained. - Classify. Signals are tagged by category (official / news / surveillance / advisory) and rank (1 / 2 / 3) based on source.
- Snapshot. A single JSON snapshot is written to a Docker volume (
runtime-data/snapshot.json). The Astro server reads this file on each API request and caches with short TTL.
Country attribution
A signal is attributed to a country when:
- The source is country-scoped (e.g.
GNEWS-ES-AR→ Argentina), or - The source explicitly tags the country in structured fields (CDC NNDSS state column → US; WHO DON country attribute), or
- The article title contains an unambiguous country name match against our country dictionary.
We deliberately do not attempt fine-grained named-entity recognition for country attribution from free text. The risk of misattribution outweighs the coverage gain.
Country level classification
The pin colour on the map encodes a level, not a count:
- Local — case, death, or active outbreak signal in the country.
- Imported — case present but exposure occurred elsewhere (returnee, repatriation, treatment).
- Response — only travel advisory, screening or quarantine policy signals; no local case.
- Inactive — no signals in the rolling 30-day window.
Levels are assigned by deterministic rules from the categorised signals; we do not use ML for this classification.
Freshness
- Fresh — last successful ingestion ≤ 60 min ago.
- Stale — 60 min – 6 h since last ingestion.
- Unknown — no successful ingestion yet.
What we don't do
- We do not estimate or extrapolate case counts. Numbers shown reflect signal mentions, not confirmed cases. Confirmed cases come from official surveillance feeds and are clearly labelled.
- We do not use generative-AI summaries for headlines. We display the publisher's title verbatim.
- We do not store article body text. We link to the source.
- We do not track users.
Open data
The complete signal feed is available at /api/signals.json
under CC BY 4.0. Country-level summary at /api/countries.json.
Source health at /api/health.json. RSS at
/feed.xml.