Skip to content
Hantaflow
Technical

Methodology

The pipeline from authoritative source to live map pin. Designed to be auditable: every signal is sourced, every source is listed, and the code is open.

What is a "signal"?

A signal is one credible mention of hantavirus activity from one of our vetted sources. It is the unit of data on the map, in the stats bar, in the per-country pages, and in the JSON API. Each signal has a source, a country, a language, a timestamp and a link to the original publication.

What a signal is depends on the source:

Signals are not confirmed case counts outside the US. We deliberately do not estimate, extrapolate, or sum signals to imply cases. The unit is "vetted mention in the last 30 days," not "patient diagnosed." When confirmed counts are available from official surveillance feeds (currently US only via NNDSS), they are clearly labelled with the CDC NNDSS source tag.

Country-level surveillance feeds for the rest of the world (RKI in Germany, Rospotrebnadzor in Russia, KDCA in South Korea, etc.) are generally not available as machine-readable APIs. RKI publishes a weekly Wochenbericht as a PDF; we link it as a citation on country pages but do not currently ingest it. Adding country-specific surveillance scrapers is on the roadmap.

Pipeline overview

  1. Ingest. A scheduled worker (node-cron, in-container) pulls each configured source on its declared cadence: weekly for CDC NNDSS, every 15 minutes for Google News and GDELT, etc.
  2. Normalise. Each ingestor produces a uniform Signal shape: id, source, sourceCode, category, rank, title, summary, url, language, countryIso2, publishedAt, ingestedAt.
  3. Resolve. News URLs are de-redirected (e.g. resolving Google News redirector links to canonical publisher URLs); language is detected; country is inferred from the source-feed metadata or text.
  4. De-duplicate. Signals are keyed by canonical-URL hash; duplicates are merged with first-seen publishedAt retained.
  5. Classify. Signals are tagged by category (official / news / surveillance / advisory) and rank (1 / 2 / 3) based on source.
  6. Snapshot. A single JSON snapshot is written to a Docker volume (runtime-data/snapshot.json). The Astro server reads this file on each API request and caches with short TTL.

Country attribution

A signal is attributed to a country using a three-tier hierarchy. This is the same pattern used by serious news-based surveillance systems (HealthMap, ProMED, GPHIN, EIOS): read the article, not the publisher's home address.

  1. Tier 1, source-authoritative. CDC NNDSS, WHO Disease Outbreak News, ECDC and PAHO publish structured country fields with their data. These are trusted as-is and tagged attributionMethod: "source-authoritative".
  2. Tier 2, content match. For news articles, we scan the title (highest weight) and summary against a multilingual country-name gazetteer derived from Unicode CLDR (via i18n-iso-countries) plus stem overrides for inflected languages (Russian, Polish, Greek, Turkish). Matches use word boundaries and longest-name-first ordering so "Georgia, United States" pins US, not Georgia. Strain and virus names that contain place words (Sin Nombre, Andes, Seoul, Hantaan, Puumala, Dobrava, Choclo, Laguna Negra) are stripped before matching. Tagged attributionMethod: "title-match" or "summary-match".
  3. Tier 3, unattributed. If no country name appears in title or summary, the signal is kept in the global and per-language feeds but contributes to no country's pin. Tagged attributionMethod: "unattributed". The feed's geo-target is never used as a primary attribution. A Portuguese article from Portugal's Google News feed that talks about an Argentine outbreak attributes to Argentina, not Portugal.

A single article can attribute to multiple countries when it explicitly mentions several (e.g. "Argentina and Chile outbreaks" emits two signals, one per country). Each per-country signal carries the same source, title and URL.

Known limitations. News-based attribution caps out at roughly 80% accuracy in published evaluations (HealthMap, Freifeld et al. 2008). Very small countries the gazetteer omits, and articles written about cities or regions without naming the country, are false negatives. False positives are minimised by the strain-name stripping, US-state disambiguation for Georgia, and the content-not-publisher rule. The breakdown by attribution method for any country is exposed at /api/countries/<slug>.json under stats.attribution so consumers can judge data quality.

Country level classification

The pin colour on the map encodes a level, not a count:

Levels are assigned by deterministic rules from the categorised signals; we do not use ML for this classification.

Freshness

What we don't do

Open data

The complete signal feed is available at /api/signals.json under CC BY 4.0. Country-level summary at /api/countries.json. Source health at /api/health.json. RSS at /feed.xml.