Methodology

In short

From source to cautious insight

1. Collect sources

The system indexes public pages, reports, news, and project information from configured sources.

2. Read content

Text is extracted deterministically. JavaScript-heavy pages may be rendered first, without screenshots, OCR, or AI summaries.

3. Connect cautiously

Controlled terms, domain anchors, and source fragments determine which relationships become visible.

Manual start, algorithmic growth

What was manually entered and what was algorithmically discovered?

Manually entered as a starting point

A small set of organisations as corpus seed, domain-anchor definitions, a few dozen source feed URLs, and a small number of initiative and context seeds.

Algorithmically discovered

Most organisation profiles, semantic relationships, anchor matches, text citations, collaboration relations, geography detections, and archetype suggestions are the result of crawling and deterministic analysis, not manual entry.

No generative AI

No LLM, no embeddings, no semantic reasoning by language models. All relationships are reproducible from source fragments and controlled vocabulary.

Evolving knowledge graph

The corpus grows step by step. Relationships that are thin or missing now may become stronger as more sources are indexed. This observatory is a process, not a finished picture.

Methodology

Artifacts, anchors, evidence, and uncertainty

Artifacts

An artifact is an indexed source record: for example a page, article, report, or project description.

Domain anchors

Domain anchors are controlled concepts such as digital literacy, educational AI, or privacy and security. Dutch and English terms can point to the same anchor.

Evidence

Relationships only become stronger when they can be traced back to source fragments. Seed-only relationships remain cautious.

Confidence and publication confidence

Internal patterns can look strong, but public interpretation is dampened when the evidence base is thin, young, or overlapping.

Uncertainty

Legend for cautious reading

Manually entered

Information manually entered as a starting point: seed organisations, domain-anchor definitions, and source feed URLs. It forms the skeleton, not the conclusions.

Publication confidence

A dampened public confidence score. Low scores remain visible internally but are not shown as strong public interpretation.

Evidence base

The amount and spread of indexed sources behind an observation.

Young evidence base

There is evidence, but little repetition or temporal spread yet.

Overlapping profile

Multiple profile patterns look similar and are not yet empirically distinguishable.

Insufficient evidence

There is too little usable evidence for public interpretation.

Seed-only relation

A starting link or editorial assumption without sufficient observed source evidence.

Accepted relation

A relation confirmed by explicit wording in source fragments. Higher trust than seed-only or co-occurrence relations.

Pending review

An algorithmically detected relation or observation waiting for human confirmation before it counts as accepted.

Geographically detected

Geographic location detected through deterministic text matching. May be a city, province, or region name. Less certain than a manual seed.

Observed relation

A relation supported by indexed source fragments but without explicit partnership wording.

Technical outline

No generative AI, but explainable rules

Deterministic extraction

HTML is cleaned with fixed rules and converted into usable text. Rendered acquisition stores rendered_html separately from fetched_html.

Quality score

Low text quality, empty extraction, and boilerplate are detected so weak sources carry less weight.

Controlled vocabulary

Terms, synonyms, context words, and negative terms determine anchor recognition. No embeddings are used.

Relation evidence

Relations get explanations, status, and evidence snippets. Seed-only is not the same as observed.

Technical outline The processing pipeline step by step

Crawl sources

Configured source feeds are visited periodically. Discovered URLs are tracked as discovered_urls and go through an approval step.

Extract content

HTML is deterministically cleaned into usable text. JavaScript-heavy pages are rendered first. Quality score determines usability.

Score artifacts

Relevance and quality are computed from text content, source trust level, and structural features. No ML models.

Match domain anchors

Controlled vocabulary determines anchor matching. Confidence per match is based on hit context.

Build relation evidence

Explicit wording in text detects collaboration relations. Evidence is stored as snippets and phrase types.

Organisation profiles

Anchor profiles aggregate per organisation. Archetype suggestions are computed from profile patterns. Publication confidence dampens weak patterns.

Geographic detection

Cities and provinces are detected through text recognition. Manual seeds are never overwritten by algorithmic detection.

Human review

Auto-review accepts or quarantines artifacts based on thresholds. Everything above or below the threshold goes to human review.

Limitations

What you should not infer from this site

This site does not say who is important, leading, or influential. Coverage is incomplete, source selection is still growing, and some pages are difficult to read automatically. Treat patterns as cautious indications within the current corpus.

How Dutch Observatory works