1. Collect sources
The system indexes public pages, reports, news, and project information from configured sources.
Methodology
Dutch Observatory explores public signals around AI, digitalisation, and education. The system collects sources, recognises themes, and connects evidence cautiously. The outputs are not rankings or judgements, but a tool for understanding developments.
In short
The system indexes public pages, reports, news, and project information from configured sources.
Text is extracted deterministically. JavaScript-heavy pages may be rendered first, without screenshots, OCR, or AI summaries.
Controlled terms, domain anchors, and source fragments determine which relationships become visible.
Manual start, algorithmic growth
Manually entered as a starting point
A small set of organisations as corpus seed, domain-anchor definitions, a few dozen source feed URLs, and a small number of initiative and context seeds.
Algorithmically discovered
Most organisation profiles, semantic relationships, anchor matches, text citations, collaboration relations, geography detections, and archetype suggestions are the result of crawling and deterministic analysis, not manual entry.
No generative AI
No LLM, no embeddings, no semantic reasoning by language models. All relationships are reproducible from source fragments and controlled vocabulary.
Evolving knowledge graph
The corpus grows step by step. Relationships that are thin or missing now may become stronger as more sources are indexed. This observatory is a process, not a finished picture.
Methodology
An artifact is an indexed source record: for example a page, article, report, or project description.
Domain anchors are controlled concepts such as digital literacy, educational AI, or privacy and security. Dutch and English terms can point to the same anchor.
Relationships only become stronger when they can be traced back to source fragments. Seed-only relationships remain cautious.
Internal patterns can look strong, but public interpretation is dampened when the evidence base is thin, young, or overlapping.
Uncertainty
Information manually entered as a starting point: seed organisations, domain-anchor definitions, and source feed URLs. It forms the skeleton, not the conclusions.
A dampened public confidence score. Low scores remain visible internally but are not shown as strong public interpretation.
The amount and spread of indexed sources behind an observation.
There is evidence, but little repetition or temporal spread yet.
Multiple profile patterns look similar and are not yet empirically distinguishable.
There is too little usable evidence for public interpretation.
A starting link or editorial assumption without sufficient observed source evidence.
A relation confirmed by explicit wording in source fragments. Higher trust than seed-only or co-occurrence relations.
An algorithmically detected relation or observation waiting for human confirmation before it counts as accepted.
Geographic location detected through deterministic text matching. May be a city, province, or region name. Less certain than a manual seed.
A relation supported by indexed source fragments but without explicit partnership wording.
Technical outline
HTML is cleaned with fixed rules and converted into usable text. Rendered acquisition stores rendered_html separately from fetched_html.
Low text quality, empty extraction, and boilerplate are detected so weak sources carry less weight.
Terms, synonyms, context words, and negative terms determine anchor recognition. No embeddings are used.
Relations get explanations, status, and evidence snippets. Seed-only is not the same as observed.
Crawl sources
Configured source feeds are visited periodically. Discovered URLs are tracked as discovered_urls and go through an approval step.
Extract content
HTML is deterministically cleaned into usable text. JavaScript-heavy pages are rendered first. Quality score determines usability.
Score artifacts
Relevance and quality are computed from text content, source trust level, and structural features. No ML models.
Match domain anchors
Controlled vocabulary determines anchor matching. Confidence per match is based on hit context.
Build relation evidence
Explicit wording in text detects collaboration relations. Evidence is stored as snippets and phrase types.
Organisation profiles
Anchor profiles aggregate per organisation. Archetype suggestions are computed from profile patterns. Publication confidence dampens weak patterns.
Geographic detection
Cities and provinces are detected through text recognition. Manual seeds are never overwritten by algorithmic detection.
Human review
Auto-review accepts or quarantines artifacts based on thresholds. Everything above or below the threshold goes to human review.
Limitations
This site does not say who is important, leading, or influential. Coverage is incomplete, source selection is still growing, and some pages are difficult to read automatically. Treat patterns as cautious indications within the current corpus.