Inside Moonlit's data pipeline
A look at how raw rulings become clean, linked, searchable data. Collected, cleaned and enriched in the nightly refresh.
6

Every night at 04:00 Amsterdam time, our database refresh starts. A ruling published the day before in The Hague, Brussels, or Paris is cleaned, linked, and searchable by the time lawyers sit down to work. This post follows that process from start to finish: how we collect the documents, how we clean them, and what we add before they reach the platform.
Where the documents come from
Every two weeks, our legal engineering team picks the next sources to add. The roadmap drives most of that choice, though we’ll often move a source up the list when a newly onboarded client needs a particular gazette, regulator, or court. One rule doesn’t bend: we never collect documents from a source without permission. If the source indicates the documents are available for re-use, we know we can safely use them. If not, we contact them first and get explicit permission. Today the catalog spans courts, parliaments, official gazettes, and regulators across Europe and beyond.
Bringing in the archive
Once a source is approved, we look at how it publishes its documents. In practice that means either downloading files from its public website or pulling documents through an API.
The first run is the heavy one. We call it backfilling: pulling in the source’s entire historical archive, document by document, with all the metadata attached: titles, publication dates, PDFs, and any official summaries. For a small regulator that takes a few hours. For a national supreme court holding hundreds of thousands of decisions, it can run for days.
After the backfill, the source moves to a daily mode. Each night, automated workers check more than 50 sources across 40-plus countries for anything new or amended, so the day’s documents are in before anyone starts work.
Cleaning up the mess
Raw legal documents are rarely tidy. Some sources publish plain text, some clean PDFs, and a few still upload scanned images of paper. A dedicated cleaning step brings all of it to a common standard.
We work on two levels. The first is the content itself. We convert PDFs to text, run scanned images through Optical Character Recognition (OCR), and strip out the noise such as navigation menus, repeating headers, the broken characters that old encodings leave behind.
The second is structure. Every source names its fields differently, so we map each one onto a single Moonlit format: date, court, case number, parties, document type, jurisdiction, and language. Because no two sources describe these concepts the same way, the mapping is done by hand, source by source. Once a document is cleaned, it gets a stable Moonlit identifier so it can be cited reliably.
The enrichment layers
Cleaning gets the data consistent. Enrichment is what turns it into a connected, searchable database. We add four layers of enrichments.
Citations and cross-references. Every reference in the text becomes a clickable link. Mentions of laws, regulations, treaties, and cases are detected automatically, using rules tuned to each country’s citation style. We resolve them at the article level, so clicking “Article 6 ECHR” takes you to that article rather than the top of the convention. For vague or informal references like nicknames and abbreviations, we use generative AI to work out what was meant.
Classification and taxonomy. We tag each document with its legal domain, the body that issued it, and its place in the court hierarchy. This is what drives the filters and makes cross-border comparison possible.
Content understanding. This lets the platform look past exact wording. We generate long and short summaries where no official one exists, and we create embeddings that capture what the text means, which is what powers semantic search. Translations are generated on demand when users select a different language for the document.
Integrity and traceability. The pipeline runs incrementally, processing only what changed since the night before. That keeps the daily refresh fast and reproducible, and keeps our copy matching the source.
The gate before publication
No new data reaches production without a manual check. Our data quality team compares the document count in our database against the public count on the source’s website, then reads through the content for layout glitches, missing fields, and duplicate records. Only datasets that pass this gate are released.
What this looks like in practice
By morning, all of this shows up as concrete features. You can filter by jurisdiction, court, or date, and read a high-quality translation alongside a summary for any document.
Search runs in two modes. Keyword search does what you’d expect: it matches exact terms and phrases when you already know the wording you need. Semantic search uses the embeddings from the enrichment step to find documents that are conceptually close, even when the wording is different. For example, a search for “wrongful termination of an indefinite contract” will also surface relevant rulings phrased in other ways.
Put together, the nightly refresh, the enrichment layers, and the manual check mean every research session on Moonlit starts from data that’s current, structured, and traceable.

Article written by
Lara Olders
