9 Feb 2026

Introducing the Data Generation Pipeline

How LCA databases are generated consistently at scale

In the first article of this series, we explored why database generation is becoming essential: the world changes fast, and LCA databases must evolve from static releases into continuously maintainable systems. In this article, we go one level deeper and dive into the technical foundations of our solution: the Data Generation Pipeline (DGP).

We designed this software ecosystem to create and maintain large LCA databases with consistent methodology, transparent assumptions, and reproducible results.

LCA databases are becoming decision infrastructure

As we outlined in our previous article, LCA data is increasingly used to steer sustainability strategies, investments and regulatory compliance. This means that the demand for environmental data is increasing. At the same time expectations increase also. LCA databases should not only be “large”, but they also need to be consistent, governable over time, transparent and reproducible. This means that LCA databases need to be engineered as modern systems, with controlled workflows, traceable inputs, and systematic versioning.

That is exactly the role of the DGP.

What the Data Generation Pipeline is designed to do

Over the past years, we developed the DGP for the generation and maintenance of large LCA databases, such as Agri-footprint, GFLI database, and EFSA’s Environmental Footprint of Food Database.

The DGP supports us with systematic generation of Life Cycle Inventories (LCIs) that are methodologically aligned across all processes, system boundaries, modelling approaches, and database variants. The system enables cross-cutting corrections and updates to be carried out consistently, within and across databases, a requirement for keeping large databases coherent as they grow and evolve. At its core, the DGP ensures that processes are generated according to a single, coherent methodological framework. This makes results comparable across datasets, this improves transparency, traceability and reproducibility.

Data generation ecosystem

One of the core design philosophies of the DGP is that database creation is not a manual modelling effort, it is a controlled workflow.

In practical terms, the controlled workflows in our data generation ecosystem are formed by a collection of software tools, models and databases that collectively make up the DGP. The DGP is built to be used flexibly, enabling the generation of different types of databases depending on scope and context.

This flexibility is enabled through the following concepts:

Conceptual diagram of the Data Generation Pipeline

Version controlled source data
Life Cycle Engines
Customizable workflows
Workflow runs (job orchestration and traceable artifacts)
The Blonk Datahub

Below, we explore each of these concepts in more detail.

1. Version controlled source data

LCA databases depend on many kinds of inputs, from agricultural statistics and energy supply mixes to process measurements, emission factors, and trade logistics. In traditional (semi) manual approaches, answering the question of "what changed?" is difficult, and reconstructing which specific version of input data produced a specific result is often not possible.

The DGP solves this by integrating version-controlled data management directly into the database build process. This ensures that source data updates are transparent and that changes in data history and changes are inspectable.

Conceptual diagram illustrating the version control approach

By allowing roll back changes or reproducing exact database builds from specific historical versions, we introduce a level of robustness to LCA that is already standard in modern engineering domains like software development. In today’s LCA applications, this level of traceability is growing more essential every year.

2. Life Cycle Engines

At scale, LCA databases are not compiled “dataset by dataset.” They are generated from modelling logic applied systematically to parameterized data. In the DGP, this modelling logic is encapsulated in Life Cycle Engines. Life Cycle Engine is a software model able to generate LCIs for a specific life cycle stage. Examples are crop cultivation systems, animal production systems, or industrial processing stages.

These Life Cycle Engines implement a consistent methodological approach, by applying modelling rules, system boundary definitions, and calculation logic in a repeatable way. We have life cycle engines covering all major stages in the food supply chain, from crop cultivation to post harvest up to food preparation at consumer.

Why Life Cycle Engines matter

Data Generation Pipeline - Life Cycle Engines — The Data Generation Pipeline contains life cycle engines for all relevant life cycle stages to create linked datasets of food products, from primary production up to consumer stages

By encoding methodology into executable engines, the DGP transforms LCA from manual work into an automated system. This ensures that datasets are generated with consistency across the entire database, moving away from fragmented interpretations. Because every assumption is structured and fully inspectable, the logic behind the data is no longer hidden but transparent.

One of the strongest capabilities of Life Cycle Engines is how they support change. In practice, large LCA databases contain tens of thousands of datasets and hundreds of thousands (sometimes millions) of exchanges. At this scale, every change or update can ripple across a large part of the database. Examples of updates are new emissions factors, change in allocation method, or a correction in a shared data source.

In (semi) manual systems, these updates often get applied inconsistently. Some datasets are updated, others remain unchanged, and assumptions drift.

The DGP safeguards that cross-cutting corrections can be implemented at the level where they belong, such as in source data, in modelling rules (engines), or in workflow configurations. This results in a database that is regenerated in a controlled run.

Furthermore, this approach guarantees that every calculation step is reproducible, allowing updates to be propagated across the system instantly. For example, an engine variant could comply with a specific national inventory methodology or a defined program rule, while keeping the overall database coherent. This provides a structured way to support policy-driven or regional methodological requirements without fragmenting the database.

3. Customizable workflows

One of the major challenges in LCA is that different contexts, from government agencies and NGOs to sector-specific organizations, require unique database variants aligned with their own methodologies and boundaries. To solve this, the DGP replaces separate manual modelling tracks with customizable workflows. Each workflow acts as a precise configuration that specifies which datasets and source versions to use, which engine variants apply, the required allocation types, and database-level linking strategies for background data.

This allows the DGP to remain flexible while staying in control: instead of changes becoming inconsistent, they are managed through these clear, centralized settings.

4. Workflow runs

Once workflow configuration exists, generating a database is no longer an ad hoc modelling effort. It becomes a controlled computational event: a workflow run. This means that a specific workflow configuration is executed, using fixed versions of source data and engines, and producing a database as output.

Workflow runs produce a set of data artifacts, such as intermediate results, error logs, validation outputs, and database packages; these can be individually downloaded for inspection, exported to various data formats, and published to the Datahub.

5. Datahub

Data Generation Pipeline - Blonk Data Hub data inspection tool — Blonk Data Hub - Explore and inspect LCI generated by the DGP

The Blonk Datahub offers a user-friendly environment that is specifically designed to explore and inspect LCI generated by the DGP.

This enables content specialists to audit results of their changes directly. When combined with a granular and isolated change-strategy in the DGP, a quick iteration cycle can be established. This guarantees that the entire data generation ecosystem remains methodologically aligned as it evolves.

In this way, continuous improvement and delivery over time are possible, even as the scale and complexity of the databases grow.

Why is this relevant?

Large-scale LCA databases are no longer static reference collections, they have become critical infrastructure for sustainability decisions in policy, regulation, and industry. As a result, the value of an LCA database is no longer measured simply by the number of datasets it contains. Instead, the focus has shifted toward fundamental questions of long-term integrity and maintainability. Ensuring this requires more than continuously adding or updating datasets; it requires systematic governance of data, methodology, and change.

The Data Generation Pipeline provides this foundation by turning modelling rules into executable engines, managing inputs through version-controlled data, and generating databases through reproducible workflows and orchestrated runs.

Together, these elements enable consistent updates, methodological consistency across system boundaries, transparency of assumptions, auditability, and reproducibility at scale, ensuring that LCA databases remain coherent, trustworthy, and fit for long-term decision-making in a rapidly changing world.

In our next article, we’ll illustrate the application of the Data Generation Pipeline and and take a closer look at how we developed EFSA's Open Access European Environmental Footprint of Food database.

Introducing the Data Generation Pipeline

How LCA databases are generated consistently at scale

LCA databases are becoming decision infrastructure

What the Data Generation Pipeline is designed to do

Data generation ecosystem

1. Version controlled source data

2. Life Cycle Engines

Why Life Cycle Engines matter

3. Customizable workflows

4. Workflow runs

5. Datahub

Why is this relevant?

More from this series

LCA Data at scale

LCA data generation at scale

More information

Get in touch

Bart Durlinger