25 Jan 2026

Databases are growing larger and expectations are rising

As LCA databases grow, it is not just about adding “more datasets”. With every additional dataset, more complexity is added. Every additional process increases the number of connections, assumptions, and dependencies across the system. Every methodological choice affects comparability. And every update raises a critical question: If I update this dataset, what else should change to remain consistent?

In practice, large LCA databases contain tens of thousands of datasets and hundreds of thousands (sometimes millions) of data exchanges. At this scale, even “small” changes can ripple across a large part of the database. However, most LCA databases are still developed using workflows rooted in an older era. This means, amongst others, manual modelling in LCA software, use of scattered spreadsheets, and fragmented scripts, isolated modelling teams, long release cycles, and hard-to-track updates.

This approach was sufficient in the past, but it can no longer handle the complexity of modern LCA requirements and applications. 

The bottleneck: manual data handling doesn’t scale

Database development is often imagined as a straightforward task: collecting data, building process inventories, linking datasets, and publishing. But in reality, building and maintaining an LCA database is a large-scale modelling challenge. It requires consistent and everyday decisions with large impacts; decisions related to system boundaries and functional units, to allocation methods, cut-off rules, background linking and proxy- strategies, and methodological alignment across thousands of datasets.

When LCA database development is (partly) manual or fragmented, as is a natural thing in most companies where people move jobs and responsibilities shift, databases tend to suffer from the same structural issues:

Methodological drift over time

As multiple contributors add or update datasets, assumptions certainly differ. Even with the best guidance documents, interpretations can vary. Over time, a database that was once coherent becomes patchwork. For instance, a dataset updated in 2025 follows different footprint rules than a dataset updated in 2018. As assumptions shift gradually, boundaries become inconsistent and as a result comparability is lost.

Inconsistent updates

When a source dataset changes, for instance an emission factor or a commodity yield, the correction often only lands in a handful of datasets. But such updates should typically be repeated across many processes in a consistent way. (Semi-) manual maintenance struggles here because of the core dependency problem: life cycle inventories are interlinked networks, not lists.

Limited reproducibility

A database release is rarely based on a single change; moreover, it is the accumulation of many actions and changes implemented over time. In traditional environments, this process often involves a fragmented mix of custom scripts, manual edits, and decisions, all while navigating the complexities of version mismatches. This lack of structure creates a risk for a "black box" effect, making it difficult to answer simple yet critical questions: Which specific version of the source data was used for a particular result? Which modelling rules were applied to this specific branch? Can this database build be reproduced exactly, and if not, why not?  Without a transparent, automated trail, reproducibility becomes impossible to guarantee, undermining the trust required for environmental reporting.

Limited audibility

Closely related is auditability. For environmental reporting, it is not enough anymore to publish only database updates; nowadays practitioners need to understand what has changed and why. An auditor might ask to explain, trace and justify how the emission factor result came to be, and what changed. Without structured traceability, it becomes difficult to establish an auditable log from data updates and modelling rule changes to specific impacts on datasets and results.

Slow iteration cycles

When releases require large manual effort, the speed of learning slows down. This means that feedback loops become inefficient. Experts can’t easily inspect changes, or development teams can’t efficiently implement systematic corrections. In other words, the release cycle limits database improvements.

Why it matters: comparability, transparency, and trust

The above challenges aren’t just an internal technical problem. As LCA becomes embedded in environmental claims, regulatory compliance, procurement rules, and investor disclosures, LCA databases are no longer just scientific artifacts. They are becoming part of the trust infrastructure of sustainability decisions.

When databases are inconsistent or hard to reproduce, it can cause problems, such as varying results, difficulties in comparability and questioning credibility. And once trust is lost, even good data becomes difficult to use.

 

The shift ahead: from database building to database generation

To stay ahead of the scenarios described and make sure that LCA databases continue to support the transition towards sustainable supply chains, database development must evolve.

This means that LCA databases need to be engineered as modern systems, to generate, update, and manage databases reliably, consistently, and transparently over time.

So, it is no longer “collect data and model it”.  It means moving away from semi-manual work towards database generation in a more automated systematic way. By using explicit modelling rules, controlled workflows, versioned inputs, inspectable outputs and mechanisms for systematic updates, we can move toward a future of systematic, automated updates rather than fragmented semi-manual work.  

Next: introducing the Data Generation Pipeline

With 25 years of experience as LCA data pioneers, we at Blonk have always had a forward-looking perspective, continuously looking for technical solutions to make LCA data generation future-proof. As such, we believe that a changing world requires more than static datasets, it requires a new way of building them.

To address this, we developed the Data Generation Pipeline (DGP), a software ecosystem specifically designed for the creation and maintenance of large-scale databases, bringing many best practices used in modern software and data engineering to the world of LCA.

The DGP supports:

  • Systematic generation of life cycle inventories (LCIs),

  • Methodological alignment across processes and boundaries, through life cycle engines,

  • Consistent cross-cutting corrections and updates , through versioned source data,

  • Transparency and reproducibility across database builds, through versioned workflows.

In our next article, we’ll dive deeper into the Data Generation Pipeline and the technical foundations behind our approach. 

More information 

Get in touch

Bart Durlinger, Director- Digital Solutions, Blonk
Bart Durlinger
Director Digital Solutions

Do you have questions about LCA database generation at scale, or do you want to know more about our Data Generation Pipeline? Get in touch with Bart Durlinger.