When Data Arrives Out of Order: Late Arrivals and Corrections

There's a question that comes up in every workshop sooner or later: “Do we really need this complexity?” What people mean is bitemporal historization – two timelines instead of one, a few more columns, more logic when loading. It's a fair question.

I won't bury it under eight reasons here; instead I'll walk through a single one where it gets clear pretty fast what this is actually about: late arriving data. A case every data team has run into – often without calling it by name.

The ideal: data in the right order
When the correction arrives before the original
The transposed exchange rate
Why one timeline doesn't solve the dilemma
Two timelines solve it
What changes in the implementation
Three questions to check your own project

If you haven't met Yerodin, Philomena, Diego and Amal yet: they are fictional characters from the FastChangeCo universe, which I use to make real-world challenges from projects and coaching sessions tangible.

This article is part 1 of a four-part series on bitemporal data in practice. It bundles two of the eight reasons from the hub article “8 reasons why bitemporal data is needed” into one focused topic: late arrivals and after-the-fact corrections.

The core argument: as soon as data arrives in a different order than the one in which it actually happened, data quality starts to slip without bitemporal historization – quietly and unnoticed.

At FastChangeCo, it starts with a complaint. Philomena Pavlovic, business analyst and application developer, gets an annoyed message from marketing: a campaign mailing has largely come back. Mass bounces on a distribution list – even though the addresses were just updated, according to the system.

Philomena brings it into the data meeting. Yerodin van Dusseldorp, controller, looks at the relevant data flow and frowns. The new addresses are in there. But somehow the old ones are too – and for some customers the list shows an address nobody can trace anymore.

Philomena puts into words exactly the question you're probably asking yourself at this point: “Do we really have to introduce a second time dimension now? Isn't this a one-off we just fix by hand?” Diego Pasión, who supports the data team as a coach, shakes his head. “Take a look at when this data came in, and in what order. Then it looks a lot less like a one-off.”

The ideal: data in the right order

In analytical systems we quietly assume an ideal case: data arrives in the same order in which it actually happened. Picture a supermarket checkout. I pay, my transaction lands in the warehouse. The customer after me pays, their transaction lands after that. A clean, chronologically correct sequence of what happened in reality.

As long as that holds, one timeline is entirely enough. We simply move forward: new state, new row, done.

But this ideal case doesn't survive contact with practice. Between the operational system and the warehouse there are layers, interfaces, message buses. And a bus guarantees you no ordering. One message keeps partying somewhere with other messages for a while and then arrives three hours later than the one that should have come after it. You can be as sure as you like that this doesn't happen in your system – I've seen it in every system where an admin assured me of exactly that. It happens.

When the correction arrives before the original

What Philomena described in the marketing meeting has a name: late arriving data. A classic example makes the mechanism tangible.

A customer enters her new email address in the FastChangeCo self-service portal and submits. Shortly afterwards she notices a typo. Ten minutes later she corrects it and submits again. From her point of view everything is fine – the corrected address is the valid one.

In the warehouse, something unpleasant happens. Taking the detour through the integration layer, the correction arrives first; the faulty initial entry trickles in late afterwards. The system sees two changes and can't judge which was correct in business terms. It falls back on the only rule it has without a second timeline: the data delivered most recently is the current data.

So the wrong address wins – it did arrive last, after all. That very address ends up in the mailing list, and here the circle closes back to Philomena's bounces. The report looks correct. It isn't.

The transposed exchange rate

The second case is subtler, and it hurts in controlling. FastChangeCo sells in several countries and currencies. For the balance sheet everything has to be converted into euros, so you need daily exchange rates.

Assume one of these rates contains a transposed digit for a single day. Instead of 900 euros, 1,000 US dollars suddenly become 1,100 euros. For one day. That sounds small, but it has an effect on the balance sheet – and it only surfaces weeks later.

Now you correct the rate. And here's the point: this correction has to slot in at exactly the one day on which the error applied. The obvious reflex is to delete the faulty rate and enter the correct one. You can do that. It's just that later nobody can explain why a report from back then shows different figures than today's – a question internal governance or a regulator is fairly certain to ask at some point. How to turn that traceability into an advantage is the topic of part 4.

Why one timeline doesn't solve the dilemma

Both cases show the same dilemma, and it can't be resolved with a single timeline.

If you have no historization at all, the data that arrived most recently counts as the current data – even if it was the wrong data that should have been overwritten. If you have a simple, unitemporal historization, it looks as if the data that was actually correct had later been corrected by the wrongly delivered, late-arriving data. You don't want either of those in your systems. In both variants, the pure order of arrival distorts your result.

Two timelines solve it

This is where Diego gets concrete. “The problem isn't that the data arrives late. The problem is that you only have one timeline, and it's supposed to answer two questions at once.” Two different questions: when did we learn about a value? And when was that value valid in reality?

Bitemporal historization separates this into two timelines. The assertion timeline (technical) records when a record landed in our warehouse. The state timeline (business) records from when it was valid in reality. For the two email addresses, that means:

Inscription timestamp (assertion timeline, technical)	State timestamp (state timeline, business)	Email address
May 12, 2:02 PM	before May 12, 2:25 PM	faulty address
May 12, 1:58 PM	from May 12, 2:25 PM	corrected address

The correction arrived first in technical terms – but in business terms it sits correctly as the later, currently valid state. The mailing list now reliably pulls the right address. Philomena's bounces stop.

There's a second reward Yerodin spots immediately. His historical quarterly report stays stable: pull the Q3 report again months later, and it shows the same figures as at quarter-end – because the assertion timeline is no longer overwritten by later corrections. “So my report from back then will still look the same a year from now,” he says. Exactly so.

What this looks like in the implementation – including the SQL patterns for end-dating and insert-only – is one module in the Temporal Data Training.

What changes in the implementation

Less changes in the basic table structure than many fear – it's still a warehouse. But there are two things you should know as an engineer.

The first is end-dating instead of update. You don't overwrite an existing row. You close its business validity period and write the new state as its own row alongside it. Every version stays visible, instead of being hidden by the next one.

The second is the insert-only logic. Your temporal table only grows; it doesn't change existing rows. A correction like the address change doesn't lead to an update, but to several new rows, depending on how the business time periods overlap.

Which brings up the next keyword: overlap. How time periods can relate to one another is described by the Allen relationship – the heart of part 2 of this series.

Three questions to check your own project

Before you decide whether late arrivals affect you, three sober questions about your architecture are worth asking.

First: are there asynchronous paths between your source systems and the warehouse – message buses, queues, interfaces that don't guarantee ordering? If so, the question isn't whether, but when.

Second: how do you handle corrections today? Is the old value overwritten, or does a new version come into being? Overwriting is the clearest symptom that history is being lost.

Third: do your historical reports stay stable when you pull them again months later? If last quarter's report shows different figures today – then you already have the problem, you just haven't named it yet.

Amal sums it up for the team: “For us this means – when a correction like that comes in, we must not overwrite the old row. We slot it into the right place via the state timeline. Then the report matches reality again, no matter what order the data reaches us in.”

About this series: This is part 1 of 4 of our series “Bitemporal Data in Practice”, which leads from the concrete engineering pain to the strategic decision. The overview of all four focus areas is given in the hub article “8 reasons why bitemporal data is needed”. Part 2 takes on the technical core: time travel in the data warehouse and the Allen relationship.

Late arrivals in your own project?

In the Temporal Data Training I show, using real exchange-rate and email corrections, how bitemporal historization slots late arrivals into place cleanly – without distorting your reports. The training is currently being re-recorded.

→ Join the waitlist (with an early-bird benefit as a series reader)

An acute late-arrival problem in an ongoing project? Sometimes a short coaching session helps faster than a whole training: temporal.tedamoh.com/coaching

That answers the first question. Late arrivals aren't a one-off you fix by hand – they are the normal case in any architecture with asynchronous paths, and the reason one timeline isn't enough.

Today was about the what and the why. The how follows in part 2: for two timelines to interlock cleanly, their time periods have to be brought into a clear relationship — covering every case where they overlap, touch or contain one another. That mechanism is what we'll look at next.

So long,
Dirk

TEDAMOH ACADEMY

ON DEMAND COACHING

BACK TO THE SERIES

TEDAMOH

Blog