Skip to content

How late arrivals work

Late-arriving events are the hard case. A poll at 10:05 reporting “MAC on Eth7 at 10:03” arrives after the gNMI event “MAC on Eth2 at 10:05” already landed. A naive overwrite would lose the 10:03–10:05 window where the MAC was on Eth7. The reconciler does belief revision instead — preserving both the old and new beliefs as distinct rows in the bitemporal log.

This page walks through what the reconciler actually does.

For each incoming event, the reconciler asks four questions in order:

  1. Is this a duplicate?INSERT ... ON CONFLICT (event_id) DO NOTHING absorbs JetStream redeliveries silently. If the row exists, we ack and move on.
  2. Is this clock-skewed or queue-dwelled? — if |device_observed_at − collector_wall_clock| > THRESHOLD we emit a device-skew quarantine. If ingested_at − collector_emitted_at > THRESHOLD we emit a queue-dwell quarantine. Either way the event is NOT written to the bitemporal log directly.
  3. Where is the LiveSet pointer for (source, device, mac, vlan)?
    • No pointer → first-sight: INSERT row, open valid_during at observed_at, write to LiveSet.
    • Pointer is the same port → continuation: no-op DB write (the existing row’s valid_during is still open). Just update liveness.last_observed_at.
    • Pointer is a different port → move: close the existing row’s valid_during.upper at observed_at, INSERT a new row at the new port, update LiveSet.
  4. Is the new event’s observed_at in the past? — if it lands before valid_during.lower of the existing row, we’re doing belief revision (the rest of this page).

Say at 10:00 we received “MAC on Eth1” and wrote:

entry_id=1 port=Eth1 valid_during=[10:00, ∞) recorded_during=[10:00, ∞)

At 10:05 we received “MAC on Eth2”, which the reconciler classified as a move:

entry_id=1 port=Eth1 valid_during=[10:00, 10:05) recorded_during=[10:00, ∞)
entry_id=2 port=Eth2 valid_during=[10:05, ∞) recorded_during=[10:05, ∞)

So far, single-time-axis stores would do the same thing.

Now at 10:20, a late poll arrives: “MAC on Eth7 at 10:03.” This contradicts entry_id=1 (which said Eth1 was open until 10:05). The reconciler:

  1. Closes recorded_during on entry_id=1:

    entry_id=1 port=Eth1 valid_during=[10:00, 10:05) recorded_during=[10:00, 10:20)
  2. Inserts a corrected belief: entry_id=1’s valid_during is now split — Eth1 until 10:03, Eth7 from 10:03 to 10:05:

    entry_id=3 port=Eth1 valid_during=[10:00, 10:03) recorded_during=[10:20, ∞)
    entry_id=4 port=Eth7 valid_during=[10:03, 10:05) recorded_during=[10:20, ∞)
  3. The original entry_id=2 (Eth2 from 10:05 onwards) is untouched — the late event didn’t contradict it.

Now valid_during @> '10:04' against the current belief returns entry_id=4 (Eth7); against the historical 10:18 belief (filter on recorded_during @> '10:18') returns entry_id=1 (Eth1).

Two invariants are load-bearing:

Invariant 1: EXCLUDE constraint per source. The mac_obs_no_overlap_per_source EXCLUDE constraint prevents two currently-believed open rows from overlapping valid_during for the same (mac, device, vlan, source). That means a buggy reconciler can’t write a contradiction within a single source — it’d be rejected by Postgres.

Invariant 2: LiveSet is patched AFTER commit. The reconciler holds an in-memory LiveSet that mirrors the liveness table. Mutating it before the surrounding transaction commits would mean a commit failure (deadlock, network error, serialization conflict) could leave the LiveSet referencing rows that were never persisted — poisoning every subsequent classification for that key. So the writer returns the in-memory updates as a separate object the runner applies after commit. Same invariant applies to the compactor — see the compactor invariant.

If step 2 above kicks in (skew or dwell), the event is not written to the bitemporal log. Instead it gets re-published on the .quarantine NATS subject with headers:

  • l2trace-quarantine-reason: device-skew | queue-dwell | integrity-error
  • l2trace-quarantine-detail: <human description>

The TUI’s OPS screen tails this subject. The original payload is preserved so it can be replayed by hand after the underlying issue is fixed (NTP corrected, queue drained, etc.).

  • The transition function: src/l2trace/reconciler/state.py::transition()
  • The writer: src/l2trace/reconciler/writer.py::apply_actions()
  • Why bitemporal? — the foundation this builds on
  • Event envelope reference — what the four timestamps mean for the skew vs dwell distinction