How late arrivals work
Late-arriving events are the hard case. A poll at 10:05 reporting “MAC on Eth7 at 10:03” arrives after the gNMI event “MAC on Eth2 at 10:05” already landed. A naive overwrite would lose the 10:03–10:05 window where the MAC was on Eth7. The reconciler does belief revision instead — preserving both the old and new beliefs as distinct rows in the bitemporal log.
This page walks through what the reconciler actually does.
The state machine
Section titled “The state machine”For each incoming event, the reconciler asks four questions in order:
- Is this a duplicate? —
INSERT ... ON CONFLICT (event_id) DO NOTHINGabsorbs JetStream redeliveries silently. If the row exists, we ack and move on. - Is this clock-skewed or queue-dwelled? — if
|device_observed_at − collector_wall_clock| > THRESHOLDwe emit adevice-skewquarantine. Ifingested_at − collector_emitted_at > THRESHOLDwe emit aqueue-dwellquarantine. Either way the event is NOT written to the bitemporal log directly. - Where is the LiveSet pointer for
(source, device, mac, vlan)?- No pointer → first-sight: INSERT row, open
valid_duringatobserved_at, write to LiveSet. - Pointer is the same port → continuation: no-op DB write
(the existing row’s
valid_duringis still open). Just updateliveness.last_observed_at. - Pointer is a different port → move: close the existing row’s
valid_during.upperatobserved_at, INSERT a new row at the new port, update LiveSet.
- No pointer → first-sight: INSERT row, open
- Is the new event’s
observed_atin the past? — if it lands beforevalid_during.lowerof the existing row, we’re doing belief revision (the rest of this page).
Belief revision in concrete terms
Section titled “Belief revision in concrete terms”Say at 10:00 we received “MAC on Eth1” and wrote:
entry_id=1 port=Eth1 valid_during=[10:00, ∞) recorded_during=[10:00, ∞)At 10:05 we received “MAC on Eth2”, which the reconciler classified as a move:
entry_id=1 port=Eth1 valid_during=[10:00, 10:05) recorded_during=[10:00, ∞)entry_id=2 port=Eth2 valid_during=[10:05, ∞) recorded_during=[10:05, ∞)So far, single-time-axis stores would do the same thing.
Now at 10:20, a late poll arrives: “MAC on Eth7 at 10:03.” This
contradicts entry_id=1 (which said Eth1 was open until 10:05). The
reconciler:
-
Closes
recorded_duringonentry_id=1:entry_id=1 port=Eth1 valid_during=[10:00, 10:05) recorded_during=[10:00, 10:20) -
Inserts a corrected belief:
entry_id=1’svalid_duringis now split — Eth1 until 10:03, Eth7 from 10:03 to 10:05:entry_id=3 port=Eth1 valid_during=[10:00, 10:03) recorded_during=[10:20, ∞)entry_id=4 port=Eth7 valid_during=[10:03, 10:05) recorded_during=[10:20, ∞) -
The original
entry_id=2(Eth2 from 10:05 onwards) is untouched — the late event didn’t contradict it.
Now valid_during @> '10:04' against the current belief returns
entry_id=4 (Eth7); against the historical 10:18 belief (filter on
recorded_during @> '10:18') returns entry_id=1 (Eth1).
What guarantees this works
Section titled “What guarantees this works”Two invariants are load-bearing:
Invariant 1: EXCLUDE constraint per source. The mac_obs_no_overlap_per_source
EXCLUDE constraint prevents two currently-believed open rows from
overlapping valid_during for the same (mac, device, vlan, source).
That means a buggy reconciler can’t write a contradiction within a
single source — it’d be rejected by Postgres.
Invariant 2: LiveSet is patched AFTER commit. The reconciler holds
an in-memory LiveSet that mirrors the liveness table. Mutating it
before the surrounding transaction commits would mean a commit failure
(deadlock, network error, serialization conflict) could leave the
LiveSet referencing rows that were never persisted — poisoning every
subsequent classification for that key. So the writer returns the
in-memory updates as a separate object the runner applies after
commit. Same invariant applies to the compactor — see
the compactor invariant.
What happens on quarantine
Section titled “What happens on quarantine”If step 2 above kicks in (skew or dwell), the event is not written
to the bitemporal log. Instead it gets re-published on the
.quarantine NATS subject with headers:
l2trace-quarantine-reason: device-skew | queue-dwell | integrity-errorl2trace-quarantine-detail: <human description>
The TUI’s OPS screen tails this subject. The original payload is preserved so it can be replayed by hand after the underlying issue is fixed (NTP corrected, queue drained, etc.).
See also
Section titled “See also”- The transition function:
src/l2trace/reconciler/state.py::transition() - The writer:
src/l2trace/reconciler/writer.py::apply_actions() - Why bitemporal? — the foundation this builds on
- Event envelope reference — what the four timestamps mean for the skew vs dwell distinction