Skip to content

The compactor invariant

The compactor closes valid_during on MACs that have aged out of their source’s threshold. It runs as a co-process in the reconciler container, sharing one in-memory LiveSet object. This page explains why that shared-LiveSet design exists, and why a subtle ordering bug in the compactor could silently corrupt the bitemporal log.

  • Reconciler writes events from NATS → Postgres + LiveSet.
  • Compactor periodically asks Postgres: “any MACs that haven’t been observed in N seconds?” If yes, it closes valid_during on those rows and DELETEs the matching liveness row.
  • Both processes share one LiveSet object (passed via the liveset parameter; co-running in the same event loop).

The LiveSet is the reconciler’s hot-path classification cache. If it gets out of sync with the database, the reconciler silently misclassifies events.

The first version of compact_once did this:

async def compact_once(session, settings, liveset=None):
rs = await session.execute(_AGE_OUT_SQL, {...})
closed_keys = list(rs.mappings())
if liveset is not None:
for k in closed_keys:
liveset.remove(...) # ← MUTATES IN-MEMORY STATE
return stats

Then in the loop:

while True:
async with session_scope() as session:
await compact_once(session, settings, liveset=liveset)
# session_scope commits here
await asyncio.sleep(interval)

That looks correct. It isn’t.

session_scope() commits the transaction on __aexit__. Anything between session.execute(...) and __aexit__ is before commit. If the commit fails — deadlock, network blip, pool timeout, serialization conflict — the SQL was rolled back, but liveset.remove(...) already happened.

After rollback:

  • The mac_observation row’s valid_during is still open in the database.
  • The liveness row is still there.
  • The in-memory LiveSet has no entry for that key.

The next event for the same MAC will be classified as first-sight by the reconciler (LiveSet says “not seen before”). It writes a new INSERT — but the database already has an open row, so the EXCLUDE constraint fires:

ERROR: conflicting key value violates exclusion constraint
"mac_obs_no_overlap_per_source"

Or worse, in a different scenario: if valid_during of the old row were briefly closed by the rollback’s loss, the new INSERT succeeds without a constraint check, and now we have two rows that say the same MAC was on two different ports simultaneously — within the same source. The disagreement view will surface this, but you’ve still corrupted the bitemporal log.

Split the compactor into two phases:

async def compact_once(session, settings) -> tuple[list[Stats], list[ClosedKey]]:
# SQL phase only — no LiveSet mutation
rs = await session.execute(_AGE_OUT_SQL, {...})
closed_keys = [ClosedKey(...) for row in rs.mappings()]
return stats, closed_keys
def apply_compaction_to_liveset(liveset, closed_keys):
# In-memory phase — runs AFTER the transaction commits
for k in closed_keys:
liveset.remove(...)

The runner pattern is now:

async with session_scope() as session:
_stats, closed_keys = await compact_once(session, settings)
# session_scope committed here — DB state is durable
if liveset is not None and closed_keys:
apply_compaction_to_liveset(liveset, closed_keys)

LiveSet invalidation only happens once we know the SQL succeeded.

The codebase includes a test that fails if anyone ever puts the mutation back inline. From tests/test_compactor_race.py:

async def test_compact_once_does_not_mutate_liveset_inline(session):
# ... seed an observation + liveness row
liveset = LiveSet()
liveset.upsert(device_id, source, mac, vlan, entry)
_stats, _closed = await compact_once(session, settings)
# CRITICAL: LiveSet entry must still be present until the caller applies it.
assert liveset.lookup(...) is not None, (
"compact_once must NOT mutate LiveSet inline — invalidation belongs "
"to the caller, run AFTER the surrounding session_scope() commits"
)

A future contributor who “simplifies” compact_once by re-introducing inline LiveSet mutation will see this test fail with a clear message and a Hamilton reference.

The reconciler’s writer has the same shape: apply_actions does SQL only, returning the LiveSet updates as a separate value; apply_to_liveset applies them after commit. From writer.py::apply_actions docstring:

Mutating in-memory state before commit is unsafe: a commit failure (deadlock, network error, serialization failure) would leave the LiveSet referencing a row that was never persisted, poisoning every subsequent classification for that key.

This is one shape of bug, applied to two places. Margaret Hamilton’s review of the first cut found the writer version; the polish pass found the compactor version. The general rule:

In-memory state is patched only after the surrounding transaction durably commits.

If you see code that mutates a cache before commit, that’s a bug.

  • The compactor source: src/l2trace/reconciler/compactor.py
  • How late arrivals work — the writer’s classification flow, where the same invariant applies