Skip to content

Event envelope

Every collector — gNMI streaming, SNMP polling, SSH show parsing — emits the same shape of event onto NATS. The reconciler is the only consumer. The shape is enforced by Pydantic models in src/l2trace/events/schema.py.

class EventEnvelope(BaseModel, frozen=True):
event_id: UUID # UUIDv7, derived deterministically
kind: EventKind # MAC_LEARNED, MAC_REMOVED, etc.
source: Source # GNMI, SNMP, SSH, NETCONF, RECONCILER
device_id: int
payload: EventPayload # union by kind
# Four timestamps
device_observed_at: datetime # device clock when it saw the event
observed_at: datetime # corrected device time (skew-fixed)
collector_emitted_at: datetime # collector wall-clock at NATS publish
ingested_at: datetime # reconciler wall-clock at consume

A single timestamp can’t distinguish:

  • “The device’s clock is 5 minutes off.”
  • “The event sat in a queue for 5 minutes.”

These have the same symptom (now - timestamp = 5min) but completely different remediation. The four-timestamp model lets you tell them apart:

ComparisonWhat it tells you
observed_at − device_observed_atNTP skew at the device
collector_emitted_at − observed_atLatency from device → collector
ingested_at − collector_emitted_atQueue dwell time in NATS
ingested_at − observed_atTotal end-to-end latency

The reconciler uses these to classify quarantine events:

  • device-skew|device_observed_at − collector_wall_clock| > THRESHOLD → the device’s clock is off; corrected observed_at is best-effort
  • queue-dwellingested_at − collector_emitted_at > THRESHOLD → the event was stale by the time the reconciler saw it; honoring it could rewrite a fact we already corrected

UUIDv7 is timestamp-prefixed — sorting by event_id is approximately sorting by event time, which is useful for cheap iteration over recent events.

The event_id is deterministically derived from (source, device_id, mac, vlan, port_name, device_observed_at) via SHA-256. Re-emitting the same observation produces the same event_id, which the reconciler’s ON CONFLICT (event_id) DO NOTHING clause absorbs harmlessly. JetStream’s at-least-once delivery is therefore free to redeliver — we just ack the duplicate.

A consequence: if a collector “corrects” device_observed_at after the fact (NTP catches up, the device clock jumps), the corrected event gets a different event_id and writes a new row. That’s intentional — the original event with its broken timestamp is preserved as part of the bitemporal record.

payload is a Pydantic discriminated union on kind. The common variants:

class MacLearned(BaseModel, frozen=True):
mac: str # normalized
vlan: int
port_name: str # vendor name like "Ethernet1/1"
entry_type: MacType = MacType.DYNAMIC # or STATIC, SECURE
class MacRemoved(BaseModel, frozen=True):
mac: str
vlan: int
port_name: str
class PortStateChanged(BaseModel, frozen=True):
port_name: str
state: str # "up", "down", "admin_down", ...
class LldpNeighborUpdate(BaseModel, frozen=True):
local_port_name: str
remote_chassis_id: str
remote_port_descr: str | None
protocol: AdjProto = AdjProto.LLDP
class DeviceIdentified(BaseModel, frozen=True):
chassis_id: str # canonical MAC like '00:1a:a1:11:22:33'

(See src/l2trace/events/schema.py for the authoritative list, plus the snapshot variants CamSnapshot / LldpSnapshot / StpSnapshot that bundle many entries into one envelope for SNMP poll cycles.)

DeviceIdentified — the smallest payload, biggest unlock

Section titled “DeviceIdentified — the smallest payload, biggest unlock”

DeviceIdentified carries a single field (chassis_id) but it unblocks every pending peer-resolution UPDATE in the adjacency table. Both collectors emit it:

  • gNMI reads /lldp/state/chassis-id from the OpenConfig subscription and emits on every refresh.
  • SNMP walks lldpLocChassisId once per poll cycle and emits before the LLDP neighbor walk events, so same-cycle eager peer resolution can use the just-known chassis_id.

The reconciler turns each event into two UPDATEs:

  1. device.chassis_id = :chassis_id WHERE id = :device_id — always.
  2. adjacency.remote_device_id = :device_id WHERE remote_chassis_id = :chassis_id — backfills every pending row.

Idempotent: re-emitting the same chassis_id is a no-op UPDATE. See How peer resolution works for why this two-phase resolve-and-backfill pattern beats waiting for both ends to be registered.

Events publish on subjects matching {NATS_SUBJECT_PREFIX}.{source}.{device_id}.{kind}, e.g.:

l2trace.gnmi.42.mac_learned
l2trace.snmp.99.mac_removed
l2trace.reconciler.42.quarantine

The reconciler consumes l2trace.> with a durable JetStream consumer. Quarantine events go on a separate subject (*.quarantine) so the OPS screen can tail them independently.