Architecture at a glance
┌──────────────┐ gNMI ┌─────────────────┐ ┌──────────────────────────┐ │ network │─────▶│ gNMI collector │ │ device_collector table │ │ device(s) │ SNMP │ SNMP collector │◀│ (per-device source cfg) │ │ │─────▶│ SSH collector │ └──────────────────────────┘ │ │ SSH │ (napalm) │ ▲ │ │─────▶└────────┬────────┘ │ reconfig └──────────────┘ │ │ every ~30s ▼ │ ┌────────────────┐ ┌────────┴────────┐ │ NATS JetStream │ │ orchestrator │ │ (event bus) │ │ (spawn/recycle │ └────────┬───────┘ │ workers) │ │ └─────────────────┘ ┌───────────────┴───────────────┐ ▼ ▼ ┌────────────────┐ ┌────────────────────┐ │ reconciler │── on skew ───▶│ quarantine subject │ │ state machine │ │ (audit + replay) │ │ + bitemporal │ └────────────────────┘ │ writer │ └───────┬────────┘ │ • MAC_LEARNED / MAC_REMOVED → mac_observation │ • LLDP_NEIGHBOR_UPDATE → adjacency │ • DEVICE_IDENTIFIED → device.chassis_id + │ adjacency backfill ▼ ┌──────────────────┐ reads │ Postgres + AGE │◀──────────────────┐ │ bitemporal log │ │ └──────────────────┘ │ ▲ │ │ co-runs │ │ (shared LiveSet) │ ┌──────────────────┐ │ │ compactor │ │ └──────────────────┘ │ │ ┌───────────┴──────────┐ │ Textual TUI │ │ TRACE / HISTORY │ │ OPS / AUDIT │ └──────────────────────┘The pieces
Section titled “The pieces”Collectors stream CAM/MAC table updates off network devices. Three ship today:
- gNMI subscribes to
openconfig-network-instance(FDB),/lldp/...(adjacencies + local chassis-id), and emits a canonical event envelope onto NATS as updates land. Sub-second latency on supported platforms. - SNMP polls Q-BRIDGE-MIB / BRIDGE-MIB (FDB), LLDP-MIB (adjacencies + local chassis-id) on a 60-second interval. Universal compatibility, higher latency. The reconciler treats it as a backstop when gNMI lags.
- SSH polls per-vendor
showcommands via napalm, which exposes a vendor-normalized API (get_mac_address_table(),get_lldp_neighbors_detail()) across Cisco IOS-XE / NX-OS / Arista EOS / Juniper JunOS — same shape regardless of vendor. A small per-vendorcli([show...])sidecar regexes the chassis MAC out ofshow version(IOS-XE) orshow lldp local-info(NX-OS) soDeviceIdentifiedworks under SSH too. Last-resort backstop for switches with neither gNMI nor SNMP exposed.
A NETCONF collector is on the roadmap as a fourth source for platforms that prefer it over gNMI streaming.
The orchestrator runs inside l2trace reconcile. It reads the
device_collector config table and (re)spawns one async worker per
(device, source) row. On a ~30-second reconfig pass it picks up new
rows added via l2trace device add, stops disabled workers, and respawns
crashed ones with exponential backoff.
The event envelope carries four timestamps: when the device saw the event, when the device’s clock said it saw it, when the collector emitted it, when the reconciler ingested it. Operators can tell “the switch’s clock is off” apart from “the event sat in a queue for two minutes” — same symptom, completely different remediation.
NATS JetStream is the durable event bus. Subject-per-device routing keeps ordering per source within reasonable bounds. JetStream’s retention is the audit log — if a row in Postgres looks wrong, you can replay the original event up to the retention window (7 days default).
The reconciler is a state machine. It dedupes by event_id, looks up
the port, classifies the event (first-sight / move / continuation /
quarantine), and writes through to the bitemporal log in one transaction.
It dispatches by event kind:
MAC_LEARNED/MAC_REMOVED→ bitemporal write intomac_observationLLDP_NEIGHBOR_UPDATE→ bitemporal write intoadjacency, with eager peer resolution byremote_chassis_idDEVICE_IDENTIFIED→ UPDATEdevice.chassis_id+ backfill every pendingadjacency.remote_device_idmatching that chassis_id (How peer resolution works)
Late-arriving events get belief revision — recorded_during of the
previously-current row gets closed; a new row is inserted at the historical
valid_during. See How late arrivals work.
The compactor co-runs with the reconciler in the same event loop, sharing
one LiveSet. When a MAC has gone silent past its aging threshold, the
compactor closes valid_during and DELETEs the matching liveness row — then
invalidates the in-memory LiveSet after the transaction commits. (Doing
it inline would be the Hamilton silent-failure bug.)
Postgres + Apache AGE is the store. Tables are bitemporal (TSTZRANGE columns + EXCLUDE constraints). AGE projects the same data as a graph for the traceroute recursive CTE.
The Textual TUI is the operator console. Four screens, ambient as-of picker, vim-style navigation:
- TRACE (Ctrl+T) — L2 traceroute form
- HISTORY (Ctrl+H) — MAC bitemporal timeline
- OPS (Ctrl+O) — live FDB tree + disagreements + quarantine tail
- AUDIT (Ctrl+U) — bidirectional LLDP audit
Press F1 from any screen for a per-screen binding cheat sheet:
The AUDIT screen surfaces one-way LLDP and cross-source telemetry
asymmetry as a single color-coded table — same query as the
l2trace audit-adjacencies CLI:
Screenshots of every screen are embedded throughout the tutorial.
What about the OUI registry?
Section titled “What about the OUI registry?”A oui_vendor lookup table humanizes MAC addresses: aa:bb:cc:11:22:33
becomes aa:bb:cc:11:22:33 (Cisco Systems, Inc) in the HISTORY timeline and
the OPS FDB tree. The table is populated by make oui-refresh which pulls
the three IEEE registries (~5 MB total) and UPSERTs them in place — concurrent
TUI lookups never see a TRUNCATE-shaped gap. See
Refresh the OUI registry.