Skip to content

Architecture at a glance

┌──────────────┐ gNMI ┌─────────────────┐ ┌──────────────────────────┐
│ network │─────▶│ gNMI collector │ │ device_collector table │
│ device(s) │ SNMP │ SNMP collector │◀│ (per-device source cfg) │
│ │─────▶│ SSH collector │ └──────────────────────────┘
│ │ SSH │ (napalm) │ ▲
│ │─────▶└────────┬────────┘ │ reconfig
└──────────────┘ │ │ every ~30s
▼ │
┌────────────────┐ ┌────────┴────────┐
│ NATS JetStream │ │ orchestrator │
│ (event bus) │ │ (spawn/recycle │
└────────┬───────┘ │ workers) │
│ └─────────────────┘
┌───────────────┴───────────────┐
▼ ▼
┌────────────────┐ ┌────────────────────┐
│ reconciler │── on skew ───▶│ quarantine subject │
│ state machine │ │ (audit + replay) │
│ + bitemporal │ └────────────────────┘
│ writer │
└───────┬────────┘
│ • MAC_LEARNED / MAC_REMOVED → mac_observation
│ • LLDP_NEIGHBOR_UPDATE → adjacency
│ • DEVICE_IDENTIFIED → device.chassis_id +
│ adjacency backfill
┌──────────────────┐ reads
│ Postgres + AGE │◀──────────────────┐
│ bitemporal log │ │
└──────────────────┘ │
▲ │
│ co-runs │
│ (shared LiveSet) │
┌──────────────────┐ │
│ compactor │ │
└──────────────────┘ │
┌───────────┴──────────┐
│ Textual TUI │
│ TRACE / HISTORY │
│ OPS / AUDIT │
└──────────────────────┘

Collectors stream CAM/MAC table updates off network devices. Three ship today:

  • gNMI subscribes to openconfig-network-instance (FDB), /lldp/... (adjacencies + local chassis-id), and emits a canonical event envelope onto NATS as updates land. Sub-second latency on supported platforms.
  • SNMP polls Q-BRIDGE-MIB / BRIDGE-MIB (FDB), LLDP-MIB (adjacencies + local chassis-id) on a 60-second interval. Universal compatibility, higher latency. The reconciler treats it as a backstop when gNMI lags.
  • SSH polls per-vendor show commands via napalm, which exposes a vendor-normalized API (get_mac_address_table(), get_lldp_neighbors_detail()) across Cisco IOS-XE / NX-OS / Arista EOS / Juniper JunOS — same shape regardless of vendor. A small per-vendor cli([show...]) sidecar regexes the chassis MAC out of show version (IOS-XE) or show lldp local-info (NX-OS) so DeviceIdentified works under SSH too. Last-resort backstop for switches with neither gNMI nor SNMP exposed.

A NETCONF collector is on the roadmap as a fourth source for platforms that prefer it over gNMI streaming.

The orchestrator runs inside l2trace reconcile. It reads the device_collector config table and (re)spawns one async worker per (device, source) row. On a ~30-second reconfig pass it picks up new rows added via l2trace device add, stops disabled workers, and respawns crashed ones with exponential backoff.

The event envelope carries four timestamps: when the device saw the event, when the device’s clock said it saw it, when the collector emitted it, when the reconciler ingested it. Operators can tell “the switch’s clock is off” apart from “the event sat in a queue for two minutes” — same symptom, completely different remediation.

NATS JetStream is the durable event bus. Subject-per-device routing keeps ordering per source within reasonable bounds. JetStream’s retention is the audit log — if a row in Postgres looks wrong, you can replay the original event up to the retention window (7 days default).

The reconciler is a state machine. It dedupes by event_id, looks up the port, classifies the event (first-sight / move / continuation / quarantine), and writes through to the bitemporal log in one transaction. It dispatches by event kind:

  • MAC_LEARNED / MAC_REMOVED → bitemporal write into mac_observation
  • LLDP_NEIGHBOR_UPDATE → bitemporal write into adjacency, with eager peer resolution by remote_chassis_id
  • DEVICE_IDENTIFIED → UPDATE device.chassis_id + backfill every pending adjacency.remote_device_id matching that chassis_id (How peer resolution works)

Late-arriving events get belief revisionrecorded_during of the previously-current row gets closed; a new row is inserted at the historical valid_during. See How late arrivals work.

The compactor co-runs with the reconciler in the same event loop, sharing one LiveSet. When a MAC has gone silent past its aging threshold, the compactor closes valid_during and DELETEs the matching liveness row — then invalidates the in-memory LiveSet after the transaction commits. (Doing it inline would be the Hamilton silent-failure bug.)

Postgres + Apache AGE is the store. Tables are bitemporal (TSTZRANGE columns + EXCLUDE constraints). AGE projects the same data as a graph for the traceroute recursive CTE.

The Textual TUI is the operator console. Four screens, ambient as-of picker, vim-style navigation:

  • TRACE (Ctrl+T) — L2 traceroute form
  • HISTORY (Ctrl+H) — MAC bitemporal timeline
  • OPS (Ctrl+O) — live FDB tree + disagreements + quarantine tail
  • AUDIT (Ctrl+U) — bidirectional LLDP audit

Press F1 from any screen for a per-screen binding cheat sheet:

Help overlay showing screen-specific bindings

The AUDIT screen surfaces one-way LLDP and cross-source telemetry asymmetry as a single color-coded table — same query as the l2trace audit-adjacencies CLI:

Audit screen showing bidirectional LLDP audit

Screenshots of every screen are embedded throughout the tutorial.

A oui_vendor lookup table humanizes MAC addresses: aa:bb:cc:11:22:33 becomes aa:bb:cc:11:22:33 (Cisco Systems, Inc) in the HISTORY timeline and the OPS FDB tree. The table is populated by make oui-refresh which pulls the three IEEE registries (~5 MB total) and UPSERTs them in place — concurrent TUI lookups never see a TRUNCATE-shaped gap. See Refresh the OUI registry.

Continue: Your first traceroute →