Skip to content

How peer resolution works

LLDP tells us who’s on the other side of this cable — but only by chassis ID, a 6-byte MAC. The recursive-CTE traceroute walks via adjacency.remote_device_id, an FK into our device table. The gap between “I see chassis ID 00:1a:a1:11:22:33 on Eth8” and “that chassis ID belongs to sw-core-1 (device.id=42)” is what peer resolution closes.

Without it, every trace terminates at the ingress switch’s egress trunk — the walker has no way to find the next hop.

┌────────────┐ ┌────────────┐
│ sw-edge-7 │ │ sw-core-1 │
│ │── LLDP ──────────────────────▶│ │
│ Eth8 ─────┐│ │┌───── Eth1 │
│ ││ ││ │
└───────────┴┘ └┴───────────┘
│ │
│ emits LldpNeighborUpdate │ emits LldpNeighborUpdate
│ {local: Eth8, remote_chassis: <core-mac>} │ {local: Eth1, remote_chassis: <edge-mac>}
│ AND │ AND
│ emits DeviceIdentified │ emits DeviceIdentified
│ {chassis_id: <edge-mac>} │ {chassis_id: <core-mac>}
▼ ▼
┌──────────────────┐
│ reconciler │
│ │
│ on DeviceIdent: │
│ UPDATE device │
│ backfill adj │
│ │
│ on LldpNeighbor:│
│ INSERT adj + │
│ eager resolve │
└──────────────────┘
┌──────────────────────────┐
│ adjacency table: │
│ remote_device_id is now │
│ populated for both ends │
└──────────────────────────┘

The collector knows its OWN device_id (from CollectorConfig) and learns its OWN chassis_id (via lldpLocChassisId for SNMP, or /lldp/state/chassis-id for gNMI). It emits a DeviceIdentified event tying those two facts together. The reconciler does the actual UPDATE device SET chassis_id = $1 WHERE id = $2.

Two resolution paths, neither sufficient alone

Section titled “Two resolution paths, neither sufficient alone”

Eager resolution at adjacency insert. When apply_adjacency_observation writes a new adjacency row, it first runs SELECT id FROM device WHERE chassis_id = $remote_chassis_id. If the peer has already been identified (its DeviceIdentified event processed), remote_device_id is populated immediately. New adjacency rows are self-resolving in the steady state.

Backfill on DeviceIdentified. When a DeviceIdentified event lands, the reconciler runs:

UPDATE adjacency
SET remote_device_id = $device_id
WHERE remote_chassis_id = $chassis_id
AND remote_device_id IS NULL

This catches the timing case where the peer’s adjacency landed BEFORE the peer was identified — common during initial fabric bring-up when multiple collectors are racing. Without backfill, those rows would sit unresolved forever (or until they age out + are re-observed).

Both paths together close the race in both directions:

  • If sw-edge-7 was identified before sw-core-1 saw it: eager resolution fills in the row at INSERT time.
  • If sw-core-1 saw sw-edge-7 before sw-edge-7’s DeviceIdentified arrived: backfill fills in the row when it does.

Either way, within one poll cycle of both ends, the adjacency table is fully resolved.

  • Cross-vendor chassis_id deduplication. Some vendors emit the base management MAC; others emit a chassis assignment that doesn’t match the management interface. We trust LLDP’s report verbatim and rely on the registered device having the same chassis_id it advertises. Operators who pre-populate device.chassis_id via device add --chassis <mac> short-circuit this entirely.

  • Backfill rewriting on chassis_id change. A hardware swap changes the chassis_id but the device_id stays. Historical adjacency rows pointing at the OLD chassis_id are left as-is (they’re real historical observations); only NEW adjacencies get resolved against the new chassis_id. If you want to retroactively rewrite history, a one-shot l2trace resolve-peers --hostname <h> --since <T> job would do it — not implemented today.

  • Bidirectional verification. A real LLDP fabric is symmetric: if sw-edge sees sw-core on Eth8, then sw-core should see sw-edge on whatever-port-faces-sw-edge. We don’t cross-check today. Adding it would let us detect “LLDP advertised but never received” failure modes (one-way cabling, asymmetric port shutdowns).

After registering a couple of devices, wait for one poll cycle:

Terminal window
make device-add HOSTNAME=sw-a IP=10.0.0.1 SOURCE=snmp COMMUNITY=ro
make device-add HOSTNAME=sw-b IP=10.0.0.2 SOURCE=snmp COMMUNITY=ro
# Wait ~90s (one orchestrator reconfig + one SNMP poll cycle)
make psql
> SELECT d.hostname, d.chassis_id::text FROM device d;
hostname | chassis_id
sw-a | 00:de:ad:be:ef:01
sw-b | 00:de:ad:be:ef:02
> SELECT
> d_local.hostname AS local_dev,
> p.name AS local_port,
> a.remote_chassis_id::text,
> d_remote.hostname AS remote_dev
> FROM adjacency a
> JOIN port p ON p.id = a.local_port_id
> JOIN device d_local ON d_local.id = p.device_id
> LEFT JOIN device d_remote ON d_remote.id = a.remote_device_id
> WHERE upper_inf(a.valid_during);
local_dev | local_port | remote_chassis_id | remote_dev
sw-a | Eth8 | 00:de:ad:be:ef:02 | sw-b
sw-b | Eth8 | 00:de:ad:be:ef:01 | sw-a

Both remote_dev columns populated = peer resolution works end-to-end. Now the recursive-CTE traceroute can walk between sw-a and sw-b.

  • The reconciler dispatch: _process_device_identified in reconciler/runner.py
  • The eager-resolution path: apply_adjacency_observation in db/adjacency.py
  • The backfill path: record_device_chassis_id in db/device_identity.py
  • The L2 traceroute algorithm — what remote_device_id ultimately powers
  • Collect LLDP adjacencies — the collector side of this story