How peer resolution works
LLDP tells us who’s on the other side of this cable — but only by
chassis ID, a 6-byte MAC. The recursive-CTE traceroute walks via
adjacency.remote_device_id, an FK into our device table. The gap
between “I see chassis ID 00:1a:a1:11:22:33 on Eth8” and “that
chassis ID belongs to sw-core-1 (device.id=42)” is what peer
resolution closes.
Without it, every trace terminates at the ingress switch’s egress trunk — the walker has no way to find the next hop.
The data flow
Section titled “The data flow”┌────────────┐ ┌────────────┐│ sw-edge-7 │ │ sw-core-1 ││ │── LLDP ──────────────────────▶│ ││ Eth8 ─────┐│ │┌───── Eth1 ││ ││ ││ │└───────────┴┘ └┴───────────┘ │ │ │ emits LldpNeighborUpdate │ emits LldpNeighborUpdate │ {local: Eth8, remote_chassis: <core-mac>} │ {local: Eth1, remote_chassis: <edge-mac>} │ AND │ AND │ emits DeviceIdentified │ emits DeviceIdentified │ {chassis_id: <edge-mac>} │ {chassis_id: <core-mac>} ▼ ▼ ┌──────────────────┐ │ reconciler │ │ │ │ on DeviceIdent: │ │ UPDATE device │ │ backfill adj │ │ │ │ on LldpNeighbor:│ │ INSERT adj + │ │ eager resolve │ └──────────────────┘ │ ▼ ┌──────────────────────────┐ │ adjacency table: │ │ remote_device_id is now │ │ populated for both ends │ └──────────────────────────┘The collector knows its OWN device_id (from CollectorConfig) and
learns its OWN chassis_id (via lldpLocChassisId for SNMP, or
/lldp/state/chassis-id for gNMI). It emits a DeviceIdentified
event tying those two facts together. The reconciler does the actual
UPDATE device SET chassis_id = $1 WHERE id = $2.
Two resolution paths, neither sufficient alone
Section titled “Two resolution paths, neither sufficient alone”Eager resolution at adjacency insert. When
apply_adjacency_observation writes a new adjacency row, it first
runs SELECT id FROM device WHERE chassis_id = $remote_chassis_id.
If the peer has already been identified (its DeviceIdentified event
processed), remote_device_id is populated immediately. New adjacency
rows are self-resolving in the steady state.
Backfill on DeviceIdentified. When a DeviceIdentified event
lands, the reconciler runs:
UPDATE adjacencySET remote_device_id = $device_idWHERE remote_chassis_id = $chassis_id AND remote_device_id IS NULLThis catches the timing case where the peer’s adjacency landed BEFORE the peer was identified — common during initial fabric bring-up when multiple collectors are racing. Without backfill, those rows would sit unresolved forever (or until they age out + are re-observed).
Both paths together close the race in both directions:
- If sw-edge-7 was identified before sw-core-1 saw it: eager resolution fills in the row at INSERT time.
- If sw-core-1 saw sw-edge-7 before sw-edge-7’s
DeviceIdentifiedarrived: backfill fills in the row when it does.
Either way, within one poll cycle of both ends, the adjacency table is fully resolved.
What we deliberately don’t do
Section titled “What we deliberately don’t do”-
Cross-vendor chassis_id deduplication. Some vendors emit the base management MAC; others emit a chassis assignment that doesn’t match the management interface. We trust LLDP’s report verbatim and rely on the registered device having the same chassis_id it advertises. Operators who pre-populate
device.chassis_idviadevice add --chassis <mac>short-circuit this entirely. -
Backfill rewriting on chassis_id change. A hardware swap changes the chassis_id but the device_id stays. Historical adjacency rows pointing at the OLD chassis_id are left as-is (they’re real historical observations); only NEW adjacencies get resolved against the new chassis_id. If you want to retroactively rewrite history, a one-shot
l2trace resolve-peers --hostname <h> --since <T>job would do it — not implemented today. -
Bidirectional verification. A real LLDP fabric is symmetric: if sw-edge sees sw-core on Eth8, then sw-core should see sw-edge on whatever-port-faces-sw-edge. We don’t cross-check today. Adding it would let us detect “LLDP advertised but never received” failure modes (one-way cabling, asymmetric port shutdowns).
Verifying it works
Section titled “Verifying it works”After registering a couple of devices, wait for one poll cycle:
make device-add HOSTNAME=sw-a IP=10.0.0.1 SOURCE=snmp COMMUNITY=romake device-add HOSTNAME=sw-b IP=10.0.0.2 SOURCE=snmp COMMUNITY=ro
# Wait ~90s (one orchestrator reconfig + one SNMP poll cycle)
make psql> SELECT d.hostname, d.chassis_id::text FROM device d; hostname | chassis_id sw-a | 00:de:ad:be:ef:01 sw-b | 00:de:ad:be:ef:02
> SELECT> d_local.hostname AS local_dev,> p.name AS local_port,> a.remote_chassis_id::text,> d_remote.hostname AS remote_dev> FROM adjacency a> JOIN port p ON p.id = a.local_port_id> JOIN device d_local ON d_local.id = p.device_id> LEFT JOIN device d_remote ON d_remote.id = a.remote_device_id> WHERE upper_inf(a.valid_during); local_dev | local_port | remote_chassis_id | remote_dev sw-a | Eth8 | 00:de:ad:be:ef:02 | sw-b sw-b | Eth8 | 00:de:ad:be:ef:01 | sw-aBoth remote_dev columns populated = peer resolution works end-to-end.
Now the recursive-CTE traceroute can walk between sw-a and sw-b.
See also
Section titled “See also”- The reconciler dispatch:
_process_device_identifiedinreconciler/runner.py - The eager-resolution path:
apply_adjacency_observationindb/adjacency.py - The backfill path:
record_device_chassis_idindb/device_identity.py - The L2 traceroute algorithm — what
remote_device_idultimately powers - Collect LLDP adjacencies — the collector side of this story