How peer resolution works

LLDP tells us who’s on the other side of this cable — but only by chassis ID, a 6-byte MAC. The recursive-CTE traceroute walks via adjacency.remote_device_id, an FK into our device table. The gap between “I see chassis ID 00:1a:a1:11:22:33 on Eth8” and “that chassis ID belongs to sw-core-1 (device.id=42)” is what peer resolution closes.

Without it, every trace terminates at the ingress switch’s egress trunk — the walker has no way to find the next hop.

The data flow

┌────────────┐                                ┌────────────┐
│  sw-edge-7 │                                │  sw-core-1 │
│            │── LLDP ──────────────────────▶│            │
│ Eth8 ─────┐│                                │┌───── Eth1 │
│           ││                                ││           │
└───────────┴┘                                └┴───────────┘
       │                                              │
       │  emits LldpNeighborUpdate                    │  emits LldpNeighborUpdate
       │  {local: Eth8, remote_chassis: <core-mac>}   │  {local: Eth1, remote_chassis: <edge-mac>}
       │  AND                                         │  AND
       │  emits DeviceIdentified                      │  emits DeviceIdentified
       │  {chassis_id: <edge-mac>}                    │  {chassis_id: <core-mac>}
       ▼                                              ▼
                   ┌──────────────────┐
                   │   reconciler     │
                   │                  │
                   │  on DeviceIdent: │
                   │    UPDATE device │
                   │    backfill adj  │
                   │                  │
                   │  on LldpNeighbor:│
                   │    INSERT adj +  │
                   │    eager resolve │
                   └──────────────────┘
                            │
                            ▼
              ┌──────────────────────────┐
              │ adjacency table:         │
              │  remote_device_id is now │
              │  populated for both ends │
              └──────────────────────────┘

The collector knows its OWN device_id (from CollectorConfig) and learns its OWN chassis_id (via lldpLocChassisId for SNMP, or /lldp/state/chassis-id for gNMI). It emits a DeviceIdentified event tying those two facts together. The reconciler does the actual UPDATE device SET chassis_id = $1 WHERE id = $2.

Two resolution paths, neither sufficient alone

Eager resolution at adjacency insert. When apply_adjacency_observation writes a new adjacency row, it first runs SELECT id FROM device WHERE chassis_id = $remote_chassis_id. If the peer has already been identified (its DeviceIdentified event processed), remote_device_id is populated immediately. New adjacency rows are self-resolving in the steady state.

Backfill on DeviceIdentified. When a DeviceIdentified event lands, the reconciler runs:

UPDATE adjacency
SET remote_device_id = $device_id
WHERE remote_chassis_id = $chassis_id
  AND remote_device_id IS NULL

This catches the timing case where the peer’s adjacency landed BEFORE the peer was identified — common during initial fabric bring-up when multiple collectors are racing. Without backfill, those rows would sit unresolved forever (or until they age out + are re-observed).

Both paths together close the race in both directions:

If sw-edge-7 was identified before sw-core-1 saw it: eager resolution fills in the row at INSERT time.
If sw-core-1 saw sw-edge-7 before sw-edge-7’s DeviceIdentified arrived: backfill fills in the row when it does.

Either way, within one poll cycle of both ends, the adjacency table is fully resolved.

What we deliberately don’t do

Cross-vendor chassis_id deduplication. Some vendors emit the base management MAC; others emit a chassis assignment that doesn’t match the management interface. We trust LLDP’s report verbatim and rely on the registered device having the same chassis_id it advertises. Operators who pre-populate device.chassis_id via device add --chassis <mac> short-circuit this entirely.
Backfill rewriting on chassis_id change. A hardware swap changes the chassis_id but the device_id stays. Historical adjacency rows pointing at the OLD chassis_id are left as-is (they’re real historical observations); only NEW adjacencies get resolved against the new chassis_id. If you want to retroactively rewrite history, a one-shot l2trace resolve-peers --hostname <h> --since <T> job would do it — not implemented today.
Bidirectional verification. A real LLDP fabric is symmetric: if sw-edge sees sw-core on Eth8, then sw-core should see sw-edge on whatever-port-faces-sw-edge. We don’t cross-check today. Adding it would let us detect “LLDP advertised but never received” failure modes (one-way cabling, asymmetric port shutdowns).

Verifying it works

After registering a couple of devices, wait for one poll cycle:

make device-add HOSTNAME=sw-a IP=10.0.0.1 SOURCE=snmp COMMUNITY=ro
make device-add HOSTNAME=sw-b IP=10.0.0.2 SOURCE=snmp COMMUNITY=ro

# Wait ~90s (one orchestrator reconfig + one SNMP poll cycle)

make psql
> SELECT d.hostname, d.chassis_id::text FROM device d;
        hostname  | chassis_id
        sw-a      | 00:de:ad:be:ef:01
        sw-b      | 00:de:ad:be:ef:02

> SELECT
>     d_local.hostname AS local_dev,
>     p.name           AS local_port,
>     a.remote_chassis_id::text,
>     d_remote.hostname AS remote_dev
> FROM adjacency a
> JOIN port p ON p.id = a.local_port_id
> JOIN device d_local ON d_local.id = p.device_id
> LEFT JOIN device d_remote ON d_remote.id = a.remote_device_id
> WHERE upper_inf(a.valid_during);
        local_dev | local_port | remote_chassis_id | remote_dev
        sw-a      | Eth8       | 00:de:ad:be:ef:02 | sw-b
        sw-b      | Eth8       | 00:de:ad:be:ef:01 | sw-a

Both remote_dev columns populated = peer resolution works end-to-end. Now the recursive-CTE traceroute can walk between sw-a and sw-b.

How peer resolution works

The data flow

Two resolution paths, neither sufficient alone

What we deliberately don’t do

Verifying it works

See also