The L2 traceroute algorithm

L2 traceroute is fundamentally different from IP traceroute. There are no TTLs, no ICMP, no probe responses. The only data is what each device’s CAM/MAC table records, and what LLDP / CDP says about its neighbors. The “path” exists only in the sense that if a frame from MAC A to MAC B were forwarded right now, this is the sequence of switches and trunks it would traverse.

l2trace reconstructs that path from observed forwarding state.

Inputs

src_mac, dst_mac, vlan, as_of

as_of defaults to now(). It’s wire-time; the recursive CTE filters every row through valid_during @> as_of.

The walk

1. INGRESS: find a row where mac=src_mac, vlan=vlan, port.role='access',
   and valid_during @> as_of. The access-port + vlan combo is "where
   does the frame enter the fabric?"

2. At the current device:
   a. Look up dst_mac's CAM entry → that's the egress port for the frame.
   b. Filter: check stp_state(egress_port, vlan).state = 'forwarding'.
      If 'blocking', emit a "would be blocked" hop and terminate.

3. If egress port has role='access':
   → DONE. We've reached the egress access port.

   If egress port has role='trunk':
   → Look up adjacency(local_port_id=egress_port, valid_during @> as_of).
     → That gives us the remote device and remote port.
     → Loop-check: have we visited this port already? If yes, terminate LOOP.
     → Otherwise, recurse into that device (step 2).

4. If dst_mac has no CAM entry at the current device:
   → FLOODS. The frame would be flooded out every trunk on the VLAN.
     We emit a "floods at sw_X" hop and terminate.

5. If max_hops is reached:
   → MAX_HOPS. Pathological topology or hidden bug; surface it.

The actual SQL is a recursive CTE in db/queries.py::traceroute(). The walker maintains a path array of visited port_ids so the loop-guard check is just NOT next_port = ANY(path). The max-hops guard is an explicit step < :max_hops filter on the recursive arm.

Why each termination matters

Termination	What it means	Operator action
`reached`	Path found end-to-end	Trust the hop list
`floods`	Intermediate switch doesn’t know dst → would flood out trunks	Check why CAM hasn’t learned dst; possibly a stale entry that aged out
`dead-end`	Trunk has no adjacency for this VLAN	Check LLDP — broken cable, or vlan-allowed config off on the trunk
`loop`	Walker revisited a port	STP misconfiguration or fabric loop; investigate STP root
`max-hops`	Hit the loop guard with no other reason	Probably a bug — file an issue with the trace output

MLAG / VPC handling

When two switches form an MLAG pair, the same MAC can appear on both peers simultaneously — without bitemporal, this looks like the MAC flapping every few seconds. With MLAG, we expect this.

device.mlag_group_id collapses the peer pair into one logical device for path traversal. The current implementation treats mlag_group_id as the canonical node-id during recursion, so MLAG peers don’t double-count in the visited-set and the hop output shows the group, not the random peer that happened to win the CAM race.

There’s a known limitation here: MLAG-aware port resolution for peer-link-only learnt MACs isn’t fully solved yet (Hamilton MISSED-H in the CLAUDE.md tombstone). When it’s done, this page will get an update.

Why a recursive CTE and not a graph database

The bitemporal source-of-truth is relational. We use Apache AGE for graph projection (the MAC → LEARNED_ON → Port → NEIGHBOR → Port relationship view), but the traceroute query itself runs as a recursive CTE against the relational store. Reasons:

No replication lag. The graph view is the same Postgres transaction as the relational write — there’s no “wait for replication” step that AGE projection would introduce.
Easier to evolve. Adding STP filter, MLAG collapse, role checks — all just additional JOINs in the CTE. Doing the same in Cypher gets verbose fast.
One database to operate. No Neo4j daemon, no separate replication, no second backup story.

The trade-off: the recursive CTE costs more per hop than a native graph walker would. For typical campus fabrics (<10 hops) this is a non-issue; for very large data-center spines (>20 hops) the algorithm starts to need optimization. We haven’t hit that wall yet.