The L2 traceroute algorithm
L2 traceroute is fundamentally different from IP traceroute. There are no TTLs, no ICMP, no probe responses. The only data is what each device’s CAM/MAC table records, and what LLDP / CDP says about its neighbors. The “path” exists only in the sense that if a frame from MAC A to MAC B were forwarded right now, this is the sequence of switches and trunks it would traverse.
l2trace reconstructs that path from observed forwarding state.
Inputs
Section titled “Inputs”src_mac, dst_mac, vlan, as_ofas_of defaults to now(). It’s wire-time; the recursive CTE filters
every row through valid_during @> as_of.
The walk
Section titled “The walk”1. INGRESS: find a row where mac=src_mac, vlan=vlan, port.role='access', and valid_during @> as_of. The access-port + vlan combo is "where does the frame enter the fabric?"
2. At the current device: a. Look up dst_mac's CAM entry → that's the egress port for the frame. b. Filter: check stp_state(egress_port, vlan).state = 'forwarding'. If 'blocking', emit a "would be blocked" hop and terminate.
3. If egress port has role='access': → DONE. We've reached the egress access port.
If egress port has role='trunk': → Look up adjacency(local_port_id=egress_port, valid_during @> as_of). → That gives us the remote device and remote port. → Loop-check: have we visited this port already? If yes, terminate LOOP. → Otherwise, recurse into that device (step 2).
4. If dst_mac has no CAM entry at the current device: → FLOODS. The frame would be flooded out every trunk on the VLAN. We emit a "floods at sw_X" hop and terminate.
5. If max_hops is reached: → MAX_HOPS. Pathological topology or hidden bug; surface it.The actual SQL is a recursive CTE in db/queries.py::traceroute(). The
walker maintains a path array of visited port_ids so the loop-guard
check is just NOT next_port = ANY(path). The max-hops guard is an
explicit step < :max_hops filter on the recursive arm.
Why each termination matters
Section titled “Why each termination matters”| Termination | What it means | Operator action |
|---|---|---|
reached | Path found end-to-end | Trust the hop list |
floods | Intermediate switch doesn’t know dst → would flood out trunks | Check why CAM hasn’t learned dst; possibly a stale entry that aged out |
dead-end | Trunk has no adjacency for this VLAN | Check LLDP — broken cable, or vlan-allowed config off on the trunk |
loop | Walker revisited a port | STP misconfiguration or fabric loop; investigate STP root |
max-hops | Hit the loop guard with no other reason | Probably a bug — file an issue with the trace output |
MLAG / VPC handling
Section titled “MLAG / VPC handling”When two switches form an MLAG pair, the same MAC can appear on both peers simultaneously — without bitemporal, this looks like the MAC flapping every few seconds. With MLAG, we expect this.
device.mlag_group_id collapses the peer pair into one logical device
for path traversal. The current implementation treats mlag_group_id
as the canonical node-id during recursion, so MLAG peers don’t double-count
in the visited-set and the hop output shows the group, not the random
peer that happened to win the CAM race.
There’s a known limitation here: MLAG-aware port resolution for peer-link-only learnt MACs isn’t fully solved yet (Hamilton MISSED-H in the CLAUDE.md tombstone). When it’s done, this page will get an update.
Why a recursive CTE and not a graph database
Section titled “Why a recursive CTE and not a graph database”The bitemporal source-of-truth is relational. We use Apache AGE for graph
projection (the MAC → LEARNED_ON → Port → NEIGHBOR → Port
relationship view), but the traceroute query itself runs as a recursive
CTE against the relational store. Reasons:
- No replication lag. The graph view is the same Postgres transaction as the relational write — there’s no “wait for replication” step that AGE projection would introduce.
- Easier to evolve. Adding STP filter, MLAG collapse, role checks —
all just additional
JOINs in the CTE. Doing the same in Cypher gets verbose fast. - One database to operate. No Neo4j daemon, no separate replication, no second backup story.
The trade-off: the recursive CTE costs more per hop than a native graph walker would. For typical campus fabrics (<10 hops) this is a non-issue; for very large data-center spines (>20 hops) the algorithm starts to need optimization. We haven’t hit that wall yet.
See also
Section titled “See also”- The query source:
src/l2trace/db/queries.py::traceroute() - Data model reference — the tables the CTE walks
- Why bitemporal? — the
valid_during @> as_offilter is the whole reason “as of T” works