How MLAG-collapsed traceroute works
The problem MLAG creates
Section titled “The problem MLAG creates”Most enterprise data centers run MLAG (multi-chassis link aggregation, called vPC on Cisco NX-OS, MLAG-domain on Arista, MC-LAG on Juniper). Two physical switches present themselves as one logical switch to downstream LACP partners:
┌──────────┐ ┌──────────┐ │ sw-core-1│ ──peer──│ sw-core-2│ │ (mlag-1)│ link │ (mlag-1)│ └──────────┘ └──────────┘ │ /\ │ └────/ \───────────┘ LACP bundle │ ┌──────────┐ │ sw-edge-a│ │ (LACP) │ └──────────┘For an L2 traceroute walker that doesn’t know about MLAG, this topology is a minefield:
-
CAM tables appear identical on both peers. When sw-edge-a’s MAC is learned on the LACP bundle, BOTH sw-core-1 and sw-core-2 see it on their member port via the MLAG state sync. To a naive walker, this looks like a “MAC moved” event between peers every reconciler cycle — even though the host hasn’t physically moved.
-
The peer-link is an LLDP-visible adjacency. sw-core-1 and sw-core-2 LLDP each other across the peer-link. A walker following adjacencies blindly would treat that as a valid hop, possibly looping between the two peers or just adding spurious nodes to the path.
-
Audit dashboards fill with noise. The peer-link is bidirectional by design — peer-sync depends on it. So in the adjacency audit it shows up as
✓ healthyfor every cycle, inflating the row count and burying real one-way-cable issues.
The l2trace solution: per-group filtering
Section titled “The l2trace solution: per-group filtering”The data model has had device.mlag_group_id BIGINT NULL since
migration 0001, designed for exactly this. Operators declare an MLAG
pair via:
l2trace mlag create --hosts sw-core-1,sw-core-2After that, two SQL filters change behavior:
Filter 1: traceroute CTE skips peer-link adjacencies
Section titled “Filter 1: traceroute CTE skips peer-link adjacencies”db/queries.py::_TRACEROUTE_SQL adds an adj CTE clause:
WHERE ... AND ( d_local.mlag_group_id IS NULL OR d_remote.mlag_group_id IS NULL OR d_local.mlag_group_id <> d_remote.mlag_group_id )This excludes adjacencies where both ends share an mlag_group_id.
The recursive walk can no longer step from sw-core-1 to sw-core-2
mid-trace; it follows only “real” hops (uplinks to other devices).
The NULL-on-either-side branch is important: an adjacency from an MLAG peer to a NON-MLAG access switch is a real hop, not a peer-link. The filter only fires when both ends are in the same group.
Filter 2: audit excludes peer-links by default
Section titled “Filter 2: audit excludes peer-links by default”db/queries.py::audit_adjacencies adds the same predicate, gated by
the include_peer_links kwarg (default False). Peer-link rows stay
in the database — they’re real LLDP observations — but the operator’s
“what’s broken?” view doesn’t show them unless explicitly asked:
l2trace audit-adjacencies --include-peer-links # show allWhy “filter” instead of “auto-detect MLAG”?
Section titled “Why “filter” instead of “auto-detect MLAG”?”In principle, l2trace could try to detect MLAG pairs from telemetry
signals — bidirectional LLDP between two switches across multiple
ports, matching chassis_id advertised on both sides, etc. We
deliberately don’t:
- No signal is reliable enough to act on automatically. Two switches with a half-dozen cross-links could be MLAG peers, or they could be a deliberately-redundant lossy-protocol-aware design that is NOT MLAG.
- Operator intent matters. “Are these MLAG peers?” is a deployment decision; mislabeling it could mask real one-way cable problems by hiding the very adjacencies the operator needs to audit.
So MLAG grouping is an operator declaration, not telemetry. See How to configure MLAG for the workflow.
Why “filter peer-links” instead of “merge the two peers”?
Section titled “Why “filter peer-links” instead of “merge the two peers”?”We considered fully merging the two peer devices into one logical
node — same device_id from the traceroute walker’s perspective,
combined CAM tables, etc. We didn’t, because:
- The audit needs to see both peers individually. “sw-core-1 has 500 MACs but sw-core-2 has 502” is a real operational signal — one of them lost an LLDP relationship or a CAM sync. Merging would hide this.
- Cross-source disagreement detection works per-device. gNMI on sw-core-1 saying MAC X is on Eth5 while SNMP on sw-core-2 says Eth7 is a real disagreement worth flagging. Merging would lose the per-device source attribution.
- The trace renderer can still collapse.
TraceHop.mlag_group_idis exposed in query output; the TUI’s TRACE screen annotates the hostname with(mlag-N)so the operator visually knows it’s a paired hop without the data layer needing to lie about per-device identity.
The principle: the bitemporal store stays honest about per-device identity; the renderer collapses for the operator-facing view.
What this doesn’t handle yet
Section titled “What this doesn’t handle yet”- Per-flow LACP hash prediction. When sw-edge-a sends a frame destined for the MLAG pair, the LACP hash picks one peer deterministically based on the frame’s headers. We don’t model that hash — the traceroute shows “the peer that has the CAM entry for the dst MAC” which is correct in the common case (it’s the same peer that handled the source flow).
- MLAG failover paths. Under normal operation, frames don’t cross the peer-link. When one peer fails, frames briefly transit the peer-link until LACP reconverges. The traceroute walker would see this as a topology change and adapt, but doesn’t model the failover transition itself.
- Triple-peer / MLAG-quad configurations.
create_mlag_groupaccepts N≥2 hostnames so triple/quad groups technically work, but they’re not exercised by the test suite.
See also
Section titled “See also”- Configure MLAG groups — operator workflow
- Audit LLDP adjacencies — how the peer-link filter interacts with the audit
- Source:
src/l2trace/db/mlag.py+ theadjCTE insrc/l2trace/db/queries.py