How MLAG-collapsed traceroute works

The problem MLAG creates

Most enterprise data centers run MLAG (multi-chassis link aggregation, called vPC on Cisco NX-OS, MLAG-domain on Arista, MC-LAG on Juniper). Two physical switches present themselves as one logical switch to downstream LACP partners:

                ┌──────────┐         ┌──────────┐
                │ sw-core-1│ ──peer──│ sw-core-2│
                │  (mlag-1)│  link   │  (mlag-1)│
                └──────────┘         └──────────┘
                       │     /\            │
                       └────/  \───────────┘
                          LACP bundle
                              │
                       ┌──────────┐
                       │ sw-edge-a│
                       │   (LACP) │
                       └──────────┘

For an L2 traceroute walker that doesn’t know about MLAG, this topology is a minefield:

CAM tables appear identical on both peers. When sw-edge-a’s MAC is learned on the LACP bundle, BOTH sw-core-1 and sw-core-2 see it on their member port via the MLAG state sync. To a naive walker, this looks like a “MAC moved” event between peers every reconciler cycle — even though the host hasn’t physically moved.
The peer-link is an LLDP-visible adjacency. sw-core-1 and sw-core-2 LLDP each other across the peer-link. A walker following adjacencies blindly would treat that as a valid hop, possibly looping between the two peers or just adding spurious nodes to the path.
Audit dashboards fill with noise. The peer-link is bidirectional by design — peer-sync depends on it. So in the adjacency audit it shows up as ✓ healthy for every cycle, inflating the row count and burying real one-way-cable issues.

The l2trace solution: per-group filtering

The data model has had device.mlag_group_id BIGINT NULL since migration 0001, designed for exactly this. Operators declare an MLAG pair via:

l2trace mlag create --hosts sw-core-1,sw-core-2

After that, two SQL filters change behavior:

Filter 1: traceroute CTE skips peer-link adjacencies

db/queries.py::_TRACEROUTE_SQL adds an adj CTE clause:

WHERE ...
  AND (
    d_local.mlag_group_id IS NULL
    OR d_remote.mlag_group_id IS NULL
    OR d_local.mlag_group_id <> d_remote.mlag_group_id
  )

This excludes adjacencies where both ends share an mlag_group_id. The recursive walk can no longer step from sw-core-1 to sw-core-2 mid-trace; it follows only “real” hops (uplinks to other devices).

The NULL-on-either-side branch is important: an adjacency from an MLAG peer to a NON-MLAG access switch is a real hop, not a peer-link. The filter only fires when both ends are in the same group.

Filter 2: audit excludes peer-links by default

db/queries.py::audit_adjacencies adds the same predicate, gated by the include_peer_links kwarg (default False). Peer-link rows stay in the database — they’re real LLDP observations — but the operator’s “what’s broken?” view doesn’t show them unless explicitly asked:

l2trace audit-adjacencies --include-peer-links  # show all

Why “filter” instead of “auto-detect MLAG”?

In principle, l2trace could try to detect MLAG pairs from telemetry signals — bidirectional LLDP between two switches across multiple ports, matching chassis_id advertised on both sides, etc. We deliberately don’t:

No signal is reliable enough to act on automatically. Two switches with a half-dozen cross-links could be MLAG peers, or they could be a deliberately-redundant lossy-protocol-aware design that is NOT MLAG.
Operator intent matters. “Are these MLAG peers?” is a deployment decision; mislabeling it could mask real one-way cable problems by hiding the very adjacencies the operator needs to audit.

So MLAG grouping is an operator declaration, not telemetry. See How to configure MLAG for the workflow.

Why “filter peer-links” instead of “merge the two peers”?

We considered fully merging the two peer devices into one logical node — same device_id from the traceroute walker’s perspective, combined CAM tables, etc. We didn’t, because:

The audit needs to see both peers individually. “sw-core-1 has 500 MACs but sw-core-2 has 502” is a real operational signal — one of them lost an LLDP relationship or a CAM sync. Merging would hide this.
Cross-source disagreement detection works per-device. gNMI on sw-core-1 saying MAC X is on Eth5 while SNMP on sw-core-2 says Eth7 is a real disagreement worth flagging. Merging would lose the per-device source attribution.
The trace renderer can still collapse. TraceHop.mlag_group_id is exposed in query output; the TUI’s TRACE screen annotates the hostname with (mlag-N) so the operator visually knows it’s a paired hop without the data layer needing to lie about per-device identity.

The principle: the bitemporal store stays honest about per-device identity; the renderer collapses for the operator-facing view.

What this doesn’t handle yet

Per-flow LACP hash prediction. When sw-edge-a sends a frame destined for the MLAG pair, the LACP hash picks one peer deterministically based on the frame’s headers. We don’t model that hash — the traceroute shows “the peer that has the CAM entry for the dst MAC” which is correct in the common case (it’s the same peer that handled the source flow).
MLAG failover paths. Under normal operation, frames don’t cross the peer-link. When one peer fails, frames briefly transit the peer-link until LACP reconverges. The traceroute walker would see this as a topology change and adapt, but doesn’t model the failover transition itself.
Triple-peer / MLAG-quad configurations. create_mlag_group accepts N≥2 hostnames so triple/quad groups technically work, but they’re not exercised by the test suite.