Prior-art synthesis — what we built vs what we could
The prior-art reference section summarizes 13 papers across 4 tiers. This page is the cross-cutting synthesis: which of those ideas l2trace embraced, which we deliberately sidestepped, and which we sketched but haven’t built yet.
The gap claim, sharpened
Section titled “The gap claim, sharpened”l2trace’s headline architectural choice is bitemporal storage of L2
forwarding state — valid_during × recorded_during per row in
mac_observation and adjacency. The lit review’s tightest version
of why that’s a contribution:
No published work does bitemporal historical reconstruction of L2 forwarding state from passively collected CAM/LLDP data, despite multiple authors explicitly identifying the static-snapshot assumption as a fundamental limitation.
Tier-2 evidence for the second clause:
- Mai 2011, §6 (Anteater): measures Abilene FIB-change rate and proposes “consistent snapshot collection + SNMP-trap retry.” That IS time-aware — but unitemporal.
- Lopes 2015, §1 (NoD) opens the paper with: “existing network verification techniques assume the network is static and operate on a static snapshot of the forwarding state.”
- HSA, NetPlumber, VeriFlow all operate on a single snapshot of the current forwarding state and don’t retain prior states.
The closest neighbor is Anteater’s “consistent snapshot collection,” which is unitemporal — l2trace’s bitemporal model is a strictly stronger structural commitment.
What l2trace embraced from prior art
Section titled “What l2trace embraced from prior art”From Tier 4 (Snodgrass 2000): the whole storage vocabulary
Section titled “From Tier 4 (Snodgrass 2000): the whole storage vocabulary”l2trace did NOT re-coin temporal-database terminology. valid_during
recorded_duringare exactly the 1998 consensus formulation, and the bitemporal-write-with-late-arrival algorithm is the canonical sequenced retraction Snodgrass works through in Chapter 10. The mortgage-bankProp_Ownercase study is structurally the same asmac_observation.
The honest framing: l2trace is not novel in its storage model. It’s novel in applying the 1998-consensus bitemporal model to L2 forwarding paths.
From Tier 1 (NetInventory): system architecture, not algorithms
Section titled “From Tier 1 (NetInventory): system architecture, not algorithms”Breitbart 2004’s Direct Connection Theorem is now obsolete (LLDP replaces AFT-intersection inference), but NetInventory’s system architecture — timestamped observations, separated Resource Discovery / Resource Monitoring, retained historical views — is what l2trace’s collector + reconciler layout actually looks like. NetInventory’s storage was unitemporal; l2trace adds the second axis.
From Tier 3 (Perlman 1985 + 802.1Q): STP-state-aware path filtering
Section titled “From Tier 3 (Perlman 1985 + 802.1Q): STP-state-aware path filtering”The L2 traceroute recursive CTE filters stp_state = 'forwarding'
per (port, vlan). That’s Perlman’s algorithm’s output (blocking
edges should never carry data-plane traffic), recorded per-observation
in our bitemporal log so we can answer “was Eth5 forwarding at 14:42?”
The bitemporal angle is the part Perlman explicitly omitted — his
algorithm preserves no history of prior trees.
From Tier 3 (MLAG/vPC): per-group peer-link filter
Section titled “From Tier 3 (MLAG/vPC): per-group peer-link filter”The MLAG-collapse feature added in this codebase explicitly draws the conclusion that peer-link adjacencies between MLAG peers are not data-plane hops under steady-state operation. The filter applied in the traceroute CTE + audit query makes that explicit. See How MLAG-collapsed traceroute works.
From Tier 2 (multiple papers): the “operator beliefs are
Section titled “From Tier 2 (multiple papers): the “operator beliefs are”incomplete” observation
Anteater’s Quagga-bug study found 13 stale config remnants dating back to 2008 — operators forget about past decisions. NoD’s central “belief refutation” framing validates the same pattern at Microsoft scale. l2trace’s adjacency audit and disagreement view are operator-belief-refutation tools in this lineage; they assume the operator thinks the fabric is healthy and surface counterexamples.
What l2trace deliberately sidestepped
Section titled “What l2trace deliberately sidestepped”AFT-intersection topology inference (Tier 1)
Section titled “AFT-intersection topology inference (Tier 1)”Breitbart, Lowekamp, and Bejerano all derived adjacencies from CAM
table set-arithmetic because LLDP wasn’t deployed in 2001–2009. By
2026 LLDP is essentially universal, so we read the adjacency directly
from /lldp/... (gNMI) or LLDP-MIB (SNMP). The
peer-resolution mechanism handles
the cross-device join via chassis_id. No CAM-intersection inference.
SAT translation for path queries (Anteater)
Section titled “SAT translation for path queries (Anteater)”Anteater’s SAT translation handles arbitrary header transformations (NAT, MPLS swap, QoS marking). l2trace’s recursive CTE handles L2 forwarding without rewrites — VLAN-tag-aware but no MPLS-style label swap. The CTE is cheaper, simpler, and sufficient for the problem domain. SAT becomes attractive if we ever add hypothetical “what-if-this-header” queries.
Active probe injection (ATPG)
Section titled “Active probe injection (ATPG)”l2trace is passive-collection-only today. Active validation is on the roadmap but deliberately deferred — passive collection already produces useful observability, and the engineering cost of probe injection (vendor-specific test-packet generation, sFlow/ERSPAN tap coordination, fault-localization algorithm) is substantial.
Header-space algebra (HSA, NetPlumber)
Section titled “Header-space algebra (HSA, NetPlumber)”The transfer-function model in HSA + NetPlumber maps cleanly onto L2 forwarding (each CAM lookup IS a transfer function), but the header-space arithmetic adds complexity beyond what L2 path queries need. We sketched HSA-style slice-leakage checks (“find frames in VLAN A that ever reach a port in VLAN B”) as a future feature, not something we built.
What we sketched but haven’t built
Section titled “What we sketched but haven’t built”The lit review extracted 32 actionable features (F1–F32) across the codebase. The actionable ones not yet shipped, grouped by motivation:
Multi-fabric collection (Tier 3)
Section titled “Multi-fabric collection (Tier 3)”- SPB/IS-IS LSDB collector for SPBM cores. In SPBM, customer MACs do NOT appear in transit-switch CAM tables — the forwarding state lives in the IS-IS link-state database. Our gNMI collector would need to read OpenConfig’s IS-IS LSDB models.
- TRILL ESADI collector for RBridge campuses. ESADI announces
attached customer MACs into the IS-IS LSDB rather than passive
learning. Schema would gain a
trill_linkentity. - PortLand PMAC decoder for fabrics using location-encoded
pseudo-MACs. Decode
pod.position.port.vmiddirectly; no CAM lookup needed. Niche (mostly hyperscaler-internal). - STP-Topology-Change-driven
recorded_duringretraction. When a TCN BPDU is observed on a switch, closerecorded_duringon affected CAM rows and force re-collection. This is correctness- not-optional for STP-based fabrics under topology change. Currently the reconciler doesn’t model TCN events as bulk-retraction triggers.
Verification + belief checking (Tier 2)
Section titled “Verification + belief checking (Tier 2)”- Belief-template policy DSL for L2 fabric. NoD’s five templates (Protection Sets, Reachable Sets, Reachability Consistency, Middlebox Processing, Locality) all have L2-specific analogs. Could be a YAML DSL parsed into recursive CTEs.
- Differential reachability between two timestamps. “Did MAC A reach MAC B via the same path at T1 as at T2?” naturally expressible as a self-join on the bitemporal table. Easy win from existing storage.
- Equivalence-class partitioning (VeriFlow). For very large
fabrics, group
mac_observationrows by(vlan, port)and iterate over ECs for belief checks. Difference between feasible and intractable for 10K+ MACs per VLAN. - Incremental verification on CAM updates (NetPlumber). Only
re-evaluate belief checks whose plumbing graph touches a newly-
arrived
mac_observationrow.
Active validation (Tier 2 + Tier 3)
Section titled “Active validation (Tier 2 + Tier 3)”- ATPG-style minimum-test-packet generation + fault localization. The canonical prior art for active mode. Tracked on the roadmap.
- ARP-Path passive trace collector. For All-Path-bridging deployments (rare), the ARP exchange itself encodes the chosen path. Passive sFlow/ERSPAN observer reconstructs path info without injection.
Collector ergonomics (Tier 1)
Section titled “Collector ergonomics (Tier 1)”- Virtual-switch placeholders for SNMP-denied or LLDP-silent
gear. When a hop in the L2 path corresponds to a device the
collectors couldn’t access, emit an
unknown_nodevertex rather than failing the traceroute. Makes the tool useful in messy real networks instead of brittle. - Per-(port, vlan) STP-state collection. Already in the data
model; collection is the missing piece. High priority because the
traceroute CTE already filters on
stp_state— without active collection the filter operates on stale/empty data.
Why we organize by what we shipped, not by what we could
Section titled “Why we organize by what we shipped, not by what we could”l2trace’s design philosophy is that the bitemporal storage layer is the foundational contribution, and every Tier-2 verification idea becomes more powerful when it consumes bitemporal data instead of a static snapshot. So instead of racing to implement every F-item, the shipping priority has been:
- Get the bitemporal store + collectors + reconciler solid.
- Build the operator-facing queries that exercise the temporal axis (TRACE / HISTORY / OPS / AUDIT).
- Address the headline operational gotchas (MLAG-collapse, cross-source disagreement, peer resolution).
- Then layer the higher-level verification / belief-checking / active-validation features on top of a working store.
The unimplemented features in this section aren’t “we don’t know how to build them” — they’re “the store has to be right first, and each additional feature should justify its complexity against a real operator pain point we’ve seen.”
What this means for a research paper
Section titled “What this means for a research paper”If l2trace ever becomes the subject of an academic paper, the lit review’s recommendation is NSDI as the publication target. Anteater (SIGCOMM ‘11), VeriFlow (NSDI ‘13), NetPlumber (NSDI ‘13), and NoD (NSDI ‘15) form a clear “data-plane verification” lineage, and NSDI’s reviewers will already have the priors for evaluating l2trace as the temporal-storage member of that family.
The evaluation requirements extrapolated from prior-art patterns:
- Production deployment with at least one named operational incident l2trace caught (matches Anteater’s UIUC bugs, NoD’s Singapore data center).
- Parameterized random-topology simulation with bug-detection rates across (network size, observation density, timestamp skew) — Bejerano-style methodology.
- Random-sample-of-real-bugs survey à la Anteater §5.2 (78 Quagga bugs, 86% detectable).
- Performance evaluation at hyperscaler scale: query latency on bitemporal tables of 100M–1B rows, vacuum/partition behavior.
- Direct comparison with at least one prior tool — likely NoD or Anteater — showing the temporal angle is genuinely additive.
The headline number for that paper would come from #3 — pick a public bug archive (Quagga, FRR, recent OpenConfig issues), classify which are bitemporal-detectable, report the percentage. This is the single highest-value pre-submission engineering item from the synthesis.
See also
Section titled “See also”- Prior art tiers 1–4 — per-paper summaries
- Why bitemporal? — the design rationale this synthesis draws on
- How MLAG-collapsed traceroute works — one place we shipped a fix the prior art doesn’t address
- Bibliography — full DOI lookups