Skip to content

Prior-art synthesis — what we built vs what we could

The prior-art reference section summarizes 13 papers across 4 tiers. This page is the cross-cutting synthesis: which of those ideas l2trace embraced, which we deliberately sidestepped, and which we sketched but haven’t built yet.

l2trace’s headline architectural choice is bitemporal storage of L2 forwarding state — valid_during × recorded_during per row in mac_observation and adjacency. The lit review’s tightest version of why that’s a contribution:

No published work does bitemporal historical reconstruction of L2 forwarding state from passively collected CAM/LLDP data, despite multiple authors explicitly identifying the static-snapshot assumption as a fundamental limitation.

Tier-2 evidence for the second clause:

  • Mai 2011, §6 (Anteater): measures Abilene FIB-change rate and proposes “consistent snapshot collection + SNMP-trap retry.” That IS time-aware — but unitemporal.
  • Lopes 2015, §1 (NoD) opens the paper with: “existing network verification techniques assume the network is static and operate on a static snapshot of the forwarding state.”
  • HSA, NetPlumber, VeriFlow all operate on a single snapshot of the current forwarding state and don’t retain prior states.

The closest neighbor is Anteater’s “consistent snapshot collection,” which is unitemporal — l2trace’s bitemporal model is a strictly stronger structural commitment.

From Tier 4 (Snodgrass 2000): the whole storage vocabulary

Section titled “From Tier 4 (Snodgrass 2000): the whole storage vocabulary”

l2trace did NOT re-coin temporal-database terminology. valid_during

  • recorded_during are exactly the 1998 consensus formulation, and the bitemporal-write-with-late-arrival algorithm is the canonical sequenced retraction Snodgrass works through in Chapter 10. The mortgage-bank Prop_Owner case study is structurally the same as mac_observation.

The honest framing: l2trace is not novel in its storage model. It’s novel in applying the 1998-consensus bitemporal model to L2 forwarding paths.

From Tier 1 (NetInventory): system architecture, not algorithms

Section titled “From Tier 1 (NetInventory): system architecture, not algorithms”

Breitbart 2004’s Direct Connection Theorem is now obsolete (LLDP replaces AFT-intersection inference), but NetInventory’s system architecture — timestamped observations, separated Resource Discovery / Resource Monitoring, retained historical views — is what l2trace’s collector + reconciler layout actually looks like. NetInventory’s storage was unitemporal; l2trace adds the second axis.

From Tier 3 (Perlman 1985 + 802.1Q): STP-state-aware path filtering

Section titled “From Tier 3 (Perlman 1985 + 802.1Q): STP-state-aware path filtering”

The L2 traceroute recursive CTE filters stp_state = 'forwarding' per (port, vlan). That’s Perlman’s algorithm’s output (blocking edges should never carry data-plane traffic), recorded per-observation in our bitemporal log so we can answer “was Eth5 forwarding at 14:42?” The bitemporal angle is the part Perlman explicitly omitted — his algorithm preserves no history of prior trees.

Section titled “From Tier 3 (MLAG/vPC): per-group peer-link filter”

The MLAG-collapse feature added in this codebase explicitly draws the conclusion that peer-link adjacencies between MLAG peers are not data-plane hops under steady-state operation. The filter applied in the traceroute CTE + audit query makes that explicit. See How MLAG-collapsed traceroute works.

From Tier 2 (multiple papers): the “operator beliefs are

Section titled “From Tier 2 (multiple papers): the “operator beliefs are”

incomplete” observation

Anteater’s Quagga-bug study found 13 stale config remnants dating back to 2008 — operators forget about past decisions. NoD’s central “belief refutation” framing validates the same pattern at Microsoft scale. l2trace’s adjacency audit and disagreement view are operator-belief-refutation tools in this lineage; they assume the operator thinks the fabric is healthy and surface counterexamples.

AFT-intersection topology inference (Tier 1)

Section titled “AFT-intersection topology inference (Tier 1)”

Breitbart, Lowekamp, and Bejerano all derived adjacencies from CAM table set-arithmetic because LLDP wasn’t deployed in 2001–2009. By 2026 LLDP is essentially universal, so we read the adjacency directly from /lldp/... (gNMI) or LLDP-MIB (SNMP). The peer-resolution mechanism handles the cross-device join via chassis_id. No CAM-intersection inference.

SAT translation for path queries (Anteater)

Section titled “SAT translation for path queries (Anteater)”

Anteater’s SAT translation handles arbitrary header transformations (NAT, MPLS swap, QoS marking). l2trace’s recursive CTE handles L2 forwarding without rewrites — VLAN-tag-aware but no MPLS-style label swap. The CTE is cheaper, simpler, and sufficient for the problem domain. SAT becomes attractive if we ever add hypothetical “what-if-this-header” queries.

l2trace is passive-collection-only today. Active validation is on the roadmap but deliberately deferred — passive collection already produces useful observability, and the engineering cost of probe injection (vendor-specific test-packet generation, sFlow/ERSPAN tap coordination, fault-localization algorithm) is substantial.

The transfer-function model in HSA + NetPlumber maps cleanly onto L2 forwarding (each CAM lookup IS a transfer function), but the header-space arithmetic adds complexity beyond what L2 path queries need. We sketched HSA-style slice-leakage checks (“find frames in VLAN A that ever reach a port in VLAN B”) as a future feature, not something we built.

The lit review extracted 32 actionable features (F1–F32) across the codebase. The actionable ones not yet shipped, grouped by motivation:

  • SPB/IS-IS LSDB collector for SPBM cores. In SPBM, customer MACs do NOT appear in transit-switch CAM tables — the forwarding state lives in the IS-IS link-state database. Our gNMI collector would need to read OpenConfig’s IS-IS LSDB models.
  • TRILL ESADI collector for RBridge campuses. ESADI announces attached customer MACs into the IS-IS LSDB rather than passive learning. Schema would gain a trill_link entity.
  • PortLand PMAC decoder for fabrics using location-encoded pseudo-MACs. Decode pod.position.port.vmid directly; no CAM lookup needed. Niche (mostly hyperscaler-internal).
  • STP-Topology-Change-driven recorded_during retraction. When a TCN BPDU is observed on a switch, close recorded_during on affected CAM rows and force re-collection. This is correctness- not-optional for STP-based fabrics under topology change. Currently the reconciler doesn’t model TCN events as bulk-retraction triggers.
  • Belief-template policy DSL for L2 fabric. NoD’s five templates (Protection Sets, Reachable Sets, Reachability Consistency, Middlebox Processing, Locality) all have L2-specific analogs. Could be a YAML DSL parsed into recursive CTEs.
  • Differential reachability between two timestamps. “Did MAC A reach MAC B via the same path at T1 as at T2?” naturally expressible as a self-join on the bitemporal table. Easy win from existing storage.
  • Equivalence-class partitioning (VeriFlow). For very large fabrics, group mac_observation rows by (vlan, port) and iterate over ECs for belief checks. Difference between feasible and intractable for 10K+ MACs per VLAN.
  • Incremental verification on CAM updates (NetPlumber). Only re-evaluate belief checks whose plumbing graph touches a newly- arrived mac_observation row.
  • ATPG-style minimum-test-packet generation + fault localization. The canonical prior art for active mode. Tracked on the roadmap.
  • ARP-Path passive trace collector. For All-Path-bridging deployments (rare), the ARP exchange itself encodes the chosen path. Passive sFlow/ERSPAN observer reconstructs path info without injection.
  • Virtual-switch placeholders for SNMP-denied or LLDP-silent gear. When a hop in the L2 path corresponds to a device the collectors couldn’t access, emit an unknown_node vertex rather than failing the traceroute. Makes the tool useful in messy real networks instead of brittle.
  • Per-(port, vlan) STP-state collection. Already in the data model; collection is the missing piece. High priority because the traceroute CTE already filters on stp_state — without active collection the filter operates on stale/empty data.

Why we organize by what we shipped, not by what we could

Section titled “Why we organize by what we shipped, not by what we could”

l2trace’s design philosophy is that the bitemporal storage layer is the foundational contribution, and every Tier-2 verification idea becomes more powerful when it consumes bitemporal data instead of a static snapshot. So instead of racing to implement every F-item, the shipping priority has been:

  1. Get the bitemporal store + collectors + reconciler solid.
  2. Build the operator-facing queries that exercise the temporal axis (TRACE / HISTORY / OPS / AUDIT).
  3. Address the headline operational gotchas (MLAG-collapse, cross-source disagreement, peer resolution).
  4. Then layer the higher-level verification / belief-checking / active-validation features on top of a working store.

The unimplemented features in this section aren’t “we don’t know how to build them” — they’re “the store has to be right first, and each additional feature should justify its complexity against a real operator pain point we’ve seen.”

If l2trace ever becomes the subject of an academic paper, the lit review’s recommendation is NSDI as the publication target. Anteater (SIGCOMM ‘11), VeriFlow (NSDI ‘13), NetPlumber (NSDI ‘13), and NoD (NSDI ‘15) form a clear “data-plane verification” lineage, and NSDI’s reviewers will already have the priors for evaluating l2trace as the temporal-storage member of that family.

The evaluation requirements extrapolated from prior-art patterns:

  1. Production deployment with at least one named operational incident l2trace caught (matches Anteater’s UIUC bugs, NoD’s Singapore data center).
  2. Parameterized random-topology simulation with bug-detection rates across (network size, observation density, timestamp skew) — Bejerano-style methodology.
  3. Random-sample-of-real-bugs survey à la Anteater §5.2 (78 Quagga bugs, 86% detectable).
  4. Performance evaluation at hyperscaler scale: query latency on bitemporal tables of 100M–1B rows, vacuum/partition behavior.
  5. Direct comparison with at least one prior tool — likely NoD or Anteater — showing the temporal angle is genuinely additive.

The headline number for that paper would come from #3 — pick a public bug archive (Quagga, FRR, recent OpenConfig issues), classify which are bitemporal-detectable, report the percentage. This is the single highest-value pre-submission engineering item from the synthesis.