Skip to content

Scale envelope

Numbers from make scale-check against the demo compose stack (Postgres 16, single-host, no tuning). The seed builds a 30-device fabric (20 access + 10 core), 240 ports, and 20,000 mac_observation rows (10,000 host CAM entries + 10,000 trunk-learned on the core side).

queryrows returnedlatency_ms
live_fdb()20,00163.0
disagreements()021.9
mac_history (1 MAC)22.9
traceroute (3 hops, reached)39.8
bulk_lookup_ouis (100 MACs)1001.7

All queries land under 70ms. The biggest is live_fdb() which has to group + aggregate every currently-open observation. The smallest are the targeted lookups (history for one MAC, OUI bulk lookup) which use direct B-tree probes.

EXPLAIN ANALYZE on live_fdb():

GroupAggregate
-> Incremental Sort (Presorted Key: device_id)
-> Merge Join (mac_observation × device)
-> Nested Loop (mac_observation × port via Memoize cache)
-> Index Scan using mac_obs_open_per_source on mac_observation
(rows=20001, loops=1)
-> Memoize (Hits: 19860, Misses: 141, Memory: 17 kB)

Two things worth knowing:

  1. The partial index mac_obs_open_per_source (created in alembic 0001) drives the scan. It’s defined on (device_id, mac, vlan, source) WHERE upper_inf(valid_during) AND upper_inf(recorded_during) — exactly the predicate live_fdb() filters by. The query never touches the closed-row part of the table.

  2. The port Memoize cache absorbs 99.3% of port lookups (19,860 of 20,001 are cache hits). Means we only physically index-scan port 141 times even though we look up 20,001 port_ids.

PostgreSQL’s row estimate is 2300 actual 20001 — a 9x undercount. Doesn’t matter for the chosen plan (which is good), but if a future workload depends on join order changing under different cardinalities, ANALYZE mac_observation is the right thing to run.

At 20k rows the bottleneck is not the SQL — it’s the TUI render. The OPS FDB tree builds an in-memory Tree widget with one leaf per (mac, device, port). At 20k entries the tree is ~30 megabytes of Python objects and the initial paint is visibly chunky.

Two reasonable mitigations when fabrics get this big:

  • Collapse port nodes by default above some threshold. ✅ Implemented in OpsScreen._render_fdbFDB_PORT_AUTO_EXPAND_THRESHOLD = 50 and FDB_DEVICE_AUTO_EXPAND_THRESHOLD = 500. Above those, the node renders with a (▶ to expand) hint and stays collapsed on first paint. Operators press Right-arrow / Space to drill in. The demo data is small enough that everything auto-expands, but a 50k-MAC fabric paints fast because most of the tree is collapsed.
  • Surface the MAC list as a DataTable instead of a Tree for the flat-listing use case. Trees shine for hierarchy; tables are denser and easier to virtualize. Not yet implemented — would benefit an operator who wants to scan all 50k MACs at once. Flag for a future session.
Terminal window
make seed-realistic # OR
make scale-check # seeds 20k rows + prints this table + EXPLAIN

The seed routes both sw-a-01 and sw-a-11 to sw-c-01 (round-robin), which is why the scale check’s chosen trace pair walks sw-a-01 → sw-c-01 → sw-a-11 — a real 3-hop path within one core, no core-to-core trunk learning required.