Scale envelope

Numbers from make scale-check against the demo compose stack (Postgres 16, single-host, no tuning). The seed builds a 30-device fabric (20 access + 10 core), 240 ports, and 20,000 mac_observation rows (10,000 host CAM entries + 10,000 trunk-learned on the core side).

Query latencies at 20k rows

query	rows returned	latency_ms
`live_fdb()`	20,001	63.0
`disagreements()`	0	21.9
`mac_history` (1 MAC)	2	2.9
`traceroute` (3 hops, reached)	3	9.8
`bulk_lookup_ouis` (100 MACs)	100	1.7

All queries land under 70ms. The biggest is live_fdb() which has to group + aggregate every currently-open observation. The smallest are the targeted lookups (history for one MAC, OUI bulk lookup) which use direct B-tree probes.

What the query plans tell us

EXPLAIN ANALYZE on live_fdb():

GroupAggregate
  -> Incremental Sort (Presorted Key: device_id)
       -> Merge Join (mac_observation × device)
            -> Nested Loop (mac_observation × port via Memoize cache)
                 -> Index Scan using mac_obs_open_per_source on mac_observation
                      (rows=20001, loops=1)
                 -> Memoize  (Hits: 19860, Misses: 141, Memory: 17 kB)

Two things worth knowing:

The partial index mac_obs_open_per_source (created in alembic 0001) drives the scan. It’s defined on (device_id, mac, vlan, source) WHERE upper_inf(valid_during) AND upper_inf(recorded_during) — exactly the predicate live_fdb() filters by. The query never touches the closed-row part of the table.
The port Memoize cache absorbs 99.3% of port lookups (19,860 of 20,001 are cache hits). Means we only physically index-scan port 141 times even though we look up 20,001 port_ids.

PostgreSQL’s row estimate is 2300 actual 20001 — a 9x undercount. Doesn’t matter for the chosen plan (which is good), but if a future workload depends on join order changing under different cardinalities, ANALYZE mac_observation is the right thing to run.

Where the next bottleneck lives

At 20k rows the bottleneck is not the SQL — it’s the TUI render. The OPS FDB tree builds an in-memory Tree widget with one leaf per (mac, device, port). At 20k entries the tree is ~30 megabytes of Python objects and the initial paint is visibly chunky.

Two reasonable mitigations when fabrics get this big:

Collapse port nodes by default above some threshold. ✅ Implemented in OpsScreen._render_fdb — FDB_PORT_AUTO_EXPAND_THRESHOLD = 50 and FDB_DEVICE_AUTO_EXPAND_THRESHOLD = 500. Above those, the node renders with a (▶ to expand) hint and stays collapsed on first paint. Operators press Right-arrow / Space to drill in. The demo data is small enough that everything auto-expands, but a 50k-MAC fabric paints fast because most of the tree is collapsed.
Surface the MAC list as a DataTable instead of a Tree for the flat-listing use case. Trees shine for hierarchy; tables are denser and easier to virtualize. Not yet implemented — would benefit an operator who wants to scan all 50k MACs at once. Flag for a future session.

Reproducing

make seed-realistic  # OR
make scale-check     # seeds 20k rows + prints this table + EXPLAIN

The seed routes both sw-a-01 and sw-a-11 to sw-c-01 (round-robin), which is why the scale check’s chosen trace pair walks sw-a-01 → sw-c-01 → sw-a-11 — a real 3-hop path within one core, no core-to-core trunk learning required.

Scale envelope

Query latencies at 20k rows

What the query plans tell us

Where the next bottleneck lives

Reproducing

See also