V11 Shards-Metadata Query Flow — Mermaid Diagrams

Visual companion to KustoEngineV11ShardsMetadataQueryFlow.md. Paste any block into the Mermaid Live Editor or render in a Mermaid-aware viewer.


1. V11 Data Model — Source of Truth = Shard Groups ⊕ Delta

In V11 a table's shards are no longer a single persisted set. The truth is persisted shard groups (Rust storage) PLUS an in-memory delta of pending attach/detach/drop ops. A periodic drain (~15 min) folds delta into new groups and bumps the version, so between drains the delta holds real, query-visible rows. Everything must be read through SchemaManager, which wraps each DB so reads see baseline+delta. The whole mode is off unless EnableShardsMetadataDelta is set.

[Diagram]

2. IShardsMap Surface — Delta-Aware vs Storage-Only

IShardsMap already splits its methods into two camps. The combined-view reads (GetTableExtentsMetadata, the ordered variant, and the stats-only GetTableShardsMapSummary) merge storage with delta and are the only sanctioned query reads. The raw methods (GetTableShardGroups, GetShardsMetadataRaw) return storage state with delta never consulted — fine for internal metadata work, wrong for query results. The take bug is simply a query consumer calling into the wrong camp.

[Diagram]

3. Current take Fast Path — the 8-Node Bypass Chain

This is how T | take K actually reaches storage today. The planner pre-materializes shard groups (SetShardGroups), then CreateTrivialLimiterStrategy consumes them. The chain flows planner → prefilter → CreateShardsMetadataFiltersGetTableShardGroupsShardsMetadataStorage.GetShardGroups. Every hop is storage-only: the in-memory delta is never touched anywhere along this path, which is the structural source of the stale results.

[Diagram]

4. Inside the Fast Path — Decision Tree (and where it goes stale)

Zooming into CreateTrivialLimiterStrategy: it picks the "latest" group by ArgMin(Age), and if K fits that group's TotalRowCount it returns just its newest extent; otherwise it walks all groups by age until the budget is met. The trap: ArgMin(Age) ranks persistence age, so the "newest" group was committed at the last drain — by definition older than anything still in delta. Both return paths are persisted-only, so delta rows are silently dropped.

[Diagram]

5. The Two Root Causes

The bug is not one line — it's two reinforcing flaws. (1) A false V10 invariant: the fast path assumes "having shard groups means having the table", so every signal it trusts (ArgMin(Age), TotalRowCount, the group walk) is storage-only. (2) A filter-contract violation: it asks for one group via ShardGroupIdsFilter, but delta shards have no group id, and the framework's SafeHasBoundedShardIdsFilter only inspects shard ids — so the filter falls through silently and delta is never surfaced.

[Diagram]

6. Redesigned Flow — Branch on IsDeltaEnabled

The fix routes V11+delta away from shard-group state entirely. Step 1: only materialize shard groups when delta is OFF. Step 2: CreateTrivialLimiterStrategy branches on IsDeltaEnabled — the existing fast path stays untouched where it's correct, V11+delta takes a new path. Step 3: the new path streams the delta-aware ordered iterator (delta first, then newest groups), accumulating rows and cold counts, breaking when the budget is met. Step 4: two Ensure guards throw if shard-group access or a ShardGroupIdsFilter ever reappears in delta mode.

[Diagram]

7. Cost Model — Why "Groups Touched" Is the Real Unit

A natural worry: does the new loop scan more? No — because the native cache loads whole shard groups on a miss, ignoring the filter and maximumShardCount; those limits only trim the later C# iteration. So the dominant cost is "how many groups did we touch", not "how many rows". A lazy iterator with a caller-side break already stops loading once the budget is met, so the redesign is performance-equivalent without needing a MaximumRowCount push-down into native storage.

[Diagram]

8. Approaches Compared (mind-map)

Three designs were weighed. A (recommended) is delta-aware by construction with no new invariants and runtime guards against regressions. B (rejected) gates on the summary, but the summary is stats-only and a correct B just converges to A. C (deferred) synthesizes a fake delta "group", which forces inventing Age/Id/RowCount semantics and re-introduces the very category error A removes — only worth it if a non-take consumer later needs per-group reasoning.

[Diagram]