fix(datasets): avoid exponential blow-up of nested struct sample values#9506
Merged
Conversation
NarwhalsTableManager.get_sample_values recursively re-stringified nested list/dict cells, causing each ancestor level to re-escape the children's repr. For deeply nested polars Struct/List columns this scaled ~8x per depth and produced multi-GB strings that hung the browser. Replace the recursion with one json.dumps pass, preserving Enum.name handling via a default callback.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Contributor
There was a problem hiding this comment.
No issues found across 2 files
Architecture diagram
sequenceDiagram
participant UI as Browser/UI
participant TM as get_sample_values()
participant json as json.dumps()
participant Polars as Polars DataFrame
Note over UI,Polars: NEW: Efficient nested struct sampling
UI->>TM: Request sample values for column
TM->>Polars: Access column data (first 3 rows)
Polars-->>TM: Return raw values (may contain deep nested structs)
loop For each sampled value
alt Value is list or dict (nested struct)
TM->>json: json.dumps(value, default=_json_default)
Note over TM,json: Single pass serialization<br/>avoids recursive str() calls
alt Enum encountered in nested structure
json->>json: _json_default(o) → o.name
json-->>TM: Encoded JSON string
else Non-serializable leaf (e.g., datetime)
json->>json: _json_default(o) → str(o)
json-->>TM: Encoded JSON string
else TypeError/ValueError
TM->>TM: Fallback to str(value)
end
else Value is int/float
TM->>TM: Return as numeric
else Value is Enum
TM->>TM: Return .name
else Other scalar (string, bytes, etc.)
TM->>TM: Return str(value)
end
end
TM-->>UI: Return list[str|int|float] (bounded size, JSON-shaped output)
1 task
Member
Author
Contributor
|
@kirangadhave I have started the AI code review. It will take a few minutes to complete. |
Contributor
There was a problem hiding this comment.
No issues found across 2 files
Architecture diagram
sequenceDiagram
participant UI as Browser/UI
participant TM as get_sample_values()
participant json as json.dumps()
participant Polars as Polars DataFrame
Note over UI,Polars: NEW: Efficient nested struct sampling
UI->>TM: Request sample values for column
TM->>Polars: Access column data (first 3 rows)
Polars-->>TM: Return raw values (may contain deep nested structs)
loop For each sampled value
alt Value is list or dict (nested struct)
TM->>json: json.dumps(value, default=_json_default)
Note over TM,json: Single pass serialization<br/>avoids recursive str() calls
alt Enum encountered in nested structure
json->>json: _json_default(o) → o.name
json-->>TM: Encoded JSON string
else Non-serializable leaf (e.g., datetime)
json->>json: _json_default(o) → str(o)
json-->>TM: Encoded JSON string
else TypeError/ValueError
TM->>TM: Fallback to str(value)
end
else Value is int/float
TM->>TM: Return as numeric
else Value is Enum
TM->>TM: Return .name
else Other scalar (string, bytes, etc.)
TM->>TM: Return str(value)
end
end
TM-->>UI: Return list[str|int|float] (bounded size, JSON-shaped output)
mscolnick
approved these changes
May 11, 2026
mscolnick
approved these changes
May 11, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes pathological sample-value serialization for nested list/dict (e.g., Polars Struct/List) columns in NarwhalsTableManager.get_sample_values, preventing exponential string growth that can hang the browser during dataset registration.
Changes:
- Replace recursive nested list/dict stringification with a single
json.dumps(...)pass (with adefault=hook to preserveEnum.nameat any depth andstr(...)fallback). - Add regression tests covering bounded runtime/size for deep nesting, JSON-shaped output, and non-JSON leaf handling (e.g.,
datetime) embedded in a struct.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
marimo/_plugins/ui/_impl/tables/narwhals_table.py |
Updates nested sample serialization to avoid exponential re-escaping by using json.dumps with an Enum-aware default serializer. |
tests/_plugins/ui/_impl/tables/test_narwhals.py |
Adds regression tests for deep nested struct sampling behavior, including size/time bounds and JSON parsing expectations. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
NarwhalsTableManager.get_sample_valuesrecursively re-stringified nested list/dict cells, causing each ancestor level to re-escape the children's repr. For deeply nested polarsStruct/Listcolumns this scaled ~8× per depth and produced multi-GB strings that hung the browser when the dataframe was registered as a dataset.Replace the recursion with one
json.dumpspass, preserving theEnum.name-at-any-depth contract via adefault=callback. Scalar paths are unchanged.Fixes #9378.
Test plan
tests/_plugins/ui/_impl/tables/test_narwhals.pycover: bounded time/size at nesting depth 8, JSON-shaped output, non-JSON leaf (datetime) embedded in a struct.uv run --group test pytest tests/_plugins/ui/_impl/tables/— 468 passed.make py-checkclean.