feat: support HTTP DuckDB queries in WASM notebooks#9480
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Pull request overview
Adds a Pyodide/WASM-only DuckDB compatibility layer that rewrites supported remote URL scans into replacement scans backed by fetched pandas DataFrames, enabling mo.sql, SQL cells, and common DuckDB APIs to query https://... sources in WASM notebooks.
Changes:
- Implement DuckDB WASM patching: SQL AST rewrite (sqlglot) + remote fetch + bytes→DataFrame decoding + replacement scan execution.
- Add shared WASM URL fetch helper and integrate it into existing Polars WASM fallbacks.
- Add extensive unit/integration coverage and update WASM workers to preload DuckDB-related deps when DuckDB is used.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/_runtime/test_patches.py | Adds unit test ensuring shared WASM fetch helper forwards Request/urlopen kwargs correctly. |
| tests/_runtime/test_duckdb_wasm.py | New test suite covering DuckDB SQL rewrite parity, direct-reader patching, and mo.sql/kernel integration in Pyodide. |
| marimo/_sql/utils.py | Hooks wrapped_sql / execute_duckdb_sql into the WASM DuckDB SQL rewrite path so mo.sql can transparently handle remote URLs. |
| marimo/_runtime/_wasm/_polars.py | Switches Polars fallback URL fetching to shared WASM fetch utility. |
| marimo/_runtime/_wasm/_patches.py | Extends patch framework with replace() for wrapper-only (no “call original first”) patching. |
| marimo/_runtime/_wasm/_fetch.py | New shared synchronous urllib-based fetch utility for Pyodide fallbacks. |
| marimo/_runtime/_wasm/_duckdb/init.py | Core DuckDB WASM patch implementation (direct readers + SQL APIs + eval-based replacement scan execution). |
| marimo/_runtime/_wasm/_duckdb/sources.py | sqlglot AST helpers to detect supported remote sources and extract literal args/options. |
| marimo/_runtime/_wasm/_duckdb/io.py | URL/option validation, reader selection, fetching, and multi-file concat semantics for remote sources. |
| marimo/_runtime/_wasm/_duckdb/dataframe.py | Bytes→DataFrame decoding via temp files + implementations for text/blob-like readers. |
| marimo/_output/formatters/formatters.py | Registers DuckDB formatter factory so importing DuckDB triggers WASM patch installation. |
| marimo/_output/formatters/df_formatters.py | Adds DuckDBFormatter that installs the DuckDB WASM patch on DuckDB import. |
| frontend/src/core/wasm/worker/worker.ts | Expands WASM dependency preloading heuristic to include DuckDB usage. |
| frontend/src/core/wasm/worker/bootstrap.ts | Expands notebook dependency preloading heuristic to include DuckDB usage. |
| frontend/src/core/islands/worker/worker.tsx | Expands islands worker dependency preloading heuristic to include DuckDB usage. |
There was a problem hiding this comment.
1 issue found across 15 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="marimo/_runtime/_wasm/_duckdb/io.py">
<violation number="1" location="marimo/_runtime/_wasm/_duckdb/io.py:198">
P2: `read_json_objects` is routed to `read_json_objects_auto`, which changes JSON-object reader semantics instead of preserving the requested function behavior.</violation>
</file>
Architecture diagram
sequenceDiagram
participant User as User Code
participant moSQL as mo.sql / SQL Cell
participant DuckDB as DuckDB Module
participant WasmPatch as WASM DuckDB Layer
participant SQLGlot as sqlglot Parser
participant Fetcher as Fetch Utility
participant DataFrame as DataFrame Builder
participant Pandas as Pandas DF
participant MemTable as DuckDB Temp Table
Note over User,MemTable: WASM DuckDB Remote File Query Flow
User->>moSQL: SQL with remote URL (e.g., read_csv('https://...'))
moSQL->>WasmPatch: try_run_duckdb_sql_with_wasm_patch()
alt Non-WASM environment
WasmPatch-->>moSQL: Return None (no-op)
moSQL->>DuckDB: Normal SQL execution
DuckDB-->>User: Query Result
else WASM environment (Pyodide)
WasmPatch->>SQLGlot: patch_duckdb_query_for_wasm()
SQLGlot->>SQLGlot: Parse SQL AST
alt SQL has remote URL references
SQLGlot-->>WasmPatch: Extract URLs and table functions
WasmPatch->>Fetcher: fetch_url_bytes(url)
Fetcher->>Fetcher: urllib.request (via pyodide_http)
Fetcher-->>WasmPatch: Raw bytes
WasmPatch->>DataFrame: Read bytes to DataFrame
DataFrame->>DataFrame: Determine format (CSV/Parquet/JSON)
DataFrame->>Pandas: Create DataFrame from bytes
Pandas-->>DataFrame: Pandas DataFrame
DataFrame-->>WasmPatch: DataFrame with remote data
WasmPatch->>WasmPatch: Generate replacement table name
Note over WasmPatch: e.g., __marimo_wasm_duckdb_remote_0
WasmPatch-->>moSQL: WasmDuckDBQueryPatch(query, tables)
moSQL->>DuckDB: Register temp table with DataFrame
moSQL->>DuckDB: Execute rewritten SQL (without URLs)
else No remote URLs
SQLGlot-->>WasmPatch: No remote sources found
WasmPatch-->>moSQL: Return None
moSQL->>DuckDB: Normal SQL execution
end
DuckDB->>MemTable: Replacement scan on temp DataFrame
MemTable-->>DuckDB: Query via pandas
DuckDB-->>User: Query Result
end
Note over User,MemTable: Direct DuckDB Reader API (patch_duckdb_for_wasm)
User->>DuckDB: duckdb.read_csv('https://...')
alt WASM + patched
DuckDB->>WasmPatch: Patched wrapper intercepts
WasmPatch->>Fetcher: fetch_url_bytes(url)
Fetcher-->>WasmPatch: Raw bytes
WasmPatch->>DataFrame: Build DataFrame from bytes
WasmPatch-->>DuckDB: Return DuckDB relation from DataFrame
DuckDB-->>User: DataFrame/Relation
else Not WASM or not patched
DuckDB->>DuckDB: Normal httpfs path (fails in WASM)
end
Note over User,MemTable: Key Boundaries
alt WASM fetch uses pyodide_http
Note over Fetcher: urllib → JS fetch bridge
end
alt sqlglot parsing fails or dynamic expressions
Note over SQLGlot: Return None → fallback to DuckDB native
end
opt Error during fetch or decode
WasmPatch-->>moSQL: Propagate exception
moSQL-->>User: Error with original DuckDB message
end
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
af360b1 to
6acf51b
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (3)
frontend/src/core/wasm/worker/worker.ts:149
- The dependency-loading heuristic now triggers on any occurrence of the substring
"duckdb"in the notebook source. This can causeloadPackagesFromImportsto pull inpandas/duckdb/sqlgloteven when the user isn’t actually importing/using DuckDB (e.g., comments/strings/variable names), increasing startup time and bandwidth in WASM. Consider tightening detection (e.g., regex for^\s*import\s+duckdb\b/^\s*from\s+duckdb\b/\bduckdb\.) or relying on import discovery rather than raw substring matching.
if (code.includes("mo.sql") || code.includes("duckdb")) {
// Add pandas and duckdb to the code
code = `import pandas\n${code}`;
code = `import duckdb\n${code}`;
code = `import sqlglot\n${code}`;
frontend/src/core/wasm/worker/bootstrap.ts:171
- The
code.includes("duckdb")heuristic is very broad and can cause WASM bootstrap to pre-load heavy deps (pandas/duckdb/sqlglot) on incidental mentions of “duckdb” (comments/strings), increasing load time. Consider switching to a more precise pattern (import statement detection /duckdb.usage) to avoid unnecessary package loads.
if (code.includes("mo.sql") || code.includes("duckdb")) {
// We need pandas and duckdb for mo.sql
code = `import pandas\n${code}`;
code = `import duckdb\n${code}`;
code = `import sqlglot\n${code}`;
frontend/src/core/islands/worker/worker.tsx:93
- Using
code.includes("duckdb")to decide whether to pre-loadpandas/duckdb/sqlglotis likely to over-trigger (e.g., “duckdb” in a comment/string), adding unnecessary package load time in WASM. Consider using a stricter detection strategy (import statement regex /duckdb.token) instead of a raw substring search.
if (code.includes("mo.sql") || code.includes("duckdb")) {
// Add pandas and duckdb to the code
code = `import pandas\n${code}`;
code = `import duckdb\n${code}`;
code = `import sqlglot\n${code}`;
3b20c2f to
6791dcf
Compare
6791dcf to
a046709
Compare
a046709 to
1665f99
Compare
mscolnick
left a comment
There was a problem hiding this comment.
this is REALLY amazing! great cleanup and isolation. super excited to use this
|
🚀 Development release published. You may be able to view the changes at https://marimo.app?v=0.23.7-dev14 |
<details> <summary>Original POC PR description</summary> Drafting this as a POC on stubbing threading and multiprocessing in WASM notebooks. This first slice adds a minimal threading patch for Pyodide: - bootstrap WASM runtime patches before marimo imports runtime modules that capture `threading.local` - patch the small `threading` surface needed for the POC: `Thread`, `Event`, `local`, current thread, ident, enumerate/count - give started WASM threads a synthetic identity - make `threading.local` use that identity - repair marimo runtime context storage if it was imported before the patch - let `mo.Thread` hand async targets back to the patched runner The main thing to review is the lifecycle: install early, install once, keep one interpreter-wide state object, and make marimo runtime context follow the synthetic thread identity. ## Review order Probably easiest to read in this order: 1. `marimo/__init__.py` and `marimo/_pyodide/bootstrap.py` 2. `marimo/_runtime/_wasm/__init__.py` 3. `marimo/_runtime/_wasm/_concurrency/_install.py` 4. `marimo/_runtime/_wasm/_concurrency/_state.py` 5. `marimo/_runtime/_wasm/_concurrency/_threading.py` 6. `marimo/_runtime/_wasm/_concurrency/_wait.py` 7. `marimo/_runtime/threads.py` 8. `tests/_runtime/test_wasm_threading_poc.py` ## WASM Demo https://github.com/user-attachments/assets/ea5e671a-8dc6-4830-adf0-79c9604e4e50 ## Not in this PR / Follow-up Work Leaving these out for now, but the intended approaches are: - `ThreadPoolExecutor`: build on the same synthetic thread identity model, with submitted callables running through asyncio tasks and returning normal-looking `Future`s. - `concurrent.futures.wait` / `as_completed`: adapt future completion to the Pyodide wait bridge so timeout and completion behavior stay close to stdlib call sites. - `multiprocessing.Process`: keep the process API callable in Pyodide, but execute the target in the current interpreter and report lifecycle state through `start`, `join`, `is_alive`, and `exitcode`. - `multiprocessing.Queue`: use an in-memory queue implementation that works with the process-shaped APIs above. - `multiprocessing.Pool`: map pool work onto the same current-interpreter execution model, mostly for common notebook/library code that imports `Pool`. - `ProcessPoolExecutor`: layer futures over the process-shaped runner, so code using the API can run without crashing in Pyodide. - stream proxy repair: if stdout/stderr proxies captured a pre-patch `threading.local`, swap them to the WASM local and sync state on restore. - browser matrix / Pyodide acceptance tests: run the examples in a real WASM notebook path, including `mo.Thread`, stdlib threading, futures, and multiprocessing-shaped code. - docs and lint-rule wording: explain which APIs marimo patches in WASM and which APIs still have Pyodide-specific semantics. </details> Belongs to PR family of #9413 and #9480. This PR makes the common Python concurrency APIs callable in WASM notebooks by installing browser-backed adapters for threading, futures, and process-shaped multiprocessing. The adapters run inside the current Pyodide interpreter on the browser event loop while preserving the API shape and lifecycle semantics that notebook and library code commonly expects. ## Summary Adds WASM adapters for: - `threading.Thread`, `threading.Event`, `threading.local`, current thread identity, and thread enumeration - `concurrent.futures.ThreadPoolExecutor`, `Future.result`, callbacks, `wait`, and `as_completed` - process-shaped `multiprocessing.Process`, `Queue`, `SimpleQueue`, and `Pool` - `concurrent.futures.ProcessPoolExecutor` - `mo.Thread` output routing from WASM workers back to the spawning cell The branch also updates WASM lint rules and docs so supported process-shaped APIs are no longer flagged as unavailable, while we still call out unsupported native process primitives. ## Runtime model As per the POC (e1c9570), all work stays in one Pyodide interpreter. Started threads, executor workers, and process-shaped targets are scheduled onto the browser-backed asyncio loop with synthetic thread and process identities. Blocking-looking waits use the [Pyodide JSPI bridge](https://blog.pyodide.org/posts/jspi/) when needed, so calls like `thread.join()`, `Event.wait()`, `Future.result()`, `wait()`, `as_completed()`, `Queue.get()`, and pool result waits can yield to browser runtime work instead of failing immediately. APIs that require OS processes, shared memory, pipes, managers, fork/forkserver contexts, or native synchronization remain unsupported. ### Support Levels WASM notebooks run Python inside one browser-hosted Pyodide interpreter. Some concurrency APIs can preserve their Python API shape there, but their execution model differs from server-backed CPython. This PR introduces four support levels to make those differences explicit: - `api-compatible`: the tested API shape and result behavior match the local Python contract for that operation. - `serialized`: the API shape is available, but submitted work runs one task at a time in the current Pyodide interpreter. - `cooperative-only`: waits, cancellation, and termination progress when Python yields back to Pyodide's event loop. Already-running Python code is not preempted. - `blocked`: marimo rejects the API because the browser cannot provide the native process, synchronization, or shared-memory primitive it requires. ## Changes - Installs WASM concurrency patches during Pyodide bootstrap before marimo runtime modules capture thread-local state. - Keeps one interpreter-wide WASM runtime state for synthetic thread identity, current process ownership, live work tracking, and shutdown. - Adds a serialized WASM executor for `ThreadPoolExecutor` and process-pool-shaped APIs. - Adds process-shaped lifecycle support for `multiprocessing.Process`: `start`, `join`, `is_alive`, `exitcode`, `current_process`, `parent_process`, and `active_children`. - Adds in-memory `Queue` and `SimpleQueue` adapters that preserve the standard put/get shape inside the browser interpreter. - Adds `multiprocessing.Pool` support for `apply`, `map`, `starmap`, `imap`, async results, callbacks, close/join, and termination semantics that make sense in Pyodide. - Adds `ProcessPoolExecutor` over the same process-shaped executor layer. - Repairs pre-imported marimo runtime context storage and stdout/stderr stream proxies so `mo.Thread` output and progress updates route back to the correct cell. - Extends WASM lint tests and docs to distinguish supported adapters from APIs that still cannot work in Pyodide. ## Testing - Added runtime coverage for WASM threading, futures, multiprocessing process, queue, pool, and process pool adapters - Added Pyodide browser matrix coverage for the public concurrency surfaces - Added stream and message-type coverage for WASM thread output routing - Added lint coverage for supported and unsupported WASM multiprocessing APIs ## Demo Outputs from [this test notebook](https://github.com/user-attachments/files/28926946/notebook.py): <img width="636" height="782" alt="Screenshot 2026-06-14 at 13 51 12" src="https://github.com/user-attachments/assets/384c208d-61d9-4104-9451-1ee041533dd7" /> <img width="636" height="1261" alt="Screenshot 2026-06-14 at 13 51 29" src="https://github.com/user-attachments/assets/1bff85be-7e2e-48e5-8421-e0fea55132fc" /> <img width="636" height="972" alt="Screen Recording 2026-06-14 at 13 52 32" src="https://github.com/user-attachments/assets/b6afa9ea-cddd-4b8b-9df0-e5a549461092" /> <img width="636" height="693" alt="Screenshot 2026-06-14 at 13 53 57" src="https://github.com/user-attachments/assets/615621ab-aa0e-4108-a1a1-7568d1fab8e2" /> <img width="636" height="818" alt="Screenshot 2026-06-14 at 13 54 09" src="https://github.com/user-attachments/assets/d97ec642-f42c-4b36-95c9-59ad921b4508" /> <img width="636" height="844" alt="Screenshot 2026-06-14 at 13 54 21" src="https://github.com/user-attachments/assets/c39c1eaf-5984-4aa1-acbf-6e83473f79e6" /> <img width="636" height="601" alt="Screenshot 2026-06-14 at 13 54 32" src="https://github.com/user-attachments/assets/294b49d8-2ed8-42ba-98ef-17c40804445d" /> <img width="636" height="1066" alt="Screenshot 2026-06-14 at 13 54 45" src="https://github.com/user-attachments/assets/a08f9255-cf0f-4645-b114-132fc26ba306" /> <img width="636" height="548" alt="Screenshot 2026-06-14 at 13 54 58" src="https://github.com/user-attachments/assets/3fefa752-5f07-43cc-be6a-23c5a065e816" /> <img width="636" height="816" alt="Screenshot 2026-06-14 at 13 55 12" src="https://github.com/user-attachments/assets/ca596079-b678-424e-a6d9-8428d26574a3" /> ## Follow-ups Having all the above primitives introduced in this PR, we could add support for: - `threading.Timer`, `cooperative-only`: can be implemented as a delayed `AsyncioThread` scheduled on the Pyodide event loop. It does not need native threads, only delayed cooperative execution plus `cancel()` before the callback runs. - `multiprocessing.Event`, `cooperative-only`: we can map it to the existing `AsyncEvent` shape already used for `threading.Event` and queue waits. This would cover common library code that just needs `set`, `clear`, `is_set`, and `wait`. - `multiprocessing.Lock` / `RLock`, `cooperative-only`: we could add same-interpreter locks backed by asyncio state and JSPI waits. - `multiprocessing.Semaphore` / `BoundedSemaphore`, `cooperative-only`: we can build on the same cooperative wait primitive as queues. Acquisition would only progress when Python yields to Pyodide. - `multiprocessing.Condition`, `cooperative-only`: we could layer over the cooperative lock plus waiter list. This is mechanically straightforward once Lock/RLock exist and would have browser-event-loop scheduling semantics. - `multiprocessing.JoinableQueue`, `serialized`: we could extend the in-memory queue with unfinished-task accounting, `task_done()`, and `join()`. Since `Queue` is already same-interpreter, this is probably the cleanest multiprocessing follow-up. - `multiprocessing.Pipe`, `serialized`: can be implemented as paired in-memory endpoints with `send`, `recv`, `poll`, and `close`. - `multiprocessing.pool.ThreadPool`, `serialized`: could be just aliased to the existing serialized executor/Pool machinery. - `Process.sentinel`, `cooperative-only`: we can expose a synthetic wait token for adapted processes to support simple completion checks. - Mixed pending `concurrent.futures.wait` / `as_completed`, `cooperative-only`: we can wrap observable foreign futures with callbacks and keep the current error for futures that cannot be observed without blocking the Pyodide event-loop lane.
Follow up of #9480. Fixes DuckDB WASM SQL rewrites for `read_csv("https://...")` and other `read_*` functions. DuckDB accepts double-quoted file sources, but `sqlglot` represents them as quoted identifiers not string literals, so we skipped the WASM remote-source rewrite and raised. This PR normalizes DuckDB string-like reader arguments before source validation. ### Before <img width="979" height="1225" alt="Screenshot 2026-06-18 at 18 14 53" src="https://github.com/user-attachments/assets/d498291d-27d8-4021-b9a3-09ec76366e0a" /> ### After <img width="966" height="1219" alt="Screenshot 2026-06-18 at 18 15 04" src="https://github.com/user-attachments/assets/b3996119-5fb5-413f-91c0-6a24f2eeca13" /> <!-- This is an auto-generated description by cubic. --> <a href="https://cubic.dev/pr/marimo-team/marimo/pull/9925?utm_source=github" target="_blank" rel="noopener noreferrer" data-no-image-dialog="true"><picture><source media="(prefers-color-scheme: dark)" srcset="https://www.cubic.dev/buttons/review-in-cubic-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://www.cubic.dev/buttons/review-in-cubic-light.svg"><img alt="Review in cubic" src="https://www.cubic.dev/buttons/review-in-cubic-dark.svg"></picture></a> <!-- End of auto-generated description by cubic. -->
Motivated by marimo-team/quarto-marimo#74, marimo-team/jupyter-book-marimo#1, and #9413.
DuckDB remote file queries fail in Pyodide because DuckDB-WASM can't use httpfs. Therefore, URL-based SQL like
FROM 'https://...'andread_csv/read_parquet/read_json('https://...')are unusable in WASM notebooks today.This PR adds a DuckDB WASM fallback layer for
mo.sql, SQL cells, rawduckdb.sql/query/execute/query_df, connection SQL methods, and directduckdb.read_csv/read_parquet/read_jsoncalls.It translates queries such as
into
where
__marimo_wasm_duckdb_remote_0is bound to a fetched pandas DataFrame, which DuckDB can query through Python replacement scans.Underneath, the fallback layer:
Unsupported or dynamic cases are left to DuckDB's normal path. The patch is a no-op outside Pyodide, and in Pyodide the DuckDB SQL fallback requires
sqlglotfor AST analysis.Tested with unit coverage for the rewrite/fetch/read paths and manually in the WASM playground against hosted CSV, parquet, JSON, and GeoJSON datasets.
WASM Playground Demo
Demonstrates that the Pyodide-build of marimo supports querying remote files with DuckDB across cases like:
mo.sql: direct URL scanduckdb.sql:read_parquet(...)duckdb.read_csvPython API with patched options (custom delimiter)duckdb.connectwasm-demo.mp4