fix: handle sqlglot OptimizeError for DuckDB unnest with JSON access#8083
fix: handle sqlglot OptimizeError for DuckDB unnest with JSON access#8083dmadisetti merged 3 commits intomarimo-team:mainfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
All contributors have signed the CLA ✍️ ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
| if isinstance(source, exp.Table): | ||
| if ref := get_ref_from_table(source): | ||
| refs.add(ref) | ||
| except OptimizeError: |
There was a problem hiding this comment.
Nice. Limitation of the fall back is that it's limited to a single scope. So if there are multiple statements, this will only catch the first one.
I almost wonder if we should log? but seems fine
There was a problem hiding this comment.
Thanks for the review! A couple of clarifications:
Multiple statements are handled correctly: the try/except is inside the for expression in expression_list: loop, so each statement is processed independently. If one fails, others still use build_scope(). I added test_multiple_statements_with_optimize_error to verify this.
Added debug logging: good call. Added a LOGGER.debug() when the fallback triggers.
Re: single scope limitation: you're right that find_all(exp.Table) doesn't understand CTEs like build_scope() does, so we may occasionally over-report dependencies (e.g., including the CTE name itself). Trade-off seems acceptable to avoid the crash, and we won't under-report real table dependencies.
Let me know if you'd like any changes!
There was a problem hiding this comment.
I think we can remove the debugging if we're not dropping anything.
Multiple statements are handled correctly
I guess I was trying to determine why don't we just run this instead of trying build_scope ?
There was a problem hiding this comment.
Removed the debug logging.
Regarding why not use find_all everywhere - build_scope properly understands SQL scopes, which matters for CTEs:
WITH my_data AS (SELECT * FROM real_table)
SELECT * FROM my_databuild_scope->{real_table}(correct -my_datais a CTE, not an external table)find_all->{my_data, real_table}(over-reports - includes the CTE name)
In marimo's reactive system, this over-reporting could trigger unnecessary cell re-executions if a CTE name happens to match another defined table. So we prefer build_scope's precision when it works, and only fall back to find_all for these edge cases where it fails.
- Add LOGGER.debug when build_scope falls back due to OptimizeError - Add test_multiple_statements_with_optimize_error to verify each statement is processed independently Follow-up to marimo-team#8080
…rence build_scope correctly handles CTEs (won't report CTE names as table refs), which is why we prefer it over find_all when possible.
dmadisetti
left a comment
There was a problem hiding this comment.
Awesome. Thanks @bxff
|
🚀 Development release published. You may be able to view the changes at https://marimo.app?v=0.19.8-dev8 |
📝 Summary
Closes #8080
🔍 Description of Changes
Problem: Marimo's SQL static analysis fails with
sqlglot.errors.OptimizeError: Alias already usedwhen parsing valid DuckDB syntax usingunnestwith JSON access (e.g.,SELECT unnest([1,2,3])->>'$' FROM (SELECT 1)).Root Cause: The
build_scope()function from sqlglot's optimizer raises anOptimizeErrorwhen SQL contains duplicate aliases. This occurs with:unnestexpressions with JSON access that generate identical internal aliasesSELECT * FROM (SELECT 1 as x), (SELECT 2 as x))This is valid DuckDB syntax but not handled by sqlglot's scope optimizer.
Solution: Wrap the
build_scope()call infind_sql_refs()with a try-except block. WhenOptimizeErroris caught, fall back to extracting table references directly usingexpression.find_all(exp.Table)without scope analysis.Changes:
marimo/_ast/sql_visitor.py: Added error handling forOptimizeErrorwith fallback to direct table extractiontests/_ast/test_sql_visitor.py: Added 3 test cases covering the regression and fallback behavior📋 Checklist