-
Notifications
You must be signed in to change notification settings - Fork 3.7k
feat(logical planning): Planner playground #16403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good to me!
However, I think I would not place the code inside pkg/dataobj
but rather in pkg/engine
.
├── GroupExprs | ||
│ └── Column #region | ||
├── AggregateExprs | ||
│ └── AggregateExpr op=sum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would have expected the the AggregateExpr would also be printed with its name, like BinaryOp:
│ └── AggregateExpr op=sum | |
│ └── AggregateExpr op=sum name=total_sales |
pkg/dataobj/planner/logical/ifc.go
Outdated
2. Type Safety - Compile-time guarantees through interfaces and type assertions | ||
3. API Surface Reduction - Exposing minimal interfaces to external consumers | ||
|
||
The pattern works as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this level of indirection. WDYT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't have a strong opinion. But I think it's not strictly necessary to have a Type()
function, since you can use the type switch statement instead.
initial SSA mapping from logical plans
a276ec2
to
16cb270
Compare
This reverts commit 16cb270.
…ataobj internals)
// sort := newSort(inputPlan, []SortExpr{ | ||
// NewSortExpr("sort_by_age", Col("age"), true, false), | ||
// NewSortExpr("sort_by_name", Col("name"), false, true), | ||
// }) | ||
func newSort(input Plan, expr SortExpr) *Sort { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newSort
only accepts a single sort expression as argument, not a slice as documented.
// Plan is the core plan type in the logical package. | ||
// It wraps a concrete plan type (MakeTable, Filter, etc.) | ||
// and provides methods to safely access the underlying value. | ||
// This approach replaces the previous interface-based design to reduce | ||
// indirection and improve code clarity. | ||
type Plan struct { | ||
ty PlanType // The type of plan | ||
val any // The concrete plan value | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not fully convinced with this approach, but also not against it.
While it removes the complexity of the interface indirections, it adds a decent amount of boilderplate code (see Schema() function etc). Maybe changing the type of val
from any
to something more specific, like a minimalistic interface
type interface PlanImplementation {
isPlanImplementation() // make sure it's only implemented within this package
Schema() Schema
}
However, I think the approach shines when you have multiple implementations of the same type.
func (p Plan) Type() PlanType { | ||
return p.ty | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes all the Type() PlanType
functions on the concrete plan "implementations" obsolete.
|
||
// writePlan writes a plan node to the tree formatter. | ||
// It dispatches to the appropriate write method based on the plan type. | ||
func (t *treeFormatter) writePlan(ast Plan) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually, the treeFormatter should be tree type agnostic, since we would use it for the physical plan as well.
A PlanVisitor could traverse the AST to create the treeFormatter nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was about to mentioned the benefits of a visitor.
type BinOpType int | ||
|
||
const ( | ||
BinOpTypeInvalid BinOpType = iota // Invalid or uninitialized binary operation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Since you use an int
as BinOpType
, you could start the iota
value at -1
(for "invalid type"). However, what do we need the BinOpTypeInvalid
type for?
You have this for other enums as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I understand now: The empty value of BinOpExpr{}
has the BinOpTypeInvalid
type.
case ExprTypeSort: | ||
return "Sort" | ||
default: | ||
return "Unknown" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
return "Unknown" | |
return UNKNOWN |
// Type implements the Plan interface | ||
func (f *Filter) Type() PlanType { | ||
return PlanTypeFilter | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type()
is now implemented on the Plan
struct
// TableSchema returns the schema of the table. | ||
// This is a convenience method that returns the same value as Schema(). | ||
func (t *MakeTable) TableSchema() schema.Schema { | ||
return t.schema | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is unused. Also, it's not a convenience method, but a duplicate ;-)
// AggregateOp represents the type of aggregation operation to perform. | ||
// It is a string-based enum that identifies different aggregation functions | ||
// that can be applied to expressions. | ||
type AggregateOp string | ||
|
||
const ( | ||
// AggregateOpSum represents a sum aggregation | ||
AggregateOpSum AggregateOp = "sum" | ||
// AggregateOpAvg represents an average aggregation | ||
AggregateOpAvg AggregateOp = "avg" | ||
// AggregateOpMin represents a minimum value aggregation | ||
AggregateOpMin AggregateOp = "min" | ||
// AggregateOpMax represents a maximum value aggregation | ||
AggregateOpMax AggregateOp = "max" | ||
// AggregateOpCount represents a count aggregation | ||
AggregateOpCount AggregateOp = "count" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use the same enum pattern (int type + iota + String() function) as in the rest of the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good starting point 🎉 I'm approving and merging immediately so we can keep momentum on the other parts of the query engine while you're on PTO. I have a few high-level thoughts for future improvements:
-
I wonder if there's more we can do to have our types more directly represent an SSA logical plan. At the moment, the package API itself is forming a hierarchical tree, with the SSA code mainly focusing on representing that tree as a string.
If you compare to something like ssa.BasicBlock from Go, you can see how their package API more naturally fits the pattern of a sequence of instructions instead of a hierarchy.
-
I like that we're learning lessons from other query engines (like DataFusion) and trying to apply it to our own, but I wonder if we'll actually use the builder pattern defined here in practice.
Specifically, the builder pattern is used to construct a plan top-down, with each chained function call being the next operation to do after all previous chained function calls.
However, if our logical plans were built with depth-first traversal of the AST, it would be constructing nodes bottom-up, which might reduce how often we use the builder pattern in our planning code.
-
Hiding fields behind getters may be increasing the size of the mental model for the package API. Exposing struct fields publicly like we do for the AST may help cut down the amount of code and files needed. As a bonus, this may also give the caller more choices for how to construct plans without needing to use the builder pattern.
Merging as mentioned above, but @chaudum's comments are also worth addressing once you're back from PTO (and I agree that pkg/dataobj is probably not the best home for this storage-agnostic engine) |
The reason for the package to be in pkg/dataobj is the import of the internal package for datatypes. |
Thanks, I noticed that, but I don't think it needs to be that way. It would be reasonable to have duplicate definitions of data types specifically for the query engine. Even if datasetmd wasn't internal, I'd still recommend doing that to minimize the intersections where the query engine uses dataobj code ("a little copying is better than a little dependency"). This typically helps with code organization and compartmentalizing the mental model. |
Logical Query Planner Structure
Messing around with logical planning; opening a draft PR to showcase & spark conversation. Much inspired by https://howqueryengineswork.com & datafusion
edit: I've refactored this to abandon the interface-heavy first implementation to prefer signal types and concrete implementations, similar to what's used in the dataobj columnar format.
Core (plan.go, expr.go)
Plan
- Schema provider + child traversalExpr
- Column/computation abstractionPlan Types
MakeTable
(table.go) - Leaf node, reads from DataSourceFilter
(filter.go) - Row filtering via predicatesProject
(project.go) - Column selection/transformationAggregate
(aggregate_plan.go) - Grouping + aggregationsBinOp
(binop.go) - Binary operations between plansLimit
(limit.go) - limits (offset and number of rows to return)Sort
(sort_plan.go) - Sorting resultsExpression Types
ColumnExpr
(column.go) - Column referencesLiteralExpr
(literal.go) - Constants (int64, float64, string)AggregateExpr
(aggregate_expr.go) - sum, avg, min, max, countSortExpr
- sorting and related options (ordering, null handling) used by sort planning.Query Building
DataFrame
(dataframe.go) - Ergonomic API for plan constructionSSA (Static Single Assignment) mappings
Debug Formatting (format/)
TreeFormatter
- Tree-style output (├── child)All components use schema propagation
Example
The following construction yields the following plan.
edit: I've also added SSA mappings. The same plan generates the following: