Skip to content

feat(logical planning): Planner playground #16403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Mar 6, 2025

Conversation

owen-d
Copy link
Member

@owen-d owen-d commented Feb 21, 2025

Logical Query Planner Structure

Messing around with logical planning; opening a draft PR to showcase & spark conversation. Much inspired by https://howqueryengineswork.com & datafusion

edit: I've refactored this to abandon the interface-heavy first implementation to prefer signal types and concrete implementations, similar to what's used in the dataobj columnar format.

Core (plan.go, expr.go)

  • Plan - Schema provider + child traversal
  • Expr - Column/computation abstraction

Plan Types

  • MakeTable (table.go) - Leaf node, reads from DataSource
  • Filter (filter.go) - Row filtering via predicates
  • Project (project.go) - Column selection/transformation
  • Aggregate (aggregate_plan.go) - Grouping + aggregations
  • BinOp (binop.go) - Binary operations between plans
  • Limit (limit.go) - limits (offset and number of rows to return)
  • Sort (sort_plan.go) - Sorting results

Expression Types

  • ColumnExpr (column.go) - Column references
  • LiteralExpr (literal.go) - Constants (int64, float64, string)
  • AggregateExpr (aggregate_expr.go) - sum, avg, min, max, count
  • Binary comparisons in binop.go (>, <, ==, etc)
  • SortExpr - sorting and related options (ordering, null handling) used by sort planning.

Query Building

  • DataFrame (dataframe.go) - Ergonomic API for plan construction

SSA (Static Single Assignment) mappings

  • Easily construct SSA forms for easier from the initial logical plan trees.

Debug Formatting (format/)

  • TreeFormatter - Tree-style output (├── child)
  • SSA formatting

All components use schema propagation

Example

The following construction yields the following plan.

	// Calculate the sum of sales per region for the year 2020
	ds := &testDataSource{
		name: "orders",
		schema: schema.Schema{
			Columns: []schema.ColumnSchema{
				{Name: "region", Type: datasetmd.VALUE_TYPE_STRING},
				{Name: "sales", Type: datasetmd.VALUE_TYPE_UINT64},
				{Name: "year", Type: datasetmd.VALUE_TYPE_UINT64},
			},
		},
	}

	df := NewDataFrame(
		NewScan(ds, nil),
	).Filter(
		Eq("year_2020", Col("year"), LitI64(2020)),
	).Project(
		[]Expr{
			Col("region"),
			Col("sales"),
			Col("year"),
		},
	).Aggregate(
		[]Expr{Col("region")},
		[]AggregateExpr{
			Sum("total_sales", Col("sales")),
		},
	)
Aggregate groupings=(region) aggregates=(total_sales)
├── GroupExprs
│   └── Column #region
├── AggregateExprs
│   └── AggregateExpr op=sum
│       └── Column #sales
└── Projection region=VALUE_TYPE_STRING sales=VALUE_TYPE_UINT64 year=VALUE_TYPE_UINT64
    ├── Column #region
    ├── Column #sales
    ├── Column #year
    └── Filter expr=year_2020
        ├── BooleanCmpExpr op=(==) name=year_2020
        │   ├── Column #year
        │   └── LiteralI64 value=2020
        └── Scan data_source=orders

edit: I've also added SSA mappings. The same plan generates the following:

%1 = MakeTable [name=orders]
%2 = ColumnRef [name=year, type=VALUE_TYPE_UINT64]
%3 = Literal [val=2020, type=VALUE_TYPE_INT64]
%4 = BinaryOp [op=(==), name=year_2020, left=%2, right=%3]
%5 = Filter [name=year_2020, predicate=%4, plan=%1]
%6 = ColumnRef [name=region, type=VALUE_TYPE_STRING]
%7 = ColumnRef [name=sales, type=VALUE_TYPE_UINT64]
%8 = ColumnRef [name=year, type=VALUE_TYPE_UINT64]
%9 = Project [region=%6, sales=%7, year=%8]
%10 = ColumnRef [name=region, type=VALUE_TYPE_STRING]
%11 = ColumnRef [name=sales, type=VALUE_TYPE_UINT64]
%12 = AggregationExpr [name=total_sales, op=sum]
%13 = AggregatePlan [aggregations=[%12], groupings=[%10]]
@owen-d owen-d changed the title Planner playground Feb 21, 2025
Copy link
Contributor

@chaudum chaudum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good to me!
However, I think I would not place the code inside pkg/dataobj but rather in pkg/engine.

├── GroupExprs
│ └── Column #region
├── AggregateExprs
│ └── AggregateExpr op=sum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would have expected the the AggregateExpr would also be printed with its name, like BinaryOp:

Suggested change
│ └── AggregateExpr op=sum
│ └── AggregateExpr op=sum name=total_sales
2. Type Safety - Compile-time guarantees through interfaces and type assertions
3. API Surface Reduction - Exposing minimal interfaces to external consumers

The pattern works as follows:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this level of indirection. WDYT

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't have a strong opinion. But I think it's not strictly necessary to have a Type() function, since you can use the type switch statement instead.

@owen-d owen-d force-pushed the planner-playground branch from a276ec2 to 16cb270 Compare March 4, 2025 16:40
@owen-d owen-d marked this pull request as ready for review March 4, 2025 19:21
@owen-d owen-d requested a review from a team as a code owner March 4, 2025 19:21
Comment on lines +23 to +27
// sort := newSort(inputPlan, []SortExpr{
// NewSortExpr("sort_by_age", Col("age"), true, false),
// NewSortExpr("sort_by_name", Col("name"), false, true),
// })
func newSort(input Plan, expr SortExpr) *Sort {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newSort only accepts a single sort expression as argument, not a slice as documented.

Comment on lines +50 to +58
// Plan is the core plan type in the logical package.
// It wraps a concrete plan type (MakeTable, Filter, etc.)
// and provides methods to safely access the underlying value.
// This approach replaces the previous interface-based design to reduce
// indirection and improve code clarity.
type Plan struct {
ty PlanType // The type of plan
val any // The concrete plan value
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not fully convinced with this approach, but also not against it.

While it removes the complexity of the interface indirections, it adds a decent amount of boilderplate code (see Schema() function etc). Maybe changing the type of val from any to something more specific, like a minimalistic interface

type interface PlanImplementation {
    isPlanImplementation() // make sure it's only implemented within this package
    Schema() Schema
}

However, I think the approach shines when you have multiple implementations of the same type.

Comment on lines +63 to +65
func (p Plan) Type() PlanType {
return p.ty
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes all the Type() PlanType functions on the concrete plan "implementations" obsolete.


// writePlan writes a plan node to the tree formatter.
// It dispatches to the appropriate write method based on the plan type.
func (t *treeFormatter) writePlan(ast Plan) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, the treeFormatter should be tree type agnostic, since we would use it for the physical plan as well.

A PlanVisitor could traverse the AST to create the treeFormatter nodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to mentioned the benefits of a visitor.

Comment on lines +31 to +34
type BinOpType int

const (
BinOpTypeInvalid BinOpType = iota // Invalid or uninitialized binary operation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since you use an int as BinOpType, you could start the iota value at -1 (for "invalid type"). However, what do we need the BinOpTypeInvalid type for?

You have this for other enums as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I understand now: The empty value of BinOpExpr{} has the BinOpTypeInvalid type.

case ExprTypeSort:
return "Sort"
default:
return "Unknown"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
return "Unknown"
return UNKNOWN
Comment on lines +48 to +51
// Type implements the Plan interface
func (f *Filter) Type() PlanType {
return PlanTypeFilter
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type() is now implemented on the Plan struct

Comment on lines +33 to +37
// TableSchema returns the schema of the table.
// This is a convenience method that returns the same value as Schema().
func (t *MakeTable) TableSchema() schema.Schema {
return t.schema
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is unused. Also, it's not a convenience method, but a duplicate ;-)

Comment on lines +8 to +24
// AggregateOp represents the type of aggregation operation to perform.
// It is a string-based enum that identifies different aggregation functions
// that can be applied to expressions.
type AggregateOp string

const (
// AggregateOpSum represents a sum aggregation
AggregateOpSum AggregateOp = "sum"
// AggregateOpAvg represents an average aggregation
AggregateOpAvg AggregateOp = "avg"
// AggregateOpMin represents a minimum value aggregation
AggregateOpMin AggregateOp = "min"
// AggregateOpMax represents a maximum value aggregation
AggregateOpMax AggregateOp = "max"
// AggregateOpCount represents a count aggregation
AggregateOpCount AggregateOp = "count"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use the same enum pattern (int type + iota + String() function) as in the rest of the PR.

Copy link
Member

@rfratto rfratto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good starting point 🎉 I'm approving and merging immediately so we can keep momentum on the other parts of the query engine while you're on PTO. I have a few high-level thoughts for future improvements:

  • I wonder if there's more we can do to have our types more directly represent an SSA logical plan. At the moment, the package API itself is forming a hierarchical tree, with the SSA code mainly focusing on representing that tree as a string.

    If you compare to something like ssa.BasicBlock from Go, you can see how their package API more naturally fits the pattern of a sequence of instructions instead of a hierarchy.

  • I like that we're learning lessons from other query engines (like DataFusion) and trying to apply it to our own, but I wonder if we'll actually use the builder pattern defined here in practice.

    Specifically, the builder pattern is used to construct a plan top-down, with each chained function call being the next operation to do after all previous chained function calls.

    However, if our logical plans were built with depth-first traversal of the AST, it would be constructing nodes bottom-up, which might reduce how often we use the builder pattern in our planning code.

  • Hiding fields behind getters may be increasing the size of the mental model for the package API. Exposing struct fields publicly like we do for the AST may help cut down the amount of code and files needed. As a bonus, this may also give the caller more choices for how to construct plans without needing to use the builder pattern.

@rfratto
Copy link
Member

rfratto commented Mar 6, 2025

Merging as mentioned above, but @chaudum's comments are also worth addressing once you're back from PTO (and I agree that pkg/dataobj is probably not the best home for this storage-agnostic engine)

@rfratto rfratto merged commit 89f0ed7 into grafana:main Mar 6, 2025
58 checks passed
@chaudum
Copy link
Contributor

chaudum commented Mar 7, 2025

Merging as mentioned above, but @chaudum's comments are also worth addressing once you're back from PTO (and I agree that pkg/dataobj is probably not the best home for this storage-agnostic engine)

The reason for the package to be in pkg/dataobj is the import of the internal package for datatypes.

@rfratto
Copy link
Member

rfratto commented Mar 7, 2025

The reason for the package to be in pkg/dataobj is the import of the internal package for datatypes.

Thanks, I noticed that, but I don't think it needs to be that way. It would be reasonable to have duplicate definitions of data types specifically for the query engine. Even if datasetmd wasn't internal, I'd still recommend doing that to minimize the intersections where the query engine uses dataobj code ("a little copying is better than a little dependency"). This typically helps with code organization and compartmentalizing the mental model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 participants