chore: Add new pointers section type for dataobj #18243

benclive · 2025-06-26T10:01:47Z

What this PR does / why we need it:
Adds a new type of section ("Pointers") for dataobjs, storing indexing information at a object-section granularity.

I am open to changing the name from Pointers if anyone has any suggestions.
I opted for a wide-section approach instead of two separate section types: This section stores both stream indexing information (ID, start, end, size, etc.) & column indexing information (blooms for column values), which are mutually exclusive.
The exact fields may change over time, I've tested indexing E2E and I'm happy with this as a first pass.

Which issue(s) this PR fixes:
Part of https://github.com/grafana/loki-private/issues/1725

Checklist

Reviewed the CONTRIBUTING.md guide (required)
Documentation added
Tests updated
Title matches the required conventional commits format, see here
- Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

benclive · 2025-06-26T10:08:07Z

pkg/dataobj/sections/pointers/builder.go

+	UncompressedSize int64
+
+	// Column indexing metadata
+	Column            string


I chose to have ColumnName as a separate column, instead of having one column per referenced column & type. I might revisit this decision in the future if it impacts read performance by needing to scan through blooms for columns we don't care about.
e.g.

path | section | column | bloom bytes ----------------------------------------- abcd | 1 | namespace | []byte{..} abcd | 1 | cluster | []byte{..}

instead of

path | section | namespace_bloom | cluster_bloom abcd | 1 | []byte{..} | abcd | 1 | | []byte{..}

benclive · 2025-06-26T10:09:57Z

pkg/dataobj/sections/pointers/builder.go

+		PageSizeHint: b.pageSize,
+		Value:        datasetmd.VALUE_TYPE_BYTE_ARRAY,
+		Encoding:     datasetmd.ENCODING_TYPE_PLAIN,
+		Compression:  datasetmd.COMPRESSION_TYPE_NONE, // TODO: is there a sensible compression algorithm for bloom filters?


Any suggestions for this TODO?
I'm tempted to leave it as NONE for two reasons: Bloom bytes are likely fairly random so won't compress well, and so we don't have to spend CPU decompressing the blooms when reading them so they should be very fast to scan.
At some point I would like to try roaring bitmaps here though.

pkg/dataobj/internal/metadata/pointersmd/pointersmd.proto

pkg/dataobj/sections/pointers/builder.go

chaudum · 2025-06-26T11:47:34Z

pkg/dataobj/sections/pointers/builder.go

+//
+// The stream indexing metadata is used to lookup which streams in the referenced section, and their ID within that section.
+// The column indexing metadata is used to lookup which column values are present in the referenced section.
+// Path & Section are mandatory fields, and are used to uniquely identify the section within the referenced object.


Did you verify that sections are returned in deterministic order?

Not yet. @rfratto If I pass a section number back to the query engine, will they then be able to read exactly the same section later based on object path + section number?
The index number is extracted from iteration order of the Sections() call, which looks deterministic to me.

I've written it as deterministic, though I don't think the API contract promises this. It could be valuable to add it to the comments in the dataobj package to solidify it.

pkg/dataobj/sections/pointers/dataset.go

chaudum · 2025-06-26T12:41:35Z

pkg/dataobj/sections/pointers/decoder.go

@@ -0,0 +1,169 @@
+package pointers


We have a lot of duplicate decoder code for each section. The implementation of Columns() Pages() ReadPages() are identical, except for the imports of ColumnDesc, PageDesc.

I think Robert addressed this below. Theres quite a bit of duplicate code between the different section types and I think we can clean it up soon-ish. I might take a stab at refactoring at some point.

pkg/dataobj/sections/pointers/encoder.go

rfratto

Overall LGTM for us to start experimenting with. I'm a little concerned about the cognitive cost of having a pointer be used for indexing two distinct things, but I think we can make that easier to reason about by tweaking the columns or even the API a little (see my comment below).

Obviously I know where to look when reviewing a new dataset-based section, so I was able to easily ignore all the boilerplate. If we think we're going to be adding many more sections, it might be time to start thinking about how to deduplicate the logic between them. I don't think that needs to block this PR though; the duplicated code is less of a maintenance burden and more of a implementation burden.

rfratto · 2025-06-26T12:57:33Z

pkg/dataobj/internal/metadata/pointersmd/pointersmd.proto

+  COLUMN_TYPE_UNCOMPRESSED_SIZE = 8;
+
+  // COLUMN_TYPE_COLUMN_NAME is the name of a column this object is storing indexes about
+  COLUMN_TYPE_COLUMN_NAME = 9;


We might want to include column index here as well; not all columns have names and column names aren't guaranteed to be unique (which will start coming into play in the future once we store parsed fields)

rfratto · 2025-06-26T13:09:22Z

pkg/dataobj/sections/pointers/builder.go

+// 1. Stream indexing metadata
+// 2. Column indexing metadata


Hm, I do find it a little confusing that a pointer is to a stream or a column. I think there's a few options to make that clearer:

Store a column indicating the type of row (STREAM_POINTER vs COLUMN_POINTER), or

Expose a method on ObjPointer to compute the type of row (StreamPointer vs ColumnPointer)

Make it clearer in the comment here that the type of index information is either/or per row and never both

Split out into two section for stream pointers and another for column pointers

I opted for "Store a column indicating the type of row (STREAM_POINTER vs COLUMN_POINTER)"
I'm not sure how much clearer it made it: A follow up PR will be actually using this section in a builder so I might revisit it there if I think of a better way.
The only real reason to have a single section is to reduce duplicate code and boilerplate. If extracting all the duplicated code reduces that, I doubt there is much harm in having different sections for different kinds of index.

rfratto · 2025-06-26T13:12:52Z

pkg/dataobj/sections/pointers/iter.go

Do we need this file? I'm hoping to phase out the iter.go files in the other sections in favour of using a reader.

Its not really needed, I just copied from the existing streams implementation, which also may have changed in the mean time. I want to use the column reader at some point so I'll clean it up then.

rfratto · 2025-06-26T13:14:52Z

pkg/dataobj/sections/pointers/row_reader.go

+)
+
+// RowReader reads the set of streams from an [Object].
+type RowReader struct {


nit: if you don't suspect we'd ever need columnar readers for the pointers section, we can just call this Reader.

I plan to eventually drop RowReader from the streams/logs sections (but that's not to say other sections can't have row-based readers).

I do plan to use the column reader so I can avoid reading log lines and make building a bit faster. Its currently ~40 seconds to process a full object in my local testing.
That said, there is likely a time in the near future where we'll want to parse log lines for things like UUIDs so maybe I'll end up using the row reader then since I'll be looking at all the columns. TBD!

pkg/dataobj/internal/metadata/pointersmd/pointersmd.proto

rfratto · 2025-06-26T17:29:51Z

pkg/dataobj/sections/pointers/row_reader_test.go

@@ -13,12 +13,21 @@ import (
 	"github.com/grafana/loki/v3/pkg/dataobj"
 )

+<<<<<<< Updated upstream


@benclive bad merge conflict resolution here

rfratto

Still LGTM 😛

benclive added 2 commits June 26, 2025 10:19

feat: Add new Pointers section to dataobj format

c24066c

Update .gitignore

7d87296

pull-request-size bot added the size/XXL label Jun 26, 2025

benclive commented Jun 26, 2025

View reviewed changes

linting

370fa10

benclive marked this pull request as ready for review June 26, 2025 10:23

benclive requested a review from a team as a code owner June 26, 2025 10:23

chaudum reviewed Jun 26, 2025

View reviewed changes

rfratto approved these changes Jun 26, 2025

View reviewed changes

benclive added 3 commits June 26, 2025 15:43

PR feedback

532347b

Add pointer type field to section pointers

e1a6807

Import order

27f7edd

benclive requested review from chaudum and rfratto June 26, 2025 17:16

rfratto reviewed Jun 26, 2025

View reviewed changes

rfratto approved these changes Jun 26, 2025

View reviewed changes

Linting

e68d5b5

benclive enabled auto-merge (squash) June 27, 2025 12:58

benclive merged commit 9642af3 into main Jun 27, 2025
65 checks passed

benclive deleted the benclive/add-new-pointers-section-to-dataobj branch June 27, 2025 13:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: Add new pointers section type for dataobj #18243

chore: Add new pointers section type for dataobj #18243

Uh oh!

benclive commented Jun 26, 2025

benclive Jun 26, 2025

benclive Jun 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chaudum Jun 26, 2025

benclive Jun 26, 2025

rfratto Jun 26, 2025

Uh oh!

chaudum Jun 26, 2025

benclive Jun 26, 2025

Uh oh!

rfratto left a comment

rfratto Jun 26, 2025

rfratto Jun 26, 2025

benclive Jun 26, 2025

rfratto Jun 26, 2025

benclive Jun 26, 2025

rfratto Jun 26, 2025

benclive Jun 26, 2025

Uh oh!

rfratto Jun 26, 2025

rfratto left a comment

Uh oh!

		// 1. Stream indexing metadata
		// 2. Column indexing metadata

chore: Add new pointers section type for dataobj #18243

chore: Add new pointers section type for dataobj #18243

Uh oh!

Conversation

benclive commented Jun 26, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

rfratto left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

rfratto left a comment

Choose a reason for hiding this comment

Uh oh!