pq: activate stringref extension for more-compact PQ representation#17849
pq: activate stringref extension for more-compact PQ representation#17849yaauie merged 3 commits intoelastic:mainfrom
Conversation
When serializing a non-primitive value, CBOR encodes a two-element tuple containing the class name and the class-specific serialized value, which results in a significant amount of overhead in the form of frequently- repeated strings. Jackson CBOR supports the stringref extension, which allows it to avoid repeating the actual bytes of a string, and instead keeps track of the strings it has encountered and _referencing_ those strings by the index in which they occur. For example, the first `org.jruby.RubyString` looks like: ~~~ 74 # text(20) 6f72672e6a727562792e52756279537472696e67 # "org.jruby.RubyString" ~~~ While each subsequent string looks like: ~~~ d8 19 # tag(25) 05 # unsigned(5) ~~~ Enabling this extension allows us to save: - ~18 bytes from each `org.jruby.RubyString` - ~23 bytes from each `org.logstash.ConvertedMap` - ~24 bytes from each `org.logstash.ConvertedList` - ...etc. The CBOR implementation in Jackson _appears_ to support reading stringrefs regardless of whether this feature is enabled for serializing, which means that this change is not a rollback-barrier.
🤖 GitHub commentsExpand to view the GitHub comments
Just comment with:
|
|
This pull request does not have a backport label. Could you fix it @yaauie? 🙏
|
…nsion With the CBOR stringref extension enabled, we add a 3-byte overhead to each event to activate the extension, and eliminate 24 bytes of overhead for each event's secondary instances of `org.logstash.ConvertedMap`. Since the events under test have exactly two instances of `org.logstash.ConvertedMap`, this is a net reduction of 21 bytes of overhead. This changes the specifically-constructed events to have the intended lengths to test their specific edge-cases.
7647be4 to
44692e9
Compare
|
I have manually validated that a stringref-enabled event serialized with this patch can be deserialized in unpatched logstash. I plan to add tests with fixture data for the releasable branches that we are not backporting to, to ensure that this change is not a rollback barrier. [EDIT] Backport test-only PR's to ensure activating the extension doesn't create a rollback boundary:
[/EDIT] |
|
💛 Build succeeded, but was flaky
Failed CI StepsHistory
|
mashhurs
left a comment
There was a problem hiding this comment.
Really good find! Thank you.
It makes sense to me to set string ref to true by default rather than feature flag.
So, LGTM!
kaisecheng
left a comment
There was a problem hiding this comment.
This is amazing enhancement for PQ. LGTM
I tested both the upgrade and downgrade paths, and they work smoothly without requiring any user intervention.
I used the tool lsq-pagedump to print the page. Instead of getting CBOR(stringref) , I got CBOR(assumed D9019F716A617661)
$ lsq-pagedump "${LOGSTASH_PATH}/data/queue/main/page.0"
1 277 094395CD page.0 CBOR(assumed D9019F716A617661)
...
100 277 F7EE8590 page.0 CBOR(assumed D9019F716A617661)
|
[EDIT] |
|
@kaisecheng after fixing my dump tool, your page file looks like:
|
The tool works as expected now 👍 |





Release notes
What does this PR do?
When serializing a non-primitive value, CBOR encodes a two-element tuple containing the class name and the class-specific serialized value, which results in a significant amount of overhead in the form of frequently-repeated strings.
Jackson CBOR supports the stringref extension, which allows it to avoid repeating the actual bytes of a string, and instead keeps track of the strings it has encountered and referencing those strings by the index in which they occur.
For example, the first
org.jruby.RubyStringlooks like:While each subsequent string looks like:
Enabling this extension allows us to save:
org.jruby.RubyStringorg.logstash.ConvertedMaporg.logstash.ConvertedListPractical example: a 9183-byte complex JSON that contains an
event.original, consumed through the stdin input with json codec (adding fields like@timestamp,@versionper normal) resulted in a 22% reduction in serialized size:The CBOR implementation in Jackson appears to support reading stringrefs regardless of whether this feature is enabled for serializing, which means that this change is not a rollback-barrier.
Why is it important/What is the impact to the user?
Reduces size-on-disk for PQ, enabling more events to fit on each page and reducing disk IO
Checklist
[ ] I have commented my code, particularly in hard-to-understand areas[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files (and/or docker env variables)How to test this PR locally
CBOR(known)(indicating that they start with the9F716Asequence that events have historically started with)CBOR(stringref)(indicating that they start with the stringref tagD90100)Related issues
Use cases