Add ES93BloomFilterStoredFieldsFormat for efficient field existence checks#137331
Add ES93BloomFilterStoredFieldsFormat for efficient field existence checks#137331fcofdez merged 13 commits intoelastic:mainfrom
Conversation
…hecks Introduces a new stored fields format that builds a Bloom filter for a specific field to enable fast existence checks without storing the field itself. This delegates storage of all other fields to another StoredFieldsFormat while maintaining the ability to quickly determine if a document might contain the target field.
tlrx
left a comment
There was a problem hiding this comment.
Looks great! I left many comments but most of them are nits.
I think this is ready for review, we'll improve the format step by step.
server/src/main/java/org/elasticsearch/index/codec/bloomfilter/BloomFilterHashFunctions.java
Show resolved
Hide resolved
...c/main/java/org/elasticsearch/index/codec/bloomfilter/ES93BloomFilterStoredFieldsFormat.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/elasticsearch/index/codec/bloomfilter/ES93BloomFilterStoredFieldsFormat.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/elasticsearch/index/codec/bloomfilter/ES93BloomFilterStoredFieldsFormat.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/elasticsearch/index/codec/bloomfilter/ES93BloomFilterStoredFieldsFormat.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/elasticsearch/index/codec/bloomfilter/ES93BloomFilterStoredFieldsFormat.java
Outdated
Show resolved
Hide resolved
| "", | ||
| TestUtil.getDefaultCodec().storedFieldsFormat(), | ||
| ByteSizeValue.ofKb(2), | ||
| IdFieldMapper.NAME |
There was a problem hiding this comment.
I wonder if we could have a better coverage if we were randomly using some field names that are generated in BaseStoredFieldsFormatTestCase?
There was a problem hiding this comment.
I'm afraid that we'll broke some of the integration tests that rely on these fields actually being stored.
|
|
||
| private void finishBloomFilterStoredFormat() throws IOException { | ||
| BloomFilterMetadata bloomFilterMetadata = null; | ||
| if (bloomFilterFieldInfo != null) { |
There was a problem hiding this comment.
If this is null, then we haven't seen an _id stored field. Does it still make sense to write an empty file then?
There was a problem hiding this comment.
Yes, it's for merges. Once we get merges working, bloomFilterFieldInfo should be always not null
| */ | ||
| public class ES93BloomFilterStoredFieldsFormat extends StoredFieldsFormat { | ||
| public static final String STORED_FIELDS_BLOOM_FILTER_FORMAT_NAME = "ES93BloomFilterStoredFieldsFormat"; | ||
| public static final String STORED_FIELDS_BLOOM_FILTER_EXTENSION = "sfbf"; |
There was a problem hiding this comment.
sfbf will be Special Francisco's Bloom Filter in my mind forever once this is merged.
| var success = false; | ||
| try { | ||
| bloomFilterFieldReader = BloomFilterFieldReader.open(directory, si, fn, context, segmentSuffix); | ||
| success = true; |
There was a problem hiding this comment.
Do we need to return the delegateReader even if bloomFilterFieldReader is null?
There was a problem hiding this comment.
Not necessarily, but I would expect that the bloom filter field reader will be always != null
Skip skipping the bloom filter field for all data types Compute the bloom filter size in a simpler way
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
|
Hi @fcofdez, I've created a changelog YAML for you. |
| + " or less (rounded to nearest power of two)" | ||
| ); | ||
| } | ||
| this.bloomFilterSizeInBits = (int) Math.multiplyExact(closestPowerOfTwoBloomFilterSizeInBytes, Byte.SIZE); |
There was a problem hiding this comment.
nit:
| this.bloomFilterSizeInBits = (int) Math.multiplyExact(closestPowerOfTwoBloomFilterSizeInBytes, Byte.SIZE); | |
| this.bloomFilterSizeInBits = Math.toIntExact(Math.multiplyExact(closestPowerOfTwoBloomFilterSizeInBytes, Byte.SIZE)); |
| if (info.getName().equals(bloomFilterFieldName)) { | ||
| addToBloomFilter(info, value); | ||
| } else { | ||
| delegateWriter.writeField(info, value); |
There was a problem hiding this comment.
I think it depends on how this bloom stored field format will be registered. If it's registered only for the _id field through (an adjusted version of) PerFieldFormatSupplier, what I think we'll do, then I prefer to have all field behave the same as we know other types of fields won't be called.
| } | ||
|
|
||
| private void addToBloomFilter(FieldInfo info, BytesRef value) { | ||
| bloomFilterFieldInfo = info; |
There was a problem hiding this comment.
nit: I think there was an assertion on info.name being equal to the bloomFilterFieldName which I liked
|
|
||
| private void finishBloomFilterStoredFormat() throws IOException { | ||
| BloomFilterMetadata bloomFilterMetadata = null; | ||
| if (bloomFilterFieldInfo != null) { |
…z/elasticsearch into bloom-filter-stored-fields-format
kkrik-es
left a comment
There was a problem hiding this comment.
@martijnvg knows this part better for sure.
server/src/main/java/org/elasticsearch/index/codec/bloomfilter/BloomFilterHashFunctions.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/codec/bloomfilter/BloomFilterHashFunctions.java
Show resolved
Hide resolved
| + " or less (rounded to nearest power of two)" | ||
| ); | ||
| } | ||
| this.bloomFilterSizeInBits = (int) Math.multiplyExact(closestPowerOfTwoBloomFilterSizeInBytes, Byte.SIZE); |
| public static final String STORED_FIELDS_BLOOM_FILTER_FORMAT_NAME = "ES93BloomFilterStoredFieldsFormat"; | ||
| public static final String STORED_FIELDS_BLOOM_FILTER_EXTENSION = "sfbf"; | ||
| public static final String STORED_FIELDS_METADATA_BLOOM_FILTER_EXTENSION = "sfbfm"; | ||
| private static final int VERSION_START = 1; |
There was a problem hiding this comment.
nit: I think typically codec versions start at zero?
|
|
||
| @Override | ||
| public int merge(MergeState mergeState) throws IOException { | ||
| // Skip merging the bloom filter for now |
There was a problem hiding this comment.
Out of curiosity what is the plan to support merging? Iirc ES87BloomFilterPostingsFormat supports this by rebuilding the bloom filter from the merged terms.
There was a problem hiding this comment.
Ideally we would just OR the bitsets, but that depends a bit on the sizing strategy. Otherwise we could just rebuild the bloom filter as you mentioned.
| this.bloomFilterData = bloomFilterData; | ||
| } | ||
|
|
||
| public boolean mayContainTerm(String field, BytesRef term) throws IOException { |
There was a problem hiding this comment.
Nit: instead of makeing the BloomFilterFieldReader public, maybe keep it package protected / private like other classes in here and add an public interface for this specific method?
|
Thanks all for the reviews! |
Introduces a new stored fields format that builds a Bloom filter for a specific field to enable fast existence checks without storing the field itself. This delegates storage of all other fields to another StoredFieldsFormat while maintaining the ability to quickly determine if a document might contain the target field.
Missing parts