New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KeySegmentsExtractor and prototype higher-dimensional filtering #12075
Conversation
b45b60b
to
f8e0f37
Compare
018e2f6
to
948fdd6
Compare
Summary: This adds an EXPERIMENTAL new API for splitting keys into segments and filtering on those segments, especially a filter that records the value ranges of segments and uses that to filter range queries. I sometimes call this "higher-dimensional filtering" because it goes beyond treating the space of keys as a single-dimensional range to be approximated. Instead, we can treat key segments (beyond the first) as data points in another dimension and filter out irrelevant SST on that dimension alone. A common pattern is for a key segment to correspond to some monotonic identifier or timestamp. A range query might look for a range of those identifiers within a particular prefix leading up to that identifier. If the range of identifiers in an SST file doesn't overlap that range (regardless of the leading prefix) we can filter it from being read by a bounded iterator. For example, if querying for a range of newer entries, older SST files will be filtered out. The current implementation is a prototype in the experimental.h header and rocksdb::experimental namespace. It currently uses table properties and table_filter to store and apply these new filters. That is certainly temporary, but I have put some significant thought and engineering into the rest of the new APIs, which I expect to be close to refined enough for production. For details, see new public APIs in experimental.h. For a detailed example, see the new unit test in db_bloom_filter_test. Test Plan: Unit test included
948fdd6
to
81caa55
Compare
@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pdillinger has updated the pull request. You must reimport the pull request before landing. |
@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pdillinger Thank you for this work, it looks awesome! I only have some minor comments.
// (That performance optimization is the only reason this function is here | ||
// rather than in SstQueryFilterConfigsManager.) | ||
virtual std::function<bool(const TableProperties&)> | ||
GetTableFilterForRangeQuery(Slice lower_bound_incl, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: how about using const reference for the bounds input arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nature of this function is that we need to save a copy of the Slices. Plain Slice parameter allows, but doesn't require, move semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably doesn't matter here for efficiency (nothing owned by the object to deep copy), but I like the certainty of knowing that the called function isn't accidentally saving the const X&
when the function clearly needs to save a copy.
|
||
for (size_t i = 0; i < filter_count; ++i) { | ||
uint64_t filter_len; | ||
if (i + 1 == filter_count) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add error handling for when p
went beyond limit
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming GetVarint64Ptr meets its contract, it's fully checked by static_cast<size_t>(limit - p) < filter_len
(followed by p += filter_len
) on the previous iteration. If it's violated in the future, ASAN should catch it.
db/experimental.cc
Outdated
return 0; | ||
} else { | ||
// For now in the code, wraps only 1 filter, but schema supports multiple | ||
return 1 + VarintLength(CategorySetToUint(categories_)) + 1 + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be VarintLength(1)
instead of 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was relying on my knowledge that VarintLength(1) == 1
. I'll improve the clarity.
db/experimental.cc
Outdated
// outer layer will have just 1 filter in its count (added here) | ||
// and this filter wrapper will have filters_to_finish.size() | ||
// (added above). | ||
total_size += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This 1 is for the space used to record number of filters, right? Should this be VarintLength(1)
instead of 1
?
// segment or composite (range of segments). The empty value is tracked | ||
// and filtered independently because it might be a special case that is | ||
// not representative of the minimum in a spread of values. | ||
kBytewiseMinMaxFilter = 0x10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious why kBytewiseMinMaxFilter
is 0x10
, not 0x3
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this before:
// ... (reserve some values for more wrappers)
|
||
private: | ||
static const std::string kTablePropertyName; | ||
static constexpr char kSchemaVersion = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schema encoding is important for understanding these filter building and reading logic. Maybe we can document how schema version 1 encodes filters?
|
||
bool MayMatch_ExtrAndCatFilterWrapper(Slice wrapper) const { | ||
assert(!wrapper.empty() && wrapper[0] == kExtrAndCatFilterWrapper); | ||
if (wrapper.size() <= 4) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, kExtrAndCatFilterWrapper
is followed by extractor_id length and extractor_id, how is this fast failing size 4
determined?
return true; | ||
} | ||
Slice smallest = Slice(p, smallest_size); | ||
p += smallest_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add error handling here for p
to go beyond limit
after incrementing it by smalles_size
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already checked static_cast<size_t>(limit - p) <= smallest_size
db/db_bloom_filter_test.cc
Outdated
EXPECT_EQ(RangeQueryKeys("abc_170", "abc_190"), Keys({})); | ||
EXPECT_EQ(TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA), 2); | ||
|
||
// Control 3: range is not filtered because prefixes not represented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this statement mean that "baa_170" to "baa_190" prefixes not represented?
|
||
// May match if both the upper bound and lower bound indicate there could | ||
// be overlap | ||
return upper_bound_input.compare(smallest) >= 0 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since upper bound is exclusive, if it's the same as smallest
, can we also filter out the table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've expanded the TODO to record why that change isn't safe:
// TODO: potentially fix upper bound to actually be exclusive, but it's not
// as simple as changing >= to > below, because it's upper_bound_excl that's
// exclusive, and the upper_bound_input part extracted from it might not be.
@pdillinger has updated the pull request. You must reimport the pull request before landing. |
@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pdillinger merged this pull request in 1201813. |
Summary: This change contains a prototype new API for "higher dimensional" filtering of read queries. Existing filters treat keys as one-dimensional, either as distinct points (whole key) or as contiguous ranges in comparator order (prefix filters). The proposed KeySegmentsExtractor allows treating keys as multi-dimensional for filtering purposes even though they still have a single total order across dimensions. For example, consider these keys in different LSM levels:
L0:
abc_0123
abc_0150
def_0114
ghi_0134
L1:
abc_0045
bcd_0091
def_0077
xyz_0080
If we get a range query for [def_0100, def_0200), a prefix filter (up to the underscore) will tell us that both levels are potentially relevant. However, if each SST file stores a simple range of the values for the second segment of the key, we would see that L1 only has [0045, 0091] which (under certain required assumptions) we are sure does not overlap with the given range query. Thus, we can filter out processing or reading any index or data blocks from L1 for the query.
This kind of case shows up with time-ordered data but is more general than filtering based on user timestamp. See #11332 . Here the "time" segments of the keys are meaningfully ordered with respect to each other even when the previous segment is different, so summarizing data along an alternate dimension of the key like this can work well for filtering.
This prototype implementation simply leverages existing APIs for user table properties and table filtering, which is not very CPU efficient. Eventually, we expect to create a native implementation. However, I have put some significant
thought and engineering into the new APIs overall, which I expect to be close to refined enough for production.
For details, see new public APIs in experimental.h. For a detailed example, see the new unit test in db_bloom_filter_test.
Test Plan: Unit test included