Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeySegmentsExtractor and prototype higher-dimensional filtering #12075

Closed
wants to merge 5 commits into from

Conversation

pdillinger
Copy link
Contributor

@pdillinger pdillinger commented Nov 17, 2023

Summary: This change contains a prototype new API for "higher dimensional" filtering of read queries. Existing filters treat keys as one-dimensional, either as distinct points (whole key) or as contiguous ranges in comparator order (prefix filters). The proposed KeySegmentsExtractor allows treating keys as multi-dimensional for filtering purposes even though they still have a single total order across dimensions. For example, consider these keys in different LSM levels:

L0:
abc_0123
abc_0150
def_0114
ghi_0134

L1:
abc_0045
bcd_0091
def_0077
xyz_0080

If we get a range query for [def_0100, def_0200), a prefix filter (up to the underscore) will tell us that both levels are potentially relevant. However, if each SST file stores a simple range of the values for the second segment of the key, we would see that L1 only has [0045, 0091] which (under certain required assumptions) we are sure does not overlap with the given range query. Thus, we can filter out processing or reading any index or data blocks from L1 for the query.

This kind of case shows up with time-ordered data but is more general than filtering based on user timestamp. See #11332 . Here the "time" segments of the keys are meaningfully ordered with respect to each other even when the previous segment is different, so summarizing data along an alternate dimension of the key like this can work well for filtering.

This prototype implementation simply leverages existing APIs for user table properties and table filtering, which is not very CPU efficient. Eventually, we expect to create a native implementation. However, I have put some significant
thought and engineering into the new APIs overall, which I expect to be close to refined enough for production.

For details, see new public APIs in experimental.h. For a detailed example, see the new unit test in db_bloom_filter_test.

Test Plan: Unit test included

@pdillinger pdillinger force-pushed the key_segments_extractor branch 4 times, most recently from b45b60b to f8e0f37 Compare November 23, 2023 00:58
@pdillinger pdillinger changed the title [WIP] KeySegmentsExtractor and prototype higher-dimensional filtering [RFC] KeySegmentsExtractor and prototype higher-dimensional filtering Nov 29, 2023
@pdillinger pdillinger changed the title [RFC] KeySegmentsExtractor and prototype higher-dimensional filtering Prototype KeySegmentsExtractor and higher-dimensional filtering Dec 1, 2023
@pdillinger pdillinger changed the title Prototype KeySegmentsExtractor and higher-dimensional filtering Experimental KeySegmentsExtractor and higher-dimensional filtering Dec 1, 2023
Summary: This adds an EXPERIMENTAL new API for splitting keys into
segments and filtering on those segments, especially a filter that
records the value ranges of segments and uses that to filter range
queries. I sometimes call this "higher-dimensional filtering"
because it goes beyond treating the space of keys as a
single-dimensional range to be approximated. Instead, we can treat key
segments (beyond the first) as data points in another dimension and
filter out irrelevant SST on that dimension alone.

A common pattern is for a key segment to correspond to some monotonic
identifier or timestamp. A range query might look for a range of those
identifiers within a particular prefix leading up to that identifier. If
the range of identifiers in an SST file doesn't overlap that range
(regardless of the leading prefix) we can filter it from being read by a
bounded iterator. For example, if querying for a range of newer entries,
older SST files will be filtered out.

The current implementation is a prototype in the experimental.h header
and rocksdb::experimental namespace. It currently uses table properties
and table_filter to store and apply these new filters. That is certainly
temporary, but I have put some significant thought and engineering into
the rest of the new APIs, which I expect to be close to refined enough
for production.

For details, see new public APIs in experimental.h. For a detailed
example, see the new unit test in db_bloom_filter_test.

Test Plan: Unit test included
@pdillinger pdillinger changed the title Experimental KeySegmentsExtractor and higher-dimensional filtering KeySegmentsExtractor and prototype higher-dimensional filtering Feb 9, 2024
@facebook-github-bot
Copy link
Contributor

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pdillinger has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jowlyzhang jowlyzhang self-requested a review February 9, 2024 21:09
Copy link
Contributor

@jowlyzhang jowlyzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdillinger Thank you for this work, it looks awesome! I only have some minor comments.

// (That performance optimization is the only reason this function is here
// rather than in SstQueryFilterConfigsManager.)
virtual std::function<bool(const TableProperties&)>
GetTableFilterForRangeQuery(Slice lower_bound_incl,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: how about using const reference for the bounds input arguments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nature of this function is that we need to save a copy of the Slices. Plain Slice parameter allows, but doesn't require, move semantics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably doesn't matter here for efficiency (nothing owned by the object to deep copy), but I like the certainty of knowing that the called function isn't accidentally saving the const X& when the function clearly needs to save a copy.


for (size_t i = 0; i < filter_count; ++i) {
uint64_t filter_len;
if (i + 1 == filter_count) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add error handling for when p went beyond limit here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming GetVarint64Ptr meets its contract, it's fully checked by static_cast<size_t>(limit - p) < filter_len (followed by p += filter_len) on the previous iteration. If it's violated in the future, ASAN should catch it.

return 0;
} else {
// For now in the code, wraps only 1 filter, but schema supports multiple
return 1 + VarintLength(CategorySetToUint(categories_)) + 1 +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be VarintLength(1) instead of 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was relying on my knowledge that VarintLength(1) == 1. I'll improve the clarity.

// outer layer will have just 1 filter in its count (added here)
// and this filter wrapper will have filters_to_finish.size()
// (added above).
total_size += 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 1 is for the space used to record number of filters, right? Should this be VarintLength(1) instead of 1?

// segment or composite (range of segments). The empty value is tracked
// and filtered independently because it might be a special case that is
// not representative of the minimum in a spread of values.
kBytewiseMinMaxFilter = 0x10,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious why kBytewiseMinMaxFilter is 0x10, not 0x3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this before:

  // ... (reserve some values for more wrappers)


private:
static const std::string kTablePropertyName;
static constexpr char kSchemaVersion = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema encoding is important for understanding these filter building and reading logic. Maybe we can document how schema version 1 encodes filters?


bool MayMatch_ExtrAndCatFilterWrapper(Slice wrapper) const {
assert(!wrapper.empty() && wrapper[0] == kExtrAndCatFilterWrapper);
if (wrapper.size() <= 4) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, kExtrAndCatFilterWrapper is followed by extractor_id length and extractor_id, how is this fast failing size 4 determined?

return true;
}
Slice smallest = Slice(p, smallest_size);
p += smallest_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add error handling here for p to go beyond limit after incrementing it by smalles_size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already checked static_cast<size_t>(limit - p) <= smallest_size

EXPECT_EQ(RangeQueryKeys("abc_170", "abc_190"), Keys({}));
EXPECT_EQ(TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA), 2);

// Control 3: range is not filtered because prefixes not represented
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this statement mean that "baa_170" to "baa_190" prefixes not represented?


// May match if both the upper bound and lower bound indicate there could
// be overlap
return upper_bound_input.compare(smallest) >= 0 &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since upper bound is exclusive, if it's the same as smallest, can we also filter out the table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've expanded the TODO to record why that change isn't safe:

    // TODO: potentially fix upper bound to actually be exclusive, but it's not
    // as simple as changing >= to > below, because it's upper_bound_excl that's
    // exclusive, and the upper_bound_input part extracted from it might not be.

@facebook-github-bot
Copy link
Contributor

@pdillinger has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pdillinger merged this pull request in 1201813.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants