Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix potential incorrect result for duplicate key in MultiGet #12295

Closed
wants to merge 3 commits into from

Conversation

anand1976
Copy link
Contributor

The RocksDB correctness testing has recently discovered a possible, but very unlikely, correctness issue with MultiGet. The issue happens when all of the below conditions are met -

  1. Duplicate keys in a MultiGet batch
  2. Key matches the last key in a non-zero, non-bottommost level file
  3. Final value is not in the file (merge operand, not snapshot visible etc)
  4. Multiple entries exist for the key in the file spanning more than 1 data block. This can happen due to snapshots, which would force multiple versions of the key in the file, and they may spill over to another data block
  5. Lookup attempt in the SST for the first of the duplicates fails with IO error on a data block (NOT the first data block, but the second or subsequent uncached block), but no errors for the other duplicates
  6. Value or merge operand for the key is present in the very next level

The problem is, in FilePickerMultiGet, when looking up keys in a level we use FileIndexer and the overlapping file in the current level to determine the search bounds for that key in the file list in the next level. If the next level is empty, the search bounds are reset and we do a full binary search in the next non-empty level's LevelFilesBrief. However, under the conditions #1 and #2 listed above, only the first of the duplicates has its next-level search bounds updated, and the remaining duplicates are skipped.

Test plan:
Add unit tests that fail an assertion or return wrong result without the fix

@facebook-github-bot
Copy link
Contributor

@anand1976 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235
Copy link
Contributor

hx235 commented Jan 30, 2024

A high-level question @anand1976

// Note: keys will not be "de-duplicated". Duplicate keys will return
// duplicate values in order.
makes it feel like we should return same values (hence same status) for duplicate keys.

So if that expectation is agreed upon, should we stop and return when first duplicate encounters SST read error? Right now, the code seems like a workaround by instead ensuring the rest duplicate has the correct key

@@ -688,6 +689,12 @@ class FilePickerMultiGet {
user_comparator_->CompareWithoutTimestamp(
Copy link
Contributor

@hx235 hx235 Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I found the purpose of "if (cmp_largest == 0)" hard to understand and the comment does not help much.

My main questions is:
If we agree with "it keeps moving forward until the last key in the batch that falls in that file" as mentioned in https://github.com/facebook/rocksdb/pull/12295/files#diff-6270b3486fea620597e24151dd0d75c2c14b3c4c30d62e0567811723628733deR623-R625, I don't know why can't we merge the case cmp_largest == 0 with cmp_largest < 0.

Particularly, the comment about this case "which means the next key will not be in this file, so stop looking further" seems to assume there aren't any duplicate keys. If we don't plan to change/clarify MultiGet() on that aspect, I don't know why we still keep this special path. Also, should we still allow MultiGet() to take duplicate keys for one CF? Duplicate keys makes the logic harder to reason.

For the above question, I was once wondering if it's because we need to, as the comment mentioned, "..leave batch_iter as is since we may have to pick up from there for the next file, if this file has a merge value rather than". But again, I don't understand we didn't do the same ("leave batch_iter_ as is ") for the case where the duplicate keys lie in the middle of the file instead of the last.

@hx235
Copy link
Contributor

hx235 commented Jan 31, 2024

The other condition needed for the bug is first duplicate key (a) marked as done while SST look returning error, which then (b) leads to rest of the duplicate keys being skipped https://github.com/facebook/rocksdb/blob/main/db/version_set.cc?fbclid=IwAR1r2JRRztP9rXZqdzugSsXKbOg0dQx0waIeD0yQ5gYCc6xCldRWtPVSWvA#L445-L448

For (a), why didn't we surface this error all the way to user?

For (b), skipping the rest of the duplicate keys doesn't sound aligned with the API description here

// Note: keys will not be "de-duplicated". Duplicate keys will return
// duplicate values in order.

Should we do something about this condition too?

@anand1976
Copy link
Contributor Author

A high-level question @anand1976

// Note: keys will not be "de-duplicated". Duplicate keys will return
// duplicate values in order.

makes it feel like we should return same values (hence same status) for duplicate keys.
So if that expectation is agreed upon, should we stop and return when first duplicate encounters SST read error? Right now, the code seems like a workaround by instead ensuring the rest duplicate has the correct key

@hx235 There's no contract to treat duplicate keys in a special manner. The way I read it, we won't de-duplicate the keys, i.e we will not attempt to return identical result for each of the duplicates. I feel its more trouble than its worth to try to ensure we return an identical status and value. I'd rather just make it explicit in the comment that different status may be returned depending on what happens while processing.

@anand1976
Copy link
Contributor Author

The other condition needed for the bug is first duplicate key (a) marked as done while SST look returning error, which then (b) leads to rest of the duplicate keys being skipped https://github.com/facebook/rocksdb/blob/main/db/version_set.cc?fbclid=IwAR1r2JRRztP9rXZqdzugSsXKbOg0dQx0waIeD0yQ5gYCc6xCldRWtPVSWvA#L445-L448

For (a), why didn't we surface this error all the way to user?

We do. Its marked as done in the batch, and the status is set to the IO error.

For (b), skipping the rest of the duplicate keys doesn't sound aligned with the API description here

See previous comment on this.

@facebook-github-bot
Copy link
Contributor

@anand1976 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@anand1976 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@hx235 hx235 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though I still wonder if there is anything we could do about https://github.com/facebook/rocksdb/pull/12295/files#r1472259193 maybe as a follow up/tech debt

@anand1976
Copy link
Contributor Author

LGTM, though I still wonder if there is anything we could do about https://github.com/facebook/rocksdb/pull/12295/files#r1472259193 maybe as a follow up/tech debt

You mean try to simplify the cmp_largest == 0 case? Yeah, we can track that as tech debt and think about how it can be done.

@hx235
Copy link
Contributor

hx235 commented Feb 2, 2024

LGTM, though I still wonder if there is anything we could do about https://github.com/facebook/rocksdb/pull/12295/files#r1472259193 maybe as a follow up/tech debt

You mean try to simplify the cmp_largest == 0 case? Yeah, we can track that as tech debt and think about how it can be done.

I meant whether we we can not have cmp_largest == 0 as a special case

@facebook-github-bot
Copy link
Contributor

@anand1976 merged this pull request in 95b41ee.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants