Fix potential incorrect result for duplicate key in MultiGet #12295

anand1976 · 2024-01-25T19:09:58Z

The RocksDB correctness testing has recently discovered a possible, but very unlikely, correctness issue with MultiGet. The issue happens when all of the below conditions are met -

Duplicate keys in a MultiGet batch
Key matches the last key in a non-zero, non-bottommost level file
Final value is not in the file (merge operand, not snapshot visible etc)
Multiple entries exist for the key in the file spanning more than 1 data block. This can happen due to snapshots, which would force multiple versions of the key in the file, and they may spill over to another data block
Lookup attempt in the SST for the first of the duplicates fails with IO error on a data block (NOT the first data block, but the second or subsequent uncached block), but no errors for the other duplicates
Value or merge operand for the key is present in the very next level

The problem is, in FilePickerMultiGet, when looking up keys in a level we use FileIndexer and the overlapping file in the current level to determine the search bounds for that key in the file list in the next level. If the next level is empty, the search bounds are reset and we do a full binary search in the next non-empty level's LevelFilesBrief. However, under the conditions #1 and #2 listed above, only the first of the duplicates has its next-level search bounds updated, and the remaining duplicates are skipped.

Test plan:
Add unit tests that fail an assertion or return wrong result without the fix

facebook-github-bot · 2024-01-29T17:50:11Z

@anand1976 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

hx235 · 2024-01-30T22:03:53Z

A high-level question @anand1976

rocksdb/include/rocksdb/db.h

Lines 665 to 666 in 377eee7

    
           // Note: keys will not be "de-duplicated". Duplicate keys will return 
        
           // duplicate values in order.

makes it feel like we should return same values (hence same status) for duplicate keys.

So if that expectation is agreed upon, should we stop and return when first duplicate encounters SST read error? Right now, the code seems like a workaround by instead ensuring the rest duplicate has the correct key

hx235 · 2024-01-31T03:20:42Z

db/version_set.cc

@@ -688,6 +689,12 @@ class FilePickerMultiGet {
               user_comparator_->CompareWithoutTimestamp(


Overall I found the purpose of "if (cmp_largest == 0)" hard to understand and the comment does not help much.

My main questions is:
If we agree with "it keeps moving forward until the last key in the batch that falls in that file" as mentioned in https://github.com/facebook/rocksdb/pull/12295/files#diff-6270b3486fea620597e24151dd0d75c2c14b3c4c30d62e0567811723628733deR623-R625, I don't know why can't we merge the case cmp_largest == 0 with cmp_largest < 0.

Particularly, the comment about this case "which means the next key will not be in this file, so stop looking further" seems to assume there aren't any duplicate keys. If we don't plan to change/clarify MultiGet() on that aspect, I don't know why we still keep this special path. Also, should we still allow MultiGet() to take duplicate keys for one CF? Duplicate keys makes the logic harder to reason.

For the above question, I was once wondering if it's because we need to, as the comment mentioned, "..leave batch_iter as is since we may have to pick up from there for the next file, if this file has a merge value rather than". But again, I don't understand we didn't do the same ("leave batch_iter_ as is ") for the case where the duplicate keys lie in the middle of the file instead of the last.

hx235 · 2024-01-31T03:24:58Z

The other condition needed for the bug is first duplicate key (a) marked as done while SST look returning error, which then (b) leads to rest of the duplicate keys being skipped https://github.com/facebook/rocksdb/blob/main/db/version_set.cc?fbclid=IwAR1r2JRRztP9rXZqdzugSsXKbOg0dQx0waIeD0yQ5gYCc6xCldRWtPVSWvA#L445-L448

For (a), why didn't we surface this error all the way to user?

For (b), skipping the rest of the duplicate keys doesn't sound aligned with the API description here

rocksdb/include/rocksdb/db.h

Lines 665 to 666 in 377eee7

    
           // Note: keys will not be "de-duplicated". Duplicate keys will return 
        
           // duplicate values in order.

Should we do something about this condition too?

anand1976 · 2024-02-01T00:20:01Z

A high-level question @anand1976

rocksdb/include/rocksdb/db.h

Lines 665 to 666 in 377eee7

// Note: keys will not be "de-duplicated". Duplicate keys will return

// duplicate values in order.

makes it feel like we should return same values (hence same status) for duplicate keys.
So if that expectation is agreed upon, should we stop and return when first duplicate encounters SST read error? Right now, the code seems like a workaround by instead ensuring the rest duplicate has the correct key

@hx235 There's no contract to treat duplicate keys in a special manner. The way I read it, we won't de-duplicate the keys, i.e we will not attempt to return identical result for each of the duplicates. I feel its more trouble than its worth to try to ensure we return an identical status and value. I'd rather just make it explicit in the comment that different status may be returned depending on what happens while processing.

anand1976 · 2024-02-01T00:32:11Z

The other condition needed for the bug is first duplicate key (a) marked as done while SST look returning error, which then (b) leads to rest of the duplicate keys being skipped https://github.com/facebook/rocksdb/blob/main/db/version_set.cc?fbclid=IwAR1r2JRRztP9rXZqdzugSsXKbOg0dQx0waIeD0yQ5gYCc6xCldRWtPVSWvA#L445-L448

For (a), why didn't we surface this error all the way to user?

We do. Its marked as done in the batch, and the status is set to the IO error.

For (b), skipping the rest of the duplicate keys doesn't sound aligned with the API description here

See previous comment on this.

facebook-github-bot · 2024-02-02T06:05:40Z

@anand1976 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2024-02-02T06:06:09Z

@anand1976 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

hx235

LGTM, though I still wonder if there is anything we could do about https://github.com/facebook/rocksdb/pull/12295/files#r1472259193 maybe as a follow up/tech debt

anand1976 · 2024-02-02T19:28:37Z

LGTM, though I still wonder if there is anything we could do about https://github.com/facebook/rocksdb/pull/12295/files#r1472259193 maybe as a follow up/tech debt

You mean try to simplify the cmp_largest == 0 case? Yeah, we can track that as tech debt and think about how it can be done.

hx235 · 2024-02-02T19:38:44Z

LGTM, though I still wonder if there is anything we could do about https://github.com/facebook/rocksdb/pull/12295/files#r1472259193 maybe as a follow up/tech debt

You mean try to simplify the cmp_largest == 0 case? Yeah, we can track that as tech debt and think about how it can be done.

I meant whether we we can not have cmp_largest == 0 as a special case

facebook-github-bot · 2024-02-02T19:54:41Z

@anand1976 merged this pull request in 95b41ee.

anand1976 requested a review from hx235 January 25, 2024 19:09

facebook-github-bot added the CLA Signed label Jan 25, 2024

anand1976 requested a review from cbi42 January 26, 2024 22:13

hx235 reviewed Jan 31, 2024

View reviewed changes

anand1976 added 3 commits February 1, 2024 22:03

Fix potential incorrect result for duplicate key in MultiGet

4bf4c95

Fix CI test failures

2fd2577

Clarify MultiGet duplicate key behavior

ee2e022

anand1976 force-pushed the crash_repro branch from e187c78 to ee2e022 Compare February 2, 2024 06:05

hx235 approved these changes Feb 2, 2024

View reviewed changes

facebook-github-bot closed this in 95b41ee Feb 2, 2024

facebook-github-bot added the Merged label Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix potential incorrect result for duplicate key in MultiGet #12295

Fix potential incorrect result for duplicate key in MultiGet #12295

anand1976 commented Jan 25, 2024

facebook-github-bot commented Jan 29, 2024

hx235 commented Jan 30, 2024

hx235 Jan 31, 2024 •

edited

hx235 commented Jan 31, 2024 •

edited

anand1976 commented Feb 1, 2024

anand1976 commented Feb 1, 2024

facebook-github-bot commented Feb 2, 2024

facebook-github-bot commented Feb 2, 2024

hx235 left a comment

anand1976 commented Feb 2, 2024

hx235 commented Feb 2, 2024

facebook-github-bot commented Feb 2, 2024

		@@ -688,6 +689,12 @@ class FilePickerMultiGet {
		user_comparator_->CompareWithoutTimestamp(

Fix potential incorrect result for duplicate key in MultiGet #12295

Fix potential incorrect result for duplicate key in MultiGet #12295

Conversation

anand1976 commented Jan 25, 2024

facebook-github-bot commented Jan 29, 2024

hx235 commented Jan 30, 2024

hx235 Jan 31, 2024 • edited

Choose a reason for hiding this comment

hx235 commented Jan 31, 2024 • edited

anand1976 commented Feb 1, 2024

anand1976 commented Feb 1, 2024

facebook-github-bot commented Feb 2, 2024

facebook-github-bot commented Feb 2, 2024

hx235 left a comment

Choose a reason for hiding this comment

anand1976 commented Feb 2, 2024

hx235 commented Feb 2, 2024

facebook-github-bot commented Feb 2, 2024

hx235 Jan 31, 2024 •

edited

hx235 commented Jan 31, 2024 •

edited