Resume remote compaction aborted due to primary restart #12177

hx235 · 2023-12-23T03:45:09Z

Context:
If the primary db is restarted after requesting a remote compaction but before installing the compaction, the same compaction will be scheduled and requested like a new compaction again. Therefore, the compaction progress made in the remote site will be wasted.

Summary:
This PR allows the restarted primary db wait for the remote compaction to return from the remote site instead of rescheduling a same new one. At the high level, we persist essential compaction information in the manifest to wait for the corresponding remote compaction. So upon restart, we can reconstruct the memory state to wait for the remote compaction and prevent compaction conflict from other new compaction after restart.

Test:

New UT TEST_F(CompactionServiceResumableCompactionTest, ResumableCompaction)
Add Options::resume_compaction to crash test to ensure it has no impact on existing feature when remote compaction is not used.
- Crash test currently does not use remote compaction.
Run stress test patch with dummy remote service to test upgrade/downgrade compatibly on manifest

Limitations:

Failed remote compaction will also be resumed on next db open in addition to the unfinished ones due to primary db restart (noted in API)
Failed resumed remote compaction will not fall back to local in any case since we now only persist minimum compaction information which can be less than local compaction needs.
Compaction are only resumed once. If the resume fail, we will not resume it again and the failure is like any other compaction failure.
If a second reopen happens quickly after the first reopen but before the first reopen is able to finish resuming the compaction, then the second reopen may not be able to resume the compaction. That's because the compaction resumed in the 1st db open is not persisted to manifest and the manifest with that info can be deleted before the 2nd DB reopen.

facebook-github-bot · 2023-12-23T06:11:29Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-12-23T18:21:13Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2023-12-23T18:22:03Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-12-23T20:06:49Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2023-12-23T20:07:12Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

hx235 · 2023-12-25T19:41:59Z

unreleased_history/new_features/resume_compaction.md

@@ -0,0 +1 @@
+Provide an experimental option `Options::resume_compaction` to resume unfinished compactions left from the last db session. Right now only unfinished remote compactions due to primary db restart or failed remote compaction are supported. This options is turned on by default and has no effect to users with no remote compaction (i.e, `Options::compaction_service == nullptr`) or disable auto compaction (i.e, `Options::disable_auto_compactions = true`)


minor TODO: "... this option"

hx235 · 2023-12-25T19:43:15Z

db/compaction/compaction_service_test.cc

+    metadata.clear();
+    db_->GetLiveFilesMetaData(&metadata);
+    if (compaction_unfinished_ && resume_compaction) {
+      ASSERT_LT(metadata.size(), prev_reopen_live_file_num);


minor TODO: assert sync point is called even manually tracing through debugger shows it is called.

facebook-github-bot added the CLA Signed label Dec 23, 2023

hx235 changed the title ~~Resume remote compaction aborted due to primary restart or failure~~ Resume remote compaction aborted due to primary restart Dec 23, 2023

hx235 force-pushed the resume_remote_compaction branch 2 times, most recently from c82c2cc to 7b881cd Compare December 23, 2023 05:40

hx235 force-pushed the resume_remote_compaction branch from 7b881cd to e669318 Compare December 23, 2023 18:21

Resume compaction

fde50d9

hx235 force-pushed the resume_remote_compaction branch from e669318 to fde50d9 Compare December 23, 2023 20:06

hx235 commented Dec 25, 2023

View reviewed changes

hx235 requested a review from ajkr December 25, 2023 19:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume remote compaction aborted due to primary restart #12177

Resume remote compaction aborted due to primary restart #12177

hx235 commented Dec 23, 2023 •

edited

facebook-github-bot commented Dec 23, 2023

facebook-github-bot commented Dec 23, 2023

facebook-github-bot commented Dec 23, 2023

facebook-github-bot commented Dec 23, 2023

facebook-github-bot commented Dec 23, 2023

hx235 Dec 25, 2023 •

edited

hx235 Dec 25, 2023 •

edited

		@@ -0,0 +1 @@
		Provide an experimental option `Options::resume_compaction` to resume unfinished compactions left from the last db session. Right now only unfinished remote compactions due to primary db restart or failed remote compaction are supported. This options is turned on by default and has no effect to users with no remote compaction (i.e, `Options::compaction_service == nullptr`) or disable auto compaction (i.e, `Options::disable_auto_compactions = true`)

Resume remote compaction aborted due to primary restart #12177

Are you sure you want to change the base?

Resume remote compaction aborted due to primary restart #12177

Conversation

hx235 commented Dec 23, 2023 • edited

facebook-github-bot commented Dec 23, 2023

facebook-github-bot commented Dec 23, 2023

facebook-github-bot commented Dec 23, 2023

facebook-github-bot commented Dec 23, 2023

facebook-github-bot commented Dec 23, 2023

hx235 Dec 25, 2023 • edited

Choose a reason for hiding this comment

hx235 Dec 25, 2023 • edited

Choose a reason for hiding this comment

hx235 commented Dec 23, 2023 •

edited

hx235 Dec 25, 2023 •

edited

hx235 Dec 25, 2023 •

edited