Extend the unicast based recovery algorithm to do replication policy check #11996

sbodagala · 2025-03-05T21:01:31Z

Extend the version vector/unicast based recovery algorithm to do the replication policy check while deciding whether a version can be recovered from the set of available log servers. This will make the algorithm compatible with the non-unicast/"main" algorithm while handling non-reporting log servers during recovery.

Test that exposed this issue:

build_output/bin/fdbserver -r simulation --crash -f /root/src/foundationdb/tests/slow/RyowCorrectness.toml -b off -s 29779152

A "getRange()" call was getting blocked because recovery was not completing, which was because "replication_factor" number of log servers were not reporting during recovery. But these set non-reporting log servers were not completing the replication policy, so extending the recovery algorithm to do the replication policy check allowed recovery to progress and the test to succeed.

Note that this extension will be able to make recovery progress only in cases where the non-reporting log servers won't meet the replication policy. But this will make the algorithm compatible with "main" while handling such scenarios.

Testing:

Id (with version vector disabled): 20250305-205711-sre-b53cba5eecb4dadb (started).

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

The PR has a description, explaining both the problem and the solution.
The description mentions which forms of testing were done and the testing seems reasonable.
Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

…y check

foundationdb-ci · 2025-03-05T21:24:01Z

Result of foundationdb-pr-clang-ide on Linux CentOS 7

Commit ID: 6851e8f
Duration 0:22:18
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-03-05T21:50:00Z

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Commit ID: 6851e8f
Duration 0:48:18
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-03-05T21:52:37Z

Result of foundationdb-pr-clang on Linux CentOS 7

Commit ID: 6851e8f
Duration 0:50:54
Result: ❌ FAILED
Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-03-05T21:57:31Z

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

Commit ID: 6851e8f
Duration 0:55:48
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci · 2025-03-05T21:58:12Z

Result of foundationdb-pr on Linux CentOS 7

Commit ID: 6851e8f
Duration 0:56:28
Result: ❌ FAILED
Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

sbodagala · 2025-03-06T17:18:32Z

Result of foundationdb-pr-clang on Linux CentOS 7

Commit ID: 6851e8f

Duration 0:50:54

Result: ❌ FAILED

Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1

Build Log terminal output (available for 30 days)

Build Workspace zip file of the working directory (available for 30 days)

This was because of failures in "AccessTenant" related (with error "traced too many lines") and clog related (I see a transaction retried many times without progress, not sure about the real error) tests.

foundationdb-ci · 2025-03-06T17:59:09Z

Result of foundationdb-pr-clang-ide on Linux CentOS 7

Commit ID: 6851e8f
Duration 0:24:48
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-03-06T18:22:30Z

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Commit ID: 6851e8f
Duration 0:48:07
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-03-06T18:30:23Z

Result of foundationdb-pr-clang on Linux CentOS 7

Commit ID: 6851e8f
Duration 0:56:02
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-03-06T18:33:25Z

Result of foundationdb-pr on Linux CentOS 7

Commit ID: 6851e8f
Duration 0:59:01
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-03-06T18:37:01Z

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

Commit ID: 6851e8f
Duration 1:02:40
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

dlambrig · 2025-03-07T18:57:30Z

fdbserver/TagPartitionedLogSystem.actor.cpp

@@ -2538,15 +2539,16 @@ ACTOR Future<Void> TagPartitionedLogSystem::epochEnd(Reference<AsyncVar<Referenc
 		Version minDV = std::numeric_limits<Version>::max();
 		Version maxEnd = 0;
 		state std::vector<Future<Void>> changes;
-		state std::vector<std::tuple<int, std::vector<TLogLockResult>>> logGroupResults;
+		state std::vector<std::tuple<int, std::vector<TLogLockResult>, bool>> logGroupResults;


The tuple has become hard to read. Can it be made into a structure instead? I think it would improve the code.

dlambrig · 2025-03-07T19:05:57Z

fdbserver/TagPartitionedLogSystem.actor.cpp

+		// policy. We are doing it this way as checking it this way is more efficient
+		// than checking on an individual version basis (which would require us to
+		// build the nonavailable log server set for each version in the unavaialble
+		// version list).


this comment is a little hard to follow. Maybe something more like this?

// At least (N - replicationFactor + 1) log servers must be available.
// Otherwise, the unavailable log servers alone would not be sufficient
// to satisfy the replication policy.
//
// @note This check is intentionally more restrictive than necessary.
// Instead of verifying whether the unavailable log servers within
// the specific set that received the version satisfy the replication policy,
// we check whether the entire set of unavailable log servers meets the policy.
//
// This approach is chosen because it is computationally more efficient.
// Checking availability on a per-version basis would require constructing
// a unique set of unavailable log servers for each version in the unavailable
// version list, which would add significant overhead.

I'll update the comment. Thanks for the edit!

- Extend the unicast based recovery algorithm to do replication polic…

6851e8f

…y check

sbodagala requested review from dlambrig and jzhou77 March 5, 2025 21:01

sbodagala closed this Mar 6, 2025

sbodagala reopened this Mar 6, 2025

dlambrig reviewed Mar 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend the unicast based recovery algorithm to do replication policy check #11996

Extend the unicast based recovery algorithm to do replication policy check #11996

sbodagala commented Mar 5, 2025

foundationdb-ci commented Mar 5, 2025

foundationdb-ci commented Mar 5, 2025

foundationdb-ci commented Mar 5, 2025

foundationdb-ci commented Mar 5, 2025

foundationdb-ci commented Mar 5, 2025

sbodagala commented Mar 6, 2025 •

edited

Loading

Result of foundationdb-pr-clang on Linux CentOS 7

foundationdb-ci commented Mar 6, 2025

foundationdb-ci commented Mar 6, 2025

foundationdb-ci commented Mar 6, 2025

foundationdb-ci commented Mar 6, 2025

foundationdb-ci commented Mar 6, 2025

dlambrig Mar 7, 2025

dlambrig Mar 7, 2025

sbodagala Mar 7, 2025

Extend the unicast based recovery algorithm to do replication policy check #11996

Are you sure you want to change the base?

Extend the unicast based recovery algorithm to do replication policy check #11996

Conversation

sbodagala commented Mar 5, 2025

Code-Reviewer Section

For Release-Branches

foundationdb-ci commented Mar 5, 2025

Result of foundationdb-pr-clang-ide on Linux CentOS 7

foundationdb-ci commented Mar 5, 2025

Result of foundationdb-pr-clang-arm on Linux CentOS 7

foundationdb-ci commented Mar 5, 2025

Result of foundationdb-pr-clang on Linux CentOS 7

foundationdb-ci commented Mar 5, 2025

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

foundationdb-ci commented Mar 5, 2025

Result of foundationdb-pr on Linux CentOS 7

sbodagala commented Mar 6, 2025 • edited Loading

Result of foundationdb-pr-clang on Linux CentOS 7

foundationdb-ci commented Mar 6, 2025

Result of foundationdb-pr-clang-ide on Linux CentOS 7

foundationdb-ci commented Mar 6, 2025

Result of foundationdb-pr-clang-arm on Linux CentOS 7

foundationdb-ci commented Mar 6, 2025

Result of foundationdb-pr-clang on Linux CentOS 7

foundationdb-ci commented Mar 6, 2025

Result of foundationdb-pr on Linux CentOS 7

foundationdb-ci commented Mar 6, 2025

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

dlambrig Mar 7, 2025

Choose a reason for hiding this comment

dlambrig Mar 7, 2025

Choose a reason for hiding this comment

sbodagala Mar 7, 2025

Choose a reason for hiding this comment

sbodagala commented Mar 6, 2025 •

edited

Loading