Skip to content

Commit

Permalink
[#13022] DST: PITR - Allow ability to restore to different points of …
Browse files Browse the repository at this point in the history
…time in the past when tablet

splitting was ongoing

Summary:
This diff adds support for restoring to points in time in the past when a tablet splitting
was ongoing. Briefly the following algorithm is used:
1. If either of the child tablets (or both) are not registered on the master as of the time to which we are restoring then we
restore the parent tablet and hide the child tablets.
2. If both the child tablets are registered on the master then we restore the child tablets and hide the parent.

This works because at the time when restore was initiated, we are waiting for splits to complete.
Thus at current time split children are ready, so its safe to restore the children and
use hybrid time filter added as part of the PITR to ensure only restored rows are visible.

Test Plan:
Different phases like
1. Restore before the middle key is fetched: ybd --cxx_test yb-admin-snapshot-schedule-test
--gtest-filter YbAdminRestoreDuringSplit.RestoreBeforeGetSplitKey
2. Restore after only one child is registered with the master: ybd --cxx_test yb-admin-snapshot-schedule-test
--gtest-filter YbAdminRestoreDuringSplit.RestoreAfterOneChildRegistered
3. Restore after both the children registered but SPLIT_OP not applied: ybd --cxx_test yb-admin-snapshot-schedule-test
--gtest-filter YbAdminRestoreDuringSplit.RestoreBeforeSplitOpIsApplied
4. Restore after children RUNNING but parent not HIDDEN: ybd --cxx_test yb-admin-snapshot-schedule-test
--gtest-filter YbAdminRestoreDuringSplit.RestoreBeforeParentHidden
5. Restore after children RUNNING and parent HIDDEN: ybd --cxx_test yb-admin-snapshot-schedule-test
--gtest-filter YbAdminSnapshotScheduleTest.VerifyRestoreWithDeletedTablets

Reviewers: slingam, timur, sergei, asrivastava, zdrudi

Reviewed By: asrivastava, zdrudi

Subscribers: bogdan, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D18299
  • Loading branch information
sanketkedia committed Jul 21, 2022
1 parent ad0c337 commit 675d486
Show file tree
Hide file tree
Showing 5 changed files with 500 additions and 128 deletions.
197 changes: 163 additions & 34 deletions ent/src/yb/master/restore_sys_catalog_state.cc
Original file line number Diff line number Diff line change
Expand Up @@ -268,6 +268,104 @@ class PgCatalogRestorePatch : public RestorePatch {
RestoreSysCatalogState::RestoreSysCatalogState(SnapshotScheduleRestoration* restoration)
: restoration_(*restoration) {}

Status RestoreSysCatalogState::PatchAndAddRestoringTablets() {
faststring buffer;
for (auto& split_tablet : restoration_.non_system_tablets_to_restore) {
auto& split_info = split_tablet.second;
// CASE: 1
// If master has fewer than 2 children registered then writes (if any) are still
// going to the parent and it is safe to restore the parent and hide the children (if any).
// Some examples:
//
// Example#1: Non-colocated split tablet that has finished split completely as of restore time
// t1
// / \
// t11 t12 <-- Restoring time
// / \ / \
// t111 t112 t121 t122 <-- Present time
// If we are restoring to a state when t1 was completely split into t11 and t12 then
// in the restoring state, the split map will contain two entries
// one each for t11 and t12. Both the entries will only have the parent
// but no children. It is safe to restore just the parent.
//
// Example#2: Colocated or not split tablet
// t1 (colocated or not split) <-- Present and Restoring time
// If we are restoring a colocated tablet then the split map will only contain one entry
// for t1 that will only have the parent but no children.
//
// Example#3: Non-colocated split tablet in the middle of a split as of restore time
// If both the children are not registered on the master as of the time to restore
// then all the writes are still going to the parent and it is safe to restore
// the parent. We also HIDE the children if any.
if (VLOG_IS_ON(3)) {
VLOG(3) << "Parent tablet id " << split_info.parent.first
<< ", pb " << split_info.parent.second->ShortDebugString();
for (const auto& child : split_info.children) {
VLOG(3) << "Child tablet id " << child.first
<< ", pb " << child.second->ShortDebugString();
}
}
if (split_info.children.size() < 2) {
// Clear the children info from the protobuf.
split_info.parent.second->clear_split_tablet_ids();
// If it is a colocated tablet, then set the schedules that prevent
// its colocated tables from getting deleted. Also, add to-be hidden table ids
// in its colocated list as they won't be present previously.
RETURN_NOT_OK(PatchColocatedTablet(split_info.parent.first, split_info.parent.second));
RETURN_NOT_OK(AddRestoringEntry(split_info.parent.first, split_info.parent.second,
&buffer, SysRowEntryType::TABLET));
// Hide the child tablets.
for (auto& child : split_info.children) {
FillHideInformation(child.second->table_id(), child.second);
RETURN_NOT_OK(AddRestoringEntry(child.first, child.second, &buffer,
SysRowEntryType::TABLET, DoTsRestore::kFalse));
}
} else {
// CASE: 2
// If master has both the children registered then we restore as if this split
// is complete i.e. we restore both the children and hide the parent.
// This works because at the time when restore was initiated, we waited
// for splits to complete, so at current time split children are ready and parent is hidden.
// Thus it's safe to restore the children and use hybrid time filter to
// ensure only restored rows are visible. This takes care of all the race conditions
// associated with selectively restoring either only the parent or children depending on
// the stage at which splitting is at.

// There should be exactly 2 children.
RSTATUS_DCHECK_EQ(split_info.children.size(), 2, IllegalState,
"More than two children tablets exist for the parent tablet");

// Restore the child tablets.
for (const auto& child : split_info.children) {
child.second->clear_split_tablet_ids();
child.second->set_split_parent_tablet_id(split_info.parent.first);
RETURN_NOT_OK(AddRestoringEntry(child.first, child.second,
&buffer, SysRowEntryType::TABLET));
}
// Hide the parent tablet.
FillHideInformation(split_info.parent.second->table_id(), split_info.parent.second);
RETURN_NOT_OK(AddRestoringEntry(split_info.parent.first, split_info.parent.second, &buffer,
SysRowEntryType::TABLET, DoTsRestore::kFalse));
}
}

return Status::OK();
}

void RestoreSysCatalogState::FillHideInformation(
TableId table_id, SysTabletsEntryPB* pb, bool set_hide_time) {
auto it = retained_existing_tables_.find(table_id);
if (it != retained_existing_tables_.end()) {
if (set_hide_time) {
pb->set_hide_hybrid_time(restoration_.write_time.ToUint64());
}
auto& out_schedules = *pb->mutable_retained_by_snapshot_schedules();
for (const auto& schedule_id : it->second) {
out_schedules.Add()->assign(schedule_id.AsSlice().cdata(), schedule_id.size());
}
}
}

Result<bool> RestoreSysCatalogState::PatchRestoringEntry(
const std::string& id, SysNamespaceEntryPB* pb) {
return true;
Expand All @@ -293,7 +391,7 @@ Result<bool> RestoreSysCatalogState::PatchRestoringEntry(
<< ", restoring version " << pb->version();
}

// Patch the partition version if changed.
// Patch the partition version.
if (pb->partition_list_version() != it->second.partition_list_version()) {
LOG(INFO) << "PITR: Patching the partition list version for table " << id
<< ". Existing version " << it->second.partition_list_version()
Expand All @@ -304,14 +402,11 @@ Result<bool> RestoreSysCatalogState::PatchRestoringEntry(
return true;
}

Result<bool> RestoreSysCatalogState::PatchRestoringEntry(
Status RestoreSysCatalogState::PatchColocatedTablet(
const std::string& id, SysTabletsEntryPB* pb) {
if (!pb->colocated()) {
return true;
return Status::OK();
}
// If it is a colocated tablet, then set the schedules that prevent
// its colocated tables from getting deleted. Also, add to-be hidden table ids
// in its colocated list as they won't be present previously.
auto it = existing_objects_.tablets.find(id);
// Since we are not allowed to drop the database on which schedule was set,
// it implies that the colocated tablet for the colocated database must always be present.
Expand Down Expand Up @@ -342,37 +437,73 @@ Result<bool> RestoreSysCatalogState::PatchRestoringEntry(
}
if (colocated_table_deleted) {
// Set schedules that retain.
auto it = retained_existing_tables_.find(found_table_id);
if (it != retained_existing_tables_.end()) {
auto& out_schedules = *pb->mutable_retained_by_snapshot_schedules();
for (const auto& schedule_id : it->second) {
LOG(INFO) << "PITR: " << schedule_id << " schedule retains colocated tablet " << id;
out_schedules.Add()->assign(schedule_id.AsSlice().cdata(), schedule_id.size());
}
FillHideInformation(found_table_id, pb, false /* set_hide_time */);
}
return Status::OK();
}

void RestoreSysCatalogState::AddTabletToSplitRelationshipsMap(
const std::string& id, SysTabletsEntryPB* pb) {
// If this tablet has a parent tablet then add it as a child of that parent.
// Otherwise add it as a parent.
VLOG_WITH_FUNC(1) << "Tablet id " << id << ", pb " << pb->ShortDebugString();
bool has_live_parent = false;
if (pb->has_split_parent_tablet_id()) {
auto it = restoring_objects_.tablets.find(pb->split_parent_tablet_id());
if (it != restoring_objects_.tablets.end()) {
has_live_parent = !TabletDeleted(it->second);
}
}
return true;
if (has_live_parent) {
restoration_.non_system_tablets_to_restore[pb->split_parent_tablet_id()]
.children.emplace(id, pb);
} else {
auto& split_info = restoration_.non_system_tablets_to_restore[id];
split_info.parent.first = id;
split_info.parent.second = pb;
}
}

Result<bool> RestoreSysCatalogState::PatchRestoringEntry(
const std::string& id, SysTabletsEntryPB* pb) {
AddTabletToSplitRelationshipsMap(id, pb);
// Don't add this entry to the write batch yet, we write
// them once split relationships are known for all tablets
// as a separate step.
return false;
}

template <class PB>
Status RestoreSysCatalogState::AddRestoringEntry(
const std::string& id, PB* pb, faststring* buffer) {
auto type = GetEntryType<PB>::value;
const std::string& id, PB* pb, faststring* buffer, SysRowEntryType type,
DoTsRestore send_restore_rpc) {
VLOG_WITH_FUNC(1) << SysRowEntryType_Name(type) << ": " << id << ", " << pb->ShortDebugString();

if (!VERIFY_RESULT(PatchRestoringEntry(id, pb))) {
return Status::OK();
}
auto& entry = *entries_.mutable_entries()->Add();
entry.set_type(type);
entry.set_id(id);
RETURN_NOT_OK(pb_util::SerializeToString(*pb, buffer));
entry.set_data(buffer->data(), buffer->size());
restoration_.non_system_objects_to_restore.emplace(id, type);
if (send_restore_rpc) {
restoration_.non_system_objects_to_restore.emplace(id, type);
}

return Status::OK();
}

template <class PB>
Status RestoreSysCatalogState::PatchAndAddRestoringEntry(
const std::string& id, PB* pb, faststring* buffer) {
auto type = GetEntryType<PB>::value;
VLOG_WITH_FUNC(1) << SysRowEntryType_Name(type) << ": " << id << ", " << pb->ShortDebugString();

if (!VERIFY_RESULT(PatchRestoringEntry(id, pb))) {
return Status::OK();
}

return AddRestoringEntry(id, pb, buffer, type);
}

bool RestoreSysCatalogState::AreAllSequencesDataObjectsEmpty(
Objects* existing_objects, Objects* restoring_objects) {
return existing_objects->sequences_namespace.empty() &&
Expand All @@ -395,14 +526,16 @@ Status RestoreSysCatalogState::AddSequencesDataEntries(
std::unordered_map<NamespaceId, SysNamespaceEntryPB>* seq_namespace,
std::unordered_map<TableId, SysTablesEntryPB>* seq_table,
std::unordered_map<TabletId, SysTabletsEntryPB>* seq_tablets) {
faststring namespace_buffer, table_buffer;
faststring buffer;
RETURN_NOT_OK(AddRestoringEntry(
seq_namespace->begin()->first, &seq_namespace->begin()->second, &namespace_buffer));
seq_namespace->begin()->first, &seq_namespace->begin()->second,
&buffer, SysRowEntryType::NAMESPACE));
RETURN_NOT_OK(AddRestoringEntry(
seq_table->begin()->first, &seq_table->begin()->second, &table_buffer));
seq_table->begin()->first, &seq_table->begin()->second,
&buffer, SysRowEntryType::TABLE));
for (auto& id_and_pb : *seq_tablets) {
faststring buffer;
RETURN_NOT_OK(AddRestoringEntry(id_and_pb.first, &id_and_pb.second, &buffer));
RETURN_NOT_OK(AddRestoringEntry(
id_and_pb.first, &id_and_pb.second, &buffer, SysRowEntryType::TABLET));
}
return Status::OK();
}
Expand Down Expand Up @@ -462,9 +595,11 @@ Status RestoreSysCatalogState::Process() {
RETURN_NOT_OK_PREPEND(DetermineEntries(
&restoring_objects_, nullptr,
[this, &buffer](const auto& id, auto* pb) {
return AddRestoringEntry(id, pb, &buffer);
return PatchAndAddRestoringEntry(id, pb, &buffer);
}), "Determine restoring entries failed");

RETURN_NOT_OK(PatchAndAddRestoringTablets());

return Status::OK();
}

Expand Down Expand Up @@ -692,14 +827,8 @@ Status RestoreSysCatalogState::PrepareTabletCleanup(

QLWriteRequestPB write_request;

auto it = retained_existing_tables_.find(pb.table_id());
if (it != retained_existing_tables_.end()) {
pb.set_hide_hybrid_time(restoration_.write_time.ToUint64());
auto& out_schedules = *pb.mutable_retained_by_snapshot_schedules();
for (const auto& schedule_id : it->second) {
out_schedules.Add()->assign(schedule_id.AsSlice().cdata(), schedule_id.size());
}
}
FillHideInformation(pb.table_id(), &pb);

RETURN_NOT_OK(FillSysCatalogWriteRequest(
SysRowEntryType::TABLET, id, pb.SerializeAsString(),
QLWriteRequestPB::QL_STMT_UPDATE, schema, &write_request));
Expand Down
19 changes: 18 additions & 1 deletion ent/src/yb/master/restore_sys_catalog_state.h
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@
namespace yb {
namespace master {

YB_STRONGLY_TYPED_BOOL(DoTsRestore);

// Utility class to restore sys catalog.
// Initially we load tables and tablets into it, then match schedule filter.
class RestoreSysCatalogState {
Expand Down Expand Up @@ -101,7 +103,18 @@ class RestoreSysCatalogState {
std::unordered_map<TabletId, SysTabletsEntryPB>* seq_tablets);

template <class PB>
Status AddRestoringEntry(const std::string& id, PB* pb, faststring* buffer);
Status AddRestoringEntry(
const std::string& id, PB* pb, faststring* buffer, SysRowEntryType type,
DoTsRestore send_restore_rpc = DoTsRestore::kTrue);

template <class PB>
Status PatchAndAddRestoringEntry(
const std::string& id, PB* pb, faststring* buffer);

// Adds the tablet to 'non_system_tablets_to_restore' map.
void AddTabletToSplitRelationshipsMap(const std::string& id, SysTabletsEntryPB* pb);

Status PatchColocatedTablet(const std::string& id, SysTabletsEntryPB* pb);

Result<bool> PatchRestoringEntry(const std::string& id, SysNamespaceEntryPB* pb);
Result<bool> PatchRestoringEntry(const std::string& id, SysTablesEntryPB* pb);
Expand Down Expand Up @@ -136,6 +149,10 @@ class RestoreSysCatalogState {

Status PatchSequencesDataObjects(Objects* existing_objects, Objects* restoring_objects);

Status PatchAndAddRestoringTablets();

void FillHideInformation(TableId table_id, SysTabletsEntryPB* pb, bool set_hide_time = true);

struct Objects {
std::unordered_map<NamespaceId, SysNamespaceEntryPB> namespaces;
std::unordered_map<TableId, SysTablesEntryPB> tables;
Expand Down
5 changes: 5 additions & 0 deletions src/yb/master/catalog_manager.cc
Original file line number Diff line number Diff line change
Expand Up @@ -496,6 +496,10 @@ DEFINE_bool(batch_ysql_system_tables_metadata, false,
"a create database is performed one by one or batched together");
TAG_FLAG(batch_ysql_system_tables_metadata, runtime);

DEFINE_test_flag(bool, pause_split_child_registration,
false, "Pause split after registering one child");
TAG_FLAG(TEST_pause_split_child_registration, runtime);

namespace yb {
namespace master {

Expand Down Expand Up @@ -6095,6 +6099,7 @@ Result<TabletInfoPtr> CatalogManager::RegisterNewTabletForSplit(
RETURN_NOT_OK(sys_catalog_->Upsert(leader_ready_term(), table, new_tablet, source_tablet_info));

MAYBE_FAULT(FLAGS_TEST_crash_after_creating_single_split_tablet);
TEST_PAUSE_IF_FLAG(TEST_pause_split_child_registration);

table->AddTablet(new_tablet);
// TODO: We use this pattern in other places, but what if concurrent thread accesses not yet
Expand Down
16 changes: 14 additions & 2 deletions src/yb/master/master_snapshot_coordinator.h
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
#include "yb/docdb/docdb.pb.h"
#include "yb/gutil/ref_counted.h"

#include "yb/master/catalog_entity_info.pb.h"
#include "yb/master/master_fwd.h"
#include "yb/master/master_heartbeat.fwd.h"
#include "yb/master/master_types.pb.h"
Expand All @@ -44,10 +45,21 @@ struct SnapshotScheduleRestoration {
std::vector<std::pair<TabletId, SysTabletsEntryPB>> non_system_obsolete_tablets;
std::vector<std::pair<TableId, SysTablesEntryPB>> non_system_obsolete_tables;
std::unordered_map<std::string, SysRowEntryType> non_system_objects_to_restore;
// YSQL pg_catalog_tables in the current state (as of restore request time).
// YSQL pg_catalog tables as of the current time.
std::unordered_map<TableId, TableName> existing_system_tables;
// YSQL pg_catalog_tables present in the snapshot to restore to.
// YSQL pg_catalog tables as of time in the past to which we are restoring.
std::unordered_set<TableId> restoring_system_tables;
// Captures split relationships between tablets.
struct SplitTabletInfo {
std::pair<TabletId, SysTabletsEntryPB*> parent;
std::unordered_map<TabletId, SysTabletsEntryPB*> children;
};
// Tablets as of the restoring time with their parent-child relationships.
// Map from parent tablet id -> information about parent and children.
// For colocated tablets or tablets that have not been split as of restoring time,
// only the 'parent' field of SplitTabletInfo above will be populated and 'children'
// map of SplitTabletInfo will be empty.
std::unordered_map<TabletId, SplitTabletInfo> non_system_tablets_to_restore;
};

// Class that coordinates transaction aware snapshots at master.
Expand Down
Loading

0 comments on commit 675d486

Please sign in to comment.