relay sync1.1 #961

brianolson · 2025-02-28T20:09:48Z

relay radical rehacking for sync1.1 induction firehose
drops all carstore code for keeping archive of all repos
drops all indexer code for running getRepo on all those repos
drops a bunch of other stuff that fell out being deprecated and unneeded.
adds sync1.1 protocol features for checking #commit.prevData and #sync
internally vendors a bunch of code that we should re-share back out to common code

formerly known as #951

sync 1.1 induction firehose

bnewbold

got through some review. I still want to look at:

the actual generated postgresql schemas. getting those right seems important. the gorm models are still spread through many files, seems like they could all be in cmd/relay/models/models.go?
account state and lifecycle
rate-limiting

it feels like there is still a lot of low-hanging fruit for code that could just be ripped out? eg, all the deprecated event types. there seemed to be a lot of config arguments which are not used.

I was surprised the disk persister is still used, instead of the pebble persister used in rainbow.

bnewbold · 2025-03-03T22:39:48Z

cmd/relay/README.md

+
+In atproto, a Relay subscribes to multiple PDS hosts and outputs a combined "firehose" event stream. Downstream services can subscribe to this single firehose a get all relevant events for the entire network, or a specific sub-graph of the network. The Relay maintains a mirror of repo data from all accounts on the upstream PDS instances, and verifies repo data structure integrity and identity signatures. It is agnostic to applications, and does not validate data against atproto Lexicon schemas.
+
+This Relay implementation is designed to subscribe to the entire global network. The current state of the codebase is informally expected to scale to around 20 million accounts in the network, and thousands of repo events per second (peak).


chunks of this README are out of date. maybe simpler to just leave it a couple lines in this PR, and I can update it in a follow-up?

I re-read it just now and I think most of it is still valid and I deleted the obviously obsolete stuff.

bnewbold · 2025-03-03T22:41:26Z

events/consumer.go


 		for {

 			select {
 			case <-t.C:
 				if err := con.WriteControl(websocket.PingMessage, []byte{}, time.Now().Add(time.Second*10)); err != nil {
 					log.Warn("failed to ping", "err", err)
+					failcount++


same feedback as before: should this get cleared to 0 if the ping is successful?

(in this for loop)

this is not the active code, see cmd/relay/events/consumer.go

if this doesn't impact the indigo:cmd/relay code, can you pull it out of this PR?

bnewbold · 2025-03-03T22:47:38Z

models/models.go

@@ -104,7 +104,7 @@ type FollowRecord struct {
 type PDS struct {
 	gorm.Model

-	Host       string
+	Host       string `gorm:"unique"`


this only impacts the old bigsky, not the new relay, right? seems like we should minimize any changes to bigsky in this PR.

yeah, I should probably rollback most of the changes outside of cmd/relay

bnewbold · 2025-03-03T22:53:04Z

.github/workflows/container-relay-aws.yaml

@@ -0,0 +1,52 @@
+name: container-relay-aws


should do a container-relay-ghcr CI action as well; we will remove the bigsky GHCR soon so it won't be a net increase in container builds

cmd/relay/models/models.go

bnewbold · 2025-03-03T23:27:32Z

cmd/relay/main.go

+		&cli.StringFlag{
+			Name:    "db-url",
+			Usage:   "database connection string for BGS database",
+			Value:   "sqlite://./data/bigsky/bgs.sqlite",


is there still sqlite support? I thought that got dropped

it's not working right now because some transactions changed, but it's maybe not far off from being fixed?

bnewbold · 2025-03-03T23:30:32Z

cmd/relay/bgs/bgs.go

+}
+
+func (bgs *BGS) handleSync(ctx context.Context, host *models.PDS, evt *comatproto.SyncSubscribeRepos_Sync) error {
+	// TODO: actually do something with #sync event


we should at least be updating the metadata for the account (DID) here, right?

I think the rough expected behavior is:

parse out the commit object

verify the message matches the fields in the commit object

fetch user metadata

if account isn't active, drop event

if rev is old/bad, drop event

update user metadata with sync info

emit the event

cmd/relay/bgs/bgs.go

bnewbold · 2025-03-03T23:32:26Z

cmd/relay/bgs/bgs.go

+		return nil
+	}
+
+	if ustatus == events.AccountStatusDeactivated {


also account status throttled

bnewbold · 2025-03-03T23:33:41Z

cmd/relay/bgs/bgs.go

+		}
+	}
+
+	if u.GetTombstoned() {


I'm pretty sure we just don't / shouldn't have a concept of "tombstoned" accounts in the relay. If the DID is tombstoned, that is basically account status deleted

yeah, old code, I basically haven't had time to update the account status flow yet, that part seemed to have been in flux up until last week

bnewbold

it looks like adding per-account rate-limits would currently mean a few slidingwindow.Limiter in-process for each of tens of millions of accounts (and growing). do you think that will scale, for us and for others (eg, with less RAM)? if they could expire out of memory, then the total number of counters could be more like "DAU" than "MAU".

the limits we have discussed are total bytes of data, and number of record ops, per account (DID). and then also the number of "expensive" events (like a #sync which is not a no-op; or an #identity event), or "broken" events (MST does not invert, for example).

overall account lifecycle state tracking:

should store "local" account state as a string, not boolean flags
if local state is "active', then default to upstream state (which may or may not be "active"). otherwise, use local state
could store an "until" timestamp in account status, for temporary throttled situations

cmd/relay/bgs/bgs.go

bnewbold · 2025-03-04T10:19:10Z

cmd/relay/bgs/models.go

+
+type DomainBan struct {
+	gorm.Model
+	Domain string


unique constraint on the domain?

bnewbold · 2025-03-04T10:22:42Z

cmd/relay/bgs/bgs.go

+	return u.UpstreamStatus
+}
+
+type UserPreviousState struct {


can this just be merged on to the existing Users table? I guess this table has higher churn and it might be good to keep that separate

yeah, that's my motivation. I think Postgres works better with the high churn on separate smaller rows from the relatively low-churn data.

bnewbold · 2025-03-04T10:23:38Z

cmd/relay/bgs/bgs.go

+
+type UserPreviousState struct {
+	Uid models.Uid   `gorm:"column:uid;primaryKey"`
+	Cid models.DbCID `gorm:"column:cid"`


I think a better column name would be good; data_cid? commit_data or commit_data_cid?

bnewbold · 2025-03-04T10:34:35Z

cmd/relay/bgs/bgs.go

+}
+
+func (bgs *BGS) handleSync(ctx context.Context, host *models.PDS, evt *comatproto.SyncSubscribeRepos_Sync) error {
+	// TODO: actually do something with #sync event


I think the rough expected behavior is:

parse out the commit object

verify the message matches the fields in the commit object

fetch user metadata

if account isn't active, drop event

if rev is old/bad, drop event

update user metadata with sync info

emit the event

bnewbold · 2025-03-04T10:42:20Z

cmd/relay/bgs/bgs.go

+// createExternalUser is a mess and poorly defined
+// did is the user
+// host is the PDS we received this from, not necessarily the canonical PDS in the DID document
+// TODO: rename? This also updates users, and 'external' is an old phrasing


seems like a good time to rename this, can update comments around what it does.

bnewbold · 2025-03-04T10:44:34Z

cmd/relay/bgs/bgs.go

+		return nil, fmt.Errorf("cannot create user on pds with banned domain")
+	}
+
+	if peering.ID == 0 {


I assume this is old code, maybe from the days when BGS had indexing and would "spider" to discover accounts? or if there is a bulk catch-up happening?

would be good if you could clarify what the expected behavior is here ("maybe this never happens").

I did a bunch of reorg and commenting on this function as a whole and I think it's better now

bnewbold · 2025-03-04T10:47:27Z

cmd/relay/bgs/bgs.go

+	}
+
+	err = bgs.db.Transaction(func(tx *gorm.DB) error {
+		res := tx.Model(&models.PDS{}).Where("id = ? AND repo_count < repo_limit", peering.ID).Update("repo_count", gorm.Expr("repo_count + 1"))


I don't really understand the repo_count < repo_limit clause here... is this how account limits are enforced? the increment fails if the clause filters out the relevant row?

my intuition is that the account should be created (inserted), but with local account state "throttled", if the overall PDS is over quota. instead of silently being dropped; the later is hard for external folks to debug.

there's a potential waste-our-storage attack for an evil PDS sending one event each from a zililon different DIDs; but aside from that I'd like your approach. We could try for both with limit and a hard limit? DIDs beyond the basic limit get #throttled, DIDs beyond the second limit get ignored?

bnewbold · 2025-03-04T18:49:55Z

util/cbor.go

 	cbor "github.com/ipfs/go-ipld-cbor"
 	mh "github.com/multiformats/go-multihash"
 )

-func CborStore(bs blockstore.Blockstore) *cbor.BasicIpldStore {
+func CborStore(bs cbor.IpldBlockstore) *cbor.BasicIpldStore {


I think you should split out these changes to function signatures (here, and indigo:mst and indigo:repo) to a separate PR, if they aren't touched by the indigo:cmd/relay code.

BGS.createExternalUser() is now syncPDSAccount() metric events_warn_counter{pds,warn}

"tombstoned" gone from account model cleanup, fixes

brianolson added 12 commits February 27, 2025 16:39

medsky radical trim of bigsky (relay)

1c1cc49

sync 1.1 induction firehose

cleanup, add getRepo 302 redirect

702d73f

more induction trace log

3cdf190

PR feedback and other cleanup

979c8ea

note subscribeRepos message deprecation

6ddbf48

pass through #sync message

664eccf

time-seq

5553351

get echo log into same io.Writer as all other log

52e19de

fix logging setup

990d7b2

fix RelaySetting gorm usage; TODO err rate squelch

72ad67b

rename everything from 'bigsky' or 'medsky' to 'relay'

dae5ceb

GHA for relay

49c9d1c

brianolson requested a review from bnewbold February 28, 2025 20:09

brianolson mentioned this pull request Feb 28, 2025

medsky radical trim of bigsky (relay) #951

Closed

bnewbold reviewed Mar 3, 2025

View reviewed changes

PR feedback

dbe624f

bnewbold reviewed Mar 4, 2025

View reviewed changes

don't bounce auth key off db

9053315

bnewbold reviewed Mar 4, 2025

View reviewed changes

brianolson added 8 commits March 5, 2025 02:48

handleFedEvent clean-er deprecated {handle,migrate,tombstone}

71d35e4

BGS.createExternalUser() is now syncPDSAccount() metric events_warn_counter{pds,warn}

a bunch of rename 'user' to 'account'

59f6718

rename RepoManager to Validator

d445904

move validator into bgs package

021d7cf

Validator is val

b22b0b6

sync handler exists and checks a bunch of stuff

109e101

"tombstoned" gone from account model cleanup, fixes

rollback irrelevant changes

4048d23

err check, nil check

6ff02b6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

relay sync1.1 #961

relay sync1.1 #961

brianolson commented Feb 28, 2025

bnewbold left a comment •

edited

Loading

bnewbold Mar 3, 2025

brianolson Mar 4, 2025

bnewbold Mar 3, 2025

brianolson Mar 4, 2025 •

edited

Loading

bnewbold Mar 4, 2025

bnewbold Mar 3, 2025

brianolson Mar 4, 2025

bnewbold Mar 3, 2025

bnewbold Mar 3, 2025

brianolson Mar 4, 2025

bnewbold Mar 3, 2025

bnewbold Mar 4, 2025

bnewbold Mar 3, 2025

bnewbold Mar 3, 2025

brianolson Mar 4, 2025

bnewbold left a comment

bnewbold Mar 4, 2025

bnewbold Mar 4, 2025

brianolson Mar 4, 2025

bnewbold Mar 4, 2025

bnewbold Mar 4, 2025

bnewbold Mar 4, 2025

bnewbold Mar 4, 2025

brianolson Mar 5, 2025

bnewbold Mar 4, 2025

brianolson Mar 4, 2025

bnewbold Mar 4, 2025


		In atproto, a Relay subscribes to multiple PDS hosts and outputs a combined "firehose" event stream. Downstream services can subscribe to this single firehose a get all relevant events for the entire network, or a specific sub-graph of the network. The Relay maintains a mirror of repo data from all accounts on the upstream PDS instances, and verifies repo data structure integrity and identity signatures. It is agnostic to applications, and does not validate data against atproto Lexicon schemas.

		This Relay implementation is designed to subscribe to the entire global network. The current state of the codebase is informally expected to scale to around 20 million accounts in the network, and thousands of repo events per second (peak).

relay sync1.1 #961

Are you sure you want to change the base?

relay sync1.1 #961

Conversation

brianolson commented Feb 28, 2025

bnewbold left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brianolson Mar 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bnewbold left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bnewbold left a comment •

edited

Loading

brianolson Mar 4, 2025 •

edited

Loading