-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add compression for uploaded documentation #780
Conversation
Oh, I just realized this has the wrong content-types on s3: It uses the original content types instead of |
Actually that seems like a lot of work for no benefit, I don't know that we'll ever actually use the |
Some benchmarks:
|
The implementation looks good to me! I agree with @Nemo157 that we should store Also, based on #379 (comment) the two options I was considering were |
The other option that @namibj was investigating was zstd with a custom dictionary optimized for rustdoc output, I can probably adapt the linked benchmarking script to test that as well with a dictionary created from the docs I have. |
Oops, the benchmarks before got cut off. Here are some taken with a desktop instead of a laptop:
|
Another idea might be to store a set of all the encodings used for a particular release into the |
This sounds like we'd only use it for metrics? I'm not opposed in principle but storing the same data in two places risks it getting out of sync. |
Metrics don’t need to be a source of complexity, we can just as easily have an |
It's not really for metrics like what goes in grafana, more for future visibility into usage distribution. Especially with if we migrate dictionaries in the future knowing how many releases use which dictionaries could be useful to determine whether it seems feasible to do something like migrate all instances of a certain dictionary to a new one to remove it from the application. Given that releases are basically read-only currently I don't think there's much chance of this data getting out of sync. |
How would that interact with us deleting docs in the future (if we ever decided to do that)? |
No effect, we'd just delete the rows out of the database as well. |
@Nemo157 a possible issue with this is it prevents using different compression algorithms for different files within a release. Does that sound worth it? Otherwise we'd have to store metadata for each file individually in the database which sort of the defeats the point of an S3 bucket. |
That's why I mention a set of encodings, we don't need to know exactly which files use which encoding, just that having this set of encodings is enough to access all files in the release (and if we want to know in detail then we would have to query the metadata on all the files). |
I added the set of compression algorithms used as a many-to-many-table in the database, with a unique constraint so we don't end up with duplicates by accident. I'm open to suggestions for more tests but otherwise I think I addressed all the comments. |
As to using a custom zstd dictionary: it sounds like the rust |
I think it definitely makes sense to start with Doing it in two steps also gives us the chance to see any performance impacts of each step. One last thought I had was whether there should be a config var to disable compression, so if there are any issues in production that can just be toggled rather than having to do a rollback. |
It doesn't take very long to revert, about 5 minutes. In general I like the idea of having feature gates instead of "it's compiled in or it's not" but I'm a little wary of toggling features with environment variables, I'd rather come up with something more principled. |
- Don't require writing a database migration - Make tests not compile if they aren't exhaustive
Let me know if there are any more changes that need to be made, I think I addressed all the review comments. |
I just found that this will prevent us from updating builds if we rebuild a crate:
I think I need to add an |
Previously, postgres would give a duplicate key error when a crate was built more than once: ``` thread 'main' panicked at 'Building documentation failed: Error(Db(DbError { severity: "ERROR", parsed_severity: Some(Error), code: SqlState("23505"), message: "duplicate key value violates unique constraint \"compression_rels_release_algorithm_key\"", detail: Some("Key (release, algorithm)=(1, 0) already exists."), hint: None, position: None, where_: None, schema: Some("public"), table: Some("compression_rels"), column: None, datatype: None, constraint: Some("compression_rels_release_algorithm_key"), file: Some("nbtinsert.c"), line: Some(434), routine: Some("_bt_check_unique") }))', src/bin/cratesfyi.rs:321:21 ``` Now, duplicate keys are discarded.
This needs some tests, but otherwise I think it should be fine as is, it was super simple after #643 :)
This uses
zstd 5
, but it's very easy to change the compression now and reasonably easy to change it after it's merged (we just have to changecompression
from a boolean into an enum and continue supportingzstd
).This compresses transparently, so that calling
backend.get()
automatically decompresses files as needed. If the files were not compressed before uploading, they are not decompressed when downloading.Closes #379
cc @namibj
r? @pietroalbini