File deduplication (#6332)

When receiving messages, blobs will be deduplicated with the new
function `create_and_deduplicate_from_bytes()`. For sending files, this
adds a new function `set_file_and_deduplicate()` instead of
deduplicating by default.

This is for
https://github.com/deltachat/deltachat-core-rust/issues/6265; read the
issue description there for more details.

TODO:
- [x] Set files as read-only
- [x] Don't do a write when the file is already identical
- [x] The first 32 chars or so of the 64-character hash are enough. I
calculated that if 10b people (i.e. all of humanity) use DC, and each of
them has 200k distinct blob files (I have 4k in my day-to-day account),
and we used 20 chars, then the expected value for the number of name
collisions would be ~0.0002 (and the probability that there is a least
one name collision is lower than that) [^1]. I added 12 more characters
to be on the super safe side, but this wouldn't be necessary and I could
also make it 20 instead of 32.
- Not 100% sure whether that's necessary at all - it would mainly be
necessary if we might hit a length limit on some file systems (the
blobdir is usually sth like
`accounts/2ff9fc096d2f46b6832b24a1ed99c0d6/dc.db-blobs` (53 chars), plus
64 chars for the filename would be 117).
- [x] "touch" the files to prevent them from being deleted
- [x] TODOs in the code

For later PRs:
- Replace `BlobObject::create(…)` with
`BlobObject::create_and_deduplicate(…)` in order to deduplicate
everytime core creates a file
- Modify JsonRPC to deduplicate blob files
- Possibly rename BlobObject.name to BlobObject.file in order to prevent
confusion (because `name` usually means "user-visible-name", not "name
of the file on disk").

[^1]: Calculated with both https://printfn.github.io/fend/ and
https://www.geogebra.org/calculator, both of which came to the same
result
([1](https://github.com/user-attachments/assets/bbb62550-3781-48b5-88b1-ba0e29c28c0d),

[2](https://github.com/user-attachments/assets/82171212-b797-4117-a39f-0e132eac7252))

---------

Co-authored-by: l <link2xt@testrun.org>
This commit is contained in:
Hocuri
2025-01-21 19:42:19 +01:00
committed by GitHub
parent 22a7cfe9c3
commit 65a9c4b79b
23 changed files with 583 additions and 240 deletions

View File

@@ -896,13 +896,12 @@ impl ChatId {
.context("no file stored in params")?;
msg.param.set(Param::File, blob.as_name());
if msg.viewtype == Viewtype::File {
if let Some((better_type, _)) =
message::guess_msgtype_from_suffix(&blob.to_abs_path())
// We do not do an automatic conversion to other viewtypes here so that
// users can send images as "files" to preserve the original quality
// (usually we compress images). The remaining conversions are done by
// `prepare_msg_blob()` later.
.filter(|&(vt, _)| vt == Viewtype::Webxdc || vt == Viewtype::Vcard)
if let Some((better_type, _)) = message::guess_msgtype_from_suffix(msg)
// We do not do an automatic conversion to other viewtypes here so that
// users can send images as "files" to preserve the original quality
// (usually we compress images). The remaining conversions are done by
// `prepare_msg_blob()` later.
.filter(|&(vt, _)| vt == Viewtype::Webxdc || vt == Viewtype::Vcard)
{
msg.viewtype = better_type;
}
@@ -2722,8 +2721,7 @@ async fn prepare_msg_blob(context: &Context, msg: &mut Message) -> Result<()> {
// Typical conversions:
// - from FILE to AUDIO/VIDEO/IMAGE
// - from FILE/IMAGE to GIF */
if let Some((better_type, _)) = message::guess_msgtype_from_suffix(&blob.to_abs_path())
{
if let Some((better_type, _)) = message::guess_msgtype_from_suffix(msg) {
if msg.viewtype == Viewtype::Sticker {
if better_type != Viewtype::Image {
// UIs don't want conversions of `Sticker` to anything other than `Image`.
@@ -2753,8 +2751,14 @@ async fn prepare_msg_blob(context: &Context, msg: &mut Message) -> Result<()> {
&& (msg.viewtype == Viewtype::Image
|| maybe_sticker && !msg.param.exists(Param::ForceSticker))
{
blob.recode_to_image_size(context, &mut maybe_sticker)
let new_name = blob
.recode_to_image_size(
context,
msg.get_filename().unwrap_or_else(|| "file".to_string()),
&mut maybe_sticker,
)
.await?;
msg.param.set(Param::Filename, new_name);
if !maybe_sticker {
msg.viewtype = Viewtype::Image;
@@ -2771,7 +2775,7 @@ async fn prepare_msg_blob(context: &Context, msg: &mut Message) -> Result<()> {
}
if !msg.param.exists(Param::MimeType) {
if let Some((_, mime)) = message::guess_msgtype_from_suffix(&blob.to_abs_path()) {
if let Some((_, mime)) = message::guess_msgtype_from_suffix(msg) {
msg.param.set(Param::MimeType, mime);
}
}