Subject:

PoC Watermark Attack on Restic


Date: Message-Id: https://www.5snb.club/posts/2023/poc-watermark-attack/
Tags: #security(4)

First off, does this matter to you? No. No it doesn’t. Unless you’re backing up gigabytes of completely attacker controlled data, to an attacker controlled service, and need to ensure they don’t know you’re backing up said data, it Doesn’t Fucking Matter.

With that said, it’s a somewhat neat attack!

A watermarking attack is when an attacker who can get you to store an attacker-controlled piece of data can then detect the presence of that attacker controlled data. It’s not a huge deal, but is a concern if someone is able to inject a watermark into, say, copywritten or leaked content, and then automatically terminate the cloud storage/backup accounts of users that can be shown to have that data on their drive.

Restic uses Content-Defined-Chunking with Rabin Fingerprints with a secret, random polynomial in order to make watermarking attacks harder. (And, of course, the repo is encrypted, otherwise this attack doesn’t make sense to try to do, just read off the data directly).

This attack does not go after that. Instead, I simply repeat a given chunk of data many times in a row. This means that Restic will tend to create chunks that will have that size, or a multiple of it. Which means the attacker can use the size of the chunks they’re forcing the user to create to encode data.

Anyways, this does work… just not very well :P

A transcript of running it is below. The check.rs tool only accesses what metadata would be visible to an attacker without the password to an encrypted repository, that being object sizes.

running attacker file with tags 1000 1000 2000 2000
-rw-r--r-- 1 jess jess 801M May 19 23:10 attacker.bin
open repository
repository e794b424 opened (version 2, compression level off)
created new cache in /home/jess/.cache/restic
lock repository
no parent snapshot found, will read all files
load index files
start scan on [attacker.bin]
start backup on [attacker.bin]
scan finished in 0.224s: 1 files, 800.732 MiB

Files:           1 new,     0 changed,     0 unmodified
Dirs:            0 new,     0 changed,     0 unmodified
Data Blobs:    528 new
Tree Blobs:      1 new
Added to the repository: 795.915 MiB (795.935 MiB stored)

processed 1 files, 800.732 MiB in 0:04
snapshot b3f64a9e saved
[src/bin/check.rs:22] &sz = [
    (
        16777390,
        8,
    ),
    (
        18045195,
        7,
    ),
    (
        18034195,
        6,
    ),
]
./check.sh  15.01s user 3.68s system 137% cpu 13.543 total

And for reference, this was ran with restic 0.15.2 compiled with go1.20.3 on linux/amd64

At the end, the first number is the file size, and the second number is how many chunks of that size appeared.

Here, the attacker generated 800MB of data, and we can see 3 interesting sizes. The two bottom ones are the real tags, the top one is noise. The attacker generated chunks with the tags 1000 and 2000, which are bytes added to the base, so we would expect to see a multiple of that many bytes difference in size between them. And we do, they’re 18MB large, but the difference is 11×1000=11000 bytes.

Compression doesn’t really affect the attack since the data being inserted is random data, so Restic won’t try and compress it.

See watermark-attack.zip for the test files, under CC0. It requires Rust, but should work on any vaguely modern compiler. Or just rewrite the code in whatever language you prefer, it’s very trivial.

Future research might include trying to automatically recover the fingerprint polynomial so a watermark attack can be pulled off much more directly. That would require interactivity, where you backup attacker controlled data repeatedly.

What if you have webserver logs that you’re backing up, is that a viable attack to confirm that 5snb.club, or iykpqm7jiradoeezzkhj7c4b33g4hbgfwelht2evxxeicbpjy44c7ead.onion is some specific account on a online backup service? (Not that I keep logs anyways :P)

Mitigations might be to pad chunks as they’re created, though that comes with a size cost. See Optimally Hiding Object Sizes with Constrained Padding, perhaps. That wouldn’t be a hard solution to the problem, but would make this harder to pull off.