What is deduplication?
Deduplication is avoiding storage of multiple copies of the same data.
Cloud storage service providers such as Dropbox, Mozy, and others perform deduplication to save space by only storing one copy of each file uploaded.
Can deduplication be combined with encryption?
Deduplication does not work with conventional encryption of course: should clients locally encrypt their files with
their individual keys, different ciphertexts would emerge even for identical plaintexts, making deduplication impossible.
is a widely used technique to combine the storage savings of deduplication while still providing decent privacy guarantees for the data.
In convergent encryption, the message is encrypted under a key derived by the hashing the message itself, along with a public parameter.
For example, to encrypt M, we could get K ← H(P, M) and derive
C ← Enc(K, M, IV_0).
Here, P is a 128-bit
string picked at random and fixed throughout the system, IV_0 is the all-0 initialization vector,
H could implemented through SHA256
and Enc through AES128 in CTR mode.
Message-locked encryption is a cryptographic primitive that broadens convergent encryption and captures the properties needed for enabling deduplication
over ciphertexts (see the Eurocrypt 2013 paper for more on message-locked encryption).
However, convergent encryption (and message-locked encryption in general) is inherently subject to brute-force attacks that can recover files falling into a known set.
If a file M is picked from a set S, given a convergent encryption ciphertext C of M,
an attacker can trial-decrypt with each candidate file M' in S by deriving K ← H(P, M') and checking
if M' = Dec(K, C).
How serious are brute-force attacks on convergent encryption?
The brute-force attack described above can reach high speeds (up to 12000 attempts per second on a single processor, see the DupLESS paper for details),
especially if S consists of short files. Moreover, the attack is completely parallelizable.
How does DupLESS protect against brute-force attacks?
DupLESS was built from the observation that the brute-force attack can be dealt with
by using a key server (KS) to derive keys, instead of setting keys to be hashes of messages.
Consider an enterprise
network, consisting of a group of affiliated clients (for
example, employees of a company) using an external deduplicated cloud storage service. Along with the clients, the network also includes
Access to the KS is
preceded by authentication, which stops external attackers. The increased cost slows down brute-force attacks
from compromised clients, and now the KS can function as a (logically) single point of control for implementing rate-limiting measures.
If the attacker cannot authenticate herself to the KS, we ensure high security.
(Effectively, semantic security, except that ciphertexts
leak equality of the underlying plaintexts. The latter is
necessary for deduplication.)
Even if the attacker compromises a client and gains
access to the KS, the attacker has to launch an online attack, interacting with the KS
for each attempt. Moreover, the KS limits the number of queries an attacker can make in
a given time period. The limit is set after analysing general usage patterns in the system:
for example, if no client performs more than 2 million writes per week, and the pattern
keeps repeating on a weekly basis, the KS will not reply to more than 2 million queries in a week.
Effectively, this brings the brute-force attack down to 3 queries per second, if the size of S is much larger than 2 million.
Finally, even if the KS is compromised, DupLESS falls back to a usual convergent encryption style guarantee.
How much overhead will the system incur with DupLESS?
DupLESS incurs only slight overheads
compared to using no encryption at all.
We deployed a prototype
KS on Amazon EC2 and experimentally evaluated
its performance in a bunch of experiments. For example, in one of our experiments, we uploaded a
1 MB file to Dropbox, first without any encryption, and then with DupLESS. The bandwidth overhead
is less than 1% and the overhead in the time to store a
file is about 17%.