neighbourhoodie-nnh-logo

How CouchDB Prevents Data Corruption: Checksums posted Wednesday, January 22, 2025 by The Neighbourhoodie Team

CouchDB is your data’s safe place. It does everything in its power not to accidentally lose any of your data. However, sometimes circumstances are not in CouchDB’s power.

One of those circumstances is disk corruption.

Whether it’s an SSD or hard drive, disk corruption can happen on all devices. Luckily, CouchDB has disk corruption in mind. All data in CouchDB is stored in so-called blocks. These are 4k chunks of data that CouchDB knows how to find again later, when you read from the database.

When storing a block, CouchDB not only stores the block data, but also a checksum calculated from the block data. Later, when CouchDB reads a block, it also reads the checksum, calculates a checksum from the block it just read and compares it to the checksum that was stored on disk.

If the checksums do not match, CouchDB has detected that something changed the bytes in the block that was just read, and logs an exception.

When you encounter this in view index files, you can delete the files and rebuild your indexes.

Before this happens in your database files, make sure your backups are up to date and work properly when restored.

Should this happen to your database files and you do not have a backup, do get in touch, we might be able to help with our specialised data recovery tooling.

Checksums in 3.4.2

CouchDB version 3.4.2, released in September 2024, offers changes to how checksums work. Most notably, checksums can now be performed with xxHash (128 bit) in place of MD5.

xxHash has the advantages of being faster — especially in terms of larger documents — while not being any slower for small and medium documents.

To give you an idea of the performance benefits of xxHash, here are comparisons (in microseconds) of hashing a 4K block:

(node1@127.0.0.1)20> f(T), {T, ok} = timer:tc(fun() -> lists:foreach(fun (_) -> do_nothing_overhead end, lists:seq(1, 1000000)) end), (T/1000000.0).`
0.167425

(node1@127.0.0.1)21> f(T), {T, ok} = timer:tc(fun() -> lists:foreach(fun (_) -> exxhash:xxhash128(B) end, lists:seq(1, 1000000)) end), (T/1000000).
0.770687

(node1@127.0.0.1)22> f(T), {T, ok} = timer:tc(fun() -> lists:foreach(fun (_) -> crypto:hash(md5, B) end, lists:seq(1, 1000000)) end), (T/1000000).
6.205445

10X! Of course, this is just a microbenchmark that doesn’t take everything that CouchDB is doing into account when reading and writing to disk, it’s just to show that there are gains to be had.

If you’re upgrading to 3.4.2, there are some things to take note of:

  • Writing xxHash checksums is disabled by default. This is so that when you need to downgrade CouchDB, the old version is able to read the MD5 checksums. Otherwise, these would be read as corrupted data.
  • In a future version, the system will be able to recognise both MD5 and xxHash checksums, trivialising the impact of downgrading on checksums. xxHash will eventually be on by default.
  • If you are confident that you won’t need to downgrade your system, you can already switch to xxHash and benefit from the performance boost.

We’ve written about other considerations to take into account when upgrading to 3.4.2, and also have answers to some popular questions about the new version and the features it comes with. If you want to keep up with our recommendations for future CouchDB releases, join our newsletter.

« Back to the blog post overview