How CouchDB Prevents Data Corruption: fsync posted Wednesday, February 26, 2025 by The Neighbourhoodie Team tip product CouchDB data

Programming can be exciting when the underlying fundamentals you’ve been operating under suddenly come into question. Especially when it comes to safely storing data. This is a story of how the CouchDB developers had a couple of hours of excitement making sure their fundamentals were solid (and your data was safe).

Modern software projects are large enough that it is unlikely that a single person can fit all of its constituent parts in their working memory. As developers we have to be okay with selectively forgetting how the program we are working on at the moment works in some parts to make progress on others.

Countless programming techniques as old as time itself (01.01.1970) help with this phenomenon and are commonly categorised as abstractions. As programmers we build ourselves abstractions in order to be able to safely forget how some parts of a program work.

An abstraction is a piece of code, or module, or library, that has a public API that we can use and remember that tells us what we can do with the piece of code, and that we can remember to have certain guarantees. Say a module has a function makeBlue(thing): you don’t necessarily have to remember how the function makes thing blue, all you need to know is that it does.

CouchDB is not a particularly large piece of software, but it is a relatively long running one, having been started in 2005. Certain parts of CouchDB are relatively old, meaning they solve a specific problem and we worked hard at the time to make sure we solve that problem good and proper and now all we, the CouchDB developers, remember is that we did solve it and that we can trust it. After that we don’t have much need to reevaluate the code in the module on an ongoing basis, so we are prone to forget specific details of how it works.

Old Assumptions Meet New Information

One consequence of this is that if new information appears that might affect the design of the old and trusted module, you have to scramble to re-understand all the details to see how the module fares in light of the new information.

This happened the other week when the CouchDB developers came across Justin Jaffray’s second part of his “NULL BITMAP Builds a Database” series: “#2: Enter the Memtable”. In it, Justin describes three scenarios for how data is written to disk under certain failure situations and evaluates what that means for writing software that does not want to lose any data (you know, a database).

CouchDB has long prided itself on doing everything in its power to not lose any data by going above and beyond to make sure your data is safe, even in rare edge-cases. Some other databases do not go as far as CouchDB goes.

For a moment, the CouchDB development team had collectively expunged the details of how CouchDB keeps data safe on disk that we could not immediately evaluate if CouchDB was susceptible to data loss in the specific scenario outlined by Justin.

To understand the scenario, we have to explain how Unix systems — and especially Linux — reads and writes data to disk. Before we go there though, rest assured this had us sweating for a hot minute. The CouchDB dev team literally stopped any other work and got together to sort out whether there was something we had to do. Data safety truly is a top priority.

The Art of Reading and Writing Data to Disk

For Unix programs to operate on files, they have to acquire a file handle with the syscall open. Once acquired, the program can use the file handle to read from or to write to any data it likes by specifying an offset and a length, both in bytes, that describes where in the file and how much of the file should be read from or written to.

The Unix kernel will respond to these syscalls by accessing the filesystem the file lives on. A filesystem’s job is to organise an operating system’s files onto a storage mechanism (NVMe, SSDs, hard drives, block storage etc.) and provide fast and safe access to those files.

Blocks & Pages

All file systems define a block size. That is a chunk of bytes that are always read or written in bulk. Common block sizes are 4096 or multiples thereof, like 8192 or 16384, sometimes even 128k. These block sizes, or pages exist so file systems can efficiently make use of all the available storage space.

A consequence of this is that if you just want to read a single byte from storage, the kernel and file system will read at least a page of data and then only return the one byte. Even with the lowest page size of 4096, that’s 4095 bytes read from disk in vain.

As a result, most programs try to avoid reading one byte at a time and instead aim for aligning their data in a way that maps directly to the page size or multiples thereof. For example, CouchDB uses a 4096 byte page, PostgreSQL uses 8192.

Up to Eleven

The fundamental trade-off that is made with the various options for page sizes is latency vs. throughput at the cost of I/O amplification. In our example earlier, reading a single byte is fastest (i.e. happens with the lowest latency) from a 4096 byte page, at a ~4000x read amplification cost. On the opposite end, reading 1GB of data for a movie stream in 4096 byte chunks has no direct amplification (all bytes read are actually needed), but that will require 250,000 read requests to the file system. A larger page size like 1M will greatly improve streaming throughput.

So there’s a value to getting the page size right for the kind of application. For databases this usually means making it as small as possible, as individual records should be returned quickly, without sacrificing too much streaming performance for larger pieces of data.

The Page Cache

The final piece of the puzzle is the page cache. This is the Unix kernel keeping file system pages in memory so it can serve them faster the next time they are requested.

Say you read the page (0,4096) once, the kernel will instruct the filesystem to load the bytes from storage into a kernel memory buffer. When you then read that same page again, the kernel will respond with the in-memory bytes instead of talking to the file system and storage again. And since storage is ~800,000 times slower than main memory, your second read is going to be a lot faster.

The same is happening for writing pages: if you write a new page (4097,8192) and then immediately read it again, that read will be very fast indeed, thanks to the page cache.

So far so good. How could this go wrong?

When writing a new page, Unix kernel can choose to write it into the page cache and then return the write call as a success. At that point, the data only lives in kernel memory and if the machine this runs on has a sudden power outage or kernel panic or other catastrophic failure, that data will be gone by the time the system has rebooted.

That’s a problem for databases. When a database like CouchDB writes new data to storage, it must make sure the data actually fully made it to storage in a way that it can guarantee to read again later, even if the machine crashes. For that purpose, the Unix kernel provides another syscall: fsync, which tells the kernel to write the data actually onto storage and not just into the page cache.

However, because the page cache provides a ludicrous speed improvement, databases aim to not fsync every single page. Instead they try to fsync as little as possible, while making sure data makes it safely to storage.

What what happens if nobody ever calls fsync? Will the data be lost for good? Not quite: the Kernel will decide when to flush the block to disk if the CPU and and disk aren’t otherwise busy. If that never happens, eventually, the Kernel pauses processes that are writing to disk, so it can safely flush the cached blocks to disk.

How CouchDB Writes Data to Disk

Heads up: we are going to gloss about a lot of details here to keep this under 50,000 words.

CouchDB database files consist of one or more B+-trees and a footer. On startup a database file is opened and read backwards until it finds a valid footer. That footer contains, among some metadata, a pointer to each of the B+-trees, which are then used to fulfil whatever request for reading or writing data needs to be handled.

When writing new data, CouchDB adds pages with B+-tree nodes to the end of the database file and then writes a new footer after that, which includes a pointer to the newly written B+-tree nodes.

To recap, the steps for reading are:

Open the database.
Read backwards until a valid footer is found.
Traverse the relevant B+-tree to read the data you are looking for.

For writing:

Open the database.
Read backwards until a valid footer is found.
Add new B+-tree nodes to the end of the file.
Add a new footer.

  bt = B+-tree node, f = footer
┌──┬──┬──┬──┬──┬──┬──┬──┐
│  │ ◄┼─ │  │ ◄┼─ │  │  │
│ ◄┼─ │  │  │  │  │ ◄┼─ │               db file
│  │  │  │ ◄┼──┼─ │  │  │
└──┴──┴──┴──┴──┴──┴──┴──┘
 bt bt f  bt bt f  bt f

A database file with three footers, i.e. a file that has received
three writes. The footer includes pointers to B+-tree nodes.

 bt = B+-tree node, f = footer
┌──┬──┬──┬──┬──┬──┬──┬──┌──┌──┌──┐
│  │  │  │  │  │  │  │  │  │ ◄┼─ │
│  │  │  │  │  │  │  │  │ ◄┼──┼─ │      db file
│  │  │  │  │  │  │  │  │  │  │  │
└──┴──┴──┴──┴──┴──┴──┴──└──└──└──┘
 bt bt f  bt bt f  bt f  bt bt f

 The same database file, with two more B+-tree nodes and footer

The Sad Path

With all this information we can revisit The Sad Path in Justin’s post:

I do a write, and it goes into the log, and then the database crashes before we fsync. We come back up, and the reader, having not gotten an acknowledgment that their write succeeded, must do a read to see if it did or not. They do a read, and then the write, having made it to the OS’s in-memory buffers, is returned. Now the reader would be justified in believing that the write is durable: they saw it, after all. But now we hard crash, and the whole server goes down, losing the contents of the file buffers. Now the write is lost, even though we served it!

Let’s translate this to our scenario:

“The log” is just “the database file” in CouchDB.
A “hard crash“ is a catastrophic failure as outlined above.
The “file buffers” are the page cache.

In the sad path scenario, we go through the 4 steps of writing data to storage. Without any fsyncs in place, CouchDB would behave as outlined. But CouchDB does not, as it does use fsyncs strategically. But where exactly?

CouchDB calls fsync after step 3 and again after step 4. This is to make sure that data referenced in the footer actually ends up in storage before the footer. That’s because storage is sometimes naughty and reorders writes for performance or just chaos reasons.

If CouchDB is terminated before the first fsync, no data has been written. On restart, the previously existing footer will be found and any data it points to can be read. This will not include the write that was just interrupted, as none of that made it to memory or storage yet and the request has not returned with a success to the original caller.

If CouchDB is terminated after first but before the second fsync, data will have made it both to the page cache and disk, but the footer might not have made it yet. If it did not, same as before: the previously existing footer will be found on restart, and the current writer will not have received a successful response. If it did make it, we know because of the first fsync that any data it points to will be safely on disk, so we can load it as a valid footer.

But what if the footer makes it to the page cache and not storage and we restart CouchDB, read the footer and retrieve its data from the page cache? The writer could issue a read to see if its data made it and if it does, not retry the write: Boom, we are in the sad path and if the machine now crashes that footer is gone. For good. And with it, any pointer to the data that was just written.

However, CouchDB is not susceptible to the sad path. Because it issues one more fsync: when opening the database. That fsync causes the footer page to be flushed to storage and only if that is successful, CouchDB allows access to the data in the database file (and page cache) because now it knows all data to be safely on disk.

Clarifying update 2025-03-25: We glossed over some details here mainly because of the rest of the being so long already, but this last paragraph created some confusion that we wanted to address. Thanks to Rob Norris for digging in on this point on his blog.

CouchDB does not fsync on open specifically to alleviate the sad path as explained in the original blog post. Instead, CouchDB does fsync on open because if it opens a new database, it writes a new header into the file and that gets fsynced and that accidentally closes the door for the sad path. There is also a discussion on whether the sad path is a valid concern at all, as Rob hints at, it can be file system dependent whether an fysnc flushes all pages that have been read or written by the controlling fd or by the actual file it represents.

And CouchDB developer Paul Davis chimes in to underline the point we were trying to make here all along: CouchDB developers take data safety extremely seriously:

Also, to make sure that we’re not missing the field for the cornstalks, I want to point out that the double fsync commit protocol used by CouchDB is probably 99.some-more-nines responsible for CouchDB’s durability guarantees. However, that’s not 100%, so when we find weird edge cases like in [the original post] we try and make sure that we’re as correct as can be. For instance, here’s the response to fsync-gate.

Phew!

After working out these details, the CouchDB team could return to their regularly scheduled work items as CouchDB has proven, once again, that it keeps your data safe. No matter what.

« Back to the blog post overview

How CouchDB Prevents Data Corruption: fsync posted Wednesday, February 26, 2025 by The Neighbourhoodie Team tipproductCouchDBdata

Old Assumptions Meet New Information#

The Art of Reading and Writing Data to Disk#

Blocks & Pages#

Up to Eleven#

The Page Cache#

How CouchDB Writes Data to Disk#

The Sad Path#

Phew!#