How to Sync Anything posted Sunday, April 6, 2025 by James Coglan
In this article I’ll discuss a common naive solution to replication, why it doesn’t work, and what the building blocks of a good solution look like. Having established this theoretical framework, my next article will look at how CouchDB provides many of those building blocks such that replicating from it into any other system is relatively painless.
Something I have seen happen surprisingly often in my career as a developer is that teams end up having to implement replication, but they don’t acknowledge that this is what they’re doing, and so end up with ad-hoc solutions that don’t really work in production. When I’ve worked at product companies this usually takes the form of integrating internal systems, such as caching, search indexing, or making incremental updates in single-page apps. Sometimes it means having to integrate with external services, for example copying a company’s user profiles into a newly adopted CRM system and keeping them up to date going forward.
As a consultant in the public sector, I’ve seen many projects whose entire purpose was to integrate several existing systems and make sure they all agree on the exact state of their data. For example, the permissions for accessing casework documents are driven by roles stored in the HR department’s Active Directory. Or, some accounting information needs to be saved in two places at once because two organisations got merged and are in the process of splicing their systems together. Or, in the tech infrastructure world, you might have a complicated build pipeline involving generation of artifacts from many different input sources and you want it to react to upstream changes promptly.
All of these are, at heart, replication problems: the goal is to make some target data store (a cache, an index, an external HR system, a package repository) in some sense mirror the state of a source data store. I am using the term “data store” in the loosest possible sense here, to mean any stateful representation of some of the information in your system. Often these will be honest-to-god databases — an RDBMS, a document store, etc. — but you really ought to think of any part of your system that accumulates state in this way.
There are many different techniques for solving this problem, and one’s choice of solution is often constrained by the fact that many of the products people deploy to store and distribute their system’s data are not designed to make replication easy to implement or reliable to operate, especially when numerous heterogeneous systems need to be integrated. A lot of the more “obvious” solutions are riddled with pitfalls that stem from the fact it is fundamentally quite hard to keep two stores in sync when one or both of them keep changing, when there’s no globally agreed ordering of events, and when updates can fail, get re-ordered or lost. Many of the problems I’ve seen in production happen because of a naive approach to these problems, or because engineers ignore them entirely.
The Easy Solution: ETL
Before getting into the details, I’ll quickly mention one solution that works really well for certain situations. Given that updating the target system incrementally from the source is hard to do correctly, the simplest possible solution is to not bother doing that at all. Instead, you rebuild the target from scratch periodically. You locate all the data of interest in the source, pull it out, transform it as necessary, and write it into the target, which starts from a blank state on each run.
This process is generally known as extract, transform, load (ETL) and is widely used in analytics and data warehousing. It is very easy to make it correctly mirror the state of the source, because the target does not retain state for very long. If you find a bug in the data copying logic, you fix it, deploy the fix, and the state gets corrected the next time the process runs, because you’re building a fresh copy of your data every time.
The Pros and Cons of ETL
The downside of this approach is that it’s very slow, because you have to scan all your source data every time. In fact it gets slower as your system grows, which can produce costs that don’t scale in the same way as your revenue sources. It also has bad consistency; if the target is rebuilt nightly or weekly, then it is never fully in sync with the source, and may even be internally inconsistent as your source data changes while it’s being copied.
For some applications, this is not a problem. Many analytics use cases do not benefit from real-time access to data and may even give worse results in this mode. They can also make use of cheaper archival storage technology and specialised processing tools so that their operational cost doesn’t grow too badly as your source data grows.
There is also an operational benefit that comes from decoupling. The ETL approach typically doesn’t need custom logic built into application code to make it work. Application developers can get on with building features, while a separate team maintains the system the extracts what they want from the application database, as long as the data extraction does not cause excessive load and interfere with end user traffic. This lets both groups operate quite independently without causing one another friction.
So, if you don’t mind the target being slightly stale, and taking a long time to update, ETL is a great solution. But what happens when that’s not an option?
Ad-hoc Replication
ETL may be conceptually very simple, but it has some major downsides. It cannot maintain real-time consistency with the source system, and it’s also extremely wasteful to scan through all your data on every rebuild, when most of it will not have changed since the last run. To address both these issues, developers will often want a more incremental approach to keeping the target up to date.
This is where things typically start to go wrong. To make incremental sync work correctly, it is usually necessary to think about the big picture of how you want the source and target to interact and design or adopt a protocol that fits all these needs. However, many of these replication projects are developed in an ad-hoc way, with little bits of code being added to handle particular scenarios, and the interactions and failure modes of these small changes are never looked at holistically.
The “Waiting for Changes” Approach
A very common approach to ad-hoc replication, especially when used to implement caching or search indexing, is to put hooks or event listeners into the application that react to changes in the data, and forward those changes to the target system. Those hooks may be put into the data model, so you can be sure that any code path that changes a record will invoke the hooks. However, this means it’s impossible to invoke the code in the data model without causing these side effects to the target system, which might be undesirable, especially for testing. Conversely, if the hooks are put into the application’s request handling tier, this keeps the data layer uncoupled from the target system, but it’s much harder to make sure you cover every code path that might result in a record being changed. In some frameworks, it’s not uncommon for developers to debug and fix production issues by invoking the data model through an interactive shell, and any changes made this way may bypass any application logic built atop the model.
This is the first source of inconsistency in these systems: the method of detecting changes to the source data is itself lossy. Because of this, there is no way of knowing whether any given record in the source system is accurately mirrored in the target without scanning the entire data set, i.e. without ending up with an ETL implementation.
What Happens in a Rollback?
The second major source of inconsistency comes from what those event hooks actually do with the change they are observing in the source data. When writing to the target system, they might attempt to relay the full state of the source record, or just the fields that changed in the source record, depending on the data models of the source and target and what facilities their framework provides. The latter assumes that the target record was already in sync with the previous state of the source record, which might not be the case, especially when other failure modes are accounted for, and this can result in lost updates. Conversely, the event hook might be implemented in such a way that it can be activated by changes in the model that have been sent to the database, but are part of a transaction that has not yet committed. If the transaction ends up being rolled back, then you end up with uncommitted data in the target system. This can be especially hard to clean up because the target state results from source state that should never have been recorded and will never be explicitly deleted.
More Ad-Hoc Replication Inconsistencies…
Because the source and target are often distinct systems, the target may be unavailable when the source is changed, so the event needs to be retried somehow. The target may reject the update because its data model has different constraints to the source. For example, it might not recognise certain email addresses as valid, or it might assume emails are used to identify users and must be unique, which may not be the case in the source. These sorts of validation constraints mean that sometimes, data from the source cannot be represented in the target’s data model.
If the target system is a remote third-party API, with high latency relative to the system’s own resources, it can also fail for network-related reasons, enforce rate limits, or just add too much latency to apply updates to it inside the scope of normal end-user requests. Therefore the copying of data to the target system is often pushed into a background job queue and is processed asynchronously from where the change originated in the source system.
This is the final contributor to this ad-hoc approach failing to produce consistency. If changes in the source data are processed asynchronously and may fail and be retried, they will end up being processed in a different order to how they happened. In general, arbitrary updates to data do not commute, that is, you cannot reorder them without changing their effects. If somebody updates a record and then later deletes it, the replicator will become confused if it tries to delete the record from the target and then update it.
…and Edge-Cases
One additional edge case that emerges in ad-hoc replication is that while the event hooks deal with changes made to the data since their introduction, they do not deal with all the legacy data from before that time. This necessitates a completely different approach to copy over all the existing data, and then keep the target up to date using event listeners. Deciding how to switch from the former to the latter is very difficult because the source data is changing all the time, so it’s hard to execute this correctly without downtime to stop anyone changing the source during the switch-over. Having a whole different code path for a one-off event makes it likely that code path will have defects that go undetected and require future ad-hoc fix-ups to the target data.
This approach is sometimes likened to change data capture (CDC) or event sourcing, but it is strictly weaker than both of them. CDC requires a reliable way of identifying which changes have not yet been replicated, and in event sourcing the primary way of recording updates is to write them to a global consistently ordered log, much like a database’s write-ahead log (WAL), which is then consumed to produce various representations of the system state.
Ad-hoc replication is characterised by reactions to changes that can be ignored, fabricated, misinterpreted, reordered, dropped, or executed multiple times, with no reliable way to check and ensure that the target is in sync with the source. It inevitably produces inconsistent states that are hard to debug, not least because there is no formal definition of what state the target should be in, given some state of the source. There is never one root cause to why the target is in a bad state; the reality is that this whole approach is almost guaranteed to produce inconsistency, and needs to be replaced with something else. So what does that something else look like?
Diff and patch
Broadly speaking, replication systems can be classified into two camps: state-based, and operation-based. State-based systems work by examining the current state of the source and target, and making changes to remove any differences between them. Operation-based systems rely on distributing a consistent sequence of events, or a log, to all the systems and having them build their local state off this log. This is how event sourcing works, and how consensus algorithms like Raft work.
Many real-world systems use some combination of these approaches, but here I’ll be emphasizing a state-based approach, with some optimisations enabled by event logs. State-based replication tends to be more applicable to disparate systems that only expose their current state to clients and don’t have any way of sharing their event histories such that you could use them as a log.
The first ingredient of a replication system that actually works™ is something I hinted at in the previous section. There must be some well-defined way of determining whether a record in the target is consistent with its corresponding source record. This includes the possibility that the target record should not exist, because it is absent or deleted from the source. If it is not in a consistent state, then a description of how it differs from the intended state must be produced, i.e. which operations would be needed to bring the target record into line with the source.
It is possible this sounds kind of “obvious”; how can you say the target is in an incorrect state if you cannot define what the correct state looks like? However, I have personally seen too many systems where the team did not have a satisfactory answer to this question, and solving this was a prerequisite to making any further progress on the problem.
You can think of this operation as a sort of “diff and patch” function. The diff tells you how the source and target are out of sync, and is empty if they are in sync. The patch tells you how to change the target to make the diff become empty. For example, imagine we have two systems that contain data on accounts in a social network, and each of them have a certain account in the following state:
Source Target
------ ------
{ {
"username": "alice", "username": "alice",
"location": "london", "location": "berlin",
"following": [ "following": [
"bob", "bob",
"carol", "dave",
"dave" "eve"
] ]
} }
The diff between the target and the source would be:
- The
location
field changes fromberlin
tolondon
carol
is added to thefollowing
seteve
is removed from thefollowing
set
The patch that removes these differences would be:
- Set the target’s
location
field tolondon
- Add
carol
to the target’sfollowing
set - Remove
eve
from the target’sfollowing
set
The exact details of the diff and patch will depend on what data representation and interface each system actually provides; if the target is a document store, the patch might reduce to a single write containing the whole new state, but if it’s a SQL database it will be possible to update specific columns and add/remove specific rows as required. It is usually desirable to express the diff and patch in the smallest number of operations against each system to maximise performance.
This sync function must be idempotent, i.e. it should be safe to run any number of times against record in the source system to bring its corresponding target record up to date. This immediately fixes several problems with ad-hoc replication:
-
Replicating a change always works by comparing the full state of the source and target records, rather than assuming the target matched the previous source state and copying some random changes over. This makes sure your event listeners always leave the target in the correct state.
-
It’s not affected by event handlers being reordered by asynchronous processing, retries, and so on. The diff/patch function always does the same thing whenever it is run: it does whatever is needed to match the target match the source.
-
If it fails due to a transient network error, a rate limit, etc, it can be retried indefinitely on an arbitrary schedule and will eventually succeed and produce the right state, as long as it inspects the state of the source record when it runs rather than retaining the state from when it was first scheduled.
-
The same code path can be used for copying over all legacy data, and for reacting to future events. You either run the diff/patch function against all existing records, or against a record you know just changed. Eventually this will result in all source records being synced.
-
There is a well-defined notion of what it means for a target record to be in an incorrect state, and furthermore, the system itself can identify such states and correct them automatically. If the detection is found to be incorrect, it can be patched and re-run over existing records to resynchronise them.
Methods for Comparing
Notice that we now have the same code path being used to handle legacy data, new changes, and retrying in case of failures. This is an example of crash-only design, where you have one code path that handles “normal” operation and failure states. The function’s starting assumption is that the target is in a bad state that needs to be identified and corrected, rather than assuming some good state and invoking an exceptional code path if an error happens. Using the same code path for all eventualities gives you a simpler system that is much more robust.
This is the core building block of several replication programs. Git’s push
command works by comparing each local branch with its corresponding remote
branch, figuring out which commits the remote branch is missing, and copying
them over. Rsync works by comparing the file size and modification time of each
source and target file (or their checksums if so instructed), and copying the
source data to the target if those differ. CouchDB
replication works
by comparing which revisions the source and target have for each doc, and
copying over any that the target is missing. In each system, the pattern is the
same: compare the source and target records, and transfer whatever data is
needed to make them equal.
A further desirable property of the diff function is that it is fast. This is not quite as important as making it fast to determine which records out of your whole data set are out of sync, which we will cover below, but is still beneficial especially in cases that require a full scan of the source data set.
For example, Rsync compares files on their size and modification time, which are small bits of metadata that are cheap to read, instead of reading the content of each file, which would be much slower. This assumes that if two files have the same size and modification time, then they very probably contain the same content. This is especially likely given that Rsync will set the mtime of the target to match the source, after it copies the data over and verifies both files have the same checksum. After doing this, it is relatively safe to use the mtime as a proxy for the fact that the data was checked last time Rsync itself updated the target.
Optimising Compare Behaviour
In Git, comparing two branches is cheap because the way the commit history is stored makes it inherently efficient to identify which commits exist in one branch and not in another. It doesn’t have to employ a “short cut”, it has a data model that directly makes its desired operations cheap. CouchDB has a similar internal structure that makes it cheap to compare the revs of a document between the source and target, so it can minimise which revs need to be copied over without comparing the content of all the revs and sending all that content over the network.
When designing your own diff functions, it is worth looking for places where you might be able to skip comparing something expensive by taking short-cuts or using better data structures. However, if multiple systems and/or people have the ability to write to the target system, this isn’t always possible and your diff function will need to be “paranoid” and assume the target could be in any random state unless you’ve proven otherwise.
Tell Me the News
Just having a reliable idempotent sync function has already solved a lot of our replication problems. In principle, you could run this function whenever you want against all source records, and this would give you eventual consistency in the target system. However, most source records don’t change most of the time, so scanning through all of them will typically be very wasteful and become slow to react to changes as they happen. To make a replication system practical, we need some way of identifying which records have changed recently. I’ll briefly discuss a couple of common ways of achieving this.
The Record Flagging Method
The first approach is to put a marker/flag on each record when it is changed, and to remove the marker after the diff/patch function has re-synchronised the record to the target. This is relatively simple to implement using a counter; when the record is changed, the counter is incremented, and when it is re-synchronised, the counter is reset to zero unless it was incremented during sync. This makes sure that changes made while the record is being synchronised are not ignored, and instead initiate another execution of the sync function. Conditionally resetting the counter can be implemented using compare-and-swap (CAS) to avoid having to pre-emptively lock the source record and prevent end users interacting with it while it’s being synced to another system.
How the sync function is executed is completely up to you — after all, it is idempotent and therefore safe to run at any time. If a lot of changes are likely to happen to a record over a short span of time, you might choose to schedule the sync function to run once, a short time interval after the first change is detected, so it picks up all the changes and doesn’t need to be executed for each individual change. If the target benefits from sending changes to it in batches, you can use a cron job to find all the records marked as changed and sync them all at once. Or, if you need less replication latency, you can execute the sync function immediately on each change.
However, keep in mind that you should only allow one instance of the sync process to run on a given source record at a time; if multiple processes are trying to sync the same record at the same time they may end up getting confusing diffs and producing an inconsistent state in the target.
To deal with the transfer of legacy data, or to deal with changes and bug fixes in the sync function itself, you can mark all existing records to execute sync against all of them. You will probably want a background process to handle syncing records in batches, just so you can control the throughput and amount of load this generates on the source and target systems, rather than trying to re-synchronise all your data immediately. The outstanding backlog can be easily monitored by counting the number of source records marked for sync.
The Event Log Method
The second common method for signalling which records need syncing is to have the application add events to a log whenever a record is changed, and have another process consume this log and use it to execute the sync function against the appropriate records. Whereas the previous approach is a form of change data capture, this technique corresponds to event sourcing. It is typically implemented using background job queues, a database write-ahead log, a message broker like RabbitMQ or an event streaming system like Kafka. However you implement it, the core idea is the same: write to the log when a record is changed, and consume from the log to invoke the sync function.
Note that for this to work, it is critically important that the log is complete i.e. it faithfully reports all changes to the source data. The log should not discard an event until the sync function reports that it has been successfully handled, i.e. that the target is up to date as of that event. Is it not necessary for the log to preserve event order, indeed in many systems it is not possible to construct a total global order in which source data changes happened. The log is just there to provide a signal about which records need to be synced, so that you do not have to scan your entire data set to find this out.
…and How to Choose Which
Whether using record markers or an event log, you must ensure that a change to a record is only considered to have been processed after the sync function reports that it has been successfully handled. Since the sync function is idempotent, this means your change notification does not need exactly-once delivery, it is fine for it to have at-least-once delivery. In the worst case, the sync function runs multiple times for the same event, determines the target is already in sync, and has no further work to do. If the system has at-most-once delivery, or the sync function fails to retry on transient failures, it will end up not conveying all changes to the target, and the source and target will drift out of sync.
How can CouchDB help?
As a distributed database, CouchDB is built from the ground up to make
replication reliable and efficient. Not only is replicating databases between
CouchDB instances as simple as a single API call, it gives you the building
blocks necessary to replicate from CouchDB into any other system you like with
relatively little complexity. In a future article we’ll look in detail at how to
use the CouchDB _changes
feed to reliably replicate your CouchDB data to other
places.