neighbourhoodie-nnh-logo
Consulting & Development CouchDB Support & Services Training

CouchDB Tips by Neighbourhoodie

CouchDB and Docker posted 03/11/2020 by The Neighbourhoodie CouchDB Team

Docker is an extremely popular way of deploying any kind of application in many different environments. Deploying CouchDB is no exception, the CouchDB project even maintains its own set of Dockerfiles, as well as a helm chart to help orchestrate multiple containers as a cluster on Kubernetes.

All of these are rather advanced concepts that we are not going into in this article. But if you need any particular help with this, we’re always happy to help.

First off, Docker is a fine choice in many cases:

  • quickly try out something without having to worry about dependencies and proper setup
  • recreate a production environment on a developer workstation
  • make it easy to deploy a well-defined production environment repeatedly

We have many CouchDB customers and users that use CouchDB successfully inside Docker.

However.

The way Docker works can introduce problems when running CouchDB in high-performance situations.

One of Docker’s core features is isolation of containers. In order to achieve this, Docker introduces a virtual networking and a virtual file system layer. Both of these layers can introduce delays and consistency issues to the point where running CouchDB inside Docker is slower than outside of Docker.

As a database, CouchDB naturally has the highest demands on data access. Any additional delays ingetting bytes to and from disk are going to make all CouchDB operations slower.

As a networked database, the same is true for Docker’s networking layer. Especially when running a CouchDB cluster in Docker containers, the networking layer leads to significant performance degradation.

We are constantly evaluating new versions of Docker and we’ll be happy to report when running CouchDB inside it is flawless. But until then, we’ll maintain this warning.

And sometimes, there is just the odd unexplainable issue.

In conclusion: there are many valid use-cases for running CouchDB inside Docker, and you might even make a conscious trade-off where the ease-of-use of Docker trumps the required performance for your database, or your load is just not that bit relative to the computing resources you have available.

But if you need to have CouchDB be its fastest and most reliable self, you should run it outside of Docker.

default.d and local.d posted 27/10/2020 by The Neighbourhoodie CouchDB Team

When we explored how default.ini and local.ini work in unison to provide a coherent configuration and upgrade behaviour, we skipped over one more part of the configuration puzzle.

In addition to the two configuration files you are already familiar with, CouchDB can be configured with two further configuration directories, which in turn can hold multiple configuration files each.

First, we explain why this mechanism exists, and then we go into the details of how this works. One thing up front however, this all is entirely optional, so if you have no need for this, you are free to ignore everything you read here.

What problem does this additional way to provide configuration files solve?

CouchDB is usually delivered in one piece, a so called release, that includes everything that makes up CouchDB. Most likely, that’s what you’re running too.

However, there is a tiny number of extensions to CouchDB that are not part of the main release. In the past this has been an extension to provide fulltext search to CouchDB, but this has since been integrated. Very specialised users of CouchDB might have their own extensions that are not public.

Those extensions often have a need to be configured. And just like CouchDB, they might ship with a default configuration, and they have a place to make local changes to the default configuration. Each extension can have a file in both directories each to mirror the behaviour of default.ini and local.ini. For example:

  • /etc/couchdb/default.d/search.ini
  • /etc/couchdb/local.d/search.ini

You would receive the default configuration in the first file, and you can make local changes in the second file. When installing new versions, new default configuration options can be added to default.d/search.ini, and as such, this file would be overwritten during an upgrade. local.d/search.ini however will not be overwritten and retain all local changes across software updates.

In addition, the standard Linux packages for CouchDB often place files inside of the default.d directory, to make changes that are distribution-specific. As an example, the Debian installer adds a file that redirects log files to /var/log/couchdb/couchdb.log. It also places settings related to the installation prompts in default.d/10-bind-address.ini or default.d/5-single-node.ini, depending on the settings chosen at installation time.

Local.ini is Never Overwritten on CouchDB Updates posted 20/10/2020 by The Neighbourhoodie CouchDB Team

CouchDB is configured through configuration files on disk. The format of the files is INI.

When it starts, CouchDB reads a series of .ini files to make up the final configuration it is going to start with. This series of .ini files is called the config file chain.

By default, two config files are read: default.ini and local.ini and they are loaded in this order default.ini first, local.ini, second. This makes local.ini the end of the config file chain. This will become important shortly.

When making changes to the configuration, CouchDB users are advised to make their configurations in local.ini. If you’ve ever changed your bind_address or admin password, you likely have made changes to your local.ini file.

Why are there two files? And why do they form a chain? — Let’s dig in.

First: default.ini includes all configuration options available for CouchDB. Since different versions of CouchDB have different configuration options, as new features added require new options, e.g., a default.ini file is always specific to a particular version of CouchDB.

When installing CouchDB for the first time, default.ini gets installed on the target system. When upgrading CouchDB to a newer version, default.ini gets overwritten with the newer version, because it might include new configuration options.

Second: local.ini is also installed the first time around, so you can make your required changes. But when you update CouchDB, local.ini does not get overwritten, so your custom configuration will be applied to the updated version of CouchDB. Your networking and admin configuration, and whatever else you have changed continues to work just fine in the new version of CouchDB.

So here we have the answer for the first question: there are two files, so we can have both: new configuration options when new CouchDB versions come out, and: retain your custom configurations across CouchDB upgrades.

Everything You Need to Know About CouchDB Database Names posted 13/10/2020 by The Neighbourhoodie CouchDB Team

Naming a database does not sound like an exciting activity. But it can be, if you know all the considerations that go into naming a database in CouchDB. Let’s start with the restrictions.

CouchDB database names have restrictions in terms of which characters can be used. Based on these restrictions, a database name:

  1. must begin with a lowercase letter from a to z, no diacritics etc.
  2. Each character in the name must be one of:
    1. a lowercase letter from a to z
    2. a number from 0-9
    3. an underscore, dollar sign, open or closed parenthesis
    4. the plus and minus signs
    5. a slash
  3. May be no longer than 238 characters.

Or expressed as a Regular Expression: ^[a-z][a-z0-9_$()+/-]{238}$

The collection of special characters might seem unfamiliar at first. We’ll explain how they come together further down.

First, we talk about one of them, the slash, or /. It is used in URLs and in UNIX-like file systems to denote hierarchy, like a subdirectory.

In CouchDB 1.x, a database was represented by a single file in the file system. If you had a database called people, CouchDB would store all associated data in a file called people.couch. And all .couch database files are stored in the same directory on your file system (it can be found in the CouchDB configuration under [couchdb] database_dir

CouchDB does not put a practical limit on how many databases there can be on a server (other than the theoretical ~43238), but file systems do. While those limits are getting higher and higher, in the times of CouchDB 1.x, you had to consider how many databases you would create to get good performance out of your file systems. Some file systems get really slow when there are more than 216 or 232 files in a single directory.

To make sure you can create more databases than a file system limits you to, CouchDB allows you to add slashes to database names. It will create actual subdirectories in the file system, so you can avoid having too many files in a single directory.

For example, a database called user/32/14/55187 will be stored in the datatabase_dir as user/32/14/55187.couch.

CouchDB 2.0 introduced database sharding, the splitting up of single databases into multiple .couch files, which are stored each in their own directory per shard range. A shard range is expressed as a subdirectory which is named after the range, which goes from 00000000 to ffffffff. For example, a database with four shards (q=4) occupies the following shard ranges:

  • 00000000-3fffffff
  • 40000000-7fffffff
  • 80000000-1bffffff
  • 1c000000-ffffffff

A database with just one shard occupies the full range:

  • 00000000-ffffffff

For databases with a single shard, which are common in the database-per-user pattern, all database files are stored in the same directory on the file system, and the same rules as with CouchDB 1.x apply.

But the more shards you have per database, the fewer actual files there are in each shard subdirectory in the file system. So you’ll be reaching at which point the file system introduces slowness at a later point, but it is still worth considering if you have a very large number of databases.

Incidentally, the shard ranges explain why the database name is limited to 238 characters. In the past, file system paths could be at most 255 (28) characters long. But since CouchDB 2.0 and onwards always includes the shard range, we have to subtract 17 characters (2x8 for the beginning and end of the shard range plus 1 for the dash in the middle).

And where do the other special characters come in? It is pretty simple actually. When deciding which characters should be allowed in database names, the CouchDB developers surveyed all common file systems and collected all their respective restrictions about what characters could be included in file names. The result is the list of characters allowed in a CouchDB database: all these characters are allowed as part of a file name on any modern file system.

That said, we usually recommend keeping it to [a-z][a-z0-9-_/].

Sharding — Reducing the Number of Shards posted 06/10/2020 by The Neighbourhoodie CouchDB Team

In contrast to increasing the number of shards for a database, reducing the number of shards is not a built-in operation. In addition, as shard splitting is only available in CouchDB 3.x and later, this advice is good for version 2.x as well.

Neighbourhoodie has built the couch-continuum tool that automates the bulk changing of database parameters, including the number of shards for a database:

This tool can both increase and decrease the number of shards for a database in both CouchDB 2.x and 3.x.

There is just one caveat: it can not operate without taking the original database offline for the duration of its restore. So you can only do this during a maintenance window.

Sharding — Increasing the Number of Shards posted 29/09/2020 by The Neighbourhoodie CouchDB Team

This advice is only true for CouchDB 3.0.0 or later. Next week, we’ll cover increasing the number of shards in CouchDB 2.x.

Increasing the number of shards in CouchDB is implemented by a technique called shard splitting. It allows you to specify for any one shard in a database to be split into two equal-sized shards.

CouchDB does all the hard work for you: a single request to the /_reshard endpoint will start the process. Shard splitting is a background process that can go on while your CouchDB cluster is fully operational. Once the split shards are available, CouchDB starts using them to serve requests instead of the original shard, which then gets deleted.

Note that shard copies are replicated across your cluster. We recommend you split one shard at a time. In addition, we recommend to split all shards of a database around the same time in order to guarantee consistent performance.

Sharding — Choosing the Right q Value posted 22/09/2020 by The Neighbourhoodie CouchDB Team

One of CouchDB’s core features is scalability. There are two axis of scalability in CouchDB:

  1. Scaling the amount of data stored
  2. Scaling the number of requests handled

One mechanism is responsible for achieving this: sharding. Sharding means that what looks like a single database to the CouchDB API is in reality multiple parts. Those parts are all independent from each other and can live on one or more nodes of a CouchDB cluster.

This allows you to store more data in a single database than fits onto a single CouchDB node. It also allows you to handle more requests to that database than a single node can handle.

In addition, since CouchDB 3.0.0, you can now increase the number of shards of a database, while the cluster is fully operational.

The number of shards is identified by the value q. In CouchDB 2 q defaulted to 8, in anticipation of storing a lot of data in CouchDB. In CouchDB 3, q got reduced to 2, since now you can dynamically increase the number as your data grows.

This leaves one more point to cover: what is the right value of q for you?

As usual, everything depends on your exact usage of CouchDB, document size and structure, request patterns etc, but in general, our advice is a minimum of 2, and increasing in powers of 2, a q for every 10GB of data, or 1M documents, whichever comes first.

So a database with 100GB of data and q=8 should start considering going to q=16.

Use JSON Patch to Resolve Conflicts posted 15/09/2020 by The Neighbourhoodie CouchDB Team

CouchDB is unique in the database world because it makes data conflicts first-class citizens of its data model. Normally, databases and applications built on them do a large amount of work to avoid doing this. In many scenarios, this leads to subtle errors and occasional data loss.

In CouchDB this means: no data is ever randomly lost and you can always make sure you have access to your user’s data. The only downside: you have to actively embrace conflicts and prepare for them. But don’t worry, it is not a lot of work.

A document conflict manifests as a _conflicts field in your doc that you get if you query your doc with the conflicts=true option and a conflict exists. We also use the revs=true option to get a little more information about what is going on.

Normally, you don’t see any conflicts:

GET /db/doc?revs=true
{
  "_id: "doc",
  "_rev": "3-WXYZ9876",
  "_revisions": [
    "2-BCDE2345",
    "1-ABCD1234"
  ],
  "x": 3,
  "y": 1
}

With the conflicts option you do:

GET /db/doc?revs=true&conflicts=true
{
  "_id: "doc",
  "_rev": "3-WXYZ9876",
  "_conflicts": ["3-CDEF3456"]
  "_revisions": [
    "2-BCDE2345",
    "1-ABCD1234"
  ],
  "x": 3,
  "y": 1,
}

But now what? We can see that this doc has four revisions total, 1 and 2 are uncontroversial, but then there are two revisions 3-… and they are in conflict. So we can deduce that two clients tried to update revision 2 at the same time and they disagree on their contents. Let’s look at the other 3-… conflict:

GET /db/doc?revs=true&rev=3-CDEF3456
{
  "_id: "doc",
  "_rev": "3-CDEF3456",
  "_revisions": [
    "2-BCDE2345",
    "1-ABCD1234"
  ],
  "x": 1,
  "y": 4,
}

Now we can see that revision 3-CDEF3456 set our fields x and y to the values 3 and 1, while revision 3-WXYZ9876 set our fields x and y to the values 1 and 4. Clearly a conflict, but what is the “right” solution now?

Without knowing more about the application that produced this, we can’t really do much. But what if we knew what revision 2-BCDE2345 looked like? By default, we don’t know what it did look like, because CouchDB does not guarantee old document revisions to be around.

Introducing JSON Patch

With a little trick, we can keep just enough information around, so we can reconstruct previous revisions. To do this, we are going to use something very neat: JSON Patch. It is a way to describe the differences between two JSON objects.

Here is a small example. Say we have two JSON objects that look like this:

{
  "a": 1
}

{
  "a": 2
}

JSON Patch can describe the difference between the two objects. If we want to know what changed in between the first and the second object, this JSON Patch describes the difference:

{[
   { "op": "replace", "path": "/a", "value": 2 }
]}

If we want to know the difference between the second and the first, this is the corresponding JSON Patch:

{[
   { "op": "replace", "path": "/a", "value": 1 }
]}

Using JSON Patch

Now if we produce our documents in a way that we don’t only update our fields to the values we want, but also record our changes in JSON Patch format, that would allow us to reconstruct earlier revisions from the latest revision. The one trick here is that we don’t store the JSON Patch that gets us from the older to the newer revision, but the other way around, from the newer to the older.

Here is an example:

GET /db/doc?revs=true
{
  "_id: "doc",
  "_rev": "3-WXYZ9876",
  "_revisions": [
    "2-BCDE2345",
    "1-ABCD1234"
  ],
  "x": 3,
  "y": 1,
  "history": [
      {[
        { "op": "replace", "path": "/x", "value": 2 }
      ]},
      {[
        { "op": "replace", "path": "/x", "value": 1 }
      ]}
   ]
}

From this, we can deduce the document bodies for revisions 2-BCDE2345 (by applying x=2) and 1-ABCD1234 (by applying x=1).

2-BCDE2345:
{
  "_id: "doc",
  "_rev": "2-BCDE2345",
  "_revisions": [
    "1-ABCD1234"
  ],
  "x": 2,
  "y": 1,
  "history": [
      {[
        { "op": "replace", "path": "/x", "value": 1 }
      ]}
   ]
}

1-ABCD1234:
{
  "_id: "doc",
  "_rev": "2-BCDE2345",
  "_revisions": [
    "1-ABCD1234"
  ],
  "x": 1,
  "y": 1,
  "history": [ ]
}

Why is this useful? If we now have the JSON Patch history for both revisions 3-WXYZ9876 and 3-CDEF3456, we can perform what is called a three way merge. It allows us to resolve the conflict without any further information.

Here is how. First, let’s look at the history of revision 3-CDEF3456:

{
  "_id: "doc",
  "_rev": "3-CDEF3456",
  "_revisions": [
    "2-BCDE2345",
    "1-ABCD1234"
  ],
  "x": 2,
  "y": 4,
  "history": [
      {[
        { "op": "replace", "path": "/y", "value": 1 }
      ]},
      {[
        { "op": "replace", "path": "/x", "value": 1 }
      ]}
   ]
}

From this, we can now take the reconstruction of revision 2-BCDE2345:

{
  "_id: "doc",
  "_rev": "2-BCDE2345",
  "_revisions": [
    "1-ABCD1234"
  ],
  "x": 2,
  "y": 1,
  "history": [
      {[
        { "op": "replace", "path": "/x", "value": 1 }
      ]}
   ]
}

Using this as a base, we can produce the two JSON Patches that would then produce both 3- revisions:

{[
      { "op": "replace", "path": "/x", "value": 3 }
]}
{[
      { "op": "replace", "path": "/y", "value": 4 }
]}

And now we can compare the patches themselves. And we can see that there now are entries in there, where the path points to the same field. That means we can safely apply both patches to revision 2-BCDE2345 and get no conflict.

The resulting document looks like this and is stored as revision 4-XYXY8877

GET /db/doc
{
  "_id: "doc",
  "_rev": "4-XYXY8877",
  "x": 3,
  "y": 4
}

Voilá.

Copy Design Docs to Avoid Waiting For Indexes to be Built posted 08/09/2020 by The Neighbourhoodie CouchDB Team

This advice is relevant for all query mechanisms in CouchDB: Views, Mango Queries, and even Search.

All query mechanisms in CouchDB use design docs to define which fields to use when querying your document. We call this the query definition. They look different for each of the mechanisms, but their function is the same in each case.

When changing the query definition of a design document, CouchDB will re-index all documents in your database before it can respond to any queries for them.

If you have a lot of documents in your database, going through all documents for re-indexing can take a while, minutes, hours, sometimes days.

During application development it is common to adjust how you query your database and adjusting your CouchDB query definitions is a common operation.

However, when deploying the latest version of your application, you don’t want to get into a situation where an end-user tries to use your application and then has a lengthy wait before CouchDB returns a result.

But you also don’t want to deploy the new design doc before the application is ready, because then the old version of your application no longer has the correct design docs to work correctly.

Luckily, the CouchDB developers have thought about this and there is a nice solution.

When deploying your new query definitions, you can deploy them under a differently named design document that is currently unused. For example, when your current design doc _id is _design/by-date, then you create a new design doc _design/by-date-deploy.

Then you wait for its index to be built and after that, when you deploy your app, you use a HTTP COPY request to copy the new design doc over the old design doc. This does not cause your index to be rebuilt, and your app can use the new index right away, while the old app could use its index up to the last moment of operation.

TL;DR:

  1. upload your new view in a new design doc
  2. query your view and wait for the index build to finish
  3. HTTP COPY your new design doc to your old one.

Use Type in Doc _id posted 01/09/2020 by The Neighbourhoodie CouchDB Team

When deciding on which data goes into which CouchDB documents, it is commonly helpful to keep track of the type of document. For example, you could have documents for users and documents for articles.

The most common way to store the type is by adding a type field to the document:

{
  "_id": "123456",
  "type": "user",
   …
}

{
  "_id": "abcdef",
  "type": "article",
   …
}

There is an alternative that has advantages in certain situations: storing the type inside the document _id:

{
  "_id": "user:123456",
   …
}

{
  "_id": "article:abcdef",
   …
}

The advantage of this approach is when you have an application that often needs to query CouchDB for documents of a certain type, say, list all articles. In the first case, you’d have to use a Mango Query, or a JavaScript View to get the documents you want. With the second approach, you can use the _all_docs endpoint with the startkey and endkey parameters.

Using _all_docs for this allows you to skip using a secondary index for your most common types of queries. While having secondary indexes is perfectly fine, not having to use them will save you some computing resources.

In addition, secondary indexing in PouchDB up until version 7 comes with a performance penalty (that version 8 is going to remove, at least for find()), that is more severe than secondary indexes in CouchDB. So it is advantageous to re-use the _all_docs index as much as possible.

You can also use more segments, say you need all article by time, your _ids could look like this:

article:2020-05-25:abcdef

If your primary use of accessing CouchDB is not getting documents by type, then this is a less useful tip, but in case you do, this can give you a good performance boost.