neighbourhoodie-nnh-logo

CouchDB Interviews: Joan Touzet posted Wednesday, August 28, 2019 by liv

This is an interview with Joan Touzet. Joan is Head of CouchDB Support at Neighbourhoodie, is a CouchDB committer, and also sits on the Apache Software Foundation Board of Directors.

Q: What first got you interested in CouchDB/NoSQL?

I was working on doctoral research at the Ontario Institute for Studies in Education at the University of Toronto. Part of our research was to take posts students made on a prototype web forum, personal blog or chat room and perform various semantic analyses of the text. But the websites themselves made extracting the content difficult.

A friend knew about CouchDB (version 0.8 at the time!) and suggested I try it. I managed to extract each post from the 3 other services and insert them into CouchDB. The REST-based UI was fantastic and so much easier to use than a SQL-based backend. Writing analysis scripts in Python to run through each record was a snap, plus I could write the findings back into CouchDB. I could then replicate the database to a colleague, who could then run their own analyses.

I was hooked.

Q: How did you start contributing to CouchDB?

My early contributions to CouchDB weren't in code or documentation - they were in community building. I gained so much by being able to talk to its core developers in IRC daily, that I decided to give back by sitting in IRC and answering questions, too, as I learned the answers.

At the time, CouchDB and the open source community at large weren't as keen on recognizing non-code contributors with merit, so it took some time (and a few small code patches) before I became a CouchDB committer. Now, while I work on code all over CouchDB, my main focus is on release engineering, continuous integration and packaging.

Q: What’s the most convoluted bug fix you ever had to figure out?

Jan may say the same thing, but it'd have to be #574/#745. This was a super gnarly bug that we couldn't always reproduce, where replication of large attachments would simply crash out and never finish. The error code pointed to some of the ugliest code in CouchDB, namely the HTTP multipart parsing code.

At first, we thought it was a race condition in how an HTTP 413 (request body is too large) error result was handled. The socket was being closed before the 413 was returned to the client, so the client never knew to send the attachment in smaller chunks.

Later, once we fixed that bug, we still got internal Erlang process crashes with the same error, or situations where CouchDB would consume all of the RAM on a machine and explode. We discovered the recurrence was related to enforcing stricter HTTP request size limits, which was preventing successful replication. Raising the size limit or reducing the attachment size fixed the issue.

Q: What’s your favorite CouchDB feature that people don’t know about?

It's got to be Mango.

CouchDB made a name for itself being a "database of, by, and for the web," so people tend to associate it with JavaScript. For all of CouchDB 1.x, you had to use JS to write views (secondary indexes), to retrieve data from the database based on anything other than each document's ID.

At the second CouchDB summit in 2012, shortly after I joined Cloudant (I was employee #20), I brought up the idea of a Domain-Specific Language (DSL). So many of the customer queries we were seeing were simple JavaScript that just looked to see if a field existed, then emitted a (key, value) pair of (field value, 1) to get an index on a different "column" than the document's primary _id. I figured, if we could write a DSL for basic secondary indexes, we'd be able to eliminate approximately 60% of people's views. We could also reduce CPU usage on the cluster, and improve performance, because we wouldn't have to convert documents between Erlang and JavaScript just to index them. The idea was met with lukewarm interest, but it was added to the backlog.

Few years later, Mango was implemented, with its DSL kinda-sorta inspired by MongoDB as a peace offering to its fans. It was done fully in Erlang, and it is significantly faster than JavaScript for indexing. Not only that, it gives you abilities like partial indexes where you can effectively do a two-stage map-reduce (like the old Cloudant "chained" map-reduce) by building a small secondary index, then filtering it a second time at query time.

If you want to learn more about Mango, there's plenty of information in the documentation as well as in my ApacheCon 2018 talk. Online you can find my slides from the talk and a recording of the session, which starts at 17:30.

Q: What would you recommend to people who are looking to adopt CouchDB for their project?

Experiment locally, but you should bring your ideas back to the CouchDB community, through the user mailing list or Slack channel (both linked from https://couchdb.apache.org/) for another pair of eyes, and to learn some best practices. If you're looking for more personalised attention, Neighbourhoodie Software is always here to help! ;)