Introducing Structured Query Server posted Tuesday, March 5, 2024 by The Neighbourhoodie Team
In the vast realm of databases, two types have consistently marked their presence – the traditional SQL databases and the more flexible NoSQL counterparts. SQL databases, with their structured nature, have been serving businesses for decades. On the other hand, NoSQL databases, notably Apache CouchDB, have catered to the needs of modern web applications and their semi-structured data requirements.
While both have their distinct strengths, there's often been a wish amongst users and developers for something that combines the two — a tool that brings together the scalability of NoSQL with the precise querying of SQL. In response to this need, Neighbourhoodie presents the Structured Query Server. Designed with feedback from the developer community, this addition aims to address common challenges faced by CouchDB users and offer enhanced capabilities for data analysts.
This blog is going show the features of our Structured Query Server for Apache CouchDB, its application, and how it harmonise the two different approaches.
Historical Context
SQL databases have been around for a long time, providing a traditional way to manage structured data which is known by many. The SQL querying approach they use is widely recognised and appreciated for its precision and familiarity. Meanwhile, NoSQL databases like Apache CouchDB have gained popularity in the database world, not by chance, but due to its robust set of features and design principles. Here are some key strengths that have made CouchDB stand out:
-
Flexibility with Document Storage: Unlike traditional relational databases, CouchDB uses a schema-free document model, making it perfect for semi-structured and evolving data.
-
Scalability: Designed with modern web applications in mind, CouchDB scales horizontally, accommodating large data sets and high request loads with ease.
-
Distributed Architecture: At the heart of CouchDB is its distributed nature, which supports multi-node setups, replication, and synchronisation, ensuring data availability and resilience.
-
Conflict Resolution: In distributed systems, data conflicts are a reality. CouchDB has mechanisms in place to handle them such as revision, ensuring data consistency across nodes.
-
Web-native Interface: With an HTTP-based API and JSON storage format, CouchDB seamlessly integrates with the web, making it easier for developers to build and deploy web applications.
-
Replication & Offline First: CouchDB's ability to replicate data between servers and devices makes it a strong choice for applications that require offline capabilities and later synchronisation when online.
Understanding these strengths provides context for why adding SQL querying capabilities via the Structured Query Server can be such a valuable enhancement for users, combining the best of CouchDB with the querying power of SQL.
Guiding Principles
-
For CouchDB Users & Developers: Recognising that many CouchDB users come from diverse backgrounds, some with strong SQL roots, the aim was to offer them an environment where they felt at home. By bringing SQL query capabilities to CouchDB, developers can now expand the way they query data in CouchDB.
-
Encouraging Adaptability: With the tech world constantly evolving, tools that can adapt have a distinct advantage. By enhancing CouchDB's functionalities with SQL querying, it becomes a more versatile choice for projects with varied data needs.
-
Enterprise Systems Integration: Businesses with legacy SQL-based systems can now easily integrate with newer applications built on CouchDB without costly migration or infrastructure change.
Structured Query Server is available as a standalone daemon that indexes your databases and provides an easy to use SQL-over-HTTP interface to run fully-featured and standards compliant SQL queries over your JSON documents. It runs next to your CouchDB installation, 100% under your control.
It exists to expand CouchDB’s native querying solutions. Those are single-dimensional, that means only one document field-dimension can be searched against at a time. e.g. search by name, or search by name and date, by name and date and time; but not: search by priority or time, order by name, etc. Most importantly, each row in the result set can only include data from a single document.
Structured Query Server let you use SQL SELECT
statements to extract information from a CouchDB database.
It does this by building a copy of the source database in SQLite, with each document in the source database being held in a table based on its type property. CouchDB design documents are used to specify the schema for the SQLite database, and the SQLite data is automatically kept up to date just like any other CouchDB view. Structured Query Server works equally well with CouchDB cluster as well as single-node instances.
Installation & Setup
When purchasing SQS you would get a zip file that looks something like:
structured-query-server-<version>-<platform>.zip
.
Extract this, check the extracted content's checksum, and switch into its directory:
$ unzip -d sqs structured-query-server-1.2.3-linux-x86_64.zip
$ shasum -a 256 --check sqs/structured-query-server.sha256
./sqs/structured-query-server: OK
$ cd sqs
The structured-query-server
executable can be run from this directory, or
copied to any directory on your $PATH
.
For command-line interaction with Structured Query Server and CouchDB, we recommend installing curl and jq. For application development, your platform's HTTP and JSON libraries are all you should need.
Usage
Starting the server
The server is distributed as a binary that launches an HTTP service. To start the server up:
$ ./structured-query-server --server <couch-url>
The <couch-url>
argument should be the URL of your CouchDB instance, and it
should not include any user credentials. For example:
$ ./structured-query-server --server 'http://couch.example.com:5984'
Get Example Data
To follow along with examples that are used in this blog post we first need to download our example data.
You can clone the repo containing our example data by copying this command:
git clone git@github.com:neighbourhoodie/couchdb-bootstrap-example-data.git
After cloning the repo you can use couchdb-bootstrap which is a community tool written to help bootstrap couchdb installations.
Now in the directory where you cloned the repo you can run:
npx couchdb-bootstrap 'http://admin:password@127.0.0.1:5984' couchdb-bootstrap-example-data
Where admin
and password
are your couchdb credentials.
This will create a database called ghibli on your couch that have all the data we need :)
NOTE: You can give it the root folder of the example data repo you cloned, there is no need to navigate to that folder since its already configured in a way that is compatible with couchdb-bootstrap
Now we can run this command to make sure we have the data we need in our ghibli
database :
$ curl -s 'http://127.0.0.1:5984/ghibli/_all_docs?include_docs=true' | jq
We should get something like this:
{
"total_rows": 31,
"offset": 0,111
"rows": [
{
"id": "arrietty",
"key": "arrietty",
"value": {
"rev": "1-3d67d1e90bdbda2308145f4eb713159d"
},
"doc": {
"_id": "arrietty",
"_rev": "1-3d67d1e90bdbda2308145f4eb713159d",
"year": 2010,
"title": "The Secret World of Arrietty",
"length": 94,
"votes": 76376,
"rating": 7.6,
"type": "movie",
"director": "yonebayashi"
}
},
{
"id": "castle",
"key": "castle",
"value": {
"rev": "1-fc101bfb407c0c161b6e4826dbedf571"
},
"doc": {
"_id": "castle",
"_rev": "1-fc101bfb407c0c161b6e4826dbedf571",
"year": 1986,
"title": "Castle in the Sky",
"length": 125,
"votes": 143557,
"rating": 8,
"type": "movie",
"director": "miyazaki"
}
},
{
"id": "earthsea",
"key": "earthsea",
"value": {
"rev": "1-3a46f56d6cfc58e6ca46637e770a5ad0"
},
"doc": {
"_id": "earthsea",
"_rev": "1-3a46f56d6cfc58e6ca46637e770a5ad0",
"year": 2006,
"title": "Tales from Earthsea",
"length": 115,
"votes": 33445,
"rating": 6.4,
"type": "movie",
"director": "goro"
}
},
...
]
}
Now that we have our data in place we can move on to the fun part which is running and configure our Structured Query Server.
Defining a schema
To begin using SQL queries on your CouchDB data, you need to define a schema for
the SQLite database that describes how you want your documents to be represented
in SQL tables. CouchDB document values are copied into tables based on their
type
field; a document with "type": "movie"
will be stored in the movie
table.
Each table needs a column for each document property you want to use in your SQL
queries. For example, if your documents with "type": "movie"
have a property
named year
whose value is an integer, then the movie
table needs a column
named year
whose type is INT
or some other compatible SQLite data type.
Documents that do not have a type
field that corresponds to a defined table
are ignored, that is they are not included in the SQLite data and will not be
returned in query results. Similarly, any document property that's not mapped to
a SQL column cannot be used in queries.
Documents that do not have a property matching one of the defined columns will
still be included in the SQLite data with that field set to NULL
. For example,
a document with "type": "movie"
that has no year
property will be stored
with year = NULL
.
The SQLite schema is defined by storing a design document in CouchDB with the
language
field set to sqlite
. This document should be written to
/{db}/_design/{name}
just like any other design document; {name}
can be any
valid design document ID. The general structure of these design documents is as
follows:
{
"_id": "_design/<name>",
"language": "sqlite",
"sql": {
"tables": {
"<table_name>": {
"fields": {
"<field_name>": <field_desc>,
...
}
},
...
},
"indexes": [
"CREATE INDEX IF NOT EXISTS ...",
...
]
}
}
The key parts of this structure are as follows:
-
The
language
field must have the value"sqlite"
. -
There must be a top-level field named
sql
which may contain the sub-fieldstables
andindexes
. -
The
sql.tables
field defines the tables that will exist in the SQLite database, including each table's name and its list of fields, where each field has at least a name and a type. -
The
sql.indexes
field is an optional array ofCREATE INDEX IF NOT EXISTS
statements to define additional indexes. Each table is already indexed on the document_id
, so you do not need to define an index for that.
A table definition must contain a fields
key whose value contains the names of
the table's columns and their types. For example, a basic table definition looks
like this:
"movie": {
"fields": {
"year": "INT",
"title": "VARCHAR",
"length": "INT",
"votes": "INT",
"rating": "REAL",
"director": "VARCHAR",
"description": "VARCHAR"
}
}
Sometimes you will want to extract nested properties from your CouchDB documents. In SQL each table has a flat list of columns, but you can project your nested document properties into column names using this syntax for a field:
"<field_name>": {
"field": "<path>",
"type": "<type>"
}
For example, the following table definition contains a date_of_birth
column
whose value comes from accessing doc.born.day
on each document whose type
is
"director"
. Similarly is has a place_of_birth
column taking its value from
doc.born.place
.
"director": {
"fields": {
"name": "VARCHAR",
"date_of_birth": {
"field": "born.day",
"type": "VARCHAR"
},
"place_of_birth": {
"field": "born.place",
"type": "VARCHAR"
}
}
}
Here is a complete example of a SQLite design document as it should be stored in CouchDB.
{
"_id": "_design/sqlite",
"language": "sqlite",
"sql": {
"tables": {
"movie": {
"fields": {
"year": "INT",
"title": "VARCHAR",
"length": "INT",
"votes": "INT",
"rating": "REAL",
"director": "VARCHAR",
"description": "VARCHAR"
}
},
"director": {
"fields": {
"name": "VARCHAR",
"date_of_birth": {
"field": "born.day",
"type": "VARCHAR"
},
"place_of_birth": {
"field": "born.place",
"type": "VARCHAR"
}
}
}
},
"indexes": ["CREATE INDEX IF NOT EXISTS year ON movie (year)"]
}
}
Running queries
Once a database schema has been defined using a design document, you can run SQL queries by sending a POST /{db}/\_design/{name}
request to Structured Query Server, with the body containing a SELECT
query.
Assume we've also stored the design doc above under /ghibli/_design/sqlite
. We start up Structured Query Server with the URL of our CouchDB instance:
$ ./structured-query-server -s 'http://127.0.0.1:5984' --data-dir ./data
It will now be listening on port 4984 and we can perform a SQL query. The server returns the results as a JSON array of objects representing the result's rows:
$ curl -s -X POST 'http://127.0.0.1:4984/ghibli/_design/sqlite' \ -d 'SELECT * FROM movie ORDER BY year LIMIT 3' | jq
The response should look like this:
[
{
"_id": "nausicaa",
"year": 1984,
"title": "Nausicaä of the Valley of the Wind",
"length": 117,
"votes": 142796,
"rating": 8.1,
"director": "miyazaki",
"description": null
},
{
"_id": "castle",
"year": 1986,
"title": "Castle in the Sky",
"length": 125,
"votes": 143557,
"rating": 8,
"director": "miyazaki",
"description": null
},
{
"_id": "grave",
"year": 1988,
"title": "Grave of the Fireflies",
"length": 89,
"votes": 221916,
"rating": 8.5,
"director": "takahata",
"description": null
}
]
Queries can also be provided in parameterised form by sending a JSON body containing the query and any arguments. This can be used to avoid interpolating arguments into the query yourself, in order to avoid SQL injection.
$ curl -s -X POST 'http://127.0.0.1:4984/ghibli/_design/sqlite' \ --json '{ "query": "SELECT * FROM movie WHERE year = ?", "args": [1986] }'
The args array must contain the same number of values as there are ? placeholders in the query. The items in the args array must be booleans, numbers, strings or null. The request must have its Content-Type
set to application/json
, otherwise it will be treated as plain text.
Structured Query Server uses CouchDB's authentication system and these query requests must include a username and password that's allowed to access the given CouchDB database, e.g.:
$ curl -su admin:admin 'http://127.0.0.1:4984/ghibli/_design/sqlite' -d '...'
or
$ curl -s 'http://admin:admin@127.0.0.1:4984/ghibli/_design/sqlite' -d '...'
If the given user credentials are not valid, a 401 response will be returned.
Note that this endpoint only allows the execution of SELECT
queries. Any query that would modify the data stored in SQLite would render it inconsistent with the data stored in CouchDB, and so is disallowed. This endpoint should be used for read-only queries just like any other CouchDB view endpoint.
Running complex queries
We saw in the example above how we can write simple SQL queries, however this is not where SQS shines, it shines when we need to run more complex queries, including combining data from multiple documents into a single result row. In the example below, each result row includes data from a director document and a movie document.
SELECT movie.title, director.name, movie.year FROM movie LEFT JOIN director ON movie.director = director._id ORDER BY movie.year LIMIT 3
[
{
"title": "Nausicaä of the Valley of the Wind",
"name": "Hayao Miyazaki",
"year": 1984
},
{
"title": "Castle in the Sky",
"name": "Hayao Miyazaki",
"year": 1986
},
{
"title": "Grave of the Fireflies",
"name": "Isao Takahata",
"year": 1988
}
]
As you can see we can write more complex quires on the same dataset using SQL.
Engineering Trade-Offs
At this point you might be asking, “if all this is so great, why is this not built into CouchDB? What’s the catch?”
The answer is as straightforward as it is simple and honest: software engineering is all about trade-offs and this is no different.
CouchDB is designed to scale seamlessly from single-node installations to very large multi-node clusters. To be able to do this, it has to leave some features on the table, some of which are required for building fully flexible SQL query engines.
With our 15+ year experience in helping people with CouchDB professionally at Neighbourhoodie, we have learned that while a lot of people value this maximalist approach to scaling, only very few of our customers actually make use of the extreme scaling options that CouchDB supports. Instead, most of our customers will run operations that operate and scale at a pace that is more moderate. And Structured Query Server is for them. It is designed for all by the most extreme scaling setups.
If you are unsure if you qualify for one or the other, our experience is that you fall into the “moderate” category.
In addition, we made Structured Query Server as fast (and in some cases faster) as the built-in JavaScript View Query Server, so if that is good enough for your, Structured Query Server is as well.
Get Your Licsense Now
* VAT is added if applicable
If you have any further questions, please visit the SQS product page or get in touch with us directly.
We’re also doing a public product demo for SQS on March 13 2024 at 5pm CET. Please join our Vito hub in order to participate in the Q&A after the demo.
Sign up for the SQS product demo
And if you’d like to keep up with our work at Neighbourhoodie, sign up for our newsletter.