DuckDB¶

This section describes how to connect to DuckDB database from Datero.

Before delving into the implementation details, let's describe shortly what DuckDB is and why connecting to it from Datero is interesting. It's a relational DBMS that could work in two modes: as an in-memory or a file-based database. It has a lot of features and capabilities , but one of the most interesting is its superior ability to work with the files.

DuckDB can read/write a variety of file formats like CSV, JSON, Excel, Parquet, Iceberg and plain files. It can also work with remote files from AWS S3 or Azure Blob Storage.

Paired with Datero capabilities to connect to other datasources like Oracle, Mongo, Redis, etc., DuckDB opens doors to a whole new set of use cases. Such combination brings in a file-based world to the relational database world. And all with the power of old plain SQL!

Environment¶

Environment setup for the DuckDB connector is similar to the SQLite. Main requirement is to have DuckDB database file accessible on the file system. So, what we need is DuckDB database file itself. Having that on hands, we just mount it to the datero container as a file in a local folder and connect to it from there.

We couldn't use datero container created in the installation section, because we created it without any mounts defined. Let's spin up a new datero_mount container but this time specify a mount folder for the DuckDB database file. We mount current folder $(pwd) to the /data folder inside the container.

GitBash on Windows

If you are on Windows, you must specify absolute path to the current folder. For example, c:/Users/user/some/path. Or expand current directory with %cd% if you are running the command from cmd.

Do not run docker commands with folder mounts specified from GitBash. Because GitBash is a Linux emulator for Windows, it will translate Windows paths to Linux paths. And docker will not be able to find the folder.

LinuxWindows (cmd.exe)

# stopping currently running container, if any, to free up ports
docker stop datero
# starting new container with current folder mounted to /data folder inside the container
docker run -d --name datero_mount \
    -p 80:80 -p 5432:5432 \
    -e POSTGRES_PASSWORD=postgres \
    -v "$(pwd):/data" \
    chumaky/datero

# stopping currently running container, if any, to free up ports
docker stop datero
# starting new container with current folder mounted to /data folder inside the container
docker run -d --name datero_mount ^
    -p 80:80 -p 5432:5432 ^
    -e POSTGRES_PASSWORD=postgres ^
    -v "%cd%:/data" ^
    chumaky/datero

DuckDB database file¶

Now we need to create a DuckDB database file. To do so, we have to install duckdb command line utility. It's available for all major operating systems. You can find installation instructions in the official documentation .

Once installed, make sure it's added to your PATH. It must be callable from the command line.

duckdb --version
v0.10.2 1601d94f94

Now we can create a new database file calendar.duckdb and create some test table in it.

duckdb calendar.duckdb

D create table seasons(id int, name text);
D insert into seasons values (1, 'Spring'), (2, 'Summer'), (3, 'Autumn'), (4, 'Winter');
D .headers on
D select * from seasons;
┌───────┬─────────┐
│  id   │  name   │
│ int32 │ varchar │
├───────┼─────────┤
│     1 │ Spring  │
│     2 │ Summer  │
│     3 │ Autumn  │
│     4 │ Winter  │
└───────┴─────────┘
D .quit

While this setup is identical to the one described in the corresponding SQLite documentation section, the main difference is the next. DuckDB allows not only to create a classic tables but also to work with files directly. And do this in a very efficient way.

Source JSON file¶

Let's extend our example and introduce a JSON file months.json which will contain list of months with the reference to the season.

$ cat months.json
[
    {"id": 1, "name": "January", "season_id": 4},
    {"id": 2, "name": "February", "season_id": 4},
    {"id": 3, "name": "March", "season_id": 1},
    {"id": 4, "name": "April", "season_id": 1},
    {"id": 5, "name": "May", "season_id": 1},
    {"id": 6, "name": "June", "season_id": 2},
    {"id": 7, "name": "July", "season_id": 2},
    {"id": 8, "name": "August", "season_id": 2},
    {"id": 9, "name": "September", "season_id": 3},
    {"id": 10, "name": "October", "season_id": 3},
    {"id": 11, "name": "November", "season_id": 3},
    {"id": 12, "name": "December", "season_id": 4}
]

To feel the greatness of DuckDB, all you have to do to read the file as a table is to open the database and execute from it the following query.

duckdb calendar.duckdb

D select * from 'months.json';
┌───────┬───────────┬───────────┐
│  id   │   name    │ season_id │
│ int64 │  varchar  │   int64   │
├───────┼───────────┼───────────┤
│     1 │ January   │         4 │
│     2 │ February  │         4 │
│     3 │ March     │         1 │
│     4 │ April     │         1 │
│     5 │ May       │         1 │
│     6 │ June      │         2 │
│     7 │ July      │         2 │
│     8 │ August    │         2 │
│     9 │ September │         3 │
│    10 │ October   │         3 │
│    11 │ November  │         3 │
│    12 │ December  │         4 │
├───────┴───────────┴───────────┤
│ 12 rows             3 columns │
└───────────────────────────────┘

Isn't it great? You can work with JSON files as with ordinary tables via SQL! And there is no data copying involved. It's a direct access to the file by the database engine. Data virtualization at its best.

We can go further and create a view on top of a file.

D create view months as select * from 'months.json';
D select * from months;
┌───────┬───────────┬───────────┐
│  id   │   name    │ season_id │
│ int64 │  varchar  │   int64   │
├───────┼───────────┼───────────┤
│     1 │ January   │         4 │
│     2 │ February  │         4 │
│     3 │ March     │         1 │
│     4 │ April     │         1 │
│     5 │ May       │         1 │
│     6 │ June      │         2 │
│     7 │ July      │         2 │
│     8 │ August    │         2 │
│     9 │ September │         3 │
│    10 │ October   │         3 │
│    11 │ November  │         3 │
│    12 │ December  │         4 │
├───────┴───────────┴───────────┤
│ 12 rows             3 columns │
└───────────────────────────────┘

What this step does is it creates a data virtualization layer for files based data as a metadata object in the database. In turn, DuckDB connector in Datero will be able to work with this view as an ordinary table already in Datero. At the end, Datero will read directly from the file without any data copying. And the chain will be: Datero -> DuckDB -> file.

But to make this approach work, we must mount into the datero container not only the database file but also the file we want to work with. Path to the file that we specify in CREATE VIEW statement must be either relative to the current working directory or an absolute path. But in both cases, if mounted to the container, the path will be resolved against the container's file system.

And here comes a tricky part. We created a view referencing a file in the current working directory. Because file was present in it, we wrote just select * from 'months.json' and it worked. But when we mount the current directory to the container, the file will be accessible under the /data folder. And the path to the file in the view must be changed to /data/months.json.

D create or replace view months as select * from '/data/months.json';

To make this command work on host, we must have the months.json file in the /data folder on the host. Takeaway from this is following: in your file based views you must specify the path under which corresponding source files will be mounted into the container. And you must put the files on your host into the same path to initially create a view on them.

So, if your current directory is /home/mydir and you mount it to the /data folder in the container, you must:

copy months.json file located in /home/mydir into the /data folder on the host.
- this is to allow you to create a view with /data/months.json file path specified.
open DuckDB database file and create a view that reads from the /data/months.json file.
mount current /home/mydir with the months.json file to the /data folder in the container.
from within the container, you will be able to query the view which will be correctly resolving to /data/months.json path already within container's filesystem.

Files mounting privileges¶

Having done that we should have in a current directory the following files. We intentionally omit some non-needed columns in the output for brevity.

$ ls -l
drwxrwxr-x some_user some_user ./
drwxrwxr-x some_user some_user ../
-rw-rw-r-- some_user some_user calendar.duckdb
-rw-rw-r-- some_user some_user months.json

Pay attention to the permissions for the files and current ./ working directory. Files are created by the user some_user on the host and the directory is owned by the same some_user user. By default, only read permissions are set for other for both files and the directory.

Thanks to the set other group read permissions calendar.duckdb file will be readable by the postgres user from inside the container. But it looks like DuckDB connector always tries to open the file in the write mode as well. If you will try to execute just SELECT query from the Datero, you will get the following error.

ERROR:  failed to open SQLite DB. rc=1 path=/data/calendar.duckdb

So, to make it work you must make the calendar.duckdb file writable by the postgres user inside the container. From the host perspective it means that you must make the file writable by other group.

$ chmod o+w calendar.duckdb

Yet another thing. Under some circumstances, there is also temporary WAL file might be generated alongside the main database file. It's named as calendar.duckdb.wal and is being deleted after the connection is closed. This file is created by the postgres process from inside the container. To make it work, you must make the directory where the database file is located writable by the postgres user inside the container. So, you have to execute over the mapped directory the following command.

$ chmod o+w .

And at the end, you will have such permissions set.

$ ls -l
drwxrwxrwx some_user some_user ./
drwxrwxr-x some_user some_user ../
-rw-rw-rw- some_user some_user calendar.duckdb
-rw-rw-r-- some_user some_user months.json

Now we have granted w permission to other group over current ./ directory and calendar.duckdb file. What is interesting, source months.json file doesn't need to be writable by other group. It's not a database file and it's not being written by the postgres process as part of DuckDB connector operations.

With this setup you will be able to connect to the DuckDB database from Datero container and access the data from the file based view without errors.

Write permissions on mounted files & folders

Mounted files are accessed by the user under which Postgres database engine is running inside datero container. By default it's postgres user. It has 999 value for the UID and GID inside the container in the /etc/passwd file.

In addition, there is an intermediate *.wal file might be created alongside the main database file. Because it's created on a fly by the postgres process, you must make mounted folder where your database file is located writable by the postgres user as well.

To add even more complexity, there is no such thing as users and groups by themselves in unix based systems. It's just a text labels for the numeric values stored in the /etc/passwd file. In reality, access is checked against the numeric values. This means that if you will have different numeric values for the same user on the host and in the container, you will still have a permission denied error.

So, on your host you have either explicitly grant read/write access to the 999 numeric value of the postgres user in the container or make the file readable/writable by any user. And do the same for the directory containing the database file. This is what is implemented via other group in file permissions.

In such scenario, you don't have to worry about the postgres user inside the container. And which numeric values it has for the UID and GID. Because the file and the directory are accessible by any user on the host, they will also be accessible by any user inside the container.

Datero connection¶

Open Datero web ui at http://localhost and click on the DuckDB entry in the the Connectors navigation section on the left.

Enter any descriptive name in the Description field. For example, DuckDB. Enter /data/calendar.duckdb as the Database value. The /data folder is the folder within the container into which we mounted our current directory. And calendar.duckdb is the database file within it that we created earlier via duckdb calendar.duckdb command.

Click Save to create the Server logical object.

Connector	Connection Form

Schema import¶

After the Server is created, we can import database schema from it. Connection wizard will switch the tab and open Import Schema form. DuckDB doesn't have notations of databases or schemas. Database file is a single database/schema which is referred to as public by DuckDB connector.

In the Remote Schema select public schema. For example, we want to import our DuckDB database into the duckdb local schema. To do that, type duckdb into the Local Schema input field and click Import Schema button. If everything is correct, you will see the success notification message.

Server Object	Import Schema

Important

Schema import doesn't physically copy any data. For every source table and view it creates an object of a special type in a local schema. This object type is called foreign table. It implements data virtualization pattern.

Querying foreign table will automatically fetch data from the source database. If supported by connector, any filtering, sorting, grouping, etc. will be pushed down to the source database. This means that only the data that is needed will be fetched.

If you change the schema in the source database, you will need to re-import it in Datero to reflect the changes. Thus, schema evolution is handled automatically just by re-importing the schema.

We are ready now to query our DuckDB database from Datero.

Data Querying¶

Click on the Query Editor icon in the left navigation panel. You will see duckdb schema in the Datero object tree. If you expand it, you will see seasons and months tables from the original DuckDB database. Their definitions were automatically imported.

To query data, just write a query in the editor and press Ctrl+Enter or click green Run button above.

And that's it! You have successfully connected to the DuckDB database from Datero and queried the data. But you not only queried the data from the database file itself, you also queried data directly from JSON file and joined them! The seasons table is a classic table, but the months table is a view on top of the JSON file.

We were able to join data from the ordinary table in the database and the JSON file from the file system. And all this was done via single SQL query in Datero. Without any data copying or moving.

Summary¶

DuckDB stands aside from other single database connectors. Apart from being classical relational database like SQLite, it is capable to work with files directly, and in a very efficient way. Because of this, it could be used as a bridge between the file-based world and the relational database world. In conjunction with Datero and its sets of connectors, it enables a whole new set of use cases.

For example, you can join data from the Oracle database and the JSON file from the AWS S3 bucket. Or, join data from your Excel files and MongoDB collections. Or, join data from the Redis cache and the Parquet files from the Azure Blob Storage.

Sounds crazy, isnt' it? But it's all possible with DuckDB and Datero!

Currently, setting up corresponding views in DuckDB is a bit tricky. Datero team is working on the simplification of this process through the UI & YAML config file. It will be allowed to create views on files directly from the Datero UI. Or to specify the views in the YAML config file and mount it to the container. It will be read and executed by the Datero on the container start. But even now, with a bit of manual work, you can achieve a great results with DuckDB connector in Datero.