Introduction

Search Shortcut cmd + k | ctrl + k

Documentation / DuckDB Extension

Introduction

In DuckDB, DuckLake is supported through the ducklake extension.

Installation

Install the latest stable DuckDB. (The ducklake extension requires DuckDB v1.3.0 "Ossivalis" or later.)

INSTALL ducklake;

To use DuckLake, you need to make two decisions: which metadata catalog database you want to use and where you want to store those files. In the simplest case, you use a local DuckDB file for the metadata catalog and a local folder on your computer for file storage.

Creating a New Database

DuckLake databases are created by simply starting to use them with the ATTACH statement. In the simplest case, you can create a local, DuckDB-backed DuckLake like so:

ATTACH 'ducklake:my_ducklake.ducklake' AS my_ducklake;
USE my_ducklake;

This will create a file my_ducklake.ducklake, which is a DuckDB database with the DuckLake schema.

We also use USE so we don't have to prefix all table names with my_ducklake. Once data is inserted, this will also create a folder my_ducklake.ducklake.files in the same directory, where Parquet files are stored.

If you would like to use another directory, you can specify this in the DATA_PATH parameter for ATTACH:

ATTACH 'ducklake:my_other_ducklake.ducklake' AS my_other_ducklake (DATA_PATH 'some/other/path/');
USE ...;

The path is stored in the DuckLake metadata and does not have to be specified again to attach to an existing DuckLake catalog.

Attaching an Existing Database

Attaching to an existing database also uses the ATTACH syntax. For example, to re-connect to the example from the previous section in a new DuckDB session, we can just type:

ATTACH 'ducklake:my_ducklake.ducklake' AS my_ducklake;
USE my_ducklake;

Using DuckLake

DuckLake is used just like any other DuckDB database. You can create schemas and tables, insert data, update data, delete data, modify table schemas etc.

Note that – similarly to other data lake and lakehouse formats – the DuckLake format does not support indexes, primary keys, foreign keys, and UNIQUE or CHECK constraints.

Don't forget to either specify the database name of the DuckLake explicity or use USE. Otherwise you might inadvertently use the temporary, in-memory database.

Example

Let's observe what happens in DuckLake when we interact with a dataset. We will use the Netherlands train traffic dataset here.

We use the example DuckLake from above:

ATTACH 'ducklake:my_ducklake.ducklake' AS my_ducklake;
USE my_ducklake;

Let's now import the dataset into the a new table:

CREATE TABLE nl_train_stations AS
    FROM 'https://blobs.duckdb.org/nl_stations.csv';

Now Let's peek behind the courtains. The data was just read into a Parquet file, which we can also just query.

FROM glob('my_ducklake.ducklake.files/*');
FROM 'my_ducklake.ducklake.files/*.parquet' LIMIT 10;

But now lets change some things around. We're really unhappy with the name of the old name of the "Amsterdam Bijlmer ArenA" station now that the stadium has been renamed to "Johan Cruijff ArenA" and everyone here loves Johan. So let's change that.

UPDATE nl_train_stations SET name_long='Johan Cruijff ArenA' WHERE code = 'ASB';

Poof, its changed. We can confirm:

SELECT name_long FROM nl_train_stations WHERE code = 'ASB';

In the background, more files have appeared:

FROM glob('my_ducklake.ducklake.files/*');

We now see three files. The original data file, the rows that were deleted, and the rows that were inserted. Like most systems, DuckLake models updates as deletes followed by inserts. The deletes are just a Parquet file, we can query it:

FROM 'my_ducklake.ducklake.files/ducklake-*-delete.parquet';

The file should contain a single row that marks row 29 as deleted. A new file has appared that contains the new values for this row.

There are now three snapshots, the table creation, data insertion, and the update. We can query that using the snapshots() function:

FROM my_ducklake.snapshots();

And we can query this table at each point:

SELECT name_long FROM nl_train_stations AT (VERSION => 1) WHERE code = 'ASB';
SELECT name_long FROM nl_train_stations AT (VERSION => 2) WHERE code = 'ASB';

Time travel finally achieved!

Detaching from a DuckLake

To detach from a DuckLake, make sure that your DuckLake is not your default database, then use the DETACH statement:

USE memory;
DETACH my_ducklake;

Using DuckLake from a Client

DuckDB works with any DuckDB client that supports DuckDB version 1.3.0.