Reading Data
DuckLake specifies tables and update transactions to modify them. DuckLake is not a black box, all metadata is stored as SQL tables under the user's control. Of course, they can be queried in whichever way is best for a client. Below we describe a small working example to retrieve table data.
The information below is to provide transparency to users and to aid developers making their own implementation of DuckLake. The
ducklake
DuckDB extension is able to execute those operations in the background.
Get Current Snapshot
Before anything else we need to find a snapshot ID to be queries. There can be many snapshots in the ducklake_snapshot
table. A snapshot ID is a continously increasing number that identifies a snapshot. In most cases, you would query the most recent one like so:
SELECT snapshot_id FROM ducklake_snapshot
WHERE snapshot_id =
(SELECT max(snapshot_id) FROM ducklake_snapshot);
List Schemas
A DuckLake catalog can contain many SQL-style schemas, which each can contain many tables.
These are listed in the ducklake_schema
table.
Here's how we get the list of valid schemas for a given snapshot:
SELECT schema_id, schema_name
FROM ducklake_schema
WHERE
SNAPSHOT_ID >= begin_snapshot AND
(SNAPSHOT_ID < end_snapshot OR end_snapshot IS NULL);
where
SNAPSHOT_ID
is aBIGINT
referring to thesnapshot_id
column in theducklake_snapshot
table
List Tables
We can list the tables available in a schema for a specific snapshot using the ducklake_table
table:
SELECT table_id, table_name
FROM ducklake_table
WHERE
schema_id = SCHEMA_ID AND
SNAPSHOT_ID >= begin_snapshot AND
(SNAPSHOT_ID < end_snapshot OR end_snapshot IS NULL);
where
SCHEMA_ID
is aBIGINT
referring to theschema_id
column in theducklake_schema
tableSNAPSHOT_ID
is aBIGINT
referring to thesnapshot_id
column in theducklake_snapshot
table
Show the Structure of a Table
For each given table, we can list the available top-level columns using the ducklake_column
table:
SELECT column_id, column_name, column_type
FROM ducklake_column
WHERE
table_id = TABLE_ID AND
parent_column IS NULL AND
SNAPSHOT_ID >= begin_snapshot AND
(SNAPSHOT_ID < end_snapshot OR end_snapshot IS NULL)
ORDER BY column_order;
where
TABLE_ID
is aBIGINT
referring to thetable_id
column in theducklake_table
tableSNAPSHOT_ID
is aBIGINT
referring to thesnapshot_id
column in theducklake_snapshot
table
Note that DuckLake supports nested columns – the filter for
parent_column IS NULL
only shows the top-level columns.
For the list of supported data types, please refer to the “Data Types” page.
SELECT
Now that we know the table structure we can query actual data from the Parquet files that store table data. We need to join the list of data files with the list of delete files (if any). There can be at most one delete file per file in a single snapshot.
SELECT data.path AS data_file_path, del.path AS delete_file_path
FROM ducklake_data_file AS data
LEFT JOIN (
SELECT *
FROM ducklake_delete_file
WHERE
SNAPSHOT_ID >= begin_snapshot AND
(SNAPSHOT_ID < end_snapshot OR end_snapshot IS NULL)
) AS del
USING (data_file_id)
WHERE
data.table_id = TABLE_ID AND
SNAPSHOT_ID >= data.begin_snapshot AND
(SNAPSHOT_ID < data.end_snapshot OR data.end_snapshot IS NULL)
ORDER BY file_order;
where (again)
TABLE_ID
is aBIGINT
referring to thetable_id
column in theducklake_table
tableSNAPSHOT_ID
is aBIGINT
referring to thesnapshot_id
column in theducklake_snapshot
table
Now we have a list of files. In order to reconstruct actual table rows, we need to read all rows from the data_file_path
files and remove the rows labeled as deleted in the delete_file_path
.
Not all files have to contain all the columns currently defined in the table, some files may also have columns that existed previously but have been removed.
DuckLake also supports changing the schema, see schema evolution.
In DuckLake, paths can be relative to the initially specified data path. Whether path is relative or not is stored in the
ducklake_data_file
andducklake_delete_file
entries (path_is_relative
) to thedata_path
prefix fromducklake_metadata
.
SELECT
with File Pruning
One of the main strengths of Lakehouse formats is the ability to prune files that cannot contain data relevant to the query.
The ducklake_file_column_statistics
table contains the file-level statistics.
We can use the information there to prune the list of files to be read if a filter predicate is given.
We can get a list of all files that are part of a given table like described above. We can then reduce that list to only relevant files by querying the per-file column statistics. For example, for scalar equality we can find the relevant files using the query below:
SELECT data_file_id
FROM ducklake_file_column_statistics
WHERE
table_id = TABLE_ID AND
column_id = COLUMN_ID AND
(SCALAR >= min_value OR min_value IS NULL) AND
(SCALAR <= max_value OR max_value IS NULL);
where (again)
TABLE_ID
is aBIGINT
referring to thetable_id
column in theducklake_table
table.COLUMN_ID
is aBIGINT
referring to thecolumn_id
column in theducklake_column
table.SCALAR
is the scalar comparision value for the pruning.
Of course, other filter predicates like greater than etc. will require slighlty different filtering here.
The minimum and maximum values for each column are stored as strings and need to be cast for correct range filters on numeric columns.
Writing Data
Snapshot Creation
Any changes to data stored in DuckLake require the creation of a new snapshot. We need to:
- create a new snapshot in
ducklake_snapshot
and - log the changes a snapshot made in
ducklake_snapshot_changes
INSERT INTO ducklake_snapshot (
snapshot_id,
snapshot_timestamp,
schema_version,
next_catalog_id,
next_file_id
)
VALUES (
SNAPSHOT_ID,
now(),
SCHEMA_VERSION,
NEXT_CATALOG_ID,
NEXT_FILE_ID
);
INSERT INTO ducklake_snapshot_changes (
snapshot_id,
snapshot_changes
)
VALUES (
SNAPSHOT_ID,
CHANGES
);
where
SNAPSHOT_ID
is the new snapshot identifier. This should bemax(snapshot_id) + 1
.SCHEMA_VERSION
is the schema version for the new snapshot. If any schema changes are made, this needs to be incremented. Otherwise the previous snapshot'sschema_version
can be re-used.NEXT_CATALOG_ID
gives the next unused identifier for tables, schemas, or views. This only has to be incremented if new catalog entries are created.NEXT_FILE_ID
is the same but for data or delete files.CHANGES
contains a list of changes performed by the snapshot. See the list of possible values in theducklake_snapshot_changes
table's documentation.
CREATE SCHEMA
A schema is a collection of tables. In order to create a new schema, we can just insert into the ducklake_schema
table:
INSERT INTO ducklake_schema (
schema_id,
schema_uuid,
begin_snapshot,
end_snapshot,
schema_name
)
VALUES (
SCHEMA_ID,
uuid(),
SNAPSHOT_ID,
NULL,
SCHEMA_NAME
);
where
SCHEMA_ID
is the new schema identifier. This should be created by incrementingnext_catalog_id
from the previous snapshot.SNAPSHOT_ID
is the snapshot identifier of the new snapshot as described above.SCHEMA_NAME
is just the name of the new schema.
CREATE TABLE
Creating a table in a schema is very similar to creating a schema. We insert into the ducklake_table
table:
INSERT INTO ducklake_table (
table_id,
table_uuid,
begin_snapshot,
end_snapshot,
schema_id,
table_name
)
VALUES (
TABLE_ID,
uuid(),
SNAPSHOT_ID,
NULL,
SCHEMA_ID,
TABLE_NAME
);
where
TABLE_ID
is the new table identifier. This should be created by further incrementingnext_catalog_id
from the previous snapshot.SNAPSHOT_ID
is the snapshot identifier of the new snapshot as described above.SCHEMA_ID
is aBIGINT
referring to theschema_id
column in theducklake_schema
table table.TABLE_NAME
is just the name of the new table.
A table needs some columns, we can add columns to the new table by inserting into the ducklake_column
table table.
For each column to be added, we run the following query:
INSERT INTO ducklake_column (column_id,
begin_snapshot,
end_snapshot,
table_id,
column_order,
column_name,
column_type,
nulls_allowed
)
VALUES (
COLUMN_ID,
SNAPSHOT_ID,
NULL,
TABLE_ID,
COLUMN_ORDER,
COLUMN_NAME,
COLUMN_TYPE,
NULLS_ALLOWED
);
where
COLUMN_ID
is the new column identifier. This ID must be unique within the table over its entire life time.SNAPSHOT_ID
is the snapshot identifier of the new snapshot as described above.TABLE_ID
is aBIGINT
referring to thetable_id
column in theducklake_table
table.COLUMN_ORDER
is a number that defines where the column is placed in an ordered list of columns.COLUMN_NAME
is just the name of the column.COLUMN_TYPE
is the data type of the column. See the “Data Types” page for details.NULLS_ALLOWED
is a boolean that defines ifNULL
values can be stored in the column. Typically set totrue
.
We skipped some complexity in this example around default values and nested types and just left those fields as
NULL
. See the table schema definition for additional details.
INSERT
Inserting data into a DuckLake table consists of two main steps: first, we need to write a Parquet file containing the actual row data to storage, and second, we need to register that file in the metadata tables and update global statistics. Let's assume the file has already been written.
INSERT INTO ducklake_data_file (
data_file_id,
table_id,
begin_snapshot,
end_snapshot,
path,
path_is_relative,
file_format,
record_count,
file_size_bytes,
footer_size,
row_id_start
)
VALUES (
DATA_FILE_ID,
TABLE_ID,
SNAPSHOT_ID,
NULL,
PATH,
true,
'parquet',
RECORD_COUNT,
FILE_SIZE_BYTES,
FOOTER_SIZE,
ROW_ID_START
);
where
DATA_FILE_ID
is the new data file identifier. This ID must be unique within the table over its entire life time.TABLE_ID
is aBIGINT
referring to thetable_id
column in theducklake_table
table.SNAPSHOT_ID
is the snapshot identifier of the new snapshot as described above.PATH
is the file name relative to the DuckLake data path from the top-level metadata.RECORD_COUNT
is the number of rows in the file.FILE_SIZE_BYTES
is the file size.FOOTER_SIZE
is the position of the Parquet footer. This helps with efficiently reading the file.ROW_ID_START
is the first logical row ID from the file. This number can be read from theducklake_table_stats
table via columnnext_row_id
.
We have omitted some complexity around relative paths, encrypted files, partitioning and partial files in this example. Refer to the
ducklake_data_file
table documentation for details.
DuckLake also supports changing the schema, see schema evolution.
We will also have to update some statistics in the ducklake_table_stats
table and ducklake_table_column_stats
table` tables.
UPDATE ducklake_table_stats SET
record_count = record_count + RECORD_COUNT,
next_row_id = next_row_id + RECORD_COUNT,
file_size_bytes = file_size_bytes + FILE_SIZE_BYTES
WHERE table_id = TABLE_ID;
UPDATE ducklake_table_column_stats
SET
contains_null = contains_null OR NULL_COUNT > 0,
contains_nan = contains_nan OR NAN_COUNT > 0,
min_value = min(min_value, MIN_VALUE),
max_value = max(max_value, MAX_VALUE)
WHERE
table_id = TABLE_ID AND
column_id = COLUMN_ID;
INSERT INTO ducklake_file_column_statistics (
data_file_id,
table_id,
column_id,
value_count,
null_count,
nan_count,
min_value,
max_value,
contains_nan
)
VALUES (
DATA_FILE_ID,
TABLE_ID,
COLUMN_ID,
RECORD_COUNT,
NULL_COUNT,
NAN_COUNT,
MIN_VALUE,
MAX_VALUE,
NAN_COUNT > 0;
);
where
TABLE_ID
is aBIGINT
referring to thetable_id
column in theducklake_table
table.COLUMN_ID
is aBIGINT
referring to thecolumn_id
column in theducklake_column
table.DATA_FILE_ID
is aBIGINT
referring to thedata_file_id
column in theducklake_data_file
table.RECORD_COUNT
is the number of values (includingNULL
andNaN
values) in the file column.NULL_COUNT
is the number ofNULL
values in the file column.NAN_COUNT
is the number ofNaN
values in the file column (floating-point only).MIN_VALUE
is the minimum value in the file column as a string.MAX_VALUE
is the maximum value in the file column as a string.FILE_SIZE_BYTES
is the size of the new Parquet file.
This example assumes there are already rows in the table. If there are none, we need to use
INSERT
instead here. We also skipped thecolumn_size_bytes
column here, it can safely be set toNULL
.