Updates in the DuckLake 0.2 Standard
TL;DR: We are releasing the updated DuckLake 0.2 standard with several new fatures.
We released DuckLake a little over a month ago, and we were overwhelmed by the positive response from the community. Naturally, we also received a bunch of feature requests. With DuckLake version 0.2, we are happy to present the improvements made since!
We would like to start with a quick reminder: the term DuckLake refers to both the DuckLake open standard and the ducklake
DuckDB extension.
For the updates in the latter, see the blog post on duckdb.org, which also contains a description of the changes in the standard.
New Features
Relative Schema/Table Paths
In the old DuckLake standard, paths were always only relative to the global data path. In the DuckLake v0.2 standard, the location to which data and delete files are written now have three layers:
- Data paths are relative to the table path
- Table paths are relative to the schema path
- Schema paths are relative to the global data path
This allows data files to be written in a more structured manner. By default, the schema and table name are set as the path to which the files are written. For example:
main/
my_table/
ducklake-uuid1.parquet
ducklake-uuid2.parquet
The paths are stored in the ducklake_table
and ducklake_schema
tables.
By writing all files for a given table in a given subdirectory, it is now possible to use prefix-based access control at the object store level to grant users access to only specific schemas or tables in the database.
Name Mapping, and Adding Existing Parquet Files
DuckLake by default uses field ids to perform column mapping. When writing Parquet files through DuckLake, each column contains a field id that indicates to which top-level column it belongs. This allows DuckLake to perform metadata-only alter operations such as renaming and dropping fields.
When registering existing Parquet files written through other means or by other writers, these files generally do not have field identifiers written to them. In order to facilitate using these Parquet files in DuckLake, v0.2 adds a new way of mapping columns in the form of name mapping. This allows registering Parquet files as follows:
ATTACH 'ducklake:my_ducklake.db' AS my_ducklake;
CREATE TABLE my_ducklake.people (id BIGINT, name VARCHAR);
COPY (SELECT 42 AS id, 'Mark' AS name) TO 'people.parquet';
CALL ducklake_add_data_files('my_ducklake', 'people', 'people.parquet');
FROM my_ducklake.people;
┌───────┬─────────┐
│ id │ name │
│ int64 │ varchar │
├───────┼─────────┤
│ 42 │ Mark │
└───────┴─────────┘
Every file that is added to DuckLake has an optional mapping_id
, that tells the system the mapping of column name to field id. As this is done on a per-file basis, Parquet files can always be added to DuckLake without restriction. All DuckLake operations are supported on added files, including schema evolution and data change feeds.
Settings
DuckLake v0.2 adds support for scoped settings. The settings stored in the ducklake_metadata
table now have an optional scope field, together with a scope_id
. This allows settings to be scoped at either the schema or table level – instead of requiring all settings to be scoped globally.
Partition Transforms
This release adds support for the year/month/day/hour partition transforms, allowing these functions to be used to partition data directly instead of having to create separate columns containing these fields.
Migration Guide
You can migrate from a DuckLake that runs v0.1 to v0.2 using the following SQL queries.
The DuckDB ducklake
extension automatically runs the following SQL queries to perform the migration:
ALTER TABLE ducklake_schema ADD COLUMN path VARCHAR DEFAULT '';
ALTER TABLE ducklake_schema ADD COLUMN path_is_relative BOOLEAN DEFAULT true;
ALTER TABLE ducklake_table ADD COLUMN path VARCHAR DEFAULT '';
ALTER TABLE ducklake_table ADD COLUMN path_is_relative BOOLEAN DEFAULT true;
ALTER TABLE ducklake_metadata ADD COLUMN scope VARCHAR;
ALTER TABLE ducklake_metadata ADD COLUMN scope_id BIGINT;
ALTER TABLE ducklake_data_file ADD COLUMN mapping_id BIGINT;
CREATE TABLE ducklake_column_mapping (mapping_id BIGINT, table_id BIGINT, type VARCHAR);
CREATE TABLE ducklake_name_mapping (mapping_id BIGINT, column_id BIGINT, source_name VARCHAR, target_field_id BIGINT, parent_column BIGINT);
UPDATE ducklake_partition_column
SET column_id = (SELECT list(column_id ORDER BY column_order)
FROM ducklake_column
WHERE table_id = ducklake_partition_column.table_id AND parent_column IS NULL AND end_snapshot IS NULL)[ducklake_partition_column.column_id + 1];
UPDATE ducklake_metadata
SET value = '0.2'
WHERE key = 'version';