DuckLake v1.0
TL;DR: We are happy to release DuckLake v1.0, a production-ready lakehouse format specification. Its reference implementation, the ducklake DuckDB extension, is available as of today in DuckDB v1.5.2.
In May 2025, we published the DuckLake manifesto, where we explained what motivated us to work on DuckLake. Here's a quick recap: we basically outlined how, in our view, it makes much more sense to store all metadata of a lakehouse in a database rather than in scattered files in object storage. This is why we created DuckLake.
Tip The manifesto is much more compelling, we recommend you read it!
Today, we are happy to announce DuckLake v1.0, almost a year after we released our first sketch of the specification. This is a production-ready release with guaranteed backward-compatibility. DuckLake v1.0 ships a stable specification, a feature-rich and fast reference implementation (the DuckDB ducklake extension), as well as a roadmap for future development.
DuckLake at a Glance
If you're new here, you may be wondering: what is DuckLake? DuckLake is a lakehouse format, which allows you to store your data on object storage and access it as a database, similarly to Delta Lake with Unity Catalog and Iceberg with Lakekeeper.
A key difference between other formats and DuckLake is that DuckLake stores all metadata in a database, which is commonly referred to as the catalog. This database can be any system that speaks SQL, supports primary keys and is able to persist data in tables. The DuckLake specification defines the metadata tables needed for a specific version along with the supported data types and a reference on how to retrieve table metadata to perform operations on the lakehouse.
We also developed an implementation of the specification using DuckDB as the engine, the DuckDB ducklake extension. This extension implements all of the specification and supports three main catalogs: SQLite, PostgreSQL and DuckDB (yes, DuckDB can be a catalog too).
Over the last year, we kept working on the DuckLake specification internally and with the community. We enabled adding existing Parquet files without creating a deep copy, introduced Iceberg compatibility, and rolled out support for the geometry and the variant types. We also released several guides and migration scripts from DuckDB to DuckLake.
DuckLake unlocks several interesting use cases. Using the inlining feature (using the catalog database to stage small updates) allows you to stream data into the data lake. Using its lightweight setup, you only need some storage and a public HTTPS endpoint to serve a read-only lakehouse without authentication. And in the context of DuckDB, it unlocks a multiplayer setup where multiple DuckDB instances can access the same DuckLake while coordinating through a central PostgreSQL catalog database.
Adoption
We are delighted to see how quickly the community adopted DuckLake. The ducklake DuckDB extension is now ranked among DuckDB's top-10 core extensions based on the download statistics.
There are now DuckLake clients for Apache DataFusion (by Hotdata), Apache Spark (by MotherDuck) (by Andrew Witten), Trino and Pandas dataframe (mostly produced by agentic coding). MotherDuck also offers a hosted DuckLake service, where they manage both the catalog database and the data storage for you.
We are very proud that DuckLake is already used in production at dozens of companies – see the collection of logos on our new landing page. And there is even an O'Reilly DuckLake book in the making!

For more DuckLake use cases and libraries, see the awesome-ducklake repository.
What's New in DuckLake v1.0
First and foremost, DuckLake ships dozens of bugfixes to make the format robust for production deployments (see the appendix). We are particularly delighted that many contributions were made by community members and we would like to thank them for their work!
In the following, we show some key features of DuckLake v1.0 – available in the DuckDB v1.5.2 that was also released today.
Data Inlining
Data inlining is one of the flagship features of DuckLake. It basically enables performing small insert, delete and update operations in the catalog database, avoiding the proliferation of “the small file problem”. DuckLake v1.0 brings full inlining of updates and deletes. This feature is now on by default with a default threshold of 10 rows. If you want to find out more about this feature, head to this blog we published recently. In this example, we do an insert, delete and update and showcase how this does not create any new files in DuckLake. We then CHECKPOINT to flush the inlined data to object storage.
CREATE TABLE lake.t (id INT, status VARCHAR);
INSERT INTO lake.t VALUES (1, 'en route'), (2, 'shipped');
DELETE FROM lake.t WHERE id = 1;
UPDATE lake.t SET status = 'delivered' WHERE id = 2;
FROM ducklake_list_files('lake', 't'); -- returns empty
CHECKPOINT; -- flushes data
Sorted Tables
If you are often going to run queries against a high cardinality column, like an id or a timestamp, sorting is a great way to increase read query performance. Both row group and file pruning will be benefited for queries that push filters on the sorted keys. Sorted tables support column ordering but also arbitrary SQL expressions. The default approach will sort tables on compaction, flushing or insertion, but the latter one can be disabled to avoid hindering write performance.
CREATE TABLE lake.sorted_t (id INT, payload JSON);
ALTER TABLE lake.sorted_t SET SORTED BY (id ASC);
INSERT INTO lake.sorted_t
VALUES
(33, {'key': 'value'}),
(2, {'key': 'value'}),
(42, {'key': 'value'}),
(1, {'key': 'value'});
CHECKPOINT;
FROM lake.sorted_t;
Bucket Partitioning
Bucketing works by hashing the value of the target column and using the modulo, based on the number of buckets, to assign this value to a specific bucket. The bucket partitioning functionality uses the murmur3 hash implementation for full compatibility with Iceberg. If you want some of the benefits of partitioning but your column has a high cardinality, bucketing is a good alternative.
CALL lake.set_option('data_inlining_row_limit', 0);
CREATE TABLE lake.events (
user_name VARCHAR, event_type VARCHAR, ts TIMESTAMP);
ALTER TABLE lake.events SET PARTITIONED BY (bucket(8, user_name));
INSERT INTO lake.events VALUES
('alice', 'click', '2024-01-01'),
('bob', 'view', '2024-01-01'),
('charlie', 'click', '2024-01-02');
EXPLAIN ANALYZE FROM lake.events WHERE user_name = 'alice';
Type System: Geometry
Following up on the addition of the GEOMETRY data type to DuckDB core, better geo stats have been enabled that allow for filter pushdown. GEOMETRY types can also now be nested in structs, lists and maps. In the following example, we showcase how a simple filter pushdown would work by using the && operator, which checks overlapping bounding boxes of two geometries.
LOAD spatial;
CALL lake.set_option('data_inlining_row_limit', 0);
CREATE TABLE lake.places (name VARCHAR, location GEOMETRY);
INSERT INTO lake.places VALUES ('Amsterdam', ST_Point(4.9, 52.37));
INSERT INTO lake.places VALUES ('London', ST_Point(-0.12, 51.51));
SELECT name
FROM lake.places
WHERE location
&& ST_GeomFromText('POLYGON((4 52, 5 52, 5 53, 4 53, 4 52))');
Type System: Variant
Variant is a similar type to JSON but with some very distinctive qualities that make it a preferable alternative, namely: (i) it has support for many more types than JSON, for example DATE or TIMESTAMP; (ii) it is stored in a binary encoded format rather than a string; (iii) it can be shredded to primitive types, which allows for much better query performance (including filter and projection pushdown). We believe VARIANT will eventually replace JSON as the main type for semistructured data in database systems.
CREATE TABLE lake.events (id INT, payload VARIANT);
INSERT INTO lake.events VALUES
(1, {'user': 'alice', 'ts': TIMESTAMP '2024-01-01'}),
(2, {'user': 'bob', 'ts': TIMESTAMP '2024-01-02', 'rand': 'value'});
SELECT *
FROM lake.events
WHERE payload.user = 'bob';
Deletion Vectors
Deletion vectors were introduced in the Iceberg v3 specification. While developing DuckLake we are also striving to keep compatibility with Iceberg at the data level. We therefore have implemented deletion vectors stored as Puffin files. These deletion vectors work in a similar way to delete files (which were already a part of the Iceberg specification).
This feature is experimental and we are planning some improvements for the upcoming DuckLake releases.
CREATE TABLE lake.t (id INTEGER);
CALL lake.set_option('write_deletion_vectors', true, table_name => 't');
INSERT INTO lake.t FROM range(100);
DELETE FROM lake.t WHERE id < 5;
The Future of DuckLake
Even though today we release DuckLake v1.0, the job is not yet done. We will continue releasing new DuckLake spec versions, and here we outline the current plans for DuckLake v1.1 and possible directions for DuckLake v2.0.
DuckLake v1.1
There are two main features planned for DuckLake v1.1, which will require spec changes:
-
Variant Inlining. Currently, variant inlining only works for catalogs that natively support it (e.g., DuckDB). However, if the catalog does not support the Variant type, the table will never be inlined. We will design and implement the necessary specification changes that will allow this.
-
Multi-Deletion Vector Puffin Files. DuckLake v1.0 already supports single deletion vector puffin files, but we want to ensure that, similar to our current partial deletion files, DuckLake can store and read multiple deletion vectors in one file, thereby preserving time travel information while still minimizing the small file problem.
DuckLake v2.0
DuckLake v2.0 is most likely not coming anytime soon, as we will focus on maturing our current feature set and guaranteeing the stability of our spec. Although we have a rough sense of the problems we want to tackle, community needs will also strongly drive our plans, so features may be added or reprioritized over time. Below is a list of the features we currently believe are important for DuckLake v2.0:
- Git-like branching. Allowing users to create branches of their DuckLakes and merge changes between them. Similar to what Git is for code, but for data.
- Permission-based roles. Defining roles and permissions to control access to DuckLake objects and operations. This is already possible by configuring Postgres and your S3 bucket correctly, but we want to make this easier to manage directly through DuckLake.
- Incremental materialized views. Supporting materialized views that can be refreshed incrementally by applying tracked changes in DuckLake tables, rather than recomputing the full view.
Conclusion
Today we released DuckLake v1.0, a stable specification for DuckLake, alongside the DuckDB ducklake extension compatible with this specification version. We are very happy with the progress that DuckLake has made in just one year, with some great community adoption and a great foundation for future improvements. If you want to hear more from the developers of DuckLake, make sure to check out our podcast episode on DuckLake v1.0:
Appendix
This is all of the work that has gone into DuckLake since the v1.0 target was set; both from the community and the DuckDB Labs team. In this appendix we are listing 108 PRs that have been merged since the end of 2025. We have classified them so as to convey that the focus has always been making DuckLake stable and production ready, with 68 PRs merged focused on reliability and correctness, 12 focused on internal refactoring to lay the foundation for future developments and another 12 focused on performance improvements.
See the details of the release.
User-Facing Features
| PR | Title | Description |
|---|---|---|
| #642 (@Alex-Monahan) | Ordered compaction and inlining | Data can now be sorted during compaction and when flushing inlined data using SET SORTED BY with column names |
| #743 (@Alex-Monahan) | Sort expressions for compaction | SET SORTED BY now accepts arbitrary SQL expressions, not just column names; enables space-filling curve sorting for geo workloads |
| #780 (@Alex-Monahan) | Sorted inserts | Inserts are automatically sorted when a table has SET SORTED BY; can be disabled with sort_on_insert |
| #697 (@pdet) | Automatic migration opt-in (AUTOMATIC_MIGRATION) |
Attaching a newer DuckLake extension version no longer auto-migrates the catalog; migration is now explicit |
| #724 (@pdet) | Settings function (ducklake_settings()) |
New table function returning catalog DBMS type, extension version, and data path |
| #923 (@pdet) | DATA_INLINING_ROW_LIMIT settable after attach |
Inline row limit can be changed after attach and takes effect immediately |
| #775 (@pdet) | Data inlining on by default | Small inserts (≤10 rows) stored inline in the catalog without configuration; includes full ALTER TABLE support for inlined tables |
| #737 (@pdet) | Deletion inlining | Small deletes (≤ DATA_INLINING_ROW_LIMIT) stored in catalog metadata instead of creating Parquet delete files |
| #877 (@pdet) | Inline insertions in updates | UPDATE uses the inlining path when updated row count is within DATA_INLINING_ROW_LIMIT; small updates no longer produce Parquet files |
| #734 (@rgernhardt) | Maintenance functions return results | ducklake_flush_inlined_data, ducklake_merge_adjacent_files, and ducklake_rewrite_data_files now return result rows |
| #676 (@Costa-SM) | Bucket transform table partitioning | Iceberg-compatible bucket(N, column) transform for partition pruning on high-cardinality columns; combinable with other partition transforms |
| #750 (@Mytherin) | VARIANT type support | DuckLake tables support VARIANT; shredded variant sub-fields get file statistics enabling file-skipping |
| #770 (@Maxxen) | Geometry type improvements | GEOMETRY columns now store bounding-box stats per file; can be nested inside structs/lists/maps; supported in data inlining. Spatial file-level pruning is not yet implemented (TODO) |
| #878 (@rgernhardt) | Metadata query logging | New DuckLakeMetadata log type records every metadata query with catalog, SQL text, and elapsed time |
| #910 (@pdet) | Iceberg V3 deletion vectors (experimental) | Experimental read/write support for deletion vectors (roaring bitmaps in puffin files) as alternative to positional Parquet delete files |
Performance Improvements
| PR | Title | Impact |
|---|---|---|
| #732 (@rgernhardt) | COUNT(*) from metadata |
COUNT(*) served from file row count stats instead of scanning Parquet files; 8×–258× speedup on S3-backed tables |
| #668 (@redox) | Top-N dynamic filter file pruning | Uses min/max column statistics for more aggressive file skipping during query execution |
| #655 (@utay) | Bulk delete API for orphaned file cleanup | More efficient cleanup of orphaned files via RemoveFiles bulk delete |
| #708 (@pdet) | Remove partial_file_info, replace with partial_max |
Simpler schema for deletion tracking; reduces metadata size |
| #801 (@suresh-summation) | Avoid parse-AST-serialize roundtrip in DuckLakeViewEntry::ToSQL() |
~70× faster duckdb_views() queries |
| #806 (@thalassemia) | More efficient AppendFiles |
Avoids copying file vectors during bulk insert; reduces peak memory |
| #807 (@thalassemia) | Use Appender API for DuckDB metadata writes | Dramatically faster bulk ducklake_add_data_files; lower memory for large imports |
| #808 (@thalassemia) | Eager/streaming metadata processing in add_data_files |
Reduces peak memory for bulk imports with many files |
| #829 (@jprafael) | Guard lock on all table_data_changes access |
Prevents data races in concurrent write workloads |
| #868 (@rgernhardt) | Cache GetInlinedDeletionTableName existence |
Avoids extra metadata round-trip per query; critical for remote Postgres catalogs |
| #870 (@jcolot) | Replace IN subqueries with LEFT JOIN in table_changes |
Makes join strategy explicit; improves query plan predictability |
| #904 (@rgernhardt) | Don't scan duckdb_tables() during Initialize |
Faster attach; prevents failures caused by unrelated corrupted catalogs |
Reliability & Correctness
Concurrency & Transaction Safety
| PR | Title | Impact |
|---|---|---|
| #846 (@qsliu2017) | Fix inlined file deletes lost on concurrent commit retry | std::move emptied delete map on first attempt; retries silently lost deletions |
| #847 (@pdet) | Fix BeginSnapshot for tables with different schema versions |
Incorrect/errored reads across schema version boundaries |
| #897 (@pdet) | Copy delete data to avoid snapshot_id corruption | Shared snapshot_id data mutated in-place during delete merges |
| #901 (@pdet) | Make inline deletion flush retriable on conflict | Inlined data flush with concurrent writers could fail non-retriably |
| #906 (@hello-world-bfree) | Detect concurrent insert+delete on same table | Silent duplicate rows in concurrent upsert workloads (MERGE INTO, delete-then-insert) |
| #920 (@pdet) | Deterministic updates with duplicate row IDs | Non-deterministic update behavior when same row ID appeared multiple times in inlined data |
| #922 (@pdet) | Fix local transaction deletion visibility | Deleted rows still visible within the same uncommitted transaction |
Postgres & SQLite Catalog Backend
Fixes for non-DuckDB catalog backends. See also related refactoring (#748, #753, #754, #757) and performance (#868) work in their respective sections.
| PR | Title | Impact |
|---|---|---|
| #849 (@pdet) | Reinterpret BYTEA to Varchar for Postgres inlined data | Varchar stored as BYTEA in Postgres not correctly read back as VARCHAR |
| #903 (@wideltann) | Escape single quotes in variant stats TrySerialize |
SQL syntax error for variant stats when string values contained single quotes (Postgres catalogs) |
Partitioning
| PR | Title | Impact |
|---|---|---|
| #718 (@pdet) | One compaction per partition group for rewrite | ducklake_rewrite_data_files generated wrong compaction plans for partitioned tables |
| #723 (@pdet) | Add projection list with partitions to merge insert | INSERT INTO ... SELECT with MERGE missing columns on partitioned tables |
| #735 (@gijshendriksen) | Fix add_data_files with multiple hive-partitioned columns |
Failed with >1 hive partition column due to column ordering inconsistency |
| #764 (@pdet) | Include identity columns with partitions | Identity/sequence columns omitted from partitioned table results |
| #811 (@thalassemia) | Implicit cast + null for hive partition keys on Parquet import | add_data_files failed on non-VARCHAR partition columns; NULL partition keys unsupported |
| #812 (@Mytherin) | Correctly handle NULL values in partitions | Internal error thrown when inserting NULL into a partition column |
| #850 (@pdet) | Value comparison for boolean partition filters | Boolean partition filters used pointer comparison instead of value comparison |
| #856 (@pdet) | merge_adjacent max option is global, not per-partition |
max_file_count was evaluated per-partition instead of globally |
| #919 (@pdet) | Fix inlined data flush over partitioned data | Inline delete flush not partition-aware; incorrect deletions on partitioned tables |
| #798 (@pdet) | Skip hive partitioned columns from parquet files | Hive partition columns incorrectly read from Parquet data files |
| #992 (@dentiny) | Fix duplicate partition key | Duplicate partition keys caused crash in copy-to-file operator |
General
| PR | Title | Impact |
|---|---|---|
| #675 (@Costa-SM) | Fix integer underflow in value_count computation for nested columns | Incorrect statistics for nested column value counts |
| #682 (@pdet) | Recursively normalize LIST child names from legacy formats | Compatibility with legacy Avro/Parquet formats |
| #689 (@qsliu2017) | Implicit conversion unique_ptr in return | Build/type compatibility fix |
| #710 (@pdet) | Fix change feed scan on compacted insertion files | Incorrect ducklake_table_changes() results on compacted tables |
| #711 (@pdet) | Order deletions required for delete files | Positional delete files were incorrectly ordered |
| #714 (@pdet) | Avoid extra column when doing type + extra change in same TX | Extra columns added when combining column type change with other changes |
| #715 (@pdet) | Avoid double-adding inlined data stats | Stats counted twice for inlined data |
| #730 (@pdet) | Fix autoload | DuckLake autoloading from DuckDB core was broken |
| #758 (@Mytherin) | Fix deadlock in deletion inlining | Deadlock on nested DuckLake catalog metadata expansion |
| #768 (@pdet) | Fix local transaction macro deletion | Macros created in local transactions not cleaned up on rollback |
| #772 (@pdet) | Fix variant field order swap | Field ordering in variant types was incorrectly reversed |
| #822 (@KumoSiunaus) | rewrite_data_files skips large files with delete files |
Large files with delete files were silently skipped during rewrite |
| #839 (@c-herrewijn) | delete_orphaned_files only deletes Parquet files |
Could previously delete the catalog itself if it resided in the data path |
| #842 (@jprafael) | Compaction handles 0-file output | Crash when all rows in a compaction source file had been deleted |
| #845 (@pdet) | Fix table macro parsing | Table macros not parsed correctly in certain situations |
| #855 (@pdet) | Reset view binding if binding fails | Failed views got stuck in broken state; all subsequent accesses threw misleading errors |
| #858 (@pdet) | Update default_value_type in migration |
Not updated after adding a column, causing downstream errors |
| #863 (@pdet) | Cleanup separator in DATA_PATH |
Trailing separator caused incorrect path construction |
| #866 (@Flamefork) | Compaction ignores per_thread_output from lake config |
Crash during compaction when per_thread_output was enabled lake-wide |
| #872 (@pdet) | Fix schema + macro creation in same transaction | Creating schema and macro (or multiple macros) in same TX failed |
| #874 (@pdet) | Fix S3 + disabled_filesystems = 'LocalFileSystem' |
DuckLake failed on S3 when local FS was disabled |
| #896 (@pdet) | DuckLake query plan serialization | Prepared statements and plan caching broken |
| #900 (@pdet) | Generic catalog for view attach | Attaching DuckLake with views in non-default catalog failed |
| #914 (@pdet) | Disallow reserved table names | Users could accidentally create tables conflicting with DuckLake internal names |
| #925 (@pdet) | Fix missing mapping file when updating external file | Updates on external Parquet-backed tables failed due to missing row-ID mapping file |
| #926 (@pdet) | Preserve original error at loading | Original attach/load error was swallowed and replaced with generic error |
| #765 (@timbess) | Fix filter pushdown regression | Late materialization in optimizer bypassed DuckLake file-skipping filters |
| #799 (@pdet) | Fixup error messages for migration | Confusing errors when catalog version mismatched |
| #939 (@dentiny) | Fix rename failure exception message | Incorrect error message on rename failure |
| #941 (@dentiny) | Fix SET NOT NULL error message | Incorrect error message for SET NOT NULL |
| #943 (@dentiny) | Fix schema creation exception | Schema creation threw wrong exception type |
| #945 (@dentiny) | Disallow column name to be internal name | Users could create columns conflicting with DuckLake internal column names |
| #958 (@dentiny) | Fix drop table in single transaction | Dropping a table within the same transaction it was created failed |
| #960 (@dentiny) | Fix dropping schema with table macros | Dropping a schema containing table macros failed |
| #962 (@dentiny) | Fix repeated column update in one transaction | Multiple column updates in a single transaction produced incorrect results |
| #965 (@dentiny) | Check transaction-local entities before drop | Dropping schema succeeded even when it contained transaction-local tables |
| #968 (@dentiny) | Fix SQL string replacement for view | View catalog name placeholder replacement incorrect in certain cases |
| #972 (@pdet) | Fixes for deletions | Missing directories for deletion files; broken name mapping for files without field_id; incorrect delete position during sorted inlining |
| #974 (@dentiny) | Fix missing assignment for sort data | Sort data not propagated due to missing assignment in constructor |
| #976 (@dentiny) | Delete deletion files for local table change | Deletion files not cleaned up for transaction-local table changes |
| #978 (@pdet) | Attach or replace fix | ATTACH OR REPLACE not working correctly |
| #979 (@pdet) | Fix migration issue of range tables from v0.3 to v0.4 | Incorrect begin_snapshot for schema versions after migration |
| #981 (@dentiny) | Fix multiple comments set in single transaction | Setting multiple comments in one transaction overwrote earlier ones |
| #985 (@dentiny) | Fix macros cleanup on snapshot expiration | Macro-related tables not cleaned up during snapshot expiration |
| #987 (@dentiny) | Cleanup variant stats table on compaction and snapshot expiration | Variant stats table rows not cleaned up during compaction/expiration |
| #989 (@dentiny) | Fix set tag SQL | Setting a tag overwrote all key-value pairs for same table/column instead of just the matching key |
| #994 (@dentiny) | Fix drop schema after no dependents | Drop schema incorrectly checked dropped tables instead of dropped views |
Internal Refactoring & Compatibility
| PR | Title | Rationale |
|---|---|---|
| #657 (@pdet) | Implement new checker for expressions as column defaults | Expression validation for column defaults |
| #671 (@pdet) | Change schema tracking from global to table | Per-table schema isolation; cleaner catalog state management |
| #674 (@redox) | Custom MultiFileReader for delete files | Improved efficiency in reading deletion information |
| #696 (@pdet) | Partial deletion files and flushing | Supports partial deletion files; enables efficient flushing of accumulated deletion data |
| #742 (@qsliu2017) | Metadata manager register refactor | Cleaner plugin architecture for multiple catalog backends |
| #748 (@qsliu2017) | Refactor list_agg in queries |
Postgres/SQLite compatibility in metadata queries |
| #753 (@qsliu2017) | Inline data handling: decouple read vs. transform | Backend-specific type conversion; enables Postgres/SQLite inlining |
| #754 (@qsliu2017) | Replace MAX_BY with DISTINCT ON ... ORDER BY |
Postgres compatibility in catalog queries |
| #757 (@qsliu2017) | Column aliases in internal queries | Prevents ambiguity in complex metadata joins |
| #786 (@rgernhardt) | Default catalog_type to "duckdb" in settings |
Consistent reporting across catalog backends |
| #803 (@pdet) | Storage/catalog version field | Enables proper version-gated migrations |
| #827 (@Mytherin) | Remove deprecated lambda syntax in add_data_files |
Build/compatibility fix for updated DuckDB API |