Delta Lake 2.0.0
We are excited to announce the release of Delta Lake 2.0.0 on Apache Spark 3.2.
- Quick start guide on how to try the Delta Lake 2.0.0: https://docs.delta.io/2.0.0/quick-start.html
- Documentation: https://docs.delta.io/2.0.0/index.html
- Maven artifacts. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Python artifacts: https://pypi.org/project/delta-spark/2.0.0/
The key features in this release are as follows.
-
Support Change Data Feed on Delta tables. Change Data Feed represents the row level changes between different versions of the table. When enabled, additional information is recorded regarding row level changes for every write operation on the table. See the documentation for more details.
-
Support Z-Order clustering of data to reduce the amount of data read. Z-Ordering is a technique to colocate related information in the same set of files. This data clustering allows column stats (released in Delta 1.2) to be more effective in skipping data based on filters in a query. See the documentation for more details.
-
Support for idempotent writes to Delta tables to enable fault-tolerant retry of Delta table writing jobs without writing the data multiple times to the table. See the documentation for more details.
-
Support for dropping columns in a Delta table as a metadata change operation. This command drops the column from metadata and not the column data in underlying files. See documentation for more details.
-
Support for dynamic partition overwrite. Overwrite only the partitions with data written into them at runtime. See documentation for details.
-
Experimental support for multi-part checkpoints to split the Delta Lake checkpoint into multiple parts to speed up writing the checkpoints and reading. See documentation for more details.
-
Python and Scala API support for OPTIMIZE file compaction and Z-order by.
-
Other notable changes
- Improve the generated column data skipping by adding the support for skipping by nested column generated column
- Improve the table schema validation by blocking the unsupported data types in Delta Lake.
- Support creating a Delta Lake table with an empty schema.
- Change the behavior of DROP CONSTRAINT to throw an error when the constraint does not exist. Before this version the command used to return silently.
- Fix the symlink manifest generation when partition values contain space in them.
- Fix an issue where incorrect commit stats are collected.
- Support for
SimpleAWSCredentialsProvider
orTemporaryAWSCredentialsProvider
in S3 multi-cluster write supportedLogStore
. - Fix an issue in generated columns that would not allow null columns in the insert
DataFrame
to be written even if the column was nullable.
Benchmark Framework Update
Independent of this release, we have improved the framework for writing large scala performance benchmarks (initial version added in version 1.2.0), we have added support for running benchmarks on Google Compute Platform using Google Dataproc (in addition to the existing support for EMR on AWS)
Credits
Adam Binford, Alkis Evlogimenos, Allison Portis, Ankur Dave, Bingkun Pan, Burak Yilmaz, Chang Yong Lik, Chen Qingzhi, Denny Lee, Eric Chang, Felipe Pessoto, Fred Liu, Fu Chen, Gaurav Rupnar, Grzegorz Kołakowski, Hussein Nagree, Jacek Laskowski, Jackie Zhang, Jiaan Geng, Jintao Shen, Jintian Liang, John O'Dwyer, Junyong Lee, Kam Cheung Ting, Karen Feng, Koert Kuipers, Lars Kroll, Liwen Sun, Lukas Rupprecht, Max Gekk, Michael Mengarelli, Min Yang, Naga Raju Bhanoori, Nick Grigoriev, Nick Karpov, Ole Sasse, Patrick Grandjean, Peng Zhong, Prakhar Jain, Rahul Shivu Mahadev, Rajesh Parangi, Ruslan Dautkhanov, Sabir Akhadov, Scott Sandre, Serge Rielau, Shixiong Zhu, Shoumik Palkar, Tathagata Das, Terry Kim, Tyson Condie, Venki Korukanti, Vini Jaiswal, Wenchen Fan, Xinyi, Yijia Cui, Yousry Mohamed