Library created by Data In Action
##### ###### ##### ###### ### ### #### #####
## ## ## ## ## ## ## ## ## ## ## ## ##
## ## ## ## ## ## ## ## ## ## ##
## ## ## ##### ## ## ## ## ## ## #####
## ## ## ## ## ## ## ## ## ## ##
## ## ## ## ## ## ## ## ## ## ## ## ##
##### ## ##### ## ### ### ####### #####
- Introduction
- How to use dtstools
- How to use tableSize
- How to use tableMaintenance
- How to use LastMaintenance
- Future implementations
- Notes
- References
version | date | description |
---|---|---|
v0.0.4 |
2023-06-17 | Basic features |
v0.0.9 |
2023-07-06 | Introduce lastMaintenance function |
This package aims to provide functionality to work with Delta Lake.
Facilitating the visualization of the actual size (Storage) of your Delta tables and their maintenance (Vacuum and Optimize)
Below are the steps performed in order of execution:
- Executed a describe detail to get the current location and size
- A scan (dbutils.fs.ls) is performed on the Storage folders recursively to calculate the space used in the Storage, excluding the _delta_logs, with this we can calculate an average of how much can be released with Vacuum
- Returns a Dataframet
First install the package via PyPi
pip install --upgrade dtstools
Import the dtsTable module into your context
from dtstools import dtsTable
Use the function dtsTable.Help() to see examples of function usage and a summary of each function.
dtsTable.Help()
Find out the true size of your table
Call the tableSize function passing the database and table name, use display() to see the results.
This function returns a Dataframe.
dtsTable.tableSize(databaseName,tableName).display()
Save the result in a Delta Table for monitoring and baseline
dtsTable.tableSize(databaseName,tableName) \
.write.format('delta') \
.mode('append') \
.saveAsTable("db_demo.tableSize",path='abfss://[email protected]/bronze/tableSize')
Get the size of all tables in your database
for tb in spark.sql(f"show tables from db_demo").collect():
try:
print(">> Collecting data... Table:",tb.tableName)
dtsTable.tableSize(tb.database,tb.tableName) \
.write.format('delta') \
.mode('append') \
.saveAsTable("db_demo.tableSize",path='abfss://[email protected]/bronze/tableSize')
except Exception as e:
print (f"###### Error to load tableName {tb.tableName} - {e}######")
Apply maintenance to a table, Optimize and Vacuum.
Parameter | Description | Type |
---|---|---|
schemaName | Name of the database where the table is created | string |
tableName | Name of the table that will be applied in the maintenance | string |
vacuum | True: Vacuum will run, False: Ignore vacuum | bool |
optimize | True: OPTIMIZE will be executed, False: Skip OPTIMIZE | bool |
zorderColumns | If informed and optimize is equal to True, Zorder is applied to the list of columns separated by a comma (,) | string |
vacuumRetention | Number of hours to hold after vacuum runs | whole |
Debug | Just print the result on screen | bool |
Apply in a single table.
dtsTable.tableMaintenance(schemaName="Database", tableName="tableName", zorderColumns='none', vacuumRetention=168, vacuum=True, optimize=True, debug=False)
Apply maintenance to all tables for YOUR database
for tb in spark.sql(f"show tables from db_demo").collect():
dtsTable.tableMaintenance(schemaName=tb.database, tableName=tb.tableName, zorderColumns='none', vacuumRetention=168, vacuum=True, optimize=True, debug=False)
Use the function dtsTable.LastMaintenance() to see summary vacuum result and optimize operations.
dtsTable.lastMaintenance('database','tableName')
- Use Unity Catalog
- Run for all databases
- Minimize costs with dbutils.fs.ls by looking directly into the transaction log
- For partitioned tables with many partitions, the execution time can take longer, so monitor the first executions well, the use is at your responsibility, despite not having any risk mapped so far, it can only generate more transactions for your storage
- Cost of Azure Storage Transactions: Read Operations (per 10,000) - $0.0258 (two cents per 10,000 operations) (Estimated price as of 4/21/2023)
https://github.com/reginaldosilva27/dtstools
Author: Reginaldo Silva