Skip to content

Rodrigo-Matsuura/dtstools

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DTSTOOLS: Help with your Delta Lake

Library created by Data In Action

#####    ######    #####   ######     ###      ###    ####      #####   
 ## ##     ##     ##   ##    ##      ## ##    ## ##    ##      ##   ##  
 ##  ##    ##     ##         ##     ##   ##  ##   ##   ##      ##       
 ##  ##    ##      #####     ##     ##   ##  ##   ##   ##       #####   
 ##  ##    ##          ##    ##     ##   ##  ##   ##   ##           ##  
 ## ##     ##     ##   ##    ##      ## ##    ## ##    ##      ##   ##  
#####      ##      #####     ##       ###      ###    #######   #####   

Table of Contents

Introduction

version date description
v0.0.4 2023-06-17 Basic features
v0.0.9 2023-07-06 Introduce lastMaintenance function

This package aims to provide functionality to work with Delta Lake.

Facilitating the visualization of the actual size (Storage) of your Delta tables and their maintenance (Vacuum and Optimize)

Below are the steps performed in order of execution:

  1. Executed a describe detail to get the current location and size
  2. A scan (dbutils.fs.ls) is performed on the Storage folders recursively to calculate the space used in the Storage, excluding the _delta_logs, with this we can calculate an average of how much can be released with Vacuum
  3. Returns a Dataframet

How to use dtstools

First install the package via PyPi

pip install --upgrade dtstools

Import the dtsTable module into your context

from dtstools import dtsTable

Use the function dtsTable.Help() to see examples of function usage and a summary of each function.

dtsTable.Help()

How to use tableSize

Find out the true size of your table

Call the tableSize function passing the database and table name, use display() to see the results.

This function returns a Dataframe.

dtsTable.tableSize(databaseName,tableName).display()

Save the result in a Delta Table for monitoring and baseline

dtsTable.tableSize(databaseName,tableName) \
 .write.format('delta') \
 .mode('append') \
 .saveAsTable("db_demo.tableSize",path='abfss://[email protected]/bronze/tableSize')

Get the size of all tables in your database

for tb in spark.sql(f"show tables from db_demo").collect():
    try:
        print(">> Collecting data... Table:",tb.tableName)
        dtsTable.tableSize(tb.database,tb.tableName) \
        .write.format('delta') \
        .mode('append') \
        .saveAsTable("db_demo.tableSize",path='abfss://[email protected]/bronze/tableSize')
    except Exception as e:
        print (f"###### Error to load tableName {tb.tableName} - {e}######")

How to use tableMaintenance

Apply maintenance to a table, Optimize and Vacuum.

Parameter Description Type
schemaName Name of the database where the table is created string
tableName Name of the table that will be applied in the maintenance string
vacuum True: Vacuum will run, False: Ignore vacuum bool
optimize True: OPTIMIZE will be executed, False: Skip OPTIMIZE bool
zorderColumns If informed and optimize is equal to True, Zorder is applied to the list of columns separated by a comma (,) string
vacuumRetention Number of hours to hold after vacuum runs whole
Debug Just print the result on screen bool

Apply in a single table.

dtsTable.tableMaintenance(schemaName="Database", tableName="tableName", zorderColumns='none', vacuumRetention=168, vacuum=True, optimize=True, debug=False)

Apply maintenance to all tables for YOUR database

for tb in spark.sql(f"show tables from db_demo").collect():
  dtsTable.tableMaintenance(schemaName=tb.database, tableName=tb.tableName, zorderColumns='none', vacuumRetention=168, vacuum=True, optimize=True, debug=False)

Last Maintenance

Use the function dtsTable.LastMaintenance() to see summary vacuum result and optimize operations.

dtsTable.lastMaintenance('database','tableName')

Future implementations

  1. Use Unity Catalog
  2. Run for all databases
  3. Minimize costs with dbutils.fs.ls by looking directly into the transaction log

Notes

  • For partitioned tables with many partitions, the execution time can take longer, so monitor the first executions well, the use is at your responsibility, despite not having any risk mapped so far, it can only generate more transactions for your storage
  • Cost of Azure Storage Transactions: Read Operations (per 10,000) - $0.0258 (two cents per 10,000 operations) (Estimated price as of 4/21/2023)

References

https://github.com/reginaldosilva27/dtstools

Author: Reginaldo Silva

About

DTSTOOLS: Help with your Delta Lake

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%