Reference Multimodal Datalake

How do you work with 1,000,000 images? Or videos? Or documents? Also, what about all the associated metadata like the date an image was taken? Or the height/width of each image? Or maybe annotations on the image?

Argh! Why is it so hard?

A Multimodal Data Warehouse needs to provide (in order of Data Warehousing exotic-ness):

Running analytics (just like any good ol' Data Warehouse): count how many images there are, grouped by the URL domain name
Running multimodal operations: extract the height/width of an image, or maybe calculate the CLIP embedding of an image
Running (multimodal) LLMs... efficiently: run the newest fancy open-sourced LLM model on all the available GPUs, or maybe make API calls to OpenAI's GPT-4o to prompt my models with my images
Feed model training: stream my data to a model training system (e.g. Pytorch running on GPU machines)
Clustering algorithms: cluster my documents based on some semantic similarity metric, and then deduplicate them
Retrieval: retrieve documents to augment to a dataset (both semantic and metadata-based)

Traditional data warehouses (or a data lakehouse, if you're feeling fancy) really handle only (1) well. This repo shows an example of building an effective Multimodal data warehouse using nothing but Daft, good old Parquet files, and URLs pointing to files in AWS S3 (or any object store of your choice, really).

Along the way, we will showcase benchmarks and show how just this simple tried-and-true set of tools can go a long way, and still maintain interoperability with all the other really good open-source tooling such as DuckDB, Trino, Spark and more.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
benchmarks		benchmarks
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reference Multimodal Datalake

About

Releases

Packages

Languages

License

Eventual-Inc/reference-multimodal-datalake

Folders and files

Latest commit

History

Repository files navigation

Reference Multimodal Datalake

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages