GitHub - henrysky/stars_foundation_diffusion: Code for Leung, Bovy & Speagle 2024

Abstract

Transformers are often the go-to architecture to build foundation models that ingest a large amount of training data. But these models do not estimate the probability density distribution when trained on regression problems, yet obtaining full probabilistic outputs is crucial to many fields of science, where the probability distribution of the answer can be non-Gaussian and multimodal. In this work, we demonstrate that training a probabilistic model using a denoising diffusion head on top of the Transformer provides reasonable probability density estimation even for high-dimensional inputs. The combined Transformer+Denoising Diffusion model allows conditioning the output probability density on arbitrary combinations of inputs and it is thus a highly flexible density function emulator of all possible input/output combinations. We illustrate our Transformer+Denoising Diffusion model by training it on a large dataset of astronomical observations and measured labels of stars within our Galaxy and we apply it to a variety of inference tasks to show that the model can infer labels accurately with reasonable distributions.

Table of Contents

Abstract
Getting Started
Authors
- License

Getting Started

This repository is to make sure all figures and results are reproducible by anyone easily for this paper🤗.

If Github has issue (or too slow) to load the Jupyter Notebooks, you can go http://nbviewer.jupyter.org/github/henrysky/stars_foundation_diffusion/tree/main/

Dependencies

Python dependencies are listed in requirements.txt.

⚠️ You have to set magicnumber = nan in astroNN configuration file for the data reduction code to work properly.

Datasets

Datasets are available on Zenodo and should be placed in the folder named data_files under the root directory of this repository.

If you are planning to use the Docker image, the data files are already downloaded and placed in the correct folder in the container.

Docker Image

If you have Docker installed, you can use the Dockerfile to build a Docker image upon Pytorch container from NVIDIA NGC Catalog with all dependencies installed and data files downloaded.

To build the Docker image called stars_foundation_diffusion, run the following command in the root directory of this repository:

docker build -t stars_foundation_diffusion .

To run the Docker container with all GPU available to the container named testing123, run the following command:

docker run --gpus all --name testing123 -it -e SHELL=/bin/bash --entrypoint bash stars_foundation_diffusion

Then you can attach to the container by running:

docker exec -it testing123 bash

Now you can run all notebooks or training script inside the container

Jupyter Notebooks

Dataset_Reduction.ipynb (External Repository)

The notebook contains code to generate the dataset used by this paper.

Terabytes of (mostly gaia) data need to be downloaded in the process to construct the datasets.
DDPM.ipynb

The notebook contains code to train a simple denoising diffusion model
DDPM_Conditional.ipynb

The notebook contains code to train a simple conditional denoising diffusion model
Inference.ipynb

The notebook contains code to do inference
California_Housing.ipynb

The notebook contains code to train a model on California housing dataset for demonstration purpose.

Python Script

If you use this training script to train your own model, please notice that details of your system will be saved automatically in the model folder as training_system_info.txt for developers to debug should anything went wrong. Delete the file before you share your model with others if you concern about privacy.

training.py

Python script to train the model.

To train the model with mixed precision and torch.compile(), run the following command in the root directory of this repository:

python training.py --mixed_precision --compile_model

To see all available arguments, run:

python training.py --help

Models

model_torch is a trained PyTorch model

The model has ~3.7 millions parameters for the paper
trained_california_model is a trained PyTorch model

The model has 20640 parameters trained on California housing dataset for demonstration purpose

Graphics

All these graphics can be opened and edited by draw.io.

encoder_ddpm.drawio

Source for Figure 1 in the paper,

Authors

Henry Leung - henrysky

Department of Astronomy and Astrophysics, University of Toronto

Contact Henry: henrysky.leung [at] utoronto.ca
Jo Bovy - jobovy

Department of Astronomy and Astrophysics, University of Toronto
Josh Speagle - joshspeagle

Department of Astronomy and Astrophysics, University of Toronto

License

This project is licensed under the MIT License - see the LICENSE file for details

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github/workflows		.github/workflows
model_weights		model_weights
stellarperceptron		stellarperceptron
tests		tests
trained_california_model		trained_california_model
trained_model		trained_model
.gitignore		.gitignore
California_Housing.ipynb		California_Housing.ipynb
DDPM.ipynb		DDPM.ipynb
DDPM_Conditional.ipynb		DDPM_Conditional.ipynb
Dockerfile		Dockerfile
Inference.ipynb		Inference.ipynb
LICENSE		LICENSE
README.rst		README.rst
encoder_ddpm.drawio		encoder_ddpm.drawio
encoder_ddpm.png		encoder_ddpm.png
requirements.txt		requirements.txt
training.py		training.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abstract

Getting Started

Dependencies

Datasets

Docker Image

Jupyter Notebooks

Python Script

Models

Graphics

Authors

License

About

Releases

Packages

Languages

License

henrysky/stars_foundation_diffusion

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Languages