diff --git a/examples/LePhare_demo.ipynb b/examples/LePhare_demo.ipynb deleted file mode 100644 index 95b6b73..0000000 --- a/examples/LePhare_demo.ipynb +++ /dev/null @@ -1,993 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "c8fb0b8d", - "metadata": {}, - "source": [ - "# Goldenspike+lephare: an example of an end-to-end analysis using RAIL with the prototype rail_lephare wrapper\n", - "\n", - "**Authors:** Sam Schmidt, Eric Charles, Alex Malz, John Franklin Crenshaw, others...\n", - "\n", - "Modified from the [original](https://github.com/LSSTDESC/rail/blob/main/examples/goldenspike_examples/goldenspike.ipynb) by Raphael Shirley to include lephare.\n", - "\n", - "**Last run successfully:** March 27, 2024" - ] - }, - { - "cell_type": "markdown", - "id": "minute-lender", - "metadata": {}, - "source": [ - "This notebook is built on the main rail Goldenspike example in order to demonstrate adding in the rail_relphare wrapper.\n", - "\n", - "This notebook demonstrates how to use a the various RAIL Modules to draw synthetic samples of fluxes by color, apply physical effects to them, train photo-Z estimators on the samples, test and validate the preformance of those estimators, and to use the RAIL summarization modules to obtain n(z) estimates based on the p(z) estimates.\n", - "\n", - "**Creation** \n", - "\n", - "Note that in the parlance of the Creation Module, \"degraders\" is any post-processing that occurs to the \"true\" sample generated by the create Engine. This can include adding photometric errors, applying quality cuts, introducing systematic biases, etc. \n", - "\n", - "In this notebook, we will draw both test and training samples from a RAIL Engine object. Then we will demonstrate how to use RAIL degraders to apply effects to those samples.\n", - "\n", - "**Training and Estimation** \n", - "\n", - "The RAIL Informer modules \"train\" or \"inform\" models used to estimate p(z) given band fluxes (and potentially other information).\n", - "\n", - "The RAIL Estimation modules then use those same models to actually apply the model and extract the p(z) estimates.\n", - "\n", - "**p(z) Validation** \n", - "\n", - "The RAIL Validator module applies various metrics.\n", - "\n", - "**p(z) to n(z) Summarization** \n", - "\n", - "The RAIL Summarization modules convert per-galaxy p(z) posteriors to ensemble n(z) estimates. " - ] - }, - { - "cell_type": "markdown", - "id": "banner-migration", - "metadata": {}, - "source": [ - "## Imports" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "supreme-dietary", - "metadata": {}, - "outputs": [], - "source": [ - "# Prerquisites: os, numpy, pathlib, pzflow, tables_io\n", - "import os\n", - "import numpy as np\n", - "from pathlib import Path\n", - "from pzflow.examples import get_galaxy_data\n", - "import tables_io\n", - "import datetime" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "material-funeral", - "metadata": {}, - "outputs": [], - "source": [ - "# Various rail modules\n", - "import rail\n", - "from rail.creation.degraders.lsst_error_model import LSSTErrorModel\n", - "from rail.creation.degraders.spectroscopic_degraders import (\n", - " InvRedshiftIncompleteness,\n", - " LineConfusion,\n", - ")\n", - "from rail.creation.degraders.quantityCut import QuantityCut\n", - "from rail.creation.engines.flowEngine import FlowModeler, FlowCreator, FlowPosterior\n", - "from rail.core.data import TableHandle\n", - "from rail.core.stage import RailStage\n", - "from rail.core.util_stages import ColumnMapper, TableConverter\n", - "\n", - "from rail.estimation.algos.bpz_lite import BPZliteInformer, BPZliteEstimator\n", - "from rail.estimation.algos.k_nearneigh import KNearNeighInformer, KNearNeighEstimator\n", - "from rail.estimation.algos.flexzboost import FlexZBoostInformer, FlexZBoostEstimator\n", - "\n", - "from rail.estimation.algos.naive_stack import NaiveStackSummarizer\n", - "from rail.estimation.algos.point_est_hist import PointEstHistSummarizer\n", - "\n", - "from rail.evaluation.evaluator import Evaluator\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f2fa696d-60d9-46dc-b4c5-e6db9552cfcf", - "metadata": {}, - "outputs": [], - "source": [ - "#This folder must contain the full filters and SED templates. These are not currently in the elphare_dev package\n", - "#They can be downloaded from the official repo https://gitlab.lam.fr/Galaxies/LEPHARE\n", - "LEPHAREDIR = \"/Users/rshirley/Documents/github/LEPHARE\" #lincc/lephare\"\n", - "os.environ['LEPHAREDIR']=LEPHAREDIR #os.path.abspath(\"..\")\n", - "os.environ['LEPHAREWORK']='WORK'\n", - "import lephare as lp\n", - "from rail.estimation.algos.lephare import LephareInformer, LephareEstimator\n", - "#This is required when defining the classes locally\n", - "__file__ = os.path.join(os.getcwd(), \"goldenspike_lephare.ipynb\")" - ] - }, - { - "cell_type": "markdown", - "id": "scheduled-chamber", - "metadata": {}, - "source": [ - "RAIL now uses ceci as a back-end, which takes care of a lot of file I/O decisions to be consistent with other choices in DESC.\n", - "\n", - "The data_store commands in the cell below effectively override a ceci default to prevent overwriting previous results, generally good but not necessary for this demo.\n", - "\n", - "The `DataStore` uses `DataHandle` objects to keep track of the connections between the various stages. When one stage returns a `DataHandle` and then you pass that `DataHandle` to another stage, the underlying code can establish the connections needed to build a reproducilble pipeline. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "transparent-worship", - "metadata": {}, - "outputs": [], - "source": [ - "DS = RailStage.data_store\n", - "DS.__class__.allow_overwrite = True" - ] - }, - { - "cell_type": "markdown", - "id": "brief-institution", - "metadata": {}, - "source": [ - "Here we need a few configuration parameters to deal with differences in data schema between existing PZ codes." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "finnish-southeast", - "metadata": {}, - "outputs": [], - "source": [ - "bands = [\"u\", \"g\", \"r\", \"i\", \"z\", \"y\"]\n", - "band_dict = {band: f\"mag_{band}_lsst\" for band in bands}\n", - "rename_dict = {f\"mag_{band}_lsst_err\": f\"mag_err_{band}_lsst\" for band in bands}\n" - ] - }, - { - "cell_type": "markdown", - "id": "66494399", - "metadata": {}, - "source": [ - "## Train the Flow Engine\n", - "\n", - "First we need to train the normalizing flow that will serve as the engine for the notebook.\n", - "\n", - "In the cell below, we load the example galaxy catalog from PZFlow and save it so that it can be used to train the flow. We also set the path where we will save the flow." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "aaa5f61a", - "metadata": {}, - "outputs": [], - "source": [ - "DATA_DIR = Path().resolve() / \"data\"\n", - "DATA_DIR.mkdir(exist_ok=True)\n", - "\n", - "catalog_file = DATA_DIR / \"base_catalog.pq\"\n", - "catalog = get_galaxy_data().rename(band_dict, axis=1)\n", - "tables_io.write(catalog, str(catalog_file.with_suffix(\"\")), catalog_file.suffix[1:])\n", - "\n", - "catalog_file = str(catalog_file)\n", - "flow_file = str(DATA_DIR / \"trained_flow.pkl\")\n" - ] - }, - { - "cell_type": "markdown", - "id": "0cd8b319", - "metadata": {}, - "source": [ - "Now we set the parameters for the FlowModeler, i.e. the pipeline stage that trains the flow:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "424f2893", - "metadata": {}, - "outputs": [], - "source": [ - "flow_modeler_params = {\n", - " \"name\": \"flow_modeler\",\n", - " \"input\": catalog_file,\n", - " \"model\": flow_file,\n", - " \"seed\": 0,\n", - " \"phys_cols\": {\"redshift\": [0, 3]},\n", - " \"phot_cols\": {\n", - " \"mag_u_lsst\": [17, 35],\n", - " \"mag_g_lsst\": [16, 32],\n", - " \"mag_r_lsst\": [15, 30],\n", - " \"mag_i_lsst\": [15, 30],\n", - " \"mag_z_lsst\": [14, 29],\n", - " \"mag_y_lsst\": [14, 28],\n", - " },\n", - " \"calc_colors\": {\"ref_column_name\": \"mag_i_lsst\"},\n", - "}\n" - ] - }, - { - "cell_type": "markdown", - "id": "2368b6b2", - "metadata": {}, - "source": [ - "Now we will create the flow and train it" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9a069fda", - "metadata": {}, - "outputs": [], - "source": [ - "flow_modeler = FlowModeler.make_stage(**flow_modeler_params)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "55d1b8d9", - "metadata": {}, - "outputs": [], - "source": [ - "flow_modeler.fit_model()\n" - ] - }, - { - "cell_type": "markdown", - "id": "nonprofit-interference", - "metadata": {}, - "source": [ - "## Make mock data\n", - "\n", - "Now we will use the trained flow to create training and test data for the photo-z estimators.\n", - "\n", - "For both the training and test data we will:\n", - "\n", - "1. Use the Flow to produce some synthetic data\n", - "2. Use the LSSTErrorModel to add photometric errors\n", - "3. Use the FlowPosterior to estimate the redshift posteriors for the degraded sample\n", - "4. Use the ColumnMapper to rename the error columns so that they match the names in DC2.\n", - "5. Use the TableConverter to convert the data to a numpy dictionary, which will be stored in a hdf5 file with the same schema as the DC2 data\n", - "\n", - "### Training sample\n", - "\n", - "For the training data we are going to apply a couple of extra degradation effects to the data beyond what we do to create test data, as the training data will have some spectroscopic incompleteness. This will allow us to see how the trained models perform with imperfect training data.\n", - "\n", - "More details about the degraders are available in the `rail/examples/creation_examples/degradation_demo.ipynb` notebook.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "political-member", - "metadata": {}, - "outputs": [], - "source": [ - "flow_creator_train = FlowCreator.make_stage(\n", - " name=\"flow_creator_train\",\n", - " model=flow_modeler.get_handle(\"model\"),\n", - " n_samples=50,\n", - " seed=1235,\n", - ")\n", - "\n", - "lsst_error_model_train = LSSTErrorModel.make_stage(\n", - " name=\"lsst_error_model_train\",\n", - " renameDict=band_dict,\n", - " ndFlag=np.nan,\n", - " seed=29,\n", - ")\n", - "\n", - "inv_redshift = InvRedshiftIncompleteness.make_stage(\n", - " name=\"inv_redshift\",\n", - " pivot_redshift=1.0,\n", - ")\n", - "\n", - "line_confusion = LineConfusion.make_stage(\n", - " name=\"line_confusion\",\n", - " true_wavelen=5007.0,\n", - " wrong_wavelen=3727.0,\n", - " frac_wrong=0.05,\n", - ")\n", - "\n", - "quantity_cut = QuantityCut.make_stage(\n", - " name=\"quantity_cut\",\n", - " cuts={\"mag_i_lsst\": 25.0},\n", - ")\n", - "\n", - "col_remapper_train = ColumnMapper.make_stage(\n", - " name=\"col_remapper_train\",\n", - " columns=rename_dict,\n", - ")\n", - "\n", - "table_conv_train = TableConverter.make_stage(\n", - " name=\"table_conv_train\",\n", - " output_format=\"numpyDict\",\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "simple-bundle", - "metadata": {}, - "outputs": [], - "source": [ - "train_data_orig = flow_creator_train.sample(150, 1235)\n", - "train_data_errs = lsst_error_model_train(train_data_orig, seed=66)\n", - "train_data_inc = inv_redshift(train_data_errs)\n", - "train_data_conf = line_confusion(train_data_inc)\n", - "train_data_cut = quantity_cut(train_data_conf)\n", - "train_data_pq = col_remapper_train(train_data_cut)\n", - "train_data = table_conv_train(train_data_pq)\n" - ] - }, - { - "cell_type": "markdown", - "id": "above-portable", - "metadata": {}, - "source": [ - "Let's examine the quantities that we've generated, we'll use the handy `tables_io` package to temporarily write to a pandas dataframe for quick writeout of the columns:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "functional-dollar", - "metadata": {}, - "outputs": [], - "source": [ - "train_table = tables_io.convertObj(train_data.data, tables_io.types.PD_DATAFRAME)\n", - "train_table.head()\n" - ] - }, - { - "cell_type": "markdown", - "id": "clinical-pavilion", - "metadata": {}, - "source": [ - "You see that we've generated redshifts, ugrizy magnitudes, and magnitude errors with names that match those in the cosmoDC2_v1.1.4_image data." - ] - }, - { - "cell_type": "markdown", - "id": "square-breeding", - "metadata": {}, - "source": [ - "### Testing sample\n", - "\n", - "For the test sample we will:\n", - "\n", - "1. Use the Flow to produce some synthetic data\n", - "2. Use the LSSTErrorModel to smear the data\n", - "3. Use the FlowPosterior to estimate the redshift posteriors for the degraded sample\n", - "4. Use ColumnMapper to rename some of the columns to match DC2\n", - "5. Use the TableConverter to convert the data to a numpy dictionary, which will be stored in a hdf5 file with the same schema as the DC2 data" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "breathing-deficit", - "metadata": {}, - "outputs": [], - "source": [ - "flow_creator_test = FlowCreator.make_stage(\n", - " name=\"flow_creator_test\",\n", - " model=flow_modeler.get_handle(\"model\"),\n", - " n_samples=50,\n", - ")\n", - "\n", - "lsst_error_model_test = LSSTErrorModel.make_stage(\n", - " name=\"lsst_error_model_test\",\n", - " renameDict=band_dict,\n", - " ndFlag=np.nan,\n", - ")\n", - "\n", - "flow_post_test = FlowPosterior.make_stage(\n", - " name=\"flow_post_test\",\n", - " model=flow_modeler.get_handle(\"model\"),\n", - " column=\"redshift\",\n", - " grid=np.linspace(0.0, 5.0, 21),\n", - ")\n", - "\n", - "col_remapper_test = ColumnMapper.make_stage(\n", - " name=\"col_remapper_test\",\n", - " columns=rename_dict,\n", - " hdf5_groupname=\"\",\n", - ")\n", - "\n", - "table_conv_test = TableConverter.make_stage(\n", - " name=\"table_conv_test\",\n", - " output_format=\"numpyDict\",\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "educational-windows", - "metadata": {}, - "outputs": [], - "source": [ - "test_data_orig = flow_creator_test.sample(150, 1234)\n", - "test_data_errs = lsst_error_model_test(test_data_orig, seed=58)\n", - "test_data_post = flow_post_test.get_posterior(test_data_errs, err_samples=None)\n", - "test_data_pq = col_remapper_test(test_data_errs)\n", - "test_data = table_conv_test(test_data_pq)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "inclusive-effect", - "metadata": {}, - "outputs": [], - "source": [ - "test_table = tables_io.convertObj(test_data.data, tables_io.types.PD_DATAFRAME)\n", - "test_table.head()\n" - ] - }, - { - "cell_type": "markdown", - "id": "formal-camping", - "metadata": {}, - "source": [ - "## \"Inform\" some estimators\n", - "\n", - "More details about the process of \"informing\" or \"training\" the models used by the estimators is available in the `rail/examples/estimation_examples/RAIL_estimation_demo.ipynb` notebook.\n", - "\n", - "We use \"inform\" rather than \"train\" to generically refer to the preprocessing of any prior information.\n", - "For a machine learning estimator, that prior information is a training set, but it can also be an SED template library for a template-fitting or hybrid estimator." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "accredited-circle", - "metadata": {}, - "outputs": [], - "source": [ - "inform_bpz = BPZliteInformer.make_stage(\n", - " name=\"inform_bpz\",\n", - " nondetect_val=np.nan,\n", - " model=\"bpz.pkl\",\n", - " hdf5_groupname=\"\",\n", - ")\n", - "\n", - "inform_knn = KNearNeighInformer.make_stage(\n", - " name=\"inform_knn\",\n", - " nondetect_val=np.nan,\n", - " model=\"knnpz.pkl\",\n", - " hdf5_groupname=\"\",\n", - ")\n", - "\n", - "inform_fzboost = FlexZBoostInformer.make_stage(\n", - " name=\"inform_FZBoost\",\n", - " nondetect_val=np.nan,\n", - " model=\"fzboost.pkl\",\n", - " hdf5_groupname=\"\",\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6b86573b-37c4-453c-a400-0dc610bdf24f", - "metadata": {}, - "outputs": [], - "source": [ - "inform_lephare = LephareInformer.make_stage(\n", - " name=\"inform_Lephare\",\n", - " nondetect_val=np.nan,\n", - " model=\"lephare.pkl\",\n", - " hdf5_groupname=\"\",\n", - " #QSO_dict={'typ':'QSO',...},\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e9fcd395-b5ea-4f9b-97b8-acf25e55662e", - "metadata": {}, - "outputs": [], - "source": [ - "train_data_errs.data.keys()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "great-verification", - "metadata": {}, - "outputs": [], - "source": [ - "inform_bpz.inform(train_data)\n", - "inform_knn.inform(train_data)\n", - "inform_fzboost.inform(train_data)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a5be863a-3b07-47fa-a226-777dde26dc3d", - "metadata": {}, - "outputs": [], - "source": [ - "len(train_data.data['redshift'])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "42c13744-c2f8-4702-829e-850e13f2c7a6", - "metadata": {}, - "outputs": [], - "source": [ - "inform_lephare.inform(train_data)" - ] - }, - { - "cell_type": "markdown", - "id": "colonial-trailer", - "metadata": {}, - "source": [ - "## Estimate photo-z posteriors\n", - "\n", - "More detail on the specific estimators used here is available in the `rail/examples/estimation_examples/RAIL_estimation_demo.ipynb` notebook, but here is a very brief summary of the three estimators used in this notebook:\n", - "\n", - "`BPZliteEstimator` is a template-based photo-z code that outputs the posterior estimated given likelihoods calculated using a template set combined with a Bayesian prior. See Benitez (2000) for more details.
\n", - "`KNearNeighEstimator` is a simple photo-z code that finds the K nearest neighbor training galaxies in color/magnitude space and creates a weighted (by distance) mixture model PDF based on the redshifts of those K neighbors.
\n", - "`FlexZBoostEstimator` is a mature photo-z algorithm that estimates a PDF for each galaxy via a conditional density estimate using the training data. See [Izbicki & Lee (2017)](https://doi.org/10.1214/17-EJS1302) for more details.
\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "electric-minute", - "metadata": {}, - "outputs": [], - "source": [ - "estimate_bpz = BPZliteEstimator.make_stage(\n", - " name=\"estimate_bpz\",\n", - " hdf5_groupname=\"\",\n", - " nondetect_val=np.nan,\n", - " model=inform_bpz.get_handle(\"model\"),\n", - ")\n", - "\n", - "estimate_knn = KNearNeighEstimator.make_stage(\n", - " name=\"estimate_knn\",\n", - " hdf5_groupname=\"\",\n", - " nondetect_val=np.nan,\n", - " model=inform_knn.get_handle(\"model\"),\n", - ")\n", - "\n", - "estimate_fzboost = FlexZBoostEstimator.make_stage(\n", - " name=\"test_FZBoost\",\n", - " nondetect_val=np.nan,\n", - " model=inform_fzboost.get_handle(\"model\"),\n", - " hdf5_groupname=\"\",\n", - " aliases=dict(input=\"test_data\", output=\"fzboost_estim\"),\n", - ")\n", - "\n", - "# estimate_lephare = LephareEstimator.make_stage(\n", - "# name=\"test_Lephare\",\n", - "# nondetect_val=np.nan,\n", - "# model=inform_lephare.get_handle(\"model\"),\n", - "# hdf5_groupname=\"\",\n", - "# aliases=dict(input=\"test_data\", output=\"lephare_estim\"),\n", - "# )" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "average-rental", - "metadata": {}, - "outputs": [], - "source": [ - "knn_estimated = estimate_knn.estimate(test_data)\n", - "fzboost_estimated = estimate_fzboost.estimate(test_data)\n", - "bpz_estimated = estimate_bpz.estimate(test_data)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c500f024-2d4b-4cbe-ab07-fac0d69cad54", - "metadata": {}, - "outputs": [], - "source": [ - "estimate_lephare = LephareEstimator.make_stage(\n", - " name=\"test_Lephare\",\n", - " nondetect_val=np.nan,\n", - " model=inform_lephare.get_handle(\"model\"),\n", - " hdf5_groupname=\"\",\n", - " aliases=dict(input=\"test_data\", output=\"lephare_estim\"),\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dccd6f85-40bc-4385-9739-18b8e9db9b41", - "metadata": {}, - "outputs": [], - "source": [ - "lephare_estimated = estimate_lephare.estimate(test_data)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "77a3e4d1-9dba-47c2-9ad4-277ec06b924c", - "metadata": {}, - "outputs": [], - "source": [ - "os.path.dirname(os.path.abspath(__file__))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "920be48d-9c01-46fb-ba8a-593394c48691", - "metadata": {}, - "outputs": [], - "source": [ - "396\n", - "\n", - "os.path.abspath(os.path.join(os.path.dirname( __file__ ), '..', '..','..'))" - ] - }, - { - "cell_type": "markdown", - "id": "right-mystery", - "metadata": {}, - "source": [ - "## Evaluate the estimates\n", - "\n", - "Now we evaluate metrics on the estimates, separately for each estimator. \n", - "\n", - "Each call to the `Evaluator.evaluate` will create a table with the various performance metrics. \n", - "We will store all of these tables in a dictionary, keyed by the name of the estimator." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "portuguese-graduate", - "metadata": {}, - "outputs": [], - "source": [ - "eval_dict = dict(\n", - " bpz=bpz_estimated, fzboost=fzboost_estimated, \n", - " knn=knn_estimated, lephare=lephare_estimated)\n", - "truth = test_data_orig\n", - "\n", - "result_dict = {}\n", - "for key, val in eval_dict.items():\n", - " the_eval = Evaluator.make_stage(name=f\"{key}_eval\", truth=truth)\n", - " result_dict[key] = the_eval.evaluate(val, truth)\n" - ] - }, - { - "cell_type": "markdown", - "id": "danish-miller", - "metadata": {}, - "source": [ - "The Pandas DataFrame output format conveniently makes human-readable printouts of the metrics. \n", - "This next cell will convert everything to Pandas." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "constant-peripheral", - "metadata": {}, - "outputs": [], - "source": [ - "results_tables = {\n", - " key: tables_io.convertObj(val.data, tables_io.types.PD_DATAFRAME)\n", - " for key, val in result_dict.items()\n", - "}\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "expanded-fellowship", - "metadata": {}, - "outputs": [], - "source": [ - "results_tables[\"knn\"]\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "normal-alexandria", - "metadata": {}, - "outputs": [], - "source": [ - "results_tables[\"fzboost\"]\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "satisfied-intelligence", - "metadata": {}, - "outputs": [], - "source": [ - "results_tables[\"bpz\"]\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "baf9d2ee-8dba-455c-9daf-bc9c6c0cd976", - "metadata": {}, - "outputs": [], - "source": [ - "results_tables[\"lephare\"]" - ] - }, - { - "cell_type": "markdown", - "id": "grave-speaking", - "metadata": {}, - "source": [ - "## Summarize the per-galaxy redshift constraints to make population-level distributions\n", - "\n", - "{introduce the summarizers}\n", - "\n", - "First we make the stages, then execute them, then plot the output." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "white-replacement", - "metadata": {}, - "outputs": [], - "source": [ - "point_estimate_test = PointEstHistSummarizer.make_stage(name=\"point_estimate_test\")\n", - "naive_stack_test = NaiveStackSummarizer.make_stage(name=\"naive_stack_test\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "korean-guess", - "metadata": {}, - "outputs": [], - "source": [ - "point_estimate_ens = point_estimate_test.summarize(eval_dict[\"lephare\"])\n", - "naive_stack_ens = naive_stack_test.summarize(eval_dict[\"lephare\"])\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "focused-puppy", - "metadata": {}, - "outputs": [], - "source": [ - "_ = naive_stack_ens.data.plot_native(xlim=(0, 3))\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "worthy-croatia", - "metadata": {}, - "outputs": [], - "source": [ - "_ = point_estimate_ens.data.plot_native(xlim=(0, 3))\n" - ] - }, - { - "cell_type": "markdown", - "id": "medical-preview", - "metadata": {}, - "source": [ - "## Convert this to a `ceci` Pipeline\n", - "\n", - "Now that we have all these stages defined and configured, and that we have established the connections between them by passing `DataHandle` objects between them, we can build a `ceci` Pipeline.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "neural-central", - "metadata": {}, - "outputs": [], - "source": [ - "import ceci\n", - "\n", - "pipe = ceci.Pipeline.interactive()\n", - "stages = [\n", - " # train the flow\n", - " flow_modeler,\n", - " # create the training catalog\n", - " flow_creator_train,\n", - " lsst_error_model_train,\n", - " inv_redshift,\n", - " line_confusion,\n", - " quantity_cut,\n", - " col_remapper_train,\n", - " table_conv_train,\n", - " # create the test catalog\n", - " flow_creator_test,\n", - " lsst_error_model_test,\n", - " col_remapper_test,\n", - " table_conv_test,\n", - " # inform the estimators\n", - " inform_bpz,\n", - " inform_knn,\n", - " inform_fzboost,\n", - " inform_lephare,\n", - " # estimate posteriors\n", - " estimate_bpz,\n", - " estimate_knn,\n", - " estimate_fzboost,\n", - " estimate_lephare,\n", - " # estimate n(z), aka \"summarize\"\n", - " point_estimate_test,\n", - " naive_stack_test,\n", - "]\n", - "for stage in stages:\n", - " pipe.add_stage(stage)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "packed-chosen", - "metadata": {}, - "outputs": [], - "source": [ - "pipe.initialize(\n", - " dict(input=catalog_file), dict(output_dir=\".\", log_dir=\".\", resume=False), None\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "academic-romance", - "metadata": {}, - "outputs": [], - "source": [ - "pipe.save(\"tmp_goldenspike.yml\")\n" - ] - }, - { - "cell_type": "markdown", - "id": "younger-testament", - "metadata": {}, - "source": [ - "### Read back the pipeline and run it" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "israeli-workshop", - "metadata": {}, - "outputs": [], - "source": [ - "pr = ceci.Pipeline.read(\"tmp_goldenspike.yml\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "danish-homeless", - "metadata": {}, - "outputs": [], - "source": [ - "pr.run()\n" - ] - }, - { - "cell_type": "markdown", - "id": "informational-performer", - "metadata": {}, - "source": [ - "## Clean up:\n", - "\n", - "Finally, you'll notice that we've written a large number of temporary files in the course of running this demo, to delete these and clean up the directory just run the `cleanup.sh` script in this directory to delete the data files." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "racial-stocks", - "metadata": {}, - "outputs": [], - "source": [ - "# TODO fix and add clean up scripts\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ec229a5d-18c1-4216-b7c6-f6a26095ab6a", - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5d09f4d3-c937-4998-8a0b-22d806a6a32d", - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "668f8e6b-0409-4abf-8c7e-b1a4bb6db0c3", - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c85107c8-b669-438d-8fa7-a7bc817ff274", - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0f380616-37a9-473e-9dce-8a95c7672fcd", - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "lincc", - "language": "python", - "name": "lincc" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.13" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/examples/cosmos_demo.ipynb b/examples/cosmos_demo.ipynb new file mode 100644 index 0000000..c39c1c6 --- /dev/null +++ b/examples/cosmos_demo.ipynb @@ -0,0 +1,287 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# COSMOS example\n", + "\n", + "The default data for testing LePHARE is the COSMOS dataset.\n", + "\n", + "In this example we use RAIL to run the standard LePHARE COSMOS example.\n", + "\n", + "In this example we use fluxes not magnitudes. In order to use magnitudes you must both update the config and the values." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from rail.estimation.algos.lephare import LephareInformer, LephareEstimator\n", + "import numpy as np\n", + "import lephare as lp\n", + "from rail.core.stage import RailStage\n", + "import matplotlib.pyplot as plt\n", + "from astropy.table import Table\n", + "import astropy.units as u\n", + "from collections import OrderedDict\n", + "import os\n", + "\n", + "DS = RailStage.data_store\n", + "DS.__class__.allow_overwrite = True" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here we load previously created synthetic data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Retrieve all the required filter and template files" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "lephare_config = lp.default_cosmos_config\n", + "# For useable science results you must use a denser redshift grid by commenting out the following line which will revert to the config dz of 0.01.\n", + "lephare_config['Z_STEP']= \".1,0.,7.\"\n", + "nobj=100 # Increase to run on more objects. Set to -1 to run on all.\n", + "\n", + "lp.data_retrieval.get_auxiliary_data(keymap=lephare_config, additional_files=[\"examples/COSMOS.in\",\"examples/output.para\"])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "bands=lephare_config['FILTER_LIST'].split(',')\n", + "len(bands)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# For a test lets just look at the first 100 objects\n", + "cosmos=Table.read(os.path.join(lp.LEPHAREDIR,\"examples/COSMOS.in\"),format='ascii')[:nobj]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(f\"We will run on {len(cosmos)} objects.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# The example file is in the historical lephare format.\n", + "data=OrderedDict()\n", + "flux_cols=[]\n", + "flux_err_cols=[]\n", + "for n,b in enumerate(bands):\n", + " #print(1+2*n,2+2*n)\n", + " flux=cosmos[cosmos.colnames[1+2*n]]\n", + " flux_err=cosmos[cosmos.colnames[2+2*n]]\n", + " data[f\"flux_{b}\"]=flux\n", + " flux_cols.append(f\"flux_{b}\")\n", + " data[f\"flux_err_{b}\"]=flux_err\n", + " flux_err_cols.append(f\"flux_err_{b}\")\n", + "data[\"redshift\"]=np.array(cosmos[cosmos.colnames[-2]])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We use the inform stage to create the library of SEDs with various redshifts, extinction parameters, and reddening values." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "inform_lephare = LephareInformer.make_stage(\n", + " name=\"inform_COSMOS\",\n", + " nondetect_val=np.nan,\n", + " model=\"lephare.pkl\",\n", + " hdf5_groupname=\"\",\n", + " lephare_config=lephare_config,\n", + " bands=flux_cols,\n", + " err_bands=flux_err_cols,\n", + " ref_band=flux_cols[0],\n", + ")\n", + "\n", + "inform_lephare.inform(data)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we take the sythetic test data, and find the best fits from the library. This results in a PDF, zmode, and zmean for each input test data. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "estimate_lephare = LephareEstimator.make_stage(\n", + " name=\"test_Lephare_COSMOS\",\n", + " nondetect_val=np.nan,\n", + " model=inform_lephare.get_handle(\"model\"),\n", + " hdf5_groupname=\"\",\n", + " aliases=dict(input=\"test_data\", output=\"lephare_estim\"),\n", + " bands=flux_cols,\n", + " err_bands=flux_err_cols,\n", + " ref_band=flux_cols[0],\n", + ")\n", + "\n", + "lephare_estimated = estimate_lephare.estimate(data)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "lephare_config[\"AUTO_ADAPT\"] " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "An example lephare PDF and comparison to the true value" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "indx = 1\n", + "zgrid = np.linspace(0,7,1000)\n", + "plt.plot(zgrid, np.squeeze(lephare_estimated.data[indx].pdf(zgrid)), label='Estimated PDF')\n", + "plt.axvline(x=data['redshift'][indx], color='r', label='True redshift')\n", + "plt.legend()\n", + "plt.xlabel('z')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "More example fits" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "indxs = [8, 16, 32, 64, 65, 66, 68, 69] #, 128, 256, 512, 1024]\n", + "zgrid = np.linspace(0,7,1000)\n", + "fig, axs = plt.subplots(2,4, figsize=(20,6))\n", + "for i, indx in enumerate(indxs):\n", + " ax = axs[i//4, i%4]\n", + " ax.plot(zgrid, np.squeeze(lephare_estimated.data[indx].pdf(zgrid)), label='Estimated PDF')\n", + " ax.axvline(x=data['redshift'][indx], color='r', label='True redshift')\n", + " ax.set_xlabel('z')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Histogram of the absolute difference between lephare estimate and true redshift" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "estimate_diff_from_truth = np.abs(lephare_estimated.data.ancil['zmode'] - data['redshift'])\n", + "\n", + "plt.figure()\n", + "plt.hist(estimate_diff_from_truth, 100)\n", + "plt.xlabel('abs(z_estimated - z_true)')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plt.scatter(data['redshift'],lephare_estimated.data.ancil['Z_BEST'])\n", + "plt.xlabel('$z_{spec}$')\n", + "plt.ylabel('$z_{LePHARE}$')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plt.scatter(data['redshift'],lephare_estimated.data.ancil['zmean'])\n", + "plt.xlabel('$z_{spec}$')\n", + "plt.ylabel('$z_{LePHARE}$')" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/src/rail/estimation/algos/lephare.py b/src/rail/estimation/algos/lephare.py index 422c8dd..ceb838a 100644 --- a/src/rail/estimation/algos/lephare.py +++ b/src/rail/estimation/algos/lephare.py @@ -9,7 +9,7 @@ import importlib # We start with the COSMOS default and override with LSST specific values. -lsst_default_config=lp.default_cosmos_config +lsst_default_config=lp.default_cosmos_config.copy() lsst_default_config.update({'CAT_IN': 'bidon', 'ERR_SCALE': '0.02,0.02,0.02,0.02,0.02,0.02', 'FILTER_CALIB': '0,0,0,0,0,0', @@ -172,15 +172,18 @@ class LephareEstimator(CatEstimator): redshift_col=SHARED_PARAMS, lephare_config=Param( dict, - lp.keymap_to_string_dict(lp.read_config( - "{}/{}".format(os.path.dirname(os.path.abspath(__file__)), "lsst.para") - )), - msg="The lephare config keymap.", + {}, + msg="The lephare config keymap. If unset we load it from the model.", ), output_keys=Param( list, - ["Z_BEST", "ZQ_BEST", "MOD_STAR"], - msg="The output keys to add to ancil. These must be in the output para file.", + ["Z_BEST", "CHI_BEST", "ZQ_BEST", "CHI_QSO", "MOD_STAR", "CHI_STAR"], + msg=( + "The output keys to add to ancil. These must be in the " + "output para file. By default we include the best galaxy " + "and QSO redshift and best star alongside their respective " + "chi squared." + ), ), offsets=Param( list, @@ -203,8 +206,8 @@ class LephareEstimator(CatEstimator): def __init__(self, args, comm=None): CatEstimator.__init__(self, args, comm=comm) - self.lephare_config = self.config["lephare_config"] CatEstimator.open_model(self, **self.config) + self.lephare_config = self.model["lephare_config"] Z_STEP=self.model["lephare_config"]["Z_STEP"] self.lephare_config["Z_STEP"]=Z_STEP self.dz = float(Z_STEP.split(",")[0]) @@ -218,7 +221,7 @@ def __init__(self, args, comm=None): _update_lephare_env(None, self.run_dir) def _process_chunk(self, start, end, data, first): - """Process an individual chunk of sources using lephare. + """Process an individual chunk of sources using lephare. Run the equivalent of zphota and get the PDF for every source. """ @@ -227,8 +230,10 @@ def _process_chunk(self, start, end, data, first): # Set the desired offsets estimate config overide lephare config overide inform offsets if self.config["offsets"]: offsets = self.config["offsets"] - elif self.config["lephare_config"]["AUTO_ADAPT"] == "YES": - a0, a1 = lp.calculate_offsets(lp.string_dict_to_keymap(self.lephare_config), input) + elif self.lephare_config["AUTO_ADAPT"] == "YES": + a0, a1 = lp.calculate_offsets( + lp.string_dict_to_keymap(self.lephare_config), input + ) offsets = [a0, a1] elif not self.config["offsets"]: offsets = self.model["offsets"] @@ -262,8 +267,12 @@ def _rail_to_lephare_input(data, mag_cols, mag_err_cols): Parameters ========== - data : pandas - The RAIL input data chunk + data : OrderedDict + The RAIL input data chunk being and ordered dictionary of magnitude arrays. + mag_cols : list + The names of the magnitude or flux columns. + mag_err_cols : list + The names of the magnitude or flux error columns. Returns ======= @@ -280,11 +289,22 @@ def _rail_to_lephare_input(data, mag_cols, mag_err_cols): except KeyError: input["id"] = np.arange(ng) # Add all available magnitudes + + context = np.full(ng, 0) for n in np.arange(len(mag_cols)): input[mag_cols[n]] = data[mag_cols[n]].T input[mag_err_cols[n]] = data[mag_err_cols[n]].T - # Set context to use all bands - input["context"] = np.sum([2**n for n in np.arange(ng)]) + # Shall we allow negative fluxes? + mask = input[mag_cols[n]] > 0 + mask &= ~np.isnan(input[mag_cols[n]]) + mask &= input[mag_err_cols[n]] > 0 + mask &= ~np.isnan(input[mag_err_cols[n]]) + context += mask * 2**n + # Set context to data value if set or else exclude all negative and nan values + try: + input["context"] = data["context"] + except KeyError: + input["context"] = context try: input["zspec"] = data["redshift"] except KeyError: