Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JuliaAstro/JuliaSpace use case #26

Open
helgee opened this issue Aug 3, 2021 · 1 comment
Open

JuliaAstro/JuliaSpace use case #26

helgee opened this issue Aug 3, 2021 · 1 comment

Comments

@helgee
Copy link

helgee commented Aug 3, 2021

Continuation of our discussion on Zulip: https://julialang.zulipchat.com/#narrow/stream/295423-juliaspace/topic/Lift-off!/near/248190318

CC: @ronisbr

Background

Within the JuliaAstro/JuliaSpace ecosystem there are several packages which need acces to data sets on the internet some of which get updated regularly.

This includes:

Workflows

We foresee several different workflows depending on the environment:

  1. The REPL workflow: A REPL (or Pluto/Jupyter) user should be able to start working without worrying about the required data. Data downloading and loading should happen automatically in the background (at package load time or better lazily upon function invocation) and be completely transparent.
  2. "Traditional" operational systems and expert users: It should be possible to override the default mechanism and provide custom data potentially from a central data storage in a traditional file-based space operations system.
  3. Reproducible scientifc analyses: For the sake of reproducibility, users should be able to fix dynamic data to a specific point in time, see ScienceState from Astropy.

Current Solution

We currently use a combination of OptionalData.jl and RemoteFiles.jl to handle workflows 1 & 2. As of now, we do not have a solution for workflow 3.

Here's an example from EarthOrientation.jl:

mutable struct EOParams
   # Fields omitted
end

# OptionalData.jl provides a type-safe wrapper for the data set
@OptionalData EOP_DATA EOParams "Call 'EarthOrientation.update()' to load it."

# RemoteFiles.jl is used to download and update the data
@RemoteFileSet data "IERS Data" begin
    iau1980 = @RemoteFile(
        "https://datacenter.iers.org/data/csv/finals.all.csv",
        file="finals.csv",
        updates=:thursdays,
    )
    iau2000 = @RemoteFile(
        "https://datacenter.iers.org/data/csv/finals2000A.all.csv",
        file="finals2000A.csv",
        updates=:thursdays,
    )
end

# Download data and `push!` it into the optional data set. Can be called from `__init__`
function update(; force=false)
    download(data; force=force)
    push!(EOP_DATA, paths(data, :iau1980, :iau2000)...)
    nothing
end

Issues

  1. Recursive update: Every package in a dependency chain needs to implement an update function for manual updates. For example, AstroBase.jl depends on AstroTime.jl which depends on EarthOrientation.jl. In principle, AstroBase.jl needs an update function which call AstroTime.jl's update function which calls EarthOrientation.jl's update function and so on.
  2. Data dependency injection: It is unclear in which package or function the data dependency should be introduced. (I am not really sure what I mean by this so I will try to give examples).
    • In the example above, the dependency on the EOP data is introduced in the lowest level package.
    • Another example, is Astrodynamics.jl depending on AstroBase.jl. AstroBase is supposed to keep things abstract and thus does not add a dependency on an ephemeris, the function for planetary positions looks like position(eph::AbstractEphemeris, t, ...). Astrodynamics.jl is the top-layer opinionated metapackage and defines a global default ephemeris via the approch above, e.g. position(t, ...).
    • I have no idea if one actually needs both ways or if one pattern is strictly better than the other.
@c42f
Copy link
Contributor

c42f commented Aug 9, 2021

Thanks @helgee, I really appreciate these concrete use cases (despite the slow reply)! I think DataSets is exactly the right kind of package to help with some of these these things, though certainly there's some thinking to do and code to write to make it all work smoothly.

Firstly, I think I noted on Zulip that DataSets doesn't have a web data download backend yet so there's some work to do there. Perhaps some of the code from RemoteData/DataDeps could help inform the details.

For your data workflows, I think we can support (2) and (3) already quite nicely. In particular, the current system is quite flexible about mapping dataset names to data storage via the data projects mechanism and JULIA_DATASETS_PATH.

Ironically I think it's workflow (1) which will require some work. For this, there's currently no nice way to define datasets which are distributed via Pkg. But perhaps we can have a builtin data project which is always loaded and which packages can use to register their datasets at load time. Maybe something simple like the following inside the package __init__?

DataSets.register_package_dataset("path/to/package/Data.toml")

Within Data.toml, we might have something like the following:

[[datasets]]
name="EarthOrientation/EOP"
description="IERS Earth orientation parameter data"

    [datasets.storage]
    driver="download"
    type="BlobTree"

    [datasets.storage.cache]
        updates="thursdays"

    [[datasets.storage.files]]
        url="https://datacenter.iers.org/data/csv/finals.all.csv"
        path="finals.csv"

    [[datasets.storage.files]]
        url="https://datacenter.iers.org/data/csv/finals2000A.all.csv"
        path="finals2000A.csv"

I've specified the currently-fictitious "download" storage driver here, but it could also be some other DataSets driver depending on the need. For example, small amounts of static data could just be distributed with the source code.

After that, how do you access the datasets within your package code? If you want type safety and in-memory caching of the data, there will still be a need for OptionalData or equivalent. Though I'd suggest it gets downloaded and loaded on-demand so that there's no need to manually call update()? For a currently-fictitious DataRef, it might look like

# Lazily loaded in-memory cache of EOP_DATA for current Julia session.
# Uses the name "EarthOrientation/EOP" to connect to the data defined in Data.toml
# Can be overridden if the user has a dataset of the same name in JULIA_DATASETS_PATH
const EOP_DATA = DataRef{EOParams}("EarthOrientation/EOP")

function use_data()
    data = EOP_DATA[]  # Internally, uses `dataset("EarthOrientation/EOP")` to read the dataset if not already in memory?
    # do something with `data
end

About your data dependencies with the EarthOrientation.jl vs Astrodynamics.jl approaches... these both seem like valid patterns for their own use cases. DataSets can enable various data to be swapped in though depending on the environment though, so it could be valid to have position(t, ...) defined in AstroBase.jl, but referencing a dataset via a name which isn't defined in AstroBase. Then have Astrodynamics provide that dataset name? I'm not sure this makes sense though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants