Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convention for persistent per-package data location? #777

Closed
tkf opened this issue Sep 25, 2018 · 15 comments
Closed

Convention for persistent per-package data location? #777

tkf opened this issue Sep 25, 2018 · 15 comments

Comments

@tkf
Copy link
Member

tkf commented Sep 25, 2018

It looks like Conda.jl JuliaPy/Conda.jl#123 and DataDeps.jl oxinabox/DataDeps.jl#48 need a stable location that persist across package versions. (I'm less certain about DataDeps.jl; @oxinabox please correct me if I'm missing something.) Furthermore, sometimes it is useful to share such location even across different Julia versions to save disk space. I think it would then make sense to document a convention for the directory that each package can use without worrying about name crashes and cluttering ~/.julia directory.

For example, how about that $(DEPOT_PATH[i])/data/$UUID can be used by the package whose uuid is $UUID? (e.g., ~/.julia/data/8f4d0f93-b110-5947-807f-2305c1781a2d for Conda.jl) (Question: should it be writable only when i==1?)

cc: @stevengj @oxinabox @Evizero

@oxinabox
Copy link
Contributor

oxinabox commented Sep 25, 2018

Yes.
I think the current DataDeps strategy of packages-private data being in the packages directory is suboptimal.

Though is is assumed packages-private is only uses when there is a name clash.
DataDeps generally assumes name clashes almost never occur, except when packages are are actually talking about the same data.

Solving this with a project directory is also fine, until you have a name clash within a given project

@tkf
Copy link
Member Author

tkf commented Sep 26, 2018

Just to be clear, by "per-package data location", I mean "for Conda.jl" and "for DataDeps.jl" not "for each package using DataDeps.jl". So I guess it would be something like DATADEPS_LOAD_PATH defaults to ~/.julia/data/124859b0-ceae-595e-8997-d05f6a7a8dfe/ (124859b0-ceae-595e-8997-d05f6a7a8dfe is the UUID for DataDeps.jl) instead of ~/.julia/datadeps or something like that.

@oxinabox
Copy link
Contributor

Ah, ok then.
No, DataDeps does not want that.
It has some opinions about where data should be stored,
and defaults to match (http://white.ucc.asn.au/DataDeps.jl/latest/z10-for-end-users.html#Moving-Data-1)
e.g. Data shared between users on linux default is in /usr/share/datadeps.
And it makes that configurable.

In general, I am of the opinion that when throwing around multiple gigabyte folders,
users probably should find them somewhere expected.
And that to me intrinsically excludes anything that includes 32 digit hex string anywhere in its filename.

Conda is different.
Conda doesn't have data in the data-science sense.
It has an environment.
And it really should be 1 Conda enviroment per julia enviroment.

You actively do not want the Conda environments bleeding over between julia environments.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Sep 26, 2018

I agree with @oxinabox. I think that we need a mechanism for sharing read-only data in a content-addressable fashion. For writable data, it should be isolated and per-environment, ideally recorded in the project file somewhere or in a separate configuration file.

@tkf
Copy link
Member Author

tkf commented Sep 26, 2018

And that to me intrinsically excludes anything that includes 32 digit hex string anywhere in its filename.

That's not the point of my suggestion. Use of a UUID was just an example. It can be $(DEPOT_PATH[i])/data/$(PACKAGE_NAME) (e.g., ~/.julia/data/DataDeps) or even $(DEPOT_PATH[i])/$(lowercase(PACKAGE_NAME)) which is what DataDeps.jl is doing. But I thought it's better to dedicate the root level $(DEPOT_PATH[i]) to Pkg.jl and julia to make it forward-compatible.

For writable data, it should be isolated and per-environment.

You need de-duplication sometimes, exactly like Pkg3 is doing. For Conda.jl, it is sub-optimal to install full Miniconda installation for each environment (JuliaPy/Conda.jl#123 (comment)). A Miniconda installation (so-called "base" environment) has a package cache to avoid downloading each package every time you create a new environment. IIRC it also uses hard-link for package installations to save diskspace. You can make use of those efforts of conda only if you use a shared Miniconda installation.

@stevengj
Copy link
Member

stevengj commented Oct 8, 2018

It's not 100% clear to me how to get a reasonable path for the current environment in Pkg.build or similar (to use for e.g. a miniconda installation or a conda virtualenv for that environment).

There doesn't seem to be any documented API in Julia 1.0 to get the environment of a package. The best I can come up with (based on Base.pathof and Base.locate_package) is:

function projectof(m::Module)
    pkg = get(Base.module_keys, m, nothing)
    pkg === nothing && return nothing
    if pkg.uuid === nothing
        for env in Base.load_path()
            Base.project_deps_get(env, pkg.name) !== false && return env
        end
    else
        for env in Base.load_path()
           Base.manifest_uuid_path(env, pkg) !== nothing && return env
        end
    end
    return nothing
end

This returns the project file or path in Base.load_path() that was used to load a given module. This can't be done in Pkg.build because the module isn't imported yet, but it could be used in Conda at precompile-time or at run-time.

To get a unique name from the environment path, I suppose we could use bytes2hex(SHA.sha256(projectof(MyModule))), though that isn't particularly human-readable.

@stevengj
Copy link
Member

stevengj commented Oct 8, 2018

It's also not clear to me how to achieve the reproducibility goals of a Project.toml file with a Conda.jl installation, since that is a completely foreign (to Pkg) system of dependencies that the user could have mucked with arbitrarily (by using Conda and then doing Conda.add etcetera).

@tkf
Copy link
Member Author

tkf commented Oct 9, 2018

Here is the Pkg API and usage in Conda.jl I have in mind. The logic below should be implementable if we use package name or UUID instead of m::Module in Hypothetical.Pkg. Both of them can be hard-coded in the build script.

module Hypothetical

module Pkg

""" Hypothetical package data API. """
datapath(m::Module) = joinpath(DEPOT_PATH[1], "data", string(m))

""" Hypothetical package option API. """
function options(m::Module)
    if endswith(string(m), "Conda")
        return Dict(:private_env => rand() < 0.5)
    end
    return Dict()
end

end  # module


module Conda
using ..Pkg

""" Return `~/.julia/data/Conda/envs/\$env`. """
prefix(env) =
    joinpath(Pkg.datapath(Conda), "envs", string(env))

"""
Get `conda` executable from the base miniconda installation.
Return `~/.julia/data/Conda/base/bin/conda`.
"""
conda_cmd() = joinpath(Pkg.datapath(Conda), "base", "bin", "conda")

# there is probably a better way to do this...
is_named_env() = startswith(Base.active_project(),
                            joinpath(DEPOT_PATH[1], "environments"))

""" Get `prefix` for the current conda environment. """
function current_prefix()
    if get(Pkg.options(Conda), :private_env, false)
        if is_named_env()
            # Use Conda environment ~/.julia/data/Conda/envs/$ENV for
            # Julia environment ~/.julia/environments/$ENV if package
            # option `private_env` is set to `true` for this
            # environment:
            name = basename(dirname(Base.active_project()))
            return prefix(name)
        else
            # For environment outside DEPOT_PATH[1], use
            # $(dirname(Base.active_project()))/.condajl or something
            # (or maybe make it configurable)?
            return joinpath(dirname(Base.active_project()), ".condajl")
        end
    else
        # Otherwise, default to environment per-Julia installation:
        return prefix("v$(VERSION.major).$(VERSION.minor)")
        # return prefix("v$(VERSION.major)")  # maybe?
    end
end

end  # module
end  # module

To get a unique name from the environment path

I suggest something like current_prefix above. For non-named environment, I think the simplest solution would be to put a hidden directory in dirname(Base.active_project()).

achieve the reproducibility goals of a Project.toml file with a Conda.jl installation

I guess you can use conda list --export and dump it to where Manifest.toml is (e.g., ~/.julia/environments/v1.0/conda-packages.txt)? Or maybe this should be in the data directory too, like ~/.julia/data/Conda/envs/v1.0/conda-packages.txt, if Pkg demands that ~/.julia/environments should be used only by Pkg?

@KristofferC
Copy link
Member

It's also not clear to me how to achieve the reproducibility goals of a Project.toml file with a Conda.jl installation, since that is a completely foreign (to Pkg) system of dependencies that the user could have mucked with arbitrarily (by using Conda and then doing Conda.add etcetera).

Indeed, packages with mutable state (like a package wrapping another package manager) break the assumption that the content hash is enough to reproduce the content.

@stevengj
Copy link
Member

stevengj commented Oct 9, 2018

I suggest something like current_prefix above.

I don't understand why basename(dirname(Base.active_project())) is a good choice. Why would that be unique to the environment that a package (e.g. Conda) was loaded from?

I think the simplest solution would be to put a hidden directory in dirname(Base.active_project()).

Is this directory even guaranteed to be writable?

@tkf
Copy link
Member Author

tkf commented Oct 9, 2018

Why would that be unique to the environment that a package (e.g. Conda) was loaded from?

It is unique because is_named_env() is true, i.e., we map only the environments of the form ~/.julia/environments/$ENV to ~/.julia/data/Conda/envs/$ENV.

(But I guess current_prefix() has to be extended and look for conda environment in $(DEPOT_PATH[i])/data/Codna/envs/$ENV in Julia environment $(DEPOT_PATH[i])/environments/$ENV. is_named_env() was probably not the best function name.)

Is this directory even guaranteed to be writable?

When you set the package option private_env, you presumably write it to Project.toml so its directory is likely to be writable. Of course, it does not cover the case that the user makes the directory non-writable after creating Project.toml and Manifest.toml. I'm not sure if this edge case is worth supporting.

@tkf
Copy link
Member Author

tkf commented Oct 11, 2018

Depending on the API, Pkg.package_state_dir proposed in #796 could be used as the persistent data location.

@tkf
Copy link
Member Author

tkf commented Oct 11, 2018

I'm closing it. It looks like ~/.julia/$(lowercase(nameof(Module))) is a reasonable choice. JuliaPy/Conda.jl#123 (comment)

@tkf tkf closed this as completed Oct 11, 2018
@stevengj
Copy link
Member

In the longer run, we still need something tied to the environment and package options, but getting package options working seems like a higher priority to me.

@tkf
Copy link
Member Author

tkf commented Oct 11, 2018

something tied to the environment and package options

@stevengj See #796 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants