Pre-fetching jars in docker environment fails to populate classpath #1265

jpolchlo · 2023-09-21T17:17:14Z

I want to build a docker environment where I can pre-load the classpath with spark-sql and some other stuff to avoid boilerplate in my notebooks. So I built the following Dockerfile:

FROM almondsh/almond:0.14.0-RC12-scala-2.12.18

RUN coursier fetch org.apache.logging.log4j:log4j-core:2.17.0
RUN coursier fetch org.apache.logging.log4j:log4j-1.2-api:2.17.0
RUN coursier fetch org.apache.spark::spark-sql:3.1.2

However, upon running this container, running import org.apache.spark.sql._ yields an error:

cell1.sc:1: object apache is not a member of package org
import org.apache.spark.sql._
           ^
Compilation Failed

What step am I missing to get Almond to recognize the coursier-installed jars?

The text was updated successfully, but these errors were encountered:

kiendang · 2023-09-22T01:11:11Z

Almond uses a separate directory for cache. coursier fetch by default fetch the artifacts to .cache/coursier (on Linux). You can try to find where almond stores the cache. If I remember correctly it's .cache/almond/coursier then you can do coursier fetch --cache <almond-coursier-cache-dir> ....

jpolchlo · 2023-09-22T14:03:22Z

That doesn't appear to be the case. Both methods (import from notebook and coursier fetch) place the jar files in the ~/.cache/coursier tree. However, there is a file ~/.cache/almond/ammonite/history that appears to track the notebook imports. The contents after executing

import $ivy.`org.apache.logging.log4j:log4j-core:2.17.0`

are

[
    "import $ivy.`org.apache.logging.log4j:log4j-core:2.17.0`"
]

I'm thinking that the way to pre-load is to provide a notebook with the desired inputs and run it through jupyter during the docker build. There appears to be some amount of state that is created in in-notebook imports that coursier fetch is not replicating.

Edit:
I've been able to preload the container with jars using jupyter execute ... on a notebook containing import $ivy... directives. It appears that the import statements in the notebook are still required to register the imported modules in the current context. However, the jar files are now present, and it's not necessary to wait for the maven downloads.

coreyoconnor · 2024-05-04T23:08:39Z

hmm I did not observe this with the docker image I'm using. However, I'm using

ENV COURSIER_CACHE=/usr/share/coursier/cache

in the dockerfile. Does that impact the coursier cache for even the notebook session?

https://github.com/coreyoconnor/nix_configs/blob/dev/modules/ufo-k8s/almond-2/Dockerfile

coreyoconnor · 2024-05-06T15:16:17Z

After further testing. Yes, setting ENV COURSIER_CACHE will pre-populate as expected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-fetching jars in docker environment fails to populate classpath #1265

Pre-fetching jars in docker environment fails to populate classpath #1265

jpolchlo commented Sep 21, 2023

kiendang commented Sep 22, 2023

jpolchlo commented Sep 22, 2023 •

edited

Loading

coreyoconnor commented May 4, 2024

coreyoconnor commented May 6, 2024

Pre-fetching jars in docker environment fails to populate classpath #1265

Pre-fetching jars in docker environment fails to populate classpath #1265

Comments

jpolchlo commented Sep 21, 2023

kiendang commented Sep 22, 2023

jpolchlo commented Sep 22, 2023 • edited Loading

coreyoconnor commented May 4, 2024

coreyoconnor commented May 6, 2024

jpolchlo commented Sep 22, 2023 •

edited

Loading