Skip to content

Latest commit

 

History

History
780 lines (621 loc) · 31.7 KB

README.md

File metadata and controls

780 lines (621 loc) · 31.7 KB

Ganymede

Ganymede: Jupyter Notebook Java Kernel

The Ganymede Kernel is a Jupyter Notebook Java kernel. Java code is compiled and interpreted with the Java Shell tool, JShell. This kernel offers the following additional features:

Installation

The Ganymede Kernel is distributed as a single JAR (download here).

⚠️ Only Jupyter Notebook versions before 7 (<7) are fully supported at this time. See the Pipfile in ganymede-notebooks for a minimal Python configuration.

Java 11 or later is required. In addition to Java, the Jupyter Notebook must be installed first and the jupyter and python commands must be on the ${PATH}. Then the typical (and minimal) installation command line:

$ java -jar ganymede-2.1.2.20230910.jar -i

The kernel will be configured to use the same java installation as invoked in the install command above. These additional command line options are supported.

Option Action Default
--id-prefix=<prefix> Adds prefix to kernel ID <none>
--id=<id> Specifies kernel ID ganymede-${version}-java-${java.specification.version}
--id-suffix=<suffix> Adds suffix to kernel ID <none>
--display-name-prefix=<prefix> Adds prefix to kernel display name <none>
--display-name=<name> Specifies kernel display name Ganymede ${version} (Java ${java.specification.version})
--display-name-suffix=<suffix> Adds suffix to kernel display name <none>
--env Specify NAME=VALUE pair(s) to add to kernel environment
--copy-jar=<boolean> Copies the Ganymede Kernel JAR to the kernelspec directory true
--sys-prefix
or --user
Install in the system prefix or user path (see the jupyter kernelspec install command). --user

The following Java system properties may be configured.

System Properties Action Default(s)
maven.repo.local Configures the local Maven repository
--sys-prefix${jupyter.data}/repository/
--user${user.home}/.m2/

The following OS environment variables may be configured:

Environment Variable Option Action
SPARK_HOME --spark-home=<path> If configured, the kernel will add the Apache Spark JARs to the kernel's classpath.
HIVE_HOME --hive-home=<path> If configured, the kernel will add the Apache Hive JARs to the kernel's classpath.

For example, a sophisticated configuration to test a snapshot out of a user's local Maven repository:

$ export JAVA_HOME=$(/usr/libexec/java_home -v 11)
$ ${JAVA_HOME}/bin/java \
      -jar ${HOME}/.m2/repository/ganymede/ganymede/2.2.0-SNAPSHOT/ganymede-2.2.0-SNAPSHOT.jar \
      -i --sys-prefix --copy-jar=false \
      --id-suffix=spark-3.3.4 --display-name-suffix="with Spark 3.3.4" \
      --spark_home=/path/to/spark-home --hive_home=/path/to/hive-home
$ jupyter kernelspec list
Available kernels:
...
  ganymede-2.2.0-java-11-spark-3.3.4             /.../share/jupyter/kernels/ganymede-2.2.0-java-11-spark-3.3.4
...

would result in the configured ${jupyter.data}/kernels/ganymede-2.2.0-java-11-spark-3.3.4/kernel.json kernelspec:

{
  "argv": [
    "/Library/Java/JavaVirtualMachines/graalvm-ce-java11-22.3.0/Contents/Home/bin/java",
    "--add-opens",
    "java.base/jdk.internal.misc=ALL-UNNAMED",
    "--illegal-access=permit",
    "-Djava.awt.headless=true",
    "-Djdk.disableLastUsageTracking=true",
    "-Dmaven.repo.local=/Users/ball/Notebooks/.venv/share/jupyter/repository",
    "-jar",
    "/Users/ball/.m2/repository/dev/hcf/ganymede/ganymede/2.2.0-SNAPSHOT/ganymede-2.2.0-SNAPSHOT.jar",
    "-f",
    "{connection_file}"
  ],
  "display_name": "Ganymede 2.2.0 (Java 11) with Spark 3.3.4",
  "env": {
    "JUPYTER_CONFIG_DIR": "/Users/ball/.jupyter",
    "JUPYTER_CONFIG_PATH": "/Users/ball/.jupyter:/Users/ball/Notebooks/.venv/etc/jupyter:/usr/local/etc/jupyter:/etc/jupyter",
    "JUPYTER_DATA_DIR": "/Users/ball/Library/Jupyter",
    "JUPYTER_RUNTIME_DIR": "/Users/ball/Library/Jupyter/runtime",
    "SPARK_HOME": "/path/to/spark-home"
  },
  "interrupt_mode": "message",
  "language": "java"
}

The kernel makes extensive use of templates and POM fragments. While not strictly required, the authors suggest that the Hide Input extension is enabled so notebook authors can hide the input templates and POMs for any finished product. This may be set from the command line with:

$ jupyter nbextension enable hide_input/main --sys-prefix

(or --user as appropriate).

Features and Usage

The following subsections outline many of the features of the kernel.

Java

The Java REPL is JShell and has all the Java features of the installed JVM. The minimum required Java version is 11 and subsequent versions are supported.

The JShell environment includes builtin functions implemented through methods that wrap the public methods defined in NotebookContext class annotated with @NotebookFunction. These functions include:

Method Description
print(Object) Render the Object to a Notebook format
display(Object) Render the Object to a Notebook format
asJson(Object) Convert argument to JsonNode
asYaml(Object) Convert argument to YAML (String)

The builtin functions are mostly concerned with "printing" or displaying (rendering) Objects to multimedia formats. For example, print(byte[]) will render the byte array as an image. Integrated renderers for chart and plot objects include:

The trig.ipynb notebook demonstrates rendering of an XChart.

As discussed in the next section, the magic identifier for java is %%java. A cell identified with %%java with no code will provide a table of variable bindings in the context with types and values. The types are links to the corresponding javadoc (if known).

NameTypeValue
$$ganymede.notebook.NotebookContextNotebookContext(super=ganymede.notebook.NotebookContext@af7e376)
by_stateorg.apache.spark.sql.Dataset<Row>[Country/Region: string, Province/State: string ... 1 more field]
chartorg.knowm.xchart.PieChartorg.knowm.xchart.PieChart@767f4a69
countries_aggregatedorg.apache.spark.sql.Dataset<Row>[Date: date, Country: string ... 3 more fields]
datesorg.apache.spark.sql.Dataset<Row>[Date: date]
intervalorg.apache.spark.sql.Row[2020-01-22,2022-04-16]
key_countries_pivotedorg.apache.spark.sql.Dataset<Row>[Date: date, China: int ... 7 more fields]
readerorg.apache.spark.sql.DataFrameReaderorg.apache.spark.sql.DataFrameReader@5a88849
referenceorg.apache.spark.sql.Dataset<Row>[UID: int, iso2: string ... 10 more fields]
sessionorg.apache.spark.sql.SparkSessionorg.apache.spark.sql.SparkSession@1b6683c4
snapshotorg.apache.spark.sql.Dataset<Row>[Country/Region: string, Deaths: int]
time_series_19_covid_combinedorg.apache.spark.sql.Dataset<Row>[Date: date, Country/Region: string ... 4 more fields]
us_confirmedorg.apache.spark.sql.Dataset<Row>[Admin2: string, Date: date ... 3 more fields]
us_deathsorg.apache.spark.sql.Dataset<Row>[Admin2: string, Date: date ... 3 more fields]
us_simplifiedorg.apache.spark.sql.Dataset<Row>[Date: date, Admin2: string ... 4 more fields]
worldwide_aggregateorg.apache.spark.sql.Dataset<Row>[Date: date, Confirmed: int ... 3 more fields]

Magics

Cell magic commands are identified by %% starting the first line of a code cell. The list of available magic commands is shown below. The default cell magic is java.

Name(s)Description
!, script Execute script with the argument command
bash Execute script with 'bash' command
classpath Add to or print JShell classpath
env Add/Update or print the environment
freemarker FreeMarker template evaluator
groovy Execute code in groovy REPL
html HTML template evaluator
java Execute code in Java REPL
javascript, js Execute code in javascript REPL
kotlin Execute code in kotlin REPL
magics Lists available cell magics
markdown Markdown template evaluator
mustache, handlebars Mustache template evaluator
perl Execute script with 'perl' command
pom Define the Notebook's Project Object Model
ruby Execute script with 'ruby' command
scala Execute code in scala REPL
sh Execute script with 'sh' command
spark-session Configure and start a Spark session
sql Execute code in SQL REPL
thymeleaf Thymeleaf template evaluator
velocity Velocity template evaluator

script, bash, perl, etc. are executed by creating a Process instance. groovy, javascript, kotlin, etc. are provided through their respective JSR 223 interfaces.3 Dependency and classpath management are provided with the classpath and pom magics and are described in detail in a subsequent subsection. thymeleaf and html provide Thymeleaf template evaluation.

The kernel does not implement any "line" magics.

Dependency and Classpath Management

The classpath magic adds JAR and directory paths to the JShell classpath. The pom magic resolves and downloads Maven artifacts and then adds those artifacts to the classpath.

The trig.ipynb notebook demonstrates the use of the pom magic to resolve the org.knowm.xchart:xchart:LATEST artifact and its transient dependencies.

%%pom
dependencies:
- org.knowm.xchart:xchart:LATEST

The POM is expressed in YAML and repositories and dependencies may be expressed. The Notebook's POM may be split across multiple cells since each repository and dependency is added or merged and dependency resolution is attempted whenever a pom cell is executed. The default/initial Notebook POM is:

repositories:
  - id: central
    layout: default
    url: https://repo1.maven.org/maven2
    snapshots:
      enabled: false

Dependencies may either be expressed in "expanded" YAML or in groupId:artifactId[:extension]:version format:

dependencies:
  - groupId: groupA
    artifactId: groupAartifact1
    version: 1.0
  - groupB:groupB-artifact2:2.0

The specific attributes for repositories and dependencies are defined by the Apache Maven Artifact Resolver classes RemoteRepository (with RepositoryPolicy) and Dependency. (Note that these classes are slightly different than their Maven settings counterparts.)

Whenever a JAR is added to the classpath, it is analyzed to determine if its Maven coordinates can be determined and, if they can be determined, the JAR is added as an artifact to the resolver. The following checks are made before adding the JAR to the JShell classpath:

  1. It is a new, unique path

  2. No previously resolved artifact with the same groupId:artifactId on the classpath

  3. Special heuristics for logging configuration:

    a. Ignore commons-logging:commons-logging:jar

    b. Allow only one of org.slf4j:jcl-over-slf4j:jar or org.springframework:spring-jcl:jar to be configured

    c. Allow only one of org.slf4j:slf4j-log4j12:jar and ch.qos.logback:logback-classic:jar to be configured

Artifacts that fail any of the above checks will be (mostly silently) ignored. Because only the first version of a resolved artifact is ever added to the classpath, the kernel must be restarted if a different version of the same artifact is specified for the change to take effect.

Finally, the kernel provides special processing to add artifacts from Apache Spark binary distributions. The dependencies for Spark SQL and corresponding Scala compiler artifacts for currently available Spark binary distributions as resources. The kernel searches the ${SPARK_HOME} for JARs for which it has the corresponding dependencies and then resolves the dependencies from the ${SPARK_HOME} hierarchy with the heuristics described above.

SQL

The SQL Magic provides the client interface to database servers through JDBC and jOOQ. Its usage is as follows:

    Usage: sql [--[no-]print] [<url>] [<username>] [<password>]
          [<url>]        JDBC Connection URL
          [<username>]   JDBC Connection Username
          [<password>]   JDBC Connection Password
          --[no-]print   Print query results.  true by default

For example:

%%sql jdbc:mysql://127.0.0.1:33061/epg?serverTimezone=UTC
SELECT * FROM schedules LIMIT 3;
airDateTimestationIDjsondurationmd5programID
153394560010139{ "programID" : "EP009370080215", "airDateTime" : "2018-08-11T00:00:00Z", "duration" : 3600, "md5" : "S1UDH1R60Eagc1E3V5Qslw", "audioProperties" : [ "cc" ], "ratings" : [ { "body" : "USA Parental Rating", "code" : "TVPG" } ] }3600S1UDH1R60Eagc1E3V5QslwEP009370080215
153394560010142{ "programID" : "EP006062993248", "airDateTime" : "2018-08-11T00:00:00Z", "duration" : 3600, "md5" : "2FQ8y5PsXl1vtxcmUBeppg", "new" : true, "audioProperties" : [ "cc" ], "ratings" : [ { "body" : "USA Parental Rating", "code" : "TVPG" } ] }36002FQ8y5PsXl1vtxcmUBeppgEP006062993248
153394560010145{ "programID" : "EP022439260394", "airDateTime" : "2018-08-11T00:00:00Z", "duration" : 1800, "md5" : "mUewfiqM8+dh24WQg2WfpQ", "audioProperties" : [ "cc" ] }1800mUewfiqM8+dh24WQg2WfpQEP022439260394

The SQL Magic accepts the --print/--no-print options to print or suppress query results. If no JDBC URL is specified, the most recently used connection will be used. The List of most recent jOOQ Queries are stored in $$.sql.queries with $$.sql.results containing the corresponding Results. For example:

%%sql --no-print
SELECT COUNT(*) FROM programs;
%%java
print($$.sql.results.get(0));
count(*)
1024495

MySQL and PostgreSQL JDBC drivers are provided in the Ganymede runtime.

Spark

The spark-session magic is provided to initialize Apache Spark sessions.

    Usage: spark-session [--[no-]enable-hive-if-available] [<master>] [<appName>]
          [<master>]    Spark master
          [<appName>]   Spark appName
          --[no-]enable-hive-if-available
                        Enable Hive if available.  true by default

Its typical usage:

%%spark-session local[*] covid-19
# Optional name/value pairs parsed as Properties

is roughly equivalent to:

var config = new SparkConf();
/*
 * Properties copied to SparkConf instance.
 */
var session =
    SparkSession.builder()
    .config(config)
    .master("local").appName("covid-19")
    .getOrCreate();

The SparkSession can then be accessed in Java and other JVM code with the SparkSession.active() static method.

Other Laguages (JSR 223)

The kernel leverages the java.scripting API to provide groovy, javascript, kotlin, and scala.4

Shells

The script magic (with the alias !) may be used to run an operating system command with the remaining code in the cell fed to the Process's standard input. bash, perl, ruby, and sh are provided as aliases for %%!bash, %%!perl, etc., respectively.

Templates

A number of templating languages are supported as magics:

The following subsections provide examples of the markdown and thymeleaf magics but the other template magics are similar. Please refer to the installation instructions for discussion of enabling the Hide Input extension so only the template output is displayed in the notebook.

Markdown and JMustache

The template magic markdown provides Markdown processing with JMustache preprocessing:

%%java
import java.util.stream.Stream;

import static java.util.stream.Collectors.toList;

var fib =
    Stream.iterate(new int[] { 0, 1 }, t -> new int[] { t[1], t[0] + t[1] })
    .mapToInt(t -> t[0])
    .limit(10)
    .boxed()
    .collect(toList());
%%markdown
| Index | Value |
| --- | --- |
{{#fib}}| {{-index}} | {{this}} |
{{/fib}}
IndexValue
00
11
21
32
43
55
68
713
821
934

Thymeleaf

The template magics thymeleaf and html offer templating with Thymeleaf. All defined Java variables are bound into the Thymeleaf context before evaluation. For example (Java implementation detail removed):

%%java
...
var map = new TreeMap<Ranking,List<Card>>(Ranking.COMPARATOR.reversed());
...
var rankings = Arrays.asList(Ranking.values());
...
%%html
<table>
  <tr th:each="ranking : ${rankings}">
    <th:block th:if="${map.containsKey(ranking)}">
      <th th:text="${ranking}"/><td th:each="card : ${map.get(ranking)}" th:text="${card}"/>
    </th:block>
  </tr>
  <tr><th>Remaining</th><td th:each="card : ${deck}" th:text="${card}"/></tr>
</table>

Would generate:

RoyalFlushA-♤K-♤Q-♤J-♤10-♤
StraightFlushK-♡Q-♡J-♡10-♡9-♡
FourOfAKind8-♤8-♡8-♢8-♧2-♧
FullHouseA-♡A-♢A-♧K-♢K-♧
FlushQ-♢J-♢10-♢9-♢7-♢
Straight7-♤6-♤5-♤4-♤3-♤
ThreeOfAKind6-♡6-♢6-♧3-♧4-♧
TwoPair9-♤9-♧7-♡7-♧5-♧
Pair5-♡5-♢10-♧J-♧2-♢
HighCardQ-♧3-♢4-♢2-♡3-♡
Remaining4-♡2-♤

Documentation

Javadoc is published at https://allen-ball.github.io/ganymede.

License

Ganymede Kernel is released under the Apache License, Version 2.0, January 2004.

Endnotes

[1] Implemented with Apache Maven Artifact Resolver.

[2] With the built-in Oracle Nashorn engine.

[3] scala is special cased: It requires additional dependencies be specified at runtime and is optimized to be used with Apache Spark.

[4] Ibid.