Investigate options for running on a spark cluster #27

dolegi · 2024-10-30T13:43:16Z

there are some problems using the directrunner on jasmin (crashes on http errors).
We need a better way to run the scripts, probably a spark cluster would be most stable, using the beam to spark runner.

Investigate how we can do this, does jasmin have an existing spark cluster, is there one available to CEH somewhere (Iain might know/be able to point in a direction).

Another option is spinning up our own spark cluster on jasmin, investigate if possible/reasonable.

acceptance criteria

We know how to move forward with running the scripts on a spark cluster

mattjbr123 · 2025-01-10T11:50:18Z

UKCEH's datalabs has the capability of using Spark clusters. I will try there with a minimal version of the pipeline first, to see if it is at least viable in general

mattjbr123 · 2025-01-20T13:58:27Z

[EDITING IN PROGRESS]

This is proving difficult.

TL;DR:
Firstly, Beam's integration with Spark is not as user-friendly and simple as it could be, and secondly, I'm suspecting there might be issues with the Spark implentation on DataLabs that are out of my control to solve.

Full:
In theory, the Beam docs for the Spark runner should tell you all you need to know, but is a little too brief and misses out useful bits on information. It appears that to integrate with Spark, a Beam-specific 'JobService' instance needs to be started. There appear to be at least three ways of doing this.

First is as the Beam docs, start the service using Docker, an image is already available for this, or run the service from the source code. What they don't say about the latter is that you can only access the necessary 'gradlew' file from directly from the github repository. Releases zip files on GitHub and their equivalents in conda/pypi do not include the file.
Given Docker is not available on DataLabs, the latter option is what we'd need to use. This does run fine, however when the Beam Pipeline is submitted to it a Java error is raised. This is apparently workaroundable but needs some extra arguments passed in to a particular part of the program which I don't know enough to be able to do, or is something to do with incompatibilities between Beam, Java and Spark. Beam claims it offers support for the 2.3.x version of Spark, but that on DataLabs is 2.5.x...

The second way of running it using Beam by compiling the pipeline as an 'UberJar' which is something vaguely similar to a container but specific to Spark, and which contains the Beam JobServer in it (I think...). However, this fails too as the Spark Master (the scheduler, iow) needs to have a REST API available to use, for the Pipeline 'job' to be submitted to it this way, and from what I could tell, this isn't available on the DataLabs Spark implementation.

The third way of running it is similar. The Pipeline is compiled to a Jar as before, but not an UberJar, and then is submitted to the Spark Master using the spark-submit executable. This again compiles fine but runs into a problem where for some reason a connection to the Spark Master cannot be established. This is the same error I get if I try to run a simple Spark example via a Jupyter notebook, leading me to believe this is something to do with the DataLabs implementation of Spark and not my code.

dolegi added this to Gridded data conversion initial product Oct 30, 2024

dolegi converted this from a draft issue Oct 30, 2024

mattjbr123 mentioned this issue Jan 10, 2025

Trial running the rechunking tool on Beam's Dask Runner #31

Open

mattjbr123 moved this from Backlog to Sprint in Gridded data conversion initial product Jan 10, 2025

mattjbr123 self-assigned this Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate options for running on a spark cluster #27

Investigate options for running on a spark cluster #27

dolegi commented Oct 30, 2024 •

edited

Loading

mattjbr123 commented Jan 10, 2025

mattjbr123 commented Jan 20, 2025

Investigate options for running on a spark cluster #27

Investigate options for running on a spark cluster #27

Comments

dolegi commented Oct 30, 2024 • edited Loading

acceptance criteria

mattjbr123 commented Jan 10, 2025

mattjbr123 commented Jan 20, 2025

dolegi commented Oct 30, 2024 •

edited

Loading