Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate options for running on a spark cluster #27

Open
dolegi opened this issue Oct 30, 2024 · 2 comments
Open

Investigate options for running on a spark cluster #27

dolegi opened this issue Oct 30, 2024 · 2 comments
Assignees

Comments

@dolegi
Copy link
Collaborator

dolegi commented Oct 30, 2024

there are some problems using the directrunner on jasmin (crashes on http errors).
We need a better way to run the scripts, probably a spark cluster would be most stable, using the beam to spark runner.

Investigate how we can do this, does jasmin have an existing spark cluster, is there one available to CEH somewhere (Iain might know/be able to point in a direction).

Another option is spinning up our own spark cluster on jasmin, investigate if possible/reasonable.

acceptance criteria

  • We know how to move forward with running the scripts on a spark cluster
@mattjbr123
Copy link
Collaborator

UKCEH's datalabs has the capability of using Spark clusters. I will try there with a minimal version of the pipeline first, to see if it is at least viable in general

@mattjbr123
Copy link
Collaborator

[EDITING IN PROGRESS]

This is proving difficult.

TL;DR:
Firstly, Beam's integration with Spark is not as user-friendly and simple as it could be, and secondly, I'm suspecting there might be issues with the Spark implentation on DataLabs that are out of my control to solve.

Full:
In theory, the Beam docs for the Spark runner should tell you all you need to know, but is a little too brief and misses out useful bits on information. It appears that to integrate with Spark, a Beam-specific 'JobService' instance needs to be started. There appear to be at least three ways of doing this.

First is as the Beam docs, start the service using Docker, an image is already available for this, or run the service from the source code. What they don't say about the latter is that you can only access the necessary 'gradlew' file from directly from the github repository. Releases zip files on GitHub and their equivalents in conda/pypi do not include the file.
Given Docker is not available on DataLabs, the latter option is what we'd need to use. This does run fine, however when the Beam Pipeline is submitted to it a Java error is raised. This is apparently workaroundable but needs some extra arguments passed in to a particular part of the program which I don't know enough to be able to do, or is something to do with incompatibilities between Beam, Java and Spark. Beam claims it offers support for the 2.3.x version of Spark, but that on DataLabs is 2.5.x...

The second way of running it using Beam by compiling the pipeline as an 'UberJar' which is something vaguely similar to a container but specific to Spark, and which contains the Beam JobServer in it (I think...). However, this fails too as the Spark Master (the scheduler, iow) needs to have a REST API available to use, for the Pipeline 'job' to be submitted to it this way, and from what I could tell, this isn't available on the DataLabs Spark implementation.

The third way of running it is similar. The Pipeline is compiled to a Jar as before, but not an UberJar, and then is submitted to the Spark Master using the spark-submit executable. This again compiles fine but runs into a problem where for some reason a connection to the Spark Master cannot be established. This is the same error I get if I try to run a simple Spark example via a Jupyter notebook, leading me to believe this is something to do with the DataLabs implementation of Spark and not my code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants