If you already have a functioning Apache Spark configuration, you can use your own. For your convenience, the provided docker-compose.yml
is based on the jupyter/pyspark-notebook
image. You will need to have Docker and Docker Compose configured on your computer. Check out the Docker Desktop documentation for details.
You can run docker-compose up
and follow the prompt to open the Jupyter Notebook UI (looks like http://127.0.0.1:8888/?token=<SOME_TOKEN>
).
The given data/
directory mounts as a Docker volume at ~/data/
for easy access:
import os
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').getOrCreate()
df = spark.read.options(
header='True',
inferSchema='True',
delimiter=',',
).csv(os.path.expanduser('~/data/DataSample.csv'))
Please host your solution as one or multiple Notebooks (.ipynb
) in a public git remote repository and reply with its link to the email thread you initially received to work on this work sample.