- This program's aim is to measure spark performance on a number of operations and and typical use cases.
- Parsing of JSONL Dataframe file. JSONL Dataframe file is a format designed to be human readable text while somewhat
optimized for large data transfers over the network. It essentially includes exactly ONE meta data line that defines
data types column names for the consecutive lines that contain compact arrays of data. Current JSONL Dataframe assumes
support for the following types:
- String, LocalDate, Timestamp, BigDecimal, BigInteger, Array, Array
- Example of JSONL format is below
{ "metaData": {"columnNames": ["biz_date","a_fejcyllnrtyexcn","b_cvdxurcjmxfmmaj","c_yjxpaqarqvjbbrm","d_qddvzpgyipoqneg","e_tochpkfecrzqjcm","f_yrukoixpwewbhhq","g_jrplbebkluvzuat","h_zzbozoomdwxcqmu","i_bkylvzbbcskksyw","j_nfxjpiqbndrusth"] , "types": ["Date","String","Timestamp","Array<BigDecimal>","Array<BigDecimal>","Timestamp","Array<BigInteger>","BigInteger","Timestamp","Timestamp","BigInteger"]}} {"row":["20190615","AZJCXSXJBXLUWZDAXV","20190910-15:48:58.651173",null,[0.6366280649451527],"20190910-15:48:58.651333",[997],-82091895,"20190910-15:48:58.651426","20190910-15:48:58.651501",85624007]} {"row":["20190929","YCPPEHMZWIQMPTARXVW","20190910-15:48:58.651688",null,[0.8005208960967507, 0.4629934652532054],"20190910-15:48:58.651745",[326],79521309,"20190910-15:48:58.651834","20190910-15:48:58.651863",null]} {"row":["20190724",null,"20190910-15:48:58.651977",[0.8346759933473351, 0.01521255424173329],null,"20190910-15:48:58.652026",[155],77425,"20190910-15:48:58.652097","20190910-15:48:58.652124",-18037480]} {"row":["20190707",null,"20190910-15:48:58.652236",[0.08305561036338849],null,"20190910-15:48:58.652282",[670, 425],-12684310,"20190910-15:48:58.652352","20190910-15:48:58.652378",null]} {"row":["20190913",null,"20190910-15:48:58.652486",null,[0.9555277095824917, 0.7486451091305474],"20190910-15:48:58.652532",[141],null,"20190910-15:48:58.652602","20190910-15:48:58.652629",66233448]} {"row":["20190620",null,"20190910-15:48:58.652736",[0.18516659006570801],[0.41137015158232304, 0.6667256822041231],"20190910-15:48:58.652784",[260, 121],null,"20190910-15:48:58.652858","20190910-15:48:58.652886",27934970]} {"row":["20190615","GQGKJIVU","20190910-15:48:58.652989",[0.9892582343257756, 0.20150289076244832],[0.20469205499263377, 0.37458327057810803],"20190910-15:48:58.653041",null,-76530845,"20190910-15:48:58.653118","20190910-15:48:58.653148",51713610]} {"row":["20190915",null,"20190910-15:48:58.653254",[0.34918000670118776],[0.3699235272374676, 0.05527038650988347],"20190910-15:48:58.653301",[825, 232],-16111738,"20190910-15:48:58.653375","20190910-15:48:58.653403",-55494614]} {"row":["20190923",null,"20190910-15:48:58.653520",[0.48612117194554993, 0.1703540048180341],[0.5016211169401508],"20190910-15:48:58.653569",null,-79147816,"20190910-15:48:58.653640","20190910-15:48:58.653667",6043744]} {"row":["20191213","PUYIYDHANH","20190910-15:48:58.653770",[0.19424961886032033],[0.36394658213372477],"20190910-15:48:58.653821",[48, 474],-849174,"20190910-15:48:58.653897","20190910-15:48:58.653925",null]}
- Supporting scripts create a number of JSONL Dataframe files of sizes that vary in number of rows as well as number of columns with enough scenarios to give you a sense for spark performance of both parsing this data and converting it to a spark Dataframe as well as performing operations such as count and save to hdfs
- Parsing of JSONL Dataframe file. JSONL Dataframe file is a format designed to be human readable text while somewhat
optimized for large data transfers over the network. It essentially includes exactly ONE meta data line that defines
data types column names for the consecutive lines that contain compact arrays of data. Current JSONL Dataframe assumes
support for the following types:
-
Instructions below will using Homebrew if you don't already have it install it first.
brew install hadoop
- This will install hadoop that includes HDFS and YARN service under
/usr/local/Cellar/hadoop/3.1.2
which will references as${HADOOP_HOME}
further in instructions.
- This will install hadoop that includes HDFS and YARN service under
- Change property
hadoop.tmp.dir
to the following value~/hadoop-storage
, in this file${HADOOP_HOME}/libexec/etc/hadoop/core-default.xml
. The default value/tmp/hadoop-${user.name}
results in a location that gets erased after reboot and HDFS gets corrupted after your computer restart. - Enable HADOOP Pseudo-Distributed Operations mode. Summary is included below, for more details and options visit hadoop.apache.org
- Locate the file
${HADOOP_HOME}/libexec/etc/hadoop/core-site.xml
and add the following property<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
- Locate file
${HADOOP_HOME}/libexec/etc/hadoop/hdfs-site.xml
and add the following property<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
- Ensure that
$ ssh localhost
has key based authentication if not follow steps below to enable.ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 0600 ~/.ssh/authorized_keys
- Format HDFS file system
${HADOOP_HOME}/libexec/bin/hdfs namenode -format
- Start HDFS & Format the disk
${HADOOP_HOME}/libexec/sbin/start-dfs.sh ${HADOOP_HOME}/libexec/bin/hdfs dfs -mkdir -p /user/test-harness ${HADOOP_HOME}/libexec/sbin/start-yarn.sh
- IMPORTANT you must see at least one data node running in the UI below if not, it is likely meant that you used
and existing hadoop install and you need to either reformat hdfs with the cluser id flag, id will be found in the
the failing data node log and/or remove all the files in the hdfs storage folder which in this example was defined as
~/hadoop-storage
- IMPORTANT you must see at least one data node running in the UI below if not, it is likely meant that you used
and existing hadoop install and you need to either reformat hdfs with the cluser id flag, id will be found in the
the failing data node log and/or remove all the files in the hdfs storage folder which in this example was defined as
- Above should start several services for which UI will be visible on the following default ports
- HDFS - [http://localhost:9870/]
- YARN - [http://localhost:8088/]
- To stop HDFS & YARN use the following scripts
${HADOOP_HOME}/libexec/sbin/stop-dfs.sh ${HADOOP_HOME}/libexec/sbin/stop-yarn.sh
brew install maven
- Scripts generating data samples and writing them to HDFS are written in python can be executed using a maven goal. Hence we recommend you install maven as well as anaconda and create environment using provided recipe.
brew casks install anaconda
followed byconda init bash
- Restart your shell terminal, after installing anaconda and initializing bash shell to recognize location of conda binaries
conda env create -f ./utils/conda.recipe/test-harness.yml
- you will need to run this command in the directory where you check out this test-harness project from GitHub This will create a python environment with necessary hdfs libraries to run data generator
conda activate test-harness
- Activate conda environment you had created. Data generator scrip needs to run in this specific Conda environment
mvn exec:exec@generate-data
- Run maven goal to populate HDFS with data scenarios defined in application.conf#scenarios/run section
- http://localhost:9870/explorer.html#/user/test-harness/data/l0/jsonl
- You should be able to see the data being generated using File Browser in the link avove.
brew install apache-spark
- This will install Apache Spark service under
/usr/local/Cellar/apache-spark/2.4.3/
which will refer to as${SPARK_HOME}
further in instructions.
- This will install Apache Spark service under
- Configure Spark to inherit HDFS configuration
- Locate file
${SPARK_HOME}/libexec/conf/spak-env.sh
and set the following variables. If the file is missing create one or copy from the template
HADOOP_HOME=/usr/local/Cellar/hadoop/3.1.2 HADOOP_CONF_DIR=${HADOOP_HOME}/libexec/etc/hadoop
- Locate file
- Start script
${SPARK_HOME}/libexec/sbin/start-all.sh
- Above should start a service for which UI will be visible on the following default ports
- SPARK UI - http://localhost:8080/
- Default Spark master url to which
spark-submit
script will be submitting jobs should located at -spark://localhost:7077
- Stop script
${SPARK_HOME}/libexec/sbin/stop-all.sh
conda env create -f ./utils/conda.recipe/test-harness.yml
- you will need to run this command in the directory where you check out this test-harness project from [github.com] This will create a python environment with necessary hdfs libraries to run data generator
conda activate test-harness
- Activate conda environment you had created. Data generator scrip needs to run in this specific Conda environment
mvn exec:exec@generate-data
- Run maven goal to populate HDFS with data scenarios defined in application.conf#scenarios/run section
- http://localhost:9870/explorer.html#/user/test-harness/data/l0/jsonl
- You should be able to see the data being generated using File Browser in the link avove.
mvn -DskipTests package exec:exec@run-test-spark-master
- This command will create an UBER jar and submit it into the stand alone cluster, spark url as defined in pom.xml's respective maven goal
- You can change this option
spark.cores.max
in the pom.xml to see how well the scenario scales with less or more cores. When missing all cores will be used on your machine. On a Mac used in the results 4 cores were used by default.
mvn -DskipTests package exec:exec@run-test-yarn-master
- Make the following class to be your main
com.alex.shagiev.spark.Main
- Enable
Include dependencies with "Provided" Scope
option in the Run/Debug configuration. - Add environment variable
master=local
, this will force the job submit to run in process
- Create a key pair using EC2 console
- IMPORTANT - they key name must be
aws-emr-key
as it is used in most util scripts in the projects - Save
aws-emr-key.pem
into your Mac's home directory and Copy it to the to${T2_MICRO}:~/aws-emr-key.pem
location
- IMPORTANT - they key name must be
- Create a Free Tier
t2.micro
Amazon Linux 2 AMI (HVM), SSD Volume Type using the wizard.- This host will be referred to as ${T2_MICRO} in this document. It will be your gateway into AWS. You will deploy and run the test harness from this node.
t2.micro
instance, includes 1 CPU, 1G RAM, Transient Elastic Brock Storage (EBS)
- Login to your Free Tier gateway
ssh -i ~/aws-emr-key.pem ${T2_MICRO}
- Create
emr-create-id
Identity andemr-create-group
group using Identity Management (IAM)- Select
Programmatic Access
forAccess Type
- While creating
emr-create-group
group, ensure to attachAmazonElasticMapReduceFullAccess
policy to it so that is able to manipulate EMR - IMPORTANT - Save
Access Key
&Secret Access Key
in the process, you WON'T be able to retrieve it later
- Select
- Execute
aws configure
on ${T2_MICRO} node and assignAccess Key
withSecret Access Key
fromemr-create-id
Identity- Assign
us-east-2
andjson
for region and output format respectively. - For more on CLI configuration review AWS Command Line Interface (CLI)
- IMPORTANT - You must choose output format to be
json
for the utility scripts in the Test Harness to work
- Assign
- Network Security Groups
- Using Security Groups UI ensure that you
add under
Inbound Rules
typeAll TCP
port ranges0-65535
Sourcesecurity group of ${T2_MICRO}
of theElasticMapReduce-slave
&ElasticMapReduce-master
Security groups
- Using Security Groups UI ensure that you
add under
- Run steps below to install tools used by Test harness
# install git sudo yum -y install git # install anaconda rm -rf Anaconda2-2019.07-Linux-x86_64.sh rm -rf ./anaconda2 wget https://repo.continuum.io/archive/Anaconda2-2019.07-Linux-x86_64.sh bash Anaconda2-2019.07-Linux-x86_64.sh -b -p ./anaconda2 ./anaconda2/condabin/conda init bash source .bashrc # install apache maven sudo wget http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo -O /etc/yum.repos.d/epel-apache-maven.repo sudo sed -i s/\$releasever/6/g /etc/yum.repos.d/epel-apache-maven.repo sudo yum install -y apache-maven # maven install with jdk 1.7 so fix env to point to jdk 1.8 sudo yum install -y java-1.8.0-devel sudo /usr/sbin/alternatives --set java /usr/lib/jvm/java-1.8.0-openjdk.x86_64/bin/java sudo /usr/sbin/alternatives --set javac /usr/lib/jvm/java-1.8.0-openjdk.x86_64/bin/javac # downloand test harness rm -rf ./spark-test-harness git clone https://github.com/alexshagiev/spark-test-harness.git cd spark-test-harness # conda env create -f ./utils/conda.recipe/test-harness.yml -- fails with out of memory on free tier. # hence the work around conda create --yes --name test-harness python=3.6 conda activate test-harness conda install --yes -c conda-forge pyhocon conda install --yes -c conda-forge pyarrow conda install --yes -c conda-forge hdfs3 conda install --yes -c conda-forge tqdm
- Create
- MUST activate correct conda env via
conda activate test-harness
mvn -DskipTests package exec:exec@run-test-aws-emr
- will create the cluster, populate it with data, and run Test Harness- modify value of
<argument>--core-nodes</argument>
of theexec:exec@run-test-aws-emr
plugin to change number of COREs nodes to be added to the fleet during instantiation. The default is set to 4 Nodes with each having 4vCores. Hence a total of 16 cores.
- modify value of
- The following command can be used to populate data on the existing EMR cluster.
python ./utils/aws_emr_spot_cluster.py --cluster-id j-SxxxxxxxxxX --populate-hdfs=true
- The following command can be use dto run spark job on the existing EMR cluster.
python ./utils/aws_emr_spot_cluster.py --cluster-id j-SxxxxxxxxxX --spark-submit=true --spark-jar ./target/spark-test-harness-*.jar
- IMPORTANT - remember to Terminate your cluster once the job is completed to avoid unecessary charges.
- Install FoxyProxy extension in your Chrome Browser
- Configure FoxyProxy as follows Latest Proxy Settings & plugin instructions here here
<?xml version="1.0" encoding="UTF-8"?> <foxyproxy> <proxies> <proxy name="emr-socks-proxy" id="2322596116" notes="" fromSubscription="false" enabled="true" mode="manual" selectedTabIndex="2" lastresort="false" animatedIcons="true" includeInCycle="true" color="#0055E5" proxyDNS="true" noInternalIPs="false" autoconfMode="pac" clearCacheBeforeUse="false" disableCache="false" clearCookiesBeforeUse="false" rejectCookies="false"> <matches> <match enabled="true" name="*ec2*.amazonaws.com*" pattern="*ec2*.amazonaws.com*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" /> <match enabled="true" name="*ec2*.compute*" pattern="*ec2*.compute*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" /> <match enabled="true" name="10.*" pattern="http://10.*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" /> <match enabled="true" name="*10*.amazonaws.com*" pattern="*10*.amazonaws.com*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" /> <match enabled="true" name="*10*.compute*" pattern="*10*.compute*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" /> <match enabled="true" name="*.compute.internal*" pattern="*.compute.internal*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false"/> <match enabled="true" name="*.ec2.internal* " pattern="*.ec2.internal*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false"/> </matches> <manualconf host="localhost" port="8157" socksversion="5" isSocks="true" username="" password="" domain="" /> </proxy> </proxies> </foxyproxy>
- Once your cluster is running navigate to Your Desired Cluster
and click the
SSH
link next to the Master Public DNS Name, will will will give you a prepopulated scrip of the following form. Run it on your local Macssh -i ~/aws-emr-key.pem -ND 8157 hadoop@${MASTER_PUBLIC_DNS_NAME}
- The EMR Management console should now have the WebLinks under
Connections:
section enabled which will take you directly to the HDFS, SPARK, YARN UIs
-
Observations:
- It appears based on the local mac test that Yarn introduces very minimal scheduling overhead. Things do slow down on local yarn as compared to the spark internal scheduler. However they are explained and proportional to the lost Vcore for processing since Yarn allocated a single Vcore to the applicaiton manager which run in the client mode.
- Also interestingly enough it is possible to achieve comparable performance between a Premium Mac hardware kit and solid drive vs AWS General purpose
m5.large
general purpose hardware with magnetic disk - It also seems that read and write take each roughly 15% & 15% of the overall processing time as compared 70% to parse Json. Json parsing is therefore naturally the area of optimization.
-
Full Results can be found here Google Sheet with results here
- for definition of scenarios refer to application.conf#scenarios section