Spark performance test harness

Goal of this test harness

This program's aim is to measure spark performance on a number of operations and and typical use cases.

Parsing of JSONL Dataframe file. JSONL Dataframe file is a format designed to be human readable text while somewhat optimized for large data transfers over the network. It essentially includes exactly ONE meta data line that defines data types column names for the consecutive lines that contain compact arrays of data. Current JSONL Dataframe assumes support for the following types:
- String, LocalDate, Timestamp, BigDecimal, BigInteger, Array, Array
Example of JSONL format is below

  { "metaData": {"columnNames": ["biz_date","a_fejcyllnrtyexcn","b_cvdxurcjmxfmmaj","c_yjxpaqarqvjbbrm","d_qddvzpgyipoqneg","e_tochpkfecrzqjcm","f_yrukoixpwewbhhq","g_jrplbebkluvzuat","h_zzbozoomdwxcqmu","i_bkylvzbbcskksyw","j_nfxjpiqbndrusth"] , "types": ["Date","String","Timestamp","Array<BigDecimal>","Array<BigDecimal>","Timestamp","Array<BigInteger>","BigInteger","Timestamp","Timestamp","BigInteger"]}}
  {"row":["20190615","AZJCXSXJBXLUWZDAXV","20190910-15:48:58.651173",null,[0.6366280649451527],"20190910-15:48:58.651333",[997],-82091895,"20190910-15:48:58.651426","20190910-15:48:58.651501",85624007]}
  {"row":["20190929","YCPPEHMZWIQMPTARXVW","20190910-15:48:58.651688",null,[0.8005208960967507, 0.4629934652532054],"20190910-15:48:58.651745",[326],79521309,"20190910-15:48:58.651834","20190910-15:48:58.651863",null]}
  {"row":["20190724",null,"20190910-15:48:58.651977",[0.8346759933473351, 0.01521255424173329],null,"20190910-15:48:58.652026",[155],77425,"20190910-15:48:58.652097","20190910-15:48:58.652124",-18037480]}
  {"row":["20190707",null,"20190910-15:48:58.652236",[0.08305561036338849],null,"20190910-15:48:58.652282",[670, 425],-12684310,"20190910-15:48:58.652352","20190910-15:48:58.652378",null]}
  {"row":["20190913",null,"20190910-15:48:58.652486",null,[0.9555277095824917, 0.7486451091305474],"20190910-15:48:58.652532",[141],null,"20190910-15:48:58.652602","20190910-15:48:58.652629",66233448]}
  {"row":["20190620",null,"20190910-15:48:58.652736",[0.18516659006570801],[0.41137015158232304, 0.6667256822041231],"20190910-15:48:58.652784",[260, 121],null,"20190910-15:48:58.652858","20190910-15:48:58.652886",27934970]}
  {"row":["20190615","GQGKJIVU","20190910-15:48:58.652989",[0.9892582343257756, 0.20150289076244832],[0.20469205499263377, 0.37458327057810803],"20190910-15:48:58.653041",null,-76530845,"20190910-15:48:58.653118","20190910-15:48:58.653148",51713610]}
  {"row":["20190915",null,"20190910-15:48:58.653254",[0.34918000670118776],[0.3699235272374676, 0.05527038650988347],"20190910-15:48:58.653301",[825, 232],-16111738,"20190910-15:48:58.653375","20190910-15:48:58.653403",-55494614]}
  {"row":["20190923",null,"20190910-15:48:58.653520",[0.48612117194554993, 0.1703540048180341],[0.5016211169401508],"20190910-15:48:58.653569",null,-79147816,"20190910-15:48:58.653640","20190910-15:48:58.653667",6043744]}
  {"row":["20191213","PUYIYDHANH","20190910-15:48:58.653770",[0.19424961886032033],[0.36394658213372477],"20190910-15:48:58.653821",[48, 474],-849174,"20190910-15:48:58.653897","20190910-15:48:58.653925",null]}

Supporting scripts create a number of JSONL Dataframe files of sizes that vary in number of rows as well as number of columns with enough scenarios to give you a sense for spark performance of both parsing this data and converting it to a spark Dataframe as well as performing operations such as count and save to hdfs

Build/Run Environment setup

Tested on Mac OS

Instructions below will using Homebrew if you don't already have it install it first.

HDFS Install & Starting Service

Install HDFS & Yarn

brew install hadoop
- This will install hadoop that includes HDFS and YARN service under /usr/local/Cellar/hadoop/3.1.2 which will references as ${HADOOP_HOME} further in instructions.
Change property hadoop.tmp.dir to the following value ~/hadoop-storage, in this file ${HADOOP_HOME}/libexec/etc/hadoop/core-default.xml. The default value /tmp/hadoop-${user.name} results in a location that gets erased after reboot and HDFS gets corrupted after your computer restart.
Enable HADOOP Pseudo-Distributed Operations mode. Summary is included below, for more details and options visit hadoop.apache.org

Configuration Summary HDFS & YARN

Locate the file ${HADOOP_HOME}/libexec/etc/hadoop/core-site.xml and add the following property

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

Locate file ${HADOOP_HOME}/libexec/etc/hadoop/hdfs-site.xml and add the following property

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>    
</configuration>

Ensure that $ ssh localhost has key based authentication if not follow steps below to enable.

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

Format HDFS file system

${HADOOP_HOME}/libexec/bin/hdfs namenode -format

Starting HDFS file system, creating home directory for the harness & YARN resource manager

Start HDFS & Format the disk
```
${HADOOP_HOME}/libexec/sbin/start-dfs.sh
${HADOOP_HOME}/libexec/bin/hdfs dfs -mkdir -p /user/test-harness
${HADOOP_HOME}/libexec/sbin/start-yarn.sh
```
- IMPORTANT you must see at least one data node running in the UI below if not, it is likely meant that you used and existing hadoop install and you need to either reformat hdfs with the cluser id flag, id will be found in the the failing data node log and/or remove all the files in the hdfs storage folder which in this example was defined as ~/hadoop-storage
Above should start several services for which UI will be visible on the following default ports
- HDFS - [http://localhost:9870/]
- YARN - [http://localhost:8088/]

To stop HDFS & YARN use the following scripts

${HADOOP_HOME}/libexec/sbin/stop-dfs.sh
${HADOOP_HOME}/libexec/sbin/stop-yarn.sh

Load HDFS with randomly generated data samples.

brew install maven
- Scripts generating data samples and writing them to HDFS are written in python can be executed using a maven goal. Hence we recommend you install maven as well as anaconda and create environment using provided recipe.
brew casks install anaconda followed by conda init bash
- Restart your shell terminal, after installing anaconda and initializing bash shell to recognize location of conda binaries
conda env create -f ./utils/conda.recipe/test-harness.yml
- you will need to run this command in the directory where you check out this test-harness project from GitHub This will create a python environment with necessary hdfs libraries to run data generator
conda activate test-harness
- Activate conda environment you had created. Data generator scrip needs to run in this specific Conda environment
mvn exec:exec@generate-data
- Run maven goal to populate HDFS with data scenarios defined in application.conf#scenarios/run section
http://localhost:9870/explorer.html#/user/test-harness/data/l0/jsonl
- You should be able to see the data being generated using File Browser in the link avove.

Spark Install & Starting Stand Alone Service

Spark Install & Starting the service

Install Spark

brew install apache-spark
- This will install Apache Spark service under /usr/local/Cellar/apache-spark/2.4.3/ which will refer to as ${SPARK_HOME} further in instructions.
Configure Spark to inherit HDFS configuration
- Locate file ${SPARK_HOME}/libexec/conf/spak-env.sh and set the following variables. If the file is missing create one or copy from the template
```
HADOOP_HOME=/usr/local/Cellar/hadoop/3.1.2
HADOOP_CONF_DIR=${HADOOP_HOME}/libexec/etc/hadoop
```

Start Stand Alone Spark Cluster

Start script

${SPARK_HOME}/libexec/sbin/start-all.sh

Above should start a service for which UI will be visible on the following default ports
- SPARK UI - http://localhost:8080/
- Default Spark master url to which spark-submit script will be submitting jobs should located at - spark://localhost:7077
Stop script

${SPARK_HOME}/libexec/sbin/stop-all.sh

Running test harness on local Mac

Load HDFS

conda env create -f ./utils/conda.recipe/test-harness.yml
you will need to run this command in the directory where you check out this test-harness project from [github.com] This will create a python environment with necessary hdfs libraries to run data generator
conda activate test-harness
- Activate conda environment you had created. Data generator scrip needs to run in this specific Conda environment
mvn exec:exec@generate-data
- Run maven goal to populate HDFS with data scenarios defined in application.conf#scenarios/run section
http://localhost:9870/explorer.html#/user/test-harness/data/l0/jsonl
- You should be able to see the data being generated using File Browser in the link avove.

Submitting spark job using pre-defined spark url

mvn -DskipTests package exec:exec@run-test-spark-master
- This command will create an UBER jar and submit it into the stand alone cluster, spark url as defined in pom.xml's respective maven goal
- You can change this option spark.cores.max in the pom.xml to see how well the scenario scales with less or more cores. When missing all cores will be used on your machine. On a Mac used in the results 4 cores were used by default.

Submitting spark job using Yarn

mvn -DskipTests package exec:exec@run-test-yarn-master

Submitting spark job using InteliJ Run/Debug mode

Make the following class to be your main com.alex.shagiev.spark.Main
Enable Include dependencies with "Provided" Scope option in the Run/Debug configuration.
Add environment variable master=local, this will force the job submit to run in process

Amazon Web Services ( AWS ) Elastic Compute ( EC2 ) & Elastic Map Reduce ( EMR )

Usefull References on AWS & EMR

Setting up Test Harness on EMR an AWS Spark Cluster

Create a key pair using EC2 console
- IMPORTANT - they key name must be aws-emr-key as it is used in most util scripts in the projects
- Save aws-emr-key.pem into your Mac's home directory and Copy it to the to ${T2_MICRO}:~/aws-emr-key.pem location

Create a Free Tier t2.micro Amazon Linux 2 AMI (HVM), SSD Volume Type using the wizard.
- This host will be referred to as ${T2_MICRO} in this document. It will be your gateway into AWS. You will deploy and run the test harness from this node.
- t2.micro instance, includes 1 CPU, 1G RAM, Transient Elastic Brock Storage (EBS)

Configure `${T2_MICRO}`

Login to your Free Tier gateway ssh -i ~/aws-emr-key.pem ${T2_MICRO}

Create emr-create-id Identity and emr-create-group group using Identity Management (IAM)
- Select Programmatic Access for Access Type
- While creating emr-create-group group, ensure to attach AmazonElasticMapReduceFullAccess policy to it so that is able to manipulate EMR
- IMPORTANT - Save Access Key & Secret Access Key in the process, you WON'T be able to retrieve it later
Execute aws configure on ${T2_MICRO} node and assign Access Key with Secret Access Key from emr-create-id Identity
- Assign us-east-2 and json for region and output format respectively.
- For more on CLI configuration review AWS Command Line Interface (CLI)
- IMPORTANT - You must choose output format to be json for the utility scripts in the Test Harness to work
Network Security Groups
- Using Security Groups UI ensure that you add under Inbound Rules type All TCP port ranges 0-65535 Source security group of ${T2_MICRO} of the ElasticMapReduce-slave & ElasticMapReduce-master Security groups
Run steps below to install tools used by Test harness

# install git
sudo yum -y install git

# install anaconda
rm -rf Anaconda2-2019.07-Linux-x86_64.sh
rm -rf ./anaconda2
wget https://repo.continuum.io/archive/Anaconda2-2019.07-Linux-x86_64.sh
bash Anaconda2-2019.07-Linux-x86_64.sh -b -p ./anaconda2
./anaconda2/condabin/conda init bash
source .bashrc

# install apache maven
sudo wget http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo -O /etc/yum.repos.d/epel-apache-maven.repo
sudo sed -i s/\$releasever/6/g /etc/yum.repos.d/epel-apache-maven.repo
sudo yum install -y apache-maven

# maven install with jdk 1.7 so fix env to point to jdk 1.8
sudo yum install -y java-1.8.0-devel
sudo /usr/sbin/alternatives --set java /usr/lib/jvm/java-1.8.0-openjdk.x86_64/bin/java
sudo /usr/sbin/alternatives --set javac /usr/lib/jvm/java-1.8.0-openjdk.x86_64/bin/javac

# downloand test harness
rm -rf ./spark-test-harness
git clone https://github.com/alexshagiev/spark-test-harness.git
cd spark-test-harness

# conda env create -f ./utils/conda.recipe/test-harness.yml -- fails with out of memory on free tier.
# hence the work around
conda create --yes --name test-harness python=3.6
conda activate test-harness
conda install --yes -c conda-forge pyhocon
conda install --yes -c conda-forge pyarrow
conda install --yes -c conda-forge hdfs3
conda install --yes -c conda-forge tqdm

Run Test Harness on EMR

MUST activate correct conda env via conda activate test-harness
mvn -DskipTests package exec:exec@run-test-aws-emr - will create the cluster, populate it with data, and run Test Harness
- modify value of <argument>--core-nodes</argument> of the exec:exec@run-test-aws-emr plugin to change number of COREs nodes to be added to the fleet during instantiation. The default is set to 4 Nodes with each having 4vCores. Hence a total of 16 cores.
The following command can be used to populate data on the existing EMR cluster. python ./utils/aws_emr_spot_cluster.py --cluster-id j-SxxxxxxxxxX --populate-hdfs=true
The following command can be use dto run spark job on the existing EMR cluster. python ./utils/aws_emr_spot_cluster.py --cluster-id j-SxxxxxxxxxX --spark-submit=true --spark-jar ./target/spark-test-harness-*.jar
IMPORTANT - remember to Terminate your cluster once the job is completed to avoid unecessary charges.

Setup dynamic port forwarding for HDFS,Yarn,Spark UI access

Install FoxyProxy extension in your Chrome Browser

Configure FoxyProxy as follows Latest Proxy Settings & plugin instructions here here

<?xml version="1.0" encoding="UTF-8"?>
<foxyproxy>
   <proxies>
      <proxy name="emr-socks-proxy" id="2322596116" notes="" fromSubscription="false" enabled="true" mode="manual" selectedTabIndex="2" lastresort="false" animatedIcons="true" includeInCycle="true" color="#0055E5" proxyDNS="true" noInternalIPs="false" autoconfMode="pac" clearCacheBeforeUse="false" disableCache="false" clearCookiesBeforeUse="false" rejectCookies="false">
         <matches>
            <match enabled="true" name="*ec2*.amazonaws.com*" pattern="*ec2*.amazonaws.com*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" />
            <match enabled="true" name="*ec2*.compute*" pattern="*ec2*.compute*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" />
            <match enabled="true" name="10.*" pattern="http://10.*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" />
            <match enabled="true" name="*10*.amazonaws.com*" pattern="*10*.amazonaws.com*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" />
            <match enabled="true" name="*10*.compute*" pattern="*10*.compute*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" /> 
            <match enabled="true" name="*.compute.internal*" pattern="*.compute.internal*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false"/>
            <match enabled="true" name="*.ec2.internal* " pattern="*.ec2.internal*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false"/>	  
       </matches>
         <manualconf host="localhost" port="8157" socksversion="5" isSocks="true" username="" password="" domain="" />
      </proxy>
   </proxies>
</foxyproxy>

Once your cluster is running navigate to Your Desired Cluster and click the SSH link next to the Master Public DNS Name, will will will give you a prepopulated scrip of the following form. Run it on your local Mac ssh -i ~/aws-emr-key.pem -ND 8157 hadoop@${MASTER_PUBLIC_DNS_NAME}
The EMR Management console should now have the WebLinks under Connections: section enabled which will take you directly to the HDFS, SPARK, YARN UIs

Performance results

Single Host deployment with & without yarn as well as AWS EMR results

Observations:
- It appears based on the local mac test that Yarn introduces very minimal scheduling overhead. Things do slow down on local yarn as compared to the spark internal scheduler. However they are explained and proportional to the lost Vcore for processing since Yarn allocated a single Vcore to the applicaiton manager which run in the client mode.
- Also interestingly enough it is possible to achieve comparable performance between a Premium Mac hardware kit and solid drive vs AWS General purpose m5.large general purpose hardware with magnetic disk
- It also seems that read and write take each roughly 15% & 15% of the overall processing time as compared 70% to parse Json. Json parsing is therefore naturally the area of optimization.
Summary of the results:
Full Results can be found here Google Sheet with results here
- for definition of scenarios refer to application.conf#scenarios section

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
README.md.resources		README.md.resources
src		src
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark performance test harness

Goal of this test harness

Build/Run Environment setup

Tested on Mac OS

HDFS Install & Starting Service

Install HDFS & Yarn

Configuration Summary HDFS & YARN

Starting HDFS file system, creating home directory for the harness & YARN resource manager

Load HDFS with randomly generated data samples.

Spark Install & Starting Stand Alone Service

Spark Install & Starting the service

Install Spark

Start Stand Alone Spark Cluster

Running test harness on local Mac

Load HDFS

Submitting spark job using pre-defined spark url

Submitting spark job using Yarn

Submitting spark job using InteliJ Run/Debug mode

Amazon Web Services ( AWS ) Elastic Compute ( EC2 ) & Elastic Map Reduce ( EMR )

Usefull References on AWS & EMR

Setting up Test Harness on EMR an AWS Spark Cluster

Configure `${T2_MICRO}`

Run Test Harness on EMR

Setup dynamic port forwarding for HDFS,Yarn,Spark UI access

Performance results

Single Host deployment with & without yarn as well as AWS EMR results

References

About

Releases

Packages

Languages

License

alexshagiev/spark-test-harness

Folders and files

Latest commit

History

Repository files navigation

Spark performance test harness

Goal of this test harness

Build/Run Environment setup

Tested on Mac OS

HDFS Install & Starting Service

Install HDFS & Yarn

Configuration Summary HDFS & YARN

Starting HDFS file system, creating home directory for the harness & YARN resource manager

Load HDFS with randomly generated data samples.

Spark Install & Starting Stand Alone Service

Spark Install & Starting the service

Install Spark

Start Stand Alone Spark Cluster

Running test harness on local Mac

Load HDFS

Submitting spark job using pre-defined spark url

Submitting spark job using Yarn

Submitting spark job using InteliJ Run/Debug mode

Amazon Web Services ( AWS ) Elastic Compute ( EC2 ) & Elastic Map Reduce ( EMR )

Usefull References on AWS & EMR

Setting up Test Harness on EMR an AWS Spark Cluster

Configure ${T2_MICRO}

Run Test Harness on EMR

Setup dynamic port forwarding for HDFS,Yarn,Spark UI access

Performance results

Single Host deployment with & without yarn as well as AWS EMR results

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Configure `${T2_MICRO}`

Packages