GitHub - 9kittenCo/Spark-ETL-Ecommerce

Spark for ETL E-commerce

Domain:

An ecommerce site with products divided into categories like toys, electronics etc. We receive events like product was seen (impression), product page was opened, product was purchased etc.

Task #1:

Enrich incoming data with sessions.

Definition of a session: it contains consecutive events that belong to a single category and are not more than 5 minutes away from each other. Output should look like this (session columns are in bold): eventTime, eventType, category, userId, ..., sessionId, sessionStartTime, sessionEndTime

Implement it using:

sql window functions and
Spark aggregator.

Task #2:

Compute the following statistics:

For each category find median session duration
For each category find # of unique users spending less than 1 min, 1 to 5 mins and more than 5 mins
For each category find top 10 products ranked by time spent by users on product pages
this may require different type of sessions. For this particular task, session lasts until the user is looking at particular product. When particular user switches to another product the new session starts.

General notes:

Ideally tasks should be implemented using pure SQL on top of Spark DataFrame API.
Spark version 2.2 or higher

Implementation:

Used Window functionality in

import org.apache.spark.sql.functions._

For use needed to run:

sbt compile
sbt assembly

For testing as example:

For Task #1

java -jar target/scala-2.11/Example-Spark-ETL-for-ecommerce-2.11.12-0.0.1-SNAPSHOT.jar "_1"

For Task #2 to get Median Session duration and grouped unique users

java -jar target/scala-2.11/Example-Spark-ETL-for-ecommerce-2.11.12-0.0.1-SNAPSHOT.jar "_2"

For Task #2 to get top10 products by category

java -jar target/scala-2.11/Example-Spark-ETL-for-ecommerce-2.11.12-0.0.1-SNAPSHOT.jar "_3"

Parameter _1 is default for processing

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.idea		.idea
project		project
src/main		src/main
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

9kittenCo/Spark-ETL-Ecommerce

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages