Testing Spark RDD and DataFrame/DataSet APIs

Exersice description

Create simple Scala applications. Applications may work only in Spark local mode and can be launched directly from IDE (perhaps except saving to parquet as it can require additional fixes on Windows)

Implement following application (each in separate main function):

Create RDD from user.txt and filter out users with valid=0, select only id and name fields (use RDD API only). Create DataFrame or Dataset from car.txt, filter out cars with valid=0, select only id, model and user_id fields (use DataFrame/Dataset API only). Join those users and cars by user_id. Save result into parquet and csv files.
Load car.txt to RDD. Group by type field and get average value of number field for each type (use RDD API only). Save result to csv or print to consol.
Load car.txt DataFrame/Dataset. Group by type field and get avg, min, max value of number field for each type (use DataFrame/Dataset API only). Save result to csv or print to consol.

Run

sbt run

Check results

All results stored in:

/resourses/exercises/answers/

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.idea		.idea
project		project
src/main		src/main
.gitignore		.gitignore
README.adoc		README.adoc
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Testing Spark RDD and DataFrame/DataSet APIs

Exersice description

Run

Check results

About

Releases

Packages

Languages

9kittenCo/spark-simple

Folders and files

Latest commit

History

Repository files navigation

Testing Spark RDD and DataFrame/DataSet APIs

Exersice description

Run

Check results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages