SFaker is one data generator. It implemented with Spark DataSourceV2. SFaker can generate rows according to specified schemas.
Feature
Support Batch
✅
Support Stream
TBD
Support DataFrameReader API
✅
Support Spark SQL Create Statement
✅
Support Unsafe Row
✅
Support Codegen
✅
Support Limit Push Down
✅
Support Columns Pruning
✅
Support spark sql types, more details about types click here .
Spark Type
Byte
✅
Short
✅
Integer
✅
Long
✅
Float
✅
Double
✅
Decimal
TBD
String
✅
Varchar
TBD
Char
TBD
Binary
TBD
Boolean
✅
Date
TBD
Timestamp
TBD
TimestampNTZ
TBD
YearMonthInterval
TBD
DayTimeInterval
TBD
Array
✅
Map
✅
Struct
✅
Conf
Type
Default
Description
spark.sql.fake.source.unsafe.row.enable
Boolean
false
If true
, all row generated will been stored in UnsafeRow
.
spark.sql.fake.source.unsafe.codegen.enable
Boolean
false
If true
, the row-generated process, which produce rows according to schema, will been executed in JIT mode.
spark.sql.fake.source.partitions
Integer
1
Number of source partitions.
spark.sql.fake.source.rowsTotalSize
Integer
8
Number of rows generated according to schema.
val schema = new StructType ()
.add(" id" , DataTypes .IntegerType )
.add(" sex" , DataTypes .BooleanType )
.add(" roles" , DataTypes .createArrayType(DataTypes .StringType ));
val df = spark.read
.format(" FakeSource" )
.schema(schema)
.option(FakeSourceProps .CONF_ROWS_TOTAL_SIZE , 100 )
.option(FakeSourceProps .CONF_PARTITIONS , 1 )
.option(FakeSourceProps .CONF_UNSAFE_ROW_ENABLE , true )
.option(FakeSourceProps .CONF_UNSAFE_CODEGEN_ENABLE , true )
.load();
Spark SQL Create Statement
val spark = SparkSession
.builder()
.master(" local[*]" )
.appName(" Case0" )
.config(
" spark.sql.catalog.spark_catalog" ,
classOf [FakeSourceCatalog ].getName
)
.getOrCreate();
val df = spark.sql("""
|create table fake (
| id int,
| sex boolean
|)
|using FakeSource
|tblproperties (
|spark.sql.fake.source.rowsTotalSize = 10000000,
|spark.sql.fake.source.partitions = 1,
|spark.sql.fake.source.unsafe.row.enable = true,
|spark.sql.fake.source.unsafe.codegen.enable = true
|)
|""" .stripMargin)
spark.sql(" select id from fake limit 10" ).explain(true );
Apache 2.0 License.