Skip to content

Latest commit

 

History

History
315 lines (290 loc) · 9.37 KB

reference.md

File metadata and controls

315 lines (290 loc) · 9.37 KB

Configuration Reference

Cassandra Authentication Parameters

All parameters should be prefixed with spark.cassandra.

Property NameDefaultDescription
auth.conf.factory DefaultAuthConfFactory Name of a Scala module or class implementing AuthConfFactory providing custom authentication configuration

Cassandra Connection Parameters

All parameters should be prefixed with spark.cassandra.

Property NameDefaultDescription
connection.compression Compression to use (LZ4, SNAPPY or NONE)
connection.factory DefaultConnectionFactory Name of a Scala module or class implementing CassandraConnectionFactory providing connections to the Cassandra cluster
connection.host localhost Contact point to connect to the Cassandra cluster. A comma seperated list may also be used. ("127.0.0.1,192.168.0.1")
connection.keep_alive_ms 5000 Period of time to keep unused connections open
connection.local_dc None The local DC to connect to (other nodes will be ignored)
connection.port 9042 Cassandra native connection port
connection.reconnection_delay_ms.max 60000 Maximum period of time to wait before reconnecting to a dead node
connection.reconnection_delay_ms.min 1000 Minimum period of time to wait before reconnecting to a dead node
connection.timeout_ms 5000 Maximum period of time to attempt connecting to a node
query.retry.count 10 Number of times to retry a timed-out query
query.retry.delay 4 * 1.5 The delay between subsequent retries (can be constant, like 1000; linearly increasing, like 1000+100; or exponential, like 1000*2)
read.timeout_ms 120000 Maximum period of time to wait for a read to return

Cassandra DataFrame Source Parameters

All parameters should be prefixed with spark.cassandra.

Property NameDefaultDescription
sql.pushdown.additionalClasses A comma seperated list of classes to be used (in order) to apply additional pushdown rules for C* Dataframes. Classes must implement CassandraPredicateRules
table.size.in.bytes None Used by DataFrames Internally, will be updated in a future release to retrieve size from C*. Can be set manually now

Cassandra SQL Context Options

All parameters should be prefixed with spark.cassandra.

Property NameDefaultDescription
sql.cluster default Sets the default Cluster to inherit configuration from

Cassandra SSL Connection Options

All parameters should be prefixed with spark.cassandra.

Property NameDefaultDescription
connection.ssl.clientAuth.enabled false Enable 2-way secure connection to Cassandra cluster
connection.ssl.enabled false Enable secure connection to Cassandra cluster
connection.ssl.enabledAlgorithms Set(TLS_RSA_WITH_AES_128_CBC_SHA, TLS_RSA_WITH_AES_256_CBC_SHA) SSL cipher suites
connection.ssl.keyStore.password None Key store password
connection.ssl.keyStore.path None Path for the key store being used
connection.ssl.keyStore.type JKS Key store type
connection.ssl.protocol TLS SSL protocol
connection.ssl.trustStore.password None Trust store password
connection.ssl.trustStore.path None Path for the trust store being used
connection.ssl.trustStore.type JKS Trust store type

Custom Cassandra Type Parameters (Expert Use Only)

All parameters should be prefixed with spark.cassandra.

Property NameDefaultDescription
dev.customFromDriver None Provides an additional class implementing CustomDriverConverter for those clients that need to read non-standard primitive Cassandra types. If your C* implementation uses a Java Driver which can read DataType.custom() you may need it this. If you are using OSS Cassandra this should never be used.

Read Tuning Parameters

All parameters should be prefixed with spark.cassandra.

Property NameDefaultDescription
input.consistency.level LOCAL_ONE Consistency level to use when reading
input.fetch.size_in_rows 1000 Number of CQL rows fetched per driver request
input.join.throughput_query_per_sec 9223372036854775807 Maximum read throughput allowed per single core in query/s while joining RDD with C* table
input.metrics true Sets whether to record connector specific metrics on write
input.split.size_in_mb 64 Approx amount of data to be fetched into a Spark partition. Minimum number of resulting Spark partitions is 1 + 2 * SparkContext.defaultParallelism

Write Tuning Parameters

All parameters should be prefixed with spark.cassandra.

Property NameDefaultDescription
output.batch.grouping.buffer.size 1000 How many batches per single Spark task can be stored in memory before sending to Cassandra
output.batch.grouping.key Partition Determines how insert statements are grouped into batches. Available values are
  • none : a batch may contain any statements
  • replica_set : a batch may contain only statements to be written to the same replica set
  • partition : a batch may contain only statements for rows sharing the same partition key value
output.batch.size.bytes 1024 Maximum total size of the batch in bytes. Overridden by spark.cassandra.output.batch.size.rows
output.batch.size.rows None Number of rows per single batch. The default is 'auto' which means the connector will adjust the number of rows based on the amount of data in each row
output.concurrent.writes 5 Maximum number of batches executed in parallel by a single Spark task
output.consistency.level LOCAL_QUORUM Consistency level for writing
output.ifNotExists false Determines that the INSERT operation is not performed if a row with the same primary key already exists. Using the feature incurs a performance hit.
output.ignoreNulls false In Cassandra >= 2.2 null values can be left as unset in bound statements. Setting this to true will cause all null values to be left as unset rather than bound. For finer control see the CassandraOption class
output.metrics true Sets whether to record connector specific metrics on write
output.throughput_mb_per_sec 2.147483647E9 *(Floating points allowed)*
Maximum write throughput allowed per single core in MB/s.
Limit this on long (+8 hour) runs to 70% of your max throughput as seen on a smaller job for stability
output.timestamp 0 Timestamp (microseconds since epoch) of the write. If not specified, the time that the write occurred is used. A value of 0 means time of write.
output.ttl 0 Time To Live(TTL) assigned to writes to Cassandra. A value of 0 means no TTL