feat: SQL Server data source #2018

scsmithr · 2023-10-30T19:46:36Z

Adds support for SQL Server as a data source

SQL

Create external table:

CREATE EXTERNAL TABLE large_table
	FROM sql_server
	OPTIONS (
		connection_string = 'server=tcp:localhost,1433;user=SA;password=Password123;TrustServerCertificate=true',
		schema = 'dbo',
		table = 'bikeshare_trips'
	);

Create external database:

CREATE EXTERNAL DATABASE external_db
	FROM sql_server
	OPTIONS (
		connection_string = 'server=tcp:localhost,1433;user=SA;password=Password123;TrustServerCertificate=true',
	);

+    let mut columns = Vec::with_capacity(schema.fields.len());
+    for (col_idx, field) in schema.fields.iter().enumerate() {
+        let col: Arc<dyn Array> = match field.data_type() {
+            DataType::Boolean => make_column!(BooleanBuilder, rows, col_idx),
+            DataType::Int16 => make_column!(Int16Builder, rows, col_idx),
+            DataType::Int32 => make_column!(Int32Builder, rows, col_idx),
+            DataType::Int64 => {
+                let mut arr = Int64Builder::with_capacity(rows.len());
+                for row in rows.iter() {
+                    let val: Option<Intn> = row.try_get(col_idx)?;
+                    arr.append_option(val.map(|v| v.0));
+                }
+                Arc::new(arr.finish())
+            }
+            DataType::Float32 => make_column!(Float32Builder, rows, col_idx),
+            DataType::Float64 => make_column!(Float64Builder, rows, col_idx),
+            DataType::Utf8 => {
+                // Assumes an average of 16 bytes per item.
+                let mut arr = StringBuilder::with_capacity(rows.len(), rows.len() * 16);
+                for row in rows.iter() {
+                    let val: Option<&str> = row.try_get(col_idx)?;
+                    arr.append_option(val);
+                }
+                Arc::new(arr.finish())
+            }
+            DataType::Binary => {
+                // Assumes an average of 16 bytes per item.
+                let mut arr = BinaryBuilder::with_capacity(rows.len(), rows.len() * 16);
+                for row in rows.iter() {
+                    let val: Option<&[u8]> = row.try_get(col_idx)?;
+                    arr.append_option(val);
+                }
+                Arc::new(arr.finish())
+            }
+            DataType::Timestamp(TimeUnit::Nanosecond, None) => {
+                let mut arr = TimestampNanosecondBuilder::with_capacity(rows.len());
+                for row in rows.iter() {
+                    let val: Option<NaiveDateTime> = row.try_get(col_idx)?;
+                    let val = val.map(|v| v.timestamp_nanos_opt().unwrap());
+                    arr.append_option(val);
+                }
+                Arc::new(arr.finish())
+            }
+
+            // TODO: All the others...
+            // Tiberius mapping: <https://docs.rs/tiberius/latest/tiberius/trait.FromSql.html>
+            other => {
+                return Err(SqlServerError::String(format!(
+                    "unsupported data type for sql server: {other}"
+                )))
+            }
+        };
+        columns.push(col);
+    }


It looks like we could easily parallelize this using rayon's par_iter. (idk if tokio has an equivalent)

Possibly, it would require us to manage a thread pool. Also we'd probably want to do higher level parallelization through working with multiple partitions.

I would think we would want both. The parallelization of multiple partitions would help with larger datasets that require several partitions, and would lend well to distributed execution. Without parallelizing the lower level stuff, we'll likely run into slower compute once we distribute the partitions as each partition would still be processing each column sequentially.

this is definitely not blocking the PR approval, but just something to think about. (it looks like our other sql readers follow the same pattern anyways).

universalmind303 · 2023-11-13T20:28:59Z

crates/datasources/src/sqlserver/mod.rs

+        Int32Builder, Int64Builder, StringBuilder, TimestampNanosecondBuilder,
+    };
+
+    let rows = rows.into_iter().collect::<Result<Vec<_>>>()?;


doesn't this cause us to iterate over the rows twice? once here, and once again when we are constructing the columns.

Yes, I figured this would make it easier to auto vectorize, but I haven't actually measure and/or seen.

crates/datasources/src/sqlserver/mod.rs

universalmind303 · 2023-11-13T20:33:06Z

crates/datasources/src/sqlserver/mod.rs

+        _overwrite: bool,
+    ) -> DatafusionResult<Arc<dyn ExecutionPlan>> {
+        Err(DataFusionError::Execution(
+            "inserts not supported for SQL Server".to_string(),


Suggested change

"inserts not supported for SQL Server".to_string(),

"inserts not yet supported for SQL Server".to_string(),

tychoish · 2023-11-13T20:22:07Z

crates/datasources/Cargo.toml

+tokio-util = { version = "*" }
 tokio = { version = "1.34.0", features = ["full"] }


feels like this wants to be workspace synced?

tychoish · 2023-11-13T20:25:39Z

crates/datasources/src/sqlserver/client.rs

+pub struct QueryStream {
+    receiver: mpsc::Receiver<Result<QueryItem>>,
+    metadata: Option<ResultMetadata>,
+    buffered_rows: VecDeque<Row>,


are we unbounded in the mpsc stream and in this buffer?

We're unbounded in the mpsc from Client to Connection, but bounded in the mpsc from Connection to QueryStream (the results). buffered_rows is unbounded, but is effectively 0 (no allocations) or 1 (unlikely).

For the unbounded mpsc, a more proper implementation would send messages directly to the server as soon as it's received instead of after reading the entirety of the response for a message before moving to the next one. The current implementation is still correct, but would unexpectedly block on 2 client messages. We're only sending 1 message at a time anyways so that's not really an issue for us right now though.

crates/datasources/src/sqlserver/client.rs

crates/datasources/src/sqlserver/mod.rs

scsmithr · 2023-11-13T23:16:35Z

Followup items tracked in #2094

scsmithr added 3 commits October 30, 2023 14:46

Stub

b820072

get arrow schema

a5186fa

idk

12423c8

scsmithr added the needs docs 📖 label Nov 2, 2023

scsmithr added 12 commits November 11, 2023 21:55

Merge remote-tracking branch 'origin/main' into sean/sqlserver

a0d05de

Don't use native tls

653d98a

Add wrapper client

e15d53d

Wire up wrapper client

213f658

Convert row chunks to record batches

23f1396

Add metastore types

d7b8f8f

Wire up planner

be76e93

ugh

01dadd3

Setup test data

be03515

Update type mappings

9411f0f

Add slts

511251d

Set env var in ci

7f024fe

scsmithr marked this pull request as ready for review November 13, 2023 19:40

scsmithr requested review from tychoish, universalmind303 and vrongmeal November 13, 2023 19:40

universalmind303 reviewed Nov 13, 2023

View reviewed changes

tychoish reviewed Nov 13, 2023

View reviewed changes

scsmithr mentioned this pull request Nov 13, 2023

feat: Parallelize source row chunks to record batches #2095

Open

scsmithr added 2 commits November 13, 2023 17:19

Update insert error message

b450ad3

Merge remote-tracking branch 'origin/main' into sean/sqlserver

92d7af3

scsmithr requested review from tychoish and universalmind303 November 13, 2023 23:19

tychoish approved these changes Nov 13, 2023

View reviewed changes

scsmithr merged commit e93bf2b into main Nov 13, 2023

scsmithr deleted the sean/sqlserver branch November 13, 2023 23:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: SQL Server data source #2018

feat: SQL Server data source #2018

scsmithr commented Oct 30, 2023 •

edited

Loading

universalmind303 Nov 13, 2023

scsmithr Nov 13, 2023

universalmind303 Nov 13, 2023

scsmithr Nov 13, 2023

universalmind303 Nov 13, 2023

scsmithr Nov 13, 2023

universalmind303 Nov 13, 2023

tychoish Nov 13, 2023

scsmithr Nov 13, 2023

tychoish Nov 13, 2023

scsmithr Nov 13, 2023

scsmithr commented Nov 13, 2023

	"inserts not supported for SQL Server".to_string(),
	"inserts not yet supported for SQL Server".to_string(),

		tokio-util = { version = "*" }
		tokio = { version = "1.34.0", features = ["full"] }

feat: SQL Server data source #2018

feat: SQL Server data source #2018

Conversation

scsmithr commented Oct 30, 2023 • edited Loading

SQL

Next

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scsmithr commented Nov 13, 2023

scsmithr commented Oct 30, 2023 •

edited

Loading