Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][SPARK-50298][CONNECT] Implement verifySchema parameter of createDataFrame in Spark Connect #48841

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented Nov 14, 2024

What changes were proposed in this pull request?

The PR targets at Spark Connect only. Spark Classic has been handled in #48677.

verifySchema parameter of createDataFrame on Spark Classic decides whether to verify data types of every row against schema.

Now it's not supported on Spark Connect.

The PR proposes to support verifySchema on Spark Connect.

By default, verifySchema parameter is pyspark._NoValue, if not provided, createDataFrame with

  • pyarrow.Table, verifySchema = False
  • pandas.DataFrame with Arrow optimization, verifySchema = spark.sql.execution.pandas.convertToArrowArraySafely
  • regular Python instances, verifySchema = True

numpy ndarray input will be supported in a separate PR.

Why are the changes needed?

Parity with Spark Classic.

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant