Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to deal with varchar(max) columns in mssql #56

Open
TheDataScientistNL opened this issue Oct 5, 2023 · 4 comments
Open

how to deal with varchar(max) columns in mssql #56

TheDataScientistNL opened this issue Oct 5, 2023 · 4 comments

Comments

@TheDataScientistNL
Copy link

TheDataScientistNL commented Oct 5, 2023

Hi, I am using polars==0.19.7, which now includes ODBC support through arrow-odbc-py (arrow-odbc==1.2.8).

When running the code, see example below, an error occurs from arrow-odbc.

SRNM = ''
PWD = ''
DBNAME = ''
HOST = ''
PORT = ''

CONN = f"Driver={{ODBC Driver 17 for SQL Server}};Server={HOST};Port={PORT};Database={DBNAME};Uid={USERNM};Pwd={PWD}"

df = pl.read_database(
    connection=CONN,
    query="SELECT varchar_max_col FROM [dbo].[tablname]",
)

with the error being:

arrow_odbc.error.Error: There is a problem with the SQL type of the column with name: varchar_max_col and index 0:
ODBC reported a size of '0' for the column. This might indicate that the driver cannot specify a sensible upper bound for the column. E.g. for cases like VARCHAR(max). Try casting the column into a type with a sensible upper bound. The type of the column causing this error is Varchar { length: 0 }.

I can easily resolve this by editing the query to

df = pl.read_database( connection=CONN, query="SELECT CAST(varchar_max_col AS VARCHAR(100)) AS varchar_max_col FROM [dbo].[tablname]", )
which then resolves the issue (or change the column type in the database, but that is not something you want to do or always can do).

However, as varchar(max) columns still occur frequently in databases, I was wondering if there could be native support in arrow-odbc for this? In other words, it catches varchar(max) columns and optimizes the query to return these columns without throwing an error.

I hope this is the right place to ask the question, because I am not sure if this is arrow-odbc related or ODBC driver related...

@pacman82
Copy link
Owner

pacman82 commented Oct 5, 2023

Hello @TheDataScientistNL ,

the best way to deal with VARCHAR(max) ist to set the max_text_size parameter. See the documentation here: https://arrow-odbc.readthedocs.io/en/latest/arrow_odbc.html#arrow_odbc.read_arrow_batches_from_odbc

You are not using the read_arrow_batches_from_odbc directly but via polars, which I think was added yesterday. Please ask the maintainers of polars how to forward this parameters or use arrow-odbc directly.

Best, Markus

@pacman82
Copy link
Owner

pacman82 commented Oct 5, 2023

I hope this is the right place to ask the question, because I am not sure if this is arrow-odbc related or ODBC driver related...

Neither it is ODBC standard related. It is an inherent limitation in the API. Avoid VARCHAR(max), TEXT or similar unbounded types in schema declarations, if you want fast bulk fetches. I take back what I said earlier. Best way to deal with this is to fix the schema, if possible.

@alexander-beedie
Copy link

alexander-beedie commented Oct 6, 2023

And I was so hoping to avoid a mystery-meat **kwargs pass-through for all the different connection flavours we now support 🤣 I'll think about the cleanest thing we can expose.

@pacman82
Copy link
Owner

pacman82 commented Oct 6, 2023

Just typing on my phone right now, so I will keep it short. I can sympathise with that. I wouldn't recommend a passthrough at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants