Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot write to Minio with deltalake.write_deltalake or Polars #2894

Closed
rwhaling opened this issue Sep 22, 2024 · 8 comments · Fixed by #2895
Closed

Cannot write to Minio with deltalake.write_deltalake or Polars #2894

rwhaling opened this issue Sep 22, 2024 · 8 comments · Fixed by #2895
Assignees
Labels
binding/python Issues for the Python package bug Something isn't working storage/aws AWS S3 storage related

Comments

@rwhaling
Copy link
Contributor

Environment

Delta-rs version: 0.20.0

Binding: Python

Environment:

  • Cloud provider: Local (Minio, via Docker)
  • OS: Mac OS 12.5, M2 CPU
  • Other:

Bug

What happened:
Running Minio locally via docker-compose (.yml spec below), attempted to write a 20-row Pyarrow table via the write_deltalake function, and got the opaque error message:

Generic S3 error: Error after 0 retries in 71.583µs, max_retries:10, retry_timeout:180s, source:builder error for url (http://localhost:9000/test-bucket/test_delta_table/_delta_log/_last_checkpoint)

Attempted to write a 20-row pandas dataframe via the polars write_delta function as well, and got the exact same error:

Generic S3 error: Error after 0 retries in 71.583µs, max_retries:10, retry_timeout:180s, source:builder error for url (http://localhost:9000/test-bucket/test_delta_table/_delta_log/_last_checkpoint)

What you expected to happen:
I expected to be able to write tables out to Minio via S3. I have tested that I can write to Minio just fine with boto3.
I'm happy to do more footwork chasing this down, turning up logging, or reproducing it deeper in the stack if someone can point me in the right direction!

How to reproduce it:

import boto3
import random
import string
import pyarrow as pa
from deltalake import write_deltalake, DeltaTable

# Configuration
endpoint_url = 'http://localhost:9000'
access_key = 'minioadmin'
secret_key = 'minioadmin'
bucket_name = 'test-bucket'
table_name = 'test_delta_table'
num_rows = 10

# Generate random string function
def generate_random_string(length=5):
    return ''.join(random.choices(string.ascii_lowercase, k=length))

# Generate data
keys = [generate_random_string() for _ in range(num_rows)]
values = [generate_random_string() for _ in range(num_rows)]

# Create PyArrow table
table = pa.table([keys, values], names=['key', 'value'])

table_path = f"s3://{bucket_name}/{table_name}"

print(f"Writing Delta table to: {table_path}")

storage_options = {
    "AWS_ACCESS_KEY_ID": access_key,
    "AWS_SECRET_ACCESS_KEY": secret_key,
    "AWS_ENDPOINT_URL": endpoint_url,
    "AWS_REGION": "us-east-1",
    "AWS_S3_ALLOW_UNSAFE_RENAME": "true"
}

try:
    # Check if MinIO is accessible
    s3 = boto3.client('s3', endpoint_url=endpoint_url)
    s3.list_buckets()
    print("Successfully connected to MinIO")

    # Check if the bucket exists
    buckets = s3.list_buckets()['Buckets']
    if not any(bucket['Name'] == bucket_name for bucket in buckets):
        print(f"Bucket {bucket_name} does not exist. Creating it...")
        s3.create_bucket(Bucket=bucket_name)

    # Write to S3
    write_deltalake(
        table_path,
        table,
        mode="overwrite",
        storage_options=storage_options
    )
    print(f"Successfully wrote Delta table to {table_path}")

    # Read and print the table metadata
    dt = DeltaTable(table_path, storage_options=storage_options)
    print(f"Table metadata:\n{dt.metadata()}")
    print(f"Table schema:\n{dt.schema().json()}")
    print(f"Table version: {dt.version()}")

except Exception as e:
    print(f"Error writing Delta table: {e}")

More details:
docker-compose.yml:

version: '3.8'

services:
  minio:
    image: minio/minio
    ports:
      - "9000:9000"
      - "9001:9001"
    volumes:
      - minio_data:/data
    environment:
      MINIO_ROOT_USER: minioadmin
      MINIO_ROOT_PASSWORD: minioadmin
    command: server /data --console-address ":9001"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
      interval: 30s
      timeout: 20s
      retries: 3

volumes:
  minio_data:
@rwhaling rwhaling added the bug Something isn't working label Sep 22, 2024
@rtyler rtyler added the binding/python Issues for the Python package label Sep 22, 2024
@rtyler rtyler self-assigned this Sep 22, 2024
@rtyler rtyler added the storage/aws AWS S3 storage related label Sep 22, 2024
@ion-elgreco
Copy link
Collaborator

@rwhaling and this worked in 0.19.x?

@rwhaling
Copy link
Contributor Author

@ion-elgreco No idea, doing this for the first time. I can try with 0.19.

@rtyler
Copy link
Member

rtyler commented Sep 22, 2024

Thank you for the reproduction case! With a fresh environment I am consistently getting Unable to locate credentials The problem is coming from boto3

My guess is that you may have environment variables set that boto3 is picking up, which are different from what are being passed as storage options into deltalake with

@rwhaling
Copy link
Contributor Author

rwhaling commented Sep 22, 2024

Thank you! I seem to get the same thing on 0.19.2 as well. Let me check out those environment vars.
(Yes, I did have the AWS env vars set as well, apologies)

And so I understand - is write_deltalake using boto3 internally? Is there a way for me to turn up the logging?

@rtyler
Copy link
Member

rtyler commented Sep 22, 2024

   s3 = boto3.client('s3', aws_access_key_id=access_key, aws_secret_access_key=secret_key, endpoint_url=endpoint_url)  

Gets the repro case to the error message you describe

@rtyler
Copy link
Member

rtyler commented Sep 22, 2024

@rwhaling don't worry about trying to reproduce this on older versions, I found the error 😄 It exists going back many versions!

This was a good Sunday morning brain exercise!

The problem here is that the stack is expecting TLS communication. Add AWS_ALLOW_HTTP as "true" to the storage_options and you'll be sorted!

If you're feeling extra thankful, I would love a pull request to update any relevant documentation in the docs/ directory which would have helped you here 🙏

@rwhaling
Copy link
Contributor Author

Bingo, it works!
I love writing doc PR's, would be happy to -
and thank y'all for this great project!

@rwhaling
Copy link
Contributor Author

So while cleaning this up for docs - I found that I couldn't get it to work with 'conditional_put': 'etag', as described in the docs, only with "AWS_S3_ALLOW_UNSAFE_RENAME": "true".

Still using the repro script above -

storage_options = {
    "AWS_ACCESS_KEY_ID": ...,
    "AWS_SECRET_ACCESS_KEY": ...,
    "AWS_ENDPOINT_URL": "http://localhost:9000",
    "AWS_ALLOW_HTTP": "true",
    "conditional_put": "etag"
}

Gives the error: Error writing Delta table: Operation not supported: S3 does not support copy-if-not-exists.

For a lark, I tried setting AWS_S3_ALLOW_UNSAFE_RENAME alongside conditional_put - and got the same error.

I'm struggling a bit to follow the various code paths, but I feel like if conditional_put is set I shouldn't get this message?

return Ok(default_logstore(store, location, options));

Will dig some more.

rtyler added a commit that referenced this issue Sep 28, 2024
…t support (#2895)

# Description
Fixes a few typos in cloudflare/minio docs page, adds working docker
example and notes on special storage_option flags for http vs https.

# Related Issue(s)
- closes #2894

# Notes
Updated docs to use `"aws_conditional_put":"etag"` due to the issue
identified below.

---------

Co-authored-by: Richard Whaling <[email protected]>
Co-authored-by: R. Tyler Croy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package bug Something isn't working storage/aws AWS S3 storage related
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants