Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Error in reading Nested Parquet #1014

Closed
ahmedriza opened this issue May 26, 2022 · 3 comments · Fixed by #1015
Closed

Error in reading Nested Parquet #1014

ahmedriza opened this issue May 26, 2022 · 3 comments · Fixed by #1015
Labels
bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog

Comments

@ahmedriza
Copy link

ahmedriza commented May 26, 2022

Issue found whilst testing #1007. Testing the Parquet generated by the following Python script on that branch fails on reading the Parquet file.

Python code to generate test Parquet:

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

d1 = {
    "id": pd.Series(['t1', 't2']),
    "prices": pd.Series([
        [
            {
                "currency": "GBP",
                "value": 3.14,
                'meta': [
                    {'loc': 'LON'}
                ]
            }
        ],
        [
            {
                "currency": "USD",
                "value": 4.14,
                'meta': [
                    {'loc': 'NYC'}
                ]                
            }
        ]
    ]),
    "bids": pd.Series([
        [
            {
                "currency": "GBP",
                "value": 3.04,
                'meta': [
                    {'loc': 'LON'}
                ]
            }
        ]        
    ], dtype='object')
}

df = pd.DataFrame(d1)

list_type = pa.list_(
    pa.struct([
        ('currency', pa.string()),
        ('value', pa.float64()),
        ('meta', pa.list_(
            pa.struct([
                ('loc', pa.string())
            ])
        ))
    ]))

schema = pa.schema([
    ('id', pa.string()),
    ('prices', list_type),
    ('bids', list_type)
])

table = pa.Table.from_pandas(df, schema=schema)
filename = '/tmp/two_level_nested.parquet'
pq.write_table(table, filename)

# Read the Parquet to check
expected_table = pq.read_table(filename).to_pandas()
print(expected_table.to_string())

Rust code (against #1007)

use arrow2::{
    array::*,
    chunk::Chunk as AChunk,
    io::parquet::write::{FileWriter, RowGroupIterator, WriteOptions},
};
use arrow2::{datatypes::Schema, io::parquet::read::FileReader};
use parquet2::encoding::Encoding;
use std::{fs::File, sync::Arc};

type Chunk = AChunk<Arc<dyn Array>>;
type Result<T> = arrow2::error::Result<T>;

pub fn main() -> Result<()> {
    let filename_to_read = "/tmp/two_level_nested.parquet";
    let filename_to_verify = "/tmp/two_level_nested_verify.parquet";

    // Read the Parquet created by PyArrow
    let (schema, chunks) = read_parquet(filename_to_read);

    // Write a new Parquet file from what what we just read
    write_parquet(filename_to_verify, schema, chunks)?;

    // Read what we just wrote
    let (_schema, _chunks) = read_parquet(filename_to_verify);

    // Compare the original Parquet file and the one we wrote 
    compare(filename_to_read, filename_to_verify)?;
    
    Ok(())
}

fn compare(filename_to_read: &str, filename_to_verify: &str) -> Result<()> {
    let (_, expected_chunks) = read_parquet(filename_to_read);    
    let expected_chunks = expected_chunks
        .into_iter()
        .map(|res| res.unwrap())
        .collect::<Vec<_>>();

    println!("Expected chunks: {:#?}", expected_chunks);

    let (_, written_chunks) = read_parquet(filename_to_verify);        
    let written_chunks = written_chunks
        .into_iter()
        .map(|res| res.unwrap())
        .collect::<Vec<_>>();

    println!("Chunks written: {:#?}", written_chunks);
    
    assert_eq!(expected_chunks, written_chunks);
    
    Ok(())
}

fn read_parquet(filename: &str) -> (Schema, Vec<Result<Chunk>>) {
    println!("Reading {}", filename);
    let reader = File::open(filename).unwrap();
    let reader = FileReader::try_new(reader, None, None, None, None).unwrap();
    let schema: Schema = reader.schema().clone();
    // println!("schema: {:#?}", schema);
    let mut chunks = vec![];
    for chunk_result in reader {
        chunks.push(chunk_result);
    }
    (schema, chunks)
}

fn write_parquet(filename: &str, schema: Schema, chunks: Vec<Result<Chunk>>) -> Result<()> {
    println!("Writing {}", filename);
    let options = WriteOptions {
        write_statistics: true,
        version: parquet2::write::Version::V2,
        compression: parquet2::compression::CompressionOptions::Snappy,
    };

    let iter = chunks.into_iter();
    let encodings = schema.fields.iter().map(|_| Encoding::Plain).collect();
    let row_groups = RowGroupIterator::try_new(iter, &schema, options, encodings)?;

    let file = std::fs::File::create(filename)?;
    let mut writer = FileWriter::try_new(file, schema, options)?;
    writer.start()?;
    for group in row_groups {
        writer.write(group?)?;
    }
    writer.end(None)?;
    Ok(())
}

Results in

Reading /tmp/two_level_nested.parquet
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', /home/ahmed/.cargo/git/checkouts/arrow2-8a2ad61d97265680/b6a516c/src/io/parquet/read/deserialize/mod.rs:272:44
@ahmedriza
Copy link
Author

ahmedriza commented May 27, 2022

@jorgecarleitao what's required at this point in the example in order to work with new fix?

    let encodings = schema.fields.iter().map(|_| Encoding::Plain).collect();
    let row_groups = RowGroupIterator::try_new(iter, &schema, options, vec![encodings])?;

Using the above, and master after the merge, I still get the assertion failure during the write:

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `3`,
 right: `1`', /home/ahmed/.cargo/git/checkouts/arrow2-8a2ad61d97265680/7cc874f/src/io/parquet/write/pages.rs:210:5

@jorgecarleitao
Copy link
Owner

jorgecarleitao commented May 28, 2022

I fielded #1018, which is what we need here - we need to transverse the fields, get the leaf columns, and create encodings with it.

I ran the example, but the writing is still not there for such nesting - I am looking into it I looked into it #1019

@ahmedriza
Copy link
Author

@jorgecarleitao thanks a lot for the explanation.

@jorgecarleitao jorgecarleitao added the no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog label Jun 5, 2022
@jorgecarleitao jorgecarleitao changed the title Nested Parquet read failure Error in reading Nested Parquet Jun 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants