Error in reading Nested Parquet #1014

ahmedriza · 2022-05-26T22:09:08Z

Issue found whilst testing #1007. Testing the Parquet generated by the following Python script on that branch fails on reading the Parquet file.

Python code to generate test Parquet:

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

d1 = {
    "id": pd.Series(['t1', 't2']),
    "prices": pd.Series([
        [
            {
                "currency": "GBP",
                "value": 3.14,
                'meta': [
                    {'loc': 'LON'}
                ]
            }
        ],
        [
            {
                "currency": "USD",
                "value": 4.14,
                'meta': [
                    {'loc': 'NYC'}
                ]                
            }
        ]
    ]),
    "bids": pd.Series([
        [
            {
                "currency": "GBP",
                "value": 3.04,
                'meta': [
                    {'loc': 'LON'}
                ]
            }
        ]        
    ], dtype='object')
}

df = pd.DataFrame(d1)

list_type = pa.list_(
    pa.struct([
        ('currency', pa.string()),
        ('value', pa.float64()),
        ('meta', pa.list_(
            pa.struct([
                ('loc', pa.string())
            ])
        ))
    ]))

schema = pa.schema([
    ('id', pa.string()),
    ('prices', list_type),
    ('bids', list_type)
])

table = pa.Table.from_pandas(df, schema=schema)
filename = '/tmp/two_level_nested.parquet'
pq.write_table(table, filename)

# Read the Parquet to check
expected_table = pq.read_table(filename).to_pandas()
print(expected_table.to_string())

Rust code (against #1007)

use arrow2::{
    array::*,
    chunk::Chunk as AChunk,
    io::parquet::write::{FileWriter, RowGroupIterator, WriteOptions},
};
use arrow2::{datatypes::Schema, io::parquet::read::FileReader};
use parquet2::encoding::Encoding;
use std::{fs::File, sync::Arc};

type Chunk = AChunk<Arc<dyn Array>>;
type Result<T> = arrow2::error::Result<T>;

pub fn main() -> Result<()> {
    let filename_to_read = "/tmp/two_level_nested.parquet";
    let filename_to_verify = "/tmp/two_level_nested_verify.parquet";

    // Read the Parquet created by PyArrow
    let (schema, chunks) = read_parquet(filename_to_read);

    // Write a new Parquet file from what what we just read
    write_parquet(filename_to_verify, schema, chunks)?;

    // Read what we just wrote
    let (_schema, _chunks) = read_parquet(filename_to_verify);

    // Compare the original Parquet file and the one we wrote 
    compare(filename_to_read, filename_to_verify)?;
    
    Ok(())
}

fn compare(filename_to_read: &str, filename_to_verify: &str) -> Result<()> {
    let (_, expected_chunks) = read_parquet(filename_to_read);    
    let expected_chunks = expected_chunks
        .into_iter()
        .map(|res| res.unwrap())
        .collect::<Vec<_>>();

    println!("Expected chunks: {:#?}", expected_chunks);

    let (_, written_chunks) = read_parquet(filename_to_verify);        
    let written_chunks = written_chunks
        .into_iter()
        .map(|res| res.unwrap())
        .collect::<Vec<_>>();

    println!("Chunks written: {:#?}", written_chunks);
    
    assert_eq!(expected_chunks, written_chunks);
    
    Ok(())
}

fn read_parquet(filename: &str) -> (Schema, Vec<Result<Chunk>>) {
    println!("Reading {}", filename);
    let reader = File::open(filename).unwrap();
    let reader = FileReader::try_new(reader, None, None, None, None).unwrap();
    let schema: Schema = reader.schema().clone();
    // println!("schema: {:#?}", schema);
    let mut chunks = vec![];
    for chunk_result in reader {
        chunks.push(chunk_result);
    }
    (schema, chunks)
}

fn write_parquet(filename: &str, schema: Schema, chunks: Vec<Result<Chunk>>) -> Result<()> {
    println!("Writing {}", filename);
    let options = WriteOptions {
        write_statistics: true,
        version: parquet2::write::Version::V2,
        compression: parquet2::compression::CompressionOptions::Snappy,
    };

    let iter = chunks.into_iter();
    let encodings = schema.fields.iter().map(|_| Encoding::Plain).collect();
    let row_groups = RowGroupIterator::try_new(iter, &schema, options, encodings)?;

    let file = std::fs::File::create(filename)?;
    let mut writer = FileWriter::try_new(file, schema, options)?;
    writer.start()?;
    for group in row_groups {
        writer.write(group?)?;
    }
    writer.end(None)?;
    Ok(())
}

Results in

Reading /tmp/two_level_nested.parquet
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', /home/ahmed/.cargo/git/checkouts/arrow2-8a2ad61d97265680/b6a516c/src/io/parquet/read/deserialize/mod.rs:272:44

The text was updated successfully, but these errors were encountered:

ahmedriza · 2022-05-27T18:34:17Z

@jorgecarleitao what's required at this point in the example in order to work with new fix?

    let encodings = schema.fields.iter().map(|_| Encoding::Plain).collect();
    let row_groups = RowGroupIterator::try_new(iter, &schema, options, vec![encodings])?;

Using the above, and master after the merge, I still get the assertion failure during the write:

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `3`,
 right: `1`', /home/ahmed/.cargo/git/checkouts/arrow2-8a2ad61d97265680/7cc874f/src/io/parquet/write/pages.rs:210:5

jorgecarleitao · 2022-05-28T03:53:11Z

I fielded #1018, which is what we need here - we need to transverse the fields, get the leaf columns, and create encodings with it.

I ran the example, but the writing is still not there for such nesting - ~~I am looking into it~~ I looked into it #1019

ahmedriza · 2022-05-28T10:36:35Z

@jorgecarleitao thanks a lot for the explanation.

ahmedriza mentioned this issue May 26, 2022

Added support to write nested parquet #1007

Merged

jorgecarleitao added the bug Something isn't working label May 27, 2022

jorgecarleitao mentioned this issue May 27, 2022

Fixed error in reading nested parquet structs #1015

Merged

jorgecarleitao closed this as completed in #1015 May 27, 2022

jorgecarleitao mentioned this issue May 28, 2022

Fixed error in writing list<struct<..>> to parquet #1019

Merged

jorgecarleitao added the no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog label Jun 5, 2022

jorgecarleitao changed the title ~~Nested Parquet read failure~~ Error in reading Nested Parquet Jun 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in reading Nested Parquet #1014

Error in reading Nested Parquet #1014

ahmedriza commented May 26, 2022 •

edited

Loading

ahmedriza commented May 27, 2022 •

edited

Loading

jorgecarleitao commented May 28, 2022 •

edited

Loading

ahmedriza commented May 28, 2022

Error in reading Nested Parquet #1014

Error in reading Nested Parquet #1014

Comments

ahmedriza commented May 26, 2022 • edited Loading

ahmedriza commented May 27, 2022 • edited Loading

jorgecarleitao commented May 28, 2022 • edited Loading

ahmedriza commented May 28, 2022

ahmedriza commented May 26, 2022 •

edited

Loading

ahmedriza commented May 27, 2022 •

edited

Loading

jorgecarleitao commented May 28, 2022 •

edited

Loading