Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Write parquet listarray produces incomplete output #967

Closed
mpetri opened this issue Apr 28, 2022 · 0 comments · Fixed by #968
Closed

Write parquet listarray produces incomplete output #967

mpetri opened this issue Apr 28, 2022 · 0 comments · Fixed by #968
Assignees
Labels
bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog

Comments

@mpetri
Copy link

mpetri commented Apr 28, 2022

I'm trying to use the arrow2 crate to write a parquet file with the following schema:

  let first_field = Field::new("first", DataType::Utf8, false);
  let second_field = Field::new(
      "second",
      DataType::List(Box::new(Field::new("item", DataType::Float32, false))),
      false,
  );
  let schema = Schema::from(vec![first_field, second_field]);

I modified the sample code accordingly but for some reason only half of the list values get written. For example, in the code below I write a list of 256 f32 values. However, when I use polars or other tools only 128 show up. Any ideas what is wrong or how can I debug this?

use std::fs::File;
use std::sync::Arc;

use arrow2::{
    array::{Array, ListArray, MutableListArray, MutablePrimitiveArray, TryPush, Utf8Array},
    chunk::Chunk,
    datatypes::{DataType, Field, Schema},
    error::Result,
    io::parquet::write::{
        CompressionOptions, Encoding, FileWriter, RowGroupIterator, Version, WriteOptions,
    },
};

fn write_batch(path: &str, schema: Schema, columns: Chunk<Arc<dyn Array>>) -> Result<()> {
    let options = WriteOptions {
        write_statistics: true,
        compression: CompressionOptions::Uncompressed,
        version: Version::V2,
    };

    let iter = vec![Ok(columns)];

    let row_groups = RowGroupIterator::try_new(
        iter.into_iter(),
        &schema,
        options,
        vec![Encoding::Plain, Encoding::Plain],
    )?;

    // Create a new empty file
    let file = File::create(path)?;

    let mut writer = FileWriter::try_new(file, schema, options)?;

    writer.start()?;
    for group in row_groups {
        writer.write(group?)?;
    }
    let _size = writer.end(None)?;
    Ok(())
}

fn main() -> Result<()> {
    let first_field = Field::new("first", DataType::Utf8, false);
    let second_field = Field::new(
        "second",
        DataType::List(Box::new(Field::new("item", DataType::Float32, false))),
        false,
    );
    let schema = Schema::from(vec![first_field, second_field]);

    let mut firsts: Vec<Option<String>> = Vec::new();
    let mut seconds = MutableListArray::<i32, MutablePrimitiveArray<f32>>::new();
    for id in 0..500 {
        let raw: Vec<Option<f32>> = (0..256).map(|e| Some(e as f32)).collect();
        seconds.try_push(Some(raw))?;
        firsts.push(Some(format!("{}", id)));
    }

    let firsts = Arc::new(Utf8Array::<i32>::from(&firsts));
    let seconds = ListArray::from(seconds);
    let seconds = Arc::new(seconds);
    let columns = Chunk::new(vec![firsts as Arc<dyn Array>, seconds as Arc<dyn Array>]);

    write_batch("test.parquet", schema, columns)
}

python:

import polars

df = polars.read_parquet('test.parquet')
print(df.shape)
print(df["second"][0])
print(df["second"][0].shape)

output:

(500, 2)
shape: (128,)
Series: 'second' [f32]
[
        0.0
        1.0
        2.0
        3.0
        4.0
        5.0
        6.0
        7.0
        8.0
        9.0
        10.0
        11.0
        ...
        116.0
        117.0
        118.0
        119.0
        120.0
        121.0
        122.0
        123.0
        124.0
        125.0
        126.0
        127.0
]
(128,)
@mpetri mpetri changed the title Write parquet write listarray produces incomplete output Write parquet listarray produces incomplete output Apr 28, 2022
@jorgecarleitao jorgecarleitao added the bug Something isn't working label Apr 29, 2022
@jorgecarleitao jorgecarleitao self-assigned this Apr 29, 2022
@jorgecarleitao jorgecarleitao added the no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog label Apr 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants