sheetreader
is a DuckDB extension that allows reading XLSX files into DuckDB tables with SheetReader, our blazingly fast XLSX parser (https://github.com/polydbms/sheetreader-core).
This repository is based on https://github.com/duckdb/extension-template.
Before using SheetReader, you need to install it from the community extensions and load it into your DuckDB-environment:
INSTALL sheetreader FROM community;
LOAD sheetreader;
Now, you can run your first query:
D SELECT *
FROM sheetreader('test.xlsx');
The sheetreader()
function offers further parameters to load the XLSX-file as required:
D CREATE TABLE test AS FROM sheetreader(
'test.xlsx',
sheet_index=1,
threads=16,
skip_rows=0,
has_header=TRUE,
types=[BOOLEAN,VARCHAR],
coerce_to_string=TRUE,
force_types=TRUE
);
Name | Description | Type | Default |
---|---|---|---|
sheet_index |
Index of the sheet to read. Starts at 1. | INTEGER |
1 |
sheet_name |
Name of the sheet to read. Only either sheet_index or sheet_name can be set. |
VARCHAR |
"" |
threads |
Number of threads to use, while parsing | INTEGER |
Half of available cores; minimum 1 |
skip_rows |
Number of rows to skip | INTEGER |
0 |
types |
List of types for all columns
|
LIST(VARCHAR) |
Uses types determined by first & second row (after skipped rows) |
coerce_to_string |
Coerce all cells in column of type VARCHAR to string (i.e. VARCHAR ). |
BOOLEAN |
false |
force_types |
Use types even if they are not compatible with types determined by first/second row. Cells, that are not of the column type, are set to NULL or coerced to string, if option is set. |
BOOLEAN |
false |
has_header |
If set to true :
If set to false :
|
BOOLEAN |
false |
SheetReader was published in the Information Systems Journal.
@article{DBLP:journals/is/GavriilidisHZM23,
author = {Haralampos Gavriilidis and
Felix Henze and
Eleni Tzirita Zacharatou and
Volker Markl},
title = {SheetReader: Efficient Specialized Spreadsheet Parsing},
journal = {Inf. Syst.},
volume = {115},
pages = {102183},
year = {2023},
url = {https://doi.org/10.1016/j.is.2023.102183},
doi = {10.1016/J.IS.2023.102183},
timestamp = {Mon, 26 Jun 2023 20:54:32 +0200},
biburl = {https://dblp.org/rec/journals/is/GavriilidisHZM23.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
You can find benchmarks in the above-mentioned paper, comparing SheetReader to other XLSX parsers.
Here is a plot of preliminary benchmarks comparing the sheetreader
DuckDB extension to the spatial
extension's st_read
function:
(System info: 2x Intel(R) Xeon(R) E5530 @ 2.40GHz, 47GiB RAM)
First, clone this repository with the --recurse-submodules
flag --- so you get all the needed source files.
To build the extension, run:
GEN=ninja make
The main binaries that will be built are:
./build/release/duckdb
./build/release/extension/sheetreader/sheetreader.duckdb_extension
duckdb
is the binary for the DuckDB shell with the extension code automatically loaded.sheetreader.duckdb_extension
is the loadable binary as it would be distributed.
To run the self-built extension code, simply start the shell with ./build/release/duckdb
.