Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add composite structure family #668

Open
wants to merge 45 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
56a0561
REL: v0.1.0a113
danielballan Jan 3, 2024
0e4ff0c
REL: v0.1.0a114
danielballan Feb 5, 2024
8054a33
First pass at 'union' structure family
danielballan Feb 23, 2024
97b9f90
Fix mismatch (from rebase, likely).
danielballan Dec 12, 2024
a5c9e51
Make grouping clear
danielballan Dec 12, 2024
7b90155
TMP Fix usage, but this needs re-examined.
danielballan Dec 12, 2024
91a15cb
FIX: pass all query params as kwargs
genematx Dec 12, 2024
2422544
Container may have structure (inlined contents).
danielballan Dec 12, 2024
e452daa
Set all_keys correctly.
danielballan Dec 12, 2024
334cba0
MNT: rename UnionStructure to ConsolidatedStructure
genematx Dec 12, 2024
e466b4f
MNT: rename UnionStructure to ConsolidatedStructure
genematx Dec 12, 2024
4596fb4
MNT: rename CatalogUnionAdapter and UnionLinks
genematx Dec 12, 2024
f54a52b
MNT: typing and lint
genematx Dec 12, 2024
325a797
MNT: typing and lint
genematx Dec 12, 2024
1978b79
ENH: refactor creation of ConsolidatedStructure as a classmethod
genematx Dec 13, 2024
0178b83
ENH: allow iterating over ConsolidatedClient and its parts
genematx Dec 13, 2024
619cac8
DOC: add Consolidated Structure to the docs
genematx Dec 13, 2024
d9db95a
MNT: remove dims from the Container client signature
genematx Dec 13, 2024
f45c296
TST: add tests for writing/reading consolidated structures
genematx Dec 13, 2024
b46de48
MNT: lint
genematx Dec 13, 2024
8cfdc28
MNT: typing
genematx Dec 13, 2024
a896e7e
FIX: reading string-dtype columns from dataframes individually
genematx Dec 14, 2024
ed86b8f
MNT: fix pydantic deprecations
genematx Dec 17, 2024
a6bcd2c
TST: consolidated with awkward and sparse arrays
genematx Dec 17, 2024
a5fcfd8
ENH: check if uris are passed as list
genematx Dec 17, 2024
ca4e979
TST: external assets in consolidated
genematx Dec 17, 2024
e8fcd64
MNT: refactor normalize_specs
genematx Dec 18, 2024
a2bfa1b
MNT: remove unused class definition
genematx Dec 30, 2024
4dbc631
MNT: lint
genematx Dec 30, 2024
429cad0
ENH: Add Composite structure
genematx Dec 30, 2024
d778e03
ENH: Add Composite structure
genematx Dec 30, 2024
6a74493
ENH: handle part query param
genematx Dec 31, 2024
8eeee15
ENH: add links for Composite
genematx Dec 31, 2024
e52afc9
ENH: creating Composite nodes
genematx Dec 31, 2024
652d4fa
ENH: add support for Composite structure
genematx Dec 31, 2024
297470e
TST: tests for Composite structure
genematx Dec 31, 2024
6981d3f
MNT: cleup, lint, and test
genematx Jan 7, 2025
d5eef7d
MNT: update Composite docs
genematx Jan 7, 2025
de4013a
TST: Add test for parts
genematx Jan 7, 2025
250fa92
TST: contents accessible only via parts
genematx Jan 10, 2025
ad8e4be
NMT: use sentinel values for error codes
genematx Jan 10, 2025
c3f986c
ENH: separate composite namespaces
genematx Jan 12, 2025
96a8476
MNT: subclass CompositeClient from Container
genematx Jan 12, 2025
97caed7
ENH: check starusture families in SecureEntry
genematx Jan 12, 2025
05eaeb4
ENH: remove links for parts
genematx Jan 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/source/explanations/catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,8 @@ and `assets`, describes the format, structure, and location of the data.
to the Adapter
- `management` --- enum indicating whether the data is registered `"external"` data
or `"writable"` data managed by Tiled
- `structure_family` --- enum of structure types (`"container"`, `"array"`, `"table"`, ...)
- `structure_family` --- enum of structure types (`"container"`, `"array"`, `"table"`,
etc. -- except for `composite`, which can not be assigned to a Data Source)
- `structure_id` --- a foreign key to the `structures` table
- `node_id` --- foreign key to `nodes`
- `id` --- integer primary key
Expand Down
81 changes: 78 additions & 3 deletions docs/source/explanations/structures.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,12 @@ potentially any language.
The structure families are:

* array --- a strided array, like a [numpy](https://numpy.org) array
* awkward --- nested, variable-sized data (as implemented by [AwkwardArray](https://awkward-array.org/))
* container --- a of other structures, akin to a dictionary or a directory
* sparse --- a sparse array (i.e. an array which is mostly zeros)
* table --- tabular data, as in [Apache Arrow](https://arrow.apache.org) or
[pandas](https://pandas.pydata.org/)
* container --- a collection of other structures, akin to a dictionary or a directory
* composite --- a container-like structure to combine table columns and arrays in a common namespace
* sparse --- a sparse array (i.e. an array which is mostly zeros)
* awkward --- nested, variable-sized data (as implemented by [AwkwardArray](https://awkward-array.org/))

## How structure is encoded

Expand Down Expand Up @@ -575,3 +576,77 @@ response.
"count": 5
}
```

### Composite

This is a specialized container-like structure designed to link together multiple tables and arrays that store
related scientific data. It does not support nesting but provides a common namespace across all columns of the
contained tables along with the arrays (thus, name collisions are forbidden). This allows to further abstract out
the disparate internal storage mechanisms (e.g. Parquet for tables and zarr for arrays) and present the user with a
smooth homogeneous interface for data access. Composite structures do not support pagination and are not
recommended for "wide" datasets with more than ~1000 items (cloumns and arrays) in the namespace.

Below is an example of a Composite structure that describes two tables and two arrays of various sizes. It sis very
similar to a usual Container structure, where `contents` list the structures of its constituents; additionally,
`flat_keys` defines the internal namespace of directly addressible columns and arrays.

```json
{
"contents": [
{
"structure_family": "table",
"structure": {
"arrow_schema": "data:application/vnd.apache.arrow.file;base64,/////...FFFF",
"npartitions": 1,
"columns": ["A", "B"],
"resizable": false
},
"name": "table1"
},
{
"structure_family": "table",
"structure": {
"arrow_schema": "data:application/vnd.apache.arrow.file;base64,/////...FFFF",
"npartitions": 1,
"columns": ["C", "D", "E"],
"resizable": false
},
"name": "table2"
},
{
"structure_family": "array",
"structure": {
"data_type": {
"endianness": "little",
"kind": "f",
"itemsize": 8,
"dt_units": null
},
"chunks": [[3], [5]],
"shape": [3, 5],
"dims": null,
"resizable": false
},
"name": "F"
},
{
"structure_family": "array",
"structure": {
"data_type": {
"endianness": "not_applicable",
"kind": "u",
"itemsize": 1,
"dt_units": null
},
"chunks": [[5], [7], [3]],
"shape": [5, 7, 3],
"dims": null,
"resizable": false
},
"name": "G"
}
],
"flat_keys": ["A", "B", "C", "D", "E", "F", "G"],
"count": 7
}
```
27 changes: 26 additions & 1 deletion docs/source/how-to/register.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,10 @@ Sometimes it is necessary to take more manual control of this registration
process, such as if you want to take advantage of particular knowledge
about the files to specify particular `metadata` or `specs`.

Use the Python client, as in this example.
#### Registering external data

To register data from external files in Tiled, one can use the Python client and
construct Data Source object explicitly passing the list of assets, as in the following example.

```py
import numpy
Expand Down Expand Up @@ -112,3 +115,25 @@ client.new(
specs=[],
)
```

#### Writing a composite structure

A Composite structure allows the user to acess the columns of contained tables in
a flat name space along with other arrays. Writing new data to a Composite container
is analoguous to the usual containers, however exceptions will be raised if there are
any name collisions.

```python
import pandas

rng = numpy.random.default_rng(12345)
arr = rng.random(size=(3, 5), dtype="float64")
df = pandas.DataFrame({"A": ["one", "two", "three"], "B": [1, 2, 3]})

# Create a Composite node
node = client.create_composite(key="x")

# Write the data
node.write_array(arr, key="C")
node.write_dataframe(df, key="table1")
```
2 changes: 2 additions & 0 deletions docs/source/reference/service.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,8 @@ See {doc}`../explanations/structures` for more context.
tiled.structures.array.BuiltinDtype
tiled.structures.array.Endianness
tiled.structures.array.Kind
tiled.structures.composite.CompositeStructure
tiled.structures.composite.CompositeStructurePart
tiled.structures.core.Spec
tiled.structures.core.StructureFamily
tiled.structures.table.TableStructure
Expand Down
221 changes: 221 additions & 0 deletions tiled/_tests/test_composite.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,221 @@
from pathlib import Path

import awkward
import numpy
import pandas
import pytest
import sparse
import tifffile as tf

from ..catalog import in_memory
from ..client import Context, from_context
from ..server.app import build_app
from ..structures.array import ArrayStructure, BuiltinDtype
from ..structures.core import StructureFamily
from ..structures.data_source import Asset, DataSource, Management
from ..structures.table import TableStructure

rng = numpy.random.default_rng(12345)

df1 = pandas.DataFrame({"A": ["one", "two", "three"], "B": [1, 2, 3]})
df2 = pandas.DataFrame(
{
"C": ["red", "green", "blue", "white"],
"D": [10.0, 20.0, 30.0, 40.0],
"E": [0, 0, 0, 0],
}
)
df3 = pandas.DataFrame(
{
"col1": ["one", "two", "three", "four", "five"],
"col2": [1.0, 2.0, 3.0, 4.0, 5.0],
}
)
arr1 = rng.random(size=(3, 5), dtype="float64")
arr2 = rng.integers(0, 255, size=(5, 7, 3), dtype="uint8")
img_data = rng.integers(0, 255, size=(5, 13, 17, 3), dtype="uint8")

# An awkward array
awk_arr = awkward.Array(
[
[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}],
[],
[{"x": 3.3, "y": [1, 2, 3]}],
]
)
awk_packed = awkward.to_packed(awk_arr)
awk_form, awk_length, awk_container = awkward.to_buffers(awk_packed)

# A sparse array
arr = rng.random(size=(10, 20, 30), dtype="float64")
arr[arr < 0.95] = 0 # Fill half of the array with zeros.
sps_arr = sparse.COO(arr)

md = {"md_key1": "md_val1", "md_key2": 2}


@pytest.fixture(scope="module")
def tree(tmp_path_factory):
return in_memory(writable_storage=tmp_path_factory.getbasetemp())


@pytest.fixture(scope="module")
def context(tree):
with Context.from_app(build_app(tree)) as context:
client = from_context(context)
x = client.create_composite(key="x", metadata=md)
x.write_array(arr1, key="arr1", metadata={"md_key": "md_for_arr1"})
x.write_array(arr2, key="arr2", metadata={"md_key": "md_for_arr2"})
x.write_dataframe(df1, key="df1", metadata={"md_key": "md_for_df1"})
x.write_dataframe(df2, key="df2", metadata={"md_key": "md_for_df2"})
x.write_awkward(awk_arr, key="awk", metadata={"md_key": "md_for_awk"})
x.write_sparse(
coords=sps_arr.coords,
data=sps_arr.data,
shape=sps_arr.shape,
key="sps",
metadata={"md_key": "md_for_sps"},
)

yield context


@pytest.fixture
def tiff_sequence(tmpdir):
sequence_directory = Path(tmpdir, "sequence")
sequence_directory.mkdir()
filepaths = []
for i in range(img_data.shape[0]):
fpath = sequence_directory / f"temp{i:05}.tif"
tf.imwrite(fpath, img_data[i, ...])
filepaths.append(fpath)

yield filepaths


@pytest.fixture
def csv_file(tmpdir):
fpath = Path(tmpdir, "test.csv")
df3.to_csv(fpath, index=False)

yield fpath


@pytest.mark.parametrize(
"name, expected",
[
("A", df1["A"]),
("B", df1["B"]),
("C", df2["C"]),
("D", df2["D"]),
("E", df2["E"]),
("arr1", arr1),
("arr2", arr2),
("awk", awk_arr),
("sps", sps_arr.todense()),
],
)
def test_reading(context, name, expected):
client = from_context(context)
actual = client["x"][name].read()
if name == "sps":
actual = actual.todense()
assert numpy.array_equal(actual, expected)


def test_iterate_parts(context):
client = from_context(context)
for part in client["x"].parts:
client["x"].parts[part].read()


def test_iterate_columns(context):
client = from_context(context)
for col, _client in client["x"]:
read_from_client = _client.read()
read_from_column = client["x"][col].read()
read_from_full_path = client[f"x/{col}"].read()
if col == "sps":
read_from_client = read_from_client.todense()
read_from_column = read_from_column.todense()
read_from_full_path = read_from_full_path.todense()
assert numpy.array_equal(read_from_client, read_from_column)
assert numpy.array_equal(read_from_client, read_from_full_path)
assert numpy.array_equal(read_from_full_path, read_from_column)


def test_metadata(context):
client = from_context(context)
assert client["x"].metadata == md
for part in client["x"].parts:
assert client["x"].parts[part].metadata["md_key"] == f"md_for_{part}"


def test_parts_not_direclty_accessible(context):
client = from_context(context)
client["x"].parts["df1"].read()
client["x"].parts["df1"]["A"].read()
client["x"]["A"].read()
with pytest.raises(KeyError):
client["x"]["df1"].read()


def test_external_assets(context, tiff_sequence, csv_file):
client = from_context(context)
tiff_assets = [
Asset(
data_uri=f"file://localhost{fpath}",
is_directory=False,
parameter="data_uris",
num=i + 1,
)
for i, fpath in enumerate(tiff_sequence)
]
tiff_structure_0 = ArrayStructure(
data_type=BuiltinDtype.from_numpy_dtype(numpy.dtype("uint8")),
shape=(5, 13, 17, 3),
chunks=((1, 1, 1, 1, 1), (13,), (17,), (3,)),
)
tiff_data_source = DataSource(
mimetype="multipart/related;type=image/tiff",
assets=tiff_assets,
structure_family=StructureFamily.array,
structure=tiff_structure_0,
management=Management.external,
name="image",
)

csv_assets = [
Asset(
data_uri=f"file://localhost{csv_file}",
is_directory=False,
parameter="data_uris",
)
]
csv_data_source = DataSource(
mimetype="text/csv",
assets=csv_assets,
structure_family=StructureFamily.table,
structure=TableStructure.from_pandas(df3),
management=Management.external,
name="table",
)

y = client.create_composite(key="y")
y.new(
structure_family=StructureFamily.array,
data_sources=[tiff_data_source],
key="image",
)
y.new(
structure_family=StructureFamily.table,
data_sources=[csv_data_source],
key="table",
)

arr = y.parts["image"].read()
assert numpy.array_equal(arr, img_data)

df = y.parts["table"].read()
for col in df.columns:
assert numpy.array_equal(df[col], df3[col])
Loading
Loading