-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Sparse heterogenous data #118
Comments
Hi Chris, this repo just tackles the issue of storing a persistent tree data structure on IPFS, it does not implement the columnar storage that @rklaehn blogged about. Rüdiger should be able to give you better pointers. Regards, Roland |
Hi @chrisrossi. Just saw this. I try to limit my screen time on Sundays, but will give you a detailed answer some time next week. |
OK, so what banyan does for you is the following:
What it does not do is to prescribe a format for your data. It has to be something that can be serialized to cbor, that's all. It is not a database where you can easily define custom indexes, but more like a database construction kit where you have to put some thoughts into what you want to index. So the question is, how will this data be queried? Are you going go e.g. query just data for The easiest thing you could do is to just store the data in a row based format and tag each value with a set of values it contains and the time. The summary would then be also a set of values, and a time range. This will work just fine and give you some decent compression, but it is not optimal. For optimal compression you would have to transpose the data to a columnar format, as I described in https://rklaehn.github.io/2018/06/10/efficient-telemetry-storage-on-ipfs/ . If you do that you can expect some extremely good compression, but it is a bit more work. It is worth it though, if you expect to store a lot of data. You can expect a large improvement of the compression rate. Here are some measurements from my blog post:
The columnar + compressed is > a factor of 5 better than the naive "compressed rows" approach. Banyan makes it easy to plug in this row->column transformation. However, I would suggest starting with the row based approach to get your feet wet and switch to the columnar approach later. HTH, Rüdiger |
Got it. Thank you for clarifying. Will be getting into it this week. Thanks! |
@rklaehn Is there an open source example of something using the columnar approach with Banyan? Thanks! |
This https://github.com/Actyx/banyan/blob/master/banyan-utils/src/tag_index.rs is a little bit like it, and a long time ago I started on an implementation of this approach in rust here https://github.com/rklaehn/iptm-rs/tree/master/src But the production quality tag index is in the actyx code base, which is proprietary. Depending on how your data looks (nested or not) this does not have to be hard though. |
Thanks, I'll take a look at those! The data, AFAIK, is all expressible in CSV, so simple tables, no nesting. I'll look at the columnar approach, but I'm also thinking about just adding a field to each row that is a bitmap indicating presence/absence of a variable. E.G. this example data, would be transformed:
to this (2023-05-20 can be omitted):
Obviously, anticipating data with many more than 3 columns. |
Just f64 values, no need for a variant? How many columns approximately? So a value would be something like struct Row {
time: u64,
values: BTreeMap<String, f64>,
} ? A key would then be the set of strings for each column name present in the row, plus a timestamp. |
More or less, yes. To get around storing the column names per row, I might do something more like:
With column order/names defined elsewhere. |
Yes, I was asking what is in a Row in principle. Storing this exactly as is would be inefficient obviously. So what you would do if you wanted a columnar approach would be to define a struct RowSeq {
names: Vec<String>, // names, unique (or unique and sorted)
times: Vec<u64>, // time for each row, should be delta encoded when serializing
colums: Vec<Column>, // columns, same number of items as names
}
struct Column {
values: Vec<f64>, // values
times: Vec<usize>, // indexes into the times table, guaranteed to be strictly sorted, should be delta encoded when serializing
// this is the equivalent of the bitmap in your approach, except that it is transposed
} You can then write things like getting a single proptest is the better crate. It has composable minimization of failed tests. |
Ok, so if I'm following along, RowSeq is the value type stored in Banyan. So instead of storing a single row per index in the Banyan tree, we'll store a certain number of rows (transformed into columnar format) at each index in the tree. Is there a heuristic for deciding how many rows to store in a RowSeq? Or have I misunderstood? |
Almost correct. One of the things banyan does for you is to give you the option to store the rows in a columnar format, which is almost always more efficient because you have more similar data closer to each other. This helps the compression algorithm find the redundancy. You are storing What you will typically do with an active data source is just to store data as quickly as possible as it comes in, creating a long chain, and then "pack" the data into bigger leafs. The heuristics about how many rows to pack into one leaf is a configuration parameter of banyan and is based on the compressed size, not the uncompressed size. This is very important. Different data has vastly different compression rate, and for efficient distribution you want the final compressed and encrypted blocks to be of roughly the same size. https://docs.rs/banyan/0.17.1/banyan/struct.Config.html pub struct Config {
All the other parameters are to prevent bad things from happening, and to give you some control over the shape of the tree. E.g. |
@rkuhn any chance you can open source some of the relevant code? It would be really helpful here I think. |
Ah, turns out that the tag index part is open source already. It is not exactly what you need, but it might help as inspiration... |
Hi, Rudiger, et al,
David Phelan has asked me to look at using this library for dClimate, and I was hoping you could help answer a question for me.
We have weather station telemetry data that has the general form:
In plain English, for a specific index, we have several variables, any of which may or may not have a value.
David did point me to this older work where it's clear you were thinking about heterogenous data and missing values, but looking at the Banyan repository, it's not obvious to me how these are handled. Is this use case accounted for with Banyan, or should we be looking elsewhere?
If this is an accounted for use case, what's the best way to go about it? Use a separate tree for each variable, using Option<T> for the value, or use a single tree with a tuple value?
I'm not sure if this is related or not, but I don't have a good handle on what the "Forest" abstraction is or does. Is there any information you could provide about that?
Thank you!
Chris
The text was updated successfully, but these errors were encountered: