Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A Review of Compile-Time Checking in Serde Format Implementations #2

Open
onkoe opened this issue Jun 26, 2024 · 3 comments
Open

A Review of Compile-Time Checking in Serde Format Implementations #2

onkoe opened this issue Jun 26, 2024 · 3 comments

Comments

@onkoe
Copy link
Owner

onkoe commented Jun 26, 2024

One of my biggest problems with the csv crate is the vast number of limitations it imposes. For example, you can't serialize scalars (general primitives) outside of structs, and lots of otherwise-helpful types aren't possible to export.

When it comes to building tables - and that's CSV's only 'real' use case - you sometimes find that your data won't fit into the csv crate's model.

This crate is intended to address these concerns. Unfortunately, this means making yet another subset of CSV that includes representation of Rust types within headers. For example, I currently want an enum to be represented by something like Nested::Enum::Variant in the header, with Variant's data dutifully v.serialize(self)'d during construction.

It also has some design woes...

Design

I've considered various API designs for this library. It seems that csv can represent most types so long as they fit the Record model, meaning that you're stuck serializing one struct with a bazillion fields.

That's pretty ugly in your codebase, but more importantly, it's difficult to maintain. In addition, the overall structure of csv disallows decent usage of #[serde(flatten)] on fields. This means that it can't really handle 'recursive' fields full of data, which is what Rust is usually good at!

Even worse, csv has countless runtime errors for all of these unsupported cases. You don't know what's going to work until you try it, making your code feel like some flimsy Python. That's fine for many folks using csv, but some use cases can't handle fragile constructs.

It's not csv's fault, though - serde requires you to fully implement their serialization API, despite some formats not supporting certain types. Even so, I aim to somehow reach around the trait system to at least alert users of the problems at compile time when serializing to a header (e.g., actually making a whole file).

In this way, some automatic determination of the given types' ordering would be fantastic! However, there are some difficulties here.

The Spirit of Reflection

I think, then, the design will be oriented around checking the types users give us. Here's some context:

An Example

Let's say we have three structs: A, B, and C. We always want them to be in that order, as CSV is a table, and its records should never mutate the original structure.

However, if our serializer allows users to input multiple types, we're just asking for inconsistencies.

My first thoughts were to just use marker traits and a PhantomData to represent the current state of the serializer. I still think that could be helpful for handling header build state, but it vastly increases the amount of random checks we need in each implementation. That sounds supremely annoying to implement, so I considered other options.

Grabbing Type Information

Well, how about std::any::TypeId? Using it would allow us to keep a list of types that we want in each Record without enforcing the struct of shame. It's also generated at compile-time, and... oh! We have to access it at runtime.

I briefly remembered const and how it could inspire my design. It helped, basically reminding me that I know the order of serialization. Because... here's what I'm looking for.

use nqcsv::{Serializer, RecordBuilder};

struct A {
    a: u8,
}

struct B;
struct C;

fn main() {
    // define the fields
    let mut record_maker = RecordBuilder::new()
        .add(A)
        .add(B)
        .add(C)
        .build();

    // make some data
    let a = A { a: 1 };
    let b = B;
    let c = C;

    // alright, let's try to serialize them in the wrong order!
    let record = record_maker.push(&c).push(&a).push(&b).make();
}

This should result in some kind of compile-time error:

$ cargo build
    Compiling nqcsv v0.1.0 (/home/user/Documents/projects/nqcsv)
    Compiling ...
    
error: This record contains fields in the wrong order. 

Expected order: [A, B, C]
Found order:    [C, A, B]

   --> src/main.rs:24:38
    |
24  |     let record = record_maker.push(&c).push(&a).push(&b).make();
    |                                   ^^^^
    |
    = note: this error originates in the macro `nqcsv::record`

To be clear, I don't love this interface design, either.

I do like the idea of making a builder, but the fact that you need to specify it twice is relatively annoying. It feels... shameful. I'd rather some way to create a Record type from the given information. After all, wasn't that the purpose of TypeId?

One consideration for this feature would be a macro design. For example, a procedural macro could kinda create a marker type and implement a StateMarker for it.

That's not fully up to snuff, though. Because the entire Serializer trait needs to be implemented for each stage, there'd still be runtime errors! Gah!

To get rid of those, we'd have to move the macro stuff into the implementation side of things. Here's the ideal way I'd do this if you could keep state across procedural macro invocations:

impl<'a> ser::Serializer for &'a mut CsvSerializer {
    // ...snip
    
    // just a random one of these MANY methods
    fn serialize_i8(self, v: i8) -> Result<Self::Ok, Self::Error> {
        // panics (during compilation) if order's wrong.
        // otherwise, resolves to unit type, (). ;)
        crate::util::type_check!(&mut self.order, v); 
        
        self.push_str_unchecked(&itoa::Buffer::new().format(v));
        Ok(())
    }
    
    // snip...
}

However, I am not convinced that it's possible without reflection (or macro state, which is generally viewed as bad).

Falling Back

With these problems combined, I think we need a formal Record type that can represent that data a user's putting in. That increases flexibility!

However, it still has that "doublespeak" pattern where you first specify what you'll give to the library, then you hand it over. And... we're back to the single-struct syntax that we wanted to avoid.

I suppose we're hitting the edge of what serde (and Rust) can do, making us lean back into macro territory again. Essentially, we need to know EVERY call that ever comes in to avoid it.

Additionally, I'm not even sure how to get the entire context in my proc macro. It'd have to be something incredibly hacky that practically does this:

#[nqcsv::look_for_all_calls_and_typecheck_maybe_with_some_reflection_idk]
mod all_external_source_code_ever;

It may not be possible to use serde as-is, but instead, require our own #[derive(Csv)] trait to make anything flexible.

And even that still gives "single struct" vibes.

I'm not sure where to continue, but I'll continue researching and provide additional details when I have them.

Until then, I'm definitely getting to see why csv is designed in this way. 😄

@onkoe
Copy link
Owner Author

onkoe commented Jun 27, 2024

A Mid Solution

  1. split src/ser.rs into src/ser/{header, field}.rs
  2. make all the header methods except "Record" (canonical struct) syntax panic. hide them from docs. put scary doc comments for ides. if possible, make them deprecated (not sure if that works on trait impls).
  3. create a state model for serialization similar to that in csv
  4. wrap everything up in the CsvWriter/CsvReader
  5. consider improvements from there. look for ways to split serde's interface, perhaps with a weird fork. see if other format impls have done anything more successful.

@onkoe
Copy link
Owner Author

onkoe commented Jun 27, 2024

oh my, i just had a nice feeling idea for taking a struct as a Record in the Writer

  • allow the user to pass in any struct as a generic, T
  • store that T in a PhantomData
  • now, the user can only operate on that T. any violation will be against the constructed type in the Writer/etc. definition

this REALLY cuts down on the boilerplate checking in De/Serializer

@onkoe
Copy link
Owner Author

onkoe commented Jun 27, 2024

note: do we need the phantomdata for anything? or is the type system enough? try both :3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant