Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement serde #2

Open
joshua-laughner opened this issue Jan 17, 2023 · 5 comments
Open

Implement serde #2

joshua-laughner opened this issue Jan 17, 2023 · 5 comments
Assignees
Milestone

Comments

@joshua-laughner
Copy link
Owner

This would be extremely convenient to be able to use serde to (de)serialize things into Fortran-style records from/to appropriate structs.

@joshua-laughner joshua-laughner added this to the Alpha release milestone Jan 17, 2023
@joshua-laughner joshua-laughner self-assigned this Jan 17, 2023
@joshua-laughner
Copy link
Owner Author

Point 1: mimicking Fortran behavior

We don't need to 100% match Fortran behavior as far as what it does when there's too many/few format specs for extra/not enough input data, but we should match as much as possible so that anyone translating Fortran code into Rust can use this without too much effort. Here's some F90 code I wrote for a quick test:

program read_tests

    implicit none

    integer*4 x, y, z
    real*4 a, b, c
    character(16) name, short_input
    character(72) input

! First try having a longer format spec than string
    input = '12 34 56'
    read(input, '(10i3)') x, y, z
    write(*,*) 'Case 1: ', 'x=',x,'y=',y,'z=',z

! Then a shorter format spec
    input = '100 200 300 400'
    read(input, '(2i4)') x, y
    write(*,*) 'Case 2: ', 'x=',x,'y=',y


! Now try having the string be too short
    write(short_input, '(2f8.2)') 10.0, 20.0
    write(*,*) 'short_input=', short_input
    read(short_input, '(3f8.2)') a, b, c
    write(*,*) 'Case 3: ', 'a=',a,'b=',b,'c=',c

end program

The output is:

gfortran -o read_tests read_tests.f90
./read_tests
 Case 1: x=          12 y=          34 z=          56
 Case 2: x=         100 y=         200
 short_input=   10.00   20.00
 Case 3:a=   10.0000000     b=   20.0000000     c=   0.00000000

So what does this mean?

  • If there's more format specifiers than variables & data in the string, Fortran just ignores the extra specifiers.
  • If there's more data in the string than format specifiers & variables, Fortran just stops once it's read in the data for the variables it needs.
  • If the string ends before all the expected variables are filled, Fortran just stops. The extra variable is left in its previous value.

The first two behaviors make sense to reproduce. The third one, to me, does not. Even if the third behavior is, in fact, used by Fortran programs, I think it runs counter to what we expect in Rust. The only use case I could see is if the "extra" variables are optional, in which case it should be represented in Rust as an Option.

@joshua-laughner
Copy link
Owner Author

Point 2: how to handle Vecs and maps?

These data structures are particularly tricky because Fortran usually uses fixed-length array and, more relevantly, when using a read statement in Fortran, you have to define exactly how many elements to read into an array. While this might sound clunky, it can be quite flexible and can even allow that number to be defined by values previously read in from the file.

Most common formats serde works for have some way to communicate the start and end of an array, e.g. JSON's [...] syntax. Thus there isn't a mechanism in serde to dynamically define how many elements should be deserialized into a Vec or map. I've considered a few possible workarounds:

  1. A custom generic type that is basically a Vec with a max length. That would be part of the type signature for it using const generics. These would have a custom Deserialize implementation that accounts for the min/max length and stops the deserialization when the max length is reached.
  2. An fread! macro that basically mimics the Fortran read statement behavior.

I'm not sure how useful the first one would be, given that the max length has to be a compile-time const. The second one I could still see being useful, though complicated to implement.

For now, I think that the most important goal is to try to make deserialization do the obvious expected thing. If there are cases where deserialization can't access enough run time information to correctly partition a series of values into different sequences, then those are best handled by using an intermediate type with serde's from and try_from derive attributes. The rules I'm planning to implement now are:

  1. An arbitrary length sequence (like a Vec) or mapping (like a HashMap) should consume as many values from deserialization as it can.
  2. These sequences will stop if it tries to deserialize the wrong type. That is, if deserializing something with the format (10f8.2,a32) into a (Vec<f32>, String), then the Vec will take the 10 floats and stop when it reaches the string. Likewise, deserializing a format like (10(a32,i8)f13.4) into a HashMap<String,i32> should take the 10 string/int pairs and stop at the float. Note that we can't peek each format spec and stop when we encounter a different one, because we might be deserializing into an enum that supports different types.
  3. These sequences should probably stop at the end of a line when reading from a multi-line source. I need to test what Fortran does in this situation.
  4. Ideally, these sequences should be able to "withhold" entries for fields following them in the deserialization. That is, if deserializaing a (6i5) formatted string into a (Vec<i32>, i32, i32), it should be smart enough to only deserialize the first four integers into the Vec, leaving the last two for the standalone integers. However, I don't know if this is possible with serde, and there are all sorts of possible pain points in what happens if you try to deserialize a (Vec<T>, T, Vec<T>) for example. This may be a case that is just best handled by either using a known length array, that is deserialize into ([i32; 6], i32, i32) or use from/try_from.

@joshua-laughner
Copy link
Owner Author

Point 3: handling columnar files

These are files like we have in GGG with a table of data using a known format string but with column headers. We can probably assume that the file as a whole would deserialize into one of:

  • Vec<T>, where T is some sequence (including a struct) or nested sequence (e.g. a tuple of structs)
  • A custom struct, where the fields are each vectors matching the types of the data in the columns
  • Map<K,Vec<V>> where K is the type of the column headers and V is the type of the data in the file (can only be one type)
  • A DataFrame

These are probably going to require a special deserializer that stores the column names and iterates through them in sync with the format specs. The tricky part is whether I'll be able to handle the line-directed input correctly, or if the stateless nature of deserializers will pose a problem.

For the Vec<T> case, I could see this working via a function like

fn vec_from_table<R: BufRead, D: Deserialize>(reader: R, header: &[&str]) -> Vec<D>

where this would read one line at a time from the file and call from_str(&line) on it, giving the string-based deserializer just a line to work with so that we enforce a mapping between file lines and the type D.

The other types are more difficult; it's not clear at the moment how we would distinguish between the top level struct, map, or dataframe (which should use the header provided) and the inner types (which should not).

@joshua-laughner
Copy link
Owner Author

Point 4: handling alternate struct deserialization

I have this written so that the default way of deserializing structs is to treat them like a tuple and just deserialize their fields in the order they appear in the struct. However, we may want to support alternate ways of deserializing that don't rely on order. Specifically:

  • An "internally tagged" representation, where the field names alternate with the values, i.e. a format spec like (a8,i4,a8,f8.2)
  • The tabular representation discussed above.

This isn't hard to switch for a full deserialization, but what if you had something like:

struct Outer {
    site_id: String,
    met: Inner
}

struct Inner {
    pres: f32,
    temp: f32,
    rhum: f32
}

and this needed to be deserialized from the format (a2,1x,3(a4,1x,f8.3)) and the data looked like:

aa pres 1013.250 temp  298.000 rhum   25.100
bb temp  297.500 pres 1010.012 rhum   18.700
cc rhum   50.123 temp  290.320 pres  999.876

such that the Outer struct should be deserialized based on order and the Inner one based on the field names before each value? I don't know if there's an easy way to communicate that these two structures should be deserialized differently. It may be that this is another case where we would have to use the try_from attribute. That might look like:

impl TryFrom<(String, HashMap<String, f32>)> for Outer { ... }

#[derive(Deserialize)]
#[serde(try_from = "(String, HashMap<String, f32>)")]
struct Outer {
    site_id: String,
    met: Inner
}

@joshua-laughner
Copy link
Owner Author

Serialization is done for the basic formatting field types. Deserialization has a few types remaining (none, newtypes, enums) but since I've worked out how I want to serialize them, the way deserialization should work is clearer. Once those types are implemented, I will close this.

I may still implement an alternative method of handling structures/maps, where the field names are written as fields in the output. I've not decided whether this should be a option in the settings or a separate serializer - it will depend how much more complicated it makes the logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant