-
Notifications
You must be signed in to change notification settings - Fork 1
The first incarnation of the metadata serialiser/deserialiser. #1
Conversation
src/decode.rs
Outdated
|
||
impl<'a> Decoder<'a> { | ||
/// Returns a new decoder which will deserialise from `read_from` and also an `Index` informing | ||
/// the consumer of the expected contents. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this docstring should mention that the ABI version number is checked.
(Sorry, only thought of this after raising the PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably that's not the only check that, conceptually, it might one day do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not impossible that we might check something else at some point, but I can't think of anything off the top of my head.
What are you suggesting? That it's not worth mentioning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I doubt it's worth mentioning.
"Meta-data" is, to me, a meaningless term. I'm happy for |
src/decode.rs
Outdated
impl<'a> Decoder<'a> { | ||
/// Returns a new decoder which will deserialise from `read_from` and also an `Index` informing | ||
/// the consumer of the expected contents. | ||
pub fn new(read_from: &'a mut dyn Read) -> Result<(Index, Self), decode::Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like it's more of a from_
method than a new
method, because I can vaguely imagine other things than a file that we want to deserialise (e.g. a string).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
src/decode.rs
Outdated
|
||
let index = Index::deserialize(&mut deser)?; | ||
Ok(( | ||
index.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we cloning the index as part of a tuple when we include it in the struct too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The one in the struct is used to keep track of what the deserialiser has already given out, and is mutated each time the user is given (in this case) a MIR.
This will become more important later where the deserialiser will likely give out other kinds of data than just MIR. In that case, the data would have to be read out by the consumer in the correct order.
I considered using a state machine to instead of a mutable copy of the index, but decided I was trying to be too clever.
I also experimented with having a deserialiser which acts only as an interface to hand out iterators, so that the order the user requests the data is irrelevant. The problem there is that we can only have one mutable reference to the underlying T: Read
and once we give it to serde, it's gone forever. This means we can't seek the underlying stream for subsequent iterator requests.
I hope that made sense, it's hard to explain.
src/lib.rs
Outdated
/// YkMir -- Metadata serialiser and deserialiser for Yorick. | ||
/// | ||
/// This crate allows ykrustc to serialise the compiler's MIR structures for later deserialisation | ||
/// by the Yorick runtime. The encoded data is specialised to a particular Rust input program (this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "specialised to a particular Rust input" mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I only have a vague understand of this at the current time, but as I understand it:
The metadata encoded into rlibs and dynlibs by Rustc is "generic" so that things like monomorphisations, type resolution and inlining can happen for arbitrary inputs (Rust programs) in later compilation sessions.
When we read our metadata at runtime, we want the data for the current binary. We don't care about the general case. We have two options:
- Spin up the compiler to resolve rusts metadata to our binary (not even sure how possible this is).
- Resolve everything at compile time and serialise that.
I'm proposing the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could just drop the paragraph about specialisation until we understand it better.
I admit, there's a liberal amount of hand waving happening here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be better to drop it if we don't understand it :) That way we leave ourselves a bit of flexibility in the future if we haven't quite got things right / fully understood them.
src/lib.rs
Outdated
/// | ||
/// The MIR data is serialised to msgpack data in the following form: | ||
/// | ||
/// version -- The ABI version number of the serialised data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ABI
the right term here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah hah! I wasn't sure myself.
The version is just a number we bump when shape of the data structures serialised within. For example, when I add statement support I would bump the number, as struct Mir
will get bigger and we don't want an earlier deserialiser to try and decode it.
What would be the correct term? "serialisation version"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Serialisation version" works for me.
I'm not sure it's meaningless. Usually it means data which is secondary to the main data. In this scenario, the MIR is secondary to the compiled code. I suggested "meta-data" to be in line with what upstream call their serialisations. If you have another suggestion I don't mind trying it, as long as we describe in the crate docstring the relation to upstream "meta-data". |
Ah, if they call this stuff meta-data, fair enough. |
I think this is ready for re-review. The API is much simpler now. |
src/lib.rs
Outdated
@@ -40,7 +41,7 @@ pub use decode::Decoder; | |||
pub use encode::Encoder; | |||
|
|||
/// The version number of the data structures. | |||
const FORMAT_VERSION: usize = 0; | |||
const SER_VERSION: usize = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be a fixed size integer, otherwise we might read in a file from another architecture and do the wrong thing. Which suggests we probably need to stick the architecture in here somewhere too...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right on both counts. How about a u32 for the version, and the platform string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The platform string is variable size, which might be a bit annoying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the serialiser cares, but let's try it.
Failing that we can use a manually specified enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also (ho ho ho!) does the ser_version change alongside the VM version? In other words, we might keep the same serialisation format, but some other detail changes and then boom! So we probably need some sort of YK version number in there too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just understood you comment about it being built in to the executable.
The problem is, I think, that there are two compile-times: the compiler, and the VM. We could get into a scenario where the compiler was built with one version of ykpack, and the VM was built using another. Right?
So, is it the semantic version of ykpack we actually care for here? Should that be what is encoded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least at first, we're probably going to compile to a big binary blob, so I don't think we need to worry about this (in other words, that blob will carry with it the run-time compiler which, by definition, must be in sync with everything else in the blob). I think we can just drop the version checks entirely, at least as long as we compile to a single blob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm. I'm not sure I follow. I thought the VM would depend upon ykrt
(and thus ykpack
) via regular Cargo dependencies.
So the compiler will have (somewhere):
[dependencies]
ykpack = "x.y"
And then the VM would (perhaps transitively) have a similar dependency line in a Cargo manifest.
I've not considered the "blob" approach, nor know how hard it is to achieve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. cargo does the hard work for us, and we use the same tricks grmtools does to include small binary blobs into our big binary blob. If I'm wrong, we can put this stuff back later, but let's just drop the version checks for now, and make our life simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll have to trust you on this one.
It seems to me that the VM author has to be careful to use the same version of ykpack
as ykrustc
used or boom.
src/lib.rs
Outdated
for md in &inputs { | ||
enc.serialise(md.clone()).unwrap(); | ||
} | ||
// User forgot: enc.done() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit cryptic!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be better?
// The user forgot to finalise the serialisation with a call to `enc.done()`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to say something like "This test fails because the user forgot to add enc.done", otherwise the comment floats adrift in a universe of gnomisms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I would have thought this was evident from #[should_panic(expected = "not marked done")]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know your default style of comments, so I got it, but I doubt many other people would get it TBH.
|
||
# We are using a git version to work around a bug: | ||
# https://github.com/3Hren/msgpack-rust/issues/183 | ||
[dependencies.rmp-serde] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want you can do something like rmp-serde = { git = "...", rev = "..." }
. I have no preferences about which is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I started with that, but it makes a very long time. I prefer this way.
src/decode.rs
Outdated
} | ||
|
||
// Iterate over meta-data. | ||
pub fn iter(self) -> MetaDataIterator<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intentional that iter
consumes self
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but if you think it's important to be able to allow the user to iterate multiple times using the same decoder instance, i can try &mut self
. It's not immediately clear if it's possible though... I think you'd need >1 mutable ref to allow that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better off as-is then. It might be worth documenting this, something like "Note that iterating over this structure consumes it"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try to confirm my suspicion regarding &muts
and get back to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried a few different approaches. I don't think it's possible to allow multiple iterations over the same decoder.
In light of that, I do wonder if instead of having a decoder struct from which we can get an iterator, we could have a decoder which is an iterator. What would you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. If it can only be a one-off iterator, then making the thole thing a one-off iterator does seem like a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try it in a separate commit, and we can revert if we hate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
src/encode.rs
Outdated
|
||
/// Serialises a meta-data item. | ||
pub fn serialise(&mut self, md: MetaData) -> Result<(), encode::Error> { | ||
Some(md).serialize(&mut self.ser)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If serialize
return ()
on success, you can probably simplify this method to just Some(md).serialize(...)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah.
/// Finalises the serialisation and writes a sentinal. | ||
pub fn done(mut self) -> Result<(), encode::Error> { | ||
None::<Option<MetaData>>.serialize(&mut self.ser)?; | ||
self.done = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.done
is never read from AFAICS, so I'm not sure what it's for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be used in the destructor.
src/lib.rs
Outdated
/// version -- The serialisation format version number. | ||
/// This should be bumped every time the meta-data structure changes. | ||
/// md_0: \ | ||
/// ...⋮ - Meta-data items. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need the "⋮" character here as well as "...".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hah! I tried to paste a ⋮
into vim but thought it didn't work! I guess liberation mono doesn't have that glyph.
src/lib.rs
Outdated
/// sentinal -- End of meta-data marker. | ||
/// ----------- | ||
/// | ||
/// Where each md_i is an instance of `Some(MetaData)` and the sentinel is a `None`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double space at the beginning of the line should be single space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Sentinel" and "sentinal" -- which is it? ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. I think it's "sentinel".
} | ||
|
||
/// The top-level meta-data type. | ||
#[derive(Serialize, Deserialize, PartialEq, Eq, Debug, Clone)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need all of these to be derived?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Serialize and Deserialize, certainly.
- Eq and PartialEq simplify tests, and may be useful later on too.
- Debug is useful.
- Clone is also used in tests, since the serialiser takes ownership of what's being serialised, and we need to compare the original input to what comes back from the round trip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's delete the others that we don't need then: it just leads to bloat (and also makes me wonder what it mean to compare these things for equality!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are suggesting we should remove Eq
and PartialEq
. This is going to make testing quite awkward, as I'll have to do a big old restructuring with multiple comparisons.
E.g. instead of comparing two metadata's using an assert_eq!(expect_md, got_md)
I'll have to have a match statement for each variant, then in the case of a MIR, destructure each basic block, and the statements within, manually comparing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry, I see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've got me thinking though. What does a derived Eq
for a reference do? Compare addresses? We need to be careful, as that might not be what we want.
As it happens, we have no references in our md at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All comparisons involve references: fn eq(&self, other: &Self)
in essence. So that isn't an issue. [If you mean "pointers", though, then yes, I think the pointer is compared and not the thing its pointed to.]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
I'd like to think we are sane enough to not serialise a pointer :P
I'm wondering if instead of just "metadata" we want to say something like "crate metadata" or similar? |
|
||
impl<'a> Drop for Encoder<'a> { | ||
fn drop(&mut self) { | ||
if !self.done { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ Use of self.done
is here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
Why limit ourselves to crates? Who knows what we might need later :P |
I'm still not super happy with "metadata" as a name, just because it's void of content. It must be metadata of something, so I'm wondering what that something is. |
src/decode.rs
Outdated
} | ||
|
||
/// A (fallible) meta-data iterator. | ||
/// Reusing the iterator once is has yielded `None` leads to undefined behaviour. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What did you think of this? We could actually force the iterator to continually return None
instead of shooting off the end of the cursor, but it would require one more flag in the struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're not exposing this to the end-user, then I think this is OK.
[Just spotted a typo though. Should be:"Reusing the iterator once it has yielded None
...".]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
BTW, are we still good to rename this to |
For the record, we've agreed on |
This should address all comments. |
src/decode.rs
Outdated
use std::io::Read; | ||
|
||
/// The pack decoder. | ||
/// Offfers a simple iterator interface to serialised packs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doh. Offfers.
Try this. |
README.md
Outdated
Pack encoder / decoder for Yorick. | ||
|
||
This library allows ykrustc to serialise various compile-time information for | ||
later reading at runtime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, maybe "run-time"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed
Please squash. |
Oops, I'm ahead of myself! We need to do the "run-time" thing first. Sorry! |
Squash now? |
Yes please. |
This change adds a streamer "pack" encoder and decoder. The idea is that compile-time info can be serialised by ykrustc at compile-time and deserialised at run-time. For example the MIR will need to be consulted at runtime for code generation. We will have ykrustc serialise the MIR into packs which are stored in the resulting ELF file ready to be used later at run-time.
squashed, and commit message fleshed out. |
Oops. |
Here's a stab at the yorick metadata serialiser and deserialiser.
It currently does only what our custom serialiser did and no more. As such, there are some
FIXME
s.I spent quite some time trying to be too clever with the type system and with iterators and made quite a mess with the borrow checker. After about four iterations, I settled on this simpler and (I hope) idiot proof API.
I have a branch which changes to make the compiler to use this, but it depends upon this being merged:
rust-lang/rust#58013
Even once that is merged, I think we will have to rename the in-compiler serde crate in our compiler fork.
I'm starting to think
ykmir
wasn't a good name. How aboutykmd
, since we will probably be encoding more than just mir. We will be encoding metadata in general. Thoughts?