Skip to content
This repository has been archived by the owner on Oct 8, 2019. It is now read-only.

The first incarnation of the metadata serialiser/deserialiser. #1

Closed
wants to merge 2 commits into from

Conversation

vext01
Copy link
Member

@vext01 vext01 commented Feb 11, 2019

Here's a stab at the yorick metadata serialiser and deserialiser.

It currently does only what our custom serialiser did and no more. As such, there are some FIXMEs.

I spent quite some time trying to be too clever with the type system and with iterators and made quite a mess with the borrow checker. After about four iterations, I settled on this simpler and (I hope) idiot proof API.

I have a branch which changes to make the compiler to use this, but it depends upon this being merged:
rust-lang/rust#58013

Even once that is merged, I think we will have to rename the in-compiler serde crate in our compiler fork.

I'm starting to think ykmir wasn't a good name. How about ykmd, since we will probably be encoding more than just mir. We will be encoding metadata in general. Thoughts?

src/decode.rs Outdated

impl<'a> Decoder<'a> {
/// Returns a new decoder which will deserialise from `read_from` and also an `Index` informing
/// the consumer of the expected contents.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this docstring should mention that the ABI version number is checked.

(Sorry, only thought of this after raising the PR).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably that's not the only check that, conceptually, it might one day do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not impossible that we might check something else at some point, but I can't think of anything off the top of my head.

What are you suggesting? That it's not worth mentioning?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I doubt it's worth mentioning.

@ltratt
Copy link
Member

ltratt commented Feb 11, 2019

"Meta-data" is, to me, a meaningless term. I'm happy for ykmir to be renamed, but preferably without the dreaded term "meta-data" rearing its head ;)

src/decode.rs Outdated
impl<'a> Decoder<'a> {
/// Returns a new decoder which will deserialise from `read_from` and also an `Index` informing
/// the consumer of the expected contents.
pub fn new(read_from: &'a mut dyn Read) -> Result<(Index, Self), decode::Error> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it's more of a from_ method than a new method, because I can vaguely imagine other things than a file that we want to deserialise (e.g. a string).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

src/decode.rs Outdated

let index = Index::deserialize(&mut deser)?;
Ok((
index.clone(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we cloning the index as part of a tuple when we include it in the struct too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one in the struct is used to keep track of what the deserialiser has already given out, and is mutated each time the user is given (in this case) a MIR.

This will become more important later where the deserialiser will likely give out other kinds of data than just MIR. In that case, the data would have to be read out by the consumer in the correct order.

I considered using a state machine to instead of a mutable copy of the index, but decided I was trying to be too clever.

I also experimented with having a deserialiser which acts only as an interface to hand out iterators, so that the order the user requests the data is irrelevant. The problem there is that we can only have one mutable reference to the underlying T: Read and once we give it to serde, it's gone forever. This means we can't seek the underlying stream for subsequent iterator requests.

I hope that made sense, it's hard to explain.

src/lib.rs Outdated
/// YkMir -- Metadata serialiser and deserialiser for Yorick.
///
/// This crate allows ykrustc to serialise the compiler's MIR structures for later deserialisation
/// by the Yorick runtime. The encoded data is specialised to a particular Rust input program (this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "specialised to a particular Rust input" mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I only have a vague understand of this at the current time, but as I understand it:

The metadata encoded into rlibs and dynlibs by Rustc is "generic" so that things like monomorphisations, type resolution and inlining can happen for arbitrary inputs (Rust programs) in later compilation sessions.

When we read our metadata at runtime, we want the data for the current binary. We don't care about the general case. We have two options:

  • Spin up the compiler to resolve rusts metadata to our binary (not even sure how possible this is).
  • Resolve everything at compile time and serialise that.

I'm proposing the latter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just drop the paragraph about specialisation until we understand it better.

I admit, there's a liberal amount of hand waving happening here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be better to drop it if we don't understand it :) That way we leave ourselves a bit of flexibility in the future if we haven't quite got things right / fully understood them.

src/lib.rs Outdated
///
/// The MIR data is serialised to msgpack data in the following form:
///
/// version -- The ABI version number of the serialised data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ABI the right term here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah hah! I wasn't sure myself.

The version is just a number we bump when shape of the data structures serialised within. For example, when I add statement support I would bump the number, as struct Mir will get bigger and we don't want an earlier deserialiser to try and decode it.

What would be the correct term? "serialisation version"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Serialisation version" works for me.

@vext01
Copy link
Member Author

vext01 commented Feb 12, 2019

"Meta-data" is, to me, a meaningless term. I'm happy for ykmir to be renamed, but preferably without the dreaded term "meta-data" rearing its head ;)

I'm not sure it's meaningless. Usually it means data which is secondary to the main data. In this scenario, the MIR is secondary to the compiled code.

I suggested "meta-data" to be in line with what upstream call their serialisations. If you have another suggestion I don't mind trying it, as long as we describe in the crate docstring the relation to upstream "meta-data".

@ltratt
Copy link
Member

ltratt commented Feb 12, 2019

Ah, if they call this stuff meta-data, fair enough.

@vext01
Copy link
Member Author

vext01 commented Feb 13, 2019

I think this is ready for re-review.

The API is much simpler now.

src/lib.rs Outdated
@@ -40,7 +41,7 @@ pub use decode::Decoder;
pub use encode::Encoder;

/// The version number of the data structures.
const FORMAT_VERSION: usize = 0;
const SER_VERSION: usize = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be a fixed size integer, otherwise we might read in a file from another architecture and do the wrong thing. Which suggests we probably need to stick the architecture in here somewhere too...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right on both counts. How about a u32 for the version, and the platform string?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The platform string is variable size, which might be a bit annoying?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the serialiser cares, but let's try it.

Failing that we can use a manually specified enum.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also (ho ho ho!) does the ser_version change alongside the VM version? In other words, we might keep the same serialisation format, but some other detail changes and then boom! So we probably need some sort of YK version number in there too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just understood you comment about it being built in to the executable.

The problem is, I think, that there are two compile-times: the compiler, and the VM. We could get into a scenario where the compiler was built with one version of ykpack, and the VM was built using another. Right?

So, is it the semantic version of ykpack we actually care for here? Should that be what is encoded?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least at first, we're probably going to compile to a big binary blob, so I don't think we need to worry about this (in other words, that blob will carry with it the run-time compiler which, by definition, must be in sync with everything else in the blob). I think we can just drop the version checks entirely, at least as long as we compile to a single blob.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm. I'm not sure I follow. I thought the VM would depend upon ykrt (and thus ykpack) via regular Cargo dependencies.

So the compiler will have (somewhere):

[dependencies]
ykpack = "x.y"

And then the VM would (perhaps transitively) have a similar dependency line in a Cargo manifest.

I've not considered the "blob" approach, nor know how hard it is to achieve.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. cargo does the hard work for us, and we use the same tricks grmtools does to include small binary blobs into our big binary blob. If I'm wrong, we can put this stuff back later, but let's just drop the version checks for now, and make our life simpler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll have to trust you on this one.

It seems to me that the VM author has to be careful to use the same version of ykpack as ykrustc used or boom.

src/lib.rs Outdated
for md in &inputs {
enc.serialise(md.clone()).unwrap();
}
// User forgot: enc.done()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit cryptic!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be better?

// The user forgot to finalise the serialisation with a call to `enc.done()`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to say something like "This test fails because the user forgot to add enc.done", otherwise the comment floats adrift in a universe of gnomisms.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I would have thought this was evident from #[should_panic(expected = "not marked done")].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know your default style of comments, so I got it, but I doubt many other people would get it TBH.


# We are using a git version to work around a bug:
# https://github.com/3Hren/msgpack-rust/issues/183
[dependencies.rmp-serde]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want you can do something like rmp-serde = { git = "...", rev = "..." }. I have no preferences about which is better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I started with that, but it makes a very long time. I prefer this way.

src/decode.rs Outdated
}

// Iterate over meta-data.
pub fn iter(self) -> MetaDataIterator<'a> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional that iter consumes self?

Copy link
Member Author

@vext01 vext01 Feb 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but if you think it's important to be able to allow the user to iterate multiple times using the same decoder instance, i can try &mut self. It's not immediately clear if it's possible though... I think you'd need >1 mutable ref to allow that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better off as-is then. It might be worth documenting this, something like "Note that iterating over this structure consumes it"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try to confirm my suspicion regarding &muts and get back to you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried a few different approaches. I don't think it's possible to allow multiple iterations over the same decoder.

In light of that, I do wonder if instead of having a decoder struct from which we can get an iterator, we could have a decoder which is an iterator. What would you prefer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. If it can only be a one-off iterator, then making the thole thing a one-off iterator does seem like a good idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try it in a separate commit, and we can revert if we hate it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

src/encode.rs Outdated

/// Serialises a meta-data item.
pub fn serialise(&mut self, md: MetaData) -> Result<(), encode::Error> {
Some(md).serialize(&mut self.ser)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If serialize return () on success, you can probably simplify this method to just Some(md).serialize(...).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah.

/// Finalises the serialisation and writes a sentinal.
pub fn done(mut self) -> Result<(), encode::Error> {
None::<Option<MetaData>>.serialize(&mut self.ser)?;
self.done = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.done is never read from AFAICS, so I'm not sure what it's for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be used in the destructor.

src/lib.rs Outdated
/// version -- The serialisation format version number.
/// This should be bumped every time the meta-data structure changes.
/// md_0: \
/// ...⋮ - Meta-data items.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need the "⋮" character here as well as "...".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah! I tried to paste a into vim but thought it didn't work! I guess liberation mono doesn't have that glyph.

src/lib.rs Outdated
/// sentinal -- End of meta-data marker.
/// -----------
///
/// Where each md_i is an instance of `Some(MetaData)` and the sentinel is a `None`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double space at the beginning of the line should be single space?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Sentinel" and "sentinal" -- which is it? ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. I think it's "sentinel".

}

/// The top-level meta-data type.
#[derive(Serialize, Deserialize, PartialEq, Eq, Debug, Clone)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need all of these to be derived?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Serialize and Deserialize, certainly.
  • Eq and PartialEq simplify tests, and may be useful later on too.
  • Debug is useful.
  • Clone is also used in tests, since the serialiser takes ownership of what's being serialised, and we need to compare the original input to what comes back from the round trip.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's delete the others that we don't need then: it just leads to bloat (and also makes me wonder what it mean to compare these things for equality!).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are suggesting we should remove Eq and PartialEq. This is going to make testing quite awkward, as I'll have to do a big old restructuring with multiple comparisons.

E.g. instead of comparing two metadata's using an assert_eq!(expect_md, got_md) I'll have to have a match statement for each variant, then in the case of a MIR, destructure each basic block, and the statements within, manually comparing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, I see.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've got me thinking though. What does a derived Eq for a reference do? Compare addresses? We need to be careful, as that might not be what we want.

As it happens, we have no references in our md at the moment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All comparisons involve references: fn eq(&self, other: &Self) in essence. So that isn't an issue. [If you mean "pointers", though, then yes, I think the pointer is compared and not the thing its pointed to.]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

I'd like to think we are sane enough to not serialise a pointer :P

@ltratt
Copy link
Member

ltratt commented Feb 13, 2019

I'm wondering if instead of just "metadata" we want to say something like "crate metadata" or similar?


impl<'a> Drop for Encoder<'a> {
fn drop(&mut self) {
if !self.done {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ Use of self.done is here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

@vext01
Copy link
Member Author

vext01 commented Feb 13, 2019

I'm wondering if instead of just "metadata" we want to say something like "crate metadata" or similar?

Why limit ourselves to crates? Who knows what we might need later :P

@ltratt
Copy link
Member

ltratt commented Feb 13, 2019

I'm still not super happy with "metadata" as a name, just because it's void of content. It must be metadata of something, so I'm wondering what that something is.

src/decode.rs Outdated
}

/// A (fallible) meta-data iterator.
/// Reusing the iterator once is has yielded `None` leads to undefined behaviour.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did you think of this? We could actually force the iterator to continually return None instead of shooting off the end of the cursor, but it would require one more flag in the struct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not exposing this to the end-user, then I think this is OK.

[Just spotted a typo though. Should be:"Reusing the iterator once it has yielded None...".]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@vext01
Copy link
Member Author

vext01 commented Feb 14, 2019

BTW, are we still good to rename this to ykmd?

@vext01
Copy link
Member Author

vext01 commented Feb 14, 2019

For the record, we've agreed on ykpack.

@vext01
Copy link
Member Author

vext01 commented Feb 14, 2019

This should address all comments.

src/decode.rs Outdated
use std::io::Read;

/// The pack decoder.
/// Offfers a simple iterator interface to serialised packs.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh. Offfers.

@vext01
Copy link
Member Author

vext01 commented Feb 14, 2019

Try this.

README.md Outdated
Pack encoder / decoder for Yorick.

This library allows ykrustc to serialise various compile-time information for
later reading at runtime.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, maybe "run-time"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

@ltratt
Copy link
Member

ltratt commented Feb 14, 2019

Please squash.

@ltratt
Copy link
Member

ltratt commented Feb 14, 2019

Oops, I'm ahead of myself! We need to do the "run-time" thing first. Sorry!

@vext01
Copy link
Member Author

vext01 commented Feb 14, 2019

Squash now?

@ltratt
Copy link
Member

ltratt commented Feb 14, 2019

Yes please.

This change adds a streamer "pack" encoder and decoder. The idea is that
compile-time info can be serialised by ykrustc at compile-time and
deserialised at run-time.

For example the MIR will need to be consulted at runtime for code
generation. We will have ykrustc serialise the MIR into packs which are
stored in the resulting ELF file ready to be used later at run-time.
@vext01
Copy link
Member Author

vext01 commented Feb 14, 2019

squashed, and commit message fleshed out.

@vext01
Copy link
Member Author

vext01 commented Feb 14, 2019

Oops.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants