Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crossbeam in elfmalloc #12

Open
ghost opened this issue Sep 8, 2017 · 19 comments
Open

Crossbeam in elfmalloc #12

ghost opened this issue Sep 8, 2017 · 19 comments

Comments

@ghost
Copy link

ghost commented Sep 8, 2017

Hi, @joshlf and @ezrosent! Congrats on the first release of elfmalloc! :) This is fantastic - I'm always excited to see new work being created in the domain of concurrency/parallelism using Rust.

I see you're using Crossbeam 0.2. Not sure if you're familiar with the current state of affairs, but we're rewriting the epoch GC and redesigning the API, and hoping to release it soon. I'd suggest looking into this RFCs repository if you're curious to see where we're heading at.

I saw a comment on Reddit saying that, to remove the dependency on std from elfmalloc, crossbeam's epoch-based memory management would need to support no-std. Well, Crossbeam's new epoch-based GC currently allocates in two places:

  1. Thread entries are kept in a linked list (consists of heap-allocated entries).
  2. Garbage bags are kept in an unbounded queue (typical Michael Scott queue, where nodes are allocated on the heap and are connected into a linked list).

We'll have to think how could allocations in those situations be avoided.

Moreover, in this RFC @jeehoonkang proposes introducing realms, which is a way to create multiple garbage collectors and manually register threads to them. Currently, there is only one implicit global epoch GC and threads get lazily and automatically registered behind the scenes. This is similar to Rayon allowing you to create your own thread pool instead of using the global one. Anyways, perhaps this is something that might interest you as well, since you're writing such a low-level thing (an allocator) that needs very precise control?

I wonder if you have any ideas on how we can help you here.
More specifically, my questions would be:

  1. What do you want from Crossbeam that it currently doesn't offer?
  2. Did you run into any obstacles with Crossbeam that need to be fixed?
  3. What might a no_std-compatible API that you'd use look like?
@jeehoonkang
Copy link
Contributor

Actually, I was writing on the same topic :) Thank you for writing up!

@joshlf
Copy link

joshlf commented Sep 8, 2017

@stjepang Thanks so much for this comment! It's great to know that this stuff is being actively developed.

@ezrosent will probably have is own comments in addition to mine, but here are mine:

The way I see it, crossbeam is part of a larger class of systems that use a common pattern: there's a global singleton, and then there are thread-local caches for performance. These caches are not strictly necessary for correctness, but only improve performance. These thread-local caches are also singletons. Allocators also fall into this class.

There are a couple of issues that this pattern introduces:

  • If one such system relies on another, then the thread-local storage (TLS) of the dependency may get destructed before the TLS of the dependent system, in which case the latter's destructor may invoke methods on the former, causing issues (we ran into this in practice when crossbeam's TLS was destructed before our own).
  • There's no obvious way for the system to provide the ability to configure what allocator is used for the system without affecting all other users.
  • The system is inherently tied to an operating system context in which TLS exists, so it can't easily be no-std.

To solve this issue, I've been envisioning the following pattern:

  • The core implementation provides a handle-based API in which a user creates a handle on the system (on the epoch-based collector, the allocator, etc). These handles are not Sync, but they are Send, and then implement Clone.
  • There is some core global structure which is allocated when the first handle is created. Each handle holds an Arc on this global structure.
  • This system can be parametric on an allocator because it is self-contained: upon constructing the first handle, an allocator can be passed, and then it can either be stored in the global state or a reference to it can be copied into each handle that is cloned.

Other patterns can easily be implemented in terms of this one:

  • For no-std environment, consumers can simply create and store these handles themselves. This allows a very nice composition in which a system that implements this pattern can use another such system as a dependency by having each of their handles have as a field one of the dependency system's handles.
  • For std environments, a default global singleton can be used. There is a global handle (probably using the lazy_static crate). Each thread's TLS holds a handle which is initialized by cloning the global handle. Using this approach requires that the global structure be able to be modified directly in case crate functions are called after a thread's TLS has been destructed. This is best accomplished by having xxx_global methods on the handles that take an immutable receiver. To give a concrete example, for crossbeam's epoch GC, this might mean being able to append garbage to a global (rather than thread-local) free list if a function is called after a thread's TLS has been destructed.
  • These two approaches also compose nicely: If, for example, an allocator implementation uses epoch GC, and uses this handle-based approach in which each handle embeds an epoch GC handle as a struct field, then that allocator can also provide this global singleton via TLS and thus both provide the global singleton pattern, but also avoid the overhead of doing yet another indirection through TLS when using epoch GC.

The way I'm envisioning it, the pattern should be so cookie-cutter that you might in theory be able to make a macro that takes one of these handle-based implementations and generates the corresponding global singleton implementation.

To be concrete, for epoch GC, this might look something like:

// global state for a GC
struct GCGlobal<A: Alloc> { ... }

pub struct GC<A: Alloc>{ ... }

impl GC<Heap> {
    fn new() -> GC<Heap> { ... }
}

impl<A: Alloc> GC<A> {
    fn with_alloc(a: A) -> GC<A> { ... }

    fn new_owned<T>(&mut self, t: T) -> Owned<T> { ... }

    fn new_owned_global<T>(&self, t: T) -> Owned<T> { ... }

    // etc
}

impl<A: Alloc> Clone for GC<A> {
    fn clone(&self) -> GC<A> { ... }
}

and then, for the global singleton:

lazy_static!{ static ref GLOBAL_HANDLE: GC<Heap> = GC::new(); }
thread_local!{ static TLS_HANDLE: GC<Heap> = GLOBAL_HANDLE.clone(); }

impl<T> Owned<T> {
    fn new(t: T) -> Owned<T> {
        match TLS_HANDLE.try_with(|handle| handle.new_owned()) {
            Ok(o) => o,
            Err(_) => GLOBAL_HANDLE.new_owned_global()
        }
    }
}

@joshlf
Copy link

joshlf commented Sep 8, 2017

Also, a quick follow-up to that: With respect to your question about avoiding allocations for no-std, you don't actually need to avoid allocations, you just need to ensure that the user can provide an allocator. Even in elfmalloc, we have a simple "bootstrapping" allocator (bsalloc) that we use for internal allocations, and we'd be able to use it with crossbeam if needed.

@joshlf
Copy link

joshlf commented Sep 8, 2017

Also also: I read the crossbeam RFC, and with the exception of dedicated GC threads, I think that everything I've described is orthogonal to the changes that you guys are already considering. So hopefully I'm not coming along and trying to tell you to do your jobs differently :)

@ezrosent
Copy link

ezrosent commented Sep 8, 2017

Thanks for reaching out, y'all! I haven't checked in on new crossbeam features in a while and I think the work you're doing looks awesome.

I think joshlf@ gave a pretty accurate summary of our current problems. I do think that having a handle-based API for the epoch GC would help us have a cleaner story around thread destruction. It looks like a lot of the work is already done (though I don't see any other implementors of Realm other than the default global instance in this branch?).

Other than that, it would also be very helpful to have crossbeam structures parametric on an allocator. It seems like being able to create a local instance of the GC system is a precondition for this (there's no good way to override a default type parameter when you're talking to a global singleton). My initial thought on this was that we would have to wait for standard collections like Vec to gain support for a custom allocator. However, if it is just Owned and two custom data-structures that we have to worry about, we could potentially migrate this without waiting for the bigger standard crates.

P.S. I'm very happy to see that you can defer functions now! A while ago I made a hash table that required this feature and I just hacked it into crossbeam to do it. Now I can factor it out into it's own crate and depend on a stable branch.

@joshlf
Copy link

joshlf commented Sep 8, 2017

I realized that I'd made a mistake in my original comment: the global static instance should itself be a handle. I've updated the original comment, and I also wrote up a concrete example (that compiles!) of a simple logging library that buffers log lines. Check it out.

@ghost
Copy link
Author

ghost commented Sep 20, 2017

Thanks so much for your writing detailed responses. I took some time to digest all this and think about it a little bit.

The code @joshlf has outlined seems pretty reasonable to me. I think we could design realms in a very similar manner. What do you think, @jeehoonkang?

We do want to support the Alloc API, although gating it under the nightly feature might be painful. I hope it gets stabilized soon. :)

@joshlf
Copy link

joshlf commented Sep 20, 2017

No problem!

We do want to support the Alloc API, although gating it under the nightly feature might be painful. I hope it gets stabilized soon. :)

No worries - the handle-based design alone would be a huge improvement for us; even without the Alloc API we'd get a lot of benefit from it.

That said, I will say that I don't see it getting stabilized soon - there are still a lot of details to work out and a lot of other work to be done (e.g., incorporating it into libcollections).

@jeehoonkang
Copy link
Contributor

jeehoonkang commented Sep 21, 2017

Thank you for the valuable comments! I felt we need use cases of Crossbeam we can target on for improving Crossbeam, and elfmalloc is an excellent one.

I implemented a part of the "handle" proposal in: https://github.com/jeehoonkang/crossbeam-epoch/tree/handle . It doesn't strictly follow the pattern described above, though. Here are my two cents from this experience.

It is possible that handle will degrade the performance by adding an extra indirection to the global data. In the current implementation of Crossbeam, the per-thread handle ("mutator" in the current master) does not have a reference to the global data. It is directly accessed from lazy_static! globals. On the other hand, in the handle proposal, a per-thread handle contains a reference to the global data. it will incur the cost of an indirection. That said, I agree that it may be also possible that the indirection actually improve the performance, as accessing a lazy_static! global is more involved then just loading the data. Also, the code looks much cleaner now. handle.rs no longer directly references the global data in global.rs.

I almost abandoned the "realm" proposal, but I think the need for supporting both std and no-std environment can be a strong motivation for it. Note that the realm proposal is almost implemented in the above branch: for early adopters, just defining a global::Global and making a handle out of it is sufficient. But I think there might be a safety bug in the current implementation. Furthermore, I don't think the current interface is the most easiest one to use for no-std environment. I would like to ask your opinion on the interface of Crossbeam for no-std scenario.

@joshlf
Copy link

joshlf commented Sep 21, 2017

That's fantastic! Good to know that proposal is actually reasonable enough to be implemented :)

It looks like your global pin implementation unconditionally calls with. Do you have an idea for how you'll handle the TLS key already having been destroyed? That's been the tricky part for us, and it's what motivated the xxx_global methods in my proposal. If I'm not mistaken, it looks like this is what your use of a reference in the handles (rather than an Arc, as my proposal uses) is meant to solve?

@jeehoonkang
Copy link
Contributor

jeehoonkang commented Sep 21, 2017

@joshlf yes, I noticed the use of try_with in your example instead of with, but I couldn't think of a good implementation of Crossbeam with try_with. We may create a new handle in pin() when TLS is already destructed, but we need to store it into a thread-local storage for later calls to pin() and is_pinned(). This is because pin() interacts with not only the global data, but also thread-local data that will affect later calls to Crossbeam's interface. For e.g.:

... // TLS destructed
pin(|scope| {
    let is_pinned = is_pinned(); // expected to be `true`, but how..?
});

A possible solution that changes the interface a little bit is, instead of providing the global pin() and is_pinned() functions, provide get_handle(): &Handle which returns the handle stored in TLS if not destructed, or creates new handle and returns it. With handle: &Handle users can call handle.pin() and handle.is_pinned(). Provided that the handle is not deallocated while the thread still has the handle, it will be safe. Also I "guess" it could be an acceptable interface..

@joshlf
Copy link

joshlf commented Sep 21, 2017

I'm not familiar with how pinning works, but I was under the impression that one of the main functions of epoch's TLS is to maintain thread-local garbage lists. I was thinking that if a TLS value doesn't exist, garbage could just be appended directly onto the global list. I'm not sure how that'd interact with pinning, though.

@jeehoonkang
Copy link
Contributor

jeehoonkang commented Sep 21, 2017

@joshlf Yes, I think that part (locally storing garbages and flushing them to the global queue) is already implementable with try_with, but Crossbeam's interface provides more functions than garbage buffering, and I currently have a good idea how to implement them properly. I'd like to hear from other Crossbeam developers 😄

@ghost
Copy link
Author

ghost commented Sep 21, 2017

@joshlf

I'm not familiar with how pinning works, but I was under the impression that one of the main functions of epoch's TLS is to maintain thread-local garbage lists.

Crossbeam's TLS is just the current thread handle. Pinning accesses the handle, marks it as pinned, executes a closure, and finally marks it as upinned.

You can think of a handle as (I'm simplifying a bit):

struct Handle(Arc<Inner>);
struct Inner { list: GarbageList, pinned: Option<Epoch> }

Do you have an idea for how you'll handle the TLS key already having been destroyed? That's been the tricky part for us, and it's what motivated the xxx_global methods in my proposal.

I thought in elfmalloc you would never call epoch::pin(|| ...), but instead create handles yourself and then call my_handle.pin(|| ...). Is this correct?

Note that this way Crossbeam wouldn't touch TLS at all, so you wouldn't have to worry about Crossbeam's TLS getting destroyed first.

@joshlf
Copy link

joshlf commented Sep 21, 2017

I thought in elfmalloc you would never call epoch::pin(|| ...), but instead create handles yourself and then call my_handle.pin(|| ...). Is this correct?

Hmmm I suppose this might work. To clarify, you're saying that in the case that our TLS data has been destroyed (and thus we don't have access to our thread-local epoch handle), we would just clone a new handle from the global instance temporarily in order to use it for whatever work we needed to do?

@ghost
Copy link
Author

ghost commented Sep 21, 2017

@joshlf Yeah, that's what we'd suggest doing. If you don't have the thread-local handle anymore, just create a one-off temporary handle so that you can move forward.

Alternatively, I believe we could also implement some kind of anonymous pinning, where you can pin yourself without a handle for cases like these. In fact, I did something like this in C++ long time ago. It might a bit tricky to implement, but it's definitely possible.

@joshlf
Copy link

joshlf commented Sep 21, 2017

I'd be curious to see the performance of the two approaches, but intuitively, I feel like the temporary handle approach is probably fine because that approach would only be used in a slow path that's executed at most a couple of times per thread (often never executed).

@jeehoonkang
Copy link
Contributor

@joshlf In order to make sure I understood your proposal, I'd like ask a question: is the following what you want to build in the end?

  • (no-std) bsalloc is implemented with no dependencies.
  • (no-std) elfalloc is implemented. It depends on bsalloc, and uses a custom instance of Crossbeam that uses bsalloc.
  • elfalloc is used in implementing std.
  • The global singleton instance of Crossbeam relies on std, which in turn relies on elfalloc for memory allocation.

I think the way elfmalloc is bootstrapped is really neat 😄 Also, the two instances of Crossbeam interact with each other in a very interesting way: a deallocation in the global instance may invoke the methods of the custom instance of Crossbeam via elfalloc. I hope it's okay performance wise, but I'm not really confident with that.

@joshlf
Copy link

joshlf commented Sep 22, 2017

Yep, that looks right!

(no-std) elfalloc is implemented. It depends on bsalloc, and uses a custom instance of Crossbeam that uses bsalloc.

To clarify, you mean something like GC<BSAlloc> (assuming GC is the name for an instance of the epoch collector)? If so, yes.

I think the way elfmalloc is bootstrapped is really neat 😄 Also, the two instances of Crossbeam interact with each other in a very interesting way: a deallocation in the global instance may invoke the methods of the custom instance of Crossbeam via elfalloc. I hope it's okay performance wise, but I'm not really confident with that.

Thanks :) I don't think the interaction should be that bad honestly. The snarky answer is "abstraction," but I think that's actually reasonable here - most code (including the std version of Crossbeam) treats the allocator as a black box. So long as that allocator is performant (in speed and memory usage), it shouldn't matter how it's implemented, even if in practice it's implemented with another copy of the logic implemented in Crossbeam. I don't see how it'd be any different than if we had implemented our own custom concurrent collection internally, or used some other performant allocation strategy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants