Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC Extensions #28368

Merged
merged 15 commits into from
Oct 5, 2018
Merged

GC Extensions #28368

merged 15 commits into from
Oct 5, 2018

Conversation

rbehrends
Copy link
Contributor

This pull request implements the GC extension proposal described in this Julia Enhancement proposal.

See the Julep above for a full description of the API and semantics.

src/gc.c Outdated
@@ -2286,7 +2328,7 @@ mark: {
obj16 = (gc_mark_obj16_t*)sp.data;
goto obj16_loaded;
}
else {
else if (layout->fielddesc_type == 2) {
// This is very uncommon
// Do not do store to load forwarding to save some code size
assert(layout->fielddesc_type == 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice this before, but I guess this assert could now be removed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rbehrends: please address this, too?

src/gc-pages.c Outdated
}
support_conservative_scans = 1;
JL_UNLOCK_NOGC(&conservative_scan_lock);
}
Copy link
Member

@fingolfin fingolfin Jul 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that all this function does in the end (at least right now) is to enable the support_conservative_scans and jl_gc_cleanup_pages flags. So perhaps the Julia team would prefer a function name that better reflects that, e.g. jl_gc_enable_clean_pages. I think the main reason we went with the current name is that hypothetically, it could do additional work in the future related to conservative scanning. Anyway, in the end, as log as get this functionality in some way, we will of course adapt to whichever names you might prefer.

@ararslan ararslan added the GC Garbage collector label Jul 31, 2018
@ararslan ararslan requested review from yuyichao and Keno July 31, 2018 16:37
Copy link
Contributor

@yuyichao yuyichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK in general except for the second last commit. I don't think there's a way around hooking into implementation details (at least in the sense for every other pieces of code) but the coupling here is a little too much and too unstable.

I also don't think you need any of these since you can get all these info from the page metadata.

During marking, when you get a pointer pointing to the pool pages,

  1. Find the cell boundary
  2. If the cell is marked, then you're done
  3. If nfree is 0, there's either no free cell in the page, in which case the cell you've just found is a valid object, or the page is being allocated. If the page is being allocated, you can tell if the cell is valid or not by comparing its address to the current allocation point on the thread the page belongs to.
  4. Now the only two possibilities are free cell or a young object on a page that has not been touched since last sweep. In this case, you can use the age to tell the difference. The only issue is the reset age done in the finalizer marking so the only change you'll need to make is to disable the age reset there.

src/gc.c Outdated
static jl_gc_callback_list_t *gc_cblist_pre_gc;
static jl_gc_callback_list_t *gc_cblist_post_gc;

#define gc_invoke_callbacks(kind, args) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The macro symbol concatenation is still here.....

src/gc.c Outdated
desc->sweepfunc(v);
}
else {
objs->items[p++] = v;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent.

src/gc-pages.c Outdated
@@ -21,6 +21,19 @@ extern "C" {

static int block_pg_cnt = DEFAULT_BLOCK_PG_ALLOC;
static size_t current_pg_count = 0;
static int support_conservative_scans = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this commit is the kind of change that will be very hard to maintain... It relies on a lot of the internal implementation details that can otherwise be changed without any issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think I can now get rid of this part entirely by implementing your other suggestion for finding references to pool allocations, as that would (I think) eliminate the need to zero pages entirely. I'll put the code in and get back to you ASAP with an update (including also the other requested changes).

@rbehrends
Copy link
Contributor Author

The changes turned out to be a bit more intricate than I had expected. They are all contained in a single new commit (for now) instead of integrating them with the existing commits to make review of the additional changes easier.

The "no free objects on page" case must actually be split up; if a page is being allocated from, then nfree contains the max number of objects that fit inside the page, and nfree is only zeroed once the page is used up. Both cases are covered by fl_begin_offset == (uint16_t)-1 and we can then use nfree == 0 to figure out if the whole page is in use or if it's still being allocated from.

Even with all the options suggested to identify pointers to the interior of a pool object, we still need to distinguish between a freshly allocated object and a freelist entry. We can identify freelist entries fairly easily, however, because the header points to another header rather than to another object (i.e. (p - pagestart - GC_PAGE_OFFSET) % osize == 0).

A remaining problem was that the linked list in newpages could contain multiple pages that were being allocated from. Checking all of them would have required iterating over potentially dozens of pages. In order to avoid that overhead, I needed to change reset_page() so that the page was inserted after the first one, not before it. This way, only the first page in the newpages list was ever being used for allocation. An additional benefit is that this should reduce fragmentation, as we can't end up with multiple partially used pages.

The API call jl_gc_enable_conservative_gc_support() could be removed entirely as a result of these changes.

I also removed jl_gc_is_internal_obj_alloc(). It had confusing semantics and only existed for performance reasons, which aren't a concern here anymore.

@yuyichao
Copy link
Contributor

yuyichao commented Aug 8, 2018

We can identify freelist entries fairly easily, however, because the header points to another header rather than to another object (i.e. (p - pagestart - GC_PAGE_OFFSET) % osize == 0).

I don't think that'll work reliably with small cell size and that's why I said,

In this case, you can use the age to tell the difference. The only issue is the reset age done in the finalizer marking so the only change you'll need to make is to disable the age reset there.

This way, only the first page in the newpages list was ever being used for allocation.

This should be reflected in the comment.

@rbehrends
Copy link
Contributor Author

rbehrends commented Aug 8, 2018

I don't think that'll work reliably with small cell size

The only concern here seems to be osize == sizeof(jl_taggedvalue_t), i.e. a pool of empty objects. But this should be safe also, as the type of the object cannot have size zero. For all other values of osize, a pointer to a type can always be distinguished from a pointer to a freelist by the offset.

In this case, you can use the age to tell the difference.

This is where I'm not sure how that would work. In jl_gc_pool_alloc(), popping an object off the freelist does not change the age. I.e. we cannot use the corresponding age bit to distinguish between the cell just before the allocation (when it is on the freelist) and after the allocation (when it becomes an object).

We can use the age bit to tell if a cell survived sweep_page(), but we cannot tell if it was allocated again after that or remains on the freelist.

A way to solve this (and which seems to work, based on a quick test) is to compare the pointer to pool->freelist before testing the age (if its address is less than the freelist address, it was either allocated before or picked from the freelist). However, this relies on items in the freelist being in ascending order per page, which seems to be a somewhat fragile invariant (though jl_gc_pool_alloc() also relies on it).

This should be reflected in the comment.

It already is, though I think I'll clarify the wording.

@vtjnash
Copy link
Member

vtjnash commented Aug 8, 2018

I don't know if this information is helpful, but some of this sounds like what gc_verify_tags_page also needed to do.

@rbehrends
Copy link
Contributor Author

I've pushed the relevant changes to use age instead as the arbiter of whether we're dealing with an allocated or free cell.

A couple of potential concerns remain, though.

1. I'm not sure if and when a page in pool->newpages is ever swept. In testing, sweep_page() seems to never be called with a page in pool->newpages as an argument (and I don't think it would work, anyway), but I'm uncertain why it does not seem to happen.

2. I had to change the value of jl_buff_tag; while it seems to be somewhat arbitrary, I'd like to make sure that its exact value doesn't have a deeper purpose. The issue here is that objects with such a type should never be passed to marking functions (as that would crash the mark loop). I changed the value slightly so that jl_buff_tag % GC_PAGE_SZ == 0 and jl_buff_tag can never accidentally be the same value as an actual type. (This may also be relevant for code in gc-debug.c).

@fingolfin
Copy link
Member

fingolfin commented Aug 10, 2018

@yuyichao is this more like what you had in mind? Any remaining concerns?

I see that CircleCI tests failed, with an error message saying "ERROR: LoadError: did not find test files for the following tests: OldPkg/pkg". Not sure what to do about that, any hints? (Perhaps @rbehrends already knows

@rbehrends
Copy link
Contributor Author

rbehrends commented Aug 10, 2018

@fingolfin I suspect that I need to rebase this onto a more recent version, it may have to do with Pkg changes? The tests themselves see to work (unsurprisingly, as I ran them myself before pushing).

@KristofferC
Copy link
Member

Yes, circle CI is not very smart and using an old config file. A rebase on master will fix it.

@rbehrends
Copy link
Contributor Author

Having rebased onto the current master, everything seems to pass now.

src/gc.c Outdated
static jl_gc_callback_list_t *gc_cblist_notify_external_alloc;
static jl_gc_callback_list_t *gc_cblist_notify_external_free;

#define gc_invoke_callbacks(kind, args) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not fixed. (Again, function like macro is fine, but not name concatination).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, just to clarify: you prefer that we replicate this code 6 times? The justification being that this way one can grep for e.g. gc_cblist_root_scanner to find where it is used?

A simple alternative to achieve that would be to add a comment to each of the 6 invocations of gc_invoke_callbacks, e.g.

    // iterate over gc_cblist_root_scanner
    gc_invoke_callbacks(notify_external_alloc, (v, allocsz));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, just to clarify: you prefer that we replicate this code 6 times?

That'll be better, yes. But you don't have to, The only bad thing about this is the symbol concatenation for the variable and type names, just spell them out explicitly will be fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I had seen your earlier comment, but thought that was about the public API only, which I changed accordingly (whereas this use is internal). This is easy enough to change, too, but to explain why I did it: I wanted to keep the code typesafe so that list name and type always match. This is not very critical, though, as this is only used in a handful of places and only in one file, so if avoiding symbol concatenation is the bigger priority that's also not a problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know type safety is the issue. There are enough of these in the GC code that I hope we could use c++ here but fortunately most of these could be solved without too much more typing and without relying on a lot of macro tricks so far.............

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I notice that there are quite a few macros in Julia which use preprocessor concatenation, and arguably for symbols were this is far more inconvenient than for this fairly localized and special one. E.g. I just spent some time tracking down where jl_box_uint64 is defined. Other examples include JL_TYPECHK and code for intrinsics... So it's not quite clear to me why those are OK, but ours is not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Congratulations. That exact macro is the very reason I hate this kind of macro so much. Unless I actually want to change something about these macros I'm not a fan of just changing those but I also don't want to introduce new ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I understand now: you really want all gone, and the existing ones are "legacy"

src/gc.c Outdated
return NULL;
}
// Not a freelist entry, therefore a valid object.
valid_object:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Less indent for the label, see other places in the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that's easy enough.

// a lower address will be an allocated object, and for cells
// with the same or a higher address, the corresponding age
// bit will reflect whether it's on the freelist.
// Age bits are set in sweep_page() and are 0 for freelist
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, for this to work reliably you still need to disable the age reset in the finalizer marking. There should be only a single place where mark_reset_age is set to 1 so just disable that. Also, in general, it's still good to have an API for enabling support for conservative scanning. That function should also return whether the conservative scanning support was enabled and you should document that if the setup doesn't guarantee that the support was enabled early enough, the caller should do a full GC afterwards.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I need to unpack that, as I think I simply misunderstood you before (thinking you were referring to sweeping rather than finalizers). I take it now that mark_reset_age = 1 should be disabled for conservative scanning only?

I had removed the function for enabling conservative scanning because it had become a no-op; but it does make sense to have one regardless in case future changes necessitate having one. And if I understand you correctly, the mark_reset_age change is for the conservative case alone and should be left in place for the normal case, so we'd have to put code in there again anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it now that mark_reset_age = 1 should be disabled for conservative scanning only?

Yes. Sorry if it wasn't clear.

but it does make sense to have one regardless in case future changes necessitate having one.

That's exactly what I'm saying.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it does make sense to have one regardless in case future changes necessitate having one.

And in particular, the behavior of "if the function returns 0 then the caller should do a full GC if it wasn't called right after initialization" should be a behavior that is reasonably easy to maintain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, does that mean mark_reset_age is "just" an optimization, which is thus safe to disable for us? I wonder how much impact disabling it has, esp. compared to our previous version of this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Will do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's just an optimization. It could delay the freeing of old finalized objects but should not have any other effects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuyichao Thanks for the explanation, and of course all the feedback

@rbehrends Perhaps we could extend the comment at the declaration of mark_reset_age to state that as part of this PR, so that future people looking at this will have an easier time figuring it out :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed semantics for jl_gc_enable_conservative_gc_support() actually run into problems in the multit-threaded case, as other threads may do work (such as allocations) in between the call and the garbage collection, leading to a race condition. I would therefore propose to do the full collection as part of the jl_gc_enable_conservative_gc_support() semantics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine too. There shouldn't be a issue though since as long as you make sure you don't start doing things that requires conservative scanning before running the GC, the race is benign. Not requiring a external GC does make the API simpler so I'm fine with that. I brought it up only in case you want to reduce the startup delay a little bit.

@@ -289,7 +289,7 @@ JL_DLLEXPORT jl_value_t *jl_gc_alloc(jl_ptls_t ptls, size_t sz, void *ty);
# define jl_gc_alloc(ptls, sz, ty) jl_gc_alloc_(ptls, sz, ty)
#endif

#define jl_buff_tag ((uintptr_t)0x4eade800)
#define jl_buff_tag ((uintptr_t)0x4eade000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is not needed anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that I introduced for the reasons mentioned in an earlier comment. Basically, it is possible for a spurious reference to an object with the type jl_buff_tag to occur with conservative scanning; such an object must not be passed to jl_gc_mark_queue_obj() and similar functions that try to treat the header word as a jl_datatype_t * that they can derefence. So we want to make sure that jl_buff_tag cannot accidentally be mistaken for a value that can legitimately occur in the object header (which is extremely unlikely, but at least theoretically possible). By making it a multiple of GC_PAGE_SZ, we avoid that problem.

Copy link
Contributor

@yuyichao yuyichao Aug 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but you should (edit: NOT) be relying on that to distinguish allocated and unallocated cell anymore?
Also, just realized that I forgot to mention, even with the conservative scan, any code that is julia-aware has to use the GC frame explicitly. More specifically, any code that creates and access julia array in C code has to do that since otherwise the array object itself might not be anywhere on the stack while the buffer is still being used which will cause 1) (from the current implementation) the buffer won't be scanned and 2) the buffer itself doesn't even contain enough information about what's in there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but you should (edit: NOT) be relying on that to distinguish allocated and unallocated cell anymore?

Hmm. I was under the impression that objects with jl_buff_tag can still happen to occur as a live object. While they get special treatment in the mark loop, I believed that they can still occur in other contexts (e.g. when allocated off pool->newpages) in ways that make them indistinguishable from normal allocations.

Also, just realized that I forgot to mention, even with the conservative scan, any code that is julia-aware has to use the GC frame explicitly. More specifically, any code that creates and access julia array in C code has to do that since otherwise the array object itself might not be anywhere on the stack while the buffer is still being used which will cause 1) (from the current implementation) the buffer won't be scanned and 2) the buffer itself doesn't even contain enough information about what's in there.

I think this is something that simply needs to be documented. Namely, that arrays need to be kept visible to the Julia GC even if they are not visible on the stack anymore if you are accessing its elements. E.g. using JL_GC_PUSH1() and friends, storing them in a Julia object or a root while accessing elements, or some such. But yes, I think this is something that requires explicit documentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I was under the impression that objects with jl_buff_tag can still happen to occur as a live object.

Ah, so what you want to distinguish jl_buff_tag from a real tag reliably. I though you needed to use this to distinguish between a unallocated and allocated cell. This use is fine then, but it should be documented near the definition of the jl_buff_tag, and possibly a "see also" here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and possibly a "see also" here as well.

Oh, and by "here" I mean where you check that in the code. I thought that's where this comment thread is....

// Insert free page after first page.
// This prevents unnecessary fragmentation from multiple pages
// being allocated from at the same time. Instead, objects will
// only ever be allocated from the first object in the list.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and this comment is still missing a mention of the relation to conservative scanning that I requested last time. That won't be clear for future reader out of the context of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. Will do that. I hadn't realized that you wanted me to make the connection to conservative scanning and thought you simply wanted the logic explained. But putting it also in the context of conservative scanning makes perfect sense.

Copy link
Member

@fingolfin fingolfin Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rbehrends hmm, this comment has not yet been addressed, has it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has, the comment now explicitly notes thatjl_gc_internal_obj_base_ptr() (which is part of the conservative marking machinery) relies on it.

@rbehrends
Copy link
Contributor Author

I believe I have addressed everything now (barring a fresh rebase of the commits). Tests pass locally, too.

src/gc.c Outdated
{
size_t p = 0;
for (size_t i = 0; i < objs->len; i++) {
jl_value_t *v = (jl_value_t *)(objs->items[i++]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is i incremented twice on purpose ? If so, IMHO a comment should explain this. If not..,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's a bug (and would result in a resource leak). I've addressed this along with the other comments and amended the existing commits.

src/gc.c Outdated

static void gc_sweep_foreign_objs(void)
{
for (int i = 0;i < jl_n_threads;i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insert spaces after semicolon?

@rbehrends rbehrends force-pushed the rb/gc-extensions branch 2 times, most recently from 66a98c4 to ed02b83 Compare August 22, 2018 18:02

// Types for mark and sweep functions.
// We make the cache and sp parameters opaque so that the internals
// do not get exposed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, cache and sp? Outdated comment?


// Kinds of callbacks. For each kind of callback, there must be
// a corresponding type with the same name, but in all lowercase,
// and with a "_t" suffix.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is outdated and now not very helpful. Just remove it?

// marking. Conservative marking will treat any machine word that points
// to an object (potentially including its interior) as a valid
// reference, regardless of whether it actually is one or just an
// integer that happens to match an actual address.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, this comment sounds as if conservative marking was enabled by this switch; the fine distinction that only "support" for it is enabled, and that the caller needs to perform additional work (and which) is easy to miss. So perhaps nake that clear?

Here, being aware of the Julep would be useful... Would it be appropriate to insert a reference to it in a comment?

if m === nothing
return false
else
num = m[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite, an unwanted tab slipped in. It should be fixed now.

@fingolfin
Copy link
Member

@yuyichao @Keno From our perspective, this PR should be ready now; at least we are not aware of any remaining unresolved issues. The failed Travis test seems to be a fluke (it failed running apt-get install), restarting it should hopefully make the tests fail.

Of course we can clean up the commits in this PR if you like us to, but I figured that only makes sense if you are happy with it now; if you want further changes, please let us know.

src/gc.c Outdated
@@ -2125,6 +2292,9 @@ mark: {
int stkbuf = (ta->stkbuf != (void*)(intptr_t)-1 && ta->stkbuf != NULL);
int16_t tid = ta->tid;
jl_ptls_t ptls2 = jl_all_tls_states[tid];
ptls->last_gc_mark_sp = &sp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I mentioned this in an early comment. This leaks the address of sp, should be replaced by copy instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you mean by "replaced by copy" here?

And I don't think you mentioned this before (nor could I find such a comment). To the contrary, our first version of the Julep and the code in our branch (from before the PR was submitted) did not have last_gc_mark_sp but rather used a more complicated approach; it was then suggested to us by you to move this into the TLS (for which we were quite grateful, as it simplified some things considerably).

Anyway, perhaps you could elaborate on what you imagine should replace this (while ensuring that jl_gc_mark_queue_obj and jl_gc_mark_queue_objarray keep working)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it's better to just explicitly synchronize them at those point instead of leaking &sp from the beginning of the function.

That's actually what i mean by explicitly synchronize instead of leaking address. So instead of

ptls->last_gc_mark_sp = &sp;
<external call>

Do

ptls->last_gc_mark_sp = sp;
<external call>
sp = ptls->last_gc_mark_sp;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sure, we can do that.

However (with the caveat that I have not yet had a chance to try the change), I think that means either including gc.h from julia_threads.h to get gc_mark_sp_t, or else using a hard coded buffer size there matching the size of gc_mark_sp_t (and then having a static_assert somewhere to ensure that value stays correct). Which do you prefer? Resp. if you have an alternate idea, please let us know?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def of gc_mark_sp_t could be moved to julia_thread.h and renamed to jl_gc_mark_sp_t. That's the approach used by jl_gc_mark_cache_t.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About copying: this could introduce measurable overhead, as the struct consists of eight words. More importantly, as the contents of the struct change during marking, this would be problematic during root scanning, as there would be no way to retrieve any changes to the struct. And the internals of the struct would be exposed, anyway.

An alternative idea would be to explicitly document when the field is valid.

Renaming the type wouldn't be a problem, of course. Part of the reason why I made the field a void * was for purposes of information hiding, i.e. to make it opaque and to not expose its internals. But that could also be done with an opaque struct pointer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rbehrends well, can you measure it? If you implement this in a commit atop the PR, we can easily drop it later on if it turns out to be too bad.

I don't understand what problem with "contents of the struct change during marking" is; and what changes you think need to be "retrieved". Could you elaborate?

As to you "An alternative idea", I also don't quite see to what it is meant to be an alternative?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About copying: this could introduce measurable overhead, as the struct consists of eight words.

The issue is that the control flow in this function is very unfriendly to the compiler, to put it mildly, so the compiler won't be able to do any escape analysis following the control flow. The sp is the most accessed variable in the function so I do not want to compromise it's performance. (Note that all the fast path in this function is fully inlined and sp doesn't leak to any other functions). Also note that since there aren't that many callee saved registers on x86, all of the function calls are already surrounded by many register spills which are basically copying things around.

as there would be no way to retrieve any changes to the struct

That's why they are copied back?

Copy link
Contributor Author

@rbehrends rbehrends Sep 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see the problem.

That's why they are copied back?

Hmm, that would be two struct copies per mark operation. I really have to benchmark that to see what happens. It may be necessary to add to the API to permit bulk operations. I'll go and do a provisional implementation of this approach and see how that turns out, performance-wise.

Edit: Nevermind, I think I know how to cut down on the copying quite a bit. I still have to do the implementation & benchmarking to see how much of an impact it actually has.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial benchmarking seems to indicate that the overhead remains fairly small, even if you have lots of objects with custom mark functions. I still need to figure out why I now have a rare and difficult to reproduce crash as the result of these changes, though.


int old = jl_astaggedvalue(new_obj)->bits.gc & 2;
ptls->last_gc_mark_sp = &sp;
uintptr_t young = desc->markfunc(ptls, new_obj);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

src/gc.c Outdated
@@ -2468,6 +2648,9 @@ static int _jl_gc_collect(jl_ptls_t ptls, int full)

// 3. walk roots
mark_roots(gc_cache, &sp);
ptls->last_gc_mark_sp = &sp;
gc_invoke_callbacks(jl_gc_cb_root_scanner_t,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here.

src/gc.c Outdated
if (jl_is_initialized()) {
JL_LOCK_NOGC(&conservative_gc_lock);
int result = support_conservative_marking;
support_conservative_marking = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jl_atomic_exchange and get rid of the lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that's easy enough.

src/gc.c Outdated

JL_DLLEXPORT int jl_gc_conservative_gc_support_enabled()
{
return support_conservative_marking;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the other one has lock/atomic, this one should be as well mainly for consistency. jl_atomic_load_acquire.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. It's not strictly necessary, as this is a benign data race, but I agree that it would be less confusing.

@fingolfin
Copy link
Member

I've updated this PR now. @rbehrends will hopefully work on adding a task scanner test (but IMHO that could also be added in a later PR, i.e. it should IMHO not delay merging of this PR; but of course you may disagree ;-)

@rbehrends
Copy link
Contributor Author

I've added a test for the task scanner functionality now.

@fingolfin
Copy link
Member

@vtjnash in 082d7d2 you added this in gc-debug.c:

    void *stkbuf = ta->stkbuf;
    if (ta == thread_task && ptls->copy_stack)
        gc_scrub_range(ptls2->stackbase, ptls2->stacksize);
    else if (stkbuf)
        gc_scrub_range((char*)stkbuf, (char*)stkbuf + ta->bufsz);

I am confused by this -- isn't the second argument to gc_scrub_range a pointer? So why is ptls2->stacksize passed to it? Should it perhaps be ptls2->stackbase + ptls2->stacksize instead? On the other hand, in signals-unix.c, I see this:

static int is_addr_on_stack(jl_ptls_t ptls, void *addr)
{
    jl_task_t *t = ptls->current_task;
    if (t->copy_stack)
        return ((char*)addr > (char*)ptls->stackbase - ptls->stacksize &&
                (char*)addr < (char*)ptls->stackbase);
    else
        return ((char*)addr > (char*)t->stkbuf &&
                (char*)addr < (char*)t->stkbuf + t->bufsz);
}

So perhaps in the first snippet, what is really meant is this: (?)

        gc_scrub_range(ptls2->stackbase - ptls2->stacksize, ptls2->stackbase);

This would also fit with code I see in the various jl_init_basefiber implementations:

    ptls->stackbase = stkbuf + ssize;
    ptls->stacksize = ssize;

However, there, usually this is contained in #ifdef COPY_STACKS ... #endif... so stackbase may not be initialized in general?!

Is there any documentation on all this? The code is almost completely undocumented, and I have a hard time reverse engineering it.

@rbehrends rbehrends force-pushed the rb/gc-extensions branch 3 times, most recently from 95b5ddd to 8f17ca2 Compare October 5, 2018 02:49
@rbehrends
Copy link
Contributor Author

The underlying problem for the crash was that in jl_init_root_task(), the start of the stack was adjusted downwards by 3,000,000 bytes and the stack size was adjusted by the same amount. This led to the stack base for the root task pointing to the middle of nowhere and resulting in a segmentation fault if you tried to access that memory.

Right now, I believe that this needs an abstraction to determine a task's stack area regardless. I've added a function jl_task_stack_buffer() that will (1) give you the actual stack area (and is therefore also future-proof if jl_task_t changes) and (2) will tell you if the task is the current task of a Julia thread (and thus may require no/additional attention, such as skipping guard pages at the bottom of the stack buffer).

@rbehrends rbehrends force-pushed the rb/gc-extensions branch 2 times, most recently from 530838d to 7942395 Compare October 5, 2018 12:09
@vchuravy
Copy link
Member

vchuravy commented Oct 5, 2018

am confused by this -- isn't the second argument to gc_scrub_range a pointer? So why is ptls2->stacksize passed to it? Should it perhaps be ptls2->stackbase + ptls2->stacksize instead?

See f6879e1 where we fixed that.

@StefanKarpinski
Copy link
Member

See f6879e1 where we fixed that.

So maybe just rebasing this PR on top of #29489 would fix the issue?

@rbehrends
Copy link
Contributor Author

@StefanKarpinski I don't think this is necessary. As I wrote above, the core issue appeared to be that the stack buffer of the root task of the main thread was manipulated in a way that made the start of the stack buffer point to the middle of nowhere. I addressed that in 83bc53b.

@vtjnash
Copy link
Member

vtjnash commented Oct 5, 2018

Note that because these stacks "grow down", start and end are reversed from usual. So here, we're intentionally manipulating the end of the stack buffer to point at the middle of nowhere (well, actually into the guard pages at the bottom). But you should never scan that far down the stack, just start at your best guess for the start and end at the current frame pointer for that thread. We can't actually determine where the stack starts and ends on most operating systems. Although we could bound it more precisely than we currently do (because we know the precise size of it and the address of one point that's somewhere in the middle).

@fingolfin
Copy link
Member

Yeah, so from our end, everything is working fine right now, so if this PR was merged as is, we'd be happy. But of course we'll also try to address any further questions or change requests you may have!

Copy link
Member

@Keno Keno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only done a cursory review of the latest, but I did look over the original proposal fairly closely and that looked reasonable to be, so if @yuyichao is ok with the implementation as is, I'm happy to have this merged. If anything else comes up we can do that in follow-on PRs.

@Keno Keno merged commit d461d56 into JuliaLang:master Oct 5, 2018
@Keno
Copy link
Member

Keno commented Oct 5, 2018

Separately, you should be commended on an excellent PR-submitting process. Getting features like this through the review process is non trivial and a lot of work, but I think you've been most excellent about it.

@fingolfin
Copy link
Member

Thank you for the kind words, and thanks to all involved for the support, esp. @yuyichao for many helpful remarks and careful reviewing!

@kim366
Copy link
Contributor

kim366 commented Oct 28, 2018

I have read through the proposal and it seems like exactly what I need. I have a question, though: if I use jl_gc_mark_queue_obj to root a value, how do I release it when I am done? Also, is this a replacement for JL_GC_PUSH/JL_GC_POP?

@fingolfin
Copy link
Member

fingolfin commented Oct 28, 2018

You must only call jl_gc_mark_queue_obj during a garbage collection (in particular, during each you want the object to survive). As such, the reference will be "released" when you stop calling jl_gc_mark_queue_obj on it during each GC.

@kim366
Copy link
Contributor

kim366 commented Oct 28, 2018

I see. Thanks for clarifying. Can't wait until the release :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GC Garbage collector
Projects
None yet
Development

Successfully merging this pull request may close these issues.