You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the paper, the edge-enqueuing tracing algorithm still enqueues ObjectReferene, not the address of the slot (Address). Because the ObjectReference is in the queue, we know the address of the objects, so we can prefetch the object header and the object fields (i.e. the header, the oopmap and the fields are accessed "contemporaneously").
However, in our current mmtk-core, each ProcessEdgesWork work packet contains a list of slot addresses (Address). The ObjectReference of the object is still beyond one level of load (unsafe { slot.load::<ObjectReference>() }). So in the following snippet we can see several places where cache misses can occur:
If we perform prefetching on the elements of ProcessEdgesBase::edges, we can eliminate CACHE MISS 1. But CACHE MISS 2 and CACHE MISS 3 still remains because we prefetched the slot, not the object itself or the side metadata. But we can't prefetch the object unless we know its address in the first place. In other words, the information held in ProcessEdgesBase::edges alone is insufficient for prefetching the objects.
solution
One solution is to load from the edges immediately when we scan an object. When scanning an object, the slots in the object is local to that object so load operations should be fast.
pubstructProcessEdgesBase<VM:VMBinding>{pubedges:Vec<Address>,pubedges_preloaded:Vec<ObjectReference>,// Add this fieldpubnodes:Vec<ObjectReference>,// ...}impl<'a,E:ProcessEdgesWork>EdgeVisitorforObjectsClosure<'a,E>{#[inline(always)]fnvisit_edge(&mutself,slot:Address){ifself.buffer.is_empty(){self.buffer.reserve(E::CAPACITY);}self.buffer.push(slot);self.buffer_preloaded.push(slot.load::<ObjectReference>());// Pre-load here. This should hit the cache.ifself.buffer.len() >= E::CAPACITY{letmut new_edges = Vec::new();
mem::swap(&mut new_edges,&mutself.buffer);letmut new_edges_preloaded = Vec::new();// Added
mem::swap(&mut new_edges_preloaded,&mutself.buffer2);// Addedself.worker.add_work(WorkBucketStage::Closure,E::new(new_edges, new_edges_preloaded,false,self.worker.mmtk),// Added new_edges_preloaded);}}}
If ProcessEdgesBase has the last known values held in the edges, we can preload from both the slot and the last-known object reference so that both the load from the slot and the load form the object can be fast, and, in the case where objects never move, we don't need to load from the slot any more.
#[inline]
fn process_edges(&mut self) {
for i in 0..PREFETCH_DISTANCE {
std::intrinsics::prefetch_read_data(self.edges[i]); // Prefetch slot
std::intrinsics::prefetch_read_data(self.edges_preloaded[i]); // Prefetch object
}
for i in 0..self.edges.len() {
self.process_edge(self.edges[i]);
std::intrinsics::prefetch_read_data(self.edges[i + PREFETCH_DISTANCE]);
std::intrinsics::prefetch_read_data(self.edges_preloaded[i + PREFETCH_DISTANCE]);
}
}
Of course the profit should be measured.
Alternative process_edge for MarkSweep
Because mark-sweep never moves objects, we can eliminate the load in process_edge if we pre-load the edge when scanning objects. This can be a special case for #574
NOTE: I am not proposing to fix it immediately. This is a research topic, and needs to be validated by experiments.
TL;DR: After reading https://users.cecs.anu.edu.au/~steveb/pubs/papers/pf-ismm-2007.pdf again, I feel that the current
ProcessEdgesWork
does not have the same opportunity for prefetching as described in that paper, unless we do some minor adjustment.In the paper, the edge-enqueuing tracing algorithm still enqueues
ObjectReferene
, not the address of the slot (Address
). Because theObjectReference
is in the queue, we know the address of the objects, so we can prefetch the object header and the object fields (i.e. the header, the oopmap and the fields are accessed "contemporaneously").However, in our current mmtk-core, each
ProcessEdgesWork
work packet contains a list of slot addresses (Address
). TheObjectReference
of the object is still beyond one level ofload
(unsafe { slot.load::<ObjectReference>() }
). So in the following snippet we can see several places where cache misses can occur:If we perform prefetching on the elements of
ProcessEdgesBase::edges
, we can eliminate CACHE MISS 1. But CACHE MISS 2 and CACHE MISS 3 still remains because we prefetched the slot, not the object itself or the side metadata. But we can't prefetch the object unless we know its address in the first place. In other words, the information held inProcessEdgesBase::edges
alone is insufficient for prefetching the objects.solution
One solution is to load from the edges immediately when we scan an object. When scanning an object, the slots in the object is local to that object so load operations should be fast.
If
ProcessEdgesBase
has the last known values held in the edges, we can preload from both the slot and the last-known object reference so that both the load from the slot and the load form the object can be fast, and, in the case where objects never move, we don't need to load from the slot any more.Of course the profit should be measured.
Alternative
process_edge
for MarkSweepBecause mark-sweep never moves objects, we can eliminate the load in
process_edge
if we pre-load the edge when scanning objects. This can be a special case for #574The text was updated successfully, but these errors were encountered: