Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delayed message delivery implementation #4062

Merged
merged 24 commits into from
May 29, 2019

Conversation

merlimat
Copy link
Contributor

@merlimat merlimat commented Apr 17, 2019

Motivation

Fixes #2375

Allow the option to mark messages for delayed delivery.

Notes:

  • If delayed delivery is disabled, messages are always delivered immediately and there's no tracking overhead.
  • Messages are only delayed on shared subscriptions. Other subscriptions will deliver immediately.
  • The tracking of delayed messages is lazily initialized and if a messages has no delay, it will have no overhead.

Implementation

  • The tracking is ephemeral and implemented in Pulsar broker. The main reason is to avoid a client refetching messages multiple times when there are multiple consumer reconnections.
  • Broker keeps a priority-queue as a buffer in direct memory
  • Use one Netty hashwheel timer to drive the checking of topics that have messages ready to be scheduled.

Possible improvements

The goal of this PR is to have simple working solution that can be used to efficiently apply delay on 10s of millions of messages at any given time.

There are several improvements that could be considered, based on real-world usage feedback.
For example:

  • Compress the priority queue, either by collapsing id ranges or by just passing the buffer through gzip.
  • Allow batching of messages with very close target time.

@merlimat merlimat added the type/feature The PR added a new feature or issue requested a new feature label Apr 17, 2019
@merlimat merlimat added this to the 2.4.0 milestone Apr 17, 2019
@merlimat merlimat self-assigned this Apr 17, 2019
@sijie
Copy link
Member

sijie commented Apr 17, 2019

@merlimat : how is this different from #3155?

Also I think there was a long thread discussion about delayed message implementation. there were pushbacks on implementing delayed messages on brokers. hence a lot of efforts were postponed due to you and bunch of other people had concerns on the solutions to PIP-26 and #3155. but the approach here seems to take the broker-side approach again. I am wondering what is the thought behind this. How does this different from other proposals?

Beside the implementation, since there was already a long discussion about delayed messages and I have spent the time on pushing the discussion and other people's efforts forward. Isn't it better to first get an agreement (or at least update the discussion thread) before starting a new PR?

@sijie
Copy link
Member

sijie commented Apr 17, 2019

nvm. I saw the email in the email thread now.

Copy link
Member

@sijie sijie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have looked into the pull request. This is actually a simpler implementation of PIP-26.

The DelayedDeliveryTracker in this pull request is what is called delayed message index in PIP-26.

In this pull request, the tracker is a priority queue, all in memory, and rebuild by replaying the messages after a broker crash.

In PIP-26, the tracker is a hash-wheel time-partitioned index. it can be all-in-memory and rebuilt by replaying the messages after a broker crash; or the time-partitioned index can be stored in ledgers to avoid replaying the messages to rebuild the index.

in theory, I don't see any technical differences between PIP-26 and #4062. In fact, I think #4062 is a simpler implementation of PIP-26 whose delayed message index is implemented using a priority queue. If so, how does this PR address the concerns raised when PIP-26 was started (i.e. making changes to dispatcher). FYI, PIP-26 was postponed because there were concerns about adding changes to dispatcher.

@merlimat
Copy link
Contributor Author

I have looked into the pull request. This is actually a simpler implementation of PIP-26.

Is that a bad thing? Is there any limitation in this approach?

If so, how does this PR address the concerns raised when PIP-26 was started (i.e. making changes to dispatcher).

The changes to the dispatcher itself have been isolated in a very few specific points. It should be easy to review and verify that with feature turned off there's zero impact in current behavior.

The biggest difference with this PR is that the tracking happens entirely off-heap, in direct memory. There are no objects created and retained for extended amount of time, which is the pattern that will kill the GC performances.

A topic will have ByteBuf using direct memory where the priority queue is stored. On the data path there are no other allocations required.

@joefk
Copy link
Contributor

joefk commented Apr 17, 2019

The changes to the dispatcher itself have been isolated in a very few specific points. It should be easy to review and verify that with feature turned off there's zero impact in current behavior.

Nice to see that change, is there a config option to turn it ON per namespace?

A topic will have ByteBuf using direct memory where the priority queue is stored. On the data path there are no other allocations required.

How does this work with load balancing? Does the load balancer know which topics are going to need this allocation.

Is there a limit on delay, num of delayed messages pending etc? What's the limits?

How is inversion handled? What happens if a message to be delayed by for eg: a few days is at the head? Will that halt the advance of the delete cursor? What if a few of those kind are randomly spread around? Is this taking for granted that on a broker restart, everything spanning the period of largest delay will potentially be read through again? Is there a checkpoint for a polite shutdown/unload?

I prefer configurable limits, and deterministic performance so that system behavior can be predicted during rolling upgrades and failures. Pulsar handles rolling upgrades and failures way better than other systems , and it would be preferable to maintain that.

@lovelle
Copy link
Contributor

lovelle commented Apr 17, 2019

Is that a bad thing? Is there any limitation in this approach?

To me this is one of the best things about this pull, and absolutely not a bad thing.

I still didn't take a deep look but my only concern would be, how will behave when very different range of delay arrive? Users sometimes makes an abusive use from this type of feature.

The improvement I really like from this is that both features (this and #4062) uses a priority queue but this pull uses the buffer in direct memory 👍

Compress the priority queue, either by collapsing id ranges or by just passing the buffer through gzip.

Since each adjacent message could have an arbitrary delay I can't see how collapsing by id range could be made.

@sijie
Copy link
Member

sijie commented Apr 17, 2019

Is that a bad thing? Is there any limitation in this approach?

It is not a bad thing. I am actually super happy to see this happen because I am a supporter for broker-side approaches from the beginning (if you have followed the email discussion).

The changes to the dispatcher itself have been isolated in a very few specific points.

If you took a look at my comment, PIP-26 also isolates the changes to a structure called DelayedMessageIndex (which is the structure you called DelayedDeliveryTracker here). So technically there are no fundamental differences between this PR and PIP-26 regarding the concerns around changes touching dispatcher. I am just trying to figure out why and make sure the authors of PIP-26 also understand your thoughts behind this. IMO that is an important thing for building a healthy community.

The biggest difference with this PR is that the tracking happens entirely off-heap, in direct memory. There are no objects created and retained for extended amount of time, which is the pattern that will kill the GC performances.

I don't think the biggest difference with this PR and PIP-26 is the direct memory thing you mentioned on implementing DelayDeliveryTracker. The delayed message index in PIP-26 can also be implemented using direct memory without allocation.

IMO the difference between this PR and PIP-26 is - DelayedDeliveryTracker in this PR is a pure memory structure which can not hold "delayed index" beyond memory; DelayedMessageIndex in PIP-26 is a time partitioned structure which can spool the index back to ledgers. DelayedDeliveryTracker is limited at the delay ranges that it can support. DelayedMessageIndex is a more generic approach on supporting arbitrary delays or scheduled messages.

DelayedDeliveryTracker and DelayedMessageIndex are just two different implementations of one same things. If the current implementation of DelayedDeliveryTracker is acceptable, why the proposal of a time-partitioned DelayedMessageIndex is not acceptable? People can choose which implementation to use by configuring a configuration in the broker configuration.

Lastly, PIP-26 already presents changes regarding api, protocol, namespace policies and many other changes around this area. Shall we just pickup the proposed changes there instead of starting a new effort?

@merlimat
Copy link
Contributor Author

DelayedDeliveryTracker and DelayedMessageIndex are just two different implementations of one same things. If the current implementation of DelayedDeliveryTracker is acceptable, why the proposal of a time-partitioned DelayedMessageIndex is not acceptable? People can choose which implementation to use by configuring a configuration in the broker configuration.

That's a very good point. It would be good to have a DelayedDeliveryTracker as an interface and we can have different implementations.

That will help:

  1. Accommodate different scenarios
  2. Easily experiment with different implementation approaches

I'll update this PR to make the interface configurable.

Lastly, PIP-26 already presents changes regarding api, protocol, namespace policies and many other changes around this area. Shall we just pickup the proposed changes there instead of starting a new effort?

In PIP-26 the proposed API methods were:

// message to be delivered at the configured delay interval
producer.newMessage().delayAt(3L, TimeUnit.MINUTE).value("Hello Pulsar!").send();

// message to be delivered at the configure time.
producer.newMessage().scheduleAt(new Date(2018, 10, 31, 23, 00, 00))

In this PR I'm proposing:

producer.newMessage().deliverAfter(3, TimeUnit.MINUTE).value("hello").send();

producer.newMessage().deliverAt(timestamp).value("hello").send();

My reasons are:

  • delayAt() seems confusing because "at" in all timing APIs is used for absolute positioning
  • I'd rather keep the same prefix deliverAt() / deliverAfter to make it visually clear these are 2 alternative ways to configure the same feature.
  • Date vs timestamp. I have no strong opinion there since both are basically interchangeable (eg: new Date(timestamp) and date.getTime()). I was using a timestamp since that is what we're already using for publishTime and eventTime.

For the protobuf metadata change, in PIP-26 was :

// the message will be delayed at delivery by `delayed_ms` milliseconds.
    optional int64 delayed_ms 	= 18;

Though that won't support specifying absolute time of scheduling scheduling.

Instead, I propose to start with

// Mark the message to be delivered at or after the specified timestamp
optional uint64 deliver_at_time = 18;

Initially, with relative delays the client will just apply based on its current time. Once we have broker assigned timestamp (stored within the message metadata), then we could add a second field.
Alternatively, we could as well start with 2 fields (abs and relative) and have the broker do the math based on producer assigned publish time.

@sijie
Copy link
Member

sijie commented Apr 18, 2019

@merlimat : great! These comments around API and protocol changes are great if it can be done when PIP-26 was sent out. A DelayedDeliveryTracker interface would definitely help as well.

@sijie
Copy link
Member

sijie commented Apr 18, 2019

Also can you provide a namespace policy to enable and disable this feature per namespace as what PIP-26 proposed? It doesn't have to be in this PR, but an issue filed for tracking this is good.

@merlimat
Copy link
Contributor Author

merlimat commented May 7, 2019

run java8 tests

@merlimat
Copy link
Contributor Author

run java8 tests

1 similar comment
@merlimat
Copy link
Contributor Author

run java8 tests

@Override
public Set<PositionImpl> getScheduledMessages(int maxMessages) {
int n = maxMessages;
Set<PositionImpl> positions = new TreeSet<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we already know life cycle of PositionImpl then can we use PositionImplRecylce instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep it simple for now. We can iteratively improve and optimize.

if (log.isDebugEnabled()) {
log.debug("[{}] Get scheduled messags - found {}", dispatcher.getName(), positions.size());
}
updateTimer();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we updating timer here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We took items out of the queue, we need to adjust the timer for the next scheduled message

@merlimat
Copy link
Contributor Author

@rdhabalia @ivankelly Please take another look.

@merlimat
Copy link
Contributor Author

run java8 tests
run integration tests

@sijie
Copy link
Member

sijie commented May 27, 2019

@rdhabalia @ivankelly Please take another look, so that we can wrap up the features for 2.4.0.

@Slf4j
public class InMemoryDelayedDeliveryTracker implements DelayedDeliveryTracker, TimerTask {

private final TripleLongPriorityQueue priorityQueue = new TripleLongPriorityQueue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This queue is unbounded. It could potentially allow someone to DOS the broker, by just allowing them to send a bunch of messages with a delivery date far in the future. We should degrade gracefully from this, though I'm not sure what the nicest behaviour would be for the user. Maybe if the queue is full, force delivery from the head of the queue or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the idea was to start with simple implementation and iterate from that, based on the observed issues/weaknesses.

Also, there are 2 ways to address that:

  1. The feature can be disabled on server side
  2. The tracker implementation is pluggable, so one could either expand the current one or provide an alternative implementation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll +1 this one, but this DOS should be dealt with asap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cap on the mem size will need to be applied per-broker though rather than per-topic

@merlimat
Copy link
Contributor Author

run java8 tests

1 similar comment
@ivankelly
Copy link
Contributor

run java8 tests

@Geal
Copy link
Contributor

Geal commented Jul 20, 2021

@merlimat why was the uint64 timestamp replaced with a int64 in 52832fe ? I can't imagine a use case for negative deliver_at_time timestamps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature The PR added a new feature or issue requested a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for delayed message delivery
7 participants