-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Static quorum ring distribution strategy #38
Static quorum ring distribution strategy #38
Conversation
Spawned test cluster nodes require the `MyApp.WorkerSup` module to be started. Only registered processes may join groups (and be assigned metadata).
When the ring maps a name to an `:undefined` node, the tracker will record the pending request. On subsequent topology changes it will retry the pending registrations.
The `:timeout` value can be used to limit the duration of blocking name registration calls.
Provide alternate strategy options: availability, consistency.
👏 |
@@ -698,11 +725,9 @@ defmodule Swarm.Tracker do | |||
debug "#{inspect name} has requested to be restarted" | |||
{:ok, new_state} = remove_registration(obj, %{state | clock: lclock}) | |||
send(pid, {:swarm, :die}) | |||
case handle_call({:track, name, m, f, a}, nil, %{state | clock: lclock}) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bitwalker I was unsure whether the state
should be the new_state
returned from the remove_registration/2
function call two lines above? Does it make a difference if the clock is incremented only once for the two operations (remove, then add) rather than twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use the new state returned from the last call which manipulates the state (i.e. the clock should be incremented twice)
This ensures that all processes in the registry are correctly redistributed (or stopped), not only the process that has triggered the monitor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I have some comments and thoughts, but nothing major; my initial reading of this looks good to me. Very impressive that you pretty much nailed it in your first pass! Nice work :)
lib/swarm.ex
Outdated
""" | ||
@spec register_name(term, atom(), atom(), [term]) :: {:ok, pid} | {:error, term} | ||
defdelegate register_name(name, m, f, a), to: Swarm.Registry, as: :register | ||
@spec register_name(term, atom(), atom(), [term], non_neg_integer() | :infinity) :: {:ok, pid} | {:error, term} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For optional parameters, you should provide a typespec for both arities, in this case, simply omitting the timeout parameter is enough. This is because otherwise someone can't do iex> s Swarm.register_name/4
and get a typespec, they have to know to do iex> s Swarm.register_name/5
.
|
||
You must configure the quorum size using the `:static_quorum_size` setting: | ||
|
||
config :swarm, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be worth including the :distribution_strategy
option here for clarity and in order to re-emphasize that both options are required to use this strategy correctly.
It defines the minimum number of nodes that must be connected in the cluster to allow process | ||
registration and distribution. | ||
|
||
If there are fewer nodes currently available than the quorum size, any calls to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth discussing here the use of the kernel
config options :sync_nodes_mandatory
, :sync_nodes_optional
, and :sync_nodes_timeout
. These ensure the required and optional members of the cluster are connected when the runtime boots and before any applications start, it's particularly useful for use cases this strategy is designed around (i.e. the cluster members are known in advance). The mandatory
and optional
settings take a list of nodes, and the timeout
setting takes an integer or :infinity
. You can configure it like any other app, e.g.:
config :kernel,
sync_nodes_mandatory: [:"[email protected]", :"[email protected]"],
sync_nodes_timeout: 60_000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a useful feature I was unaware of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is for sure :). The only caveat to the above is that the configuration needs to be present when the VM boots, so running under mix, you need to pass --erl "-config path/to/sys.config"
and convert the configuration I mentioned to Erlang terms, e.g.:
[{kernel, [{sync_nodes_mandatory, ['[email protected]', ...]},
{sync_nodes_timeout, 60000}]}].
Using the Mix config files works for releases though.
mix.exs
Outdated
@@ -14,7 +14,7 @@ defmodule Swarm.Mixfile do | |||
def project do | |||
[app: :swarm, | |||
version: "3.0.5", | |||
elixir: "~> 1.3", | |||
elixir: "~> 1.5", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we're relying on 1.5 here? I don't like breaking backwards compatibility unless we have a good reason to do so.
@@ -17,22 +17,39 @@ defmodule Swarm.Tracker do | |||
alias Swarm.Registry | |||
alias Swarm.Distribution.Strategy | |||
|
|||
defmodule Tracking do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add @moduledoc false
to this as well?
lib/swarm/tracker/tracker.ex
Outdated
a: list(), | ||
from: {pid, tag :: term}, | ||
} | ||
defstruct name: nil, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since all of the default values are nil
, let's make this defstruct [:name, :m, :f, :a, :from]
@@ -698,11 +725,9 @@ defmodule Swarm.Tracker do | |||
debug "#{inspect name} has requested to be restarted" | |||
{:ok, new_state} = remove_registration(obj, %{state | clock: lclock}) | |||
send(pid, {:swarm, :die}) | |||
case handle_call({:track, name, m, f, a}, nil, %{state | clock: lclock}) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use the new state returned from the last call which manipulates the state (i.e. the clock should be incremented twice)
lib/swarm/tracker/tracker.ex
Outdated
@@ -719,14 +744,19 @@ defmodule Swarm.Tracker do | |||
:else -> | |||
# pid is dead, we're going to restart it | |||
case Strategy.key_to_node(state.strategy, name) do | |||
:undefined -> | |||
# No node available to restart process on, so remove registrartion | |||
debug "no node available to restart #{inspect name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My instinct is that this should be logged as a warning - it seems like a condition that we would definitely want logged when it occurs.
lib/swarm/tracker/tracker.ex
Outdated
end | ||
defp handle_cast({:retry_pending_trackings}, %{pending_trackings: pending_trackings} = state) do | ||
debug "retry pending trackings: #{inspect state.pending_trackings}" | ||
state = Enum.reduce(pending_trackings, %TrackerState{state | pending_trackings: []}, fn (tracking, state) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor style change here, my preference in new/modified code is to change assignments which break 80 characters into multiple lines and use pipes if applicable to shorten it up; and I also prefer to avoid parens in anonymous function arguments, e.g:
state =
pending_trackings
|> Enum.reduce(%TrackerState{state | pending_trackings: []}, fn tracking, state ->
...
end)
@bitwalker Thanks for the positive feedback. I'll go ahead and make the changes you've outlined and update the pull request. |
@bitwalker I've pushed additional commits to this pull request containing the changes you outlined. One issue that I've spotted today is that joined groups are not rejoined when a process gets restarted. I can include a fix for that too, if you want? |
Let's fix that issue in a separate PR to keep things easier to review :), I've merged this for now, thanks for all the hard work! |
Adds new strategy module
Swarm.Distribution.StaticQuorumRing
. This is used to provide consistency during a network partition.The strategy is configured with a quorum size which defines the minimum number of nodes that must be connected in the cluster to allow process registration and distribution. If there are fewer nodes available than the quorum size, any calls to
Swarm.register_name/5
will block until enough nodes have started.The
Swarm.Distribution.Strategy.key_to_node/2
function may return:undefined
to indicate that there is no available node to start the process. In this case the tracker will record the registration as pending. It will attempt to start the process whenever the network topology changes. Should a node go down, or during a net split, any process that are currently running but are determined to have no node available will be stopped. They will be restarted whenever a node becomes available, as determined by the same consistent named based hash ring distribution using libring. Additional ring distribution strategies can now be written that take advantage of consistency, instead of availability, by returning:undefined
node.A full suite of tests for the new functionality are included in
test/quorum_test.exs
.Please let me know of any issues or changes you recommend.