Skip to content

Releases: outbrain-inc/orchestrator

GA Release v1.4.440

14 Oct 07:32
Compare
Choose a tag to compare
Better error message in ReadTopologyInstance hint for the exact origin

of the problem

GA Release v1.4.438

13 Oct 12:27
Compare
Choose a tag to compare
analysis & recovery:

- added DeadCoMasterAndSomeSlaves analysis
- better DeadCoMaster handling; now regrouping based on
GTID/Pseudo-GTID, attempting to promote co-master on top.

GA Release v1.4.437

12 Oct 14:42
Compare
Choose a tag to compare
Better handling of DeadCoMaster. Previously this used same logic as for

DeadIntermediateMaster, but the two do not really behave in same way.
The correct way to recover a dead-co-master is to relocate all its
slaves to its still-working-co-master, and that's it.

Furthermore, post-failover scripts should be master-failover processes,
not intermediate-master failover processes. 
Recovery should only be automated if cluster applies for
automated-master-recovery, not automated-intermediate-master-recovery.

GA Release v1.4.428

08 Oct 12:41
Compare
Choose a tag to compare
config: added PostponeSlaveRecoveryOnLagMinutes

postponed functions:
- introducing PostponedFunctionsContainer type (add to, invoke postponed
functions)
- MultiMatchBelow() accepts PostponedFunctionsContainer; on single-slave
buckets where the slave has SQLDelay, match of said slave is postponed
- RegroupSlaves() accepts PostponedFunctionsContainer to support the
above
- RegroupSlavesIncludingSubSlavesOfBinlogServers() accepts
PostponedFunctionsContainer to support the above
- TopologyRecovery includes PostponedFunctionsContainer
- RecoverDeadMaster uses TopologyRecovery as postponed functions
container upon RegroupSlavesIncludingSubSlavesOfBinlogServers()
  - effectively: SQL-delayed replicas are not blocking the entire
recovery operation but are postponed (see
PostponeSlaveRecoveryOnLagMinutes config).

GA Release v1.4.426

07 Oct 19:08
Compare
Choose a tag to compare
web:

- fixed bug in cluster.js for virtual node (threw JS error)
- worked around that strange bug that empty array gets rendered by
golang/json as "null"

GA Release v1.4.425

07 Oct 15:41
Compare
Choose a tag to compare
metrics:

- writing resolve_dao metrics

GA Release v1.4.424

07 Oct 14:21
Compare
Choose a tag to compare
recovery:

- introducing PostponedFunctions: may be added as recovery progresses;
issued after all PostRecovery* processes execute
- RecoverDeadMaster uses postponed functions for executing detach-slave
on lost slaves
- RecoverDeadMaster uses postponed functions for executing
begin-downtime on lost slaves, dead master
- RecoverDeadMaster uses postponed functions for executing repoint on
binlog servers in a binlog server topology

async_request:
- added, but currently UNUSED!
- since postponed-functions work well enough
- db: added async_request table
- added async_request.go, async_request_dao.go
- idea is to be able to request operations such as "relocate" etc. that
will be executed by the elected service asynchronously (implementing a
job queue)
- unsure when this will be used

metrics:
- Fixed bug in metrics callbacks not being called
- instance_dao writes some instance read/write metrics

GA Release v1.4.421

07 Oct 06:59
Compare
Choose a tag to compare
db:

- added these columns to topology_recovery, making for a better audit:
  - participating_instances
  - lost_slaves
  - all_errors

recovery:
- More data in TopologyRecovery (see above); now class being used for
writes, as well
- dao reads & writes added columns in topology_recovery
- ResolveRecovery() uses recovery_id rather than host/port combination
- TopologyRecovery reference used throughout recovery process and holds
the state of things (BTW this is the most stateful data in orchestrator
at this time)
- TopologyRecovery object collecting errors as recovery makes progress,
for future visibility
- TopologyRecovery object collecting lost slaves as recovery makes
progress, for future visibility
- TopologyRecovery object collecting participating instances as recovery
makes progress, for future visibility

web:
- fixed bug in cluster.js leading to infinite loop on master-master
replication and loss of drag-n-drop functionality

GA Release v1.4.415

24 Sep 13:51
Compare
Choose a tag to compare
recovery:

- dao reads is_successful
- audit-recovery presents FAIL status for failed recoveries

GA Release v1.4.412

18 Sep 14:15
Compare
Choose a tag to compare
Fixed resolve issue leading to unjustified fatals