Releases: cadence-workflow/cadence
v1.2.4
What's Changed
- Remove database check for config store tests by @Shaddoll in #5401
- Fix persistence tests setup by @Shaddoll in #5402
- Implement config store for MySQL by @Shaddoll in #5403
- Retract v1.2.3 by @sankari165 in #5406
- Implement config store for PostgresSQL by @Shaddoll in #5405
- Release v1.2.4 by @Shaddoll in #5407
Full Changelog: v1.2.3...v1.2.4
v1.2.3 (Retracted, please use v1.2.4)
Added
Expose workflow history size and count to client by @timl3136 (#5392)
Fixed
[cadence-cli] fix typo in input flag for parallelism by @sankari165 (#5397)
Changed
Update config store client to support SQL database by @Shaddoll (#5395)
Scaffold config store for sql plugins by @Shaddoll (#5396)
Improve poller detection for isolation by @Shaddoll (#5399)
v1.2.2
What's Changed
- add a update workflow execution count metric for RI by @allenchen2244 in #5386
- Pass partition config and isolation group to history/matching even if isolation is disabled by @Shaddoll in #5385
- [CLI] fix nil pointer issue in domain migration command rendering by @shijiesheng in #5378
- Release v1.2.2 by @shijiesheng in #5388
Full Changelog: v1.2.1...v1.2.2
v1.2.1
Project release: Zonal isolation
This version introduces a few resiliency concepts into customers' worker task processing such that they can detect deployment or configuration failures earlier. These features are opt-in.
The high-level concept is to provide a means to subdivide work (called 'isolation-groups') for workers along whatever partitioning mechanism that is required for your service.
By default the partitioning mechanism provided will attempt to keep workflows running in the location the are started, such that customers may identify broken changes earlier, rather than waiting for the deployment of an entire region. However, if there are no pollers available available in that subdivision, it'll route the work elsewhere.
Nomenclature
Partitioning: A means to subdivide the tasks given to workflows, of which there are many possible schemes and one default one provided. When a workflow is started, a group of partition keys are provided by request headers. The partition keys are used to determine which isolation group of workers should process these workflows.
Workflow pinning: A partitioning scheme which emphasizes keeping workflows running in the location they were started
Isolation-groups: A division of work within a customer region in which they can subdivide their workers and pin the workflows. This originally was intended as a synonym for 'zone' in the site reliability, as a subdivision of a region. However the important point is that this is a failure domain for customer workflows, so this may be an arbitrary subdivision of your cluster's traffic.
Isolation-group drain: A means of excluding work from an isolation-group. If an isolation group is drained, workers from that isolation group won't be able to get any task. And customers cannot start workflows from that isolation group.
Default concepts and approaches
The partitioning and isolation concepts are intended to be provided as general purpose orchestration concepts and flexible, with some basic defaults provided. By default the following behaviour is given:
- Partition data is persisted with workflow execution records by the provided middleware if the provided header is passed when workflows are created.
- The cadence client and worker Go libraries will pass these as headers if provided in client options
Pinning behaviour
The workflow original zone is captured on workflow start and will be used on workflow processing.
The default partitioner provides the following behaviour: It will attempt to dispatch work in a zone where the workflow was started. However, workers may not be available in that zone, or no longer available for some reason. So the partitioner takes information from a lookback of poller information and uses this lookback data to ensure that the workflow can be processed. If the the start isolation-group is not available it'll another healthy random one.
'Health', here, is determined as the presence of pollers and the absence of drains.
The 'unpinning' is import for two main reasons: firstly, it's quite possible to start a workflow from an unrelated isolation-group in which the pollers are created and to suddenly blackhole that work would likely be not the desired behaviour. But secondly, and probably more importantly, this prevents a head-of-line blocking problem internally for Cadence. At the database level (in this release anyway) tasks need to be dispatched in-order and so if an isolation-group were to be not processed it would block task processing.
Drains
This release also introduces a simplistic notion of drains, which allow for isolation-groups to be excluded from traffic processing, should that be required. Drains are issuable via the Admin API or via cli:
eg:
cadence admin isolation-groups update-global --set-drains zone-1
cadence admin isolation-groups get-global
This information is stored in the config-store and is not part of dynamic configuration.
Configuration
In order to use this feature, the requisite configuration is required:
system.allIsolationGroups
: This is a list of all the possible isolation-groups
system.enableTasklistIsolation
: This is the bool flag to enable it for a domain
Implementation
The changes for this feature are largely in Matching and can be (reductively) described as: Sync and Async-match in Cadence as being made aware of a new dimension; their associated isolation-group. The tasks piped through the Matching service are matching the appropriate isolation-group channel.
What's Changed
- Set config for shardscanner fixer by @mantas-sidlauskas in #3844
- Fix get raw history for transient decision by @yycptt in #3847
- Fix error handling when processing parent close policy by @yycptt in #3845
- Add logging/metrics for decision attempts by @yycptt in #3849
- Switch to gocql interface by @yycptt in #3837
- Fix NPE in DescribeMutableState by @yycptt in #3850
- Switch the remaining history component to internal types by @vytautas-karpavicius in #3843
- Switch Health status endpoints to internal types by @vytautas-karpavicius in #3842
- reset workflow with no decision task complete by @yux0 in #3687
- error check before return the ActivityLocalDispatchInfo by @mkolodezny in #3853
- Delete unused dynamic configs that have no referrence anymore by @longquanzheng in #3859
- Merge sql updates: Blob size increase by @yux0 in #3858
- Handle matching task list conditional error by @yux0 in #3867
- Fix go-generate by @yycptt in #3864
- Support visibility query with close status represented in string by @yycptt in #3865
- Add timers shardscanner by @mantas-sidlauskas in #3846
- replace string based logging with tagged logs by @mantas-sidlauskas in #3871
- Downgrade golang tools version by @yycptt in #3876
- Add instructions to setup local MySQL and Postgres by @yux0 in #3868
- Make max activity schedule to start timeout for retry configurable by domain by @yycptt in #3878
- Task processing debug logs by @yycptt in #3877
- Transfer queue validator by @yycptt in #3875
- Pick sql index changes by @yux0 in #3866
- Remove strict sanity check to allow reset by @yux0 in #3879
- Improve shard context timeout handling by @yycptt in #3881
- Add domain name tag in failover metrics by @yux0 in #3882
- break out when response is nil by @mantas-sidlauskas in #3886
- Allow using Kafka TLS without cert ca and key by @longquanzheng in #3862
- Fix dynamic config collection logValue function by @yycptt in #3880
- Update read DLQ messages API to return raw task info by @yux0 in #3869
- break if adminClient returns error by @mantas-sidlauskas in #3887
- Latest idl by @yux0 in #3888
- Fix activity lost metrics by @yycptt in #3889
- Add replication error logging and metrics by @yux0 in #3891
- Simplify templateGetLastMessageIDQuery sql query by @andrewjdawson2016 in #3890
- Add task processing workflow busy metric by @yycptt in #3892
- CLI 0.18.0 release by @yycptt in #3896
- Handle data corruption error in replication by @yux0 in #3895
- Add a "help" target to the makefile by @Groxx in #3898
- Initial protobuf types and API by @vytautas-karpavicius in #3863
- Fix workflow reset command by @yycptt in #3904
- CLI 0.18.1 patch release by @yycptt in #3908
- Use GetDomainName instead of GetDomainByID for retrieving domain names by @yycptt in #3899
- Start enabled shardscanner fixers by @mantas-sidlauskas in #3906
- Switch to protoc-gen-go by @vytautas-karpavicius in #3905
- Fix scan unsupported workflow in SQl DB by @yux0 in #3909
- Makefile cleanup / thrift revamp / gobin removed by @Groxx in #3903
- Version goveralls, remove unused go bins from docker setup by @Groxx in #3913
- Remove duplicate doc...
v1.0.0
We are v1.0! (with a schema upgrade)
What does this mean?!
Not much. Primarily that we are declaring "it's stable and in use" more visibly, because we continually get questions about this :) A larger public announcement / state-of-the-project is in the works.
Importantly, v1.0 does not imply any change to backwards compatibility (the minimum supported client version has not changed), RPC compatibility (ditto, all changes are backwards compatible), or Go API compatibility (this is not truly a library, Go compatibility is not a goal).
Going by previous version patterns, this would have been labeled v0.26.0 as it is a relatively incremental change (plus schema changes) from v0.25.0. As such, some strings still reference "0.26", because this older SHA is the one we have been using the most internally.
These strings will be updated and validated soon, and will likely be released as v1.0.1. This should have no behavioral impact at all, but will be visible in metrics, logs, and display strings.
What do I need to do to upgrade?
Schema upgrades needed
There have been schema changes to both normal and visibility datastores, primarily to provide better data for cleanup and hot-shard detection:
- Update-time additions by @neil-xie in #4962 and #4971
- Add FirstExecutionRunID to mutable state by @Shaddoll in #5031
- Shard ID visibility additions by @allenchen2244 in #5099 and #5123
These were intentionally kept out of v0.25.0 to keep that upgrade simple, as they were not fully utilized yet.
Replication cache recommendation
We have internally disabled the replication cache (history.replicatorCacheCapacity
dynamic config set to 0
), due to unexpectedly large memory use under abnormal load, and you may wish to do so as well.
We did not encounter any misbehavior, and it did reduce database load as intended, but we intend to make some changes to it to estimate and constrain memory use before re-enabling.
What has changed?
At a very high level, we've been focused on:
- Internal scaling challenges, both improving bottlenecks and improving our ability to accurately identify bottlenecks
- Many metrics, logs, and refactors are at least somewhat related to this
- Our multi-cluster support is improved in particular, as we have been connecting clusters and moving many domains to spread load more evenly
- Database corruptions, as our Cassandra clusters have had some problems that cause issues for months
- Many logs, scanner, and stale-task changes are related to this, e.g. to detect and remove invalid data
- Scaling up the team
- More changes to come!
Some loosely categorized PRs that were included follows:
Critical bugfixes (resolving issues in v0.25.0)
- Fix ndc flush buffered events by @Shaddoll in #5009
- Hotfix a replication panic causing crashes by @davidporter-id-au in #5074
- Resolve an infinite loop around impossible cron schedules by @Groxx in #5097
Parent-close-policies apply to child workflows even after they reset/continue-as-new/etc
- Update parent close policy to terminate/cancel child workflows even after continue as new by @Shaddoll in #5032
- This requires new stored data, so it does not apply to child workflows started before this version.
Better config introspection
- Config store CLI: make value required when updating by @mantas-sidlauskas in #5089
- CLI: print all available dynamic config keys by @mantas-sidlauskas in #5090
Schemas are now available via the go module, as go:embed files
- Embed schema files by @Shaddoll in #5040
- Embed elasticsearch index templates by @Shaddoll in #5043
- Fix ES embedding by @Shaddoll in #5056
Enhancing existing metrics and logging (and more included in other PRs)
- Reduce metrics cardinality replication.TaskStore by @vytautas-karpavicius in #4981
- Add Metric Emitter, which right now emits a metric once a minute for true replication lag in nanoseconds. by @ZackLK in #4979
- Added logs for domainName empty situation by @abhishekj720 in #4987
- Improve logs for task executor by @Shaddoll in #4989
- Add domain_type and cluster_groups tags by @vytautas-karpavicius in #4990
- Introduce per domain metrics by @Shaddoll in #5012
- Improve logs for transfer task validator by @Shaddoll in #5044
- Make replication log error message better by @davidporter-id-au in #5052
- Wf version metrics by @allenchen2244 in #5041
- Add domain tag to unregistered field error by @neil-xie in #5070
- UpdateWorkflow ShardId based metrics by @allenchen2244 in #5080
- Emit workflow counts per workflow type metrics by @neil-xie in #5082
- Use zap logger when initialising dynamic config by @mantas-sidlauskas in #5081
- add 3 tags to support adding logs for every manual access by @bowenxia in #5112
- Add sample log and dynamic config for updateworkflowexecution hot shard detection by @allenchen2244 in #5120
- Add attempt-count to task processing logs, and update unit test so that it will cover deadlock by @bowenxia in #5122
Misc
- Allow docker compose to work with docker-compose-mysql.yml on M1 by @ZackLK in #4983
- Return early when there are no replication tasks by @vytautas-karpavicius in #4982
- Update Cassandra deletes to use ALL consistency level by @Shaddoll in #4984
- Make test should pass locally by @ZackLK in #4915
- Immediate replication task hydration after successful transaction by @vytautas-karpavicius in #4980
- Convert client peer resolving errors to service transient errors by @Shaddoll in #4993
- Update idls by @Shaddoll in #4997
- Fix history corruption check for workflow signaling by @Shaddoll in #4998
- Introduce a dynamic config for cassandra all consistency level delete by @Shaddoll in #5000
- Adds fix for domain ack level issue by @davidporter-id-au in #5001
- Drop dynamic config for gRPC message size by @vytautas-karpavicius in #5002
- Fix Cadence CLI by @Shaddoll in #5005
- Re-enable workflow test by @Shaddoll in #5007
- Add new unit test by @Shaddoll in #5008
- Reformatting most things for go 1.19, rebuilding go.mod tools after clean, warning about different go versions by @Groxx in #5019
- Enhance workflowDeletionTaskJitterRange to handle deletes piling up when many workflows have finished at the same time. by @ZackLK in #5020
- Feature/min initial failover version by @davidporter-id-au in #5015
- Fix Makefile OpenSearch rule name in CONTRIBUTING.md install guide, Fix OpenSearch version in dev Docker config by @charlese-instaclustr in #5004
- Decouple StateBuilder from TaskGenerator by @vytautas-karpavicius in #4991
- Removing unused code by @vytautas-karpavicius in #5024
- Use internal IndexedValueType by @Shaddoll in #5016
- Fix workflow cancellation by @Shaddoll in #5025
- Add UpdateTime to uninitialized workflow execution record and update logic to set the update time by @neil-xie in #5014
- Update DSL query to allow filtering by missing start time by @neil-xie in #5017
- test: use
T.TempDir
to create temporary test directory by @Juneezee in #5013 - Enable workflow corruption check for Describe and Query API by @Shaddoll in #5028
- Remove unused watchdog signal by @demirkayaender in #5029
- Add TLS ServerName as CLI option for Cadence Cassandra Tool by @sonpham96 in #5011
- Add cli tls support by @charlese-instaclustr in #5027
- Improve Cassandra errors for schema check by @mantas-sidlauskas in #5038
- Fix SignalWithStartWorkflow by @Shaddoll in #5036
- Fix error message by @ZackLK in #5045
- Making a schema tooling concrete -> interface by @davidporter-id-au in #5046
- Exposing the ability to pull CQL changesets by @davidporter-id-au in https://github.com/uber/ca...
v0.25.0
Important Notice: If you're experiencing OOM after deploying this version, please update this dynamic property to disable replication cache.
history.replicatorCacheCapacity:
- value: 0
Per-domain metrics
- 483a149 Introduce per domain metrics (#5012)
- e87bd74 Added logs for domainName empty situation (#4987)
- c8783f0 Addition of domainName tag to Replication task (#4975)
- 88991f2 Addition of domain tag for Replication task metric (#4974)
- e69dbd6 Added changes to readHistoryBranchRequest (#4972)
- 76a025a Added domainName change to remaining functions of appendHistoryNodeRequest and RecordWorkflowExecutionUninitializedRequest (#4968)
- 0f59042 Added changes to archival client (#4958)
- d1965b1 Added domain Tag to UpdateTaskList,DeleteTaskList,LeaseTaskList,CompleteTask and CompleteTaskLessThan (#4950)
- 4c8013d Added changes to GetTask and CreateTask (#4947)
- e88a9c7 Added changes to PutReplicationTaskToDLQ and IsWorkflowExecutionExists (#4946)
- b9b8b42 Added changes to DeleteCurrentWorkflowExecution and GetCurrentExecution (#4944)
- 8c5f2ff Added changes to ConflictWorkflowExecution and DeleteWorkflowExecution (#4943)
- 13a130b Added changes to GetWorkflowExecution and UpdateWorkflowExecution (#4938)
- 2bb13a1 Added DomainTag changes to ReadHistory branch for readHistoryRequest, CreateWorkflowRequest + added DomainCacheNoOp file (#4930)
- c091a49 Changed DeleteHistoryBranch and GetHistoryTree by adding Domain Tag with mocks (#4928)
- b34f4e4 Adding DomainTag to the ForkHistoryBranch, ReadRawHistoryBranch and ReadHistoryBranchByBatch (#4926)
- 6cf4252 Adding DomainTag to the Persistence metrics client (#4922)
- c3f7bd3 Addition of DomainTag to required functions for the creation of metrics required for Domain Cost Attribution (#4908)
Replication improvement
- 6242854 Immediate replication task hydration after successful transaction (#4980)
- beaf670 Return early when there are not replication tasks (#4982)
- d38b08e Add Metric Emitter, which emits a metric once a minute for true replication lag in nanoseconds. (#4979)
- 1a2804d Reduce metrics cardinality for replication.TaskStore (#4981)
- 93a6f23 Return persisted history events blob (#4953)
- 1be9b6d Replication cache for sharing hydrated messages (#4952)
- 457c35e Partial response of GetReplicationMessages on history service (#4935)
- d739bf5 Helpers for getting enabled and remote cluster info (#4951)
- 385c1c3 Adds more pertinent information about replication (#4931)
- fe3bf0b Refactor task ack manager (#4894)
- 83aa193 Removed TaskID from types.HistoryTaskV2Attributes (#4876)
Observability improvement
- 1e788db Add domain_type and cluster_groups tags (#4990)
- ff11392 Improve logs for task executor (#4989)
- e597b87 Add logs to debug transfer task (#4970)
- 177f087 Improve log for transfer task validator (#4961)
- b0d1f06 Capture CassandraLWT error and log/bump metrics for it. (#4888)
- 50d331a add activity info logging (#4867)
- 93bda8f \adence-history does not emit continue-as-new metrics (#4866)
- 7854f81 Add empty response metrics for read operations (#4855)
- 471e6d1 Log replication messages that did not fit (#4844)
- b03d03e add metric tags for activity task disaptch (#4821)
- d21162d Add logs for domain failover (#4810)
- 400bbe4 Improve failover coordinator error logging (#4811)
- a51b613 Log error fields as tags (#4801)
- c598654 Improve task re-dispatch error logging (#4809)
- 22f97c8 Log error when fetchHistoryFromRemote fails (#4807)
- 33edece Add source_cluster tag when emitting DLQ size (#4782)
Activity dispatch optimization
- 52203ab count local and server optimized activity dispatches as started (#4901)
- bafdf15 do not wait for activity task channel if sync match from history (#4860)
- 361edb6 add activity dispatch configs to matching (#4818)
- e77b43d add activity dispatch configs (#4816)
- 2b0b03f updated idl for activity task dispatch (#4815)
- 2890600 add data contract for activity task dispatch (#4813)
- cda6c53 set EnableActivityLocalDispatchByDomain default value to true (#4788)
Restart workflow
Cross Cluster operations
- e5ed7f7 Feature/adding canary for cross cluster -> readme patch (#4870)
- 68fb2e6 Adds cross-cluster canary (#4868)
Corrupted workflows
- 79437b3 Introduce a dynamic config for cassandra all consistency level delete (#5000)
- 052d77c Update Cassandra deletes to use ALL consistency level (#4984)
Cancel workflow
- add4b39 Standardizing cancellation behavior: a canceled workflow never starts a new run (#4898)
- f1c5578 adding reason to cancel workflow. (#4934)
Failover lockdown
Bug fixes
- c2ffb71 Adds fix for domain ack level issue (#5001)
- 3985fec Fix history corruption check for workflow signaling (#4998)
- 1375e49 Revert "Fix error conversion for WorkflowExecutionAlreadyStartedError (#4838)" (#4999)
- 494f202 Fix status check for visibility and archival (#4864)
- a727049 Bugfix/correct failover issue target domain not active ii (#4840)
Misc improvements & updates
- 78a755c Add new unit test (#5008)
- 278a3b8 Re-enable workflow test (#5007)
- 43c9ebc Fix Cadence CLI (#5005)
- 146bc31 Update idls (#4997)
- 6da9676 Convert client peer resolving errors to service transient errors (#4993)
- a91a250 Adding first scheduled time metadata field for cron workflows. (#4969)
- 5eb67d1 Make test now passes locally (#4915)
- 3aaa1e8 Allow docker compose to work with docker-compose-mysql.yml on M1 (#4983)
- 854fc59 Run docker build on commits, to prevent docker build from breaking in the future (#4978)
- 172abd6 Fix docker build. (#4977)
- 701fb70 Adding limit for amount of pending activties in mutable state. (#4959)
- 6ecd1e4 Fixing test. (#4941)
- d8cb61e Upgrade Golang base images to remediate CVEs (#4957)
- f2b2108 Simplify shard write operations (#4955)
- 9949a22 Simplify history engine task read ID logic (#4949)
- 7566018 fix funcorder linter (#4942)
- b21f34f add funcorder linter (#4939)
- e3496a3 Add List*Execution (ElasticSearch) API ratelimiters (#4925)
- 85e0fee Fix flacky QueryWorkflow tests (#4932)
- 341d9f0 Improve decode_thrift output (#4929)
- a4d77f5 Fix query workflow high latency after a long inactive time (#4871)
- 43a17d2 downgrade testify to fix monorepo (#4918)
- ef8d11e Update revive to catch more defer/recover badness (#4917)
- 82544de Replace unsafe usage of recover() in helper functions (#4913)
- c06649e Fix remaining server lint warnings and make lint error by default. (#4911)
- 8b42a6d Start fixing server lint warnings (#4909)
- d2f72d8 Fix flaky retrypolicy tests. (#4905)
- 25e221b Add new CI step for lint validation (#4903)
- 64cb46f Add new es record for uninitialized workflow execution (#4899)
- 8c449b3 Add JitterDelay option when creating workflows. (#4886)
- 1f8c93a reduce MatchingActivityTaskSyncMatchWaitTime default value (#4897)
- 7da6bc0 [codegen] introduce gowrap for generating retryableClient (#4879)
- ed2beb2 Separating tools dependencies from main dependencies (#4895)
- de09926 Minor makefile cleanup, verbose CI, fmt with a recent Go version (#4896)
- cfd637e add mockery to go generate (#4887)
- 6f9e2d9 upgrade go version to 1.17 in go mod and Buildkite dockerfile (#4889)
- 663a041 Added support for network topology strategy (#4875)
- ac10760 Move visibility operation from search attributes to indexer message (#4881)
- 691bf3f Magically speed up integration tests by nearly 10x (#4892)
- e9915ae Rename dockers default cluster name to match the other config files. (#4885)
- aff5ecf Simplified FindFirstVersionHistoryByItem (#4882)
- 4cfb741 fix flaky TestDelayStartWorkflow (#4884)
- 9f21900 update generated code (#4880)
- 6009044 Support allowed authenticators in tool (#4873)
- f133d3c Add support for changing the gocql connect timeout (#4874)
- dc5230f Update idl for StickyWorkerUnavailableError (#4869)
- 9e6d122 Used exposed admin proto IDLs (#4865)
- 0930305 Add visibility operation types to Kafka message (#4828)
- ae14412 Move some proto definitions to admin package (#4861)
- af932bd Fix CLI rendering long workflow types (#4853)
- b457b55 Make cluster.Metadata a struct and stop using mocks for it (#4851)
- 12d8c54 Add UpdateFromConfig function to schema tool library (#4848)
- d6ae278 Decouple domain cache entry from cluster metadata (#4847)
- 15267b9 Separate buildkite pipeline for PRs (#4850)
- 0582a58 Update SQL implementation of UpdateExecution to support async transaction (#4792)
- 535cda8 Remove unused loggers from history (#4822)
- 915a777 Simplify history builder (#4837)
- beab75c Removing target-domain-not-active special-case handling (#4835)
- a575908 Extract Engine from matching handler (#4833)
- 20329a2 Forward activity responses and heartbeats on failover as well (#4823)
- fbfafb9 Update PROPOSALS.md (#4831)
- 94fd0a6 Update roadmap.md (#4829)
- 0a37a8b remove redundant type conversions for activity task dispatch (#4820)
- ee5461b Check for resurrected activities during RecordActivityTaskStarted (#4806)
- 4194b29 Remove unused PayloadSerializer param (#4827)
- 45770c2 Add CustomDomain and Operator as default indexed keys (#4825)
- eede466 Fill domainID for backwards compatibility (#4819)
- 8b10063 Fix error type returned from GetWorkflowExecution and DeleteWorkflowExecution (#4817)
- fc9d5fa Change access dienied error type (#4808)
- e91a5a7 Allow decoding thrift from base64 string via CLI (#4805)
- 5be511b Update base image to Alpine 3.15 ...
v0.24.0 Release
Schema upgrades (required)
Cassandra: upgrade schema to 0.33
288e935 Persist domainID instead of domainName for childExecutionInfo (#4601)
d9e5003 Handle applyParentClose target domain failover (#4533)
gRPC Support
- Internal traffic is now on gRPC by default
- Cadence canary is now on gRPC
- Cadence worker is now on gRPC
- Cadence CLI supports
--transport
to use gRPC (default is still tchannel) - Added support for TLS
028c444 Update cadence go client to 0.19.0 (#4696)
fd510e1 Export ResponseInfoMiddleware & InboundMetricsMiddleware (#4680)
3865361 Switch system worker to gRPC (#4679)
ff71ae3 Shuffle responses for replication messages (#4652)
ad49ea6 Fix ResponseInfo to work on all transports (#4649)
5ccad58 Use generated proto types from cadence-idl repo (#4630)
8698157 Add inbound header forwarding middleware (#4637)
9237acb Use direct outbound for matching client. (#4622)
04cd354 Use direct outbound for history client (#4619)
8510816 Add TLS support on gRPC (#4606)
2af3246 Handle error case in response info middleware (#4609)
53833a2 Fix and improve canary thrift config and docs (#4580)
a0ccc85 Switch canary to gRPC (#4570)
b21e5e0 Remove dispatcher provider (#4559)
b2037bf Removed frontend client randomisation (#4558)
aa9e7a5 Fix public client default value after xdc switching to gRPC
9ff3eb3 Added cross DC outbound builder (#4552)
7a6b851 Remove unused NewFrontendClient functions (#4553)
f2f859b Move out dispatcher from client factory (#4506)
37a8fd7 Add inbound metrics middleware (#4545)
9db1a61 Added combineOutbounds to combine multiple outbound builders (#4538)
b1e3001 Use common dispatcher for public client outbound #2 (#4537)
d19cae1 Revert "Use common dispatcher for public client outbound (#4523)" (#4534)
a094a33 Use common dispatcher for public client outbound (#4523)
844181f Add size checker when replication messages return (#4521)
9ba3b99 Added response info middleware (#4522)
f3e3897 Move out auth middleware and add test coverage (#4519)
580c448 Introduce rpc.Params (#4517)
e45753a Refactor PeerChooserFactory out of DispatcherProvider (#4508)
a53f4c9 Move dispatcher provider to rpc package (#4507)
0b2107f Moved RPC related types to a dedicated package (#4505)
5846821 Use gRPC outbound by default for internal traffic (#4492)
Membership changes
Pluggable membership information provider with extended host metadata.
d3e03c2 Ringpop: set tchannel port even if label is missing (#4765)
f65fecb Ringpop: filter out unhealthy nodes (#4764)
4dab59a Use named port to select transport for outbound calls (#4749)
9b50717 Provide portmap to ringpop (#4745)
8477b11 Return Hostinfo identity if set (#4739)
acff10c Add correct Address tag (#4736)
29874b6 Lock membership keys after peer provider call (#4733)
7e3d48c Protect membership member keys concurrent access (#4731)
45bc726 Hashring: return Hostinfo struct instead of string (#4708)
7a17a30 Extend Hostinfo with identity and port map (#4706)
770e9ec Replace Ringop with PeerProvider interface (#4653)
3557eb5 Merge membership Monitor and ServiceResolver to membership.Resolver (#4646)
97f1690 Reduce API scope for membership.Monitor (#4644)
7e14102 Move ringpop setup to common/membership (#4638)
c145ab8 Remove Membership Factory (#4627)
e15f181 Support DNS SRV Records within Ringpop (#4614)
9a072ca Provide Channel for Ringpop (#4597)
Cross Cluster operations
Cross cluster domain dependency support for signals, child workflows, cancels and parent close policy (pre feature release).
b7d2c77 Generate parentClosePolicy task for x-cluster child (#4682)
c894177 Improve cross cluster components shutdown logic (#4662)
624a1fc Bug fixes for cross domain operations (#4623)
d3d0682 Add domain to pendingChildExecutionInfo (#4611)
39bebb4 Fix target domain not active error handling for transfer task (#4599)
e2b8e94 Split transfer close execution task (#4583)
0bfd2f7 Schedule first decision for abandoned child if parent closed (#4579)
a9ed73a Add admin respond cross cluster task completed API (#4565)
5879fa3 Misc. fix for cross cluster implementation (#4554)
041061c Wire up cross-cluster operation implementation (#4524)
70cf8be Add metrics for cross cluster implementation (#4527)
898aa91 Improve close execution task for cross cluster situation (#4528)
f74b915 Execution logic for RecordChildCompletion and ApplyParentClosePolicy (#4474)
f2ff1c3 Refactor cross cluster queue implementation (#4493)
52c8acc Limit batch size for fetching cross cluster tasks (#4487)
5ac1940 Fix parent close policy for cross-domain childs (#4486)
bd7072c Implement xcluster source task executor (#4445)
6101ab2 Implement cross-cluster source task (#4398)
c8f3c1c Support ApplyParentClosePolicy Cross Cluster Tasks (#4392)
c8d0838 Set completed workflow current version to lastWriteVersion (#4431)
fb8e782 Add feature flag for scheduling cross-cluster operations (#4424)
Auto-Forwarding
9540236 Update auto-forwarding to work for global domains with 1 cluster (#4681)
3ee1178 Update batcher to support replicating workflows (#4672)
06891aa Add Redirect policy to forward all domain APIs
ParentClosePolicy for child workflows only
c7727c0 Parent close policy should apply to child workflow only (#4612)
ES Analyzer
80700d8 Add long running workflow metrics (#4643)
2fa2787 ElasticSearch Analyzer (#4598)
MongoDB Support
46b84be Implement MongoDB plugin Part1: skeleton and ConfigStore (#4590)
IPv6 Support
0643788 feat: Fixing RPC to allow bindOnIP for IPv6 (#4620)
SQL Support
085a799 Perform schema checks for multiple SQL database and add context to AdminDB DDL interface (#4561)
f182b87 Unify mysql user password for testing (#4589)
75b10a5 Fix mysql insecure hostname verify didn't work (#4569)
f5ce7cb Implement sharded SQL driver to support using multiple SQL databases (#4504)
90e2290 Refactor to add a SQL driver layer for multiple SQL databases support as sharded SQL (#4498)
Auth
334d51f add workflow type to signal with start auth (#4495)
f98bd06 add enable service auth logging key (#4480)
b22df41 extend permission attributes for service auth (#4468)
7aca829 Load OAuth credential on startup instead of request processing (#4442)
4c2bcc7 Fix OAuth sample config and add docker-compose for OAuth testing server
5191468 Adding middleware to inject Auth token for internal requests to frontend (#4364)
Graceful failover
0c3db56 Integrate failover into into describe domain response (#4440)
920077c Adding debug metrics in domain callback (#4484)
6ee5f93 Add getFailoverInfo API (#4408)
Refresh Tasks API
417f150 fixed refresh workflow tasks (#4750)
6123731 add refresh tasks API to client (#4747)
a5c527f Allow generating workflow tasks if workflow is non-current (#4688)
Corrupted workflows
6980508 Add Watchdog Workflow with Corrupt Workflow Fix (#4713)
e13da58 Add fixer workflow triggered by remote (#4482)
1cc94d5 Add a step to scan workflow to be in DLQ (#4471)
Activity dispatch optimization
de0653f add metric tags for activity task disaptch (#4821)
3581be5 remove redundant type conversions for activity task dispatch (#4820)
ac8cbbd add activity dispatch configs to matching (#4818)
532da71 merged activity dispatch config
f5cfeaf add activity dispatch configs (#4816)
c4713d2 updated idl for activity task dispatch (#4815)
b4f38d0 add data contract for activity task dispatch (#4813)
Cadence CLI Changes
b445012 Improved CLI DLQ read command (#4780)
950f5ac Added --format flag to render table, json or custom template (#4777)
c833c98 Use RenderTable for the remaining CLI commands (#4774)
99fcca8 Allow loading service config for all DB operations (#4768)
0557c2b Added presentation layer for rendering workflow list tables (#4773)
9d65899 Allow reading shard list from stdin for CLI DLQ operations (#4771)
5511bd6 Drop unused flags for cli rereplicate command (#4728)
ceacad0 Fix NPE when observing history in CLI (#4714)
9530143 Update CLI client factory to use grpc clients (#4605)
38d1e2a Add exclude query for list and reset-batch command (#4699)
cf21c86 Add skipCurrentCompleted option to reset-batch command (#4698)
41c8923 Update domain describe command to support JSON output (#4674)
c220950 Fix admin db thrift decode tool (#4665)
75a992a Create ElasticSearch client via factory (#4660)
9d40c45 Add admin tool to decode any thrift binary into JOSN (#4634)
a370de0 Cli: notify on SIGINT (#4615)
8c9db18 Expose GetTaskListByDomain in CLI (#4462)
Bug fixes
0c8a0fd Fill domainID for backwards compatibility (#4819)
6981b1d Only update maxReadLevel after successful re-acquire of shard (#4799)
7328473 Fix ScanWorkflowExecutions function in frontend client (#4781)
13f9cf8 Added missing mapper fields for DecisionTaskTimedOutEventAttributes (#4762)
1923121 Fix auto-forwarding for QueryWorkflow API (#4763)
f1a0983 Fix data conversion from serialization.WorkflowExecutionInfo to persistence.InternalWorkflowExecutionInfo (#4758)
0596698 Use setupBackoffTimer with locking (#4748)
b0da1be Fix SQL implementation of DeleteWorkflowExecution (#4746)
19a8526 Update cadence batch command to receive more input (#4725)
edf4cb4 Fix parsing domain_id in child_info_maps for backward compatibility (#4722)
dea6429 Fallback to zero value for initiatedID in exteralWorkflowExecutionFields struct (#4720)
27a0df2 Add decision offset to LastDecisionCompleted reset type (#4700)
e8fdcd9 Fix cassandra plugin nil pointer dereference issue (#4697)
027bbd6 Fix queue diff metric for disabled clusters (#4686)
35ae7e7 Fix canary/bench dev con...
v0.23.2 Patch Release
Release commits
Bug fix
ff5ef71 Fix ResponseInfo to work on all transports (#4649)
97127f0 Fix remote sync match for standby domains and task creation time (#4654)
Improvement
b4b94c6 Create ElasticSearch client via factory (#4660)
831dc7f Shuffle responses for replication messages (#4652)
d367a88 Add Redirect policy to forward all domain APIs (#4657)
00bbe50 Add logs for ID length violation checkers (#4655)
Misc.
8dd7a08 Update docker files for 0.23.2 release
v0.23.1 Release
Upgrade instructions (from 0.22.x releases)
Schema upgrades (required)
- Cassandra: upgrade schema to 0.32
Configuration changes (optional but recommended)
- Change
clusterMetadata
toclusterGroupMetadata
- Change
clusterMetadata.masterClusterName
toclusterGroupMetadata.primaryClusterName
- Change
clusterMetadata.clusterInformation
toclusterGroupMetadata.clusterGroup
- Change
dynamicConfigClient
todynamicconfig
withclient:filebased
and move all fields under the olddynamicConfigClient
to a new fieldfilebased
under the newdynamicconfig
. publicClient
is no longer required. If not specified, will default to current cluster's RPCAddress inclusterGroup
- Sample config
Release Commits
New features
Config Store
0fd2b50 Added config store functionality (initial implementation) (#4357)
Cross Domain Operations
38881a8 Add X-Cluster Child Workflow Completion Tasks (#4336)
40c5f18 Implement history handler for fetching and responding cross-cluster tasks (#4329)
adbffa4 Notify queue processor about cross cluster tasks (#4328)
23eb8be Improve is cross cluster task check (#4326)
af30753 Update admin CLI commands for cross-cluster queue (#4321)
58e8e1c Add cross cluster target task executor (#4317)
8d319e2 Refactor task executor interface (#4300)
de9a086 Add cross cluster queue processor (#4269)
4384e4c Target cluster cross cluster task processor (#4292)
6964885 Improve definition for cross-cluster related endpoints (#4294)
eead0e5 Add client and handler implementation for cross-cluster task APIs (#4286)
622b13b Add internal types and mappers for cross cluster related APIs (#4285)
GRPC
4b95ec8 Fallback to zero value for non-present parent execution fields (#4617)
dbe538e Switch canary to gRPC (#4570)
59c8f0e gRPC for cross DC traffic (#4390)
5328cba Expose frontend gRPC port on docker containers (#4312)
Auth
8b8d8d8 add workflow type to signal with start auth
eacf42f add enable service auth logging key
d1a3c11 extend permission attributes for service auth
35f588f Add authorizer protection for AdminAPI
9a46d9d Feature cont.: authorize CLI as admin with private (#4338)
37706b2 Update OAuth implementation to use domainCache to authorize (#4333)
0085b7a CLI sending authorized request (#4327)
989e35c Add Permissions to Attributes and reading Public/Private key from disk instead of reading it from yaml file (#4320)
9f5d412 Implement OAuth Authorizer (#4306)
deed482 add enable service auth key (#4299)
51be820 fill tasklist in auth attributes for poll APIs (#4296)
70f3f58 add tasklist to auth attributes (#4288)
Bug Fixes
4808e65 Fix NPE in GCP archival (#4626)
49df671 Handle error case in response info middleware (#4609)
efb7b08 Fix get replication task read level update issue (#4607)
144d694 Fix NPE when replicating child started event (#4591)
0a1337c Fix cherry-pick for docker config template from #4585
1affb65 Fix mysql insecure hostname verify didn't work (#4569)
3fd8001 Fx docker template and canary batcher workflow (#4585)
364b2a1 Fix and improve canary thrift config and docs (#4580)
f744a6f Fix record child completion error handling (#4515)
3cb214a Fix access control admin handler initialization (#4500)
0398bf6 Fix timer resurrection check (#4499)
39f45eb Fix startTime in workflow task refresher (#4488)
dfca8e1 Fix nil pointer dereference issue in matching (#4481)
5dd7eb7 Fix workflow refresh for closed workflows (#4472)
76573a2 Fix domain updating via grpc (#4418)
f52498a Fix admin workflow re-replicate command (#4325)
abe2284 Fix ndc reset workflow replication bug (#4376)
a58b8b9 Fix oauth yaml config (#4360)
8ea6a6a Fix CLI jwtKey npe issue (#4358)
cd9a33a Fix deadlock in transfer queue (#4337)
fbc79f9 Fix CLI admin domain bug that didn't load Cassandra plugin
bec009a Fix missing activity failure details in standby cluster (#4323)
a1b9679 Fix typo in docker config template which cause docker image corruption (#4310)
6a00f35 Fix a racy read in test (#4291)
Improvements
d53b1fb Support DNS SRV Records within Ringpop (#4614)
2c3a8f3 Change frontend drain time
76653c1 Limited retry for normal decision scheduleToStart timeout (#4567)
dbae130 Improve archival history mutated error logs and add option to allow archiving incomplete history
4048370 URL encode postgres credentials (#4550)
41e9b53 Add console as logging encoding type (#4549)
0332c59 Log WorkflowID, RunID, domainName when a workflow times out or gets terminated (#4548)
df0c4bf Change canary back to start both worker and starter by default (#4587)
39b1970 Update cadence go SDK for building canary in 0.23.x (#4586)
dcfe3f6 skip eror on creating domains for canary (#4584)
f53bec1 Add documentation to canary and improvements (#4447)
e052190 Add size checker when replication messages return (#4521)
16aed76 Added response info middleware (#4522)
a6a1793 Adding debug metrics in domain callback (#4484)
e602b8c Set limit on range queries to prevent bad queries causing degradation (#4458)
6b9184c Refactor config methods for internal use (#4448)
d58d346 Add feature flag on emitting signal name metric tag (#4434)
db77377 Refactor test for internal integration tests (#4437)
d67fb41 Revert ratelimiting behavior for frontend worker related APIs (#4435)
0b98055 Rewrite/improve basic load test (#4399)
cde0f41 Dynamic replication batch size (#4301)
efb9f90 Long poll completion buffer to prevent timeouts (#4425)
45c7b4c Improve/simplify archival config validation (#4366)
fb10abe Automatically adjust task priority and redispatch interval based on attempts (#4378)
6dec5aa Disable basic(db) visibility sampling by default (#4407)
76ec20a Emit logs with workflow execution tag for timedout frontend requests (#4379)
170deed Try detecting timer and activity resurrection (#4375)
7110f05 add decision result count check (#4402)
8ad444b Add context metric tags for admin handler (#4404)
c6ef3c9 Refactor ClusterMetadata defaults and validation (#4385)
185988a Add docker file/config/compose for bench and canary
3255b7c Failover metrics scope improve (#4391)
4781a8d Expose invalid timer value in the error message (#4380)
f621c7c Fill currentCluster RpcAddress with publicClient as default
58ae905 Optimization for start child workflow task (#4315)
71e730f Allow removing replica from domain replication group (#4346)
b32af80 Enalbe batch job feature by default and update dynamic config docs (#4343)
70bc150 Parallelize GetWorkflowExecution SQL calls (#4339)
28e0489 Server and CLI use version from release and versionChecking constant and commit revision (#4308)
ffbfdb7 Remove unused replicationConsumer related config (#4324)
66f2f26 Refactor Cassandra test utility for NoSQL support
7db7654 Fix MaximumSignalsPerExecution documentation default value
c63aa78 Add persistence error logs to queue manager (#4318)
deb0caf Update Mutable State to reduce unnecessary update to DB (#4304)
572582c Refactoring Cassandra workflow persistence manager for NoSQL support-Part 3
c185ad8 Deep merge config files (#4165)
d91e86f Enforce context timeout for retry policies in execution context and cache (#4303)
a24af63 Refactoring Cassandra workflow persistence manager for NoSQL support-Part 2
17663af Add domain tag for skip task metrics and logs (#4293)
28bb116 Remove tasklist kind from tasklist id (#4295)
94b2405 Implement new matching and frontend API to get all tasklists for a domain (#4175)
ff0046f Rename cassandra files to nosqlStores
3cc8c31 Allow skipping optional tests for optional methods in plugins (#4287)
Misc.
2618998 Update docker files for 0.23.1 release
01f0939 Update CHANGELOG.md (#4405)
44392dc Update community links for Discussion
7420786 Improve contributing and dev process (#4347)
6f989a3 Cleanup lint warning (#4309)
24cd8fa Clean up linting warnings (#4290)
7e88e6e Ignore bench and canary test coverage (#4297)
v0.22.4 Patch Release
Schema/configuration change
None.
Release commits
Bug fix
Improvement
Credits
Thank you @lindleywhite for the contribution!