Skip to content
This repository has been archived by the owner on Nov 18, 2024. It is now read-only.

rfc: add leader-based journal #280

Closed
wants to merge 9 commits into from

Conversation

w41ter
Copy link
Contributor

@w41ter w41ter commented Jan 10, 2022

The rendered version: leader based journal

@w41ter
Copy link
Contributor Author

w41ter commented Jan 10, 2022

cc @huachaohuang

@w41ter w41ter marked this pull request as draft January 10, 2022 13:11
@zojw
Copy link
Contributor

zojw commented Jan 10, 2022

Confirm another question~ (storage seems also have the same question

Now, we have 2 engine nodes (L1 and F2), and we add 2 engine Nodes(F3, F4) to the cluster, how do F3, F4 know whether they need to replicate L1's data?

Or where is suitable to maintain metadata like default-tablet-group -> Leader L1, Follower F2, F4, Nothing F3? (ps. F3 maybe as an active follower in other tablet-group in future)

Orchestrator could give a full list of L1, F2, F3, F4, but it doesn't know F3 no need to replicate...

@w41ter
Copy link
Contributor Author

w41ter commented Jan 11, 2022

@zojw Now we only need to implement single writer journal, it means that only one engine is writer, the others are readers. But once partition is added to engine, we should add a new node to manage the destribution of partitions.

@huachaohuang huachaohuang mentioned this pull request Jan 12, 2022
This was referenced Jan 12, 2022
@tisonkun
Copy link
Contributor

@w41ter-l @huachaohuang I noticed that we're implementing the API designed here. If the RFC is accepted, please update the status and pull request links, start a tracking issue for the work (added in the RFC) and merge this PR.

Otherwise, the RFC process is fake.

Comment on lines 3 to 4
- Status: draft
- Pull Request:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Status: draft
- Pull Request:
- Status: accepted
- Pull Request: https://github.com/engula/engula/pull/280
- Tracking Issue: <please create one of this>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@w41ter w41ter marked this pull request as ready for review January 13, 2022 06:14

#### Follower read

A leader will broadcast the committed sequence of events to all journal server, and those events is visible for reading. But here exists a gap between an event become committed in leader and an event is readable in a journal server. So a follower want to read events with consistency, it should ask the latest committed sequence from leader and wait until it receive those events.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is expected to perform follower read?

If it happens only during recovery, duplicate followers seems overkill.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An engine follower must read the stream for the duration of its life to keep track of the leader's state. And no follower read is performed during recovery, because the recovery may execute in leader,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your explanation :)

@huachaohuang
Copy link
Owner

I think this RFC proposes two things: a new SingleWriteJournal API and an implementation of the API. I suggest that we split the RFC into two, one RFC describes the semantic of SingleWriteJournal (I prefer LeaderBasedJournal, though), another RFC describes an implementation of the new API.

Follower,
}

pub trait SingleWriteJournal : Journal {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we have two focuses:

  1. For persisting data into journal server cluster, the proposal is using a quorum-based algorithm as its replication policy.
  2. For supporting high availability via standby engine (journal client), the proposal is electing a leader for all engines.

If so, I suggest that we separate these two focuses and resolve them one after the other.

  1. QuorumBasedJournal: a client side implementation that dealing with persisting data into journal server cluster.
  2. ...
    • LunaEngine (or its internal JournalWriter, concretely) has a leader election mechanism for determining which instance is the exclusive writer of current shard, with a monotonically increasing token. When a new leader elected, it writes its token to the journal server cluster (quorum).
    • FencedJournal: a server side journal implementation that accepted monotonically increasing token. It only serves requests configured with the current token and accepts monotonically increasing only token updates.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit over design, at least for use cases similar to the luna engine. The semantic of LeaderBasedJournal is easier to use and the previous design in this RFC naturally maps to the API. It is OK to decouple the internal implementation to some reusable components like QuorumBasedJournal + FencedJournal, but that depends on the implementation.

@w41ter
Copy link
Contributor Author

w41ter commented Jan 14, 2022

I have divided this RPC into two parts. This part mainly focuses on the design of the leader based journal, the design of the shared journal will be submitted with a new RPC later.

cc @huachaohuang @tisonkun @zojw

@@ -0,0 +1 @@
<svg id="SvgjsSvg1006" width="649" height="381" xmlns="http://www.w3.org/2000/svg" version="1.1" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:svgjs="http://svgjs.com/svgjs"><defs id="SvgjsDefs1007"><marker id="SvgjsMarker1026" markerWidth="12" markerHeight="8" refX="9" refY="4" viewBox="0 0 12 8" orient="auto" markerUnits="userSpaceOnUse" stroke-dasharray="0,0"><path id="SvgjsPath1027" d="M0,0 L12,4 L0,8 L0,0" fill="#323232" stroke="#323232" stroke-width="1"></path></marker><marker id="SvgjsMarker1066" markerWidth="12" markerHeight="8" refX="9" refY="4" viewBox="0 0 12 8" orient="auto" markerUnits="userSpaceOnUse" stroke-dasharray="0,0"><path id="SvgjsPath1067" d="M0,0 L12,4 L0,8 L0,0" fill="#323232" stroke="#323232" stroke-width="1"></path></marker><marker id="SvgjsMarker1074" markerWidth="12" markerHeight="8" refX="9" refY="4" viewBox="0 0 12 8" orient="auto" markerUnits="userSpaceOnUse" stroke-dasharray="0,0"><path id="SvgjsPath1075" d="M0,0 L12,4 L0,8 L0,0" fill="#323232" stroke="#323232" stroke-width="1"></path></marker><marker id="SvgjsMarker1082" markerWidth="12" markerHeight="8" refX="9" refY="4" viewBox="0 0 12 8" orient="auto" markerUnits="userSpaceOnUse" stroke-dasharray="0,0"><path id="SvgjsPath1083" d="M0,0 L12,4 L0,8 L0,0" fill="#323232" stroke="#323232" stroke-width="1"></path></marker><marker id="SvgjsMarker1086" markerWidth="12" markerHeight="8" refX="9" refY="4" viewBox="0 0 12 8" orient="auto" markerUnits="userSpaceOnUse" stroke-dasharray="0,0"><path id="SvgjsPath1087" d="M0,0 L12,4 L0,8 L0,0" fill="#323232" stroke="#323232" stroke-width="1"></path></marker><marker id="SvgjsMarker1090" markerWidth="12" markerHeight="8" refX="9" refY="4" viewBox="0 0 12 8" orient="auto" markerUnits="userSpaceOnUse" stroke-dasharray="0,0"><path id="SvgjsPath1091" d="M0,0 L12,4 L0,8 L0,0" fill="#323232" stroke="#323232" stroke-width="1"></path></marker><marker id="SvgjsMarker1094" markerWidth="12" markerHeight="8" refX="9" refY="4" viewBox="0 0 12 8" orient="auto" markerUnits="userSpaceOnUse" stroke-dasharray="0,0"><path id="SvgjsPath1095" d="M0,0 L12,4 L0,8 L0,0" fill="#323232" stroke="#323232" stroke-width="1"></path></marker><marker id="SvgjsMarker1098" markerWidth="12" markerHeight="8" refX="9" refY="4" viewBox="0 0 12 8" orient="auto" markerUnits="userSpaceOnUse" stroke-dasharray="0,0"><path id="SvgjsPath1099" d="M0,0 L12,4 L0,8 L0,0" fill="#323232" stroke="#323232" stroke-width="1"></path></marker><marker id="SvgjsMarker1106" markerWidth="12" markerHeight="8" refX="9" refY="4" viewBox="0 0 12 8" orient="auto" markerUnits="userSpaceOnUse" stroke-dasharray="0,0"><path id="SvgjsPath1107" d="M0,0 L12,4 L0,8 L0,0" fill="#323232" stroke="#323232" stroke-width="1"></path></marker><marker id="SvgjsMarker1114" markerWidth="12" markerHeight="8" refX="9" refY="4" viewBox="0 0 12 8" orient="auto" markerUnits="userSpaceOnUse" stroke-dasharray="0,0"><path id="SvgjsPath1115" d="M0,0 L12,4 L0,8 L0,0" fill="#323232" stroke="#323232" stroke-width="1"></path></marker><marker id="SvgjsMarker1122" markerWidth="12" markerHeight="8" refX="9" refY="4" viewBox="0 0 12 8" orient="auto" markerUnits="userSpaceOnUse" stroke-dasharray="0,0"><path id="SvgjsPath1123" d="M0,0 L12,4 L0,8 L0,0" fill="#323232" stroke="#323232" stroke-width="1"></path></marker><marker id="SvgjsMarker1130" markerWidth="12" markerHeight="8" refX="9" refY="4" viewBox="0 0 12 8" orient="auto" markerUnits="userSpaceOnUse" stroke-dasharray="0,0"><path id="SvgjsPath1131" d="M0,0 L12,4 L0,8 L0,0" fill="#323232" stroke="#323232" stroke-width="1"></path></marker><marker id="SvgjsMarker1134" markerWidth="12" markerHeight="8" refX="9" refY="4" viewBox="0 0 12 8" orient="auto" markerUnits="userSpaceOnUse" stroke-dasharray="0,0"><path id="SvgjsPath1135" d="M0,0 L12,4 L0,8 L0,0" fill="#323232" stroke="#323232" stroke-width="1"></path></marker></defs><g id="SvgjsG1008" transform="translate(24.999984741210938,25.003982543945312)"><path id="SvgjsPath1009" d="M 0 0L 599 0L 599 331.2999954223633L 0 331.2999954223633Z" stroke="rgba(255,255,255,1)" stroke-width="2" fill-opacity="1" fill="#ffffff"></path><g id="SvgjsG1010"><text id="SvgjsText1011" font-family="微软雅黑" text-anchor="middle" font-size="13px" width="579px" fill="#323232" font-weight="400" align="middle" lineHeight="125%" anchor="middle" family="微软雅黑" size="13px" weight="400" font-style="" opacity="1" y="156.02499771118164" transform="rotate(0)"></text></g></g><g id="SvgjsG1012" transform="translate(238.99998474121094,38.50398254394531)"><path id="SvgjsPath1013" d="M 0 0L 96 0L 96 37L 0 37Z" stroke-dasharray="10,6" stroke="rgba(50,50,50,1)" stroke-width="1" fill-opacity="1" fill="#ffffff"></path><g id="SvgjsG1014"><text id="SvgjsText1015" font-family="微软雅黑" text-anchor="middle" font-size="13px" width="76px" fill="#323232" font-weight="400" align="middle" lineHeight="125%" anchor="middle" family="微软雅黑" size="13px" weight="400" font-style="" opacity="1" y="8.875" transform="rotate(0)"><tspan id="SvgjsTspan1016" dy="16" x="48"><tspan id="SvgjsTspan1017" style="text-decoration:;">Master</tspan></tspan></text></g></g><g id="SvgjsG1018" transform="translate(490.99998474121094,38.50398254394531)"><path id="SvgjsPath1019" d="M 0 0L 96 0L 96 37L 0 37Z" stroke="rgba(50,50,50,1)" stroke-width="1" fill-opacity="1" fill="#ffffff"></path><g id="SvgjsG1020"><text id="SvgjsText1021" font-family="微软雅黑" text-anchor="middle" font-size="13px" width="76px" fill="#323232" font-weight="400" align="middle" lineHeight="125%" anchor="middle" family="微软雅黑" size="13px" weight="400" font-style="" opacity="1" y="8.875" transform="rotate(0)"><tspan id="SvgjsTspan1022" dy="16" x="48"><tspan id="SvgjsTspan1023" style="text-decoration:;">Orchestrator</tspan></tspan></text></g></g><g id="SvgjsG1024"><path id="SvgjsPath1025" d="M335.49998474121094 57.00398254394531L412.99998474121094 57.00398254394531L412.99998474121094 57.00398254394531L489.1999847412109 57.00398254394531" stroke-dasharray="3,3" stroke="#323232" stroke-width="1" fill="none" marker-end="url(#SvgjsMarker1026)"></path></g><g id="SvgjsG1028" transform="translate(327.99998474121094,128.5039825439453)"><path id="SvgjsPath1029" d="M 0 0L 96 0L 96 37L 0 37Z" stroke="rgba(50,50,50,1)" stroke-width="1" fill-opacity="1" fill="#ffffff"></path><g id="SvgjsG1030"><text id="SvgjsText1031" font-family="微软雅黑" text-anchor="middle" font-size="13px" width="76px" fill="#323232" font-weight="400" align="middle" lineHeight="125%" anchor="middle" family="微软雅黑" size="13px" weight="400" font-style="" opacity="1" y="8.875" transform="rotate(0)"><tspan id="SvgjsTspan1032" dy="16" x="48"><tspan id="SvgjsTspan1033" style="text-decoration:;">Journal Server</tspan></tspan></text></g></g><g id="SvgjsG1034" transform="translate(327.99998474121094,205.5039825439453)"><path id="SvgjsPath1035" d="M 0 0L 96 0L 96 37L 0 37Z" stroke="rgba(50,50,50,1)" stroke-width="1" fill-opacity="1" fill="#ffffff"></path><g id="SvgjsG1036"><text id="SvgjsText1037" font-family="微软雅黑" text-anchor="middle" font-size="13px" width="76px" fill="#323232" font-weight="400" align="middle" lineHeight="125%" anchor="middle" family="微软雅黑" size="13px" weight="400" font-style="" opacity="1" y="8.875" transform="rotate(0)"><tspan id="SvgjsTspan1038" dy="16" x="48"><tspan id="SvgjsTspan1039" style="text-decoration:;">Journal Server</tspan></tspan></text></g></g><g id="SvgjsG1040" transform="translate(327.99998474121094,282.5039825439453)"><path id="SvgjsPath1041" d="M 0 0L 96 0L 96 37L 0 37Z" stroke="rgba(50,50,50,1)" stroke-width="1" fill-opacity="1" fill="#ffffff"></path><g id="SvgjsG1042"><text id="SvgjsText1043" font-family="微软雅黑" text-anchor="middle" font-size="13px" width="76px" fill="#323232" font-weight="400" align="middle" lineHeight="125%" anchor="middle" family="微软雅黑" size="13px" weight="400" font-style="" opacity="1" y="8.875" transform="rotate(0)"><tspan id="SvgjsTspan1044" dy="16" x="48"><tspan id="SvgjsTspan1045" style="text-decoration:;">Journal Server</tspan></tspan></text></g></g><g id="SvgjsG1046" transform="translate(52.99998474121094,160.13598251342773)"><path id="SvgjsPath1047" d="M 0 0L 120 0L 120 36.36800003051758L 0 36.36800003051758Z" stroke="rgba(50,50,50,1)" stroke-width="1" fill-opacity="1" fill="#ffffff"></path><g id="SvgjsG1048"><text id="SvgjsText1049" font-family="微软雅黑" text-anchor="middle" font-size="13px" width="100px" fill="#323232" font-weight="400" align="middle" lineHeight="125%" anchor="middle" family="微软雅黑" size="13px" weight="400" font-style="" opacity="1" y="8.559000015258789" transform="rotate(0)"><tspan id="SvgjsTspan1050" dy="16" x="60"><tspan id="SvgjsTspan1051" style="text-decoration:;">Journal Client (L)</tspan></tspan></text></g></g><g id="SvgjsG1052" transform="translate(52.99998474121094,240.13598251342773)"><path id="SvgjsPath1053" d="M 0 0L 120 0L 120 39.36800003051758L 0 39.36800003051758Z" stroke="rgba(50,50,50,1)" stroke-width="1" fill-opacity="1" fill="#ffffff"></path><g id="SvgjsG1054"><text id="SvgjsText1055" font-family="微软雅黑" text-anchor="middle" font-size="13px" width="100px" fill="#323232" font-weight="400" align="middle" lineHeight="125%" anchor="middle" family="微软雅黑" size="13px" weight="400" font-style="" opacity="1" y="10.059000015258789" transform="rotate(0)"><tspan id="SvgjsTspan1056" dy="16" x="60"><tspan id="SvgjsTspan1057" style="text-decoration:;">Journal Client (F)</tspan></tspan></text></g></g><g id="SvgjsG1058" transform="translate(490.99998474121094,148.5039825439453)"><path id="SvgjsPath1059" d="M 0 0L 96 0L 96 37L 0 37Z" stroke-dasharray="10,6" stroke="rgba(50,50,50,1)" stroke-width="1" fill-opacity="1" fill="#ffffff"></path><g id="SvgjsG1060"><text id="SvgjsText1061" font-family="微软雅黑" text-anchor="middle" font-size="13px" width="76px" fill="#323232" font-weight="400" align="middle" lineHeight="125%" anchor="middle" family="微软雅黑" size="13px" weight="400" font-style="" opacity="1" y="8.875" transform="rotate(0)"><tspan id="SvgjsTspan1062" dy="16" x="48"><tspan id="SvgjsTspan1063" style="text-decoration:;">Journal Server</tspan></tspan></text></g></g><g id="SvgjsG1064"><path id="SvgjsPath1065" d="M504.99998474121094 75.25198364257812L504.99998474121094 111.75198364257812L504.99998474121094 111.75198364257812L504.99998474121094 146.9519836425781" stroke-dasharray="3,3" stroke="#323232" stroke-width="1" fill="none" marker-end="url(#SvgjsMarker1066)"></path><rect id="SvgjsRect1068" width="49" height="16" x="480.49998474121094" y="103.10198364257812" fill="#ffffff"></rect><text id="SvgjsText1069" font-family="微软雅黑" text-anchor="middle" font-size="13px" width="49px" fill="#323232" font-weight="400" align="top" lineHeight="16px" anchor="middle" family="微软雅黑" size="13px" weight="400" font-style="" opacity="1" y="101.47698364257812" transform="rotate(0)"><tspan id="SvgjsTspan1070" dy="16" x="504.99998474121094"><tspan id="SvgjsTspan1071" style="text-decoration:;">provision</tspan></tspan></text></g><g id="SvgjsG1072"><path id="SvgjsPath1073" d="M575.9999847412109 148.25198364257812L575.9999847412109 111.75198364257812L575.9999847412109 111.75198364257812L575.9999847412109 76.55198364257812" stroke-dasharray="3,3" stroke="#323232" stroke-width="1" fill="none" marker-end="url(#SvgjsMarker1074)"></path><rect id="SvgjsRect1076" width="66" height="16" x="542.9999847412109" y="104.40198364257813" fill="#ffffff"></rect><text id="SvgjsText1077" font-family="微软雅黑" text-anchor="middle" font-size="13px" width="66px" fill="#323232" font-weight="400" align="top" lineHeight="16px" anchor="middle" family="微软雅黑" size="13px" weight="400" font-style="" opacity="1" y="102.77698364257813" transform="rotate(0)"><tspan id="SvgjsTspan1078" dy="16" x="575.9999847412109"><tspan id="SvgjsTspan1079" style="text-decoration:;">de-provision</tspan></tspan></text></g><g id="SvgjsG1080"><path id="SvgjsPath1081" d="M424.49998474121094 224.0039825439453L453.99998474121094 224.0039825439453L453.99998474121094 105.50398254394531L286.99998474121094 105.50398254394531L286.99998474121094 77.30398254394531" stroke-dasharray="8,5" stroke="#323232" stroke-width="1" fill="none" marker-end="url(#SvgjsMarker1082)"></path></g><g id="SvgjsG1084"><path id="SvgjsPath1085" d="M424.49998474121094 301.0039825439453L453.99998474121094 301.0039825439453L453.99998474121094 105.50398254394531L286.99998474121094 105.50398254394531L286.99998474121094 77.30398254394531" stroke-dasharray="8,5" stroke="#323232" stroke-width="1" fill="none" marker-end="url(#SvgjsMarker1086)"></path></g><g id="SvgjsG1088"><path id="SvgjsPath1089" d="M173.49998474121094 178.31998252868652L216.99998474121094 178.31998252868652L216.99998474121094 57.00398254394531L237.19998474121093 57.00398254394531" stroke-dasharray="8,5" stroke="#323232" stroke-width="1" fill="none" marker-end="url(#SvgjsMarker1090)"></path></g><g id="SvgjsG1092"><path id="SvgjsPath1093" d="M173.49998474121094 259.8199825286865L218.99998474121094 259.8199825286865L218.99998474121094 57.00398254394531L237.19998474121093 57.00398254394531" stroke-dasharray="8,5" stroke="#323232" stroke-width="1" fill="none" marker-end="url(#SvgjsMarker1094)"></path></g><g id="SvgjsG1096"><path id="SvgjsPath1097" d="M424.49998474121094 147.0039825439453L453.99998474121094 147.0039825439453L453.99998474121094 106.00398254394531L286.99998474121094 106.00398254394531L286.99998474121094 77.30398254394531" stroke-dasharray="8,5" stroke="#323232" stroke-width="1" fill="none" marker-end="url(#SvgjsMarker1098)"></path><rect id="SvgjsRect1100" width="21" height="16" x="380.899984741211" y="98.00398254394531" fill="#ffffff"></rect><text id="SvgjsText1101" font-family="微软雅黑" text-anchor="middle" font-size="13px" width="21px" fill="#323232" font-weight="400" align="top" lineHeight="16px" anchor="middle" family="微软雅黑" size="13px" weight="400" font-style="" opacity="1" y="96.37898254394531" transform="rotate(0)"><tspan id="SvgjsTspan1102" dy="16" x="391.399984741211"><tspan id="SvgjsTspan1103" style="text-decoration:;">Pull</tspan></tspan></text></g><g id="SvgjsG1104"><path id="SvgjsPath1105" d="M173.4996402464354 178.30142515551884C 236.25273722524983 178.31998252868652 264.7472322571721 147.0039825439453 326.20122492240284 147.07078908734897" stroke="#323232" stroke-width="1" fill="none" marker-end="url(#SvgjsMarker1106)"></path><rect id="SvgjsRect1108" width="29" height="16" x="235.837596702013" y="154.6680136825954" fill="#ffffff"></rect><text id="SvgjsText1109" font-family="微软雅黑" text-anchor="middle" font-size="13px" width="29px" fill="#323232" font-weight="400" align="top" lineHeight="16px" anchor="middle" family="微软雅黑" size="13px" weight="400" font-style="" opacity="1" y="153.0430136825954" transform="rotate(0)"><tspan id="SvgjsTspan1110" dy="16" x="250.337596702013"><tspan id="SvgjsTspan1111" style="text-decoration:;">Write</tspan></tspan></text></g><g id="SvgjsG1112"><path id="SvgjsPath1113" d="M173.4993074767335 178.34599802003825C 237.63685082538203 178.31998252868652 263.3631186570399 224.0039825439453 326.20242289332975 223.9103267750791" stroke="#323232" stroke-width="1" fill="none" marker-end="url(#SvgjsMarker1114)"></path><rect id="SvgjsRect1116" width="29" height="16" x="235.8377048521661" y="193.1535275016266" fill="#ffffff"></rect><text id="SvgjsText1117" font-family="微软雅黑" text-anchor="middle" font-size="13px" width="29px" fill="#323232" font-weight="400" align="top" lineHeight="16px" anchor="middle" family="微软雅黑" size="13px" weight="400" font-style="" opacity="1" y="191.5285275016266" transform="rotate(0)"><tspan id="SvgjsTspan1118" dy="16" x="250.3377048521661"><tspan id="SvgjsTspan1119" style="text-decoration:;">Write</tspan></tspan></text></g><g id="SvgjsG1120"><path id="SvgjsPath1121" d="M173.49768695360865 178.36786266835227C 252.07095424050976 178.31998252868652 248.9290152419121 301.0039825439453 326.20825677657916 300.8316140411486" stroke="#323232" stroke-width="1" fill="none" marker-end="url(#SvgjsMarker1122)"></path><rect id="SvgjsRect1124" width="29" height="16" x="235.83823152218167" y="231.64642149092455" fill="#ffffff"></rect><text id="SvgjsText1125" font-family="微软雅黑" text-anchor="middle" font-size="13px" width="29px" fill="#323232" font-weight="400" align="top" lineHeight="16px" anchor="middle" family="微软雅黑" size="13px" weight="400" font-style="" opacity="1" y="230.02142149092455" transform="rotate(0)"><tspan id="SvgjsTspan1126" dy="16" x="250.33823152218167"><tspan id="SvgjsTspan1127" style="text-decoration:;">Write</tspan></tspan></text></g><g id="SvgjsG1128"><path id="SvgjsPath1129" d="M375.99998474121094 243.0039825439453L375.99998474121094 259.8199825286865L174.79998474121095 259.8199825286865" stroke="#323232" stroke-width="1" fill="none" marker-end="url(#SvgjsMarker1130)"></path></g><g id="SvgjsG1132"><path id="SvgjsPath1133" d="M375.99998474121094 282.0039825439453L375.99998474121094 259.8199825286865L174.79998474121095 259.8199825286865" stroke="#323232" stroke-width="1" fill="none" marker-end="url(#SvgjsMarker1134)"></path><rect id="SvgjsRect1136" width="27" height="16" x="272.9919847488403" y="251.81998252868652" fill="#ffffff"></rect><text id="SvgjsText1137" font-family="微软雅黑" text-anchor="middle" font-size="13px" width="27px" fill="#323232" font-weight="400" align="top" lineHeight="16px" anchor="middle" family="微软雅黑" size="13px" weight="400" font-style="" opacity="1" y="250.19498252868652" transform="rotate(0)"><tspan id="SvgjsTspan1138" dy="16" x="286.4919847488403"><tspan id="SvgjsTspan1139" style="text-decoration:;">Read</tspan></tspan></text></g></svg>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is unused in this document.

pub trait LeaderBasedJournal : Journal {
type Role;
type Peer;
type StateStream: Stream<Item = RoleState>;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need some API like wait_next here so that the caller can wait for the next event?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as it's a stream of state/event, an extension like next() can achieve this out-of-the-box.

@w41ter-l Is "state" just "event" or something different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tisonkun I've added some details, maybe able to answer your question.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huachaohuang It seems observe_state is enough to wait for the next event, are there details that I haven't considered?


fn state(&self, name: &str) -> (Self::Role, Option<Self::Peer>);

async fn observe_state(&self, name: &str) -> Self::StateStream;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe watch_state is better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There doesn't seem to be a difference between the two, is there something I haven't noticed?

docs/rfcs/20220110-leader-based-journal.md Outdated Show resolved Hide resolved
@huachaohuang huachaohuang changed the title rfc: add signle write journal rfc: add leader-based journal Jan 15, 2022
tisonkun
tisonkun previously approved these changes Jan 16, 2022
Copy link
Contributor

@tisonkun tisonkun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nits inline.

docs/rfcs/20220110-leader-based-journal.md Outdated Show resolved Hide resolved
docs/rfcs/20220110-leader-based-journal.md Outdated Show resolved Hide resolved
Comment on lines 50 to 58
The `LeaderJournal` doesn't affects the semantics of `Journal`, so `Journal::open_stream_writer` could be called whenever a stream isn't a leader. Of course, the implementation should guarantee that calls `StreamWriter::append` or other modifying operations will got a `Error::NotLeader`, if it isn't the stream leader.

The `LeaderBasedJournal` will forwards the electing progress automatically, which the engine won't have to recognize it. However, the engine must initiate that automatic progress manually, because a journal might contains multiple streams, which could exceeds the hardware limitation if we monitors all stream's electing progress. As a result, just streams that the engine is interested in will be watched.

When the engine calls `LeaderBasedJournal::observe_state`, the `LeaderBasedJournal` starts monitoring and subscribing to the electing state transition. It will yield a `Stream` that will be fired whenever one of the electing states changes. We utilize the epoch to track state changes. Time is divided into epochs of arbitrary length, the `LeaderBasedJournal` must ensure that each epoch has only one leader.

When a leader engine crashes, another machine's `LeaderBasedJournal` instance is elected as the new leader and begins to recover, eventually providing service.

We can't ensure that the state returned by the `observe_state` or `state` methods is always fresh in a distributed system, but any write operations will identify this circumstance. As a result, every decision made before submitting should trigger any write operations to check for freshness.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can discuss the semantic of APIs in sections instead of paragraphs, which separates focuses better.

Also, for significant procedure, such as read/append events, leader state changes, and recovery, you'd better include pseudo code of the procedure or a diagram. For example, in #287 I can see a new concept Phase that participates the recovery and leader state changes. I guess it's part of significants of this RFC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinks for you advices, I have separated paragraphs into sections.

I think the new concept is the implementation details, so I will introduce it in a follow up RFC.


## Summary

In this RPC, we present a trait `LeaderBasedJournal`, which divides the users of `Journal` into two roles: a leader who could write, and followers, who only have read permission. In the same time, this trait provides a means of observing role transition.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the following API, role binds to a stream of the journal. Do you intend to elect leader per stream (a.k.a., in stream granularity)?

Here you write leader and followers bind to a journal, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinks, I will fix it.


fn state(&self, name: &str) -> Result<Box<dyn EpochState>>;

async fn observe_state(&self, name: &str) -> Result<Self::StateStream>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we handle connection reconnection after a stream is returned?

is the caller's duty to retry or retry logic is done inside the returned stream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retry logic should be done in inside.


The `LeaderBasedJournal` will forwards the electing progress automatically, which the engine won't have to recognize it. However, the engine must initiate that automatic progress manually, because a journal might contains multiple streams, which could exceeds the hardware limitation if we monitors all stream's electing progress. As a result, just streams that the engine is interested in will be watched.

When the engine calls `LeaderBasedJournal::observe_state`, the `LeaderBasedJournal` starts monitoring and subscribing to the electing state transition. It will yield a `Stream` that will be fired whenever one of the electing states changes. We utilize the epoch to track state changes. Time is divided into epochs of arbitrary length, the `LeaderBasedJournal` must ensure that each epoch has only one leader.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, is the engines called observe_state("stream1") have a chance to be a follower or leader of stream1~? so for another engine that does not interest those stream1 can be avoid be elect as a follower

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, only called observe_state("stream") would be a follower or leader of stream.

fn role(&self) -> Role;

// The leader of the associated stream.
fn leader(&self) -> Option<String>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the content of String, is it need to be logic server-id or something else~?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea about which is better to fill, so I just left it a String. it can be changed once we found and defines a properly structs.

@w41ter w41ter marked this pull request as draft January 18, 2022 10:46
@huachaohuang
Copy link
Owner

Closed in favor of #361. Please help to improve the stream store design in #361.

@w41ter w41ter deleted the rfc_single_write_journal branch July 19, 2022 13:18
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants