Experiment/dataset update flow #292

sergiimk · 2023-11-18T00:20:06Z

No description provided.

src/domain/dataset-update-flow/src/entities/update/update_trigger.rs

src/domain/dataset-update-flow/src/entities/schedule/schedule_type.rs

src/domain/dataset-update-flow/src/aggregates/schedule.rs

src/domain/dataset-update-flow/src/aggregates/update.rs

src/domain/dataset-update-flow/src/entities/update/update_event.rs

src/domain/dataset-update-flow/src/entities/update/update_delay_reason.rs

src/domain/dataset-update-flow/src/aggregates/update.rs

src/utils/event-bus/src/event_bus.rs

src/utils/event-sourcing/src/event_store.rs

src/domain/core/src/entities/events.rs

src/domain/dataset-update-flow/src/entities/schedule/schedule.rs

src/domain/dataset-update-flow/src/entities/schedule/update_schedule_event.rs

src/domain/dataset-update-flow/src/entities/schedule/update_schedule_state.rs

src/domain/dataset-update-flow/src/entities/update/update_event.rs

src/utils/event-sourcing/src/event_store.rs

src/domain/dataset-update-flow/src/aggregates/update_schedule.rs

src/infra/dataset-update-flow-inmem/src/activity_time_wheel.rs

Cargo.toml

sergiimk · 2023-12-09T23:51:24Z

src/domain/dataset-update-flow/src/entities/flow/flow_start_condition.rs

+
+#[derive(Debug, Clone, Copy, PartialEq, Eq)]
+pub struct FlowStartConditionBatching {
+    pub available_records: usize,


Keeping this field up-to-date will require a lot of writes to event store. Imagine upon every push to kafka by a device that sends data every second we had to update this state in the flow. Since this type is a "start condition", not a state of batching I suggest we remove this field. And if we really want to show the state of a batch - we'll add a separate API to query it from Kafka or elsewhere when needed.

This isn't used yet. I removed, but we will have to have some source of truth in this regard.
Following the example with the IoT device, we still have to create an every-second secondary trigger event and must show each in the History view.

Currently this trigger is an empty struct:

#[derive(Debug, Clone, PartialEq, Eq)] pub struct FlowTriggerPush { // TODO: source (HTTP, MQTT, CMD, ...) }

similarly, we don't store yet the number of records added for the flow outcome, and can't use it yet for the derived dataset throttling:

#[derive(Debug, Clone, PartialEq, Eq)] pub struct FlowTriggerInputDatasetFlow { pub input_dataset_id: DatasetID, pub input_flow_type: DatasetFlowType, pub input_flow_id: FlowID, }

Perhaps both these structures should have information about the offset change. Sum of these numbers might be the criteria we are looking for.

Hm, I was imaging it differently:

Kafka controller will know about batching configuration

When Kafka queue becomes non-empty - it will trigger the flow (to show users that batch is pending)

While more records are flowing - it will NOT touch the flow system, until the configured batch size is reached to finally trigger the flow

This way secondary trigger events will appear only for other types of trigger, e.g. "manual".

While the flow is already in the state of waiting for a batch - I don't think showing more data arriving as secondary triggers is useful, and can result in too many updates.

Assuming this suggestion, without the visibility of Kafka's queue state, let's say, showing something like accumulated 17,245/25,000 records, the user will not understand what is the flow waiting for.

src/domain/dataset-update-flow/src/entities/flow/flow_trigger.rs

sergiimk · 2023-12-10T00:13:23Z

src/domain/dataset-update-flow/src/entities/flow_configuration/flow_configuration_rule.rs

+
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub struct StartConditionConfiguration {
+    pub throttling_period: Option<Duration>,


Is this throttling config related to:

"don't update free datasets more often than 30 min"

or "after first push creates a flow - keep batching data for 10 more minutes before running ingest task"?

I feel like the former should be a system-wide setting, not something we store per flow.

If its latter - we should have a different name for it, e.g. "batching period".

src/domain/dataset-update-flow/src/entities/shared/system_flow_type.rs

src/domain/dataset-update-flow/src/services/flow/flow_service.rs

src/infra/dataset-update-flow-inmem/src/dataset_flow_key.rs

src/infra/dataset-update-flow-inmem/src/services/dependency_graph_service_inmem.rs

sergiimk · 2023-12-10T01:09:29Z

src/infra/dataset-update-flow-inmem/src/services/flow/flow_service_inmem.rs

+    async fn schedule_flow_task(&self, flow: &mut Flow) -> Result<(), InternalError> {
+        let logical_plan = match &flow.flow_key {
+            FlowKey::Dataset(flow_key) => match flow_key.flow_type {
+                DatasetFlowType::Update => LogicalPlan::UpdateDataset(UpdateDataset {


Should we have something like Probe for the flow system too, so we could test it manually without spawning expensive tasks?

Perhaps a Probe flow could run Probe tasks on a certain dataset and spawn Probe flows for all downstream, like a cascading update but the actual work will be a no-op?

I would likely introduce something like this when testing.

There is no mechanism to pass any parameters to the flow yet. This would be necessary for Probe idea to make sense.

src/infra/dataset-update-flow-inmem/src/services/flow/flow_service_inmem.rs

sergiimk

Some minor suggestions, but overall I'm really happy with the state of this PR 👍

src/domain/flow-system/src/entities/shared/dataset_flow_type.rs

src/domain/flow-system/src/services/flow/flow_service.rs

src/domain/flow-system/src/repos/flow/flow_event_store.rs

src/infra/core/src/dependency_graph_service_inmem.rs

DatasetUpdateFlow => UpdateSchedule. Added Update aggregate, representing a single instance of update process for a given dataset. Formatter fix Review: renamings Review: externalized time source for event-sourcing aggregates Review: compacted task events in update flow Review: added update cancellation (before tasks scheduled) Review: accepted suggestions to name delay reasons as start conditions, and secondary triggers as simply adding triggers Merge corrections In-memory implementation of repositories Drafted update scheduler service. ES: support optional aggregate loads. Sketched `UpdateService` without tasks scheduling yet Scheduler steps: - schedule update task on manual trigger - read all active auto-schedules at the beginning of the run process - react on schedule change events: update table of active schedules Separate set in each in-memory event repository: quick return of query objects Drafted Time Wheel concept Connected time wheel and update service: initial scheduling and run loop Drafted in-memory dependency graph service (based on petgraph library). Scheduling downstream datasets when dataset update completes, respecting throttling period logic Enqueue next auto-polling root dataset update when current update succeeds Prototyped EventBus + added 1st demo link for schedule modification event Minor event structure fixes Simplified UpdateSchedule events Connected task finish and dataset removal events. Large DI changes in existing tests to support EventBus dependency Concurrent execution of event handlers in dispatcher Converted event bus handlers to traits. Registering handlers in the catalog. Merge corrections Review: renamed DatasetDeleted event Review: avoid excessive events cloning Shifted down 'get_queries' to update schedule's event store only Async event handler combiner now collects all handlers results, before reducing the error for reporting Resolved basic code review notes Added `get_last_update` by dataset operation Formatter fixed dill 0.8 - replaced `builder_for` on `Component::builder()` Integrated `dill::interface` feature and removed many explicit binds Review: renamed task event classes, enum for dataset events Review: reworked relevance of update schedule, using statuses instead (active, paused, stopped) Review: allow update schedules to be re-added after dataset reincarnation with the same ID Review: a few TODOs on performance improvements Review: reimplemented TimeWheel using binary heap Review: removed `pause` and `resume` methods in `UpdateSchedule` aggregate, use `set_schedule` only Renamed update schedules => update configurations Separated schedules and start conditions in update configurations Generalized dataset flow configurations System vs Dataset flow configurations Not very smart, but a model of System and Dataset flows. Scheduling service largely not implemented yet. Generic-based flow events, state Refactored flow configurations aggregate to generic events/state similarly Code reuse approach for flow/flow-config aggregates based on trait extensions Attempts to generalize flow configuration services (at least traits). Folder reorganization in interface and in-mem crate. Implemented generic in-memory event stores and integrated them into all current aggregates Implemented SystemFlow in-memory repository Implemented in-memory Flow service for all kinds of flows Decomposing Flow service: extracted ActiveConfigsState Decomposing Flow service: extracted PendingFlowsState Compacted DatasetFlow & SystemFlow into Flow Similarly compacted FlowConfiguration aggregate Simplifications in FlowService Review: 'flow-system' and 'flow-system-inmem' are final names Review: 'flow-system' and 'flow-system-inmem' are final names Review: improved enum all-value iteration methods in flow types Review: specific => of_type Review: removed duplicate OwnedDatasetFlowKey Review: killed redundand feature flags Review: tracing without formatting Moved `DependencyGraphService` to core domain Review: removed reundand field Experiment/dependencies graph (#364) * Simplest startup job to initialize dependencies graph. * Removed dependency query from DatasetRepository. * Integrated dependencies graph into GraphQL queries for upstream/downstream links. * Integrated dependencies graph into dataset deletion. * Reacting on `DatasetCreated` events. * Implemented reaction of dependencies graph on changes in dataset inputs * Implemented lazy vs eager dependencies initialization Merge corrections Test fix Review: minor renamings

…ependency graph

…w trigger time

Flow itself needs an Aborted outcome.

Explicit model of flow abortion. More determenistic time propagation of flow config events.

… update on success. Removed retry on failure logic in flow aggregate.

sergiimk requested a review from zaychenko-sergei November 18, 2023 00:20

sergiimk commented Nov 18, 2023

View reviewed changes

src/domain/dataset-update-flow/src/aggregates/update.rs Outdated Show resolved Hide resolved

zaychenko-sergei force-pushed the experiment/dataset-update-flow branch 2 times, most recently from bbe8ed0 to 051713a Compare November 26, 2023 14:11

sergiimk commented Nov 27, 2023

View reviewed changes

zaychenko-sergei force-pushed the experiment/dataset-update-flow branch from 9d447cb to 3338514 Compare November 27, 2023 20:35

sergiimk commented Nov 28, 2023

View reviewed changes

sergiimk commented Nov 29, 2023

View reviewed changes

Cargo.toml Outdated Show resolved Hide resolved

sergiimk commented Dec 10, 2023

View reviewed changes

zaychenko-sergei force-pushed the experiment/dataset-update-flow branch from 56cfbd6 to 91fda61 Compare December 14, 2023 16:47

sergiimk commented Dec 16, 2023

View reviewed changes

zaychenko-sergei force-pushed the experiment/dataset-update-flow branch 2 times, most recently from 089747d to 5328f84 Compare December 20, 2023 09:58

zaychenko-sergei added 12 commits December 22, 2023 01:09

Separated Update flow on Ingest and ExecuetQuery

8b12280

Implemented systematic tracing for flow manager services/repors and d…

9348267

…ependency graph

Tests for dependency graph

4ec1b59

EventBus tests

ec82cb7

Unit tests for flow time wheel

2cb8387

Tests for flow configurations service

bb48ef4

Candidate flow system test structure

9dcd386

Rounding time to step in flow system to get stable activation times

0ef4665

Tested and fixed manual flow triggers

2d994c6

Release + minor deps + merge correction

0e1a94e

Minimizing fow tests randomness with fixed planned start time and flo…

08ea032

…w trigger time

zaychenko-sergei force-pushed the experiment/dataset-update-flow branch from 41a924e to 08ea032 Compare December 22, 2023 10:11

zaychenko-sergei added 3 commits December 22, 2023 03:39

Tested deleting dataset with queued/scheduled flow.

859cfad

Flow itself needs an Aborted outcome.

New test: pause/resume/modify flow schedule when queued already.

978f056

Explicit model of flow abortion. More determenistic time propagation of flow config events.

New test: flow finishes with different task outcomes, enqueueing next…

762d256

… update on success. Removed retry on failure logic in flow aggregate.

zaychenko-sergei added 3 commits December 22, 2023 15:34

New test: root update triggers derived dataset + fixes

5cfbe8b

Disabled flow tests on Windows

68ccb92

Changelog updated

fc57e10

zaychenko-sergei marked this pull request as ready for review December 23, 2023 00:23

zaychenko-sergei merged commit b5fa5e8 into master Dec 23, 2023

zaychenko-sergei deleted the experiment/dataset-update-flow branch December 23, 2023 00:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment/dataset update flow #292

Experiment/dataset update flow #292

sergiimk commented Nov 18, 2023

sergiimk Dec 9, 2023

zaychenko-sergei Dec 11, 2023

sergiimk Dec 15, 2023

zaychenko-sergei Dec 18, 2023

sergiimk Dec 10, 2023

sergiimk Dec 10, 2023

zaychenko-sergei Dec 11, 2023 •

edited

Loading

zaychenko-sergei Dec 18, 2023

sergiimk left a comment

Experiment/dataset update flow #292

Experiment/dataset update flow #292

Conversation

sergiimk commented Nov 18, 2023

sergiimk Dec 9, 2023

Choose a reason for hiding this comment

zaychenko-sergei Dec 11, 2023

Choose a reason for hiding this comment

sergiimk Dec 15, 2023

Choose a reason for hiding this comment

zaychenko-sergei Dec 18, 2023

Choose a reason for hiding this comment

sergiimk Dec 10, 2023

Choose a reason for hiding this comment

sergiimk Dec 10, 2023

Choose a reason for hiding this comment

zaychenko-sergei Dec 11, 2023 • edited Loading

Choose a reason for hiding this comment

zaychenko-sergei Dec 18, 2023

Choose a reason for hiding this comment

sergiimk left a comment

Choose a reason for hiding this comment

zaychenko-sergei Dec 11, 2023 •

edited

Loading