You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The timeline UI view is marginally useful to debug performance, but has a lot of room for improvement. Integrating the runtime metrics breakdown proposed in the performance observability RFC is a step in the right direction, partitioning node executions into a collection of categorized time-series. This representation will help the "what" but misses a lot of the "why". For example, if a particular execution has a large amount of frontend plugin overhead this means that Flyte started the Task but the backend service has not yet indicated the service has started. K8s gurus will be quick to identify that there may be scheduling contention, large image pull times, or a few other likely scenarios. However, this is not easily available to the user even though FlytePropeller has this information available. We currently store a singular "reason" for the current execution status' but may be better off tracking a time-series of reasons to better explain the execution.
Proposal
This proposal outlines a solution for overlaying a collection of human readable messages in the timeline view. The exact representation is VERY open for debate, but I envision something similar to jaeger (time-series telemetry data with events) which uses a single tick mark that displays a message on hover. This solution supplies the "why" in an explanation of the reported execution status that will complement the "what" in the runtime breakdown of the execution time-series. The goal will be to balance utility with simplicity, displaying a "useful" number of messages to improve context.
Implementation
Currently, FlyteAdmin maintains a singular "reason" within the task execution metadata. This is updated in-place on each event from FlytePropeller, meaning the old "reasons" are not persisted. At risk of over-simplifying this, we will need to transition to maintaining a collection of "reasons" with associated timestamps. This will require updates in the following repositories:
FlyteIDL: update TaskExecutionClosure to have repeated reasons with associated timestamps.
FlyteAdmin: use an append to the "reason" list rather than overwriting the existing singular "reason".
FlyteConsole: correctly parse the "reason" list to annotate the timeline UI view.
Open Questions
How should this be visualized? I will leave this discussion for more UI / UX oriented personnel.
Should we add this information to node executions / workflow executions? Currently the "reason" is only tracked for the task-level execution.
Do we need to be able to send multiple reasons in a single task event?
currently possible to skip phases if execution progresses before FlytePropeller detects and processes the intermediate stage
could use event buffers to just send multiple events -> probably the better solution
The text was updated successfully, but these errors were encountered:
Discussed in #3429
Originally posted by hamersaw March 8, 2023
Motivation
The timeline UI view is marginally useful to debug performance, but has a lot of room for improvement. Integrating the runtime metrics breakdown proposed in the performance observability RFC is a step in the right direction, partitioning node executions into a collection of categorized time-series. This representation will help the "what" but misses a lot of the "why". For example, if a particular execution has a large amount of frontend plugin overhead this means that Flyte started the Task but the backend service has not yet indicated the service has started. K8s gurus will be quick to identify that there may be scheduling contention, large image pull times, or a few other likely scenarios. However, this is not easily available to the user even though FlytePropeller has this information available. We currently store a singular "reason" for the current execution status' but may be better off tracking a time-series of reasons to better explain the execution.
Proposal
This proposal outlines a solution for overlaying a collection of human readable messages in the timeline view. The exact representation is VERY open for debate, but I envision something similar to jaeger (time-series telemetry data with events) which uses a single tick mark that displays a message on hover. This solution supplies the "why" in an explanation of the reported execution status that will complement the "what" in the runtime breakdown of the execution time-series. The goal will be to balance utility with simplicity, displaying a "useful" number of messages to improve context.
Implementation
Currently, FlyteAdmin maintains a singular "reason" within the task execution metadata. This is updated in-place on each event from FlytePropeller, meaning the old "reasons" are not persisted. At risk of over-simplifying this, we will need to transition to maintaining a collection of "reasons" with associated timestamps. This will require updates in the following repositories:
Open Questions
currently possible to skip phases if execution progresses before FlytePropeller detects and processes the intermediate stage
could use event buffers to just send multiple events -> probably the better solution
The text was updated successfully, but these errors were encountered: