Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[flytepropeller][flyteadmin] Streaming Decks V2 #6053

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

Future-Outlier
Copy link
Member

@Future-Outlier Future-Outlier commented Nov 27, 2024

Tracking issue

#5574

Why are the changes needed?

To enhance user visibility into Flyte Decks at different stages of workflow execution (running, failing, and succeeding), enabling better debugging and analysis.

What changes were proposed in this pull request?

Concept:

  1. propeller will turn node info to NodeExecutionEvent, and send it to admin.

nev, err := ToNodeExecutionEvent(
nCtx.NodeExecutionMetadata().GetNodeExecutionID(),
p,
nCtx.InputReader().GetInputPath().String(),
nCtx.NodeStatus(),
nCtx.ExecutionContext().GetEventVersion(),
nCtx.ExecutionContext().GetParentInfo(), nCtx.Node(),
c.clusterID,
nCtx.NodeStateReader().GetDynamicNodeState().Phase,
c.eventConfig,
targetEntity)

Life Cycle:

use new flytekit > 1.14.0

summary:

  1. NO HEAD request to be called. (save resource)
  2. use config from task template to know whether enable deck or not

details:

  1. propeller keep adding DeckURI when the task is running if FLYTE_ENABLE_DECK=true in the task template.
  2. propeller will put DeckURI to node info, and turn it to NodeExecutionEvent to flyte admin.
  3. flyte admin will add DeckURI to Closure
  4. flyte console will get DeckURI by sending request to admin.
    nativeURL = node.GetClosure().GetDeckUri()
    }
    } else {
    return nil, errors.NewFlyteAdminErrorf(codes.InvalidArgument, "unsupported source [%v]", reflect.TypeOf(req.GetSource()))
    }
    if len(nativeURL) == 0 {
    return nil, errors.NewFlyteAdminErrorf(codes.Internal, "no deckUrl found for request [%+v]", req)
    }
    ref := storage.DataReference(nativeURL)
    meta, err := s.dataStore.Head(ctx, ref)
    if err != nil {
    return nil, errors.NewFlyteAdminErrorf(codes.Internal, "failed to head object before signing url. Error: %v", err)
    }
  5. if flyte console can't get the DeckURI from the node Closure, it will not show the Flyte Deck button.

old flytekit <= 1.14.0

summary:

  1. we keep the backward compatible (show deck when succeed)

details:

  1. In the terminal state, use a HEAD request to know if the Deck URI exists or not.
    if exist, then put it to the node info.

How was this patch tested?

  1. unit test and remote execution.

python code:

from flytekit import ImageSpec, task, workflow
from flytekit.deck import Deck

flytekit_hash = "6b55930d0a77efc3594ebaac056f2c75024e61b5"
flytekit = f"git+https://github.com/flyteorg/flytekit.git@{flytekit_hash}"

# Define custom image for the task
custom_image = ImageSpec(packages=[flytekit],
                            apt_packages=["git"],
                            registry="localhost:30000",
                            env={"FLYTE_SDK_LOGGING_LEVEL": 10},
                         )

@task(enable_deck=False, container_image=custom_image)
def t_no_deck():
    # Deck.publish()
    print("No Deck")

@task(enable_deck=True, container_image=custom_image)
def t_deck():
    import time
    """
    1st deck only show timeline deck
    2nd will show
    """
    for i in range(3):
        Deck.publish()
        time.sleep(1)

@task(enable_deck=True, container_image=custom_image)
def t_fail_deck():
    import time

    for i in range(3):
        Deck.publish()
        time.sleep(3)
    time.sleep(10)
    raise ValueError("Failed Deck")

@workflow
def wf():
    t_no_deck()
    t_deck()
    t_fail_deck()

if __name__ == "__main__":
    from flytekit.clis.sdk_in_container import pyflyte
    from click.testing import CliRunner
    import os

    runner = CliRunner()
    path = os.path.realpath(__file__)

    result = runner.invoke(pyflyte.main,
                           ["run", path, "t_no_deck"])
    print("Local Execution: ", result.output)

    result = runner.invoke(pyflyte.main,
                           ["run", "--remote", path,"wf"])
    print("Remote Execution: ", result.output)

Setup process

single binary.

flyte: this branch
flytekit: flyteorg/flytekit#2779
flyteconsole: flyteorg/flyteconsole#890

Screenshots

flytekit branch:
flyteorg/flytekit#2779

NEW FLYTEKIT, NO DECK, RUNNING With Deck, SUCCEED, and FAILED

OSS-STREAMING-DECK-small.mov

OLD FLYTEKIT, NO DECK, RUNNING With Deck, SUCCEED, and FAILED

OSS-STREAMING-DECK-OLD-FLYTEKIT-small.mov

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

follow up questions

  1. should we support Abort phase for the streaming deck?

should we support EPhaseAbort in this file?

https://github.com/flyteorg/flyte/blob/b3330ba4430538f91ae9fc7d868a29a2e96db8bd/flytepropeller/pkg/controller/nodes/handler/transition_info.go

  1. how can we support the auto-refresh UX?

Future-Outlier and others added 2 commits November 27, 2024 23:36
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Co-authored-by: Yi Cheng <[email protected]>
Co-authored-by: pingsutw  <[email protected]>
Copy link

codecov bot commented Nov 27, 2024

Codecov Report

Attention: Patch coverage is 31.81818% with 60 lines in your changes missing coverage. Please review.

Project coverage is 36.97%. Comparing base (ab04192) to head (4068043).
Report is 30 commits behind head on master.

Files with missing lines Patch % Lines
...lytepropeller/pkg/controller/nodes/task/handler.go 31.03% 51 Missing and 9 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6053      +/-   ##
==========================================
- Coverage   37.08%   36.97%   -0.11%     
==========================================
  Files        1318     1318              
  Lines      132284   132511     +227     
==========================================
- Hits        49062    49001      -61     
- Misses      78950    79250     +300     
+ Partials     4272     4260      -12     
Flag Coverage Δ
unittests-datacatalog 51.58% <ø> (ø)
unittests-flyteadmin 54.06% <100.00%> (-0.04%) ⬇️
unittests-flytecopilot 30.99% <ø> (ø)
unittests-flytectl 62.29% <ø> (-0.05%) ⬇️
unittests-flyteidl 7.23% <ø> (-0.01%) ⬇️
unittests-flyteplugins 53.85% <ø> (+0.11%) ⬆️
unittests-flytepropeller 42.55% <31.03%> (-0.09%) ⬇️
unittests-flytestdlib 55.18% <ø> (-2.36%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Future-Outlier <[email protected]>
switch pluginTrns.pInfo.Phase() {
case pluginCore.PhaseSuccess:
// This is to prevent the console from potentially checking the deck URI that does not exist if in final phase(PhaseSuccess).
err = pluginTrns.RemoveNonexistentDeckURI(ctx, tCtx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this do a head call on the deck URI for every task that succeeds? Two thoughts here:
(1) does the flyteadmin merge algorithm then remove the deckURI from the execution metadata?
(2) this is incurring a 20-30ms performance degredation to every task execution

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will take a look tmr, thank you!!!

Copy link
Member Author

@Future-Outlier Future-Outlier Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this do a head call on the deck URI for every task that succeeds?

yes it will do a head call by RemoteFileOutputReader

func (r RemoteFileOutputReader) DeckExists(ctx context.Context) (bool, error) {
md, err := r.store.Head(ctx, r.outPath.GetDeckPath())
if err != nil {
return false, err
}
return md.Exists(), nil
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you know the performance degradation?
did you use grafana or other performance tools?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the flyteadmin merge algorithm then remove the deckURI from the execution metadata?

flyteadmin will set the deckURI in the execution metadata to nil if the propeller removes it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Future-Outlier
Copy link
Member Author

Future-Outlier commented Nov 27, 2024

How to test it?

  1. start a new sandbox
flytectl demo start --image futureoutlier/sandbox:deck-1205-1138 --force
  1. checkout streaming deck flytekit branch
cd flytekit
gh pr checkout 2779
  1. run a failure task (show deck after it failed)
from flytekit import ImageSpec, task, workflow
from flytekit.deck import Deck

flytekit_hash = "473ae1119af6f86c26c0790dee0affa3eb29be64"
flytekit = f"git+https://github.com/flyteorg/flytekit.git@{flytekit_hash}"

# Define custom image for the task
custom_image = ImageSpec(packages=[flytekit],
                            apt_packages=["git"],
                            registry="localhost:30000",
                            env={"FLYTE_SDK_LOGGING_LEVEL": 10},
                         )

@task(enable_deck=True, container_image=custom_image)
def t_deck():
    import time
    """
    1st deck only show timeline deck
    2nd will show
    """
    for i in range(5):
        Deck.publish()
        # # raise Exception("This is an exception")
        time.sleep(3)

@workflow
def wf():
    t_deck()

if __name__ == "__main__":
    from flytekit.clis.sdk_in_container import pyflyte
    from click.testing import CliRunner
    import os

    runner = CliRunner()
    path = os.path.realpath(__file__)

    # result = runner.invoke(pyflyte.main,
    #                        ["run", path, "wf"])
    # print("Local Execution: ", result.output)

    result = runner.invoke(pyflyte.main,
                           ["run", "--remote", path,"wf"])
    # "--remote"
    print("Remote Execution: ", result.output)

@EngHabu
Copy link
Contributor

EngHabu commented Nov 28, 2024

Mind adding screenshots for the rendered deck and refresh to the PR description?

@Future-Outlier
Copy link
Member Author

Mind adding screenshots for the rendered deck and refresh to the PR description?

Yes no problem

@Future-Outlier
Copy link
Member Author

Mind adding screenshots for the rendered deck and refresh to the PR description?

its provided!
#6053 (comment)

Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants