Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Action-Response Cycle bottlenecks in interactive music apps #97

Open
anssiko opened this issue Sep 3, 2020 · 1 comment
Open

Action-Response Cycle bottlenecks in interactive music apps #97

anssiko opened this issue Sep 3, 2020 · 1 comment
Labels
Developer's Perspective Machine Learning Experiences on the Web: A Developer's Perspective Discussion topic Topic discussed at the workshop User's Perspective Machine Learning Experiences on the Web: A User's Perspective

Comments

@anssiko
Copy link
Member

anssiko commented Sep 3, 2020

The Interactive ML - Powered Music Applications on the Web talk by @teropa explains how a key design consideration in apps for musical instruments is latency between the user input (e.g. a key press on an instrument, a video input) and musical output as illustrated by the Action-Response Cycle:

User Input > Create Input Tensor > Upload to GPU > Run Inference > Download from GPU > Process Output Tensors > Musical Output

This cycle must execute within ~0-20 ms for the experience to feel natural.

Real-time audio is mentioned as a very constrained capability on the web platform currently:

[...] you have this task of generating 48,000 audio samples per second per channel consistently without fault. Because if you fail to do that you have an audible glitch in your outputs. So it's a very hard constraint, and it has to be deterministic because of this reason.

Particularly demanding task is generating actual audio data in the browser with ML (as opposed to generating symbolic music data with ML). Proposals mentioned for consideration that may help lower the latency in this scenario:

  • Inference running in WebAssembly on the CPU on the audio thread
  • WebNN in Worklets

Another use case that involves video input (from webcam) and musical output has the following per-frame path:

Webcam MediaStream > Draw to Canvas > Build Pixel Tensor > Upload to CPU > Run Inference > Download from GPU > Process Output Tensors > Musical Output

Notably, the steps to get data into the model (Webcam MediaStream > Draw to Canvas > Build Pixel Tensor) take half of the time.

The bottleneck of canvas (copy rendered video frames to a canvas element, process pixels extracted from the canvas, and render the result to a canvas) was identified as an inefficient path also in the Media processing hooks for the Web talk by @tidoust.

This calls for APIs to provide better abstractions that allow feeding input data into ML models, @teropa concludes:

Could there be some APIs that give me abstractions to do this in a more direct way to get immediate input into my machine learning model, without having to do quite so much work and run quite so much slow code on each frame.

As a summary, the talk outlines the following areas as important:

  • Low and predictable performance
  • Not compromising CPU/GPU needed by the UI or Audio
  • Inference in AudioWorklet context - Wasm or native [WebNN]?
  • Media integration (e.g. fast streaming inputs from MediaStream)

This issue is to discuss the proposals that involve Web API surface improvements and other problematic aspects of real-time use cases that involve audio.

Looping in @padenot for AudioWorklet expertise as well as to reflect on the recent work on WebCodecs that might also help with these real-time audio use cases. Feel free to tag other folks who might be interested.

@anssiko anssiko added Developer's Perspective Machine Learning Experiences on the Web: A Developer's Perspective User's Perspective Machine Learning Experiences on the Web: A User's Perspective labels Sep 3, 2020
@anssiko
Copy link
Member Author

anssiko commented Sep 7, 2020

The Empowering Musicians and Artists using Machine Learning to Build Their Own Tools in the Browser talk by @Louismac also notes AudioWorklets as a partial solution to use cases that have strict action-response latency requirements, such as:

[...] connecting inputs from a variety of sources, running potentially computationally expensive feature extractors alongside lightweight machine learning models and generating audio and visual output, in real time, without interference.

@Louismac and @teropa, in your experience, are there known feature gaps in the core AudioWorklets API that make the API not optimal for your use cases? I've understood some of the known implementation issues in Chrome around AudioWorklets-related garbage collection have been addressed recently. @teropa made a suggestion in his talk to look into exposing inference capabilities in a AudioWorklet context, which has been noted as a possible future exploration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Developer's Perspective Machine Learning Experiences on the Web: A Developer's Perspective Discussion topic Topic discussed at the workshop User's Perspective Machine Learning Experiences on the Web: A User's Perspective
Projects
None yet
Development

No branches or pull requests

2 participants