Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YouTube] Refactor player clients, add support for poTokens, extract visitor data from the service and more #1272

Merged

Conversation

AudricV
Copy link
Member

@AudricV AudricV commented Feb 1, 2025

This PR, which supersedes #1247, adds support for poTokens, refactor clients used and extract visitor data from the service itself, as generated visitor data do not work anymore to get valid player responses (rolling out in progress as of writing the first version of this PR description).

It also replaces age-restricted videos streams' fetching with the web embed player, which allows to play a very few age-restricted videos anonymously. This is the best we can do with what we can use (unless some method is still present and we haven't discovered it, but I have a lot of doubts).

TODO:

  • describe in a more detailed way changes in this PR?
  • add Javadocs
  • update mocks
  • add a poToken implementation for tests (in this PR?)

Fixes TeamNewPipe/NewPipe#11980.
Fixes TeamNewPipe/NewPipe#11803.
Fixes TeamNewPipe/NewPipe#11486.
Fixes partially TeamNewPipe/NewPipe#11382.

Comment on lines +105 to +106
private static PoTokenProvider poTokenProvider;
private static boolean fetchIosClient;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry for a sort of a drive-by comment, but is it possible to not make these static? For example, have them in YoutubeService and pass them along when getting the stream extractor? It's trivial to make them static downstream if needed, and in the typical usage, where the YoutubeService is already static, they will also be effectively static with something like ServiceList.YouTube.setPoTokenProvider(...).

My use case is that running the stream extractor is an expensive operation taking multiple seconds, so if I need to run multiple at the same time, I do it from multiple threads using different service and extractor instances.
It would be nice to not have to worry about synchronization here, reduce global state, and have the ability to have extractors with different providers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have static fields in various places of the extractor, so this issue is not limited to this PR. Please open a pull request to fix this across the whole extractor ;-).

My use case is that running the stream extractor is an expensive operation taking multiple seconds, so if I need to run multiple at the same time, I do it from multiple threads using different service and extractor instances.
It would be nice to not have to worry about synchronization here, reduce global state, and have the ability to have extractors with different providers.

I am not sure how your setup works, but can't you first set the static fields and then start multiple threads? Then if poTokenProvider is thread-safe (NewPipe's one is) everything that uses it should also be thread safe. Or am I missing something?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have static fields in various places of the extractor, so this issue is not limited to this PR. Please open a pull request to fix this across the whole extractor ;-)

Sure, I'll do that :), I thought it was fine already for some reason. Gotta at least fix it for my own sanity.
However, this is part of a public facing api, and I'd like not to break people unnecessarily that will start to depend on this.

I am not sure how your setup works, but can't you first set the static fields and then start multiple threads? Then if poTokenProvider is thread-safe (NewPipe's one is) everything that uses it should also be thread safe. Or am I missing something?

True, it's not impossible to adapt, what I'm saying is that with this change I'm forced to make it thread-safe and I'm forced to have only one provider for the whole program. It seems unnecessary, I would like to have other options. Like having extractors with different providers or different settings could conceivably be useful in case some of them don't work.
A library probably shouldn't force these decisions onto the users. Besides, the users can always make a static provider if they wish.
If I'm reading it correctly, NewPipe does exactly that: makes its own static provider.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help me understand your use-case more? Why would you need more than one provider for all of your extractors?

In case your PoTokenProvider isn't thread-safe, you could just have a field, and use a synchronized block for the unsafe activity. Would it be more helpful to have documentation somewhere for this?

Copy link

@kitterion kitterion Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it reasonable to want to switch fields like fetchIosClient or return values of PoTokenProvider methods at runtime? I feel like it could be because youtube requirements are constantly changing and it'd be nice to have multiple configurations available and switch between them at runtime if one doesn't work, but maybe I'm wrong.

If it's not reasonable, then one time initialization of read-only statics is perfectly fine. If it is, then for example there's currently no thread safe way to change the value of fetchIosClient without synchronizing calls to the extractor.

There's another issue that I spotted but I'm not sure if it's relevant: I don't see how to switch what the PoTokenProvider methods return in a consistent manner and without a race condition. Specifically, PoTokenProvider currently requires implementations to essentially return 4 values (some of them potentially null). There's no way to consistently switch them to another set of 4 values in a multi-threaded environment. The reason is that onFetchPage simply calls methods in order, and the values can change between calls even if I synchronize everything in a PoTokenProvider on my end.
If this is not an issue, still having to synchronize everything is not great. It's error prone, requires you to think about what you're doing.

I avoid having to think about this because in my use case threads don't really know about other threads, they use thread-local instances. That way there's really no synchronization, data races, inconsistencies, etc., it's very convenient. Though I'm told this is not currently safe and will need some fixing.
(My actual use case is a bit more cursed involved, java is not the main language, I call it through jni, even the threads are created outside of the jvm and then attached to it. So having to juggle multiple languages with threads in the mix, I really try to minimize the possibility of things that can go wrong.)

So if I were to summarize my approach here, it's: why worry about any of this if you can just avoid it altogether and write code as if it's single threaded.
It eliminates a whole class of errors, I'm a big fan.

(I really wrote a whole wall of text, huh)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kitterion Thanks for your feedback and your use cases.

To make it short, I completely agree that such objects should not be static, but as the extractor doesn't have a way to pass options to extraction requests, that's really hard to do properly. I'd like to add this ability in an extractor rewrite from scratch in Kotlin on which I was working a long time ago for an new extractor API concept. I hope I will be able to advance my work on this and post my ideas as class diagrams.

I don't think we will change anything right now.

Don't worry about breaking public APIs, all of these methods are not stable, this PR introduces breaking changes 😆

Comment on lines +906 to +907
@Nonnull final String videoId,
@Nullable final PoTokenProvider poTokenProviderInstance,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we simply do a null-check rather than having 2 fields for this? (in all the fetch methods)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was to avoid recomputing the field nullability multiple times (also in a previous version when I was working on this, the fetchHtml5Client method content was directly in the onFetchPage method if I remember properly), but maybe this should be avoided.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure passing one more parameter around is slower than redoing the null check. Actually, I would guess that if (o != null) is as fast as if (b), since both perform a memory load and then use the zero flag to determine whether to jump or not (assuming null is represented as 0).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s 100% irrelevant anyway, because the extractor is network IO bound

* @return a {@link PoTokenResult} specific to the WEB InnerTube client
*/
@Nullable
PoTokenResult getWebClientPoToken(String videoId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we want to create a potoken with some other data other than the videoId?

https://github.com/LuanRT/BgUtils/blob/main/examples/node/innertube-challenge-fetcher-example.ts
Over here, for example, the pot url parameter is based on a 'session' PoToken based on the visitorData, while the the one sent to the player request is based on the videoId?

In that case, why does PoTokenResult need to have a visitorData?

Is having 2 different PoTokens the behavior on YouTube, or is what this PR does correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is having 2 different PoTokens the behavior on YouTube, or is what this PR does correct?

YouTube desktop website uses indeed two poTokens: one minted with a visitorData, one with a videoId.

The visitorData must be the same for player requests and for the visitorData poToken, used in streaming URLs.

On embeds, that's currently not the case: it uses the old behavior of sending a visitorData poToken for player and streaming URLs requests.

As the extractor doesn't support login with a Google account, you only need to create poTokens from two properties: a visitorData and a videoId (for logged-in users, as written in BgUtils repo, it uses account's dataSyncId).

So maybe we should generate the visitorData at NPE and then pass it to the interface, which will be used by implementations. This also prevents clients to reuse the same visitorData poToken, which is good for privacy but not really for speed and resources consumption.

Also update client version and device info and add TVHTML5 client and
WEB_EMBEDDED_PLAYER constants, these will be used in the future.

TVHTML5_SIMPLY_EMBED_CLIENT_VERSION has been kept in its place as
the corresponding client does not work anonymously anymore, so it will
be removed soon.
This allows their internal usage in an upcoming new class to be used on
other InnerTube clients than the WEB one.
This internal class, YoutubeStreamHelper, has to be public in order to
be used by subpackages of the service's root one.

It supports poTokens, HTML5 signatureTimestamp property, embed context
and multiple InnerTube clients. It is meant to replace the
corresponding methods in YoutubeParsingHelper, in order to reduce the
class' size, code duplicates and improve its readability.

It adds a new way to get age-restricted videos' streams, the only ones
which are playable in YouTube embeds, which is not very common.
Also improve detection of age-restricted statuses in playability error
messages returned by the service and provide version 7 of DASH
manifests.
- Use POST requests with the same body as official clients do;
- Update methods checking the client streaming URLs come from:
  - Replace TVHTML5_SIMPLY_EMBEDDED_PLAYER by TVHTML5;
  - Add WEB_EMBEDDED_PLAYER.
This new class, InnertubeClientRequestInfo, composed of two mutable
subclasses, ClientInfo and DeviceInfo, allows to store client and
device info in a better way, without requiring to pass more than 10
method parameters like in YoutubeStreamHelper currently.

Mutability has been added in order to allow changing some fields
easily, especially visitorData.
As YouTube is disabling ability to use a random visitor ID in a
visitorData on player requests and BotGuard challenges, it
shouldn't matter if we use a random one or not for other
request types.
Some clients like TVHTML5 are not allowed on the visitor_id endpoint
(with this client, a 400 HTTP response is returned with a precondition
check failed error).

Also disable pretty printing for these requests, like we do for others.
…some fixes

visitorData are get using InnertubeClientRequestInfo and
YoutubeParsingHelper.prepareJsonBuilder, which is replacing the
corresponding method in YoutubeStreamHelper, removed in this commit.

Also fix some bugs like JsonBuilder usages in some places and remove
HLS manifest filtering for the iOS client, as we still use for now
the full player response, in the case having streams requiring poTokens
after some time as a last resort is useful.
This argument, which has been forgot, is required to get valid
streaming URLs with this client.
This commits provides methods to get InnertubeClientRequestInfo
instances, which can be used by extractor clients to get visitor data
to pass to PoTokenProvider implementations using YoutubeParsingHelper.

Ability to create custom instances has been removed, but returned
objects can be modified. This is what YoutubeStreamHelper now uses to
set the visitorData property.
Also make YoutubeParsingHelper.getOriginReferrerHeaders public, in
order to be used by other extractor classes and improve the name of a
parameter of YoutubeParsingHelper.getVisitorDataFromInnertube.
…creators

Also use TVHTML5 user agent for requests from this client in these
DASH manifests creators.
@Stypox Stypox force-pushed the yt_clients_changes_and_potokens_support branch from 15e35a2 to 96911ae Compare February 5, 2025 08:57
@Stypox
Copy link
Member

Stypox commented Feb 5, 2025

I am merging this as it does work well (many users tested it in NewPipe) and we need to make a NewPipe release asap. I will publish the extractor release as a pre-release. Please create PRs to fix/improve/finish this PR and then we can do a proper release.

@Stypox Stypox merged commit fe168ab into TeamNewPipe:dev Feb 5, 2025
3 of 4 checks passed
@AudricV AudricV deleted the yt_clients_changes_and_potokens_support branch February 5, 2025 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ASAP Issue needs to be fixed as soon as possible bug Issue is related to a bug enhancement New feature or request youtube service, https://www.youtube.com/
Projects
None yet
5 participants