Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Super-xBR and NNEDI3 support #2427

Closed
wants to merge 4 commits into from
Closed

Add Super-xBR and NNEDI3 support #2427

wants to merge 4 commits into from

Conversation

bjin
Copy link
Contributor

@bjin bjin commented Oct 28, 2015

An initial version ready for code review

@haasn

This commit marks the image size variables temporary, and renames them
in order to prevent any potential confusion in the future.
@bjin bjin changed the title Add Super-xBR and NNEDI3 shaders Add Super-xBR and NNEDI3 support Oct 28, 2015
@Akemi
Copy link
Member

Akemi commented Oct 28, 2015

i have two remarks.

  1. the options "superxbr-sharpness" and "superxbr-edge-strength" were named but not described in the manual.
  2. would it be feasible to implement a parameter "auto" for the option "prescale-passes". to clarify the auto parameter, it should run as many passes as needed to reach the first resolution greater than the target resolution.

@bjin
Copy link
Contributor Author

bjin commented Oct 28, 2015

  1. the options "superxbr-sharpness" and "superxbr-edge-strength" were named but not described in the manual.

They are mentioned in the manual, but I don't think they deserve detailed explanation.

  1. would it be feasible to implement a parameter "auto" for the option "prescale-passes". to clarify the auto parameter, it should run as many passes as needed to reach the first resolution greater than the target resolution.

There are hard limit on the number of passes, but with the help of prescale-downscaling-threshold, you can get this behavior by setting prescale-passes to the maximal value (currently 5). Once it's greater than the display size, it will stop perform further passes automatically.

@ghost
Copy link

ghost commented Oct 28, 2015

I'm still very suspicious towards the prescaling stuff. Are multiple passes really needed? Why is alpha/chroma apparently handled separately in some places? How do we know that awful to handle features like chroma positioning, non-mod-2 cropping, and rotation still work correctly in all cases?

@ghost
Copy link

ghost commented Oct 28, 2015

Oh, and some larger chunks could probably be moved into separate functions, and the prescale stuff could probably be in its own commit.

@bjin
Copy link
Contributor Author

bjin commented Oct 28, 2015

@wm4 Before addressing individual review comments (probably later tonight), I would like to describe how the prescaling works in general.

The current pass_read_video() works as following

  1. merge texture1 and texture2, and resize the merged texture1 to the WxH (the main texture size)
  2. merge texture1 into texture0.gb, and texture3 into texture0.a

With prescaling, it works like this:

  1. apply upscaler on texture0, and store the result in texture4, it's 2Wx2H with offset
  2. merge texture1 and texture2, and resize the merged texture1 to 2Wx2H, with offset applied.
  3. At this point, texture4 (the upscaled main texture) and texture1 (the merged chroma texture) are 2Wx2H with offset, while texture0 (the main texture) and texture3 (the alpha texture) are still WxH without offset.
  4. For the main texture merging, we need to apply offset to texture0 (only non-essential planes, those not overwritten by texture4) and texture3 (the alpha plane), this is how bilinear comes.

And offset will be corrected by main scaler later.

Are multiple passes really needed?

4x upscaling (two passes) is actually very helpful for upscaling DVD content.

Why is alpha/chroma apparently handled separately in some places?

chroma processing happens before the main texture merging, and there are resizer involved as well, we need to calculate the offset carefully.

alpha (and also those non-essential plane from texture0) are process during the main texture merging, and bilinear is used.

How do we know that awful to handle features like chroma positioning, non-mod-2 cropping, and rotation still work correctly in all cases?

chroma positioning, non-mod-2 cropping are already handled by chromafix I believe. the rotation is handled by the compute_src_transform(). the transform there is aware of prescaling.

Oh, and some larger chunks could probably be moved into separate functions, and the prescale stuff could probably be in its own commit.

Acknowledged, will fix that by next push.

@bjin
Copy link
Contributor Author

bjin commented Oct 28, 2015

@wm4 pushed new commits, addressed most of the comments except that I didn't make prescaling into a single function or commit. There is already a separated function for prescaling named pass_prescale(), and the code in pass_read_video() are just glue code to process the planes: back up pass_tex, optionally perform debanding, and some context switch code later.

I should add that I only tested the code with nvidia card and driver (355.11), on Linux.

Also, I have a small concern about my use of GL_UNIFORM_BUFFER macro, it's defined in OpenGL 3.1, will it guaranteed to compile on all platforms we support?

@ghost
Copy link

ghost commented Oct 28, 2015

There is already a separated function for prescaling named pass_prescale(), and the code in pass_read_video() are just glue code to process the planes

You should try to add new functions, instead of letting existing functions grow to 100s of lines. (If they're already big, there's little justification to make them even bigger.)

Also, I have a small concern about my use of GL_UNIFORM_BUFFER macro, it's defined in OpenGL 3.1, will it guaranteed to compile on all platforms we support?

Generally yes.

@bjin
Copy link
Contributor Author

bjin commented Oct 28, 2015

@wm4 pushed new commits, I moved the luma prescaling part into the own function, since they are kind of self contained. But I had to leave the plane merging part in pass_read_video(), could figure out a nice way to split it.

@ghost
Copy link

ghost commented Oct 28, 2015

OK, I do think this is a bit better. Since I have no idea what's going on anyway (the concept is simple, the interaction with all the code and possible configurations is not), I have nothing more to say I guess.

@bjin: do you still have any further plans for this code, or do you consider this final?

@haasn: what is your opinion? Review would be appreciated too.

@bjin
Copy link
Contributor Author

bjin commented Oct 28, 2015

@wm4: I have no plan to make major changes to these commits. They should already be big enough just for introducing new features.

@ghost
Copy link

ghost commented Oct 28, 2015

(Oh I see, a for loop is actually slightly awkward, because you want compute and return the number of passes in the first place. Well, doesn't matter.)

@bjin
Copy link
Contributor Author

bjin commented Oct 29, 2015

made some minor changes to the nnedi3 commit

@bjin
Copy link
Contributor Author

bjin commented Oct 30, 2015

Updated commit message for nnedi3 with details that's too technical to be in the manual.

@ghost
Copy link

ghost commented Oct 30, 2015

ping @haasn

@haasn
Copy link
Member

haasn commented Oct 31, 2015

Do we have any comparison images of this version of NNEDI3 against a reference implementation (eg. vapoursynth)?

@bjin
Copy link
Contributor Author

bjin commented Oct 31, 2015

@haasn I would like to describe how the current code prescale non-YUV video on the main thread.

The background is that both Super-xBR and NNEDI3 are designed to handle luma only, it probably will somehow works on chroma or RGB planes (and they actually did, for example the original NNEDI3 filter will also use NNEDI3 to handle chroma, madvr also do chroma upscaling with them).

The original Super-xBR implementation just hardcode color matrix and use the luma value (calculated from RGBA) to find the weights, and applies to all channels later. This is an ugly hack which I don't want to follow. So I removed all these to make superxbr works on luma only.

So the problem comes, how we handle RGB (and XYZ) video? It's difficult choice, and a lot of tradeoffs are involved. We can of course convert RGB back to YUV, upscaling Y with prescale, upscale UV with dscale, and convert it back to RGB later, but it's just too complicated with no actual benefits. We can also disallow prescaling on non-YUV video, but it's just not user friendly.

So the decision I made is to:

  1. prescale RGB (and other non-YUV content) by pretending they are all luma planes.
  2. Add a warning in the manual, suggesting that it might not be ideal to use prescalers on non YUV video, and user can manually overwrite this behavior by using a custom format video filter.

The superxbr is significantly slower for non-YUV. But It's much faster on YUV, and more importantly, correct without ugly code. (Well, I also suspect there will be any complains on this either, superxbr is already very fast compare to NNEDI3. And I don't care much about its performance either).

@haasn
Copy link
Member

haasn commented Oct 31, 2015

So the problem comes, how we handle RGB (and XYZ) video? It's difficult choice, and a lot of tradeoffs are involved. We can of course convert RGB back to YUV, upscaling Y with prescale, upscale UV with dscale, and convert it back to RGB later, but it's just too complicated with no actual benefits. We can also disallow prescaling on non-YUV video, but it's just not user friendly.

I would personally prefer to disable scaling non-YUV for super-xbr. A version that works directly (and correctly) on RGB exists, and in fact it would just require some abstraction on the underlying type (eg. #define T (channels = 1 ? float : vec3)) + a few more lines to handle the luma coefficients. We can add that sort of abstraction for multiple channels later (based on the code I already have), but for now I would just disable it.

For NNEDI3 I'm undecided. You can keep the current loop around for NNEDI3 simply for lack of a better alternative. It might make sense to add an option for it, since scaling RGB is a huge performance hit and some users might prefer to not take the performance hit on this type of content. (Although the content is rare enough to justify having those users create a non-NNEDI3 profile instead)

  1. Add a warning in the manual, suggesting that it might not be ideal to use prescalers on non YUV video, and user can manually overwrite this behavior by using a custom format video filter.

The problem is that the swscale format filter probably does a pretty bad job of converting to yuv. It would absolutely only be a hack, and never something intended for “real” viewing. (And certainly not one I would recommend users to try, in a man page)

@haasn
Copy link
Member

haasn commented Oct 31, 2015

Note that for scaling multiple channels there might be a way to reconcile efficiency with speed - perform the heavy “edge detection”-style calculations on the luminance and store its result in a separate texture, than only use this texture + more lightweight blending code on each texture.

That might be the best approach overall as it's both correct and reasonable fast. You can even skip the separate texture when processing luma only (for YUV).

But I would fine with taking the algorithm as-is right now without this change, which would be a significant redesign, and then adding that in in a later commit.

@ghost
Copy link

ghost commented Oct 31, 2015

and in fact it would just require some abstraction on the underlying type (eg. #define T (channels = 1 ? float : vec3)).

I didn't look what exactly the shader code does here, but generally processing multiple channels shouldn't add much to the performance costs, if at all.

@bjin
Copy link
Contributor Author

bjin commented Oct 31, 2015

I would personally prefer to disable scaling non-YUV for super-xbr.

Yes, it's an option. What I considered is that it's just very cheap to enable it for RGB video. Just introduce an extra variable "prescaled_planes" and we are done.

It would absolutely only be a hack, and never something intended for “real” viewing. (And certainly not one I would recommend users to try, in a man page)

Will remove that part.

UPDATE: quoted wrong sentence

bjin added 3 commits October 31, 2015 21:57
Add the Super-xBR filter for image doubling, and the prescaling framework
to support it.

The shader code was ported from MPDN extensions project, with
modification to process luma only.

This commit is largely inspired by code from #2266, with
`gl_transform_trans()` authored by @haasn taken directly.
Escaping all question marks as well, they can be used to form
trigraph characters which are effective even within string literal.
Implement NNEDI3, a neural network based deinterlacer.

The shader is reimplemented in GLSL and supports both 8x4 and 8x6
sampling window now. This allows the shader to be licensed
under LGPL2.1 so that it can be used in mpv.

The current implementation supports uploading the NN weights (up to
51kb with placebo setting) in two different way, via uniform buffer
object or hard coding into shader source. UBO requires OpenGL 3.1,
which only guarantee 16kb per block. But I find that 64kb seems to be
a default setting for recent card/driver (which nnedi3 is targeting),
so I think we're fine here (with default nnedi3 setting the size of
weights is 9kb). Hard-coding into shader requires OpenGL 3.3, for the
"intBitsToFloat()" built-in function. This is necessary to precisely
represent these weights in GLSL. I tried several human readable
floating point number format (with really high precision as for
single precision float), but for some reason they are not working
nicely, bad pixels (with NaN value) could be produced with some
weights set.

We could also add support to upload these weights with texture, just
for compatibility reason (etc. upscaling a still image with a low end
graphics card). But as I tested, it's rather slow even with 1D
texture (we probably had to use 2D texture due to dimension size
limitation). Since there is always better choice to do NNEDI3
upscaling for still image (vapoursynth plugin), it's not implemented
in this commit. If this turns out to be a popular demand from the
user, it should be easy to add it later.

For those who wants to optimize the performance a bit further, the
bottleneck seems to be:
1. overhead to upload and access these weights, (in particular,
   the shader code will be regenerated for each frame, it's on CPU
   though).
2. "dot()" performance in the main loop.
3. "exp()" performance in the main loop, there are various fast
   implementation with some bit tricks (probably with the help of the
   intBitsToFloat function).

The code is tested with nvidia card and driver (355.11), on Linux.

Closes #2230
@bjin
Copy link
Contributor Author

bjin commented Oct 31, 2015

pushed new commits, with all comments addressed (plus several other changes to DOCS/)

@bjin
Copy link
Contributor Author

bjin commented Oct 31, 2015

Do we have any comparison images of this version of NNEDI3 against a reference implementation (eg. vapoursynth)?

Super-xBR (GLSL)

superxbr

NNEDI3 (GLSL)

nnedi3-glsl

NNEDI3 (vapoursynth, taken from `Upscaling` Wiki)

nnedi3-vs

@haasn
Copy link
Member

haasn commented Nov 1, 2015

That super-xBR image looks sort of funny. Can you compare it to the version in #2266 please?

Also, the NNEDI3 version looks different too for some reasons, but this might be due to settings differences on your end (?). I'll do some testing myself as well.

@haasn
Copy link
Member

haasn commented Nov 1, 2015

For comparison with the NNEDI3 vapoursynth version: that one was generated with the default settings, which are 8x6 window and 128 neurons.

@haasn
Copy link
Member

haasn commented Nov 1, 2015

Here is a version of the super-xbr test with my commit. Looks significantly different to me; but that might be because I'm processing based on luma instead of RGB planes individually. (More reason to do it the proper way)

@haasn
Copy link
Member

haasn commented Nov 1, 2015

The updated commit is basically okay, I just want some more confirmation that the result matches what it actually should be; since there still seem to be some deviations here and there.

Best test on a black and white image to make sure it's not due to the RGB thing.

@bjin
Copy link
Contributor Author

bjin commented Nov 1, 2015

source is produced with convert rose: -grayscale Rec709Luma rose.png

vapoursynth (neurons=32, window=8x6)

nnedi3-vs

vapoursynth (neurons=32, window=8x6, no prescreener)

nnedi3-vs-nopscrn

GLSL (neurons=32, window=8x6)

nnedi3-glsl

Looks like the difference is due to vapoursynth's use of prescreener in its default setting (it's an optimization to use another NN to predict which pixel could be interpolated with bicubic instead of the main NN)

@haasn
Copy link
Member

haasn commented Nov 2, 2015

Okay, looks good. (for NNEDI3 at least)

@ghost
Copy link

ghost commented Nov 5, 2015

Merged, thanks.

@ghost ghost closed this Nov 5, 2015
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants