Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to deal with multiple valid test results #149

Open
TimoRoth opened this issue May 18, 2022 · 12 comments
Open

How to deal with multiple valid test results #149

TimoRoth opened this issue May 18, 2022 · 12 comments

Comments

@TimoRoth
Copy link

TimoRoth commented May 18, 2022

Depending on the version of matplotlib, the way it was installed, the way other dependencies were installed and a bunch of other random factors, the results of our image comparison tests can vary slightly.
Enough to fall way out of the tolerance window, but still being perfectly valid results.

So we have a set of multiple valid baseline images for most of our tests, but don't really have a good way to actually test if any of them matches.

Right now, we are using our own fork of pytest-mpl, which does support testing against multiple baselines: https://github.com/OGGM/pytest-mpl
But maintaining that has been a consistent pain, and it's getting harder the more upstream pytest-mpl evolves.

This must be a general issue not only we have. What is the proper way to deal with it?

@Cadair
Copy link
Contributor

Cadair commented May 18, 2022

Hi @TimoRoth

The way I use pytest-mpl and I have seen projects which use baseline images is that they work hard to maintain a set of baseline images per constrained environment (versions of mpl etc) and then dynamically select the path for the baseline images based on the versions installed at test time.

See this code in sunpy which selects the hashlib: https://github.com/sunpy/sunpy/blob/main/sunpy/tests/helpers.py#L72 which is then used in a wrapper decorator: https://github.com/sunpy/sunpy/blob/main/sunpy/tests/helpers.py#L103-L105.

Or see Astropy which uses baseline images and not a hash library where the path to the baseline images is simarly dynamically generated based on matplotlib version: https://github.com/astropy/astropy/blob/main/astropy/tests/image_tests.py

I have thought about ways we could make these kind of workflows easier for users, but making them general always feels like you need a way to specify the baseline images or hash library for an arbitrarily complex set of dependencies (sometimes C libs like freetype). If you have any suggestions based on how your fork works then I would love to hear them.

@TimoRoth
Copy link
Author

The basic idea of the way it's implemented in our fork is that instead of a baseline image, we point at a baseline directory, and as long as one image in there matches with the desired tolerance, the test passes.

You can see one of those baseline image collections here: https://github.com/OGGM/oggm-sample-data/tree/master/baseline_images/freetype_28/test_raster
If you go up one level, you can see all the other tests with their baseline directories.

We used to do the same method you described(hence the freetype version still being part of the path), but there are so many moving parts that can affect the output (matplotlib version, version of libfreetype used by matplotlib, proj version, gdal version, and probably more dependencies that can cause slight variations in results) that it's become absolutely unmaintainable. We'd have to maintain several hundred sets of different baseline images, and would probably still be missing some.

@Cadair
Copy link
Contributor

Cadair commented May 20, 2022

I can see how that approach would be useful, especially when you get to a very large dependency tree.

I would be happy to accept a contribution with this feature, my only reservation is further complicating an already exceptionally flexible API, see the beginnings of a discussion in #154 about that.

@TimoRoth
Copy link
Author

I'll need to rebase, or even rewrite, the patch. The introduction of the summary-Mechanism has complicated it quite a bit.
Will send it over as PR then (excluding some of our more "specific" additions, like the automatic imgur upload of failed test images).

@bjlittle
Copy link
Contributor

bjlittle commented May 21, 2022

I'm curious to know how much #150 might alleviate this issue.

If it gets banked, then it certainly opens the door to hashing being a better alternative to the rms approach. This give us a new alternative to consider how we might approach this problem i.e., it means that the problem can perhaps be pushed up into the hash library.

How might that work?

Consider only having one baseline image per test, generated by a single "golden" environment of your choice. Extend the hash library such that it now supports a dict per test entry in the library, which contains the accepted phash for the test executed in each uniquely named environment (where the dict key is the env name). Note that, these hashes may differ (most likely given what you've said above) across all your environments for a test, or be all the same, or be a mixture of the two.

When running pytest-mpl the user provides the unique name of the environment under test, say --mpl-env-name=py37-mpl22-macos. This allows the appropriate hash for a test being executed in the specific environment to be checked within the hash library.

This approach certainly reduces the bloat and problem of managing multiple baseline image collections per environment.

With respect to the baseline images, for a failed test the image difference contains the difference that tripped the failure but for "non-golden" environments also includes any acceptable difference (that you originally signed-off when creating the hash) against the "golden" environment image. This might be enough for you to ascertain the source of the image change and whether that is acceptable or not given the failure.

Just a thought... dunno if it has legs or not 🤔

@TimoRoth
Copy link
Author

The issue I see with that is still that it's A LOT of permutations.
And it's not always obvious which dependency is causing the change. And sometimes the same set of dependencies on a different version of Ubuntu cause a change in output.
So the env name would end up with virtually all our 20+ dependencies, some of their dependencies, the OS version and/or the version of various OS libraries in it, and then we'd probably still end up missing some.

For the vast majority of cases, the change isn't meaningful enough to cause a difference that causes the tolerance to be violated.
But sometimes, on certain tests, it does vary a bit too much.

Maintaining such a massive set of baselines would also be quite bothersome.

@bjlittle
Copy link
Contributor

@TimoRoth Understood.

Is it possible for you to point me to a repo containing your baseline images?

I'd love to take a peek at them, if you don't mind? No worries if that's not possible 👍

@TimoRoth
Copy link
Author

Already linked to them above, in my first reply: https://github.com/OGGM/oggm-sample-data/tree/master/baseline_images/freetype_28/test_raster

That's probably the most clear demonstration of the random differences that can occur. The other tests show similar differences. Some also have numerical differences, that are negligible enough though.

@bjlittle
Copy link
Contributor

bjlittle commented May 22, 2022

@TimoRoth I done some quick analysis on your 5 test images using 256-bit perceptual images hashes.

If you take test_raster_1.png as your baseline image and compare it in the perceptual image hash space with your other samples, then you get the following hamming distances i.e., number of bits difference:

test_raster_1.png - test_raster_2.png = 0
test_raster_1.png - test_raster_3.png = 2
test_raster_1.png - test_raster_4.png = 2
test_raster_1.png - test_raster_5.png = 22

This is telling me that images 1, 2, 3 and 4 are sufficiently similar images 👍

However, with a hamming distance of 22, image 5 is quite different. On inspection this is indeed true, as the plot is larger overall, and the scale on the colorbar is different.

Does this help confirm how perceptual image hashing might help?

@TimoRoth
Copy link
Author

It's primarily telling me that we should investigate using (perceptual?) hashes instead of a growing library of baseline images :D
But I'd probably like to have the same option there than with the baseline images: The ability to specify multiple valid baseline hashes.

@bjlittle
Copy link
Contributor

bjlittle commented May 22, 2022

Okay, how about this as a compromise, given what I suggested in the above comment and what you'd like as part of your workflow (which I'm assuming is a pretty common pain point)...

Entirely drop the environment name concept. It's not necssary.

Also, instead of a dict per hash library entry have a list. This flat list contains all the perceptual hash values that have been generated and registered as acceptable for that test on the multitude of environments/permtations that you care about. In fact, it's more like an ordered set, with no duplicate hash entries. I think that maintaining a historical timeline order might be helpful here, such that the first hash is the oldest and the last hash is the newest. We do something similar on SciTools/iris.

There are two important further enhancements necessary, and then I think we might have struck gold.

Firstly, when testing a result image hash for a test against the entries in the list, because perceptual image hashes are comparable (resulting in a hamming bit distance), we compare the result hash against the "most similar" accepted image hash. If the hamming distance is ≤ the hamming bit tolerance, then "bingo" the test passes, otherwise it fails. What's great about this approach is that we're minimising the hash list, but also testing a result automatically against the most appropriate and likely accepted "similar" baseline hash/image. Also, it avoids the need to support a matrix of hamming tolerances per test/environment. We simply support the global hamming tolerance or customised tolerance via the mpl_compare_image marker kwarg option, as proposed in #150.

Secondly, for the baseline images, rather than have a flat directory of ".png" files, we use the name of the test to instead create a directory associated with the test, and within that directory we have "<hash-value.png>" files; one for each actual image associated with that hash. Having separate test sub-directories for the baseline directory avoids any hash value conflict issues, plus it'll makes the management of what images belong to which test intuitive.

So... given all of the above, I genuinely think that this might be what we need here.

It will certainly work for your test-raster example above, where one image (test_raster_5.png) for whatever reason is quite different, but acceptable given its runtime environment. So for your example it's only necessary to have two hashes and two images. Win 👍

Also, the other big win here is that image comparison summaries with result/diff/baseline images still works.

If there's enough agreement, and I've not missed anything obvious, then I'd be happy to implement this in a feature branch if/once #150 lands... And if you'd be willing to collaborate and test it @TimoRoth with your workflow (and anyone else that cares), and feedback, then we can make an informed decision as to whether pytest-mpl should adopt this strategy.

Thoughts?

@TimoRoth
Copy link
Author

That sounds much more elaborate than anything I had thought of so far, but it'll definitely solve the issue in an elegant way that's hopefully useful for a lot more users.
Will gladly help test that with our set of tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants