Rethink the testing mechanism for images #963

seisman · 2021-02-24T07:42:49Z

If you're unclear about how PyGMT tests images, please read the "Testing plots" section in the contributing guides first.

In short, for image-based tests, we need to specify the baseline/reference image. When we make any changes to the code, we can generate the new "test" image and compare it with the "baseline" image. If the two images are different, then we know the changes break the tests. The most important thing is, to ensure that the "baseline" images are correct.

Currently, we have two different methods to generate the "baseline" image and compare them:

using the decorator @pytest.mark.mpl_image_compare
using the decorator @check_figures_equal()

The @pytest.mark.mpl_image_compare method is the most straightforward way to do image testing. Using the decorator, we need to generate baseline images, check their correctness, and store them in the repository (https://github.com/GenericMappingTools/pygmt/tree/master/pygmt/tests/baseline).

Pros:

We can visually check the baseline images to make sure they are correct

Cons:

Have to store the static PNG images in the repository. The repository size grows quickly.

To avoid storing many large static images in the repository, we (mainly @weiji14 and @seisman) had some discussions (in #451, #522) and developed the @check_figures_equal decorator (#555, #590, #600).

Below is an example test using the @check_figures_equal() decorator:

pygmt/pygmt/tests/test_basemap.py

Lines 67 to 77 in e057927

    
           @check_figures_equal() 
        
           def test_basemap_polar(): 
        
               """ 
        
               Create a polar basemap plot. 
        
               """ 
        
               fig_ref, fig_test = Figure(), Figure() 
        
               # Use single-character arguments for the reference image 
        
               fig_ref.basemap(R="0/360/0/1000", J="P6i", B="afg") 
        
               fig_test.basemap(region=[0, 360, 0, 1000], projection="P6i", frame="afg") 
        
               return fig_ref, fig_test

In this example, the baseline/reference image fig_ref is generated using basemap(R="0/360/0/1000", J="P6i", B="afg"), while the test image fig_test is generated using basemap(region=[0, 360, 0, 1000], projection="P6i", frame="afg"). We can't see what the baseline image looks like, but we're somehow confident that the baseline image is correct, because the basemap wrapper is very simple.

Pros:

Don't need to store static images in the repository, thus keep the repository size small

Cons:

For each test, we have to generate two images (baseline and test images), which doubles the execution time
We can't visually check the correctness of the baseline images
If we decided to disable single-character parameters (i.e, J="X10c/10c" is disallowed) as proposed in Disallow single character arguments #262 (also related to Fail for invalid input arguments #256), then most of the code for generating reference images will be invalid.

For some complicated wrappers, we even can't easily know if the reference image is correct. For example,

pygmt/pygmt/tests/test_subplot.py

Lines 30 to 42 in e057927

    
           @check_figures_equal() 
        
           def test_subplot_direct(): 
        
               """ 
        
               Plot map elements to subplot directly using the panel parameter. 
        
               """ 
        
               fig_ref, fig_test = Figure(), Figure() 
        
               with fig_ref.subplot(nrows=2, ncols=1, Fs="3c/3c"): 
        
                   fig_ref.basemap(region=[0, 3, 0, 3], frame="af", panel=0) 
        
                   fig_ref.basemap(region=[0, 3, 0, 3], frame="af", panel=1) 
        
               with fig_test.subplot(nrows=2, ncols=1, subsize=("3c", "3c")): 
        
                   fig_test.basemap(region=[0, 3, 0, 3], frame="af", panel=[0, 0]) 
        
                   fig_test.basemap(region=[0, 3, 0, 3], frame="af", panel=[1, 0]) 
        
               return fig_ref, fig_test

In this test, we expect that the baseline image has a 2-row-by-1-column subplot layout. However, if we make a silly mistake in Figure.subplot, resulting in a 1-row-by-2-column layout, the test still passes, because both the baseline and test images have the same "wrong" layout. Then the test is useless to us.

Almost every plotting tools have to decide if they want to store static images in the repository. There are some similar discussions in the upstream GMT project (GenericMappingTools/gmt#3470) and the matplotlib project (matplotlib/matplotlib#16447).

As we're having more active developers now, I think we should rethink how we want to test PyGMT.

The text was updated successfully, but these errors were encountered:

willschlitzer · 2021-02-24T17:58:34Z

I'm a fan of switching to @check_figures_equal(), although I understand the difficulties associated with having our testing functions make so many plots. I mostly like the idea of not having a large repository of reference plots, especially for when we have changes in the plotting function, like in GMT 6.2.

That being said, I have begun to realize that it doesn't seem like we're always effectively testing all that much when we're just passing the same parameter to single-letter and PyGMT arguments and testing that the plots turn out the same. As we've discussed in #771, the thing we should be most concerned about is the "Python" parts of the function, not the aliases that pass arguments to the GMT API. I think we can reduce the testing workload by consolidating some of the tests that test nothing more than aliases (which includes many of the tests I've written) and focus on the Python parts (my recent example is testing the Python parts of solar to make sure the terminator and datetime inputs produce the same result as a GMT string passed to the T parameter).

seisman · 2021-02-25T07:51:29Z

the thing we should be most concerned about is the "Python" parts of the function, not the aliases that pass arguments to the GMT API. I think we can reduce the testing workload by consolidating some of the tests that test nothing more than aliases and focus on the Python parts.

Yes, I agree. This is what we must do.

Still, the biggest challenge is "how to make sure that baseline images are correct.".

We had some discussions in #451.

One solution is storing static images in a separate repository (e.g. pygmt_baseline_image), but it is too complicated when we want to add a new test.

Should we start using submodules? I.e. split the pygmt/tests/baseline folder into a separate repository (see https://docs.github.com/en/github/using-git/splitting-a-subfolder-out-into-a-new-repository).

I've been reading up on git submodules/subtrees/git-lfs and there doesn't seem to be an easy way to do this, there will be a learning curve in any case. Matplotlib currently has a big PR at matplotlib/matplotlib#17557 to move their baseline images into a separate place, and I really do not want myself or anyone to handle that in X years.

It's too complicated. When we add a test, we have to open two separate PRs in two repositories, one for the baseline images and one for the tests. How can the tests PR know it should get the new baseline images in the corresponding branch?

Yeah, and I don't think it will be friendly for new contributors either. Surely there must be a better way to store the images, or test them

Another solution is generating baseline images by directly calling Session.call_module(). It's almost testing the equivalence of a Python script (using PyGMT syntax) and a bash script (using GMT CLI). This is the method I prefer.

For other non-grd tests, perhaps we could generate the reference images by directly passing arguments to GMT modules. For examples,
fig_test.basemap(region=[0, 10, 0, 10], projection='X10c/10c', frame=['xaf', 'yaf', 'WSen'])
should be identical to the reference image generated by:
lib.call_module("basemap", "-R0/10/0/10 -JX10c/10c -Bxaf -Byaf -BWSen")

willschlitzer · 2021-02-25T09:37:13Z

Another solution is generating baseline images by directly calling Session.call_module(). It's almost testing the equivalence of a Python script (using PyGMT syntax) and a bash script (using GMT CLI). This is the method I prefer.
For other non-grd tests, perhaps we could generate the reference images by directly passing arguments to GMT modules. For examples,
fig_test.basemap(region=[0, 10, 0, 10], projection='X10c/10c', frame=['xaf', 'yaf', 'WSen'])
should be identical to the reference image generated by:
lib.call_module("basemap", "-R0/10/0/10 -JX10c/10c -Bxaf -Byaf -BWSen")

I like this idea. I think it more effectively tests that the aliases and Python functions line up with the expected outcome from GMT, as opposed to seeing if passing the same arguments to PyGMT twice will produce different results. We assume that if the "correct" inputs are sent to GMT, the figure will turn out as expected, much like a reference image. The downside is that it will expect someone to learn GMT commands, but I don't think this is too advanced from someone wrapping a new module.

Why wouldn't this also be applicable for grd tests? Since we use standard GMT-hosted grids, wouldn't we be able to add @earth_relief in the string passed to lib.call_module()?

How would this work with @check_figures_equal()? The GMT 6.2 release seems like an ideal time to make this switch, since we will have to either update the tests or update the reference images.

seisman · 2021-02-25T15:20:01Z

We assume that if the "correct" inputs are sent to GMT, the figure will turn out as expected, much like a reference image.

Yes, it sounds reasonable and valid assumption.

Why wouldn't this also be applicable for grd tests? Since we use standard GMT-hosted grids, wouldn't we be able to add @earth_relief in the string passed to lib.call_module()?

It can also be applied to grid tests.

How would this work with @check_figures_equal()?

pygmt/pygmt/tests/test_grd2cpt.py

Lines 23 to 37 in e057927

    
           @check_figures_equal() 
        
           def test_grd2cpt(grid): 
        
               """ 
        
               Test creating a CPT with grd2cpt to create a CPT based off a grid input and 
        
               plot it with a color bar. 
        
               """ 
        
               fig_ref, fig_test = Figure(), Figure() 
        
               # Use single-character arguments for the reference image 
        
               fig_ref.basemap(B="a", J="W0/15c", R="d") 
        
               grd2cpt(grid="@earth_relief_01d") 
        
               fig_ref.colorbar(B="a2000") 
        
               fig_test.basemap(frame="a", projection="W0/15c", region="d") 
        
               grd2cpt(grid=grid) 
        
               fig_test.colorbar(frame="a2000") 
        
               return fig_ref, fig_test

Just take this test (written by you) as an example, I think I mentioned before that the test may still pass even if grd2cpt() doesn't work as expected. The test can be rewritten to:

from pygmt.clib import Session


@check_figures_equal()
def test_grd2cpt(grid):
    """
    Test creating a CPT with grd2cpt to create a CPT based off a grid input and
    plot it with a color bar.
    """
    # reference image
    fig_ref = Figure()
    with Session() as lib:
        lib.call_module("basemap", "-Ba -JW0/15c -Rd")
        lib.call_module("grd2cpt", "@earth_relief_01d")
        lib.call_module("colorbar", "-Ba2000")

    # test image
    fig_test = Figure()
    fig_test.basemap(frame="a", projection="W0/15c", region="d")
    grd2cpt(grid=grid)
    fig_test.colorbar(frame="a2000")

    return fig_ref, fig_test

willschlitzer · 2021-02-28T08:48:08Z

@GenericMappingTools/python-contributors Does anyone have opinions on this? I'm in support of @seisman's example using with Session() as lib.

core-man · 2021-02-28T09:19:18Z

As a new PyGMT and an old GMT user, it seems that the test method by with Session() as lib is better.

If the right figure cannot be generated by PyGMT, I guess there are two possible reasons: 1) some bugs exist in PyGMT; 2) GMT has some bugs. We should fix the first one in PyGMT, while we should report GMT bugs to upstream.

But if the PyGMT project plans to develop more functions that are not in GMT, this testing mechanism will not work.

seisman · 2021-02-28T19:40:56Z

But if the PyGMT project plans to develop more functions that are not in GMT, this testing mechanism will not work.

Any new functions in PyGMT would reply on GMT, so we can always find equivalent GMT command lines. For example, in the new Figure.hlines() (#923) function, we can call low-level gmt plot commands to generate the reference images.

michaelgrund · 2021-03-02T19:07:51Z

But if the PyGMT project plans to develop more functions that are not in GMT, this testing mechanism will not work.

Any new functions in PyGMT would reply on GMT, so we can always find equivalent GMT command lines. For example, in the new Figure.hlines() (#923) function, we can call low-level gmt plot commands to generate the reference images.

You're right, so far I made no use in the tests to compare the images as you suggested @seisman. Hopefully I have time to work on this the upcoming weekend.

willschlitzer · 2021-03-02T22:27:30Z

@seisman Should we begin working on rewriting the tests, or should we wait until the GMT 6.2 release? I'm assuming we want to prioritize the rewriting the tests that use @pytest.mark.mpl_image_compare, but will ideally update the @check_figures_equal tests to use with Session as lib()?

weiji14 · 2021-03-02T22:43:32Z

Not to throw a spinner into things, but do we want to reconsider using @pytest.mark.mpl_image_compare? I've mentioned it before at #451 (comment) about storing the PNG images offsite, while commiting the 'hash' of the image here on the pygmt repo using things like git-lfs.

There's also solutions like dvc (basically 'git' but for data) that are maturing quite nicely and might be worth considering, especially if we can automate things using Github Actions (e.g. a bot checks if images differ from baseline, and we can do /yes-bot or /no-bot to update the image). Might be able to host it on Github Artifacts or the like.

Alternatively, I wonder if storing SVG instead of PNG would make things lighter?

And yes, all this should be done closer to the GMT 6.2.0 release. We have the GMT dev tests set up on that CI for that matter and should be able to fix most tests before the actual GMT 6.2.0 package is out on conda-forge.

seisman · 2021-03-03T02:23:51Z

Not to throw a spinner into things, but do we want to reconsider using @pytest.mark.mpl_image_compare? I've mentioned it before at #451 (comment) about storing the PNG images offsite, while commiting the 'hash' of the image here on the pygmt repo using things like git-lfs.

I agree that @pytest.mark.mpl_image_compare is the most accurate way to compare reference and test images. As you said, if we can store images offsite, @pytest.mark.mpl_image_compare is the best solution.

There's also solutions like dvc (basically 'git' but for data) that are maturing quite nicely and might be worth considering, especially if we can automate things using Github Actions (e.g. a bot checks if images differ from baseline, and we can do /yes-bot or /no-bot to update the image). Might be able to host it on Github Artifacts or the like.

Not sure if it really works for us. How can we download and update baseline images if we want to run tests locally?

Alternatively, I wonder if storing SVG instead of PNG would make things lighter?

Unfortunately, GMT doesn't support SVG anymore (because recent Ghostscript versions drop the SVG support). Even GMT can save figures in SVG formats, I doubt that it may still not work. The GMT project stores PS files (ASCII) in the repository, and the repository size still grows quickly, because images (especially figures generated by grdimage) are saved as binary data in ASCII PS files (Not sure if I explain it clearly, but you can plot an image using grdimage and open it using your text editor.).

weiji14 · 2021-03-03T02:51:19Z

There's also solutions like dvc (basically 'git' but for data) that are maturing quite nicely and might be worth considering, especially if we can automate things using Github Actions (e.g. a bot checks if images differ from baseline, and we can do /yes-bot or /no-bot to update the image). Might be able to host it on Github Artifacts or the like.

Not sure if it really works for us. How can we download and update baseline images if we want to run tests locally?

The hash of the images will be stored in a .dvc file. To download/update the PNG images locally, use dvc pull (similar to git pull). Adding files would be through dvc add (similar to git add). In fact, most of the dvc commands are based on git (see also https://realpython.com/python-data-version-control/), so the learning curve shouldn't be too steep (hopefully). They also have a Python API at https://dvc.org/doc/api-reference we could plug into.

I'll probably need to open up a demo PR to illustrate how things would work, but things we'll need to do are:

Think about where to store the PNG images in the cloud.
Change our testing workflow (for plotting images) back to @pytest.mark.mpl_image_compare

seisman · 2021-03-03T03:09:08Z

I'll probably need to open up a demo PR to illustrate how things would work

Yes, that would be better.

Think about where to store the PNG images in the cloud.

If it works, can we just store the PNG images in another github repository?

weiji14 · 2021-03-05T02:13:49Z

Think about where to store the PNG images in the cloud.

If it works, can we just store the PNG images in another github repository?

Will need to have a think about where to store things as I create that PR. Probably won't have time to do this until v0.4.0 though.

Edit: Just mirrored the PyGMT repo at https://dagshub.com/GenericMappingTools/pygmt. DAGsHub is a web platform for data version control (see FAQ). Give me a few days or weeks and I'll try and get a pipeline of some sort set-up for us to start uploading images!

weiji14 · 2021-03-18T04:07:41Z

Ok, #1036 has been merged which sets up data version control (dvc) for the PyGMT repo. The new dvc based workflow addresses the main con of using @pytest.mark.mpl_image_compare (storing large images in the repo) by storing them on https://dagshub.com/GenericMappingTools/pygmt instead (please create an account there everyone using your GitHub login).

We will slowly migrate the tests from @check_figures_equal to @pytest.mark.mpl_image_compare. Instructions are documented at https://github.com/GenericMappingTools/pygmt/blob/master/CONTRIBUTING.md#using-data-version-control-dvc-to-manage-test-images. The test migration will proceed as follows over the next few weeks/months:

After this PR Initialize data version control for managing test images #1036, change recommended way of testing from @check_figures_equal() to @pytest.mark.mpl_image_compare

Bump minimum GMT version to 6.2.0

Fix all the test images that have changed, storing the new test images with dvc on DAGsHub

Optional - Migrate @check_figures_equal tests to @pytest.mark.mpl_image_compare (can prioritize the slow tests as reported in Show test execution times in pytest #835/Improve some tests to speed up the CI #840)

Optional - Fully deprecate @check_figures_equal(), removing it from codebase and documentation in CONTRIBUTING.md, also close Directly check if two figures returned by a function are equal matplotlib/pytest-mpl#95?

Write new tests for new functionality using @pytest.mark.mpl_image_compare only

Originally posted by @weiji14 in #1036 (comment)

Update the install instructions, because pygmt.test() won't work for users.

Maybe add the baseline images as a release asset when making releases.

Originally posted by @seisman in #1036 (comment)

I'd encourage everyone to use for their open PRs when creating test images, and feel free to ask any questions if things are unclear!

maxrjones · 2021-03-18T17:46:22Z

Wow, @weiji14 and @seisman this looks fantastic! Great job! I'm excited to try it out and find out whether it could be a solution for core-gmt's testing woes as well 😄

seisman · 2021-03-18T23:16:35Z

@weiji14 Perhaps you could open a separate issue or several issues with a list of TODOs so that people who want to help have a better idea of what to do.

After this PR Initialize data version control for managing test images #1036, change recommended way of testing from @check_figures_equal() to @pytest.mark.mpl_image_compare

Bump minimum GMT version to 6.2.0

Fix all the test images that have changed, storing the new test images with dvc on DAGsHub

Optional - Migrate @check_figures_equal tests to @pytest.mark.mpl_image_compare (can prioritize the slow tests as reported in Show test execution times in pytest #835/Improve some tests to speed up the CI #840)

Question: Do we want to do the migration before GMT 6.2.0 or after? I prefer to do the migration before v6.2.0, although it means more work for us. After bumping GMT to 6.2.0, most tests will fail due to the changes in GMT 6.2.0, but I feel it's also a good opportunity for us to learn the GMT changes and find potential bugs by comparing the images generated by GMT 6.1.1 and 6.2.0.

FYI, one week ago, I built the PyGMT documentation using GMT 6.2.0, and found several issues with the GMT dev version (GenericMappingTools/gmt#4955), and they were all fixed in less than one week!

Optional - Fully deprecate @check_figures_equal(), removing it from codebase and documentation in CONTRIBUTING.md, also close Directly check if two figures returned by a function are equal matplotlib/pytest-mpl#95?

I think we may still need @check_figures_equal() in some cases, especially for grid plotting functions. You may remember that we found some upstream bugs by checking if the images generated from a file and a xarray.DataArray are the same. Sometimes upstream bugs can cause tiny differences between these two images and are difficult to identify by visually checking baseline images.

maxrjones · 2021-03-20T15:31:04Z

@weiji14, could I please get added on DAGshub?

weiji14 · 2021-03-20T18:04:09Z

@weiji14, could I please get added on DAGshub?

Ok done

seisman · 2021-04-11T02:26:39Z

After this PR #1036, change recommended way of testing from @check_figures_equal() to @pytest.mark.mpl_image_compare

Done in #1108.

Bump minimum GMT version to 6.2.0

Waiting for the GMT v6.2.0 release.

Fix all the test images that have changed, storing the new test images with dvc on DAGsHub

Tracked by issue #1131.

Optional - Migrate @check_figures_equal tests to @pytest.mark.mpl_image_compare (can prioritize the slow tests as reported in #835/#840)

Tracked by issue #1131.

Optional - Fully deprecate @check_figures_equal(), removing it from codebase and documentation in CONTRIBUTING.md, also close matplotlib/pytest-mpl#95?

I think we still need it when testing grids.

Write new tests for new functionality using @pytest.mark.mpl_image_compare only

Yes, it's already documented in the contributing guides.

Update the install instructions, because pygmt.test() won't work for users.

I just opened issue #1200 for discussions.

Maybe add the baseline images as a release asset when making releases.

I just opened issue #1201 for discussions.

seisman · 2021-04-11T02:27:04Z

I think we can close the issue.

seisman added the question Further information is requested label Feb 24, 2021

seisman mentioned this issue Feb 24, 2021

Wrap solar #804

Merged

5 tasks

seisman mentioned this issue Feb 26, 2021

Three tests fail with GMT dev versions #968

Closed

weiji14 added this to the 0.4.0 milestone Mar 5, 2021

weiji14 mentioned this issue Mar 11, 2021

Initialize data version control for managing test images #1036

Merged

5 tasks

weiji14 mentioned this issue Mar 18, 2021

Wrap rose #794

Merged

5 tasks

seisman mentioned this issue Mar 19, 2021

Allow passing an array as intensity for plot #1065

Merged

5 tasks

weiji14 pinned this issue Mar 20, 2021

maxrjones mentioned this issue Mar 20, 2021

Add dvc to CONTRIBUTING.md under Testing your code #1083

Merged

5 tasks

seisman mentioned this issue Mar 22, 2021

Migrate Figure.basemap tests to dvc #1096

Merged

5 tasks

weiji14 mentioned this issue Mar 22, 2021

Create Github Action workflow for reporting DVC image diffs #1104

Merged

5 tasks

seisman mentioned this issue Mar 26, 2021

Migrate tests to use dvc-tracked baseline images #1131

Closed

28 tasks

seisman unpinned this issue Mar 26, 2021

seisman added maintenance Boring but important stuff for the core devs and removed question Further information is requested labels Apr 11, 2021

seisman closed this as completed Apr 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethink the testing mechanism for images #963

Rethink the testing mechanism for images #963

seisman commented Feb 24, 2021 •

edited

Loading

willschlitzer commented Feb 24, 2021

seisman commented Feb 25, 2021

willschlitzer commented Feb 25, 2021 •

edited

Loading

seisman commented Feb 25, 2021 •

edited

Loading

willschlitzer commented Feb 28, 2021

core-man commented Feb 28, 2021

seisman commented Feb 28, 2021

michaelgrund commented Mar 2, 2021

willschlitzer commented Mar 2, 2021

weiji14 commented Mar 2, 2021 •

edited

Loading

seisman commented Mar 3, 2021

weiji14 commented Mar 3, 2021

seisman commented Mar 3, 2021

weiji14 commented Mar 5, 2021 •

edited

Loading

weiji14 commented Mar 18, 2021

maxrjones commented Mar 18, 2021

seisman commented Mar 18, 2021 •

edited

Loading

maxrjones commented Mar 20, 2021

weiji14 commented Mar 20, 2021

seisman commented Apr 11, 2021

seisman commented Apr 11, 2021

Rethink the testing mechanism for images #963

Rethink the testing mechanism for images #963

Comments

seisman commented Feb 24, 2021 • edited Loading

willschlitzer commented Feb 24, 2021

seisman commented Feb 25, 2021

willschlitzer commented Feb 25, 2021 • edited Loading

seisman commented Feb 25, 2021 • edited Loading

willschlitzer commented Feb 28, 2021

core-man commented Feb 28, 2021

seisman commented Feb 28, 2021

michaelgrund commented Mar 2, 2021

willschlitzer commented Mar 2, 2021

weiji14 commented Mar 2, 2021 • edited Loading

seisman commented Mar 3, 2021

weiji14 commented Mar 3, 2021

seisman commented Mar 3, 2021

weiji14 commented Mar 5, 2021 • edited Loading

weiji14 commented Mar 18, 2021

maxrjones commented Mar 18, 2021

seisman commented Mar 18, 2021 • edited Loading

maxrjones commented Mar 20, 2021

weiji14 commented Mar 20, 2021

seisman commented Apr 11, 2021

seisman commented Apr 11, 2021

seisman commented Feb 24, 2021 •

edited

Loading

willschlitzer commented Feb 25, 2021 •

edited

Loading

seisman commented Feb 25, 2021 •

edited

Loading

weiji14 commented Mar 2, 2021 •

edited

Loading

weiji14 commented Mar 5, 2021 •

edited

Loading

seisman commented Mar 18, 2021 •

edited

Loading