Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multiple problems with views-of-views #10505

Merged
merged 2 commits into from
Mar 22, 2015
Merged

Fix multiple problems with views-of-views #10505

merged 2 commits into from
Mar 22, 2015

Conversation

timholy
Copy link
Member

@timholy timholy commented Mar 13, 2015

This is a public service announcement from your local "oh, crap" department.

Due to what was probably a copy/paste error, the SubArray tests were grabbing a global variable A rather than an argument SB. A happened to be a plain array, with the consequence that the SubArray tests were not thoroughly testing views-of-views:

A = rand(20, 20, 20)
B = slice(A, 1:5, 6, 6:9)  # a view (2-dimensional)
C = sub(B, 6:14)           # a view of a view (1-dimensional)

I note that, had I passed that file through Lint, it would have caught the problem.

Unsurprisingly, fixing this test problem uncovered bugs in my implementation of SubArrays. Many "simple" (hah!) things worked because I had tested quite a lot via the command line, and also copied in some old hand-written tests. But constructs like the one above---specifically, those that use linear indexing in creation of a view-of-view, when one of the uber-parent dimensions had been sliced out---certainly had problems. I think people would have quickly gotten BoundsErrors if this were being commonly used, so the fact that this hasn't been noticed until now is a good sign. Once we start returning views from indexing operations, views-of-views are going to be really common, so it surely would have turned up then.

One major downside to this fix is that, now that the tests are running properly, the subarray test takes (on my machine) 14 minutes to complete. I think that's probably too long. However, I also don't think the test is excessive in its thoroughness:

  • it only goes up to 3 dimensions, which I've found to be the absolute minimum for uncovering errors in logic. Originally I used 4 dimensions.
  • the slow part (views-of-views) uses just 5 different instances of indexes: one Colon, one Int, one UnitRange{Int}, one StepRange{Int}, and one Vector{Int}. It uses all possible combinations of these, leading to 125 views. It then creates views-of-views using indexes of the same types, for 1, 2, and 3-dimensional indexing. That's a lot of combinations (and a lot of code to compile), but the different choices have really important consequences for both cartesian indexing and inference of LD (the maximum dimension with a uniform stride), which is key to our current SubArrays' ability to perform fast linear indexing in many situations. The logic behind this inference is quite complicated, and seems to demand thorough testing.

I suspect there's no alternative but to hand-pick a subset of these combinations for routine testing, and perhaps save the exhaustive tests for a rarely-run test suite. But I didn't want to start decreasing the stringency of our tests without discussing it first, especially when an accidental lack of stringency led to bugs going unnoticed.

[Note that the large size of this commit partly reflects cleanup as well as bugfixes.]

Sorry, all!

Due to what was probably a copy/paste error, the tests
were grabbing a global variable A (rather than an argument)
that happened to be a plain array. So we weren't really testing
views-of-views.

Unsurprisingly, fixing this problem uncovered bugs.
Many "simple" things worked, but sufficiently complicated
constructs like
   A = rand(20, 20, 20)
   B = slice(A, 1:5, 6, 6:9)
   C = sub(B, :)
(specifically, those that use linear indexing in creation
of a view-of-view) certainly had problems.

Drat. Well, hopefully this has been rare up until now;
it seems rather likely that folks would have gotten BoundsErrors if
anyone had actually been using this. But we start returning views from
getindex it will become common, so better that it was caught now.

In debugging and fixing the problems, I took the opportunity to
do some cleanup to make this more maintainable, specifically:
adopting a consistent naming convention, improving many variable
names, and adding comments.
@timholy
Copy link
Member Author

timholy commented Mar 13, 2015

The appveyor failure looks like a timeout (unsurprisingly).

@tkelman
Copy link
Contributor

tkelman commented Mar 14, 2015

If we really need to run a substantially slower set of tests on every commit and PR, an easy option would be to split up the existing tests into more entries in the build matrix. This will impact our queue time (esp. on appveyor), but not much else.

While our tests have been gradually growing in coverage lately, it also seems there have been some regressions in codegen speed. Profiling and taking a few performance passes on codegen might be called for soon, particularly if we're aiming to upgrade LLVM before 0.4, which AIUI will only make this problem worse.

An outstanding todo has been to move more testing onto the buildbots, the perf suite and otherwise. Other than Elliot and myself I don't think many people are paying very close attention to those at the moment though.

@timholy
Copy link
Member Author

timholy commented Mar 14, 2015

Where does one go to look at buildbot results? Other than the coveralls page (which I do check now and then). Indeed, a puzzling gap in coverage helped me discover the problem here.

Interesting failure in bitarray tests here, I haven't seen that one before.

@tkelman
Copy link
Contributor

tkelman commented Mar 14, 2015

http://buildbot.e.ip.saba.us:8010/builders - I have a hard time remembering that myself, would be great if build.julialang.org could be made to redirect there

@ViralBShah
Copy link
Member

The redirect can be set up by @StefanKarpinski

It is probably time to have a comprehensive test suite and a quick test suite.

@mbauman
Copy link
Member

mbauman commented Mar 14, 2015

Interesting failure in bitarray tests here, I haven't seen that one before.

I saw a very similar failure on Win32 during the parallel tests on #10512. It was new to me, too. Glad that it isn't an issue with either of these PRs, but sad that it looks like we've got another non-deterministic type error on our hands.

@timholy
Copy link
Member Author

timholy commented Mar 15, 2015

OK, I think I've come up with a good solution: run a random small fraction of the subarray tests. There's an environmental variable you can set to specify the fraction (including all of them). Strategy or name bikeshedding welcome.

@tkelman
Copy link
Contributor

tkelman commented Mar 15, 2015

I'm not a fan of the idea of running a random subset of tests. Nondeterministic test failures is probably the single most frustrating thing about working on/with Julia. Unless you're confident that the percentage you've chosen would be enough to always immediately catch any problem that might get introduced that would cause any of these tests to fail...

If we're going to choose based on an environment variable, wouldn't it be more customary to give it an all-caps name?

@timholy
Copy link
Member Author

timholy commented Mar 15, 2015

I appreciate the feedback as always, @tkelman. It's great to have such a conscientious steward of our CI infrastructure.

I'm not confident that 2.5% of the tests would be enough to catch any problem, so the issue you are worried about is real. But I think this is in a different category from the nondeterministic failures that plague other julia development (BTW, on Linux rr is indeed awesome). In this case it's not like you're going to get a segfault or something; there will be a clear backtrace, and the solution will be obvious: git bisect with a run of the full test. The theory is that even if folks aren't watching the buildbots, it will be caught before too much time elapses.

The alternative is to hand-pick a subset of tests to run consistently. That's not bad either, but I'm not confident of my ability to pick a subset that would catch any and all possible errors. Therefore, a potential outcome is that a bug might be introduced that affects a subset that isn't tested. Indeed, developers will change their code until the test suite passes, so a consistent, biased subset of the full set of tests might encourage this outcome. In that scenario, if no one is watching the buildbots, then it could go a long time without being caught.

Anyway, willing to go either way here, but to me the random subset seems the safer solution.

Indeed I can change to all-caps. I'll do that if this strategy ends up being approved, or delete it if not.

@ViralBShah
Copy link
Member

The reason I do not like picking a random subset of tests is because you can easily have a failure on one subset, and not have it occur again. You never know then if it is some CI environment issue, or a real failure.

I would much rather pick a small set (even imperfect) for regular testing where failures are deterministic, and put in efforts towards building a comprehensive test suite that runs nightly.

@StefanKarpinski
Copy link
Member

One option would be to pick a fixed random subset of these tests. That is deterministic but arbitrary.

@tknopp
Copy link
Contributor

tknopp commented Mar 15, 2015

yep. Random tests are dangerous. Better pick a fixed random set. May it be possible to achieve this with a fixed seed or is the random sequence generated by a certain seed hardware specific?

@ivarne
Copy link
Member

ivarne commented Mar 15, 2015

I dislike the "Write fewer tests" option, or never run all of them.

If we make the choice of test permutations to run random, we should at least make it simple to rerun the same set of tests (in the same order) trivial. The test failure message must be very clear that a randomly chosen test failed, and detailed steps to rerun the exact same test run. (eg print the seed used for the RNG).

Another way to pick tests would be to run every Nth permutation skipping the first K, and configure N to a value that gives a reasonable runtime, and pick a random offset K. (Maybe based on VERSION.build[1] % N, so that we cycle through the full test space, and get full reproducibility when rerunning the same commit.)

A convenient interface might be:

@test_subset 5 for I_type in subtypes(Integer), F_type in subtypes(FloatingPoint)
    testfun(zero(I_type), one(F_type))
end

Where the 5 specifies that every 5th permutation should be tested for this batch, because if a test set has 1000 combinations, and another has 50, we should run different portions of the test space for each of them. One might have a global setting to increase this fraction.

PS: What is the recursive option for subtypes(Integer), that only returns leaftypes?

@timholy
Copy link
Member Author

timholy commented Mar 16, 2015

What if I were to run all the tests inside a try/catch and have the catch handle the error by (1) informing the user that a randomly-selected test failed, and (2) printing the exact combination of indexes used in the failing case? That way people could play with that combination manually until their patch succeeds.

@ivarne, your interface is a good idea. One challenge, though: currently to run 2.5% of the total test suite, I construct a view from array using 15.8% of the possible choices, and then a view from these views with another 15.8% of the choices. (0.158^2 == 0.025 to within the accuracy of these statements.) But you probably want the two "levels" to be independent of each other.

timholy referenced this pull request Mar 16, 2015
This allows the tests to complete in a reasonable time. One can force
all tests to run by setting the environment variable testfraction=1
@nalimilan
Copy link
Member

It feels simpler and more robust to me to have an extensive test suite that would be run nightly and on demand, and keep an (hopefully complete enough) subset to run on every PR. With a random subset, one would never be sure anyway that the next build wouldn't trigger a failure. At last, with a deterministic testsuite, we can decide to have most typical use cases in the default subset, so that 99% of mistakes are caught before merging. And we'll still be able to run the full testsuite when a PR looks like it could break something subtle.

Finally, if some subset was known to be run only nightly, we could make it run very complete tests, like going over all possible combination of types from Base. If tests were chosen at random, we would still be inclined to avoid too time-consuming checks like that, by fear of diluting essential ones in an ocean of "just in case" tests.

@timholy
Copy link
Member Author

timholy commented Mar 16, 2015

The consensus seems clearly against random subsets, so I will change this as soon as I can.

@StefanKarpinski
Copy link
Member

There's another option, which is to craft a subset of tests that is hard to "game" and designed so that it covers most of the corner cases. Of course, that's easier said than done, but it would be ideal.

@ivarne
Copy link
Member

ivarne commented Mar 16, 2015

Hard to game

We don't anticipate people writing intentionally broken code that passes the first level of tests. The goal is rather that we should have a high probability to catch inadvertently introduced problems.

timholy added a commit that referenced this pull request Mar 22, 2015
Fix multiple problems with views-of-views
@timholy timholy merged commit e1f0310 into master Mar 22, 2015
@timholy timholy deleted the teh/fixsubarrays branch March 22, 2015 12:24
@timholy
Copy link
Member Author

timholy commented Mar 22, 2015

OK, I went with JULIA_TESTFULL as a name for the environment variable.

@amitmurthy
Copy link
Contributor

Thought I'll mention here that parallel tests also have a conditional execution of tests dependent on env variable PTEST_FULL.

https://github.com/JuliaLang/julia/blob/master/test/parallel.jl#L185)

@tkelman
Copy link
Contributor

tkelman commented Mar 22, 2015

@amitmurthy I think most of us had missed that comment and were unaware of that, otherwise we would probably have more people running with PTEST_FULL set to check that... and it would've come up in this thread as precedent, since we may as well use the same environment variable to control both sets of comprehensive tests.

(Also, when linking to lines of code, please hit y so the link gets canonicalized with a commit sha, so the line number remains relevant even after code might get moved around in that file.)

@amitmurthy
Copy link
Contributor

Its been introduced quite a ways back - probably the reason it has not been noticed. We can standardize on JULIA_TESTFULL. As written, the additional parallel tests launch and tear down upto 90 workers via ssh - which takes around 7-8 GB of additional RAM.

@amitmurthy
Copy link
Contributor

I think that number should be reduced a bit if we want more people running the full parallel test.

@timholy
Copy link
Member Author

timholy commented Mar 22, 2015

Thanks for bringing this to our attention, @amitmurthy. I agree with all of your proposals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants