-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelizing the optimization step (redux) #83
Comments
Are you planning on doing any more testing of this? For example, with different numbers of threads and different dataset sizes? Is the speed up you are getting with 10 threads using a default number of iterations? Say I am using the MNIST digits dataset with 4 threads: is there a noticeable change in speed/visualization quality with this approach? Right now UMAPPP offers the Hogwild!-style asynchronous SGD (the current implementation for uwot and the "real" UMAP Python package) as well as this new batch optimization. Is it planned to keep both modes? It sounds like that's the case, I just want to check. |
Yes, I'll try to put together something over the week. Need to constantly remind myself to set
That's right, though I don't parallelize the "standard" implementation, that's strictly serial. Which I think is fine as a default, the updates are probably more coherent when applied immediately rather than in batch. |
Great job Aaron. Thanks for the community. |
@LTLA I am trying to formulate a plan for how to refactor towards umappp. I think the first step is to get a version of your batch approach into uwot. For now, I will freely adapt the code rather than import umappp directly. I have a couple of questions about the umappp implementation: In terms of the main optimization loop, it looks like the main difference is that standard UMAP updates positive pair coordinates directly ( Also, for the batch update code there is some clever stuff between iterations that switches the buffer passed to the optimization loop to avoid fully copying. Finally, the code uses If there are other gotchas I should be aware of, please let me know. |
Great. I also made a start with a small sandbox to get some timings and results, as suggested above. I actually tried doing this directly in my uwot fork but the C++ code isn't quite a plug-and-play replacement for the existing uwot code, so I just took an easier route.
Yep, that's right. In batched mode, I believe the update should be symmetric (i.e., A->B's force is the same as B->A's force if the coordinates are not changed in the meantime). So I just double the update to avoid touching the memory of the tail observation when computing for the head observation.
Mostly my use of |
I finally got around to making some timings on the MNIST dataset.
Though it must be said that the batched one looks a little wonky. For uwot: For umappp without batching: For umappp with batching: The source code is here, and I get similar conclusions from the fashion MNIST. |
Hmm, I wonder if batching needs a different learning rate ( |
Another possible source of wonk could be the asymmetric update. @LTLA if you go back to updating both sides of the positive edges (and run it with one thread) does that have an effect? That would at least decouple the effect of the asymmetric attractive update from the effect of not updating the coordinates within the epoch. |
Here are two images from my own independent implementation each with 200 epochs and single threaded. The "in-place" one is the way uwot currently works (i.e. updating the coordinates immediately). The "batch" result calculates the gradients using the previous epoch's coordinates, so should be equivalent to how the umappp batch would work with one thread and if the update rule for positive pairs hadn't been altered to only apply to one point. The wonk is clearly present in my batch case as well, so at least it's reassuring to know that we get the same sort of effect with two different implementations, and that the effect is due to the coordinate update strategy (nothing to do with only moving one side of the positive pair). Increasing the number of epochs by a factor of ten ( This is much more normal-looking and in line with what I would expect a better-converged-than-typical-defaults UMAP for the MNIST digits to look like. So almost certainly what we are dealing with here is a similar mechanism to the observation that stochastic gradient descent tends to converge faster than normal gradient descent. Some potential strategies:
Anyway, I will continue to chip away at implementing batch mode in uwot at my usual snail's pace. |
Huh, I wasn't able to get that with Bit of a puzzle. I'll have to stare at it for a little bit. More generally - hmm - I was hoping that the batch update would be more of a plug-and-play substitute. Worked okay on my tests but I guess it missed more subtle structure. Maybe it's just not possible to parallelize the optimization effectively. FWIW the minibatch approach (i.e., your last suggestion) sounds the most promising to me. |
I changed the attractive update to be asymmetric as in umappp and the results seem to be closer to umappp: The effect seems most pronounced on the elongated cluster (of the digit '1' IIRC). A worst-case scenario would be for symmetric updates where each thread would have to accumulate the updates in a local list, and then have a subsequent single threaded step iterate over the lists and apply them. If the gradient calculation is a lot slower than applying the update it might still be a win (at the potentially substantial cost of the extra storage). |
Just spent an evening trying to parallelize the tight inner loops (i.e., when we iterate over the dimensions) with SIMD autovectorization. Not very promising, I'm afraid, just sliced 1 second off a total runtime of 15 seconds in my test case. Probably not worth it, given that we would have to compile with (Rather annoyingly, Well, I'm at a loss. It really does seem that the only way to get closer to the canonical UMAP result is to compromise on performance in some manner that decreases the benefit of parallelizing in the first place. I was hoping that the difference would be negligible, and to be fair, the results are very qualitatively similar... but it does look different enough that I couldn't recommend it as an unqualified replacement for the standard approach. IMO it's still better than the irreproducible parallelization scheme that uwot currently has. But at this point, maybe it's better to just not offer any parallelization of the optimization step at all. |
Minor update: I added the asymmetric update to the Python UMAP (just commenting out the More epochs does sort it out: Maybe the more vigorous application of a presumably poor estimate of the gradient (whatever the true gradient actually is anyway) in one helping in the asymmetric update can lead to initial distortions that take longer to ameliorate? In that case, mini-batching would help. Or scaling down the learning rate and increasing the number of epochs to account for it. Then maybe batching would only be a win over the single-threaded SGD if there were more than a threshold number of threads, but if it's a small enough number (like say at least 4 threads) that would still be useful. |
Very interesting. Yes, your hypothesis makes sense, I was also thinking that the initial iterations would be more unstable so the algorithm probably ended up in a strange state that needed more epochs to sort out. Maybe ... we could do single-threaded for the first x iterations, and then switch to batch mode for the remaining iterations. Sort of like how t-SNE has an early exaggeration phase to get the overall structure right. The later iterations don't do much anyway - check this out: |
@LTLA that's a very neat gif. Is there currently (or planned to be) that kind of functionality in umappp? Adding a callback functionality to the C++ part of uwot after each epoch to plot this kind of thing would be interesting (to me at least). Anyway, I have been prodding at my batch implementation a bit more, and I have a question: how does umappp organize its data structure and parallelization for the batch optimization, specifically the edges? uwot stores the "head" nodes of each directed edge in the This means that for my batch implementation I am currently forced to store each gradient update in |
Sure, that's the main reason for the existence of the The GIF generation code is in the I must say, the UMAP animations are pretty sedate. The t-SNE ones are much wilder, lots of action going on there.
The physical storage is pretty much inherited from uwot, i.e., a vector of The PRNG seeding is handled by generating a vector of seeds at the start of each epoch. One seed is generated for each (*): I just noticed that I think my |
Thank you for the detailed answer, which I will mull over at length. I may stick with my currently memory-hogging alternative while I investigate whether a more sophisticated optimization method or mini-batches will bring the number of epochs down to a level which is comparable to the in-place update, but your way seems a lot better.
Yes, I've seen that too: presumably you observe it early on? It's because of the early exaggeration phase. The spectral initialization of UMAP should approximate that part of the t-SNE optimization.
Edit 3: I managed to oops my oops. I just double-checked and uwot is the same way round as Python UMAP. Where it's the "wrong" way round is in when you transform new data. In that case, Python UMAP has the head be the ordered indices of the new data (whose coordinates will move) and the tail is the unordered indices of the original data (which will not move). uwot correctly has the head data be the new nodes, but they are ordered. But I may have done that on purpose back in the mists of time to get one optimization function moving the right coordinates. And for parallelizing over the head nodes only to work, it would need to be this way around. I think. Edit: also, does your threading scheme rely on the structure of head/tail in any way (e.g. assuming indices are ordered in one of the arrays) or does every thread iterate over the entire node list checking if the head or tail index is inside its window, on the assumption that this is always fast compared to doing the actual work of calculating the gradient and updating the embedding? Edit edit: to answer my own question, as the |
Forgot to mention that the It does also currently require more epochs to optimize, so it will stay on a branch until I work out a suitable optimization scheme (and then undo my typically over-enthusiastic application of templates which has finally brought the Rtools compiler to its knees and hugely increased compilation times). Getting there, though. |
Sounds like progress. For a completely left-field idea: I managed to "parallelize" my broader workflow by running the UMAP iterations concurrently with other stuff that I needed to do. Specifically, I would run the neighbor detection in parallel, and then I would give that information to a separate thread that would perform the iterations. On another thread, I would perform other steps of my workflow, e.g., clustering and t-SNE. In this manner, I managed to keep all threads occupied, even if the UMAP itself was not parallelized. Yeah, kind of cheating. But it does avoid all the headaches of trying to modify UMAP's logic, and this should be implementable with uwot right now, given that In fact, uwot could even provide a function for parallelized parameter sweeps. It seems that quite a few people will run |
Like a zombie, slasher movie antagonist or undead monster of your choice, this issue thread has risen from the grave and is plodding inexorably forward. The quite-similar-to-UMAP method PacMAP uses Adam as its optimizer, and seems to give good results, so it's likely I will use that. However, the MNIST digits are a pretty easy case, so I have been spending a lot of my free time benchmarking various parameter settings and datasets. For cases where batch+Adam with defaults introduces distortions or splitting of clusters that aren't present in the current UMAP optimization, I would like to know what (if anything) can be done to ameliorate the situation. Also I got sucked into reading the dizzyingly large number of papers advocating various tweaks to Adam. @LTLA I suspect parameter sweeps will go on my long list of things to get around to eventually, but I don't know exactly what would be helpful for uwot to provide vs someone just writing a function to do it themselves. Probably helpful as a separate issue, but also I don't want to waste your time documenting something I don't know when I will get around to. |
Very cool. I'll admit I don't quite understand the subtleties of the parametrization of the gradient descent, but if we can get a speed-up without our ad hoc batch mode, I'm all for that. I'll probably just piggy-back off your C++ implementation again when the time comes, but no rush; I think the current code in uwot/umappp is pretty fast as it is. I'll do the honors and drive a stake through the heart of this thread. |
The issue refuses to die: the optimizer will be used with the batch branch. The question at this point is whether there is a good set of parameters which will:
|
The above PR will be merged shortly. As discussed above, it also introduces batch mode and uses Adam for optimization, similarly to PacMAP. I recommend using at least 500 epochs with batch mode, so larger datasets and those which need longer optimizations will see the benefit from using this in combination with multiple threads. Additionally, an There will not be a new CRAN submission for this functionality until I have experimented with it a bit more, extended it to work with Apologies to @LTLA for me saying I would try to get uwot integrated with umappp and then completely failing. But at least functionality-wise they are better aligned for a future effort. |
Forgot to mention: I did also look at mini-batches, but they introduced some stability issues when used with momentum. There are any number of potential causes and solutions, but there is already a trade-off introduced with mini-batches versus thread utilization. Rather than go wandering down yet another rabbit-hole for another couple of months, I have declared batch bankruptcy: there is a lot of room for improvement, but it may have benefits for some users as it is. Hence the merge. |
Sweet - sorry for the late reply - this looks great. Looking forward to playing around with it. What's the ballpark % speedup? |
For the MNIST digits and For the So a speed up of ~3x with 4 threads. Arguably the non-batch mode is better optimized for any given set of epochs, but there's not a good way for me to measure the equivalent optimization state between the two modes. I would expect the speed up to be more impressive with umappp. Caveat: I wouldn't be surprised if I have introduced a memory leak or other misfortune due to some sporadic strange behavior I have witnessed in RStudio on Windows. Sadly it wasn't the usual immediate R Session explosion variety that makes tracking things down easier. I will be attempting to |
Did a bit of |
More specifically, the outcome from batch updates is too different to that of the iterative updates, such that it's difficult to recommend the former as a drop-in replacement for the latter. Even though it's easier to parallelize per epoch, the convergence is such that we need to spend more epochs (briefly discussed in jlmelville/uwot#83), which defeats the purpose. By removing batch processing, we also simplify the code a bit, which is nice.
To recap for other readers: it is currently possible to parallelize the optimization step by setting the (non-default)
n_sgd_threads > 1
. However, this is not ideal because the code is subject to race conditions, meaning that the results are not reproducible even when the seed is set. This makes it difficult to recommend - maybe not so bad for visualization, but still pretty bad, especially when you're on a manuscript deadline and you're trying to recreate a figure.Unfortunately, parallelization of the optimizations is not straightforward as later iterations are dependent on the results of the earlier iterations. The current code just ignores these dependencies, resulting in race conditions. Ideally, parallelization would occur transparently, giving the same results as the serial version in less time. I spent a few days this week exploring some approaches based on locking or some other type of synchronization. This was largely disappointing, the overhead of thread management offset (and then some) any benefits from work sharing across threads.
I settled on approximating the optimization step by computing the updates first before applying them in batch. That is, gradients and deltas are computed for all points first before any changes to the coordinates of each point. In contrast, the current approach computes the deltas for each point and applies them immediately before proceeding to the next point. This batch approach is much more amenable to parallelization as each point's calculations are now independent.
The actual code is here; I use OpenMP but it should be straightforward to adapt to a different framework. Each parallelizable iteration contains all edges corresponding to a single
head
point, which simplifies book-keeping. The neatest part is the definition of the seeds in the serial section before the creation of a per-iteration RNG; this enables the reproducible creation of iteration-specific random number streams. In the PCG32 case, you'd just need to combine 2 32-bit ints to create a 64-bit seed.See the (previously mysterious) "umappp batch" figure in #80 for an example of the effect of batching. This approach is similar to how t-SNE performs its updates, and is analogous to the difference between the Hartigan-Wong (immediate) and Lloyd algorithm (batch) for k-means. Based on this analogy, I would speculate that, while batching enables convenient parallelization, this comes at the cost of stability/accuracy of the optimizations in the early steps. Nonetheless, it could be a useful mode for getting a quicker reproducible answer for big datasets; my testing indicates that it gives a 3-4-fold speed up with 10 threads.
The text was updated successfully, but these errors were encountered: