-
Notifications
You must be signed in to change notification settings - Fork 610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[question] Getting recommendations for 14M+ users quickly #29
Comments
The problem here is that for each user, this has to rank all 1 million items. For getting similar items, I had a similar problem that I solved using annoy. On the lastfm dataset it sped things up from about 1 hour to 30 seconds or so on my laptop: www.benfrederickson.com/matrix-factorization/ by doing an approximate neigbhour search instead of an exact search (which is basically the same problem you have here). However, annoy is looking at cosine distance and the predictor is the dot product. The dot product is a little trickier to generate fast approximate nearest neighbours for (the dot product isn't a true distance since it doesn't satisfy the triangle inequality, making approximate measures trickier). There are approaches that could work here (like https://arxiv.org/pdf/1202.6101v1.pdf) but I haven't tried it out yet. Parallelizing the computation should be pretty easy though, and would be a small win. If you want to try it out, might be worth looking at the nearest neighbour code in this package.This code returns the K nearest neighbours by dot product of a sparse matrix: https://github.com/benfred/implicit/blob/master/implicit/_nearest_neighbours.pyx#L48 . Doing something similar on the dense factors is even easier since you don't have to worry about doing a sparse matrix multiplication. |
Interesting paper about ANN for dot-product. |
There are a few recent papers about the LSH MIPS problem. See e.g. https://arxiv.org/abs/1410.5518 |
@dselivanov I'm using numpy's argpartition function to get the K largest items here, which uses the introselect algorithm so we should already be at @chapleau thanks for the link! I'll check it out |
I tried out the idea from the paper: transforming a cosine nearest neighbours search into an inner product nearest neighbours search by adding an extra dimension. The code to do this and get annoy to do a nearest neighbour search for the recommended items based off the transformed vectors is here: https://github.com/benfred/implicit/blob/master/implicit/annoy_als.py I tested it out quickly, it seems to run around 20 times faster on the last.fm dataset on my laptop - with the downside that the precision is only around 80%. Adding more trees to annoy improves precision, but also slows down the retrieval. Anyways - I thought I'd put this code out there, it might be worth experimenting with.
|
I tested out Annoy/NMSLib/Faiss for quickly generating recommendations, http://benfrederickson.com/approximate-nearest-neighbours-for-recommender-systems/ and added support in the last commit. It should be possible to do this relatively efficiently, let me know how it goes. |
Hi, @benfred, thank you for excellent tool! I wonder is there easier way to get recommendations for a few millions of users than slowly looping (#72) als.AlternatingLeastSquares.recommend() over them or modifying https://github.com/benfred/implicit/blob/master/implicit/evaluation.pyx#L96 to return array to fasten computations? Thank you in advance for help or hints. |
Hi Ben, thanks for the package. I have a dataset of 14M+ users and 1M+ items. The model fitting take around 70 mins. But getting recommendations for all users would take north of 10 hours. Is there a way to expedite this, parallelize it maybe. Any suggestions?
The text was updated successfully, but these errors were encountered: