Skip to content

Commit

Permalink
expand docs
Browse files Browse the repository at this point in the history
  • Loading branch information
david-cortes committed Jul 22, 2023
1 parent 5750d0c commit 20de19e
Show file tree
Hide file tree
Showing 3 changed files with 276 additions and 0 deletions.
103 changes: 103 additions & 0 deletions R/cmfrec.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
#' @name cmfrec
#' @title cmfrec package
#' @description This is a library for approximate low-rank matrix factorizations, mainly oriented
#' towards recommender systems but being also usable for other domains such as dimensionality reduction
#' or imputation of missing data.
#'
#' In short, the main goal behind the models in this library is to produce an approximate factorization
#' of a matrix \eqn{\mathbf{X}} (which might potentially be sparse or very large) as the product of two
#' lower-dimensional matrices:
#' \cr
#' \eqn{ \mathbf{X} \approx \mathbf{A} \mathbf{B}^T }
#' \cr
#' For recommender systems, it is assumed that \eqn{\mathbf{X}} is a matrix encoding user-item
#' interactions, with rows representing users, columns representing items, and values representing
#' some interaction or rating (e.g. numer of times each user listened to different songs), where higher
#' numbers mean better affinity for a given item. Under this setting, the items for which a user might have
#' higher affinity should have larger values in the approximate factorization, and thus items with larger
#' values for a given user can be considered as better candidates to recommend - the idea being to recommend
#' new items to users that have not yet been seen/consumed/bought/etc.
#'
#' This matrix factorization might optionally be enhanced by incorporating side information about users/items
#' in the 'X' matrix being factorized, which is assumed to come in the form of additional matrices
#' \eqn{\mathbf{U}} (for users) and/or \eqn{\mathbf{I}} (for items), which might also get factorized along
#' the way sharing some of the same low-dimensional components.
#'
#' The main function in the library is \link{CMF}, which can fit many variations of the model described
#' above under different settings (e.g. whether entries missing in a sparse matrix are to be taken as zeros
#' or ignored in the objective function, whether to apply centering, to including row/column intercepts,
#' which kind of regularization to apply, etc.).
#'
#' A specialized function for implicit-feedback models is also available under \link{CMF_implicit}, which
#' provides more appropriate defaults for such data.
#' @section Nomenclature used throughout the library:
#' The documentation and function namings use the following naming conventions:
#' \itemize{
#' \item About data: \itemize{
#' \item 'X' -> data about interactions between users/rows and items/columns (e.g. ratings given by users to items).
#' \item 'U' -> data about user/row attributes (e.g. user's age).
#' \item 'I' -> data about item/column attributes (e.g. a movie's genre).
#' }
#' \item About functionalities: \itemize{
#' \item 'warm' -> predictions based on new, unseen 'X' data, and potentially including new 'U' data along.
#' \item 'cold' -> predictions based on new user attributes data 'U', without 'X'.
#' \item 'new' -> predictions about new items based on attributes data 'I'.
#' }
#' \item About function descriptions: \itemize{
#' \item 'existing' -> the user/item was present in the training data to which the model was fit.
#' \item 'new' -> the user/items was not present in the training data that was passed to 'fit'.
#' }
#' }
#'
#' Be aware that the package's functions are user-centric (e.g. it will recommend items for users, but not users for items).
#' If predictions about new items are desired, it's recommended to use the method \link{swap.users.and.items}, as the item-based
#' functions which are provided for convenience might run a lot slower than their user equivalents.
#' @section Implicit and explicit feedback:
#' In recommender systems, data might come in the form of explicit user judgements about items (e.g. movie ratings) or
#' in the form of logged user activity (e.g. number of times that a user listened to each song in a catalogue). The former
#' is typically referred to as "explicit feedback", while the latter is referred to as "implicit feedback".
#'
#' Historically, driven by the Netflix competition, formulations of this problem have geared towards predicting the rating
#' that users would give to items under explicit-feedback datasets, determining the components in the low-rank factorization
#' in a way that minimizes the deviation between predicted and observed numbers on the \bold{observed} data only (i.e. predictions
#' about items that a user did not rate do not play any role in the optimization problem for determining the low-rank factorization
#' as they are simply ignored), but this approach has turned out to oftentimes result in very low-quality recommendations,
#' particularly for users with few data, and is usually not suitable for implicit feedback as the data in such case does not contain
#' any examples of dislikes and might even come in just binary (yes/no) form.
#'
#' As such, research has mostly shifted towards the implicit-feedback setting, in which items that are not consumed by users
#' do play a role in the optimization objective for determining the low-rank components - that is, the goal is more to predict
#' which items would users have consumed than to predict the exact rating that they'd give to them - and the evaluation of recommendation
#' quality has shifted towards looking at how items that were consumed by users would be ranked compared to unconsumed items
#' (evaluation metrics for implicit-feedback for this library can be calculated through the package
#' \href{https://cran.r-project.org/package=recometrics}{recometrics}).
#' @section Other problem domains:
#' The documentation and naming conventions in this library are all oriented towards recommender systems, with the assumption
#' that users are rows in a matrix, items are columns, and values denote interactions between them, with the idea that values
#' under different columns are comparable (e.g. the rating scale is the same for all items).
#'
#' The concept of approximate low-rank matrix factorizations is however still useful for other problem domains, such as general
#' dimensionality reduction for large sparse data (e.g. TF-IDF matrices) or imputation of high-dimensional tabular data, in which
#' assumptions like values being comparable between different columns would not hold.
#'
#' Be aware that functions like \link{CMF} come with some defaults that might not be reasonable in other applications, but which
#' can be changed by passing non-default arguments to functions - for example:
#' \itemize{
#' \item Global centering - the "explicit-feedback" models here will by default calculate a global mean for all entries in 'X' and
#' center the matrix by substracting this value from all entries. This is a reasonable thing to do when dealing with movie ratings
#' as all ratings follow the same scale, but if columns of the 'X' matrix represent different things that might have different ranges
#' or different distributions, global mean centering is probably not going to be desirable or useful.
#' \item User/row biases: models might also have one bias/intercept parameter per row, which in the approximation, would get added
#' to every column for that user/row. This is again a reasonable thing to do for movie ratings, but if the columns of 'X' contain
#' different types of information, it might not be a sensible thing to add.
#' \item Regularization for item/column biases: since the models perform global mean centering beforehand, the item/column-specific
#' bias/intercept parameters will get a regularization penalty ("shrinkage") applied to them, which might not be desirable if
#' global mean centering is removed.
#' }
#' @section Improving performance:
#' This library will run faster when compiled from source with non-default compiler arguments,
#' particularly `-march=native` (replace with `-mcpu=native` for ARM/PPC); and when using an
#' optimized BLAS library for R. See this guide for details:
#' \href{https://github.com/david-cortes/installing-optimized-libraries}{installing optimized libraries}.
#' @seealso \link{CMF} \link{CMF_implicit}
NULL
52 changes: 52 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,14 @@ Package is available on PyPI, can be installed with
pip install cmfrec

This library will run faster when compiled from source with non-default compiler arguments,
particularly ``-march=native`` (replace with ``-mcpu=native`` for ARM/PPC), which gets added automatically
when installing from ``pip``, and when using an optimized BLAS library for SciPy.

See this guide for details:

`<https://github.com/david-cortes/installing-optimized-libraries>`_

Sample Usage
============

Expand Down Expand Up @@ -59,6 +67,50 @@ About function descriptions:

Be aware that the package's functions are user-centric (e.g. it will recommend items for users, but not users for items). If predictions about new items are desired, it's recommended to use the method 'swap_users_and_items', as the item-based functions which are provided for convenience might run a lot slower than their user equivalents.

Implicit and explicit feedback
==============================

In recommender systems, data might come in the form of explicit user judgements about items (e.g. movie ratings) or
in the form of logged user activity (e.g. number of times that a user listened to each song in a catalogue). The former
is typically referred to as "explicit feedback", while the latter is referred to as "implicit feedback".

Historically, driven by the Netflix competition, formulations of the recommendation problem have geared towards predicting the rating
that users would give to items under explicit-feedback datasets, determining the components in the low-rank factorization
in a way that minimizes the deviation between predicted and observed numbers on the **observed** data only (i.e. predictions
about items that a user did not rate do not play any role in the optimization problem for determining the low-rank factorization
as they are simply ignored), but this approach has turned out to oftentimes result in very low-quality recommendations,
particularly for users with few data, and is usually not suitable for implicit feedback as the data in such case does not contain
any examples of dislikes and might even come in just binary (yes/no) form.

As such, research has mostly shifted towards the implicit-feedback setting, in which items that are not consumed by users
do play a role in the optimization objective for determining the low-rank components - that is, the goal is more to predict
which items would users have consumed than to predict the exact rating that they'd give to them - and the evaluation of recommendation
quality has shifted towards looking at how items that were consumed by users would be ranked compared to unconsumed items.

Other problem domains
=====================

The documentation and naming conventions in this library are all oriented towards recommender systems, with the assumption
that users are rows in a matrix, items are columns, and values denote interactions between them, with the idea that values
under different columns are comparable (e.g. the rating scale is the same for all items).

The concept of approximate low-rank matrix factorizations is however still useful for other problem domains, such as general
dimensionality reduction for large sparse data (e.g. TF-IDF matrices) or imputation of high-dimensional tabular data, in which
assumptions like values being comparable between different columns would not hold.

Be aware that classes like ``CMF`` come with some defaults that might not be reasonable in other applications, but which
can be changed by passing non-default arguments - for example:

- Global centering - the "explicit-feedback" models here will by default calculate a global mean for all entries in 'X' and
center the matrix by substracting this value from all entries. This is a reasonable thing to do when dealing with movie ratings
as all ratings follow the same scale, but if columns of the 'X' matrix represent different things that might have different ranges
or different distributions, global mean centering is probably not going to be desirable or useful.
- User/row biases: models might also have one bias/intercept parameter per row, which in the approximation, would get added
to every column for that user/row. This is again a reasonable thing to do for movie ratings, but if the columns of 'X' contain
different types of information, it might not be a sensible thing to add.
- Regularization for item/column biases: since the models perform global mean centering beforehand, the item/column-specific
bias/intercept parameters will get a regularization penalty ("shrinkage") applied to them, which might not be desirable if
global mean centering is removed.

.. toctree::
:maxdepth: 3
Expand Down
Loading

0 comments on commit 20de19e

Please sign in to comment.