A sophisticated game recommendation system that combines K-Means clustering and Nearest Neighbors to provide personalized game recommendations. The system addresses the challenge of decision paralysis faced by players when navigating vast game libraries by offering tailored suggestions based on game attributes and user preferences.
- Motivation
- Project Structure
- Literature Review
- Dataset Description
- Methodology
- Iterative Model Development
- Results and Analysis
- Getting Started
- Team Members
- Future Work
- References
The gaming industry's exponential growth has led to an overwhelming number of choices for players. Our project addresses this challenge by:
- Helping players navigate extensive game libraries effectively
- Reducing decision paralysis through personalized recommendations
- Supporting discovery of both popular and niche titles
- Leveraging machine learning to understand and match player preferences
game-recommendation-system/
โโโ data/
โ โโโ final_data.csv
โ โโโ games_with_reviews.csv
โ โโโ games.csv
โ โโโ processed_game_data.csv
โโโ indexdir/
โ โโโ _MAIN_1.toc
โ โโโ MAIN_ez3ijabrp49e5se8.seg
โ โโโ MAIN_WRITELOCK
โโโ pickle/
โ โโโ kmeans_model.pkl
โโโ src/
โ โโโ collaborative_filtering.ipynb
โ โโโ data_scraping.py
โ โโโ eda.ipynb
โ โโโ main.ipynb
โ โโโ main2.ipynb
โโโ txt/
โ โโโ notes.md
โ โโโ output.txt
โ โโโ testing.txt
โ โโโ user_agents.txt
โโโ documentation/
โโโ ML_Project_proposal.pdf
โโโ Project_Presentation.pdf
โโโ Project_Report.pdf
Our approach is informed by several key research papers:
-
Machine-Learning Item Recommendation System for Video Games
- Explores ERT and DNN models for personalized recommendations
- Focuses on real-time user behavior adaptation
- ERT model showed superior accuracy and scalability
-
Content-Based and Context-Based Recommendation Systems
- Reviews various recommendation techniques
- Addresses challenges like information overload
- Emphasizes importance of contextual information
-
STEAM Game Recommendations
- Investigates recommender systems for the STEAM platform
- Tests various models including FM, DeepNN, and DeepFM
- Found DeepNN performs best for accuracy and novelty
- Source: Video Games Recommendation System (Kaggle)
- Features:
name | release_date | price | dlc_count | detailed_description | about_the_game windows | mac | linux | achievements | supported_languages | developers publishers | categories | genres | estimated_owners | average_playtime_forever
- Source: Steam Store
- Contents:
- 41 million user recommendations
- Game metadata
- User profiles
- Review data
- Numerical Features:
Price
,Release Year
- Categories & Genres: A list of categories & genres that a game belongs to.
- Platform: 0/1 Binary features for the availability of
Windows
,Mac
,Linux
. - Publishers & Studios: A list of publishers & studios that a game belongs to.
- PlayTime, Description, Supported Languages: Other features that were either missing for many entries or not relevant.
To ensure equal contribution of all features during clustering, numerical features like Price
and Release Year
were normalized. Binary features like Categories
, Genres
, and platform support were normalized using StandardScaler
. This prevented any single feature or group of features from disproportionately influencing the clustering process.
# | Column Name | Dtype | Description |
---|---|---|---|
1 | windows |
int64 |
Binary feature for Windows |
2 | mac |
int64 |
Binary feature for Mac |
3 | release_year |
int64 |
Year of game release |
4 | linux |
int64 |
Binary feature for Linux |
5 | price |
float64 |
Price of the game |
6 | categories |
object |
Categories list |
7 | genres |
object |
Genres list |
8 | game_studios |
object |
Associated game studios |
9 | categories_includes_level_editor |
int64 |
Level editor feature |
10 | categories_<category_name> |
int64 |
One-hot encoded categories |
... | ... | ... | ... |
52 | genres_nudity |
int64 |
Binary for genre: nudity |
53 | genres_casual |
int64 |
Binary for genre: casual |
54 | genres_short |
int64 |
Binary for genre: short |
55 | genres_video_production |
int64 |
Binary for genre: video production |
- Numerical Features:
Price
,Release Year
- Categories & Genres: A list of categories & genres that a game belongs to.
- Platform: 0/1 Binary features for the availability of
Windows
,Mac
,Linux
. - Publishers & Studios: A list of publishers & studios that a game belongs to.
- PlayTime, Description, Supported Languages: Other features that were either missing for many entries or not relevant.
To ensure equal contribution of all features during clustering, numerical features like Price
and Release Year
were normalized. Binary features like Categories
, Genres
, and platform support were normalized using StandardScaler
. This prevented any single feature or group of features from disproportionately influencing the clustering process.
# | Column Name | Dtype | Description |
---|---|---|---|
1 | windows |
int64 |
Binary feature for Windows |
2 | mac |
int64 |
Binary feature for Mac |
3 | release_year |
int64 |
Year of game release |
4 | linux |
int64 |
Binary feature for Linux |
5 | price |
float64 |
Price of the game |
6 | categories |
object |
Categories list |
7 | genres |
object |
Genres list |
8 | game_studios |
object |
Associated game studios |
9 | categories_includes_level_editor |
int64 |
Level editor feature |
10 | categories_<category_name> |
int64 |
One-hot encoded categories |
... | ... | ... | ... |
52 | genres_nudity |
int64 |
Binary for genre: nudity |
53 | genres_casual |
int64 |
Binary for genre: casual |
54 | genres_short |
int64 |
Binary for genre: short |
55 | genres_video_production |
int64 |
Binary for genre: video production |
- Feature Engineering
- Numerical features: Price, Release Year
- Binary features: Platform support, categories, genres
- Studio clustering using K-Means++
- Dimensionality Reduction
- Applied Truncated SVD
- Reduced studio data to 60 components
- Achieved 6.8% explained variance
Add SVD variance explanation chart
- Arriving at global minima through random initialization is not guaranteed, and in most cases, it is highly unlikely
- Noticed poor inter-cluster similarity using Silhouette Analysis.
- Non-Convex Optimization Problem
- Multiple local minima exist
- Final clustering highly dependent on initial centroid positions
- May lead to:
- Splitting of a single cluster
- Merging of two clustersRandom centroid initialization when doing clustering has some shortcomings.
- Solution: K-Means++ Algorithm implementation improves clustering by initializing centroids in a smarter, probabilistic way that ensures they are spread out, reducing the chances of poor convergence and suboptimal clusters. It results in faster convergence and better clustering quality compared to random initialization in standard K-Means.
- Evaluated multiple distance metrics:
# Distance metric comparison metrics = { 'euclidean': euclidean_distances, 'manhattan': manhattan_distances, 'cosine': cosine_distances }
Cluster-wise Silhouette Analysis:
- Cluster 1.0: 0.263 (46,626 games) - Good structure
- Cluster 3.0: 0.108 (16,156 games) - Normal structure
- Cluster 4.0: 0.065 (1,612 games) - Normal structure
- Cluster 2.0: 0.043 (19,331 games) - Weak structure
- Cluster 0.0: -0.010 (13,679 games) - Potential misclassification
python>=3.8
numpy
pandas
scikit-learn
scipy
whoosh
- Clone the repository
git clone [your-repository-link]
cd game-recommendation-system
- Install required packages
pip install -r requirements.txt
from src.recommender import GameRecommender
# Initialize the recommender
recommender = GameRecommender()
# Get recommendations for a game
recommendations = recommender.get_recommendations("FIFA")
# Example output:
# 1. FIFA 23 (Match Score: 0.89)
# 2. Pro Evolution Soccer 2023 (Match Score: 0.82)
# 3. FIFA 22 (Match Score: 0.81)
Aditya Sharma @adsh16
- Data preprocessing
- Feature engineering
- Clustering analysis
- Model development
- EDA, SVD, fuzzy search
Kanishk Kumar Meena @KanishkKumarMeena
- Data cleaning
- Collaborative Filtering
- Model evaluation
Vansh Aggarwal @VanshAg283
- Dataset management
- Visualization
- EDA
- Performance Testing
- More robust and real-life based model performance testing. Using avaiable databases of
similar
orco-bought
games to match with model's reccomendations. - Integration of user interaction data for hybrid recommendations
- Enhancement of clustering algorithms for better game categorization
- Implementation of real-time recommendation updates
- Addition of more sophisticated feature engineering techniques
- Development of a user interface for easier interaction
- The model runs for the query
"fifa"
. - The model first fixes the nearest matching strings available in the database for the query name.
- It uses the top match to find the top recommendations, which are then sorted in order of critic score.
- Video Games Recommendation System (Kaggle)
- Game Recommendations on Steam (Kaggle)
- Paul Bertens, et al. "A Machine-Learning Item Recommendation System for Video Games"
- Umair Javed and Kamran Shaukat, "A Review of Content-Based and Context-Based Recommendation Systems"
- Germรกn Cheuque, et al. "Recommender Systems for Online Video Game Platforms: the Case of STEAM"
This project is licensed under the MIT License - see the LICENSE file for details.
- IIIT Delhi for project support and guidance
- Kaggle and Steam for providing comprehensive datasets
- The gaming community for inspiration and feedback