This repository contains implementations for a machine learning pipeilne to predict job titles of github users. Both classical ML models and graph deep learning models are available. Github Social Network is used as the reference dataset for the user subset but a new set of features related to user's repositories, company and descriptive statistics are extracted.
You need to download the Github Social Network. To be able to make requests to Github Rest API, you need to install Github CLI.
- scikit-learn
- Pandas
- Pytorch
- DGL
Get authenticated in Github with:
gh auth login
Following functions from 'utils.py' should be called to get the necessary data from Github:
get_absent_users_from_api
get_user_relations_from_api
get_user_repos_from_api
Execute 'edge analysis.ipynb', 'label_analysis.ipynb' and 'feature_extraction.ipynb' notebooks.
You can use the "run.py" function to train and evaluate models.
Example usage:
python run.py --model GraphSAGE --feature-selection select_from_model --select-from extra_trees --undirected --h-feats 400
All options:
options:
-h, --help show this help message and exit
--model {NaiveBayes,LogisticRegression,GCN,GraphSAGE}
Model name.
--lr-max-iter LR_MAX_ITER
Logistic Regression iteration.
--lr LR Learning rate for GCNs.
--h-feats H_FEATS Hidden units.
--epochs EPOCHS Number of epochs.
--patience PATIENCE Number of iterations to wait for improvement before early stopping.
--undirected Make the graph undirected.
--feature-selection {None,variance,select_from_model}
Feature selection method.
--variance-threshold VARIANCE_THRESHOLD
Threshold value for variance feature selection.
--select-from {svc,extra_trees}
Select features according to given model.
--n-splits {1,5} Number of splits for k-fold cross-validation.
--neighborhood-features {mean,max}
Neighborhood aggregation function for non-graph models.
Model | Weighted F-1 |
---|---|
#1 Logistic Regression | 0.752 ± 0.009 |
#2 Naive Bayes | 0.736 ± 0.007 |
#3 GraphSAGE | 0.762 ± 0.008 |
#4 GCN | 0.758 ± 0.006 |
Run experiment.sh to reproduce the results of the experiments in this study.