The objective of this project is to identify age-related conditions using a dataset provided for the "ICR - Identifying Age-Related Conditions" competition on Kaggle. The analysis involves data preprocessing, exploratory data analysis, feature engineering, and model building to predict the target variable related to age-related conditions.
Since our data was cluttered, we had a lot of opportunities in this project, we worked on various preprocessing and analysis methods and found out the best features to rely on, we also discovered the use of a variety of metrics thereby improving our model accuracy .
The dataset contains various features, including medical and demographic information. Key columns include:
Id
: Unique identifier for each entry.target
: The target variable indicating the presence of an age-related condition.- Multiple numerical and categorical features representing patient data.
- Loading Data: The data is loaded from CSV files.
- Handling Missing Values: Missing values are addressed by imputation or removal.
- Encoding Categorical Variables: Categorical variables are encoded using techniques like one-hot encoding.
- Visualization: Histograms, scatter plots, and correlation matrices are used to understand data distribution and relationships.
- Summary Statistics: Descriptive statistics provide insights into the central tendency and variability of features.
- Scaling: Numerical features are standardized to ensure uniformity.
- Interaction Features: New features are created based on interactions between existing features to capture complex relationships.
- Model Selection: Various models, including Logistic Regression, Decision Trees, and Random Forest, are evaluated.
- Hyperparameter Tuning: Grid search and cross-validation are used to optimize model parameters.
- Evaluation Metrics: Models are evaluated using metrics such as accuracy, precision, recall, and F1-score.
- Model Performance: The performance of different models is compared, and the best-performing model is selected.
- Feature Importance: Key features contributing to the model's predictions are identified.
- ROC Curves: Receiver Operating Characteristic curves to evaluate the model's ability to distinguish between classes.
- Confusion Matrix: Visualization of the confusion matrix to understand the classification performance.
The project successfully identifies the most significant features and builds a predictive model for age-related conditions.
Pranjal Vanjale
Siddhant Gupta
Arihant Aggarwal
Kuldeep