Post

Loan Default

Loan Default

🏦 Loan Default Prediction Project

Loan Default

📂 Dataset Information

The dataset contains loan-related features, including demographic, financial, and behavioral attributes.

  • Total Rows: 150,000
  • Total Columns: 22
  • Target Variable: status (0 = Non-Default, 1 = Default)

🔧 Data Preprocessing

🛠 Null Value Imputation

Several columns had missing values and were handled as follows:

  • Debt-to-Income Ratio (dtir1): Imputed using the median grouped by income.
  • Property Value (property_value), Income (income), Loan-to-Value Ratio (LTV): Imputed using the mean.
  • Term (term), Age (age): Filled missing values with the most frequent value.

🔢 Encoding Categorical Variables

  • Ordinal columns were encoded using Label Encoding.
  • Age ranges (e.g., <25, 45-54, etc.) were converted to their midpoints for numerical representation.

⚖ Handling Imbalanced Data (SMOTE)

The target variable was highly imbalanced:

  • Non-Default (0): 112,031 instances
  • Default (1): 36,591 instances

To balance the dataset, SMOTE (Synthetic Minority Oversampling Technique) was applied with sampling_strategy=0.7, ensuring a proportional increase in the minority class.

📊 Model Selection

Several machine learning models were evaluated, including:

  • XGBoost Classifier
  • Random Forest Classifier
  • Gradient Boosting Classifier

🚀 Why XGBoost?

XGBoost outperformed other models in terms of accuracy, precision, recall, and F1-score.

✅ Model Performance (Before Hyperparameter Tuning)

MetricScore
Accuracy91.32%
Precision81.5%
Recall96.7%
F1-Score88.4%
AUC-ROC92.1%

✅ Final Model Performance (After Tuning)

MetricScore
Accuracy91.2%
Precision80.6%
Recall96.6%
F1-Score88.05%
AUC-ROC92.3%

📂 GitHub Repository

🔗 Link to GitHub Repo

This post is licensed under CC BY 4.0 by the author.