Concept to Code: Building Blocks of ML Script

·

In last post we created a setup for Machine learning project and in this part we move on with writing code- importing model from library and configuring hyperparameters for the same

Configuring a Scikit-learn Model for Training

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb
import mlflow


class ModelDeployment:
    defclass ModelTrainer:
    def __init__(self):
        self.models = {
            'random_forest': RandomForestClassifier(
                n_estimators=100,# A Random Forest builds multiple decision trees, 
                                 # and n_estimators dictates how many individual trees will be 
                                 # constructed and averaged (or voted upon) for the final prediction. 
                                 # More trees generally lead to more robust models but also increase computation time.

                n_jobs=-1 # controls the number of CPU cores used for parallel processing. 
                          # A value of -1 means that the model will use all available CPU cores 
                          # to train the trees concurrently, which can significantly speed up the 
                          # training process on multi-core machines.
            ),
            'xgboost': xgb.XGBClassifier(
                n_estimators=100,
                n_jobs=-1
            ),
            'lightgbm': lgb.LGBMClassifier(
                n_estimators=100,
                n_jobs=-1,
                min_child_samples=20,
                min_split_gain=0.0,
                max_depth=15,
                num_leaves=31,
                learning_rate=0.1,
                colsample_bytree=0.8,
                subsample=0.8,
                reg_alpha=0.1,
                reg_lambda=0.1,
                verbose=-1
            )
        }
        self.results = {}

In the script, we are creating multiple models- random_forest, xgboost & lightgbm, each one of these can address classification problem. By training and evaluating multiple-models using same dataset, we’ll be able to compare the results of each of them and can get insights into the suitability of model for the target dataset and application.

During ML model creation from libraries, we configure specific settings called ‘hyperparameters’—these differ fundamentally from model parameters. Model parameters are learned and optimized through exposure to training data, whereas hyperparameters are predefined configurations that we manually set to control the algorithm’s learning behavior and process.

now taking a look again at the script portion-

            'lightgbm': lgb.LGBMClassifier(
                n_estimators=100,
                n_jobs=-1,
                min_child_samples=20,
                min_split_gain=0.0,
                max_depth=15,
                num_leaves=31,
                learning_rate=0.1,
                colsample_bytree=0.8,
                subsample=0.8,
                reg_alpha=0.1,
                reg_lambda=0.1,
                verbose=-1
            )

we can see a number of hyperparameter values set in the initialization of ml model lightgbm. we will cover each of these parameters and their values in some time, The current point of focus here is the significance of these hyperparameters in the quality of trained model at the end of training and how they are essential to the performance of the trained model to unseen or new data.

what is the correct “fit” in ML Model

Imagine we’ve trained and deployed a model. It achieves 99% accuracy on the training data, but performs poorly on new, unseen data. This likely indicates overfitting—the model has learned not just the underlying patterns, but also the noise and outliers specific to the training set. As a result, it struggles to generalize to new data where those same quirks don’t exist.

On the other hand, if the model is too simple or not trained enough to capture the patterns in the data, leading to poor performance on both training and test sets.

An overfitted model is similar to a student who has memorized answers to specific questions. If the exam repeats those exact questions, the student excels. But if the questions are even slightly different, their performance drops significantly.

An underfitted model is like a student who hasn’t studied enough or hasn’t understood the subject well. They try to answer exam questions using only the most basic concepts, regardless of the question’s complexity. As a result, perform poorly—not just on the exam, but even on practice questions they’ve already seen.

To prevent the model from going to either underfit or overfit zone, the model has to be

1. restrained from getting too complex and the dataset, &

2. increasing the quality of training dataset.

this can be controlled by the model initialization parameters, below is the explanation for parameters for LightGBM model

lgb.LGBMClassifier (LightGBM)

LightGBM is configured with more specific hyperparameters for fine-tuning its gradient boosting process:

  • min_child_samples=20:
    • This specifies the minimum number of data points (or samples) required in a leaf node. If a split results in a leaf with fewer samples than this, it won’t be made. This helps to prevent overfitting by ensuring that leaf nodes are not too specific to individual data points.
    • If a split in a tree results in a leaf node having fewer than 20 samples, that split won’t be made. This prevents the model from learning extremely specific patterns from very small subsets of data. 20 is a common sensible starting point.
  • min_split_gain=0.0:
    • A split will only be considered if the gain (reduction in loss) it provides is greater than or equal to this value. Setting it to 0.0 means even very small gains are considered, which might make the tree more complex. Increasing this value can act as a regularization technique.
  • max_depth=15:
    • The maximum depth of each individual tree. This limits how deep the trees can grow, which is another way to control model complexity and prevent overfitting.
    • 15 is a moderately deep value that allows for significant complexity without necessarily leading to extreme overfitting, especially with n_estimators set to 100 and other regularization. Default is often None (no limit) or a smaller number like 6 in other boosting implementations.
  • num_leaves=31:
    • This defines the maximum number of leaves (terminal nodes) in any tree. LightGBM grows trees leaf-wise, and num_leaves is a key parameter for controlling model complexity, often more important than max_depth in LightGBM.
    • A common rule of thumb is that num_leaves should be less than or equal to 2max_depth. Since 25=32, a max_depth of around 5 or 6 would correspond to 31 leaves if the tree were balanced, but leaf-wise growth allows for more flexibility.
  • learning_rate=0.1:
    • Also known as shrinkage, this determines the contribution of each tree to the final prediction. A smaller learning rate means each tree has less impact, requiring more n_estimators but potentially leading to a more robust model.
    • A smaller learning rate (0.1 is common) makes the boosting process more conservative, which generally improves robustness and prevents overfitting, but it often requires a larger n_estimators.
  • colsample_bytree=0.8:
    • This specifies the fraction of features (columns) to be randomly selected for building each tree. 0.8 means 80% of features are sampled. This technique, also known as feature bagging or column subsampling, helps to reduce overfitting.
  • subsample=0.8:
    • This specifies the fraction of data samples (rows) to be randomly selected for building each tree. 0.8 means 80% of data is sampled. This technique, also known as row subsampling or data bagging, also helps to reduce overfitting and speed up training.
  • reg_alpha=0.1:
    • This is the L1 regularization term on weights. L1 regularization adds a penalty equal to the absolute value of the magnitude of coefficients, which can lead to sparse models with fewer features (some coefficients become zero).
  • reg_lambda=0.1:
    • This is the L2 regularization term on weights. L2 regularization adds a penalty equal to the square of the magnitude of coefficients, which discourages large coefficients and helps prevent overfitting.

The initial part of setting up the model with suitable hyperparameters could make the model robust and ensure that it performs well for most of the dataset, and identify patterns in both seen data and new data after training.

in the next post we will look at the dataset available and try to code the script further.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *