Concepts to Code: Setup for ML Projects on Ubuntu

·

Believe it or not, the latest chipset model from NVIDIA is not essential to start doing projects on machine learning. one can run the scripts that train the ML model on a 16GB RAM laptop running Ubuntu LTS without any additional hardware. (no harm if you do have ml processors- it makes the training quicker and supports more library features :))

This part covers setup on a ubuntu system(or any debian distro) for Machine Learning Models for any target idea or application.

  • Install python3
$ python3 --version
Python 3.12.3
  • Create a virtual environment to ensure the new packages do not bother existing packages
$ python3 -m venv my_scikit_learn_env
  • Activate the virtual enviroment
$ source my_scikit_learn_env/bin/activate
(my_scikit_learn_env)$

the next step would be to install all the packages required in the pythons script used for preprocessing the data and training the model,

Python Packages
# Since I've already completed the script setup and package installations, 
# I won’t be able to provide the individual installation commands. 
# However, following is the list of all the packages currently installed in the virtual environment. 
# Please note that, due to some trial and error during development, 
# there may be a few redundant or unused packages. 
# They don’t cause any issues, apart from occupying a bit of extra memory space. 🙂


$ my_scikit_learn_env/bin/pip list
Package                            Version
---------------------------------- -----------
absl-py                            2.1.0
alembic                            1.14.1
annotated-types                    0.7.0
anyio                              4.8.0
asttokens                          3.0.0
astunparse                         1.6.3
bcc                                0.1.10
bleach                             6.2.0
blinker                            1.9.0
boto3                              1.36.6
botocore                           1.36.6
cachetools                         5.5.1
certifi                            2024.12.14
charset-normalizer                 3.4.1
click                              8.1.8
cloudpickle                        3.1.1
contourpy                          1.3.1
cycler                             0.12.1
databricks-sdk                     0.41.0
decorator                          5.1.1
Deprecated                         1.2.17
docker                             7.1.0
executing                          2.1.0
fastapi                            0.115.7
Flask                              3.1.0
flatbuffers                        24.12.23
fonttools                          4.55.3
gast                               0.6.0
gitdb                              4.0.12
GitPython                          3.1.44
google-auth                        2.38.0
google-pasta                       0.2.0
graphene                           3.4.3
graphql-core                       3.2.5
graphql-relay                      3.2.0
graphviz                           0.20.3
greenlet                           3.1.1
grpcio                             1.69.0
gunicorn                           23.0.0
h11                                0.14.0
h5py                               3.12.1
idna                               3.10
imbalanced-learn                   0.13.0
importlib_metadata                 8.5.0
inference-tools                    0.13.4
iniconfig                          2.0.0
ipython                            8.31.0
itsdangerous                       2.2.0
jedi                               0.19.2
Jinja2                             3.1.5
jmespath                           1.0.1
joblib                             1.4.2
kaggle                             1.6.17
kagglehub                          0.3.6
keras                              3.8.0
kiwisolver                         1.4.8
libclang                           18.1.1
lightgbm                           4.5.0
Mako                               1.3.8
Markdown                           3.7
markdown-it-py                     3.0.0
MarkupSafe                         3.0.2
matplotlib                         3.10.0
matplotlib-inline                  0.1.7
mdurl                              0.1.2
ml-dtypes                          0.4.1
mlflow                             2.20.0
mlflow-skinny                      2.20.0
namex                              0.0.8
numpy                              2.0.2
nvidia-nccl-cu12                   2.24.3
opentelemetry-api                  1.29.0
opentelemetry-sdk                  1.29.0
opentelemetry-semantic-conventions 0.50b0
opt_einsum                         3.4.0
optree                             0.13.1
packaging                          24.2
pandas                             2.2.3
parso                              0.8.4
pexpect                            4.9.0
pillow                             11.1.0
pip                                25.0
pluggy                             1.5.0
prompt_toolkit                     3.0.48
protobuf                           5.29.3
ptyprocess                         0.7.0
pure_eval                          0.2.3
pyarrow                            18.1.0
pyasn1                             0.6.1
pyasn1_modules                     0.4.1
pydantic                           2.10.6
pydantic_core                      2.27.2
Pygments                           2.19.1
pyparsing                          3.2.1
pytest                             8.3.4
python-dateutil                    2.9.0.post0
python-slugify                     8.0.4
pytz                               2024.2
PyYAML                             6.0.2
requests                           2.32.3
rich                               13.9.4
rsa                                4.9
s3transfer                         0.11.2
scikit-learn                       1.6.1
scipy                              1.15.1
seaborn                            0.13.2
setuptools                         75.8.0
six                                1.17.0
sklearn-compat                     0.1.3
smmap                              5.0.2
sniffio                            1.3.1
SQLAlchemy                         2.0.37
sqlparse                           0.5.3
stack-data                         0.6.3
starlette                          0.45.3
tabulate                           0.9.0
tensorboard                        2.18.0
tensorboard-data-server            0.7.2
tensorflow                         2.18.0
termcolor                          2.5.0
text-unidecode                     1.3
threadpoolctl                      3.5.0
tqdm                               4.67.1
traitlets                          5.14.3
traittypes                         0.2.1
typing_extensions                  4.12.2
tzdata                             2024.2
urllib3                            2.3.0
uvicorn                            0.34.0
wcwidth                            0.2.13
webencodings                       0.5.1
Werkzeug                           3.1.3
wheel                              0.45.1
wrapt                              1.17.1
xgboost                            2.1.3
zipp                               3.21.0

ML Model: A little rewind

Machine learning model at the lowest level is an software program of learning algorithm. As a user choosing an algorithm (i.e. the ML Model before training.) depends on the target or the prediction required at the end. In the initial post we covered few machine learning categories like supervised, unsupervised, and that each of these category have multiple algorithms available to choose from based on the requirement.

Machine Learning Models: A Categorical Overview
Main ParadigmSub-CategoryGoal / What it DoesCommon Algorithms / ApproachesExample Applications
1. Supervised Models1.1 ClassificationUses labeled data to predict a categorical output (assigns new data points to predefined classes).Logistic Regression, Support Vector Machine (SVM), Decision Tree, Random Forest, K-Nearest Neighbors (KNN)Email spam/inbox filtering, Image classification (cat/dog), Predicting loan applicant credibility.
1.2 RegressionUses labeled data to predict a continuous output value based on input features.Linear Regression, Polynomial Regression, Decision Tree Regression, Random Forest Regression, Support Vector Regression (SVR)1Predicting real estate prices, Stock market trend forecasting, Anticipating customer churn, Sales forecasting.
2. Unsupervised Models2.1 ClusteringGroups unlabeled data points based on in-built similarities (discovers inherent groupings).K-means Clustering, Hierarchical Clustering, Density-Based Spatial Clustering of Applications with Noise (DBSCAN)Grouping similar fruits, Customer segmentation, Document organization by topic.
2.2 Dimensionality ReductionReduces the number of features/dimensions in data while maintaining key features for visualization and analysis.Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA) (Note: LDA is supervised/semi-supervised if used for classification but its core concept of finding discriminative dimensions fits here).Visualizing high-dimensional data, Speeding up model training, Noise reduction in data.
2.3 Anomaly DetectionIdentifies data points that differ significantly from the majority (outliers or unusual occurrences).Local Outlier Factor (LOF), Isolation ForestIdentifying errors, Detecting fraud, Highlighting unusual events in data streams.
3. Semi-Supervised Models3.1 Generative Semi-Supervised LearningCombines limited labeled data with abundant unlabeled data by using a generative model to learn data distribution and generate pseudo-labeled synthetic data for training.Generative Adversarial Networks (GANs) combined with semi-supervised techniques, Variational Autoencoders (VAEs) used for semi-supervised tasks.Training models when labeled data is scarce (e.g., medical image diagnosis with few labeled examples, but many unlabeled ones).
3.2 Graph-based Semi-Supervised LearningLeverages relationships (links/connections) between data points to propagate labels from labeled to unlabeled nodes within a network structure.Graph Convolutional Networks (GCNs) when adapted for semi-supervised tasks, Label Propagation Algorithms.Inferring user interests in social networks, Classifying web pages based on links, Protein function prediction in biological networks.
4. Reinforcement Learning Models4.1 Value-based learningAn agent learns by interacting with an environment, receiving rewards for desired actions, and updating a value function (expected future reward) for each state-action pair to find optimal paths.Q-learning, SARSA (State-Action-Reward-State-Action)Training robots to navigate mazes, Game AI (e.g., learning to play chess or Go), Resource allocation in complex systems.
4.2 Policy-based learningAn agent directly learns a policy (a mapping from states to actions) that dictates its behavior to maximize cumulative rewards, rather than learning state-action values.Actor-Critic, Proximal Policy Optimization (PPO)Robotics control (e.g., teaching a robot to walk), Autonomous driving, Training agents for complex decision-making in simulations.

The Common Algorithms / Approaches field in above table shows the algorithms or ml models those are present in the ml libraries which can be imported from the standard ML Libraries and using available dataset it can be trained to desired form.

Scikit-learn (sklearn) is one of such ML libraries, which is a powerful open-source Python library for machine learning. Designed to simplify ML implementation, it provides a consistent interface for supervised and unsupervised algorithms. Built on SciPy, it supports numeric data in NumPy arrays, SciPy sparse matrices, and convertible formats like Pandas DataFrames. other alternates available are pytorch/tensorflow but I found Scikit-learn to be more friendly 🙂 The sklearn library comes equipped with extensive methods to process and filter data in addition to standard machine learning datasets.

  • Install ML Library
$ pip3 install -U scikit-learn

Now that our setup is ready and the libraries required for models are in place, the next step is to train them using a dataset in a Python script ; which is going to be in next post.

Comments

One response to “Concepts to Code: Setup for ML Projects on Ubuntu”

Leave a Reply to Concept to Code: Building Blocks of ML Script – Himanshu Sourav Cancel reply

Your email address will not be published. Required fields are marked *