Concepts to Code: Using ML as Developers- Prerequisites

·

Even though I think I understand the general concepts of machine learning(I also believe in dreamcatchers and monarchy), I often struggle to put theory into practice. These posts document my journey of learning ML through hands-on implementation. As a complete newbie to machine learning practice, I will start with an idea and then, step by step, take that idea and work on software implementation, breaking down the process into small, manageable steps. Along the way, I’ll focus on understanding each stage's practical concepts and rationale.

This part includes fundamental terminology and key concepts necessary for applying and executing a machine learning project idea. 
The focus is on getting equipped with the essential knowledge required to conceptualize and develop machine learning-based projects, without getting overwhelmed by the underlying algorithms or the internal mechanics of machine learning models.

What is Machine Learning

For the umpteenth time 🙂 – It is a Mechanism where software learns to perform tasks by identifying patterns in data, rather than relying on explicitly programmed instructions. In traditional software systems, the behavior is determined by fixed algorithms. For example, in conventional antivirus software, the system scans files and compares them against a database of known malware signatures. If a file matches a signature in the database, it is flagged as malicious. The logic is hardcoded, and the system cannot detect new or unknown threats unless the database is updated with the relevant signature.

In contrast, ML-based antivirus and internet security software follow a different path- using large datasets of both malicious and benign files to train a model. During training, the software learns to identify patterns and behaviors associated with malware, such as unusual file structures, suspicious network activity, or malicious code execution.

What is ML Model?

Machine learning models are software programs that can recognize patterns in data or make predictions.

What is Training the Model?

The training is process of providing the input data i.e. file and expected outcome for example labeling each input file as malicious or benign. using this “input and output” pairing the model is expected to tune its internal parameters to identify patterns & characteristics unique to malicious files and And by end of training the model can reliably recognize similar patterns in new, unseen files and classify them as malicious when appropriate.

The training of model can be using Labeled dataset where the model is told what files/data belongs to benign category and what files/data represents malicious category. and once trained the model is expected to categorize input file/data in one of the two category i.e. benign or malicious with high accuracy.

Over time, the system “learns” to detect new and evolving threats based on these patterns, even if they don’t match any known signatures. This allows ML-based systems to adapt and improve their performance as they are exposed to more data.

In conventional systems, humans define the rules. In ML systems, the machine learns the rules from data.


Target Problems for ML

It’s important to identify the type of problem an idea addresses. This helps in selecting the right algorithms, evaluation metrics, and approach for solving it.

  • A. Classification(Is it an Apple or a Banana?)

This learning technique is used to categorise data into predefined classes or labels based on input features. For example, classification could predict whether a viewer will like a specific movie based on their demographic information and viewing history. The algorithm learns from labelled examples to make predictions about which category new observations belong to.

  • B. Regression (How much or how many?)

This learning technique is used to predict continuous numerical values based on relationships between variables. For example, regression could predict how many minutes a user will spend watching a particular show based on factors like their age, previous viewing duration, and content genre. The algorithm establishes mathematical relationships between input features and target values to make quantitative predictions.

  • C. Clustering (How is it organised?)

This learning technique is used to discover natural groupings or patterns in data based on similarity without using predefined labels. For example, clustering could group viewers based on their viewing habits or preferences. After clustering, you might observe that certain clusters predominantly contain teenagers who prefer horror, adults who prefer thrillers, etc.


Machine learning algorithms

Supervised Learning

The goal is to learn a mapping function that can accurately predict the output for new, unseen inputs.

Imagine you’re teaching a child to identify different fruits. You show them an apple and say, “This is an apple.” Then you show them a banana and say, “This is a banana.” You repeat this process with various fruits, providing both the image (input) and the name (label). Eventually, the child learns to identify fruits on their own.

That’s essentially how supervised learning works.  The algorithm learns from a labelled dataset, meaning the data includes both the input features (e.g., colour, shape, size of the fruit) and the corresponding correct outputs or labels (e.g., “apple,” “banana”).

Unsupervised Learning

The goal is to discover hidden patterns, structures, or groupings within the data without explicit guidance.

Imagine giving the child a basket of fruits without telling them the names. You observe them naturally grouping the fruits based on similarities – perhaps they put all the round red fruits together and the long yellow ones in another group. They might not know the names, but they’ve discovered inherent patterns.
This is analogous to unsupervised learning. The algorithm is trained on an unlabeled dataset, providing only input features.

How to decide which Algorithm/Model for the idea I have?

For starters, to decide what model/algorithm to use, two questions need to be answered

  1. Is it a classification problem or prediction problem ?
  2. what kind of datset is available – labelled or unlablled ?
Problem TypeDataset TypeExample ML Techniques
ClassificationLabelledLogistic Regression, Random Forest, SVM
Prediction (Regression)LabelledLinear Regression, DecisionTrees, XGBoost
Clustering / Pattern DiscoveryUnlabelledK-Means, DBSCAN, Hierarchical Clustering
Anomaly DetectionUnlabelled or LabelledIsolation Forest, Autoencoders, One-Class SVM

Each one of these models have specific purpose and most of these models are available as part of standard ml libraries. But for our experiment we can try multiple models and assess its accuracy after training before finalizing on one.

Maybe it is time to see how all this fits into the idea-

What Idea?

Network Anomaly Detection- anomaly detection, which means spotting network traffic that looks different from what’s considered normal. The goal is simple: classify each activity on a network as either normal or anomalous.
This is where machine learning (ML) comes in. Unlike traditional systems that rely on fixed rules, ML can learn from examples. By feeding it past network traffic data parameters— labeled as normal or abnormal — the system trains itself to recognize patterns that are unique to regular traffic and traffic when suspicios activity occurs. It is capable of flagging suspicious activity, even when new types of threats appear.
since The idea is to identify if there is any irregular traffic going through the target network/device, this can also be a starting point to identify any threat to network or cyberattack using ml model in real-time and take protective actions.

In the case of anomaly detection, where the goal is to classify network traffic as either normal or anomalous, and where labelled data is available, this becomes a supervised classification problem. Based on this, suitable algorithms include Random Forest, Logistic Regression, Support Vector Machines, Gradient Boosting (like XGBoost), and Neural Networks.

Most of these models are available as part of standard ML libraries, while some may need additional installation.

with all of the above information we can start moving towards the setup and coding training part in next post. 🙂

Comments

One response to “Concepts to Code: Using ML as Developers- Prerequisites”

Leave a Reply to Idea to Execution: Network Anomaly Detection – Himanshu Sourav Cancel reply

Your email address will not be published. Required fields are marked *