Introduction to Statistical and Machine Learning. Explains basics of ML, fundamental concepts of ML, Statistical Learning and Deep Learning. Recommends the learning sources and techniques of Machine Learning. This was one of the lectures of a full course I taught in University of Moratuwa, Sri Lanka on 2023 second half of the year.
3. Machine Learning Overview
• Intelligence: Understanding the nature to generate useful information
• Artificial Intelligence (AI): Mimicking the Intelligence in
animals/humans by man-made machines
• Machine Learning (ML): Consuming data by machines to achieve
Artificial Intelligence
• Deep Learning (DL): Machine Learning using multiple layers of nature
inspired neurons (in Deep Neural Networks)
4. AI vs ML
• AI may consist of theory and rule based intelligence
• Expert Systems
• Control Systems
• Algorithms
• And Machine Learning Systems
• ML is developed by mainly using available data where AI can also be
developed with any data by using a fixed set of rules
• ML systems are almost free from fixed rules added by experts where
data will design the system
• Domain knowledge is less required
• ML does not contain if-else statements (a common misconception)
5. What is Statistical Learning (SL)?
• Using statistics to understand the nature with data
• Have well established proven mathematical methods while ML can
sometimes be a form of Alchemy with data where focus is more on
results
• Is the base of ML where the statistics used in some ML models
may not have well studied yet
• Has a higher interpretability as proven with mathematics
• Has a blur line between with ML
6. SL vs ML
Statistical Learning Machine Learning
Focus Primarily focuses on understanding and modeling the
relationships between variables in data using
statistical methods. It aims to make inferences and
predictions based on these relationships.
A broader field that encompasses various techniques for
building predictive models and making decisions without being
overly concerned with the underlying statistical assumptions. It
is often used for tasks such as classification, regression,
clustering, and more.
Foundation Rooted in statistical theory and often uses classical
statistical techniques like linear regression, logistic
regression, and analysis of variance.
Draws from a wider range of techniques, including traditional
statistics but also incorporates methods like decision trees,
support vector machines, neural networks, and more. It is less
reliant on statistical theory and more focused on empirical
performance.
Assumptions Methods often make explicit assumptions about the
underlying data distribution, such as normality or
linearity. These assumptions help in making
inferences about population parameters.
Models are often designed to be more flexible and adaptive,
which can make them less reliant on strict data distribution
assumptions.
Interpretability Models tend to be more interpretable, meaning it is
easier to understand how the model arrives at its
predictions. This interpretability is important in fields
where understanding the underlying relationships is
crucial.
While interpretability can be a concern in some machine
learning models (e.g., deep neural networks), many machine
learning models are designed with a primary focus on
predictive accuracy rather than interpretability.
7. Course Structure
• Machine Learning will be the main focus
• You should be able to do ML stuff yourself from the available data
• You should be familiar with every phase of the ML lifecycle
• Statistical background will be explained depending on your progress
of the above requirement
• ML will be first taught with simpler mathematics and intuition and
then will be explained with statistical fundamentals
• You will first be able to work on ML projects and then the theory
behind it will be learned with statistics
8. For Your Reference
• Machine Learning can be self-learned with the free course
https://www.coursera.org/specializations/machine-learning-introduction
• You can learn more about Statistical Learning from the free book about
Python based SL at https://www.statlearning.com
• Learn Python, Numpy, Pandas and scikit-learn from online tutorials and
Youtube videos
• You can also clarify tricky ML/SL problems with ChatGPT
• Anyway, note that some online tutorials, videos and ChatGPT may provide
incorrect information where you should be careful when learning from
these resources
• Never use ChatGPT for answering Quizzes or Exams! (at least until the AI
takes over the world)
9. What we want from Machine Learning?
• Say we have some collected data
• We want a computer/machine to learn from those data and get the insight of that data
into a model
• Our expectation is to use that model to predict/make inferences on newly provided data
• This is like you teach a kid to learn a certain pattern from example pictures and ask him
later to draw/classify similar pictures
• After the model is made (known as “trained”) you want to make sure the model has
learned the insights with a sufficient accuracy
• For that requirement, you train the model with only a part of the given data and use the
remaining data to check (known as “test”) the accuracy of the model
• Model will be used for our needs (to predict/make inferences) only if the tests are
passed. Otherwise, we have to look back about the problem and may have to start from
data collection
10. What we do in Machine Learning?
• We find a dataset
• In Supervised ML we have labeled data (i.e.: data has both X values and Y values)
• In Un-supervised ML we have un-labeled data (i.e.: data has only X values but no Y
values)
• We select a suitable ML algorithm for modeling (e.g.: Linear Regression)
• We train a model with most of the data (say 80% of the total data) using
that algorithm
• We test (check the accuracy of) the trained model with the remaining data
(say 20% of the total data)
• If the tests are passing (i.e. the trained model is accurate enough) we can
use the model to label more un-labeled data (in supervised ML) or making
inferences on more data (in unsupervised ML).
• Otherwise, we have to iterate the above process until the tests are passed
11. Supervised Machine Learning
• Now, let’s further look more detail into Supervised Machine Learning
• There are two types of fields/variables/parameters in a Supervised
ML dataset
1. Independent variables/features/predictors/X values
2. Dependent variable/target variable/response/Y value
• Data sets will contain a set of records where each record contains
data in a certain set of X values and a one Y value
• E.g.: X1 - GPA X2 - income X3 – IQ Y– life_expectency
3.41 3000 105 72
2.32 1800 86 65
3.82 6000 130 86
3.56 4800 112 ?
Given For training/testing
Need to predict
12. Supervised Machine Learning
X1 - GPA X2 - income X3 – IQ Y– life_expectancy
3.41 3000 105 72
2.32 1800 86 65
X1 - GPA X2 - income X3 – IQ
3.56 4800 112
Y– life_expectency
76
ML Model
Training
Trained ML
Model
Predicting
X1 - GPA X2 - income X3 – IQ Y– life_expectency
3.82 6000 130 86
Testing
1
3
2
Accuracy = 80%
13. Supervised Machine Learning
• You are given to train a model to identify how X1, X2, X3 relates to Y by the
definition of the function f.
• Where, Y = f(X1, X2, X3 ) or simply, Y = f(X)
• Once the model is trained it will model an estimator for f, named as መ
f which
is not the exact f as the model is just an approximation of the true f
• When predicting Y values for new X data, it will generate
Y, an estimator
for Y due to መ
f
• Due to this error (i.e.
Y ≠ Y) there will be an error 𝜀
• Now the trained model will be መ
f(X) where,
መ
f(X) =
Y = f(X) + 𝜀
Model’s error
True function to be approximated
Predicted values from the model
Approximated model function
14. Supervised Machine Learning
• There are mainly 2 types of Supervised Machine Learning problems
• Regression problems
• Classification problems
• This difference comes from the data type we are going to predict (Y)
• If the Y is a continuous number such as temperature or length it is a
regression problem
• Else if the Y is a discreate finite number such as gender or country it is a
classification problem
15. Supervised Machine Learning – Example 1
• Problem: A real estate company wants to estimate the sales price of a house
given the following details of last 100 houses sold as data, with parameters
including the sale price,
• Area of the house
• Area of the land
• Number of rooms
• Number of floors
• Distance to the main road
• Solution: This is a supervised learning regression problem where sales price is
the Y parameter and other parameters of the given dataset as X parameters
16. Supervised Machine Learning – Example 2
• Problem: A doctor wants to diagnose a cancer as malignant or benign using
the data of 500 tumors with labeled data,
• Length of the tumor
• Age of the patient
• Having a cancer patient in family
• Solution: This is a supervised learning classification problem where malignant
or benign nature is the Boolean Y parameter and other parameters of the
given dataset are the X parameters. Here, length of the tumor and age of the
patient are float in type X variables while having a cancer patient in family is a
Boolean X variable.
17. Un-supervised Machine Learning
• Now, let’s look more detail into Un-supervised Machine Learning
• There is only one type of fields/variables/parameters in a Supervised
ML dataset
• Independent variables/features/X values
• No dependent variables
• There are several types of Un-supervised Machine Learning problems
• Clustering
• Dimensionality reduction
• Anomaly detection
• …
18. Un-supervised Machine Learning – Example 1
• Problem: A web site owner wants to categorize its past 1000 visitors into 10
types based on the following data,
• Visited hour of the day
• Visit time
• Most preferred product
• Web browser used
• Country of the IP address
• Solution: As there are no labelled data (Y parameters) this is an unsupervised
learning clustering problem where the given parameters of the given dataset
are X parameters. We can use K-means clustering to cluster the X parameters
into 10 classes