2. Important Links referred
1) https://www.analyticsvidhya.com/blog/2020/09/precision-recall-machine-learning/
2) https://www.javatpoint.com/confusion-matrix-in-machine-learning
3) https://medium.com/analytics-vidhya/confusion-matrix-accuracy-precision-recall-
f1-score-ade299cf63cd
4) https://www.freecodecamp.org/news/evaluation-metrics-for-regression-problems-
machine-learning/
3. Why do we use different evaluation metrics
There are plenty of ways to measure the quality of an algorithm and each company
decides for themselves
→ What is the most appropriate way for their particular problem.
Example:
Let’s say an online shop is trying to maximize effectiveness of their website.
→ we need to formalize what is effectiveness.
→ we need to define a metric how effectiveness is measured.
→ It can be a number of times a website was visited, or the number of times
something was ordered using this website.
→ So the company usually decides for itself what quantity is most important
4. When assessing how well a model fits a dataset, we use the RMSE more often because
it is measured in the same units as the response variable
5.
6. Regression & Classification Metrics
1) Regression
a) MSE
b) RMSE
c) R-squared
d) MAE
e) RMSPE,MAPE
2) Classification
a) Confusion Matrix
b) Accuracy
c) Precision
d) Recall
e) F1 Score
f) AUC
7. Regression Metrics - Mean Square Error(MSE)
Mean or Average of the square of the difference between actual and estimated values
A high value of MSE means that the model is not performing well,
whereas a MSE of 0 would mean that you have a perfect model that predicts the
target without any error.
12. Example : Model Comparison
When we compare Model A with Mobel B is having extreme errors
13. Advantages & Disadvantages
Advantages of using MSE
Easy to calculate in Python
Simple to understand calculation for end users
Designed to punish large errors
Disadvantages of using MSE
Error value not given in terms of the target
Difficult to interpret
Not comparable across use cases
14. RMSE
RMSE is the square root of the mean of the square of all of the error
→ RMSE has the benefit of penalizing large errors more so can be more
appropriate in some cases,
→ On the other hand, one distinct advantage of RMSE over MAE is that RMSE
avoids the use of taking the absolute value
16. Let’s understand the above statement with the two examples:
Case 1 : Actual Value = [2,4,6,8], Predicted Values = [4,6,8,10]
Case 2: Actual Values = [2,4,6,8] , Predicted Values = [4,6,8,12]
MAE for case 1 = 2.0, RMSE for case 1 = 2.0
MAE for case 2 = 2.5, RMSE for case 2 = 2.65
From the above example,
→ we can see that RMSE penalizes the last value prediction more heavily than
MAE. Generally, RMSE will be higher than or equal to MAE.
→ The only case where it equals MAE is when all the differences are equal or zero
(true for case 1 where the difference between actual and predicted is 2 for all
observations).
17. Mean Absolute Error(MAE)
MAE is the average of the absolute difference between the predicted values and
observed values
→ All the individual differences are weighted equally in the average.
18. What are the disadvantages of using mean absolute error?
it doesn't tell you whether your model tends to overestimate or underestimate
→ since any direction information is destroyed by taking the absolute value.
22. MAE is the sum of absolute differences between actual and predicted values. It doesn’t
consider the direction, that is, positive or negative.
→ When we consider directions also, that is called Mean Bias Error (MBE),
which is a sum of errors(difference).
23. So which one should you choose and why?
Well, it is easy to understand and interpret MAE because it directly takes the average of
offsets
whereas RMSE penalizes the higher difference more than MAE.
24. MAE is the sum of absolute differences between actual and predicted values. It doesn’t
consider the direction, that is, positive or negative.
→ When we consider directions also, that is called Mean Bias Error (MBE),
which is a sum of errors(difference).
25. Residual
→ residual are the difference between the actual and predicted value, you can
think of residuals as being a distance.
→ the closer the residual to zero, the better the model performs in making its
predictions.
26. R2 Score
The R2 score is a statistical measure that tells us how well our model is making
predictions on a scale of 0 to 1.
→ we can use the R2 square to determine the distance or residual
27. R-Squared
R-squared is a goodness-of-fit measure for linear regression models. This statistic
indicates the percentage of the variance in the dependent variable that the
independent variables explain collectively.
28. When to use R2 score
You can use the R2 score to get the accuracy of your model on a percentage
scale, that is 0 - 100, just like in a classification model.
29.
30.
31. Adjusted R2
Adjusted R2 is the better model when you compare models that have a different
amount of variables
→ The logic behind it is, that R2 always increases when the number of variables
increases. Meaning that even if you add a useless variable to you model, your R2
will still increase. To balance that out, you should always compare models with
different number of independent variables with adjusted R2.
→ Adjusted R2 only increases if the new variable improves the model more than
would be expected by chance.
→ When you compare models use adjusted R2. When you only look at one model
report R2, as it is the not adjusted measure of how much variance is explained by
your model.
33. TP,TN,FP,FN
We represent prediction as positive(P) or Negative(N) and truth values as True(T) or
False.
→ Representing truth and predicted values together, we get True positive (TP), True
Negative (TN), False Positive (FP), False Negative (FN).
38. Confusion Matrix
The confusion matrix is used to determine the performance of the classification model.
→ It can only determined if the true values for the test data is known.
→ It shows error in the model performance in the form of a matrix.
39. Need for confusion matrix
→ It evaluate the performance of the classification model, when they make
predictions on test data and tells how good your model is.
→ with help of confusion matrix we can calculate the different parameters of the
model, such as Accuracy, Precision,Recall.
41. Accuracy
Accuracy is the quintessential classification metric. It is pretty easy to understand. And
easily suited for binary as well as a multiclass classification problem.
Accuracy = (TP+TN)/(TP+FP+FN+TN)
Accuracy is the proportion of true results among the total number of cases examined.
42. When to use?
Accuracy is a valid choice of evaluation for classification problems which are well
balanced and not skewed or No class imbalance.
43. Accuracy
"What percentage of my predictions are correct?"
True Positives (TP): should be TRUE, you predicted TRUE, These are cases in
which we predicted yes (they have the disease), and they do have the disease.
True Negative (TN): should be FALSE, you predicted FALSE, We predicted no,
and they don't have the disease.
False Positives (FP): should be FALSE, you predicted TRUE, We predicted yes,
but they don't actually have the disease. (Also known as a "Type I error.")
False Negatives (FN): should be TRUE, you predicted FALSE, We predicted no,
but they actually do have the disease. (Also known as a "Type II error.")
44.
45. Caveats
Let us say that our target class is very sparse. Do we want accuracy as a metric of our
model performance? What if we are predicting if an asteroid will hit the earth? Just say
No all the time. And you will be 99% accurate. My model can be reasonably accurate, but
not at all valuable.
46. Example :
→ When a search engine returns 30 pages, only 20 of which are relevant, while
failing to return 40 additional relevant pages, its precision is 20/30 = 2/3,
→ which tells us how valid the results are, while its recall is 20/60 = 1/3, which tells
us how complete the results are.
47. Precision
Let’s start with precision, which answers the following question: what proportion of
predicted Positives is truly Positive?
Precision = (TP)/(TP+FP)
What is the precision of your model ?
→ Yes it is 0.843 or When it is predict that a patient has heart disease, it is
correct around 84% of the time.
48. When to use?
Precision is a valid choice of evaluation metric when we want to be very sure of our
prediction.
For example:
If we are building a system to predict if we should decrease the credit limit on
a particular account, we want to be very sure about our prediction or it may result in
customer dissatisfaction.
Caveats
Being very precise means our model will leave a lot of credit defaulters untouched and
hence lose money.
49. Recall
Another very useful measure is recall, which answers a different question: what
proportion of actual Positives is correctly classified?
For your model, Recall = 0.86, recall gives a measure of how accurately your model is
able to identify the relevant data.
50. Precision
"Of the points that I predicted TRUE, how many are actually TRUE?"
Good for multi-label / multi-class classification and information retrieval
Good for unbalanced datasets
Recall
"Of all the points that are actually TRUE, how many did I correctly
predict?"
Good for multi-label / multi-class classification and information retrieval Good for
unbalanced datasets
51. Precision / Recall
Let’s say we are evaluating a classifier on the test set.
→ The Actual class of that example in the test set is going to be “1” or “0”.
→ If there is a binary classification problem.
→ High precision would be good.
→ High recall would be a good thing.
52. True Positive
Your algorithm predicted that’s positive(1) and in reality the example is
positive.
True Negative
Your learning algorithm predicted that something is negative class “Zero” and the
Actual class is “Zero” is called a true negative.
False positive
If our learning algorithm predicts that the class is positive(1) but the actual
class is Negative(0). Then that’s called a False positive.
False Negative
Algorithm predicted as Negative(0), but actual is positive(1)
53.
54. Suppose we want to predict that the patient has cancer only if we’re very confident that
they really do
→ So maybe we want to tell someone that we think they have cancer only if they are
very confident.
One way to do this would be modify the algorithm, so that instead of setting this
threshold at 0.5 to 0.7.
→ Then you’re predicting someone has cancer only when you’re more
confident.
55.
56. How to compare precision/recall numbers?
When we are trying to compare Algorithm 1 and algorithm 2 and Algorithm 3 we don’t
have a single real number evaluation metric.
→ If we have a single real number evaluation metric like a number that just tells us
is algorithm 1 or algorithm 2 is better.
→ That helps us to much more quickly decide which algorithm to go with.
57.
58. F1 Score
F1 score Can you give a single metric that balances precision and recall.
→ Gives equal weight to precision and recall
→ Good for unbalanced datasets
59. What is AUC - ROC Curve?
AUC - ROC curve is a performance measurement for classification problem at various
thresholds settings.
→ It tells how much model is capable of distinguishing between classes.
→ Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s.
60. ROC Curve
Receiver Operating Characteristic curve represent a probability graph to show the
performance of a classification model at different thresholds levels
1) True positive rate or TPR
2) False positive rate
61. An excellent model has AUC near to the 1 which means it has good measure of
separability.
A poor model has AUC near to the 0 which means it has worst measure of separability.
In fact it means it is reciprocating the result.
→ It is predicting 0s as 1s and 1s as 0s.
→ And when AUC is 0.5, it means model has no class separation capacity
whatsoever.
67. When to Use ROC vs. Precision-Recall Curves?
Generally, the use of ROC curves and precision-recall curves are as follows:
● ROC curves should be used when there are roughly equal numbers of observations for each class.
● Precision-Recall curves should be used when there is a moderate to large class imbalance.
The reason for this recommendation is that ROC curves present an optimistic picture of the model on datasets with a class
imbalance.