5. Use Case
• Cloud traffic emails
• The typical customer has a few millions of
emails per day
• Phishing’s rate is assumed to be 1 per 10000
emails
• System calssifies the emails to one of four
categories:
Ø Clean
Ø Phsihing
Ø Spam
Ø Marketing
6. Development Envirnoment
• Security regulations disallows
downloading the emails
• AWS instances are being run
using Shell. It obviously leads
to many obstacles during the
development
7. The Model
A double steps XGBoost:
• Combination of tabular feautres
with some text analysis
• In order to achieve good
perfromance (namley, hgih
precision) the second step is
perfomred only if the first model
detects Phishing
8. Labeling Protocol
An enormous numner of emails
• Labeling emails that were detected by the model as
phishing or spam
• Precision is well measured
• Howerver, new types of phishing are hardly
detected as a result the recall measurements are
endowed with high risk
.
10. Our Objectives
• Construct a DL model
ØModel’s inputs are both text and tabular
data
ØModel’s outputs are the four categories
ØPerformances requirements:
Optimizing Recall for 98% precision
13. Current Text Embedding
• We have a DistilBERT model which its outputs
are used as inputs for the XGBOOST:
How does it work?
We follow Huggingface ‘s regular procedure:
Ø Take the text and tokenize it
Ø Use a pre trained model for the embedding
Ø Use Huggingface services for upper layers
and perform a downstream training.
14. Develompenet Challenges
• Replacing the Jupiter with a tensors
folder
• Modifying the network –achieving
flexibility for the combined stage
• Achieving flexibility for different types
of embeddings
16. “The Berts”
Base Bert
• bert-base-multilingual-cased
• 12 Layers
DistillBert
• distilbert-base-multilingual-cased
• 6 Layers
” 40 % less parameteres, 60% fatster, preserve 95% of the perfromancse”
17.
18. Transfromers
Transfromers are an extnesion of CNN
• We measure the entire input in paralel
• We compare a given “pixel” with neighbours
• We perfrom a “pooling” upon this comparison
But: For a sentence of length K we compare all possible “pairs”: all the inputs versus
all the inputs (O(𝐾!))
20. FNET
• A paremeters free layer
• Complexity is reduced to something lower (linear to nlog(n))
• We perform the followign formula
• Performances ar slightly worse but faster without trainable params
• We can think that rather measuring each mixing between tokens we measure a
general information flow
21. The Architecute
• Performing Embedding process
• Terminate the embedding phase using
Huggngface’ pooling layer
• We concatentate the pooled embedding to a
naïve represenstation of the tabular data
• We perfrom a regular 1D NN (with regular
extras such as dropuot and RELU activation)
25. Imbalanced Traffic
• While training we have enough data.
• Real world traffic is extremely non-sysmmetric. Recall that we have
about single Phishing email in a traffic of 10k emails.
• Classifcation algorithms commnoly focus on accuracy.
Is it good enough?
26. Example
We trained a model with 99% accuracy.
ØThe number of clean emails is 1000,000
ØThe number of phishing emails is 100
We will have each day 10,100 alerts in which only 100 of them are valid
Pretty shit
27. A solution (A and not The)
• We use cross entropy as a loss (L)
• We aim to reduce our FP – usecase’s objective is the precision level
• A regulation term is needed
Which crietira such term needs to satisfy
28. The precision function
We construct a function F s.t.
• For every real phishing training example it outputs 0
• For every non phishing examples it outputs :
Ø A positive function that increases with the phishing prob.
Ø The gradient does not vanish for high scores
29. Regulation Function – Actual Example
ay1=-2.5
ay2=0.4
ax1=0.01
ax2=1
mu =(ay2-ay1)/(ax2-ax1)
beq= ay1-mu*ax1
R(prob_phish) =100. * (1 + tanh(mu * prob_phish + beq))
Total loss = cross_entropy + (target<>phishing) * R(prob_phish)
30. Another Example -Wasserstein
If target is Phishing
y=-1
Else
y=1
R( prob_phish ) =100. * (1 + tanh(mu * prob_phish + beq))
Total loss = cross entropy + (target<>phishing) * R( prob_phish )
33. Background
• DL commonly provides bad explainability
• Epxlainbility in phishing is essential
• We can search for probabilistic graph represnetaion
34. Bayesian Networks
• We assume that the feautres (as well as the target) are nodes in a
graph
• Each arc’s weight represents the conditional probability between
the nodes
• We use bnlearn to optimize these ditirubtions
• The outcome is the optimal DAG of the data.