Cyn meetup

DL FOR
CLOUD PHISHING
Natan Katz Avanan- CP

Agenda
• Challenges in Phishing models
• Current Engine
• The unified DL model
• BN and explainability

Challenges in phishing
• Turing test
• How do we need to label?
• Imbalanced traffic
• Explainability is mandatory

Use Case
• Cloud traffic emails
• The typical customer has a few millions of
emails per day
• Phishing’s rate is assumed to be 1 per 10000
emails
• System calssifies the emails to one of four
categories:
Ø Clean
Ø Phsihing
Ø Spam
Ø Marketing

Development Envirnoment
• Security regulations disallows
downloading the emails
• AWS instances are being run
using Shell. It obviously leads
to many obstacles during the
development

The Model
A double steps XGBoost:
• Combination of tabular feautres
with some text analysis
• In order to achieve good
perfromance (namley, hgih
precision) the second step is
perfomred only if the first model
detects Phishing

Labeling Protocol
An enormous numner of emails
• Labeling emails that were detected by the model as
phishing or spam
• Precision is well measured
• Howerver, new types of phishing are hardly
detected as a result the recall measurements are
endowed with high risk
.

Our Objectives
• Construct a DL model
ØModel’s inputs are both text and tabular
data
ØModel’s outputs are the four categories
ØPerformances requirements:
Optimizing Recall for 98% precision

Our “Strategy”
Perform good
embedding for
the emails’ bodies
1
Combine the
embedding with
the tabular data
2
Find a method to
get high precsion
3

Current Text Embedding
• We have a DistilBERT model which its outputs
are used as inputs for the XGBOOST:
How does it work?
We follow Huggingface ‘s regular procedure:
Ø Take the text and tokenize it
Ø Use a pre trained model for the embedding
Ø Use Huggingface services for upper layers
and perform a downstream training.

Develompenet Challenges
• Replacing the Jupiter with a tensors
folder
• Modifying the network –achieving
flexibility for the combined stage
• Achieving flexibility for different types
of embeddings

Emebeding Methods
• Base Bert
• Distil Bert
• Fnet

“The Berts”
Base Bert
• bert-base-multilingual-cased
• 12 Layers
DistillBert
• distilbert-base-multilingual-cased
• 6 Layers
” 40 % less parameteres, 60% fatster, preserve 95% of the perfromancse”

Transfromers
Transfromers are an extnesion of CNN
• We measure the entire input in paralel
• We compare a given “pixel” with neighbours
• We perfrom a “pooling” upon this comparison
But: For a sentence of length K we compare all possible “pairs”: all the inputs versus
all the inputs (O(𝐾!))

Fnet
• We replace the multi head layer by FFT one.

FNET
• A paremeters free layer
• Complexity is reduced to something lower (linear to nlog(n))
• We perform the followign formula
• Performances ar slightly worse but faster without trainable params
• We can think that rather measuring each mixing between tokens we measure a
general information flow

The Architecute
• Performing Embedding process
• Terminate the embedding phase using
Huggngface’ pooling layer
• We concatentate the pooled embedding to a
naïve represenstation of the tabular data
• We perfrom a regular 1D NN (with regular
extras such as dropuot and RELU activation)

HOW TO HANDLE
IMBALANCED
TRAFFIC?

Imbalanced Traffic
• While training we have enough data.
• Real world traffic is extremely non-sysmmetric. Recall that we have
about single Phishing email in a traffic of 10k emails.
• Classifcation algorithms commnoly focus on accuracy.
Is it good enough?

Example
We trained a model with 99% accuracy.
ØThe number of clean emails is 1000,000
ØThe number of phishing emails is 100
We will have each day 10,100 alerts in which only 100 of them are valid
Pretty shit

A solution (A and not The)
• We use cross entropy as a loss (L)
• We aim to reduce our FP – usecase’s objective is the precision level
• A regulation term is needed
Which crietira such term needs to satisfy

The precision function
We construct a function F s.t.
• For every real phishing training example it outputs 0
• For every non phishing examples it outputs :
Ø A positive function that increases with the phishing prob.
Ø The gradient does not vanish for high scores

Regulation Function – Actual Example
ay1=-2.5
ay2=0.4
ax1=0.01
ax2=1
mu =(ay2-ay1)/(ax2-ax1)
beq= ay1-mu*ax1
R(prob_phish) =100. * (1 + tanh(mu * prob_phish + beq))
Total loss = cross_entropy + (target<>phishing) * R(prob_phish)

Another Example -Wasserstein
If target is Phishing
y=-1
Else
y=1
R( prob_phish ) =100. * (1 + tanh(mu * prob_phish + beq))
Total loss = cross entropy + (target<>phishing) * R( prob_phish )

Background
• DL commonly provides bad explainability
• Epxlainbility in phishing is essential
• We can search for probabilistic graph represnetaion

Bayesian Networks
• We assume that the feautres (as well as the target) are nodes in a
graph
• Each arc’s weight represents the conditional probability between
the nodes
• We use bnlearn to optimize these ditirubtions
• The outcome is the optimal DAG of the data.

Cyn meetup

Recommended

Recommended

More Related Content

Similar to Cyn meetup

Similar to Cyn meetup (20)

More from Natan Katz

More from Natan Katz (17)

Recently uploaded

Recently uploaded (20)

Cyn meetup