Slide presentation of the paper "EVIL: Exploiting Software via Natural Language" accepted at The 32nd International Symposium on Software Reliability Engineering (ISSRE 2021)
Project Based Learning (A.I).pptx detail explanation
EVIL: Exploiting Software via Natural Language
1. EVIL: Exploiting Software Via Natural Language
Pietro Liguori⋆, Erfan Al-Hossami † , Vittorio Orbinato⋆,
Roberto Natella⋆, Samira Shaikh †, Domenico Cotroneo⋆, Bojan Cukic †
⋆ Università degli Studi di Napoli Federico II, Naples, Italy
† University of North Carolina at Charlotte, NC, USA
⋆{pietro.liguori, vittorio.orbinato, roberto.natella, cotroneo}@unina.it
† {ealhossa, samirashaikh , bcukic}@uncc.edu
The 32nd International Symposium on Software Reliability Engineering
4. ISSRE, October 25-28, 2021 pietro.liguori@unina.it 4
Proposed Solution
Writer needs to master programming and obfuscation
techniques to develop a successful exploit
We propose an approach, EVIL, to automatically
generate exploits in assembly/Python language from
descriptions in natural language by using Neural
Machine Translation (NMT) techniques
Move /bin//sh into
the ebx register
push 0x68732f2f
push 0x6e69622f
mov ebx, esp
5. ISSRE, October 25-28, 2021 pietro.liguori@unina.it 5
Contributions
A substantive dataset containing exploits
collected from shellcode databases and their
descriptions in the English language
A new approach that applies NMT techniques
to automatically generate exploits based on
their description in the English language
An extensive experimental study to evaluate
the feasibility of the proposed approach at
generating real exploits.
6. ISSRE, October 25-28, 2021 pietro.liguori@unina.it 6
EVIL Approach
Intent Parser Standardizer Embedding
Pre-processing
Seq2Seq CodeBERT
NMT Models
De-
standardizer
Code
Snippet
Cleaning
Post-processing
Training Data
Perform xor between
dl and 0xbb and jump
to next_cycle if the
result is zero
Perform xor var0 and var1
and jump var2 if result zero
xor var0, var1 n jz var2
xor dl, 0xbb
jz next_cycle
User
Processed
Training Data
1
2
3
4
5
6
7. ISSRE, October 25-28, 2021 pietro.liguori@unina.it 7
Data
NL Intents
Python Code
Snippets
Unique Lines 15,421 14,034
Unique
Tokens
10,605 9,511
Avg. Tokens
per Line
14.90 9.53
NL Intents
Assembly
Code
Snippets
Unique Lines 3,689 2,542
Unique
Tokens
1,924 1,657
Avg. Tokens
per Line
9.53 4.75
Python (Encoder) Dataset
15,540 pairs intents-snippets
Assembly (Decoder) Dataset
3,715 pairs intents-snippets
It extends the Django dataset
"Learning to Generate Pseudo-Code from Source
Code Using Statistical Machine Translation", Oda
et al., ASE, 2015.
It extends the Shellcode_IA32 dataset
”Shellcode_IA32: A Dataset for Automatic Shellcode
Generation", Liguori et al., NLP4Prog, 2021.
8. ISSRE, October 25-28, 2021 pietro.liguori@unina.it 8
Test Set
The test sets cover 20 different exploits (i.e., it is not
randomly spitted)
• 20 Python encoders
• 20 assembly decoders
The pairs intents-snippets in the test set are unique and
are not included in the training/dev sets.
Detailed information on the test sets can be found in the
appendix of the paper
https://github.com/dessertlab/EVIL/tree/main/Appendix
Statistics
Python
Encoders
Assembly
Decoders
Average 23.4 26.4
Median 19 24
Min 11 17
Max 64 46
9. ISSRE, October 25-28, 2021 pietro.liguori@unina.it 9
Automated Evaluation
Model
BLEU-1
(%)
BLEU-2
(%)
BLEU-3
(%)
BLEU-4
(%)
ACC
(%)
Seq2Seq no
IP
72.34 62.55 56.88 52.2 31.2
Seq2Seq with
IP
86.92 83.68 81.42 79.69 45.33
CodeBERT
no IP
79.23 74.12 69.84 65.76 48.00
CodeBERT
with IP
89.22 86.78 84.94 83.50 56.00
Model
BLEU-1
(%)
BLEU-2
(%)
BLEU-3
(%)
BLEU-4
(%)
ACC
(%)
Seq2Seq no
IP
35.83 26.1 21.38 17.69 25.25
Seq2Seq with
IP
88.38 86.37 85.51 85.05 40.98
CodeBERT
no IP
33.42 28.39 25.38 22.72 45.9
CodeBERT
with IP
88.21 85.53 84.05 82.99 45.9
Python
Dataset
Assembly
Dataset
10. ISSRE, October 25-28, 2021 pietro.liguori@unina.it 10
Human Evaluation
NL Intent: res2 is the result of the bitwise and operation
between res2 and val1
Predicted Snippet: res2 = val2 | val1
Model
Syntactic
Correctness
(%)
Semantic
Correctness
(%)
Seq2Seq
with IP
88.53 50.13
CodeBERT
with IP
93.60* 67.73*
Python
Dataset
Assembly
Dataset
Model
Syntactic
Correctnes
s (%)
Semantic
Correctnes
s (%)
Seq2Seq
with IP
90.16 56.36
CodeBERT
with IP
87.05 61.90*
Syntactically Correct Semantically Incorrect
11. ISSRE, October 25-28, 2021 pietro.liguori@unina.it 11
Whole Exploit Evaluation
𝑆𝑦𝑛𝑡𝑎𝑐𝑡𝑖𝑐 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑛𝑠𝑠 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑖 =
𝑛𝑠𝑦𝑛
𝑖
𝑛𝑡
𝑖 𝑆𝑒𝑚𝑎𝑛𝑡𝑖𝑐 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑛𝑠𝑠 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑖 =
𝑛𝑠𝑒𝑚
𝑖
𝑛𝑡
𝑖
𝑛_𝑠𝑦𝑛^𝑖 is the number of automatically-generated snippets that
are semantically correct of the i-th program
𝑛_𝑠𝑒𝑚^𝑖 is the number of automatically-generated snippets that
are semantically correct of the i-th program
𝑛_𝑡^𝑖 is the number of total lines of the i-th program
12. ISSRE, October 25-28, 2021 pietro.liguori@unina.it 12
Qualitative Analysis
NL Intent Ground Truth Snippet Model Output
Append 0xAA to the string
encode
encode += “0xAA” encode += “0x02x” %170
Convert the value of leader to
hexadecimal, then slice it at
index 3, convert it to an int16
and set its value to the
variable sb
sb =
int(hex(leader)[3:],
16)
sb =
int(hex(leader)[3:],
32)
NL Intent Ground Truth Snippet Model Output
Perform the xor between the
current byte of the shellcode
and the dl register
xor byte [esi], dl xor byte [esi], dl
Zero out the ecx register and
move 25 in the lower 8 bits of
the register
xor ecx, ecx n
mov cl, 25
xor ecx, ecx n
mov al, 25
Python Dataset
Assembly Dataset
13. ISSRE, October 25-28, 2021 pietro.liguori@unina.it 13
Conclusion (1/2)
We presented EVIL, an approach that applies NMT
techniques for automatic exploit generation for
security assessment purposes
We developed and released two datasets of real
exploits in Python and assembly language to enable
neural network training and experimental evaluation
We evaluated the feasibility of the approach showing
its ability in generating software exploits with high
syntactic and semantic correctness
14. ISSRE, October 25-28, 2021 pietro.liguori@unina.it 14
Conclusion (2/2)
Future work includes
the improvement of the post-processing to increase the
accuracy of the program generation task
the development of a single engine that generates the
encoding and decoding schemes at the same time
We invite the readers to give a look at our GitHub
repo containing the datasets and the code to
reproduce the experiments in the paper:
https://github.com/dessertlab/EVIL