SlideShare a Scribd company logo
1 of 40
Examining Malware with Python
Phil Roth
Data Scientist at Endgame
@mrphilroth
3
Python tools for text classification can easily be
adopted for malware classification.
When using instruction ngrams, your disassembler
and analysis passes are very important.
references: http://bit.ly/scipy-malware
Conclusions
4
Yes it’s malware, but what kind?
The Data
5
10868 labeled samples
10873 unlabeled samples
~500 GB uncompressed
9 classes
Classes
6
Hex Dump
7
00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20
00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01
00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18
00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04
00401040 82 20 08 83 00 08 00 00 00 00 02 00 60 80 10 80
00401050 18 00 00 20 A9 00 00 00 00 04 04 78 01 02 70 90
00401060 00 02 00 08 20 12 00 00 00 40 10 00 80 00 40 19
00401070 00 00 00 00 11 20 80 04 80 10 00 20 00 00 25 00
00401080 00 00 01 00 00 04 00 10 02 C1 80 80 00 20 20 00
00401090 08 A0 01 01 44 28 00 00 08 10 20 00 02 08 00 00
004010A0 00 40 00 00 00 34 40 40 00 04 00 08 80 08 00 08
004010B0 10 00 40 00 68 02 40 04 E1 00 28 14 00 08 20 0A
004010C0 06 01 02 00 40 00 00 00 00 00 00 20 00 02 00 04
004010D0 80 18 90 00 00 10 A0 00 45 09 00 10 04 40 44 82
004010E0 90 00 26 10 00 00 04 00 82 00 00 00 20 40 00 00
004010F0 B4 00 00 40 00 02 20 25 08 00 00 00 00 00 00 00
00401100 08 00 00 50 00 08 40 50 00 02 06 22 08 85 30 00
00401110 00 80 00 80 60 00 09 00 04 20 00 00 00 00 00 00
00401120 00 82 40 02 00 11 46 01 4A 01 8C 01 E6 00 86 10
00401130 4C 01 22 00 64 00 AE 01 EA 01 2A 11 E8 10 26 11
00401140 4E 11 8E 11 C2 00 6C 00 0C 11 60 01 CA 00 62 10
00401150 6C 01 A0 11 CE 10 2C 11 4E 10 8C 00 CE 01 AE 01
00401160 6C 10 6C 11 A2 01 AE 00 46 11 EE 10 22 00 A8 00
00401170 EC 01 08 11 A2 01 AE 10 6C 00 6E 00 AC 11 8C 00
00401180 EC 01 2A 10 2A 01 AE 00 40 00 C8 10 48 01 4E 11
00401190 0E 00 EC 11 24 10 4A 10 04 01 C8 11 E6 01 C2 00
raw data in hex
Hex Dump
8
00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20
00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01
00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18
00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04
00401040 82 20 08 83 00 08 00 00 00 00 02 00 60 80 10 80
00401050 18 00 00 20 A9 00 00 00 00 04 04 78 01 02 70 90
00401060 00 02 00 08 20 12 00 00 00 40 10 00 80 00 40 19
00401070 00 00 00 00 11 20 80 04 80 10 00 20 00 00 25 00
00401080 00 00 01 00 00 04 00 10 02 C1 80 80 00 20 20 00
00401090 08 A0 01 01 44 28 00 00 08 10 20 00 02 08 00 00
004010A0 00 40 00 00 00 34 40 40 00 04 00 08 80 08 00 08
004010B0 10 00 40 00 68 02 40 04 E1 00 28 14 00 08 20 0A
004010C0 06 01 02 00 40 00 00 00 00 00 00 20 00 02 00 04
004010D0 80 18 90 00 00 10 A0 00 45 09 00 10 04 40 44 82
004010E0 90 00 26 10 00 00 04 00 82 00 00 00 20 40 00 00
004010F0 B4 00 00 40 00 02 20 25 08 00 00 00 00 00 00 00
00401100 08 00 00 50 00 08 40 50 00 02 06 22 08 85 30 00
00401110 00 80 00 80 60 00 09 00 04 20 00 00 00 00 00 00
00401120 00 82 40 02 00 11 46 01 4A 01 8C 01 E6 00 86 10
00401130 4C 01 22 00 64 00 AE 01 EA 01 2A 11 E8 10 26 11
00401140 4E 11 8E 11 C2 00 6C 00 0C 11 60 01 CA 00 62 10
00401150 6C 01 A0 11 CE 10 2C 11 4E 10 8C 00 CE 01 AE 01
00401160 6C 10 6C 11 A2 01 AE 00 46 11 EE 10 22 00 A8 00
00401170 EC 01 08 11 A2 01 AE 10 6C 00 6E 00 AC 11 8C 00
00401180 EC 01 2A 10 2A 01 AE 00 40 00 C8 10 48 01 4E 11
00401190 0E 00 EC 11 24 10 4A 10 04 01 C8 11 E6 01 C2 00
00401180
EC 01 2A 10 2A 01 AE
raw data in hex
Disassembly
9
HEADER:00400000 ;
HEADER:00400000 ; +-------------------------------------------------------------------------+
HEADER:00400000 ; | This file has been generated by The Interactive Disassembler (IDA) |
HEADER:00400000 ; | Copyright (c) 2013 Hex-Rays, <support@hex-rays.com> |
HEADER:00400000 ; | License info: |
HEADER:00400000 ; | Microsoft |
HEADER:00400000 ; +-------------------------------------------------------------------------+
HEADER:00400000 ;
HEADER:00400000
HEADER:00400000
HEADER:00400000 .686p
HEADER:00400000 .mmx
HEADER:00400000 .model flat
HEADER:00400000
HEADER:00400000 ; ===========================================================================
HEADER:00400000
HEADER:00400000 ; [00001000 BYTES: COLLAPSED SEGMENT HEADER. PRESS KEYPAD CTRL-"+" TO EXPAND]
.text:00401000 ;
.text:00401000 ; Format : Portable executable for 80386 (PE)
.text:00401000 ; Imagebase : 400000
.text:00401000 ; Section 1. (virtual address 00001000)
.text:00401000 ; Virtual size : 00071050 ( 462928.)
.text:00401000 ; Section size in file : 00071200 ( 463360.)
.text:00401000 ; Offset to raw data for section: 00000400
.text:00401000 ; Flags 60000020: Text Executable Readable
.text:00401000 ; Alignment : default
.text:00401000 ; ===========================================================================
HEADER:00400000 ;
HEADER:00400000 ; +-------------------------------------------------------------------------+
HEADER:00400000 ; | This file has been generated by The Interactive Disassembler (IDA) |
HEADER:00400000 ; | Copyright (c) 2013 Hex-Rays, <support@hex-rays.com> |
HEADER:00400000 ; | License info: |
HEADER:00400000 ; | Microsoft |
HEADER:00400000 ; +-------------------------------------------------------------------------+
HEADER:00400000 ;
HEADER:00400000
HEADER:00400000
HEADER:00400000 .686p
HEADER:00400000 .mmx
HEADER:00400000 .model flat
HEADER:00400000
HEADER:00400000 ; ===========================================================================
HEADER:00400000
HEADER:00400000 ; [00001000 BYTES: COLLAPSED SEGMENT HEADER. PRESS KEYPAD CTRL-"+" TO EXPAND]
.text:00401000 ;
.text:00401000 ; Format : Portable executable for 80386 (PE)
.text:00401000 ; Imagebase : 400000
.text:00401000 ; Section 1. (virtual address 00001000)
.text:00401000 ; Virtual size : 00071050 ( 462928.)
.text:00401000 ; Section size in file : 00071200 ( 463360.)
.text:00401000 ; Offset to raw data for section: 00000400
.text:00401000 ; Flags 60000020: Text Executable Readable
.text:00401000 ; Alignment : default
.text:00401000 ; ===========================================================================
Disassembly
10
HEADER:00400000
HEADER:00400000 ;
HEADER:00400000 ; +-------------------------------------------------------------------------+
HEADER:00400000 ; | This file has been generated by The Interactive Disassembler (IDA) |
HEADER:00400000 ; | Copyright (c) 2013 Hex-Rays, <support@hex-rays.com> |
HEADER:00400000 ; | License info: |
HEADER:00400000 ; | Microsoft |
HEADER:00400000 ; +-------------------------------------------------------------------------+
HEADER:00400000 ;
HEADER:00400000
HEADER:00400000
HEADER:00400000 .686p
HEADER:00400000 .mmx
HEADER:00400000 .model flat
HEADER:00400000
HEADER:00400000 ; ===========================================================================
HEADER:00400000
HEADER:00400000 ; [00001000 BYTES: COLLAPSED SEGMENT HEADER. PRESS KEYPAD CTRL-"+" TO EXPAND]
.text:00401000 ;
.text:00401000 ; Format : Portable executable for 80386 (PE)
.text:00401000 ; Imagebase : 400000
.text:00401000 ; Section 1. (virtual address 00001000)
.text:00401000 ; Virtual size : 00071050 ( 462928.)
.text:00401000 ; Section size in file : 00071200 ( 463360.)
.text:00401000 ; Offset to raw data for section: 00000400
.text:00401000 ; Flags 60000020: Text Executable Readable
.text:00401000 ; Alignment : default
.text:00401000 ; ===========================================================================
Disassembly
11
HEADER:00400000
Disassembly
12
.text:00470050 ; =============== S U B R O U T I N E ====================================
.text:00470050
.text:00470050 ; Attributes: bp-based frame
.text:00470050
.text:00470050 sub_470050 proc near ; CODE XREF: start+D8D^Yp
.text:00470050
.text:00470050 var_68 = dword ptr -68h
.text:00470050 var_64 = dword ptr -64h
.text:00470050 var_60 = dword ptr -60h
.text:00470050
.text:00470050 55 push ebp
.text:00470051 8B EC mov ebp, esp
.text:00470053 83 C4 98 add esp, 0FFFFFF98h
.text:00470056 33 C0 xor eax, eax
.text:00470058 8B 15 7C 10 4B 00 mov edx, dword_4B107C
.text:0047005E 89 55 EC mov [ebp+var_14], edx
.text:00470061 89 45 EC mov [ebp+var_14], eax
.text:00470064 53 push ebx
.text:00470065 8B 1D 7C 10 4B 00 mov ebx, dword_4B107C
.text:0047006B 83 FB 2D cmp ebx, 2Dh
.text:0047006E 75 03 jnz short loc_470073
.text:00470070 89 5D EC mov [ebp+var_14], ebx
.text:00470073
.text:00470073 loc_470073: ; CODE XREF: sub_470050+1E^Xj
.text:00470073 56 push esi
.text:00470074 33 C0 xor eax, eax
.text:00470076 8B 5D EC mov ebx, [ebp+var_14]
.text:00470050 ; =============== S U B R O U T I N E ====================================
.text:00470050
.text:00470050 ; Attributes: bp-based frame
.text:00470050
.text:00470050 sub_470050 proc near ; CODE XREF: start+D8D^Yp
.text:00470050
.text:00470050 var_68 = dword ptr -68h
.text:00470050 var_64 = dword ptr -64h
.text:00470050 var_60 = dword ptr -60h
.text:00470050
.text:00470050 55 push ebp
.text:00470051 8B EC mov ebp, esp
.text:00470053 83 C4 98 add esp, 0FFFFFF98h
.text:00470056 33 C0 xor eax, eax
.text:00470058 8B 15 7C 10 4B 00 mov edx, dword_4B107C
.text:0047005E 89 55 EC mov [ebp+var_14], edx
.text:00470061 89 45 EC mov [ebp+var_14], eax
.text:00470064 53 push ebx
.text:00470065 8B 1D 7C 10 4B 00 mov ebx, dword_4B107C
.text:0047006B 83 FB 2D cmp ebx, 2Dh
.text:0047006E 75 03 jnz short loc_470073
.text:00470070 89 5D EC mov [ebp+var_14], ebx
.text:00470073
.text:00470073 loc_470073: ; CODE XREF: sub_470050+1E^Xj
.text:00470073 56 push esi
.text:00470074 33 C0 xor eax, eax
.text:00470076 8B 5D EC mov ebx, [ebp+var_14]
Disassembly
13
mov ebx,dword_4B107C
.text:00470050 ; =============== S U B R O U T I N E ====================================
.text:00470050
.text:00470050 ; Attributes: bp-based frame
.text:00470050
.text:00470050 sub_470050 proc near ; CODE XREF: start+D8D^Yp
.text:00470050
.text:00470050 var_68 = dword ptr -68h
.text:00470050 var_64 = dword ptr -64h
.text:00470050 var_60 = dword ptr -60h
.text:00470050
.text:00470050 55 push ebp
.text:00470051 8B EC mov ebp, esp
.text:00470053 83 C4 98 add esp, 0FFFFFF98h
.text:00470056 33 C0 xor eax, eax
.text:00470058 8B 15 7C 10 4B 00 mov edx, dword_4B107C
.text:0047005E 89 55 EC mov [ebp+var_14], edx
.text:00470061 89 45 EC mov [ebp+var_14], eax
.text:00470064 53 push ebx
.text:00470065 8B 1D 7C 10 4B 00 mov ebx, dword_4B107C
.text:0047006B 83 FB 2D cmp ebx, 2Dh
.text:0047006E 75 03 jnz short loc_470073
.text:00470070 89 5D EC mov [ebp+var_14], ebx
.text:00470073
.text:00470073 loc_470073: ; CODE XREF: sub_470050+1E^Xj
.text:00470073 56 push esi
.text:00470074 33 C0 xor eax, eax
.text:00470076 8B 5D EC mov ebx, [ebp+var_14]
Disassembly
14
mov ebx,dword_4B107C
Disassembly
15
.idata:0046F4DC ;
.idata:0046F4DC ; Imports from KERNEL32.DLL
.idata:0046F4DC ;
.idata:0046F4DC ; ===========================================================================
.idata:0046F4DC
.idata:0046F4DC ; Segment type: Externs
.idata:0046F4DC ; _idata
.idata:0046F4DC ; DWORD __stdcall GetCurrentThreadId()
.idata:0046F4DC ?? ?? ?? ?? extrn __imp_GetCurrentThreadId:dword
.idata:0046F4DC ; DATA XREF: .text:0046F66C^Yo
.idata:0046F4DC ; GetCurrentThreadId^Yr
.idata:0046F4E0 ; BOOL __stdcall WriteFile(HANDLE hFile, LPCVOID lpBuffer, DWORD ...
.idata:0046F4E0 ?? ?? ?? ?? extrn WriteFile:dword ; DATA XREF: .text:00471E4C^Yr
.idata:0046F4E4 ; BOOL __stdcall FindNextVolumeA(HANDLE hFindVolume, LPSTR lpszVolumeName, DW ...
.idata:0046F4E4 ?? ?? ?? ?? extrn FindNextVolumeA:dword
.idata:0046F4E4 ; DATA XREF: .text:00471E46^Yr
.idata:0046F4E8 ; LPVOID __stdcall VirtualAlloc(LPVOID lpAddress, SIZE_T dwSize, DWORD ...
.idata:0046F4E8 ?? ?? ?? ?? extrn __imp_VirtualAlloc:dword
.idata:0046F4E8 ; DATA XREF: VirtualAlloc^Yr
.idata:0046F4EC ; BOOL __stdcall EnumResourceLanguagesA(HMODULE hModule, LPCSTR lpType, LPCSTR ...
.idata:0046F4EC ?? ?? ?? ?? extrn EnumResourceLanguagesA:dword
.idata:0046F4EC ; DATA XREF: .text:00471E70^Yr
.idata:0046F4DC ;
.idata:0046F4DC ; Imports from KERNEL32.DLL
.idata:0046F4DC ;
.idata:0046F4DC ; ===========================================================================
.idata:0046F4DC
.idata:0046F4DC ; Segment type: Externs
.idata:0046F4DC ; _idata
.idata:0046F4DC ; DWORD __stdcall GetCurrentThreadId()
.idata:0046F4DC ?? ?? ?? ?? extrn __imp_GetCurrentThreadId:dword
.idata:0046F4DC ; DATA XREF: .text:0046F66C^Yo
.idata:0046F4DC ; GetCurrentThreadId^Yr
.idata:0046F4E0 ; BOOL __stdcall WriteFile(HANDLE hFile, LPCVOID lpBuffer, DWORD ...
.idata:0046F4E0 ?? ?? ?? ?? extrn WriteFile:dword ; DATA XREF: .text:00471E4C^Yr
.idata:0046F4E4 ; BOOL __stdcall FindNextVolumeA(HANDLE hFindVolume, LPSTR lpszVolumeName, DW ...
.idata:0046F4E4 ?? ?? ?? ?? extrn FindNextVolumeA:dword
.idata:0046F4E4 ; DATA XREF: .text:00471E46^Yr
.idata:0046F4E8 ; LPVOID __stdcall VirtualAlloc(LPVOID lpAddress, SIZE_T dwSize, DWORD ...
.idata:0046F4E8 ?? ?? ?? ?? extrn __imp_VirtualAlloc:dword
.idata:0046F4E8 ; DATA XREF: VirtualAlloc^Yr
.idata:0046F4EC ; BOOL __stdcall EnumResourceLanguagesA(HMODULE hModule, LPCSTR lpType, LPCSTR ...
.idata:0046F4EC ?? ?? ?? ?? extrn EnumResourceLanguagesA:dword
.idata:0046F4EC ; DATA XREF: .text:00471E70^Yr
Disassembly
16
Imports from KERNEL32.DLL
__stdcall VirtualAlloc(
My Solution
17
Byte ngrams
Instruction
ngrams
Named
features
SelectKBest
SelectKBest
Gradient
Boosting
Classifier
Features Feature Selection Model
Manual
Features
Byte ngrams
18
00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20
00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01
00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18
00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04
Possibilies
1gram: 256
2gram: 65536
3gram: 16777216
4gram: 4294967296
Solution: Hashing
Byte ngrams
19
vectorizer = HashingVectorizer(
input="content", lowercase=True, stop_words=None, ngram_range=(1,3),
analyzer="word", n_features=2**16, binary=False, norm=None,
non_negative=True
)
pipe = Pipeline([
("extraction", CustomExtractor(vectorizer=vectorizer)),
("sel", VarianceThreshold(threshold=0)),
("tfidf", TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True,
sublinear_tf=True)),
("kbest", SelectKBest(score_func=f_classif, k=500))
])
Code for extracting the byte ngrams and reducing
dimensionality:
Byte ngrams
20
vectorizer = HashingVectorizer(
input="content", lowercase=True, stop_words=None, ngram_range=(1,3),
analyzer="word", n_features=2**16, binary=False, norm=None,
non_negative=True
)
pipe = Pipeline([
("extraction", CustomExtractor(vectorizer=vectorizer)),
("sel", VarianceThreshold(threshold=0)),
("tfidf", TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True,
sublinear_tf=True)),
("kbest", SelectKBest(score_func=f_classif, k=500))
])
Code for extracting the byte ngrams and reducing
dimensionality:
class CustomExtractor() :
def __init__(self, vectorizer=HashingVectorizer()) :
self.vectorizer = vectorizer
def fit(self, X, y) :
return self # stateless
def transform(self, X, y=None) :
pool = multiprocessing.Pool()
rows = pool.map(self.feature_extract, X, 32)
return scipy.sparse.vstack(list(rows))
fit_transform = transform
def feature_extract(self, file_name) :
clean_bytes = " ".join(toolz.pipe(
open(file_name, "r"),
map(lambda line : line.rstrip().split()[1:]),
toolz.concat,
filter(lambda b : b != "??" and b != "?")
))
return self.vectorizer.transform([clean_bytes])
Byte ngrams
21
Why they might be useful: https://github.com/wapiflapi/binglide
Byte ngrams
22
sample 0A32eTdBKayjCWhZqDOQ
Instruction ngrams
23
push lea push mov call mov mov pop retn
mov jmp
push mov mov call test jz push call add mov pop retn
mov mov mov mov retn
mov lea mov inc test jnz sub retn
mov mov mov push mov push push push push call add mov pop retn
mov mov mov push mov push push push push call add mov pop retn
xor retn
mov retn
mov retn
mov retn
mov mov mov retn
mov test jz mov mov push push call mov mov retn
push push push push call push call mov push push push mov call mov retn
mov mov mov retn
mov test jz mov mov push push call mov mov retn
push push push push call mov push push push mov call push call mov retn
Extracted instructions:
Instruction ngrams
24
vectorizer = HashingVectorizer(
input="content", lowercase=True, stop_words=None, ngram_range=(1, 2),
analyzer="word", n_features=2**25, binary=False, norm=None,
non_negative=True
)
pipe = Pipeline([
("extraction", CustomExtractor(vectorizer=vectorizer)),
("sel", VarianceThreshold(threshold=0)),
("tfidf", TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True,
sublinear_tf=True)),
("kbest", SelectKBest(score_func=f_classif, k=500))
])
Code for extracting the instruction ngrams and reducing
dimensionality:
Section Names, Imports, Imported Functions.
Extracted these features with regular expressions.
Features were (awkwardly) selected in the same
step as instruction ngrams.
Named Features
25
Named Features
26
import re
re_features = {
"imports" : {
"re" : re.compile("Imports from w.+"),
"extract" : lambda m : m.group().split()[-1],
"filter" : lambda m : True
},
"imported_functions" : {
"re" : re.compile("__stdcall w.+("),
"extract" : lambda m : m.group().split()[-1][:-1],
"filter" : lambda m : not m.startswith("sub_")
},
"section_names" : {
"re" : re.compile("^S+?:"),
"extract" : lambda m : m.group()[:-1],
"filter" : lambda m : True
}
}
Named Features
27
from toolz import pipe, unique
from tools.curried import map, filter
def process_re_feature(lines, re_dict) :
return pipe(
lines,
map(re_dict["re"].search),
filter(lambda m : m is not None),
map(re_dict["extract"]),
filter(re_dict["filter"]),
unique
)
Named Features
28
Manual Features
29
{
"number_of_collapsed_functions": 451,
"number_of_imported_functions": 101,
"sample_length": 1201668,
"number_of_imports": 4,
"number_of_sections": 4,
"section_length_0": 979764,
...
“section_length_6”: 0,
"length_of_functions_0": 2706,
...
"length_of_functions_15": 107
}
0A32eTdBKayjCWhZqDOQ
Gradient Boosting Classifier on 1026 features
Grid search optimized parameters
Also tried: LogisticRegression, MultinomialNB,
KNeighborsClassifier, RandomForestClassifier
Final Model
30
clf = GradientBoostingClassifier(
loss='deviance', learning_rate=0.1, n_estimators=300, subsample=0.9,
min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0,
max_depth=3, init=None, random_state=None, max_features=200,
max_leaf_nodes=None, warm_start=False, verbose=2
)
Final Model tSNE Plot
31
Final Model tSNE Plot
32
pipe = Pipeline([
("tsvd", TruncatedSVD(n_components=50)),
("tsne", TSNE(n_components=2, perplexity=40.0,
early_exaggeration=4.0, learning_rate=1000.0,
n_iter=1000, metric='euclidean', init='random’))
])
33
Results:
I did OK…
More focused on productization
xgboost
malware as an image
compression ratio as a feature
other expanded feature sets
probability calibration
semi supervised learning
Winning Strategies
34
usable in a product
specific to
competitions
35
ida ******************************
CV Scores: [ 0.03800 0.02551 0.05283 0.03953 0.0350 ]
mean: 0.03817940685733493 std: 0.008799619405211161
capstone ******************************
CV Scores: [ 0.05065 0.0451 0.06953 0.05583 0.05089]
mean: 0.05441113231562615 std: 0.008283830117670508
code = bytes(bytearray.fromhex("".join(map(
lambda l : "".join(l.split()[1:]).replace("?", ""),
open("data/sample/0A32eTdBKayjCWhZqDOQ.bytes", "r")
))))
from capstone import Cs, CS_ARCH_X86, CS_MODE_32
md = Cs(CS_ARCH_X86, CS_MODE_32)
instructions = " ".join(
[t[2] for t in md.disasm_lite(code, 0x1000) if t[2] != "int3"]
)
Using Capstone
IDA not (easily) batch distributable
capstone single pass produces suboptimal results
radare2 Python scriptable reversing framework
vivisect pure Python, largely undocumented
disassembler and analysis project
Disassemblers
36
Other Projects
37
pefile extracts header information from executables
binglide visualizations of entropy and byte ngrams
cuckoo automated dynamic analysis
barf binary analysis framework with code analysis
38
Python tools for text classification can easily be
adopted for malware classification.
When using instruction ngrams, your disassembler
and analysis passes are very important.
references: http://bit.ly/scipy-malware
Conclusions
Thank You
Examining Malware with Python

More Related Content

Viewers also liked

Outpost networksecurity
Outpost networksecurityOutpost networksecurity
Outpost networksecurityehsangha
 
A SURVEY ON SECURITY IN WIRELESS SENSOR NETWORKS
A SURVEY ON SECURITY IN WIRELESS SENSOR NETWORKSA SURVEY ON SECURITY IN WIRELESS SENSOR NETWORKS
A SURVEY ON SECURITY IN WIRELESS SENSOR NETWORKSIJNSA Journal
 
Python reading and writing files
Python reading and writing filesPython reading and writing files
Python reading and writing filesMukesh Tekwani
 
Виртуальное рабочее место на базе продуктов Microsoft (Desktops as a service)
Виртуальное рабочее место  на базе продуктов Microsoft (Desktops as a service)Виртуальное рабочее место  на базе продуктов Microsoft (Desktops as a service)
Виртуальное рабочее место на базе продуктов Microsoft (Desktops as a service)Alexey Vasiliev
 
Paes Andrano - Bozza - Piano d’Azione per l’Energia Sostenibile
Paes Andrano - Bozza - Piano d’Azione per l’Energia SostenibilePaes Andrano - Bozza - Piano d’Azione per l’Energia Sostenibile
Paes Andrano - Bozza - Piano d’Azione per l’Energia SostenibileComune di Andrano
 
Assembly Information Management System
Assembly Information Management SystemAssembly Information Management System
Assembly Information Management Systemdevlinb
 
Chinese food box
Chinese food boxChinese food box
Chinese food boxAlex Jones
 
New Jersey photos
New Jersey photosNew Jersey photos
New Jersey photosjemoranjr22
 
Presentasi function room
Presentasi function roomPresentasi function room
Presentasi function roomDesnaz Setiawan
 
Evolution of computers
Evolution of computersEvolution of computers
Evolution of computersAryan Kabra
 
Yoga gives your life a new direction
Yoga gives your life a new directionYoga gives your life a new direction
Yoga gives your life a new directionjohnmisbah02
 

Viewers also liked (14)

Outpost networksecurity
Outpost networksecurityOutpost networksecurity
Outpost networksecurity
 
A SURVEY ON SECURITY IN WIRELESS SENSOR NETWORKS
A SURVEY ON SECURITY IN WIRELESS SENSOR NETWORKSA SURVEY ON SECURITY IN WIRELESS SENSOR NETWORKS
A SURVEY ON SECURITY IN WIRELESS SENSOR NETWORKS
 
Python reading and writing files
Python reading and writing filesPython reading and writing files
Python reading and writing files
 
Ple
PlePle
Ple
 
Виртуальное рабочее место на базе продуктов Microsoft (Desktops as a service)
Виртуальное рабочее место  на базе продуктов Microsoft (Desktops as a service)Виртуальное рабочее место  на базе продуктов Microsoft (Desktops as a service)
Виртуальное рабочее место на базе продуктов Microsoft (Desktops as a service)
 
Cathexis therapeutic imagery
Cathexis therapeutic imageryCathexis therapeutic imagery
Cathexis therapeutic imagery
 
Paes Andrano - Bozza - Piano d’Azione per l’Energia Sostenibile
Paes Andrano - Bozza - Piano d’Azione per l’Energia SostenibilePaes Andrano - Bozza - Piano d’Azione per l’Energia Sostenibile
Paes Andrano - Bozza - Piano d’Azione per l’Energia Sostenibile
 
Assemblea Pubblica 19/12
Assemblea Pubblica 19/12Assemblea Pubblica 19/12
Assemblea Pubblica 19/12
 
Assembly Information Management System
Assembly Information Management SystemAssembly Information Management System
Assembly Information Management System
 
Chinese food box
Chinese food boxChinese food box
Chinese food box
 
New Jersey photos
New Jersey photosNew Jersey photos
New Jersey photos
 
Presentasi function room
Presentasi function roomPresentasi function room
Presentasi function room
 
Evolution of computers
Evolution of computersEvolution of computers
Evolution of computers
 
Yoga gives your life a new direction
Yoga gives your life a new directionYoga gives your life a new direction
Yoga gives your life a new direction
 

Similar to Examining Malware with Python

nullcon 2011 - Memory analysis – Looking into the eye of the bits
nullcon 2011 - Memory analysis – Looking into the eye of the bitsnullcon 2011 - Memory analysis – Looking into the eye of the bits
nullcon 2011 - Memory analysis – Looking into the eye of the bitsn|u - The Open Security Community
 
Compilation process
Compilation processCompilation process
Compilation processAlex Denisov
 
Looking in the eye of the bits
Looking in the eye of the bitsLooking in the eye of the bits
Looking in the eye of the bitsIftach Ian Amit
 
Jumpstarting big data projects / Architectural Considerations of HDInsight Ap...
Jumpstarting big data projects / Architectural Considerations of HDInsight Ap...Jumpstarting big data projects / Architectural Considerations of HDInsight Ap...
Jumpstarting big data projects / Architectural Considerations of HDInsight Ap...Olivia Klose
 
LT SAP HANAネットワークプロトコル初段
LT SAP HANAネットワークプロトコル初段LT SAP HANAネットワークプロトコル初段
LT SAP HANAネットワークプロトコル初段Koji Shinkubo
 
Monitoring Containers with Weave Scope
Monitoring Containers with Weave ScopeMonitoring Containers with Weave Scope
Monitoring Containers with Weave ScopeWeaveworks
 
Harmonic drive hpn gearhead brochure
Harmonic drive hpn gearhead brochureHarmonic drive hpn gearhead brochure
Harmonic drive hpn gearhead brochureElectromate
 
ambil aja
ambil aja ambil aja
ambil aja muxander
 
Reverse engineering of binary programs for custom virtual machines
Reverse engineering of binary programs for custom virtual machinesReverse engineering of binary programs for custom virtual machines
Reverse engineering of binary programs for custom virtual machinesSmartDec
 
バイナリかるた(アーキテクチャかるた)
バイナリかるた(アーキテクチャかるた)バイナリかるた(アーキテクチャかるた)
バイナリかるた(アーキテクチャかるた)kozossakai
 
バイナリかるた(アーキテクチャかるた・完全版)
バイナリかるた(アーキテクチャかるた・完全版)バイナリかるた(アーキテクチャかるた・完全版)
バイナリかるた(アーキテクチャかるた・完全版)kozossakai
 
Aimp3 memory manager_eventlog
Aimp3 memory manager_eventlog Aimp3 memory manager_eventlog
Aimp3 memory manager_eventlog Ahmad Shabri
 

Similar to Examining Malware with Python (20)

crack satellite
crack satellite crack satellite
crack satellite
 
nullcon 2011 - Memory analysis – Looking into the eye of the bits
nullcon 2011 - Memory analysis – Looking into the eye of the bitsnullcon 2011 - Memory analysis – Looking into the eye of the bits
nullcon 2011 - Memory analysis – Looking into the eye of the bits
 
Compilation process
Compilation processCompilation process
Compilation process
 
Looking in the eye of the bits
Looking in the eye of the bitsLooking in the eye of the bits
Looking in the eye of the bits
 
Crashinfo
CrashinfoCrashinfo
Crashinfo
 
server
serverserver
server
 
Jumpstarting big data projects / Architectural Considerations of HDInsight Ap...
Jumpstarting big data projects / Architectural Considerations of HDInsight Ap...Jumpstarting big data projects / Architectural Considerations of HDInsight Ap...
Jumpstarting big data projects / Architectural Considerations of HDInsight Ap...
 
No more dumb hex!
No more dumb hex!No more dumb hex!
No more dumb hex!
 
Roc curve, analytics
Roc curve, analyticsRoc curve, analytics
Roc curve, analytics
 
LT SAP HANAネットワークプロトコル初段
LT SAP HANAネットワークプロトコル初段LT SAP HANAネットワークプロトコル初段
LT SAP HANAネットワークプロトコル初段
 
Monitoring Containers with Weave Scope
Monitoring Containers with Weave ScopeMonitoring Containers with Weave Scope
Monitoring Containers with Weave Scope
 
Harmonic drive hpn gearhead brochure
Harmonic drive hpn gearhead brochureHarmonic drive hpn gearhead brochure
Harmonic drive hpn gearhead brochure
 
Example 006
Example 006Example 006
Example 006
 
ambil aja
ambil aja ambil aja
ambil aja
 
Reverse engineering of binary programs for custom virtual machines
Reverse engineering of binary programs for custom virtual machinesReverse engineering of binary programs for custom virtual machines
Reverse engineering of binary programs for custom virtual machines
 
バイナリかるた(アーキテクチャかるた)
バイナリかるた(アーキテクチャかるた)バイナリかるた(アーキテクチャかるた)
バイナリかるた(アーキテクチャかるた)
 
バイナリかるた(アーキテクチャかるた・完全版)
バイナリかるた(アーキテクチャかるた・完全版)バイナリかるた(アーキテクチャかるた・完全版)
バイナリかるた(アーキテクチャかるた・完全版)
 
Test
TestTest
Test
 
Aimp3 memory manager_eventlog
Aimp3 memory manager_eventlog Aimp3 memory manager_eventlog
Aimp3 memory manager_eventlog
 
IPv6 Basics
IPv6 BasicsIPv6 Basics
IPv6 Basics
 

Recently uploaded

IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 

Recently uploaded (20)

IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdf
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 

Examining Malware with Python

  • 1.
  • 2. Examining Malware with Python Phil Roth Data Scientist at Endgame @mrphilroth
  • 3. 3 Python tools for text classification can easily be adopted for malware classification. When using instruction ngrams, your disassembler and analysis passes are very important. references: http://bit.ly/scipy-malware Conclusions
  • 4. 4 Yes it’s malware, but what kind?
  • 5. The Data 5 10868 labeled samples 10873 unlabeled samples ~500 GB uncompressed 9 classes
  • 7. Hex Dump 7 00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20 00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01 00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18 00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04 00401040 82 20 08 83 00 08 00 00 00 00 02 00 60 80 10 80 00401050 18 00 00 20 A9 00 00 00 00 04 04 78 01 02 70 90 00401060 00 02 00 08 20 12 00 00 00 40 10 00 80 00 40 19 00401070 00 00 00 00 11 20 80 04 80 10 00 20 00 00 25 00 00401080 00 00 01 00 00 04 00 10 02 C1 80 80 00 20 20 00 00401090 08 A0 01 01 44 28 00 00 08 10 20 00 02 08 00 00 004010A0 00 40 00 00 00 34 40 40 00 04 00 08 80 08 00 08 004010B0 10 00 40 00 68 02 40 04 E1 00 28 14 00 08 20 0A 004010C0 06 01 02 00 40 00 00 00 00 00 00 20 00 02 00 04 004010D0 80 18 90 00 00 10 A0 00 45 09 00 10 04 40 44 82 004010E0 90 00 26 10 00 00 04 00 82 00 00 00 20 40 00 00 004010F0 B4 00 00 40 00 02 20 25 08 00 00 00 00 00 00 00 00401100 08 00 00 50 00 08 40 50 00 02 06 22 08 85 30 00 00401110 00 80 00 80 60 00 09 00 04 20 00 00 00 00 00 00 00401120 00 82 40 02 00 11 46 01 4A 01 8C 01 E6 00 86 10 00401130 4C 01 22 00 64 00 AE 01 EA 01 2A 11 E8 10 26 11 00401140 4E 11 8E 11 C2 00 6C 00 0C 11 60 01 CA 00 62 10 00401150 6C 01 A0 11 CE 10 2C 11 4E 10 8C 00 CE 01 AE 01 00401160 6C 10 6C 11 A2 01 AE 00 46 11 EE 10 22 00 A8 00 00401170 EC 01 08 11 A2 01 AE 10 6C 00 6E 00 AC 11 8C 00 00401180 EC 01 2A 10 2A 01 AE 00 40 00 C8 10 48 01 4E 11 00401190 0E 00 EC 11 24 10 4A 10 04 01 C8 11 E6 01 C2 00 raw data in hex
  • 8. Hex Dump 8 00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20 00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01 00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18 00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04 00401040 82 20 08 83 00 08 00 00 00 00 02 00 60 80 10 80 00401050 18 00 00 20 A9 00 00 00 00 04 04 78 01 02 70 90 00401060 00 02 00 08 20 12 00 00 00 40 10 00 80 00 40 19 00401070 00 00 00 00 11 20 80 04 80 10 00 20 00 00 25 00 00401080 00 00 01 00 00 04 00 10 02 C1 80 80 00 20 20 00 00401090 08 A0 01 01 44 28 00 00 08 10 20 00 02 08 00 00 004010A0 00 40 00 00 00 34 40 40 00 04 00 08 80 08 00 08 004010B0 10 00 40 00 68 02 40 04 E1 00 28 14 00 08 20 0A 004010C0 06 01 02 00 40 00 00 00 00 00 00 20 00 02 00 04 004010D0 80 18 90 00 00 10 A0 00 45 09 00 10 04 40 44 82 004010E0 90 00 26 10 00 00 04 00 82 00 00 00 20 40 00 00 004010F0 B4 00 00 40 00 02 20 25 08 00 00 00 00 00 00 00 00401100 08 00 00 50 00 08 40 50 00 02 06 22 08 85 30 00 00401110 00 80 00 80 60 00 09 00 04 20 00 00 00 00 00 00 00401120 00 82 40 02 00 11 46 01 4A 01 8C 01 E6 00 86 10 00401130 4C 01 22 00 64 00 AE 01 EA 01 2A 11 E8 10 26 11 00401140 4E 11 8E 11 C2 00 6C 00 0C 11 60 01 CA 00 62 10 00401150 6C 01 A0 11 CE 10 2C 11 4E 10 8C 00 CE 01 AE 01 00401160 6C 10 6C 11 A2 01 AE 00 46 11 EE 10 22 00 A8 00 00401170 EC 01 08 11 A2 01 AE 10 6C 00 6E 00 AC 11 8C 00 00401180 EC 01 2A 10 2A 01 AE 00 40 00 C8 10 48 01 4E 11 00401190 0E 00 EC 11 24 10 4A 10 04 01 C8 11 E6 01 C2 00 00401180 EC 01 2A 10 2A 01 AE raw data in hex
  • 9. Disassembly 9 HEADER:00400000 ; HEADER:00400000 ; +-------------------------------------------------------------------------+ HEADER:00400000 ; | This file has been generated by The Interactive Disassembler (IDA) | HEADER:00400000 ; | Copyright (c) 2013 Hex-Rays, <support@hex-rays.com> | HEADER:00400000 ; | License info: | HEADER:00400000 ; | Microsoft | HEADER:00400000 ; +-------------------------------------------------------------------------+ HEADER:00400000 ; HEADER:00400000 HEADER:00400000 HEADER:00400000 .686p HEADER:00400000 .mmx HEADER:00400000 .model flat HEADER:00400000 HEADER:00400000 ; =========================================================================== HEADER:00400000 HEADER:00400000 ; [00001000 BYTES: COLLAPSED SEGMENT HEADER. PRESS KEYPAD CTRL-"+" TO EXPAND] .text:00401000 ; .text:00401000 ; Format : Portable executable for 80386 (PE) .text:00401000 ; Imagebase : 400000 .text:00401000 ; Section 1. (virtual address 00001000) .text:00401000 ; Virtual size : 00071050 ( 462928.) .text:00401000 ; Section size in file : 00071200 ( 463360.) .text:00401000 ; Offset to raw data for section: 00000400 .text:00401000 ; Flags 60000020: Text Executable Readable .text:00401000 ; Alignment : default .text:00401000 ; ===========================================================================
  • 10. HEADER:00400000 ; HEADER:00400000 ; +-------------------------------------------------------------------------+ HEADER:00400000 ; | This file has been generated by The Interactive Disassembler (IDA) | HEADER:00400000 ; | Copyright (c) 2013 Hex-Rays, <support@hex-rays.com> | HEADER:00400000 ; | License info: | HEADER:00400000 ; | Microsoft | HEADER:00400000 ; +-------------------------------------------------------------------------+ HEADER:00400000 ; HEADER:00400000 HEADER:00400000 HEADER:00400000 .686p HEADER:00400000 .mmx HEADER:00400000 .model flat HEADER:00400000 HEADER:00400000 ; =========================================================================== HEADER:00400000 HEADER:00400000 ; [00001000 BYTES: COLLAPSED SEGMENT HEADER. PRESS KEYPAD CTRL-"+" TO EXPAND] .text:00401000 ; .text:00401000 ; Format : Portable executable for 80386 (PE) .text:00401000 ; Imagebase : 400000 .text:00401000 ; Section 1. (virtual address 00001000) .text:00401000 ; Virtual size : 00071050 ( 462928.) .text:00401000 ; Section size in file : 00071200 ( 463360.) .text:00401000 ; Offset to raw data for section: 00000400 .text:00401000 ; Flags 60000020: Text Executable Readable .text:00401000 ; Alignment : default .text:00401000 ; =========================================================================== Disassembly 10 HEADER:00400000
  • 11. HEADER:00400000 ; HEADER:00400000 ; +-------------------------------------------------------------------------+ HEADER:00400000 ; | This file has been generated by The Interactive Disassembler (IDA) | HEADER:00400000 ; | Copyright (c) 2013 Hex-Rays, <support@hex-rays.com> | HEADER:00400000 ; | License info: | HEADER:00400000 ; | Microsoft | HEADER:00400000 ; +-------------------------------------------------------------------------+ HEADER:00400000 ; HEADER:00400000 HEADER:00400000 HEADER:00400000 .686p HEADER:00400000 .mmx HEADER:00400000 .model flat HEADER:00400000 HEADER:00400000 ; =========================================================================== HEADER:00400000 HEADER:00400000 ; [00001000 BYTES: COLLAPSED SEGMENT HEADER. PRESS KEYPAD CTRL-"+" TO EXPAND] .text:00401000 ; .text:00401000 ; Format : Portable executable for 80386 (PE) .text:00401000 ; Imagebase : 400000 .text:00401000 ; Section 1. (virtual address 00001000) .text:00401000 ; Virtual size : 00071050 ( 462928.) .text:00401000 ; Section size in file : 00071200 ( 463360.) .text:00401000 ; Offset to raw data for section: 00000400 .text:00401000 ; Flags 60000020: Text Executable Readable .text:00401000 ; Alignment : default .text:00401000 ; =========================================================================== Disassembly 11 HEADER:00400000
  • 12. Disassembly 12 .text:00470050 ; =============== S U B R O U T I N E ==================================== .text:00470050 .text:00470050 ; Attributes: bp-based frame .text:00470050 .text:00470050 sub_470050 proc near ; CODE XREF: start+D8D^Yp .text:00470050 .text:00470050 var_68 = dword ptr -68h .text:00470050 var_64 = dword ptr -64h .text:00470050 var_60 = dword ptr -60h .text:00470050 .text:00470050 55 push ebp .text:00470051 8B EC mov ebp, esp .text:00470053 83 C4 98 add esp, 0FFFFFF98h .text:00470056 33 C0 xor eax, eax .text:00470058 8B 15 7C 10 4B 00 mov edx, dword_4B107C .text:0047005E 89 55 EC mov [ebp+var_14], edx .text:00470061 89 45 EC mov [ebp+var_14], eax .text:00470064 53 push ebx .text:00470065 8B 1D 7C 10 4B 00 mov ebx, dword_4B107C .text:0047006B 83 FB 2D cmp ebx, 2Dh .text:0047006E 75 03 jnz short loc_470073 .text:00470070 89 5D EC mov [ebp+var_14], ebx .text:00470073 .text:00470073 loc_470073: ; CODE XREF: sub_470050+1E^Xj .text:00470073 56 push esi .text:00470074 33 C0 xor eax, eax .text:00470076 8B 5D EC mov ebx, [ebp+var_14]
  • 13. .text:00470050 ; =============== S U B R O U T I N E ==================================== .text:00470050 .text:00470050 ; Attributes: bp-based frame .text:00470050 .text:00470050 sub_470050 proc near ; CODE XREF: start+D8D^Yp .text:00470050 .text:00470050 var_68 = dword ptr -68h .text:00470050 var_64 = dword ptr -64h .text:00470050 var_60 = dword ptr -60h .text:00470050 .text:00470050 55 push ebp .text:00470051 8B EC mov ebp, esp .text:00470053 83 C4 98 add esp, 0FFFFFF98h .text:00470056 33 C0 xor eax, eax .text:00470058 8B 15 7C 10 4B 00 mov edx, dword_4B107C .text:0047005E 89 55 EC mov [ebp+var_14], edx .text:00470061 89 45 EC mov [ebp+var_14], eax .text:00470064 53 push ebx .text:00470065 8B 1D 7C 10 4B 00 mov ebx, dword_4B107C .text:0047006B 83 FB 2D cmp ebx, 2Dh .text:0047006E 75 03 jnz short loc_470073 .text:00470070 89 5D EC mov [ebp+var_14], ebx .text:00470073 .text:00470073 loc_470073: ; CODE XREF: sub_470050+1E^Xj .text:00470073 56 push esi .text:00470074 33 C0 xor eax, eax .text:00470076 8B 5D EC mov ebx, [ebp+var_14] Disassembly 13 mov ebx,dword_4B107C
  • 14. .text:00470050 ; =============== S U B R O U T I N E ==================================== .text:00470050 .text:00470050 ; Attributes: bp-based frame .text:00470050 .text:00470050 sub_470050 proc near ; CODE XREF: start+D8D^Yp .text:00470050 .text:00470050 var_68 = dword ptr -68h .text:00470050 var_64 = dword ptr -64h .text:00470050 var_60 = dword ptr -60h .text:00470050 .text:00470050 55 push ebp .text:00470051 8B EC mov ebp, esp .text:00470053 83 C4 98 add esp, 0FFFFFF98h .text:00470056 33 C0 xor eax, eax .text:00470058 8B 15 7C 10 4B 00 mov edx, dword_4B107C .text:0047005E 89 55 EC mov [ebp+var_14], edx .text:00470061 89 45 EC mov [ebp+var_14], eax .text:00470064 53 push ebx .text:00470065 8B 1D 7C 10 4B 00 mov ebx, dword_4B107C .text:0047006B 83 FB 2D cmp ebx, 2Dh .text:0047006E 75 03 jnz short loc_470073 .text:00470070 89 5D EC mov [ebp+var_14], ebx .text:00470073 .text:00470073 loc_470073: ; CODE XREF: sub_470050+1E^Xj .text:00470073 56 push esi .text:00470074 33 C0 xor eax, eax .text:00470076 8B 5D EC mov ebx, [ebp+var_14] Disassembly 14 mov ebx,dword_4B107C
  • 15. Disassembly 15 .idata:0046F4DC ; .idata:0046F4DC ; Imports from KERNEL32.DLL .idata:0046F4DC ; .idata:0046F4DC ; =========================================================================== .idata:0046F4DC .idata:0046F4DC ; Segment type: Externs .idata:0046F4DC ; _idata .idata:0046F4DC ; DWORD __stdcall GetCurrentThreadId() .idata:0046F4DC ?? ?? ?? ?? extrn __imp_GetCurrentThreadId:dword .idata:0046F4DC ; DATA XREF: .text:0046F66C^Yo .idata:0046F4DC ; GetCurrentThreadId^Yr .idata:0046F4E0 ; BOOL __stdcall WriteFile(HANDLE hFile, LPCVOID lpBuffer, DWORD ... .idata:0046F4E0 ?? ?? ?? ?? extrn WriteFile:dword ; DATA XREF: .text:00471E4C^Yr .idata:0046F4E4 ; BOOL __stdcall FindNextVolumeA(HANDLE hFindVolume, LPSTR lpszVolumeName, DW ... .idata:0046F4E4 ?? ?? ?? ?? extrn FindNextVolumeA:dword .idata:0046F4E4 ; DATA XREF: .text:00471E46^Yr .idata:0046F4E8 ; LPVOID __stdcall VirtualAlloc(LPVOID lpAddress, SIZE_T dwSize, DWORD ... .idata:0046F4E8 ?? ?? ?? ?? extrn __imp_VirtualAlloc:dword .idata:0046F4E8 ; DATA XREF: VirtualAlloc^Yr .idata:0046F4EC ; BOOL __stdcall EnumResourceLanguagesA(HMODULE hModule, LPCSTR lpType, LPCSTR ... .idata:0046F4EC ?? ?? ?? ?? extrn EnumResourceLanguagesA:dword .idata:0046F4EC ; DATA XREF: .text:00471E70^Yr
  • 16. .idata:0046F4DC ; .idata:0046F4DC ; Imports from KERNEL32.DLL .idata:0046F4DC ; .idata:0046F4DC ; =========================================================================== .idata:0046F4DC .idata:0046F4DC ; Segment type: Externs .idata:0046F4DC ; _idata .idata:0046F4DC ; DWORD __stdcall GetCurrentThreadId() .idata:0046F4DC ?? ?? ?? ?? extrn __imp_GetCurrentThreadId:dword .idata:0046F4DC ; DATA XREF: .text:0046F66C^Yo .idata:0046F4DC ; GetCurrentThreadId^Yr .idata:0046F4E0 ; BOOL __stdcall WriteFile(HANDLE hFile, LPCVOID lpBuffer, DWORD ... .idata:0046F4E0 ?? ?? ?? ?? extrn WriteFile:dword ; DATA XREF: .text:00471E4C^Yr .idata:0046F4E4 ; BOOL __stdcall FindNextVolumeA(HANDLE hFindVolume, LPSTR lpszVolumeName, DW ... .idata:0046F4E4 ?? ?? ?? ?? extrn FindNextVolumeA:dword .idata:0046F4E4 ; DATA XREF: .text:00471E46^Yr .idata:0046F4E8 ; LPVOID __stdcall VirtualAlloc(LPVOID lpAddress, SIZE_T dwSize, DWORD ... .idata:0046F4E8 ?? ?? ?? ?? extrn __imp_VirtualAlloc:dword .idata:0046F4E8 ; DATA XREF: VirtualAlloc^Yr .idata:0046F4EC ; BOOL __stdcall EnumResourceLanguagesA(HMODULE hModule, LPCSTR lpType, LPCSTR ... .idata:0046F4EC ?? ?? ?? ?? extrn EnumResourceLanguagesA:dword .idata:0046F4EC ; DATA XREF: .text:00471E70^Yr Disassembly 16 Imports from KERNEL32.DLL __stdcall VirtualAlloc(
  • 18. Byte ngrams 18 00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20 00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01 00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18 00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04 Possibilies 1gram: 256 2gram: 65536 3gram: 16777216 4gram: 4294967296 Solution: Hashing
  • 19. Byte ngrams 19 vectorizer = HashingVectorizer( input="content", lowercase=True, stop_words=None, ngram_range=(1,3), analyzer="word", n_features=2**16, binary=False, norm=None, non_negative=True ) pipe = Pipeline([ ("extraction", CustomExtractor(vectorizer=vectorizer)), ("sel", VarianceThreshold(threshold=0)), ("tfidf", TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True, sublinear_tf=True)), ("kbest", SelectKBest(score_func=f_classif, k=500)) ]) Code for extracting the byte ngrams and reducing dimensionality:
  • 20. Byte ngrams 20 vectorizer = HashingVectorizer( input="content", lowercase=True, stop_words=None, ngram_range=(1,3), analyzer="word", n_features=2**16, binary=False, norm=None, non_negative=True ) pipe = Pipeline([ ("extraction", CustomExtractor(vectorizer=vectorizer)), ("sel", VarianceThreshold(threshold=0)), ("tfidf", TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True, sublinear_tf=True)), ("kbest", SelectKBest(score_func=f_classif, k=500)) ]) Code for extracting the byte ngrams and reducing dimensionality: class CustomExtractor() : def __init__(self, vectorizer=HashingVectorizer()) : self.vectorizer = vectorizer def fit(self, X, y) : return self # stateless def transform(self, X, y=None) : pool = multiprocessing.Pool() rows = pool.map(self.feature_extract, X, 32) return scipy.sparse.vstack(list(rows)) fit_transform = transform def feature_extract(self, file_name) : clean_bytes = " ".join(toolz.pipe( open(file_name, "r"), map(lambda line : line.rstrip().split()[1:]), toolz.concat, filter(lambda b : b != "??" and b != "?") )) return self.vectorizer.transform([clean_bytes])
  • 21. Byte ngrams 21 Why they might be useful: https://github.com/wapiflapi/binglide
  • 23. Instruction ngrams 23 push lea push mov call mov mov pop retn mov jmp push mov mov call test jz push call add mov pop retn mov mov mov mov retn mov lea mov inc test jnz sub retn mov mov mov push mov push push push push call add mov pop retn mov mov mov push mov push push push push call add mov pop retn xor retn mov retn mov retn mov retn mov mov mov retn mov test jz mov mov push push call mov mov retn push push push push call push call mov push push push mov call mov retn mov mov mov retn mov test jz mov mov push push call mov mov retn push push push push call mov push push push mov call push call mov retn Extracted instructions:
  • 24. Instruction ngrams 24 vectorizer = HashingVectorizer( input="content", lowercase=True, stop_words=None, ngram_range=(1, 2), analyzer="word", n_features=2**25, binary=False, norm=None, non_negative=True ) pipe = Pipeline([ ("extraction", CustomExtractor(vectorizer=vectorizer)), ("sel", VarianceThreshold(threshold=0)), ("tfidf", TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True, sublinear_tf=True)), ("kbest", SelectKBest(score_func=f_classif, k=500)) ]) Code for extracting the instruction ngrams and reducing dimensionality:
  • 25. Section Names, Imports, Imported Functions. Extracted these features with regular expressions. Features were (awkwardly) selected in the same step as instruction ngrams. Named Features 25
  • 26. Named Features 26 import re re_features = { "imports" : { "re" : re.compile("Imports from w.+"), "extract" : lambda m : m.group().split()[-1], "filter" : lambda m : True }, "imported_functions" : { "re" : re.compile("__stdcall w.+("), "extract" : lambda m : m.group().split()[-1][:-1], "filter" : lambda m : not m.startswith("sub_") }, "section_names" : { "re" : re.compile("^S+?:"), "extract" : lambda m : m.group()[:-1], "filter" : lambda m : True } }
  • 27. Named Features 27 from toolz import pipe, unique from tools.curried import map, filter def process_re_feature(lines, re_dict) : return pipe( lines, map(re_dict["re"].search), filter(lambda m : m is not None), map(re_dict["extract"]), filter(re_dict["filter"]), unique )
  • 29. Manual Features 29 { "number_of_collapsed_functions": 451, "number_of_imported_functions": 101, "sample_length": 1201668, "number_of_imports": 4, "number_of_sections": 4, "section_length_0": 979764, ... “section_length_6”: 0, "length_of_functions_0": 2706, ... "length_of_functions_15": 107 } 0A32eTdBKayjCWhZqDOQ
  • 30. Gradient Boosting Classifier on 1026 features Grid search optimized parameters Also tried: LogisticRegression, MultinomialNB, KNeighborsClassifier, RandomForestClassifier Final Model 30 clf = GradientBoostingClassifier( loss='deviance', learning_rate=0.1, n_estimators=300, subsample=0.9, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, init=None, random_state=None, max_features=200, max_leaf_nodes=None, warm_start=False, verbose=2 )
  • 31. Final Model tSNE Plot 31
  • 32. Final Model tSNE Plot 32 pipe = Pipeline([ ("tsvd", TruncatedSVD(n_components=50)), ("tsne", TSNE(n_components=2, perplexity=40.0, early_exaggeration=4.0, learning_rate=1000.0, n_iter=1000, metric='euclidean', init='random’)) ])
  • 33. 33 Results: I did OK… More focused on productization
  • 34. xgboost malware as an image compression ratio as a feature other expanded feature sets probability calibration semi supervised learning Winning Strategies 34 usable in a product specific to competitions
  • 35. 35 ida ****************************** CV Scores: [ 0.03800 0.02551 0.05283 0.03953 0.0350 ] mean: 0.03817940685733493 std: 0.008799619405211161 capstone ****************************** CV Scores: [ 0.05065 0.0451 0.06953 0.05583 0.05089] mean: 0.05441113231562615 std: 0.008283830117670508 code = bytes(bytearray.fromhex("".join(map( lambda l : "".join(l.split()[1:]).replace("?", ""), open("data/sample/0A32eTdBKayjCWhZqDOQ.bytes", "r") )))) from capstone import Cs, CS_ARCH_X86, CS_MODE_32 md = Cs(CS_ARCH_X86, CS_MODE_32) instructions = " ".join( [t[2] for t in md.disasm_lite(code, 0x1000) if t[2] != "int3"] ) Using Capstone
  • 36. IDA not (easily) batch distributable capstone single pass produces suboptimal results radare2 Python scriptable reversing framework vivisect pure Python, largely undocumented disassembler and analysis project Disassemblers 36
  • 37. Other Projects 37 pefile extracts header information from executables binglide visualizations of entropy and byte ngrams cuckoo automated dynamic analysis barf binary analysis framework with code analysis
  • 38. 38 Python tools for text classification can easily be adopted for malware classification. When using instruction ngrams, your disassembler and analysis passes are very important. references: http://bit.ly/scipy-malware Conclusions