Loading presentation...
Prezi is an interactive zooming presentation

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Machine Learning for Security

No description
by

Josiah Hagen

on 18 August 2015

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Machine Learning for Security

Courtesy of Shutterstock
Question
Data
Features / Distance
Algorithm
Non-Existent
Subdomains
Question
Data
Features / Distance
Algorithm
Question
Data
Features / Distance
Algorithm
Machine Learning for Security
What is Machine Learning?
How can we use it in security?
Data Analysis
Framing a Problem
Data
Asking Questions
Question
Data
Features / Distance
Algorithm
Logs
Traffic
Files
Metadata
Malicious?
Vulnerable?
Linear Regression
Unsupervised
Principle Component Analysis
K-Means Clustering
How It Works
Vulnerability
Extrapolation
How it Works
Tools
Free Code
Expensive Code
Logistic Regression
Malicious PE
Question
Data
Features / Distance
Algorithm
Decision Tree
Anomaly Detection
How it works
Basic Example
Question
Data
Features / Distance
Algorithm

Links
https://www.csie.ntu.edu.tw/~cjlin/liblinear/
https://www.csie.ntu.edu.tw/~cjlin/libsvm/
http://scikit-learn.org/stable/
http://www.inside-r.org/category/packagetags/machinelearning
https://www.gnu.org/software/octave/
https://sourceforge.net/projects/weka/
http://elki.dbs.ifi.lmu.de/
https://www.knime.org/
"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E." - Tom M Mitchell
Josiah Hagen
HP TippingPoint DVLabs Research Group
"A computer program is said to learn from
experience
E with respect to some class of
tasks
T and performance measure P, if its
performance
at tasks in T, as measured by P,
improves
with experience E." - Tom M Mitchell
"Whoever has the most data wins."
josiah.hagen@hp.com
Supervised
Question
Data
Features / Distance
Algorithm
Question
Data
Features / Distance
Algorithm
How much will my botnet earn?
Gather some data
Number of bots
Linear Regression
What files are malicious?
Known malware / benign .exe's
PE Header Imports
Logistic Regression
Microsoft Malware Classification Challenge (BIG 2015)
http://www.kaggle.com/c/malware-classification
What botnet DGA is this?
DGA code, samples, valid data
Syntax: characters and their position
Logistic Regression / SVM
Logs
Traffic
Files
Metadata
Logs
Traffic
Files
Metadata
Logs
Traffic
Files
Metadata
Feature Vector
Vowels
lowercase
UPPER
Consonants
Other
D1G1T5
Bigrams
Characters
Trigrams
Characters By Position
TLD
Length
How did it do?
TP/(TP+FP)
TP/(TP+FN)
2PR/(P+R) =
2TP/(2TP + FP + FN)
DGA Family
Hidden Markov Model
LIBLINEAR
LIBSVM
scikit-learn
inside-
.org
Mathematica
Octave
Weka
ELKI
KNIME
Matlab
Windows Azure
https://github.com/bniemczyk/pacumen
http://www.covert.io/research-papers/security/Vulnerability%20Extrapolation%20-%20Assisted%20Discovery%20of%20Vulnerabilities%20using%20Machine%20Learning.pdf
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=6983873

http://en.wikipedia.org/wiki/Machine_learning
http://www8.hp.com/us/en/software-solutions/dvlabs-security-threat-intelligence/
http://h30499.www3.hp.com/t5/HP-Security-Research-Blog/bg-p/off-by-on-software-security-blog
Eigenvalues of
Singular Value Decomposition


No need to create
Woot '11: Vulnerability Extrapolation:
Assisted Discovery of Vulnerabilities using Machine Learning - Yamaguchi, Lindner, & Rieck
What vulnerabilities are in the code?
FFmpeg code and CVE-2010-3429
API Calls
PCA, SVD
Question
Data
Features / Distance
Algorithm
Uncovering New DGAs
ISSRE 2014: Finding Domain-Generation Algorithms by Looking at Length Distribution - Miranda Mowbray HPLabs and Josiah Hagen
What hosts are calling DGAs?
DNS logs
Lengths of 2LDs that hosts requested

Uncovered 19 DGAs on 5 days data
9 of which were previously unknown
https://www.mlsecproject.org/
Minimize within group mean distances
How do you pick K?
Find the elbow
Construction (C4.5)
All same class: leaf
No gain: move up and use EV
New class: move up and use EV
Base case?
Pick attribute that gives highest information gain on splitting
Create a decision node for that attribute
Recur on sublists
Information Gain
Base
Loop
SSH Tunneling
What application(s) runs over tunnel
Encrypted packets from apps running
Packet sizes in 10 second windows
Decision Tree vs others
Black Hat 2014: Identification over Encrypted Channels -Prasad Rao HPLabs, Brandon Niemczyk DVLabs
https://github.com/bniemczyk/pacumen
Narrowed search from 6778 to 20
Found another known vuln and 0-day
Exploitable
How many subdomain groups?
DNS logs
Syntactic ED / ED
K-Means
Infosec SW: Network Threat Detection via Machine Learning - BN & JH
Non-Existent
Subdomains
Syntactic ED / ED
Subdomains with few natural clusters are blocklists, CDNs, telcos
Subdomains with 5 or more natural clusters are malicious sites
Question
Data
Features / Distance
Algorithm
DNS Traffic
Virus Bulletin 2013: Using statistical analysis of DNS traffic to identify infections of unknown malware - BN & JA
What hosts are infected?
DNS logs
Periodicity and Resource Records
Hidden Markov Model
The first event we see for a host, we assume it is infected with probability P(I). Then increment P(I) with each following event.
P(I)
P(I)
P(I)
Increments to P(I) are done using Bayes' Rule
Highly periodic requests of domains with 3+ resource records step us towards UNKNOWN - update servers?
The entire row of low-periodic requests gives us almost no information
Highly periodic requests with 0 or 1 resource records steps us quickly towards PROBABLY_INFECTED - C2 servers?
Req - High Periodicity
Req - Med Periodicity
Req - Low Periodicity
Req - Unknown
Response - 0 RR
Response - 1 RR
Response - 2 RR
Response - 3+ RR
Full transcript