Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

CanSecWest 2014 (Public)

No description
by

Brandon Niemczyk

on 21 March 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of CanSecWest 2014 (Public)

Req - High Periodicity
Req - Low Periodicity
Req - Unknown
Response - 0 RR
Response - 1 RR
Response - 2 RR
Response - 3+ RR
Req - Mid Periodicity
We can visualize how each pair of DNS events will effect a host that is currently believed to be equaly likely to be PROBABLY_INFECTED or UNKNOWN by producing a heat map of
Weight votes of each host based on how "average" the host is in terms of volume
Hosts far from the median get less of a vote
Dampen our probabilities so that steps are always in the same direction, but not as large. Allowing a smoother transition between classes.
This means our markov chains rows do not sum up to 1
It still works!
Don't allow hosts with a lot of data to influence our decision to much.
Instead generate our class models with a voting algorithm - equal vote for each host.
Case Study: Zero Knowledge Infection Detection
Label DNS data with known infected hosts
Training will require semi-labeled data
Build Markov Models
We want to build a markov chain for each class (PROBABLY_INFECTED and UNKNOWN) where DNS requests and replies are the states.
Detecting Hosts that appear to utilize a domain generation algorithm
Steal from basic crypto-analysis techniques
Find the probability of characters and character pairs
We utilize the English dictionary
Generating probabilities can easily be done for any language
Generate expected probabilities
Detecting Hosts that appear to utilize a domain generation algorithm
Map domain names to a 2-tuple of real values
Replace each character in the domain with it's expected probability
Take the mean
Replace each character pair in the domain with it's expected probability
Take the mean
Detecting Hosts that appear to utilize a domain generation algorithm
Find a non-linear mapping of domain 2-tuples -> reals that maximizes distance between randomly generated domains and language-based domains
Utilized an in-house developed fuzzy logic module to generate a parametrized equation
Labeling some hosts as PROBABLY_INFECTED
Label any hosts that meet the qualifications as probably infected.
Requests a 2nd-level domain that is randomly generated
Requests a known C&C server
Check for domain in DVLabs RepDV blacklist
All other hosts get labeled UNKNOWN
Label as many hosts as we can as PROBABLY_INFECTED
Host makes a request for a known C2 server
Host makes a request for a 2nd-level domains that appears to be psuedo-random
Other hosts get labeled as UNKNOWN
There is no way to know that a host is not infected.

Finding infected hosts becomes finding hosts that fit the PROBABLY_INFECTED model better than the UNKNOWN model.
Categorizing DNS requests and responses
8 categories
Requests (Categories 0 - 3)
domain has high periodicity score
domain has average periodicity score
domain has low periodicity score
not enough data to determine domain periodicity score
Responses (Categories 4 - 7)
0 resource records (NXDOMAIN)
1 resource record
2 resource records
3+ resource records
Build Markov chains for PROBABLY_INFECTED and UNKNOWN
Each class gets an 8x8 matrix
Populate each classes matrix with the probability of one category following another
Looking at the data as a whole
This is our training, it is done only once (or can be done on a sheduled basis).
Massage the matrices to allow for smooth transitions between classes. Will go over this later.
Visualizing and adjusting the steps taken with event pairs.
We learned something!
Interesting observations of our step-visualizations
Highly periodic requests of domains with 3+ resource records step us towards UNKNOWN - update servers?
The entire row of low-periodic requests gives us almost no information
Highly periodic requests with 0 or 1 resource records steps us quickly towards PROBABLY_INFECTED - C2 servers?
Results
After running on a day of DNS data we collected.
Increased belief of infection on approximately 20% of hosts.
Belief of infection was increased for over 90% of our originally labeled PROBABLY_INFECTED hosts.
Used random domains and known domains with respective target values of 1.0 and 0.0
Performed a least squares regression to find optimal parameter values
Network Data Capture
Lossless capture at line speed
Packet filtering (possibly complex)
Inline CPU processing capability
Large storage capacity for extended deployments
Network Capture Adapter
Napatech NT4E-STD
4 1Gb/s network interfaces
4 PCIe Gen1 lanes @ 2.5 GT/s (6.3 Gb/s)
HW timestamp (10ns)
<2% CPU load @ full line rate capture
Multi-CPU packet distribution
6s of onboard burst data buffering
Data Storage Array
HP P-410/1G
SAS Controller
6Gb/s SAS and 3Gb/s SATA support
SATA NCQ for all controllers
256MB DDR2/800 RAM
1GB Flash backed write cache
8 PCIe Gen2 lanes

4 4TB SATA drives
Capture
Label
Model
Detect
Detection
The first event we see for a host, we assume it is infected
with probability P(I). Then increment P(I) with each following event.
P(I)
P(I)
P(I)
Increments to P(I) are done using Bayes' Rule
Taken from PROBABLY_INFECTED model
Taken from UNKNOWN model
Req - High Periodicity
Req - Mid Periodicity
Req - Low Periodicity
Req - Unknown
Response - 0 RR
Response - 1 RR
Response - 2 RR
Response - 3+ RR
10 minutes
10 minutes
10 minutes
10 minutes
20 minutes
L
This is our periodicity score.
Calculating Periodicity of a Domain
Contact Us
Brandon Niemczyk
- insecurity@hp.com
Jonathan Andersson
- jonathan.andersson@hp.com
Configuration
PCIe
Ensure the required number of PCIe lanes are available from the selected MB slot for each adapter and not shared or routed through the South Bridge
Ensure the selected slot on the MB supports the latest PCIe Gen the adapter supports: PCIe Gen2 -> PCIe Gen2
Mechanical PCIe x8 != Electrical PCIe x8
Software
Custom C code:
packets distributed to multiple cores (up to 32)
capture raw data in PCAP nano format
parse & batch insert into MSSQL 2012 DB using native C ODBC API and parameterized stored procs with Table Valued Parameters
>50K insertions / sec (with moderate indexing)
Data
Captured data in a large real-world network for 87 consecutive days.
42B records representing 9B DNS requests
106M requests / day
7M unique hosts
Network Threat Detection Via Machine Learning
When is it useful?
When no known perfect solution exists
To generalize current knowledge to future unknowns
Network streams concerns
No fixed feature vector size
Analyze stream in small parts
Brandon Niemczyk & Jonathan Andersson
March 14, 2014

TippingPoint Research Group: HP DVLabs

What about security?
Listen on network
?
?
?
Identify malicious behavior

Why Machine Learning?
No reversing or signatures necessary
Resistant to obfuscation
Identify malicious behavior not binaries
Can detect previously unknown malware
Machine Learning
Application
Problems
Confidence
Classification of a part ignores history
Maintain confidence and update over time
Confidence
How is confidence calculated?
Easy with generative models
Discriminative
Generative
Full transcript