Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Ph.D. Proposal

My research on Parallel All Pairs Similarity Search Optimizations.
by

maha alabduljalil

on 10 March 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Ph.D. Proposal

7
Too long to
fit in cache!
Repeat



Until other potentially similar vectors are compared.
Holder’s inequality:

Dot( d , d ) ≤ d x d
Optimizing Parallel Algorithms for All Pairs Similarity Search
Partition-based Similarity Search (PSS)
Key techniques:

Data partitioning with dissimilarity detection.
Partition-based load balancing.
Hybrid indexing..

Cache-conscious APSS Optimizations
Key techniques:

Analyze memory hierarchy behavior in PSS tasks.
Index splitting.
Controlled vector coalescing.

Research Plan
Complete load balancing algorithms and implementation for PSS2.
Develop and open source a software package for parallel similarity search running Mapreduce.
Prepare dissertation.
Plan the defense to be held Feb’14.

Goal
Develop APSS techniques that are:


Scalable.
Efficient.
General.

Introduction
Definition of All Pairs Similarity Search (APSS).
Related Work and Motivation.
Problem Statement & Baseline.

Ph.D. Proposal: Maha Alabduljalil
Department of Computer Science
University of California, Santa Barbara

Committee: Omer Egecioglu, Xifeng Yan, Tao Yang (Chair)

PSS1
PSS2
Conclusions
Evaluation


UCSB knot cluster: 84N, 12 cores,2.66 GHz Intel, 24GB RAM.
Local cluster: 3N, 6 cores, 3.1GHz AMD,16GB RAM.
Implementation: Java based Hadoop.
S. Zhu, A. Potapova, M. Alabduljalil, X. Liu and T. Yang. "Clustering and Load Balancing Optimization for Redundant Content Removal" - WWW 2012.
M. Alabduljalil, X. Tang, T. Yang, "Optimizing Parallel Algorithms for All Pairs Similarity Search" - WSDM2013.
M. Alabduljalil, X. Tang, T. Yang. "Cache-Conscious Performance Optimization for Similarity Search." -SIGIR2013 .
M. Alabduljalil, X. Tang, X.Jin, T. Yang.."Load Balancing for Partitioned Similarity Search". (under submission).
X.Tang, M. Alabduljalil, T.Yang "Cache-Conscious Data Layout and Traversal for Similarity Search" ..(to be submitted).
Current publications
Speedup

PSS1 focused on splitting S to fit into cache.
PSS1 does not consider cache reuse to improve temporal locality in memory areas B and C.
Cache locality for B,C not addressed
Two-stage Partition-based Similarity Search (PSS)

APSS is still time consuming for big datasets.


Scalable, efficient and general for parallel architectures.
Reuse accumulator fetched to cache as much as possible by coalescing multiple vectors from B.
Have a model to find grouping optimum number.
Vector coalescing (PSS2)

Vector level skipping, ie. Use vector ID to skip.
 Eliminate computation, but not I/O.
Symmetric comparisons
Also referred to as SSjoin, RSjoin, Record linkage, Entity resolution.
Continue ..
Related problem: when threshold is high, it’s categorized as Duplicate Detection (DD).

2

All Pairs Similarity Search (APSS) is the process of
all finding pairs of objects whose similarity is above a certain threshold.
Problem Definition
21

fastest

Affect of s & b on PSS2 performance (Twitter)

19

Average number of shared features

Data Cleaning Example

Emails
Twitter
Impact of split size s in PSS1

Clueweb
Ratio: 1 – Time(optimized)/Time(baseline)

How to detect dissimilarity without pair-wise comparison?
CosSim (d , d ) = Σ (w , x w , )
d is dissimilar to d if:
How To Detect Dissimilar Vectors?

Baseline: SIGIR’09 using inverted index without heuristics.

Open Source and Baseline

Split size s

Data access
computation

Ratio of Data Access to Computation in PSS1
q
p
q

p

Further Generalization

APSS application examples:
Research Motivation

G3

Gn

G2

G1

Dissimilarity
edges

Final Output of the Partitioning

?
Pj

Pi

Should Pi compare with Pj or Pj compares with Pi?
Partition-level symmetric detection
Eliminate entire partition I/O and communication
Partition-level Assignment


 Sim(d1,d2) = (0.57x 0.7) = 0.4

[0.57, 0,.57, 0.57]
[0.70, 0.70, 0, 0]

[1, 0, 1, 1]
[1, 1, 0, 0]

Eg. Feature space = {a,b,c,d}
d1 = “a c d”
d2 = “a b”

Cosine-based Similarity

10x
faster
20x
faster
PSS/forward indexing
PSS/ hybrid indexing
Inverted indexing
Twitter

Balance I/O traffic among tasks.

Load Balancing
PSS -Hadoop based package
di, dj, similarity score \n
Partition
Sort
Sequence
Hashing
Cleaning
Preprocessing
Bag of words \n
PSS2
PSS
PSS1
Ratio?
Avg. task time ≈ computational + memAccess
For each split Sx
Output similarity scores


for d in S
for d in B
Sim(d ,d ) += w * w
if( sim(d ,d ) + maxw(d ) * sum(d ) < ) then
..
Read other vectors into B
Read S and divide into many splits
Compare(Sx, B)
Compare (Sx, B)

PSS1 Task
Output similar vector pairs
Compare vi with S
Read some vectors vi from other partitions
Task steps:
Memory areas:
PSS Task
Read assigned partition into area S.
Memory analysis increase:
Temporal locality: reusing of an item.
Spatial locality: accessing things nearby previous accesses.
Cache Conscious Optimizations
dissimilar

Sort
Illustration of O(nlogn) partitioning with dissimilarity detection

Reduce
+
= sim(d2,d4)

Partial
result

Partial
result

Map
Partial
result

f1
w2,1
Parallel APSS with Inverted indexing
PSS2 Example



B

C





Si
Split Size?




After splitting:

S
S
S
Accumulator for Si
C

B

Cache-conscious data splitting (PSS1)

Df-cut: cut docs with small weights. SIGIR’09

Tf-cut : cut top frequent word HLT'08..
Memory optimizations: Manegold VLDB’02, Gilbert ICCP’08
n-grams

Tf/Idf’72

Stemming 60’s

Prefix vector info

Remove stop words

Parallelization with MapReduce.:Lin SIGIR’09, Vernica SIGMOD’11
Locality Sensitive Hashing: Gionis VLDB'99
Invert index: Arasu VLDB’06
Computation Filtering.: Bayardo WWW’07
Comparisons
Offline filtering
Preprocessing
Data collecting
APSS Pipeline & Related Work





Accumulator for S
S
Inverted index of vectors
C
B
Other vectors
PSS area S is too big to fit in cache


Exact vs. Approximate.
Metric vs. Non-metric.
Cosine, Jaccard, Edit-distance …
All pairs vs. top-k.
Text , Media , Sequence
Normalize
w1,3
Image search.
Collaborative filtering.
Spam and near duplicate detection.
Clustering.
Query suggestions.
Web crawling
logs
w3,1
w4,1
f3
w2,3
w4,3
w7,3
w8,3
f5
w2,5
w4,5
d
t
j
i
d
i
j
t
Σ w , x maxw( d )
d
i
t
t
t
j
= d x d
i
j
1
i
j
d < / d
i
j
1
Group
6
||d ||1 ≤ τ T/maxw(d )
Partition
Generalize the dissimilarity condition to:
d < / d
i

j
i
j
i

j
2
3
4
5
CPU
L1
L3
Memory
S = vectors owned,
B = other vectors,
C = temporary.
...
...
...
...
...
1
2
q
i
j
i
j
i
j
i
x
j
i,t
j,t
...
...
...
...
PSS with partitioning and hybrid indexing is 10x – 20x faster than open sourced codes.
Partitioning
Comparison
Load assignment
Communication overhead
18

2.7x
Improvement Ratio of PSS1,PSS2 over PSS

All techniques are scalable and can be generalized to other similarity measures.
Splitting the hosted partitions in PSS1 to fit into cache reduces slow memory data access with upto 60%.
Coalescing vectors with size-controlled inverted indexing can improve the spatial locality of prefetched data and is 2.74x as fast as cache-oblivious ones.
Modeling of task time is a good guide to optimize parameter setting.
Effectiveness of Partitioning & Circular balancing
dissimilar

7
WSDM''13, WWW'12
Hybrid Indexing
Sim(d3,d7)
Memory structure

Task
Sim(d1,d7)
Forward index w/ no network bottleneck..

Inverted index filtering.

Hybrid indexing.

Hybrid Indexing

Ratio of PSS w/ hybrid indexing over
PSS w/forward indexing time
Use modeling to predict job time.
s : number of vectors in the assigned partitions.
p : average posting length in the assigned partition.


For Twitter predicted ratio was 20% vs. actual 10%.
s
SIGIR'13
Impact of memory hierarchy on PSS execution time is not considered!
Same setting as before.
Evaluation
Objectives:
Performance improvement PSS2 over PSS1 over PSS.
Identify optimum values of s , b in PSS1, PSS2.
, 1/p+1/q=1
Predicted ratio between PSS hybrid and forward indexing:
Environment:
Objectives:
Hybrid indexing over forward and inverted index.
Effectiveness of Partitioning and Circular load balance.
Stopword removal + df-cut.
Static partitioning for dissimilarity detection. ~3%
Preprocessing:
Twitter (20M), C.web (50M), E.Emails (500K), Y.Music,(1M) G.news. (100K).
Datasets:
Memory access 10x to 100x slower than L1 cache.
...
...
Software Package
Clueweb 70K: Network bottlneck!
Weakness:
Objective of thesis work:
Size-based partitioning. WWW'12.
Circular
Partition i < j
Similarity Search Categories:
1
Full transcript