Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Equipping Solr with Semantic Search and Recommendation

Make sure to check out my related blog posts: http://www.opensourceconnections.com/2013/08/25/semantic-search-with-solr-and-python-numpy/ and also http://www.opensourceconnections.com/2013/10/05/search-aware-product-recommendation-in-solr/
by

John Berryman

on 3 February 2015

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Equipping Solr with Semantic Search and Recommendation



of

for
Semantic Search and
Recommendation

Python
+Solr

=>
Next Steps
The Problem
How does Solr search engine work?
The Problem
What's this search engine missing?
The Problem
What about recommenders?
The Problem
Goal: Equip Solr with Semantic Search
Mathy Bits
What does it mean to "blur" meaning?
Mathy Bits
Geometric Interpretation of Rank Reduction
Mathy Bits
Various approaches for Rank Reduction
Singular Value Decomposition
Alternating Least Squares
Stoichastic SVD
Mathy Bits
Goal: Equip Solr with Semantic Search
Ideal Approach
Demonstration
Query-Aware Recommendations
Consider Index that holds movies.
You are logging customer movie ratings.
Create customer-movie matrix.
Blur it! Stick it in Solr as the field UserRecs.
Recommendation:
Next Steps
Where do we go from here?
Incorporate "strength" of the blurred terms into Solr via payloads.
Scale for large data sets.
Luigi - Data Processing Framework
Stream and Batch Processing
Metaphor of faucets, pipes, filters, tanks and drains
Same code for Big Data or Little Data
Productize it!
The Problem
Approach
Demonstration
Mathy Bits
Implementation
@JnBrymn
?
Questions?
"Brown foxes jump."
"The quick, quick fox runs."
"Brown dogs are running."
brown dog fox jump run quick
Doc1 Doc2 Doc3
q=brown AND fox
Documents
Term Vectors
Inverted Index
Boolean Query
[1 0 1] AND [1 1 0]
[1 0 0]
Doc1
Doc1)
Doc2)
Doc3)
Doc1)
Doc2)
Doc3)
brown:
dog:
fox:
jump:
run:
quick:
"Yellow banana peels."
"A banana is a long yellow fruit."
"This mystery fruit is long and yellow and has a peel."
Consider these Documents
Doc1)
Doc2)
Doc3)
You can only match on tokens.
You can't match on meaning.
What if you search for "banana"?
Doc1 Doc2 Doc3
Term Document Matrix
brown:
dog:
fox:
jump:
run:
quick:
User1 User2 User3
Video User Matrix
Texas Chainsaw Masacre:
Night of the Living Dead:
Toy Story:
Finding Nemo:
American Pie:
Mean Girls:
Semantic Search = Recommendations
And we have the same problem!
Pull the terms out of Solr
"Blur" the meaning of the terms
Shove it all back into Solr
Make it searchable
5 0 2
0 0 4
0 3 0
4 1 0
0 1 2
=
movie
user
genre
movie
genre
user
genre
strength
Singular Value
Decomposition
1 0 1 1 0 0
0 0 1 0 1 2
1 1 0 0 1 0
1 0 1
0 0 1
1 1 0
1 0 0
0 1 1
0 2 0
1 0 1
0 0 1
1 1 0
1 0 0
0 1 1
0 2 0
4 0 4
0 0 5
1 4 0
0 4 0
2 5 4
3 0 5

of
overall strength of genre_i
=
x
x
i = 0
num_genres
in genre_i
value
value
of genre_i
for
=
value
Genre
Action
Romance
Horror
...
Scary Doll Movie
2
0.5 0.1 5.32 1.0
0.0 0.0 3.15 0.5
1.4 0.9 0.75 2.1
0.1 100.0 0.01 0.1
... ... ... ...
2
-0.471 0.681 0.168
-0.168 0.471 -0.681
-0.378 -0.378 -0.378
-0.681 -0.168 0.471
-0.378 -0.378 -0.378
-0.591 -0.737 -0.328
0.328 -0.591 0.737
0.737 -0.328 -0.591
x
x
5.4 0.0 0.0
0.0 4.3 0.0
0.0 0.0 0.2
4.5 0.3 1.6
3.8 0.7 4.2
0.3 2.7 0.2
4.2 0.9 0.1
0.4 1.3 1.7
movie
user
5 0 2
0 0 4
0 3 0
4 1 0
0 1 2
movie
user
Rank
Reduction
1.3 5.3
0.1 2.5
2.3 2.5
4.2 1.1
0.6 1.3
movie
user
Easy(ish) to understand
winner of Netflix challenge
the one we went with
Extract term-document matrix
Rank-reduce this matrix
Extract the larger values and map them back to terms
Shove it all back into Solr
Make it searchable.
Index
Extract
Rank Reduce
Augment
Use
4.0
docs
svd
and then
matrixmult
Cruel Reality
Overview
Rank Reduce
Use?
Extract
3.6
docs
Augment
Index

x2!
4.0
q=UserRecs:user13
q=scary dolls & bq=UserRecs:user13
Better yet, query-aware recommendation:
Introductions
@JnBrymn
SAFETY BOX
Demonstration
Testing the Results
Example Solr: SciFi Stack Exchange
Body field contains text of questions and answers
BodyBlurred field contains rank reduced version of Body field.
q=+Body:"vader" -BodyBlurred:"vader"
q=+BodyBlurred:"vader" -Body:"vader"
Cruel Reality
Overview
Rank Reduce
Use?
Extract
3.6
docs
Augment
Index
x2
4.0
Mahout's lucene.vector only supports Solr3.6
Atomic Updates only in Solr 4.0
Mahout uses integers instead of docIds and terms.
Mahout provides a dictionary from integer to term.
We had to create dictionary from integer back to docId.
*
we're Mahout noobs
Matrix Decomposition and Rank Reduction
only returns left matrix
demo works great - algorithm takes input formatted differently from ours
did what we expected - very fast
svd
parallelALS

ssvd
$MAHOUT lucene.vector --field Body --dir $SOLR_INDEX_PATH
--dictOut term_dictionary.txt --output sequence.file --idField Id
$MAHOUT ssvd --input sequence.file --output ssvd \
--vHalfSigma true --uHalfSigma true --rank 200 \
--reduceTasks 1 --tempDir whatever
Matrix Multiplication
$MAHOUT transpose --input ssvd/V --tempDir /tmp/foo \
--numRows 35812 --numCols 5

$HADOOP dfs -mv ssvd/transpose* ssvd/V_trans
Transpose V

Mahout writes transpose to wherever. Put it somewhere reasonable.
"ssvd" contains left and right matrices U and V
GOAL: multiply(U,transpose(V))
#THRASHING VEHEMENTLY
#ANGRY DRUNKEN CODING
#DISPAIR
U x transpose(V) doesn't work as expected.
Lots of confusion!


$HADOOP CustomLong2IntKeyConverter ssvd/U ssvd/U_int
#Discovered that the U matrix is keyed by LongWritable but matrixmult expects IntWritable.
Convert key of U to int from long(V already is?!)
#MORE THRASHING



#STARTED OVER
#USED ONE REDUCER SO THAT ALL OUTPUTS WOULD #HAVE SINGLE PARTITION
Multiplication still doesn't work right!


Turns out multiplication needs the same partitioning for both matrices... ok
#ARG ! $#@#!%! (John and Doug exclaimed)



#CREATED EXAMPLE 2x3 MATRICES
#ACTUALLY WORKED!
#(TOOK NOTE OF THE IRONY)
Still freakin' confused



Started experimenting with tiny matrices
$MAHOUT transpose --input ssvd/U_int \
--numRows 18278 --numCols 5 \
--tempDir /tmp/foo


$HADOOP dfs -mv ssvd/transpose* ssvd/U_int_trans
Turns out we needed to transpose the U matrix too!

Moved transposed matrix from random place
$MAHOUT matrixmult \
--numRowsA 18278 --numColsA 5 \
--numRowsB 35812 --numColsB 5 \
--inputPathA ssvd/U_int_trans \
--inputPathB ssvd/V_trans


$HADOOP dfs -mv \
/user/hadoop/productWith* ssvd/UVtrans
Finally multiplied
the matrices!




...and them moved the matrix from the random place it had been written.
Used custom code to convert term vectors to Solr docs.
Updated Solr via Solr 4's Atomic Updates
In math section, demonstrate getting bag of terms docs out of matrix. Do this after the geometric interpretation.
Ideas:
Pythonic Approach
Overview
Index
Extract
Rank Reduce
Augment
Use
4.0
docs
Benefit: Quick to prototype and test.
Drawback: Will not scale ... yet.
Aerospace Engineer
General Software Contractor
Search Technology Consultant
Data Scientist
Data Sadist
at VividCortex
Eventually Nashville?
TIME
John's Downward Spiral
Ideal Approach
Overview
Index
Extract
Rank Reduce
Augment
Use
4.0
docs
Use Mahout's
lucene.vector
Get rank reduced svd matrices via Mahout's
svd
Multiply together via
Mahout's
matrixmult
Convert numerical matrices back into Solr documents.
Re-index documents via Solr4
(custom code)
atomic updates
lucene.vector
custom code
and
Solr4 atomic updates
Required Custom code to extract doc Ids
Non-existent Documentation
Misleading Naming
Inconsistent Code

Finally Built our own MapReduce Job
Didn't really get here.
X

X

Demo 1: Semantic Sci-fi Search
Stack Exchange Sci-fi Q&A
18.5K Posts, 35.5K Unique Terms
Goal: Extract terms from Body field, "blur" them, insert them in "BodyBlurred" field
Blog: http://bit.ly/14TqJOM
Body:
You're correct, Enterprise is the only Star Trek that fits into both the original and the new 2009 movie timelines. From the perspective of the Enterprise characters, both are possible futures, given the over-arcing conceit of the show was a Temporal Cold War, so its future is in flux and could line up with either of the timelines we're familiar with, or with an entirely different future.
BodyBlurred:
answer charact place klingon star trek design travel crew watch work movi happen enterpris featur futur exist origin 2009 chang altern timelin war to version event captain gener pictur tng creat iii galaxi theori return alter voyag entir fry turn kirk paradox biff doc marti feder 1955 starship 2015 class hero centuri tempor uss phoenix mirror river 800 ncc 1701 simon conner skynet alisha"

Terms related to "
vader
"
vader luke emperor darth palpatin anakin sith skywalk sidiou apprentic empir luca side star son forc turn kill death rule suit father question jedi command obi tarkin dark wan plan
Terms related to "
potter
"
harri potter voldemort wizard snape death magic jame love spell time rowl lili eater travel seri hous hand hogwart three find wormtail kill slytherin hallow secret deathli muggl order lord
Terms related to "
dark
"
dark side jedi sith eater lord death mark snape magic curs evil forc luke mercuri cave yoda jame palpatin dagobah anakin black call wizard slytherin live light siriu matter voldemort
Demo 2: Grocery Search and Recommendations
25.9K Items; 14.5K Customers
Record of 100K Purchases
Goal:
Build matrix of customer to purchased item.
"Blur" matrix into a matrix of customer to item they'll
likely
purchase
For ever product in index add new field of customers likely to purchase
Blog: http://bit.ly/19qhdcL
text-based search
dave-aware search
wendy-aware search
banana
potato
steak
http://spkr8.com/t/28981
Rate this talk!
Full transcript