Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Solr/Lucene Intro

No description
by

John Berryman

on 24 October 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Solr/Lucene Intro

?
Let me introduce myself.
Aerospace engineer turned software consultant
Problem Statement:
You have documents and lots of them!
Attempt #1: SQL `LIKE`
Very slow (basically greps entire DB)
Poor recall (character-based matches)
Sorted by something - but not relevancy
SELECT * FROM books
WHERE title LIKE '%your_search%';
The Solution: Lucene/Solr
Understands text
Synonyms:
TV = television
Stop words:
the, a, is, an
Stemming:
quicker = quickly = quick
Relevancy Sorting:
TF*IDF
Foreign languages:

The Java "toolbox" of full-text search
"Great! So let's go build a Search Engine in Lucene"
Well, you can... but...
Access to all the features of Lucene
Distributed, scalable, fault tolerant, available, near real-time search
Works with ZooKeeper
Easy(-ish) to set up
Questions?
Built for the purpose of full-text search
First things first:
Lucene
Example Code
How do you solve this problem?
Super fast (thanks to inverted index)
Easy to customize and extend
... and plenty of other goodies for free
snippeting | highlighting |facets and filters | more-like-this
did-you-mean | suggest-as-you-type | grouping | field collapsing | boosting | statistics | geo-search
Encodes the concepts:
Second things Second:
Solr
Search Engines get complex fast
Steep learning curve
Hard (a.k.a. expensive) to find Lucene experts.
It's a wheel that's been reinvented enough already.
Configuration capability
Http access
Fault tolerance & availability
"Wouldn't it be great if there was just a standardized way of doing this?"
"Best-practices implementation of a Lucene search engine wrapped inside a web server."
UI Example
Solr Behavior Configuration
Highly customizable
Configuration files
Rich and extensible API
It's open source!
Added features
HTTP access via XML/JSON/etc.
Result snippeting and highlighting
Faceting and filtering
Query caching (makes Solr fast)
Replication and sharding
search box
Result
Listing
facets
filters
"soft sorting"
via boosts
more-like-this
geo search
xml
json
javabin
csv
ruby
xslt
etc.
SolrCloud
Clients ranging from Zappos to the US Patent Office
Search|Discovery|Analytics across large data sets
I work with OpenSource Connections
You must find that the documents by keyword search (and maybe filter by the metadata)
The documents contain unstructured text (and maybe some metadata)
Let's see you do that with SQL `LIKE`
This just in...
Other awesome Solr capabilities
Range queries.
Highly position aware queries:
"Find 'red' next to 'green' in a sentence that contains 'blue'"
Suggest-as-you-Type
Did-you-mean
More like this queries
Recommendation engines
Rich hierarchical tagging:
(ask about my taxonomy extraction and tagging algorithm)
When to use Lucene rather than Solr
When you're not doing a typical search app.
The abstractions made by Solr get in the way.
You don't need a http interface - e.g. desktop app
Common Solr anti-patterns?
No text... in a text search engine.
Joining every table in site.
Building a single core to answer every question imaginable.
Holy configuration relics.
The US Patent and Trademark Office
Zappos
Instagram
Etsy
CareerBuilder
eHarmony
NetFlix
eBay
Smithsonian Museums
Solr Adoption
Relevancy:
"Dress Shoes"
Scaling
Indexing
Sorting vs. Boosting
Consider a search for "dress shoes". Sorting by relevancy and the results look fine. Sorting by newness and ?
Common Issues
Companies who use Solr
range
facets
Attempt #2: Full-text DB
Oracle, SQL Server,
MySQL, PostgreSQL
Often expensive
Often lack control
Lack many features that you may want
If scaling was a problem, it's just going to get worse
@JnBrymn
John Berryman
... and normal sorting
DOC1: "Do dogs have dots?"
DOC2: "Do you have a dog!"
DOC3: "Dotted lines"
dog:
dot:
lines:
dog AND dot
[110]&[101]=[100]
return DOC1
[110]
[101]
[001]
index
query
search

document
field
analysis

Inverted Index
Defined
Schema Configuration
Sickel's observation: Deleted code is debugged code.
Sickel's |observation | Deleted | code | is | debugged | code
Sickel's |observation | Deleted | code || debugged | code
sickel's |observation | deleted | code || debugged | code
sickel |saying | deleted | code || debugged | code
sickel's |saying | deleted | code || debugged | code
sickel |say | delet | cod || debug | cod
Sickel's observation: Deleted code is debugged code.
Sickel's | observation: | Deleted | code | is | debugged | code.
create your own Java class
Life is fun.
java -Dbootstrap_confdir=./solr/collection1/conf
-Dcollection.configName=myconf
-DzkRun
-DnumShards=2
-jar start.jar
#on 3 other machines
java -DzkHost=zookeepbox:9983
-jar start.jar
Full transcript