Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Building a Search Infrastructure with Apache Solr and SolrCloud

Apache Solr is the underlying technology that powers Mylife.com's people search capabilities. Apache Solr and SolrCloud will be introduced and their application to our current search infrastructure will be discussed.
by

SPENCER YUEN

on 2 October 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Building a Search Infrastructure with Apache Solr and SolrCloud

Building a Search Infrastructure
with Apache Solr and SolrCloud

Agenda
Solr Replicas
Centralized management
Introduction to Solr
SolrCloud
When replica goes down
Leader stops sending updates
During recovery
Simple synch if differences are small
Get missing update commands from leader
Otherwise replicate whole index
Get full index from leader
Solr and Lucene
Built on Apache Lucene- Java library for information retrieval
Manages an inverted index
Regular index
"what terms in a document?"
Inverted index
"which documents contain a specific term?"
Challenges
If master goes down, then cannot update shard
Document assignment performed internally by own hashing scheme
No centralized management of schema or configuration
Splitting shards requires reindexing
Searches to cluster, must use lengthy 'shard' parameter
Mylife Architecture
What is Solr?
System built to search text
A platform to build search applications on
Customizable, open source software
SolrCloud
SolrCloud Indexing
No more master/slave. Just leaders/replicas
Leader automatically elected
If leader goes down, then a new replica is automatically elected as new leader
Fixes problem of master going down and cannot update shard
Shard selection for indexing
document ID hashed
no need to do own hashing
Solr Searching
Requests can be sent to any machine
No distinction between master and slave
Searching is near real-time
distributed indexing
soft commits to memory
No need for shard parameter in query or solrconfig.xml
ZooKeeper
Centralized configuration
Maintains schema and configuration
Provides distributed synchronization
Embedded
Run ZooKeeper as part of Solr application
If Solr app goes down, ZooKeeper also goes down
Ensemble
Run ZooKeeper as stand-alone instance on separate boxes
Multiple instances of ZooKeeper running if any ZooKeeper instance crash
Embedded versus Ensemble
SolrCloud Leader
When leader goes down
Only some replicas may have received updates
New leader chosen and synch processes run against other replicas
If replicas are too out of synch, asks for full replication/ replay based recovery
Solves challenge of failed updates when master is down
Why Solr?
Solr performs better for text search than relational DBs
Solr features specific to text search
highlighting, faceting, etc
Sample Inverted Index
Solr Architecture
PRIZE
GIVEAWAY
!!!
Distributed capabilities of Solr
Set up fault-tolerant, highly available cluster of Solr servers
Capacity
Expanding
install Solr
Start Solr up with -DzkHost parameter
Register them with load balancer
Magic!!
Reducing
Shut down machine
Currently, each Solr server has its own schema.xml, solrconfig.xml
Updates requires copying to each server
Can mean redeployment of entire cluster
SolrCloud allows uploading configuration to ZooKeeper
ZooKeeper sends updated configuration to entire cluster
Spencer Yuen, Mylife.com
SolrCloud
Simple Clients
Manage own load balancing
If server fails, try another replica in shard
Intelligent Clients
SolrJ
Connects to ZooKeeper to know which shards are up
Commit Strategies
When doc is sent to index
For replica, forward request
to own shard leader if shard is correct
to leader of another shard that is correct
For leader
forward doc to correct shard leader
index doc for itself and shard replicas
Transaction Log
Records updates
Allows replay of uncommitted updates if indexing is interrupted
Allows replicas to synch
Mylife Pipeline
Searching in Solr
http://localhost:10018/solr/select/?
fq=(source:cadillac+OR+source:reunion)
&
defType=dismax
&rows=10&indent=true&qid=0e5e4a19ee&
shards=localhost:10001/solr,10002/localhost:10003/solr,localhost:10004/solr,localhost:10005/solr,localhost:10006/solr,localhost:10007/solr,localhost:10008/solr,localhost:10009/solr,localhost:10010/solr,localhost:10011/solr,localhost:10012/solr,localhost:10013/solr,localhost:10014/solr,localhost:10015/solr,localhost:10016/solr,localhost:10017/solr,localhost:10018/solr,localhost:10019/solr,localhost:10020/solr&start=0
&
wt=json
&
bq=has_profile_image:true^0.2
&
q.alt=(((family_name:(Swanson)+OR+maiden_name:(Swanson)^0.5)+AND++(given_name:(Mikayla)+OR+given_name_exact:(Mikayla)))+OR+(name:%22Mikayla+Swanson%22^3.0+OR+name:%22Mikayla+Swanson%22~2)+OR+((+name:Mikayla)+AND+(+name:Swanson)))
Mylife Solr Query
fq
Filter queries limit responses to main query
deftype
query parser which processes user input, can handle errors, e.g. Lucene, dismax, edismax
shards
request distributed across all shards in the list. We'll revisit for SolrCloud.
wt
Response writers format output, including XML, JSON, etc
bq
Boost particular field when determining which results go to top
q or q.alt
actual main query
A Few Parameters
Request Handling
http://<host>:8983/solr/<core>/<request-handler>
core
index with its configurations. What is actually being indexed and searched on
`"/facet"
request handler
plugin to Solr that processes incoming request in a particular way
"/select"
Solr Schema
schema.xml
fields - what you are searching/indexing in your document
"name", "dob", "location"
a field can be "indexed" (searched on) and/or "stored" (displayed as result)
fields are denormalized or flat structure
field types - data type of field
"string", "int", "date", "boolean"
SolrConfig
Contains Solr configurations
Custom request handlers
browse, admin, data import
Supporting library paths
Data directory
Frequency of commit
cache management configurations
...and more
Sample Solr Schema
<fields>
<field name="name"type="string" indexed="true" stored="true"/>
</fields>
<types>
<fieldType name="string" class="solr.StrField"/>
</types>
Indexing
Update Handler
Handles update request
Processes commit to disk
refreshes searcher or snapshot view of index
Solr uses unique ID
marks old version as deleted
adds new version of document
Analyzer
processes text for each field or apply transformations to make text easily searchable
character filter, e.g. ISO Latin
I LOVE This café -> I LOVE This cafe
tokenizer, e.g. whitespace
I LOVE This cafe
token filters, e.g. lowercase
i love this cafe
Summary
Solr is powerful, simple, easily configurable platform for search applications
SolrCloud support fault-tolerance through ZooKeeper
Prize Question
What is the name of an intelligent Solr client?
Solr Feature: Faceted Search
technique to access information according to a classification system by multiple dimensions
Sample faceted search
Faceting in Solr
Faceting supported out-of-the-box in Solr
Use facet query parameters
Can facet on
field values
ranges
dates
Facet Parameters
Specify faceted fields
facet.field
Specify max number values
facet.limit
Specify sort order by count or alphabet
facet.sort
Others
facet.offset for paging
facet.mincount
facet.prefix
Sample Mylife query
http://localhost:8983/solr/facet/mylife?indent=true&wt=json&q.alt=*:*&facet=true&facet.field=job_title&facet.field=date_of_birth&facet.field=source&facet.field=location&facet.field=gender&facet.field=company_name&facet.limit=10
Shard Splitting
Initial collection
Need to select number of shards
May have chosen wrong number
Re-indexing data required to reallocate shards
Shard splitting
Solr 4.3 feature
Pre-existing shards can be split without reindexing
http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=collection1&shard=shard1
Solr Cloud Admin
Full transcript