Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Sphinx Search Basic Overview

Touring Sphinx Documentation.
by

Steve Barker

on 25 November 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Sphinx Search Basic Overview

Searchd
Indexer
real time
disk
indexes
Data Sources
Non-SQL storage indexing.
Data can also be streamed to
batch indexer in TSV,
in a simple XML format called
XMLpipe, or inserted directly into an
incremental RT index.
SQL database indexing.
Sphinx can directly access
and index data stored in
MySQL
PostgreSQL
Oracle
Microsoft SQL Server
SQLite
Drizzle
and anything else that supports
ODBC.
In addition to full-text, an arbitrary number of attributes (product IDs, company names, prices, etc) can be stored in the index and used either just for retrieval (to avoid hitting the DB), or for efficient Sphinx-side search result set post-processing.
SphinxAPI
SphinxQL
SphinxSE
SphinxAPI is a native library available for Java, PHP, Python, Perl, C, and other languages.
SphinxQL lets the application query Sphinx using standard MySQL client library and query syntax.
SphinxSE, a pluggable storage engine for MySQL, enables huge result sets to be shipped directly to the MySQL server for post-processing.
tab separated values, or xml pipe
easy application integration
Sphinx is a performant, scalable, and easy to use search server. Give it a try. Go to sphinxsearch.com
you might imagine Sphinx like this..
super flexible queries
your application
SphinxQL is the
recommended method
It gets most of our developer's
attention and will soon be more
feature-rich than SphinxAPI.
Sphinx 'searchd' supports MySQL binary network protocol and can be accessed with the regular MySQL API.
Introducing SphinxQL
Here's an example of querying Sphinx using the MySQL client:

$ mysql -P 9306
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 0.9.9-dev (r1734)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> SELECT * FROM test1 WHERE MATCH('test')
-> ORDER BY group_id ASC OPTION ranker=bm25;
+----+----------+------------+-----------------+
| id | weight | group_id | date_added |
+----+----------+------------+-----------------+
| 4 | 1442 | 2 | 1231721236 |
| 2 | 2421 | 123 | 1231721236 |
| 1 | 2421 | 456 | 1231721236 |
+-----+---------+------------+------------------+
3 rows in set (0.00 sec)

Note that mysqld was not even running on the test machine. Everything was handled by searchd itself.

This access method is supported in addition to native APIs which all still work perfectly well. In fact, both access methods can be used at the same time. Also, native API is still the default access method. MySQL protocol support needs to be additionally configured. This is a matter of 1-line config change, adding a new listener with mysql41 specified as a protocol:

listen = localhost:9306:mysql41

Just supporting the protocol and not the SQL syntax would be useless so Sphinx also supports a subset of SQL that we dubbed SphinxQL. It supports standard querying of all the index types with SELECT, modifying RT indexes with INSERT, REPLACE, and DELETE, and much more. Full SphinxQL reference is available in our documentation.
In short, SphinxQL makes your job easier.
There's 'matching' and 'ranking'.

Matching comes first, then
matches are 'ranked'.

'Matching modes' are pretty much obsolete.
Instead, use 'extended matching'
or 'SPH_MATCH_EXTENDED'
Extended Query Syntax
operator OR:

hello | world
operator NOT:

hello -world
hello !world
field search operator:

@title hello @body world
field position
limit modifier:

@body[50] hello
multiple-field
search operator:

@(title,body) hello world
ignore field search
operator (will ignore
any matches of 'hello
world' from field 'title'):

@!title hello world
ignore multiple-field search
operator (if we have fields title,
subject and body then @!(title)
is equivalent to @(subject,body)):

@!(title,body) hello world
all-field search operator:

@* hello
phrase search operator:

"hello world"
proximity search operator:

"hello world"~10
quorum matching
operator:

"the world is a
wonderful place"/3
strict order operator
(aka operator "before"):

aaa << bbb << ccc
exact form modifier:

raining =cats and =dogs
field-start and
field-end modifier:

^hello world$
NEAR, generalized proximity operator:

hello NEAR/3 world NEAR/4 "my test"
SENTENCE operator:

all SENTENCE words
SENTENCE "in one sentence"
PARAGRAPH operator :

"Bill Gates" PARAGRAPH "Steve Jobs"
ZONE limit operator:

ZONE:(h3,h4)
only in these titles
ZONESPAN limit operator:

ZONESPAN:(h2)
only in a (single) title
smart queries
Extended matching mode: query example

"hello world" @title "example program"~5 @body python -(php|perl) @* code

The full meaning of this search is:

Find the words 'hello' and 'world' adjacently in any field in a document;
Additionally, the same document must also contain the words 'example' and 'program' in the title field, with up to, but not including, 5 words between the words in question; (E.g. "example PHP program" would be matched however "example script to introduce outside data into the correct context for your program" would not because two terms have 5 or more words between them)
Additionally, the same document must contain the word 'python' in the body field, but not contain either 'php' or 'perl';
Additionally, the same document must contain the word 'code' in any field.
...and after we have our matches.. On to ranking!
Quorum matching operator introduces a kind of fuzzy matching. It will only match those documents that pass a given threshold of given words. The example above ("the world is a wonderful place"/3) will match all documents that have at least 3 of the 6 specified words. Operator is limited to 255 keywords. Instead of an absolute number, you can also specify a number between 0.0 and 1.0 (standing for 0% and 100%), and Sphinx will match only documents with at least the specified percentage of given words. The same example above could also have been written "the world is a wonderful place"/0.5 and it would match documents with at least 50% of the 6 words.
Strict order operator (aka operator "before"), will match the document only if its argument keywords occur in the document exactly in the query order. For instance, "black << cat" query (without quotes) will match the document "black and white cat" but not the "that cat was black" document. Order operator has the lowest priority. It can be applied both to just keywords and more complex expressions, ie. this is a valid query:

(bag of words) << "exact phrase" << red|green|blue
Field position limit, additionally restricts the searching to first N position within given field (or fields). For example, "@body[50] hello" will not match the documents where the keyword 'hello' occurs at position 51 and below in the body.
Proximity distance is specified in words, adjusted for word count, and applies to all words within quotes. For instance, "cat dog mouse"~5 query means that there must be less than 8-word span which contains all 3 words, ie. "CAT aaa bbb ccc DOG eee fff MOUSE" document will not match this query, because this span is exactly 8 words long.
Proximity distance is specified in words, adjusted for word count, and applies to all words within quotes. For instance, "cat dog mouse"~5 query means that there must be less than 8-word span which contains all 3 words, ie. "CAT aaa bbb ccc DOG eee fff MOUSE" document will not match this query, because this span is exactly 8 words long.
Field-start and field-end keyword modifiers, will make the keyword match only if it occurred at the very start or the very end of a fulltext field, respectively. For instance, the query "^hello world$" (with quotes and thus combining phrase operator and start/end modifiers) will only match documents that contain at least one field that has exactly these two keywords.
NEAR operator, is a generalized version of a proximity operator. The syntax is NEAR/N, it is case-sensitive, and no spaces are allowed between the NEAR keyword, the slash sign, and the distance value.

The original proximity operator only worked on sets of keywords. NEAR is more generic and can accept arbitrary subexpressions as its two arguments, matching the document when both subexpressions are found within N words of each other, no matter in which order. NEAR is left associative and has the same (lowest) precedence as BEFORE.

You should also note how a (one NEAR/7 two NEAR/7 three) query using NEAR is not really equivalent to a ("one two three"~7) one using keyword proximity operator. The difference here is that the proximity operator allows for up to 6 non-matching words between all the 3 matching words, but the version with NEAR is less restrictive: it would allow for up to 6 words between 'one' and 'two' and then for up to 6 more between that two-word matching and a 'three' keyword.
SENTENCE and PARAGRAPH operator matches the document when both its arguments are within the same sentence or the same paragraph of text, respectively. The arguments can be either keywords, or phrases, or the instances of the same operator. Here are a few examples:

one SENTENCE two
one SENTENCE "two three"
one SENTENCE "two three" SENTENCE four

The order of the arguments within the sentence or paragraph does not matter. These operators only work on indexes built with index_sp (sentence and paragraph indexing feature) enabled, and revert to a mere AND otherwise. Refer to the index_sp directive documentation for the notes on what's considered a sentence and a paragraph.
ZONE limit operator is quite similar to field limit operator, but restricts matching to a given in-field zone or a list of zones. Note that the subsequent subexpressions are not required to match in a single contiguous span of a given zone, and may match in multiple spans. For instance, (ZONE:th hello world) query will match this example document:

<th>Table 1. Local awareness of Hello Kitty brand.</th>
.. some table data goes here ..
<th>Table 2. World-wide brand awareness.</th>

ZONE operator affects the query until the next field or ZONE limit operator, or the closing parenthesis. It only works on the indexes built with zones support (see Section 11.2.9, “index_zones”) and will be ignored otherwise.
ZONESPAN limit operator is similar to the ZONE operator, but requires the match to occur in a single contiguous span. In the example above, (ZONESPAN:th hello world)> would not match the document, since "hello" and "world" do not occur within the same span.
RT indexes enable you to implement dynamic updates and incremental additions to the full text index. RT stands for Real Time and they are indeed "soft realtime" in terms of writes, meaning that most index changes become available for searching as quick as 1 millisecond or less, but could occasionally stall for seconds. (Searches will still work even during that occasional writing stall.)
Sphinx also supports so-called distributed indexes. Compared to disk and RT indexes, those are not a real physical backend, but rather just lists of either local or remote indexes that can be searched transparently to the application, with Sphinx doing all the chores of sending search requests to remote machines in the cluster, aggregating the result sets, retrying the failed requests, and even doing some load balancing.
Disk indexes are designed to provide maximum indexing and searching speed, while keeping the RAM footprint as low as possible. That comes at a cost of text index updates. You can not update an existing document or incrementally add a new document to a disk index. You only can batch rebuild the entire disk index from scratch. (Note that you still can update document's attributes on the fly, even with the disk indexes.)

This "rebuild only" limitation might look as a big constraint at a first glance. But in reality, it can very frequently be worked around rather easily by setting up multiple disk indexes, searching through them all, and only rebuilding the one with a fraction of the most recently changed data.

ALL DOCUMENT IDS MUST BE UNIQUE UNSIGNED NON-ZERO INTEGER NUMBERS (32-BIT OR 64-BIT, DEPENDING ON BUILD TIME SETTINGS).

If this requirement is not met, different bad things can happen. For instance, Sphinx can crash with an internal assertion while indexing; or produce strange results when searching due to conflicting IDs. Also, a 1000-pound gorilla might eventually come out of your display and start throwing barrels at you. You've been warned.
When indexing some index, Sphinx fetches
documents from the specified sources, splits the text into words, and does case folding so that "Abc", "ABC" and "abc" would be treated as the same word (or, to be pedantic, term).

To do that properly, Sphinx needs to know

what encoding the source text is in;
which characters are letters and which are not;
what letters should be folded to other letters.
You can also specify text pattern replacement rules. For example, given the rules:

regexp_filter = \b(\d+)\" => \1 inch
regexp_filter = (BLUE|RED) => COLOR

the text 'RED TUBE 5" LONG' would be indexed as 'COLOR TUBE 5 INCH LONG', and 'PLANK 2" x 4"' as 'PLANK 2 INCH x 4 INCH'. Rules are applied in the given order. Text in queries is also replaced; a search for "BLUE TUBE" would actually become a search for "COLOR TUBE". Note that Sphinx must be built with the --with-re2 option to use this feature.
Ranking
With all the SQL drivers, indexing generally works as follows.

connection to the database is established;
pre-query is executed to perform any necessary initial setup, such as setting per-connection encoding with MySQL;
main query is executed and the rows it returns are indexed;
post-query is executed to perform any necessary cleanup;
connection to the database is closed;
indexer does the sorting phase (to be pedantic, index-type specific post-processing);
connection to the database is established again;
post-index query is executed to perform any necessary final cleanup;
connection to the database is closed again.
xmlpipe2 lets you pass arbitrary full-text and attribute data to Sphinx in a custom XML format. It also allows to specify the schema (ie. the set of fields and attributes) either in the XML stream itself, or in the source settings.

When indexing xmlpipe2 source, indexer runs the given command, opens a pipe to its stdout, and expects well-formed XML stream.
Here's sample stream data:

<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>

<sphinx:schema>
<sphinx:field name="subject"/>
<sphinx:field name="content"/>
<sphinx:attr name="published" type="timestamp"/>
<sphinx:attr name="author_id" type="int" bits="16" default="1"/>
</sphinx:schema>

<sphinx:document id="1234">
<content>this is the main content <![CDATA[[and this <cdata> entry
must be handled properly by xml parser lib]]></content>
<published>1012325463</published>
<subject>note how field/attr tags can be
in <b class="red">randomized</b> order</subject>
<misc>some undeclared element</misc>
</sphinx:document>

<sphinx:document id="1235">
<subject>another subject</subject>
<content>here comes another document, and i am given to understand,
that in-document field order must not matter, sir</content>
<published>1012325467</published>
</sphinx:document>

<!-- ... even more sphinx:document entries here ... -->

<sphinx:killlist>
<id>1234</id>
<id>4567</id>
</sphinx:killlist>

</sphinx:docset>

Arbitrary fields and attributes are allowed. They also can occur in the stream in arbitrary order within each document; the order is ignored. There is a restriction on maximum field length; fields longer than 2 MB will be truncated to 2 MB (this limit can be changed in the source).

The schema, ie. complete fields and attributes list, must be declared before any document could be parsed. This can be done either in the configuration file using xmlpipe_field and xmlpipe_attr_XXX settings, or right in the stream using <sphinx:schema> element. <sphinx:schema> is optional. It is only allowed to occur as the very first sub-element in <sphinx:docset>. If there is no in-stream schema definition, settings from the configuration file will be used. Otherwise, stream settings take precedence.
Ranking (aka weighting) of the search results can be defined as a process of computing a so-called relevance (aka weight) for every given matched document with regards to a given query that matched it. So relevance is in the end just a number attached to every document that estimates how relevant the document is to the query. Search results can then be sorted based on this number and/or some additional parameters, so that the most sought after results would come up higher on the results page.
There is no single standard one-size-fits-all way to rank any document in any scenario. Moreover, there can not ever be such a way, because relevance is subjective. As in, what seems relevant to you might not seem relevant to me. Hence, in general case it's not just hard to compute, it's theoretically impossible.

So ranking in Sphinx is configurable. It has a notion of a so-called ranker. A ranker can formally be defined as a function that takes document and query as its input and produces a relevance value as output. In layman's terms, a ranker controls exactly how (using which specific algorithm) will Sphinx assign weights to the document.

Previously, this ranking function was rigidly bound to the matching mode. So in the legacy matching modes (that is, SPH_MATCH_ALL, SPH_MATCH_ANY, SPH_MATCH_PHRASE, and SPH_MATCH_BOOLEAN) you can not choose the ranker. You can only do that in the SPH_MATCH_EXTENDED mode. (Which is the only mode in SphinxQL and the suggested mode in SphinxAPI anyway.) To choose a non-default ranker you can either use SetRankingMode() with SphinxAPI, or OPTION ranker clause in SELECT statement when using SphinxQL.
SPH_RANK_PROXIMITY_BM25, the default ranking mode that uses and combines both phrase proximity and BM25 ranking.
SPH_RANK_BM25, statistical ranking mode which uses BM25 ranking only (similar to most other full-text engines). This mode is faster but may result in worse quality on queries which contain more than 1 keyword.
SPH_RANK_NONE, no ranking mode. This mode is obviously the fastest. A weight of 1 is assigned to all matches. This is sometimes called boolean searching that just matches the documents but does not rank them.
SPH_RANK_WORDCOUNT, ranking by the keyword occurrences count. This ranker computes the per-field keyword occurrence counts, then multiplies them by field weights, and sums the resulting values.
SPH_RANK_PROXIMITY, added in version 0.9.9-rc1, returns raw phrase proximity value as a result. This mode is internally used to emulate SPH_MATCH_ALL queries.
SPH_RANK_MATCHANY, added in version 0.9.9-rc1, returns rank as it was computed in SPH_MATCH_ANY mode earlier, and is internally used to emulate SPH_MATCH_ANY queries.
SPH_RANK_FIELDMASK, added in version 0.9.9-rc2, returns a 32-bit mask with N-th bit corresponding to N-th fulltext field, numbering from 0. The bit will only be set when the respective field has any keyword occurrences satisfying the query.
SPH_RANK_SPH04, added in version 1.10-beta, is generally based on the default SPH_RANK_PROXIMITY_BM25 ranker, but additionally boosts the matches when they occur in the very beginning or the very end of a text field. Thus, if a field equals the exact query, SPH04 should rank it higher than a field that contains the exact query but is not equal to it. (For instance, when the query is "Hyde Park", a document entitled "Hyde Park" should be ranked higher than a one entitled "Hyde Park, London" or "The Hyde Park Cafe".)
SPH_RANK_EXPR, added in version 2.0.2-beta, lets you specify the ranking formula in run time. It exposes a number of internal text factors and lets you define how the final weight should be computed from those factors.
Expression ranker, added in version 2.0.2-beta, lets you change the ranking formula on the fly, on a per-query basis. For a quick kickoff, this is how you emulate PROXIMITY_BM25 ranker using the expression based one:

SELECT *, WEIGHT() FROM myindex WHERE MATCH('hello world')
OPTION ranker=expr('sum(lcs*user_weight)*1000+bm25')
The output of this query must not change if you omit the OPTION clause, because the default ranker (PROXIMITY_BM25) behaves exactly like specified in the ranker formula above. But the expression ranker is somewhat more flexible than just that and provides access to many more factors.

The ranking formula is an arbitrary arithmetic expression that can use constants, document attributes, built-in functions and operators (described in Section 5.5, “Expressions, functions, and operators”), and also a few ranking-specific things that are only accessible in a ranking formula. Namely, those are field aggregation functions, field-level, and document-level ranking factors.
Currently implemented rankers
Quick summary of the ranking factors
So.. when we say "flexible",
we mean "really, really flexible".
Sphinx unlocks insight from your HUGE
data sets. And, because of our diligent
focus on performance, you won't go
broke buying extra hardware.. Give it a try.
It's free (and open source).
This is the end of this little presentation.
Thanks for your interest in Sphinx!
If there is a Sphinx-related topic you
would like explained, just let us know..
Tell us on twitter: @sphinxsearch
If you need help with the design
and/or implementation of your search
(or analytics) system, give us a call at
+1 (888) 333-1345 or email us at sales@sphinxsearch.com

Bye for now!
Full transcript