Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled

No description

John Berryman

on 26 November 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled

Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
John Berryman
Prototyping Patent Search for the USPTO
Search as used by a Patent Examiner
Unleashing the Full Potential of Lucene
Putting it All Together in Solr
Parboiled and PEG Parsing
Search Architect
prototype future of patent search for examiners
emphasize that they are modernizing and wanted to update to an opensource industry standard Solr/Lucene
incorporate more modern search techniques
year and half ago
Marti Hearst
full-stack implementation - (zoom through pieces)
data conditioning -10 million patents - various formats
image ingestion pipeline
100M of hierarchical classifications
api layer - keep track of examiner's search histories for later
javascript web app
search (zoom in on search box)
The way we normally think of search (images of amazon and zappos) you dig slice and dice until you find the thing you're looking for - then you leave - needle in the haystack
Examiners - are given the needle to start with asked to prove that there IS no needle in the haystack
You can't prove that no needle exists - but the examiners must be able to show that they've done due dilligence to dig up evidence and have come up empty
Name of game - understand the pieces of haystack and be able to show that they've looked every place where prior art is likely to be found
Practically this turns into
Massively compound searches
Saved searches (for later)
Pallettes of queries - often for synonyms
Most important deviation: Proximity searches
BRS examples:
Fielded Searches asdf.something.
classifications D34.235.648.
Out of the box, Solr's only position-aware querying is for phrases.
However Lucene does have rich capacity for position-based queries.
Span Query
lack of SpanAnd
can implement with large SpanNear but what if you just mean the document must contain both queries?
Lack of interoperability of Queries and SpanQueries - but there's talk of getting rid of SpanQueries soon.
Implementing SWAN with SpanNear
PositionGap way - if you have paragraphs and sentences then you've got to have lots of room for tokens
Special token way - slower, because it relies upon compound queries, but handles variously sized paragraphs
People are familiar with - Context Free Grammar parsers such as ANTLR and JavaCC - But people try to avoid them!
external grammar definitions (BNF)
an extra compilation step
"untouchable" code
Parboiled (a parser expression grammar parser) is interesting alternative
PEG - similar to context free grammar except doesn't work for natural language (can't handle ambiguous grammar) but great for parsing code (I think it can do anything that a LL or LR parser can do)
No extra grammar definition not compilation - the code you write is the grammar and compiler! You just give it strings and it converts it to code.
Examples (calculator) - precedence, arity, affinity
Example (can do Java)
Example (my code) - show where the Lucene goes:
start at leaves - here's where the terms go
here's where they get added to lucene queries
here's the whole query
Testing patterns? (Maybe I can include this in a parser-focused talk. This one's too short.)
Patent Examiners Are Different
Given a patent application, examiners must "prove" the non-existence of prior art.
... and it's our job to re-implement it in Solr/Lucene.
Why not just use dismax?
Simple (non-existent) search syntax!
Don't have to worry about fields!
Results return in relevancy sorting!
But they don't want it!
...and also, they already know BRS
Fielded Query
((widget ADJ gizmo) SAME doodad).dscr.
AND thingamabob
Out of the box Solr's position support is limited to phrase slop:
"some query"~4
Pain point: SpanQuery is a Query, but Query is not a SpanQuery
Implementing SWAN operators in Lucene
Prior to analysis:
Find most number of tokens that a sentence will have - say 50
Find most number of sentences a paragraph will have - say 50
Position Gap Strategy
Then use SpanNearQueries
(let's see!)
Want to parse your own syntax?
What you need is a Parser Generator
you have to keep track of an external syntax definition file
you have to compile the grammar definitions to runnable code (extra step)
you run the risk of creating "untouchable code"
...in other words, it's complicated
PEG similar to CFG - but no ambiguity allowed
Capable of parsing Java itself (so we're not talking Regular Expressions here)
No external syntax definition, no compilation step. The syntax definition IS the code that parses the language.
a Parser Expression Grammar Parser
Parboiled Calculator Example
Ability to package and configure this parser.
Something the can parse the syntax.
Lucene SWAN Queries
An understanding of BRS syntax.
Lucene Queries that implement this behavior.
We're ready!
Let's get parsing!
SWAN Parser
SWAN Solr Plugin
$ mvn package
$ cp target/swan_parser.jar /my/solr/contrib
Configure Plugin
(not pictured here:
custom analyzer)
Use it!
Patents End-to-End Search
Started in late 2011
Goal: Prototype the next generation of search for patent examiners.
Specifically: Upgrade to open source industry standard - Solr
Search for the average user
Therefore, examiners rely upon extremely rich search syntax.
It's called BRS
Historically constraints:
No facets!
No synonyms!
No stemming!
No relevancy sorting!
Practically, examiners must provide strong evidence that prior art does not exist.
Slicing and Dicing Patents and NPL
Understanding groupings
Exhaustively demonstrating no prior are in appropriate sections
(this is the big point)
Now, you're probably thinking ...
Parenthetical Groupings
Proximity Search
SAME: in same paragraph
WITH: in same sentence
ADJ: adjacent (order matters)
NEAR: adjacent (any order)
Boolean Search
Position Aware Queries
Lucene does have position support: SpanQueries
SpanTermQuery(Term term)
SpanOrQuery(SpanQuery... clauses)
SpanNotQuery(SpanQuery include, SpanQuery exclude)
SpanNearQuery(SpanQuery[] clauses, int slop, boolean inOrder)
Are We Ready?
During Analysis
Add a position gap of 50 tokens between sentences.
Add a position gap of 50*2*50=5000 tokens between sentences.
(this is the big point)
Now, you're probably thinking ...
Parenthetical Groupings
Proximity Search
SAME: in same paragraph
WITH: in same sentence
ADJ: adjacent (order matters)
NEAR: adjacent (any order)
Boolean Search
Full transcript