Introduction to Full-Text Search

Introduction to Full-text search

About me Full-time (Mostly) Java Developer Part-time general technical/sysadmin/geeky guy Interested in: hard problems, search, performance, paralellism, scalability

Because every application needs search

We live in an era of big, complex and connected applications.

But it's no use if you can't find anything!

But it's no use if you can't quickly find anything something relevant

Deathy's Tip You can't win by being generic, but you can be the best for your specific type of content.

So back to our full-text search...

Some core ideas "index" (or "inverted index") "document"

Deathy’s Tip Don't be too quick in deciding what a "document" is. Put some thought into it or you'll regret it (speaking from a lot of experience)

First we need some documents, more specifically some text samples

Documents Doc1: "The cow says moo" Doc2: "The dog says woof" Doc3: "The cow-dog says moof“ "Stolen" from http://www.slideshare.net/tomdyson/being-google

Important: individual words are the basis for the index

Individual words index = [ "cow", "dog", "moo", "moof", "The", "says", "woof" ]

For each word we have a list of documents to which it belongs

Words, with appearances index = { "cow": ["Doc1", "Doc3"], "dog": ["Doc2", "Doc3"], "moo": ["Doc1"], "moof": ["Doc3"], "The": ["Doc1", "Doc2", "Doc3"], "says": ["Doc1", "Doc2", "Doc3"], "woof": ["Doc2"] }

Q1: Find documents which contain "moo" A1: index["moo"]

Q2: Find documents which contain "The" and "dog" A2: set(index["The"]) & set(index["dog"])

Try to think of search as unions/intersections or other filters on sets.

Most searches are using simple terms and "boolean" operators.

“boolean” "word" - word MAY/SHOULD appear in document "+word" - word MUST appear in document "-word" - word MUST NOT appear in document

Example Query: “+type:bookcontent:javacontent:python -content:ruby” Find books, with "java" or "python" in content but which don't contain "ruby" in content.

Err...wait...what the hell does "content:java" mean?

Reviewing the "document" concept

An index consists out of one or more documents

Each document consists of one or more "field"s. Each field has a name and content.

Field examples content title author publication date etc.

So how are fields handled internally? In most cases very simple. A word belongs to a specific field, so it can be stored in the term directly.

New index example index = { "content:cow": ["Doc1", "Doc3"], "content:dog": ["Doc2", "Doc3"], "content:moo": ["Doc1"], "content:moof": ["Doc3"], "content:The": ["Doc1", "Doc2", "Doc3"], "content:says": ["Doc1", "Doc2", "Doc3"], "content:woof": ["Doc2"], "type:example_documents": ["Doc1", "Doc2", "Doc3"] }

We missed the most important thing!

We missedsaved the most important thing for last!

or for mortals: how you get from a long text to small tokens/words/terms

…borrowing from Lucene naming/API...

Some more interesting documents Doc1: "The quick brown fox jumps over the lazy dog" Doc2: "All Daleks: Exterminate! Exterminate! EXTERMINATE!! EXTERMINATE!!!" Doc3: "And the final score is: no TARDIS, no screwdriver, two minutes to spare. Who da man?!"

Tokenizer: Breaks up a single string into smaller tokens.

You define what splitting rules are best for you.

Whitespace Tokenizer Just break into tokens wherever there is some space. So we get something like:

Doc1: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"] Doc2: ["All", "Daleks:", "Exterminate!", "Exterminate!", "EXTERMINATE!!", "EXTERMINATE!!!"] Doc3: ["And", "the", "final", "score", "is:", "no", "TARDIS,", "no", "screwdriver,", "two", "minutes", "to", "spare.", "Who", "da", "man?!"]

But wait, that doesn't look right...

Filter transforms one single token into another single token, multiple tokens or no token at all you can apply more of them in a specific order

Filter 1: lower-case (since we don't want the search to be case-sensitive)

Result Doc1: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"] Doc2: ["all", "daleks:", "exterminate!", "exterminate!", "exterminate!!", "exterminate!!!"] Doc3: ["and", "the", "final", "score", "is:", "no", "tardis,", "no", "screwdriver,", "two", "minutes", "to", "spare.", "who", "da", "man?!"]

Result Doc1: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"] Doc2: ["all", "daleks", "exterminate", "exterminate", "exterminate", "exterminate"] Doc3: ["and", "the", "final", "score", "is", "no", "tardis", "no", "screwdriver", "two", "minutes", "to", "spare", "who", "da", "man"]

Add more filter seasoning until it tastes just right.

Lots of things you can do with filters case normalization removing unwanted/unneeded characters transliteration/normalization of special characters stopwords synonyms

Possibilities are endless, enjoy experimenting with them!

Always use the same analysis rules when indexing and when parsing search text entered by the user!

I bet you want to start working with this

Implementations Lucene (Java main, .NET, Python, C ) SOLR if using from other languages Xapian Sphinx OpenFTS MySQL Full-Text Search (kind of…)

The theory Introduction to Information Retrieval http://nlp.stanford.edu/IR-book/information-retrieval-book.html Warning: contains a lot of math.

The practice (for Lucene at least): Lucene in Action, second edition: http://www.manning.com/hatcher3/ Warning: contains a lot of Java

Contact me (with interesting problems involving lots of data  ) @deathy cristian.vat@gmail.com http://blog.deathy.info/ (yeah…I know…)

So where’s the Halloween Party? Happy Halloween !

Introduction to Full-Text Search

Recommended

Recommended

More Related Content

Similar to Introduction to Full-Text Search

Similar to Introduction to Full-Text Search (20)

Recently uploaded

Recently uploaded (20)

Introduction to Full-Text Search

Editor's Notes