SlideShare a Scribd company logo
1 of 47
Download to read offline
Finite State Automata
   in
       Dawid WEISS
.



            Dawid Weiss
        .
            20+ years of coding
            10 years assembly only

    .       Academia & Research
            PhD in Information Retrieval, PUT

            Open source
            Carrot2 , HPPC, Lucene, …

            Industry & Business
            Carrot Search s.c.




.       .
Talk outline

State machines (automata)
FSAs, DFAs, FSTs and other XXXs.

Use cases in Lucene and Solr
Suggester. FuzzySearch. Index.

No API details
Still @experimental.
(Non)? Deterministic Finite
State (Automata|Machines)
HashSet
hash         → slot   → value
0x29384d34            → lucene
0xde3e3354            → lucid
0x00000666            → lucifer
HashSet
hash           → slot       → value
0x29384d34                  → lucene
0xde3e3354                  → lucid
0x00000666                  → lucifer

FSA (deterministic)
          l      u      c       e       n       e
                                 i
                                            d
                                                    r
                                        f
                                                e
HashSet
hash           → slot       → value
0x29384d34                  → lucene
0xde3e3354                  → lucid
0x00000666                  → lucifer

FSA (deterministic)
          l        u    c       e       n       e
                                 i
                                            d
exists(sequence)                                    r
 oor(pre x)                             f
ceil(pre x)                                     e
k   i   l   l

b           l   deterministic, non-minimal
    i   l
k   i   l   l

b           l   deterministic, non-minimal
    i   l



k
    i   l   l
                deterministic, minimal
b
k   i   l    l

b            l   deterministic, non-minimal
    i   l



k
    i   l    l
                 deterministic, minimal
b


k
    i    l
             l   non-deterministic,
    i    l
                 non-minimal
b
(Sorted)Map

lucene    → 1
lucid     → 2
lucifer   → 666
(Sorted)Map

lucene        → 1
lucid         → 2
lucifer       → 666

FST (transducer)
          l        u   c   e   n         e|1
                           i
                                   d|2
                                             r|666
                               f
                                         e
(Sorted)Map

lucene     → 1
lucid      → 2
lucifer    → 666

FST (transducer)
         l|1       u   c   e     n           e
                           i|1
                                     d
                                                 r
                                 f|664
                                         e
NFSAs and
Regular expressions
                                                    a
                                          a


                                        e1e2   e1           e1



Determinization                          e+
                                                        e
states explosion, not always possible

Backtracking
recursion explosion                      e*
                                                        e



                                         e?
                                                        e
a?nan
a?nan
n=3 → a?a?a?aaa
a?nan
                      n=3 → a?a?a?aaa




Source: Russ Cox, Regular Expression Matching Can Be Simple And Fast (re2).
35000


            30000


            25000
Time [ms]




            20000


            15000


            10000


             5000


               0
                    0                 5              10               15               20              25     30




                        Time of matching an for pattern a?n an , depending on n. Java 1.6, modern hardware.
Linear-time, minimal, deterministic
FSA construction

Linear algorithm from sorted input
by Daciuk, Mihov, et al.

Active path
states that still can change

States dictionary
nodes that will never change
1) common AP pre x
2) freeze the rest of AP
3) add suffix → new AP




lucene
1) common AP pre x
2) freeze the rest of AP
3) add suffix → new AP




        l    u     c       e   n   e




lucid
1) common AP pre x
2) freeze the rest of AP
3) add suffix → new AP




       l     u     c       e   n   e
                           i
                               d
1) common AP pre x
2) freeze the rest of AP
3) add suffix → new AP




          l   u    c       e   n   e
                           i
                               d




lucifer
1) common AP pre x
2) freeze the rest of AP
3) add suffix → new AP




       l     u     c       e   n       e
                           i
                                   d

                               f
                                       e   r
1) common AP pre x
2) freeze the rest of AP
3) add suffix → new AP




       l     u     c       e   n       e
                           i
                                   d
                                           r
                               f
                                       e
FS(A|T)s in (Lucene|Solr)
Automata in
Lucene|Solr

org.apache.lucene.util.automaton.*
partial port of brics, FuzzyQuery, AutomatonTermsEnum

org.apache.lucene.util.automaton.fst.FST
FSA and FSTs from sorted data, suggester, indexes
org.apache.lucene.util.automaton.fst.*
FSA representation

Arc-based, not state-based
Moore vs. Mealy. Compact vs. intuitive




              Input: abc, bd, bde.
                      a      b       c       a   b   c

                      b          d           b
                                         e       d   e
                              d
org.apache.lucene.util.automaton.fst.*
FSA representation
                                                   a        b        c
Arc-based, not state-based                    s3       s2       s1
Moore vs. Mealy. Compact vs. intuitive             b        d        e
                                                       s4       s5
Next-state chaining
requires unusual tricks during construction
                                              s1 s2 s5 s4            s3
                                              cFL bL eFL dL      a   bL




                                              s1 s2 s5 s4            s3
                                              cFL bL eFL dL bLN a
org.apache.lucene.util.automaton.fst.*
FSA representation
                                                   a        b        c
Arc-based, not state-based                    s3       s2       s1
Moore vs. Mealy. Compact vs. intuitive             b        d        e
                                                       s4       s5
Next-state chaining
requires unusual tricks during construction
                                              s1 s2 s5 s4            s3
Everything in a byte[]                        cFL bL eFL dL      a   bL
traversals-ready, memory-efficient


                                              s1 s2 s5 s4            s3
                                              cFL bL eFL dL bLN a
org.apache.lucene.util.automaton.fst.*
FSA representation
                                                   a        b        c
Arc-based, not state-based                    s3       s2       s1
Moore vs. Mealy. Compact vs. intuitive             b        d        e
                                                       s4       s5
Next-state chaining
requires unusual tricks during construction
                                              s1 s2 s5 s4            s3
Everything in a byte[]                        cFL bL eFL dL      a   bL
traversals-ready, memory-efficient

Dual transition storage format
lookup: bsearch or linear scan                s1 s2 s5 s4            s3
                                              cFL bL eFL dL bLN a
Input size       Compressed size (MB)
Input               MB        Terms    Lucene    morf.   gzip
Wikipedia t.index   481   38 092 045      258     164    149
Polish in .         162    3 672 200       3.1     1.7   15.4




        .
Use Cases:
Solr's Autocomplete
Solr's
Suggesters

Design choices
sort order (alpha, score), pre x vs. spelling, boost exact matches?

Weights
term→weight, lookup(term, onlyMorePopular)

org.apache.solr.spelling.suggest.Lookup
JaspellLookup, TSTLookup, FSTLookup
flour|3
    four|4
    fourier|3
    furious|2




                .
                .
                Take 1 .
.
flour|3
           four|4
                                          →fou*
           fourier|3
           furious|2



                       o             u
                 l
                                     i     e      r       |
           f     o     u      r
                                     |                        3
                 u                                    4

                        r                                     2
                              i      o     u      s       |



    Find pre x.
    Depth-in traversal for completions.
    PQ on score|alpha
                                                                  .
                                                                  .
                                                                  Take 1 .
.
2furious
    3flour
    3fourier
    4four




               .
               .
               Take 2 .
.
2furious
           3flour
                                              →fou*
           3fourier
           4four


                            u
                    f              r
                                          i       o      u
            2                                                  s
                            l
            3       f             u       r       i      e     r
                            o
            4                                     u
                        f          o



    From score roots, until N collected.
    Find pre x.
    Depth-in traversal for completions, stop if N collected.
    Find/boost exact match.                                        .
                                                                   .
                                                                   Take 2 .
.
2furious
    5urious|furious
    5rious|furious
    5ious|furious
    5ous|furious
    5us|furious
    5s|furious
    3flour
    …




                      .
                      Take 3 (in xes) .
                      .
.
2



                i

                    o
                                o
                                        u
                        i
            r                                   s
                r           s                           |
                                                            f

    5   u

                                s
                                                                u


                4

        o       u
    6                                                               r
            u
                r
                        r       |                   .                   i   o   u
                                                                                    s
                                                f
    7                                                                   u           r
                                                            o
                                                        l
                                            f               o               i   e
                3                                       l           u   r

                        e                       f               o
                                        |
                                                            f
                                r                       |
                                                r
                    i
        l       o       u               e
                                i
        o                       |   |
                r
                                i
                u       r
            u
.
Constant time lookups!
  Regardless of the terms dictionary size.
       Regardless of pre x length.
Constant time lookups!
  Regardless of the terms dictionary size.
       Regardless of pre x length.


            Exact matches only.
    Static snapshot (not incremental).
            Discretized weights.
Top50KWiki.utf8, 676 KB, 50 000 terms

                    Jaspell        TST           FST
       .
    RAM [B]            .
                   7 869 415         .
                                7 914 524      300 .175



                       queries per second,. . . tpq
        .
PREFIX [100-200]      .
                     458             .
                                   966              .
                                                  742


       .
  PREFIX [6-9]        .
                     330            .
                                   228            .
                                                 659


       .
  PREFIX [2-4]        .
                     126            .
                                   29
                                    .             .
                                                 501
Summary
Summary and Conclusions
Automata
compact, powerful, efficient data structure

Lucene/Solr bene ts
behind the scenes, but spreading: index, queries, suggesters

API in Lucene
…is shaped right now, still @experimental
Acknowledgement

Michael McCandless

Robert Muir

committer: .+
dawid.weiss@carrotsearch.com

More Related Content

What's hot

[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)NAVER D2
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Caching solutions with Redis
Caching solutions   with RedisCaching solutions   with Redis
Caching solutions with RedisGeorge Platon
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repositoryJukka Zitting
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudAnshum Gupta
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Ambari: Agent Registration Flow
Ambari: Agent Registration FlowAmbari: Agent Registration Flow
Ambari: Agent Registration FlowHortonworks
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko
 
From Postgres to ScyllaDB: Migration Strategies and Performance Gains
From Postgres to ScyllaDB: Migration Strategies and Performance GainsFrom Postgres to ScyllaDB: Migration Strategies and Performance Gains
From Postgres to ScyllaDB: Migration Strategies and Performance GainsScyllaDB
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at FacebookDatabricks
 
Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Jukka Zitting
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013Jun Rao
 
A Technical Introduction to WiredTiger
A Technical Introduction to WiredTigerA Technical Introduction to WiredTiger
A Technical Introduction to WiredTigerMongoDB
 

What's hot (20)

Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)
 
MongodB Internals
MongodB InternalsMongodB Internals
MongodB Internals
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Caching solutions with Redis
Caching solutions   with RedisCaching solutions   with Redis
Caching solutions with Redis
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Ambari: Agent Registration Flow
Ambari: Agent Registration FlowAmbari: Agent Registration Flow
Ambari: Agent Registration Flow
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
From Postgres to ScyllaDB: Migration Strategies and Performance Gains
From Postgres to ScyllaDB: Migration Strategies and Performance GainsFrom Postgres to ScyllaDB: Migration Strategies and Performance Gains
From Postgres to ScyllaDB: Migration Strategies and Performance Gains
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
 
Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
A Technical Introduction to WiredTiger
A Technical Introduction to WiredTigerA Technical Introduction to WiredTiger
A Technical Introduction to WiredTiger
 

Viewers also liked

Text tagging with finite state transducers
Text tagging with finite state transducersText tagging with finite state transducers
Text tagging with finite state transducerslucenerevolution
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Luceneotisg
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCLucidworks (Archived)
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1YI-CHING WU
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadooplucenerevolution
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBertrand Delacretaz
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simonlucenerevolution
 
Lucandra
LucandraLucandra
Lucandraotisg
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...Lucidworks
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMLucidworks
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartLucidworks
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisJosiane Gamgo
 

Viewers also liked (20)

Text tagging with finite state transducers
Text tagging with finite state transducersText tagging with finite state transducers
Text tagging with finite state transducers
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
Automata Invasion
Automata InvasionAutomata Invasion
Automata Invasion
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1
 
Lucene And Solr Intro
Lucene And Solr IntroLucene And Solr Intro
Lucene And Solr Intro
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoop
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
Lucene
LuceneLucene
Lucene
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
 
Lucandra
LucandraLucandra
Lucandra
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
 

More from Lucidworks (Archived)

Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Lucidworks (Archived)
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchLucidworks (Archived)
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Lucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...Lucidworks (Archived)
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCLucidworks (Archived)
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKLucidworks (Archived)
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarLucidworks (Archived)
 
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks Lucidworks (Archived)
 

More from Lucidworks (Archived) (20)

The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
 

Dawid Weiss- Finite state automata in lucene

  • 1. Finite State Automata in Dawid WEISS
  • 2. . Dawid Weiss . 20+ years of coding 10 years assembly only . Academia & Research PhD in Information Retrieval, PUT Open source Carrot2 , HPPC, Lucene, … Industry & Business Carrot Search s.c. . .
  • 3. Talk outline State machines (automata) FSAs, DFAs, FSTs and other XXXs. Use cases in Lucene and Solr Suggester. FuzzySearch. Index. No API details Still @experimental.
  • 4. (Non)? Deterministic Finite State (Automata|Machines)
  • 5. HashSet hash → slot → value 0x29384d34 → lucene 0xde3e3354 → lucid 0x00000666 → lucifer
  • 6. HashSet hash → slot → value 0x29384d34 → lucene 0xde3e3354 → lucid 0x00000666 → lucifer FSA (deterministic) l u c e n e i d r f e
  • 7. HashSet hash → slot → value 0x29384d34 → lucene 0xde3e3354 → lucid 0x00000666 → lucifer FSA (deterministic) l u c e n e i d exists(sequence) r oor(pre x) f ceil(pre x) e
  • 8. k i l l b l deterministic, non-minimal i l
  • 9. k i l l b l deterministic, non-minimal i l k i l l deterministic, minimal b
  • 10. k i l l b l deterministic, non-minimal i l k i l l deterministic, minimal b k i l l non-deterministic, i l non-minimal b
  • 11. (Sorted)Map lucene → 1 lucid → 2 lucifer → 666
  • 12. (Sorted)Map lucene → 1 lucid → 2 lucifer → 666 FST (transducer) l u c e n e|1 i d|2 r|666 f e
  • 13. (Sorted)Map lucene → 1 lucid → 2 lucifer → 666 FST (transducer) l|1 u c e n e i|1 d r f|664 e
  • 14. NFSAs and Regular expressions a a e1e2 e1 e1 Determinization e+ e states explosion, not always possible Backtracking recursion explosion e* e e? e
  • 15. a?nan
  • 17. a?nan n=3 → a?a?a?aaa Source: Russ Cox, Regular Expression Matching Can Be Simple And Fast (re2).
  • 18. 35000 30000 25000 Time [ms] 20000 15000 10000 5000 0 0 5 10 15 20 25 30 Time of matching an for pattern a?n an , depending on n. Java 1.6, modern hardware.
  • 19. Linear-time, minimal, deterministic FSA construction Linear algorithm from sorted input by Daciuk, Mihov, et al. Active path states that still can change States dictionary nodes that will never change
  • 20. 1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP lucene
  • 21. 1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP l u c e n e lucid
  • 22. 1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP l u c e n e i d
  • 23. 1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP l u c e n e i d lucifer
  • 24. 1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP l u c e n e i d f e r
  • 25. 1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP l u c e n e i d r f e
  • 27. Automata in Lucene|Solr org.apache.lucene.util.automaton.* partial port of brics, FuzzyQuery, AutomatonTermsEnum org.apache.lucene.util.automaton.fst.FST FSA and FSTs from sorted data, suggester, indexes
  • 28. org.apache.lucene.util.automaton.fst.* FSA representation Arc-based, not state-based Moore vs. Mealy. Compact vs. intuitive Input: abc, bd, bde. a b c a b c b d b e d e d
  • 29. org.apache.lucene.util.automaton.fst.* FSA representation a b c Arc-based, not state-based s3 s2 s1 Moore vs. Mealy. Compact vs. intuitive b d e s4 s5 Next-state chaining requires unusual tricks during construction s1 s2 s5 s4 s3 cFL bL eFL dL a bL s1 s2 s5 s4 s3 cFL bL eFL dL bLN a
  • 30. org.apache.lucene.util.automaton.fst.* FSA representation a b c Arc-based, not state-based s3 s2 s1 Moore vs. Mealy. Compact vs. intuitive b d e s4 s5 Next-state chaining requires unusual tricks during construction s1 s2 s5 s4 s3 Everything in a byte[] cFL bL eFL dL a bL traversals-ready, memory-efficient s1 s2 s5 s4 s3 cFL bL eFL dL bLN a
  • 31. org.apache.lucene.util.automaton.fst.* FSA representation a b c Arc-based, not state-based s3 s2 s1 Moore vs. Mealy. Compact vs. intuitive b d e s4 s5 Next-state chaining requires unusual tricks during construction s1 s2 s5 s4 s3 Everything in a byte[] cFL bL eFL dL a bL traversals-ready, memory-efficient Dual transition storage format lookup: bsearch or linear scan s1 s2 s5 s4 s3 cFL bL eFL dL bLN a
  • 32. Input size Compressed size (MB) Input MB Terms Lucene morf. gzip Wikipedia t.index 481 38 092 045 258 164 149 Polish in . 162 3 672 200 3.1 1.7 15.4 .
  • 34. Solr's Suggesters Design choices sort order (alpha, score), pre x vs. spelling, boost exact matches? Weights term→weight, lookup(term, onlyMorePopular) org.apache.solr.spelling.suggest.Lookup JaspellLookup, TSTLookup, FSTLookup
  • 35. flour|3 four|4 fourier|3 furious|2 . . Take 1 . .
  • 36. flour|3 four|4 →fou* fourier|3 furious|2 o u l i e r | f o u r | 3 u 4 r 2 i o u s | Find pre x. Depth-in traversal for completions. PQ on score|alpha . . Take 1 . .
  • 37. 2furious 3flour 3fourier 4four . . Take 2 . .
  • 38. 2furious 3flour →fou* 3fourier 4four u f r i o u 2 s l 3 f u r i e r o 4 u f o From score roots, until N collected. Find pre x. Depth-in traversal for completions, stop if N collected. Find/boost exact match. . . Take 2 . .
  • 39. 2furious 5urious|furious 5rious|furious 5ious|furious 5ous|furious 5us|furious 5s|furious 3flour … . Take 3 (in xes) . . .
  • 40. 2 i o o u i r s r s | f 5 u s u 4 o u 6 r u r r | . i o u s f 7 u r o l f o i e 3 l u r e f o | f r | r i l o u e i o | | r i u r u .
  • 41. Constant time lookups! Regardless of the terms dictionary size. Regardless of pre x length.
  • 42. Constant time lookups! Regardless of the terms dictionary size. Regardless of pre x length. Exact matches only. Static snapshot (not incremental). Discretized weights.
  • 43. Top50KWiki.utf8, 676 KB, 50 000 terms Jaspell TST FST . RAM [B] . 7 869 415 . 7 914 524 300 .175 queries per second,. . . tpq . PREFIX [100-200] . 458 . 966 . 742 . PREFIX [6-9] . 330 . 228 . 659 . PREFIX [2-4] . 126 . 29 . . 501
  • 45. Summary and Conclusions Automata compact, powerful, efficient data structure Lucene/Solr bene ts behind the scenes, but spreading: index, queries, suggesters API in Lucene …is shaped right now, still @experimental