Tweaking the Base Score: Lucene/Solr Similarities Explained

•

1 like•8,864 views

This talk was given during Activate Conference 2019. Lucene has a lot of options for configuring similarity, and Solr inherits them. Similarity makes the base of your relevancy score: how similar is this document to the query? The default similarity (BM25) is a good start, but you may need to tweak it for your use-case. In this session, you will learn how BM25 works and how you may want to change its parameters. Then, we'll move to other similarity classes: DFR, DFI, IB and LM. You will learn the thinking behind them, how that thinking translates to the similarity score, and which parameters allow you to tweak how score evolves based on things like term frequency or document length. By the end, you’ll have a good understanding of which similarity options are likely to work well for your use-case. You'll know which tunables are available and whether you need to implement a custom similarity class. As an example, we’ll focus on E-commerce, where you often end up ignoring term frequency altogether. Key Takeaway 1) What are the built-in Lucene/Solr similarities and what they do 2) Which similarity to use for which use-case 3) How to use a custom similarity class in Solr Learn more about search relevance and similarity: sematext.com/blog/search-relevance-solr-elasticsearch-similarity

Engineering

Tweaking the Base Score:
Lucene/Solr Similarities Explained
Demo: github.com/sematext/activate/tree/master/2019
More info: sematext.com/blog/search-relevance-solr-elasticsearch-similarity
Radu
Gheorghe
Rafał
Kuć
www.sematext.com

Agenda
BM25 - Best Match: the default
DFR - Divergence From Randomness framework
DFI - Divergence From Independence
IB - Information-Based models
LM - Language Models
Custom similarity
Putting it all together

BM25 - the TF part
freq / (freq + k1 * (1 - b + b * dl / avgdl))
Best for Most 😁

BM25 tunables
freq / (freq + k1 * (1 - b + b * dl / avgdl))
k1 - raise or lower ceiling

BM25 tunables
freq / (freq + k1 * (1 - b + b * dl / avgdl))
doc length normalization

BM25 demo
yes, that’s how we look
when we give demos

BM25
Good default. You can
tune the weight of freq
and docLength.

Divergence From Randomness
Basic Model
G, I(n), I(ne), I(F)
After Effect
L, B
Normalization
H1, H2, H3, Z, none

tf * c * avgFieldLength / docFieldLength
Divergence From Randomness - H1

Divergence From Randomness - H1
No normalization, and H1 with c == 1, 3, 5, 7

tf * log2
(1 + c * (avgFieldLength / docFieldLength))
Divergence From Randomness - H2

Divergence From Randomness - H2
No normalization, and H2 with c == 1, 3, 5, 7

tf * (avgFieldLength / docFieldLength)Z
Divergence From Randomness - Z

Divergence From Randomness - Z
No normalization, and Z with z == 0.1, 0.2, 0.3, 0.4

(tf * mu * ((totalTermFreq + 1) / (#ﬁeldTokens + 1)))
(docFieldLength + mu) * mu
Divergence From Randomness - H3

Divergence From Randomness - H3
No normalization, and H3 with mu == 1, 3, 5, 7

DFR
Framework. Tunable:
choose algorithm and
tune parameters for
both IDF* and
docLength.
* generic name for importance
of this term

Divergence From Independence
expected frequency

Divergence From Independence
docLength*totalTermFrequency/numberOfFieldTokens
expected frequency

DFI: Standardized
(actual - expected)/sqrt(expected)

DFI demo
Oh, but don’t remove
stopwords*!
1) arbitrarily chops ﬁeld length
2) stopwords aren’t always
stopwords ;)

DFI
Simple. Parameterless.
Flexible: works well
with various datasets.

Information Based
how much information we get from this term?

Information Based
Distribution
Log-Logistic, Smoothed Power-Law
Lambda
DF, TTF
Normalization
H1, H2, H3, Z, none

Information Based - Log-Logistic
log( tfn / (lambda + 1) )

Information Based - Log-Logistic
lambda: 0.1 (red), 0.3 (black), 0.8 (blue)

Information Based - Retrieval Function
the average of the document information brought
by each query term

Information Based - Retrieval Function - DF
number of matching documents
(docFrequency + 1) / (numberOfDocuments + 1)

Information Based - Retrieval Function - TTF
total number of term occurrences
(totalTermFrequency + 1) / (numberOfDocuments + 1)

IB
Framework. like DFR.
Even has the same
normalization options.
But newer and, in the
paper, better.

Language Models
probability of a term being our term

Language Models
totalTermFreq/totalFieldTokens
probability of a term being our term

Language Models: Jelinek-Mercer
log(
(1-λ)*
tf
)docLength
λ * probability

LM
Two probabilistic
models. Similar
approach to DFI, but
tunable.

Custom Similarity
compute a similarity score using custom code

Custom Similarity - Activate Similarity Factory
public class ActivateSimilarityFactory extends SimilarityFactory {
private volatile Similarity similarity;
public void init(SolrParams params) {
super.init(params);
}
public Similarity getSimilarity() {
if (similarity == null) {
similarity = new ActivateSimilarity();
}
return similarity;
}
}

Custom Similarity - Similarity
public class ActivateSimilarity extends Similarity {
public ActivateSimilarity() {}
public long computeNorm(FieldInvertState state) { return 1; }
public Similarity.SimScorer scorer(float boost,
CollectionStatistics collectionStats, TermStatistics... termStats) {
return new ActivateSimScorer();
}
}

Custom Similarity - SimScorer
public class ActivateSimScorer extends Similarity.SimScorer {
public float score(float freq, long norm) {
return freq;
}
}

Custom
When you need
something special, like
disregarding term
frequency.

What's hot

Deep Dive into the New Features of Apache Spark 3.0Databricks

Real-time Analytics with Presto and Apache PinotXiang Fu

Dynamic Partition Pruning in Apache SparkDatabricks

Fluentd vs. Logstash for OpenStack Log ManagementNTT Communications Technology Development

ElasticsearchRicardo Peres

Introduction to ML with Apache Spark MLlibTaras Matyashovsky

An introduction to Elasticsearch's advanced relevance ranking toolboxElasticsearch

Apache Spark 101Abdullah Çetin ÇAVDAR

Elasticsearch V/s Relational DatabaseRicha Budhraja

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!

What is in a Lucene index?lucenerevolution

Elastic search overviewABC Talks

Building an open data platform with apache icebergAlluxio, Inc.

Zipline—Airbnb’s Declarative Feature Engineering FrameworkDatabricks

The Parquet Format and Performance Optimization OpportunitiesDatabricks

SparkSQL: A Compiler from Queries to RDDsDatabricks

Memory Management in Apache SparkDatabricks

Incremental View Maintenance with Coral, DBT, and IcebergWalaa Eldin Moustafa

Lessons for the optimizer from running the TPC-DS benchmarkSergey Petrunya

Apache Solr-WebinarEdureka!

What's hot (20)

Deep Dive into the New Features of Apache Spark 3.0

Real-time Analytics with Presto and Apache Pinot

Dynamic Partition Pruning in Apache Spark

Fluentd vs. Logstash for OpenStack Log Management

Elasticsearch

Introduction to ML with Apache Spark MLlib

An introduction to Elasticsearch's advanced relevance ranking toolbox

Apache Spark 101

Elasticsearch V/s Relational Database

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...

What is in a Lucene index?

Elastic search overview

Building an open data platform with apache iceberg

Zipline—Airbnb’s Declarative Feature Engineering Framework

The Parquet Format and Performance Optimization Opportunities

SparkSQL: A Compiler from Queries to RDDs

Memory Management in Apache Spark

Incremental View Maintenance with Coral, DBT, and Iceberg

Lessons for the optimizer from running the TPC-DS benchmark

Apache Solr-Webinar

Similar to Tweaking the Base Score: Lucene/Solr Similarities Explained

Core javakasaragaddaslide

C++ concept of Polymorphismkiran Patel

Terraform Abstractions for Safety and PowerCalvin French-Owen

The GO Language : From Beginners to GophersAlessandro Sanino

Andy On Closuresmelbournepatterns

Addressing ScenarioTara Hardin

Terraform training 🎒 - BasicStephaneBoghossian1

Doing It Wrong with Puppet - Puppet

Grape generative fuzzingFFRI, Inc.

Introducing PHP Latest UpdatesIftekhar Eather

Design patternsAnas Alpure

Network automation with Ansible and PythonJisc

How to test infrastructure code: automated testing for Terraform, Kubernetes,...Yevgeniy Brikman

Groovy Ecosystem - JFokus 2011 - Guillaume LaforgeGuillaume Laforge

Terraform modules restructuredAmi Mahloof

Terraform Modules RestructuredDoiT International

Refactoring In Tdd The Missing PartGabriele Lana

Spock: A Highly Logical Way To TestHoward Lewis Ship

From Java to Parellel Clojure - Clojure South 2019Leonardo Borges

Kicking off with Zend Expressive and Doctrine ORM (PHP South Africa 2018)James Titcumb

Similar to Tweaking the Base Score: Lucene/Solr Similarities Explained (20)

Core java

C++ concept of Polymorphism

Terraform Abstractions for Safety and Power

The GO Language : From Beginners to Gophers

Andy On Closures

Addressing Scenario

Terraform training 🎒 - Basic

Doing It Wrong with Puppet -

Grape generative fuzzing

Introducing PHP Latest Updates

Design patterns

Network automation with Ansible and Python

How to test infrastructure code: automated testing for Terraform, Kubernetes,...

Groovy Ecosystem - JFokus 2011 - Guillaume Laforge

Terraform modules restructured

Terraform Modules Restructured

Refactoring In Tdd The Missing Part

Spock: A Highly Logical Way To Test

From Java to Parellel Clojure - Clojure South 2019

Kicking off with Zend Expressive and Doctrine ORM (PHP South Africa 2018)

Recently uploaded

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat

Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile

UNIT - IV - Air Compressors and its Performancesivaprakash250

Unleashing the Power of the SORA AI lastest leapRishantSharmaFr

Unit 1 - Soil Classification and Compaction.pdfRagavanV2

Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...9953056974 Low Rate Call Girls In Saket, Delhi NCR

AKTU Computer Networks notes --- Unit 3.pdfankushspencer015

PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLManishPatel169454

(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7Call Girls in Nagpur High Profile Call Girls

Java Programming :Event Handling(Types of Events)simmis5

notes on Evolution Of Analytic Scalability.pptMsecMca

FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066

Double rodded leveling 1 pdf activity 01KreezheaRecto

Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi

data_management_and _data_science_cheat_sheet.pdfJiananWang21

The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat

Generative AI or GenAI technology based PPTbhaskargani46

Recently uploaded (20)

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...

Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service

BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...

UNIT - IV - Air Compressors and its Performance

Unleashing the Power of the SORA AI lastest leap

Unit 1 - Soil Classification and Compaction.pdf

Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...

AKTU Computer Networks notes --- Unit 3.pdf

PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL

(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7

Java Programming :Event Handling(Types of Events)

notes on Evolution Of Analytic Scalability.ppt

FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756

Double rodded leveling 1 pdf activity 01

Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking

data_management_and _data_science_cheat_sheet.pdf

The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...

Generative AI or GenAI technology based PPT

Tweaking the Base Score: Lucene/Solr Similarities Explained

2. Tweaking the Base Score: Lucene/Solr Similarities Explained Demo: github.com/sematext/activate/tree/master/2019 More info: sematext.com/blog/search-relevance-solr-elasticsearch-similarity Radu Gheorghe Rafał Kuć www.sematext.com

3. Agenda BM25 - Best Match: the default DFR - Divergence From Randomness framework DFI - Divergence From Independence IB - Information-Based models LM - Language Models Custom similarity Putting it all together

4. TF*IDF You know, for historical reasons

5. BM25 - the TF part freq / (freq + k1 * (1 - b + b * dl / avgdl)) Best for Most 😁

6. BM25 tunables freq / (freq + k1 * (1 - b + b * dl / avgdl)) k1 - raise or lower ceiling

7. BM25 tunables freq / (freq + k1 * (1 - b + b * dl / avgdl)) doc length normalization

8. BM25 demo yes, that’s how we look when we give demos

9. BM25 Good default. You can tune the weight of freq and docLength.

10. Divergence From Randomness Basic Model G, I(n), I(ne), I(F) After Effect L, B Normalization H1, H2, H3, Z, none

11. tf * c * avgFieldLength / docFieldLength Divergence From Randomness - H1

12. Divergence From Randomness - H1 No normalization, and H1 with c == 1, 3, 5, 7

13. tf * log2 (1 + c * (avgFieldLength / docFieldLength)) Divergence From Randomness - H2

14. Divergence From Randomness - H2 No normalization, and H2 with c == 1, 3, 5, 7

15. tf * (avgFieldLength / docFieldLength)Z Divergence From Randomness - Z

16. Divergence From Randomness - Z No normalization, and Z with z == 0.1, 0.2, 0.3, 0.4

17. (tf * mu * ((totalTermFreq + 1) / (#ﬁeldTokens + 1))) (docFieldLength + mu) * mu Divergence From Randomness - H3

18. Divergence From Randomness - H3 No normalization, and H3 with mu == 1, 3, 5, 7

19. DFR demo Only one, I promise

20. DFR Framework. Tunable: choose algorithm and tune parameters for both IDF* and docLength. * generic name for importance of this term

21. Divergence From Independence expected frequency

22. Divergence From Independence docLength*totalTermFrequency/numberOfFieldTokens expected frequency

23. DFI: Standardized (actual - expected)/sqrt(expected)

24. DFI demo Oh, but don’t remove stopwords*! 1) arbitrarily chops ﬁeld length 2) stopwords aren’t always stopwords ;)

25. DFI Simple. Parameterless. Flexible: works well with various datasets.

26. Information Based how much information we get from this term?

27. Information Based Distribution Log-Logistic, Smoothed Power-Law Lambda DF, TTF Normalization H1, H2, H3, Z, none

28. Information Based - Log-Logistic log( tfn / (lambda + 1) )

29. Information Based - Log-Logistic lambda: 0.1 (red), 0.3 (black), 0.8 (blue)

30. Information Based - Retrieval Function the average of the document information brought by each query term

31. Information Based - Retrieval Function - DF number of matching documents (docFrequency + 1) / (numberOfDocuments + 1)

32. Information Based - Retrieval Function - TTF total number of term occurrences (totalTermFrequency + 1) / (numberOfDocuments + 1)

33. IB demo

34. IB Framework. like DFR. Even has the same normalization options. But newer and, in the paper, better.

35. Language Models probability of a term being our term

36. Language Models totalTermFreq/totalFieldTokens probability of a term being our term

37. Language Models: Jelinek-Mercer log( (1-λ)* tf )docLength λ * probability

38. LM demo feat. Jelinek-Mercer

39. LM Two probabilistic models. Similar approach to DFI, but tunable.

40. Custom Similarity compute a similarity score using custom code

41. Custom Similarity - Activate Similarity Factory public class ActivateSimilarityFactory extends SimilarityFactory { private volatile Similarity similarity; public void init(SolrParams params) { super.init(params); } public Similarity getSimilarity() { if (similarity == null) { similarity = new ActivateSimilarity(); } return similarity; } }

42. Custom Similarity - Activate Similarity Factory public class ActivateSimilarityFactory extends SimilarityFactory { private volatile Similarity similarity; public void init(SolrParams params) { super.init(params); } public Similarity getSimilarity() { if (similarity == null) { similarity = new ActivateSimilarity(); } return similarity; } }

43. Custom Similarity - Activate Similarity Factory public class ActivateSimilarityFactory extends SimilarityFactory { private volatile Similarity similarity; public void init(SolrParams params) { super.init(params); } public Similarity getSimilarity() { if (similarity == null) { similarity = new ActivateSimilarity(); } return similarity; } }

44. Custom Similarity - Similarity public class ActivateSimilarity extends Similarity { public ActivateSimilarity() {} public long computeNorm(FieldInvertState state) { return 1; } public Similarity.SimScorer scorer(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) { return new ActivateSimScorer(); } }

45. Custom Similarity - Similarity public class ActivateSimilarity extends Similarity { public ActivateSimilarity() {} public long computeNorm(FieldInvertState state) { return 1; } public Similarity.SimScorer scorer(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) { return new ActivateSimScorer(); } }

46. Custom Similarity - Similarity public class ActivateSimilarity extends Similarity { public ActivateSimilarity() {} public long computeNorm(FieldInvertState state) { return 1; } public Similarity.SimScorer scorer(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) { return new ActivateSimScorer(); } }

47. Custom Similarity - SimScorer public class ActivateSimScorer extends Similarity.SimScorer { public float score(float freq, long norm) { return freq; } }

48. Custom Similarity - SimScorer public class ActivateSimScorer extends Similarity.SimScorer { public float score(float freq, long norm) { return freq; } }

49. Custom Similarity demo

50. Custom When you need something special, like disregarding term frequency.

51. Multiple similarities demo

52.

53. THANK YOU

Tweaking the Base Score: Lucene/Solr Similarities Explained

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Tweaking the Base Score: Lucene/Solr Similarities Explained

Similar to Tweaking the Base Score: Lucene/Solr Similarities Explained (20)

More from Sematext Group, Inc.

More from Sematext Group, Inc. (20)

Recently uploaded

Recently uploaded (20)

Tweaking the Base Score: Lucene/Solr Similarities Explained