SlideShare a Scribd company logo
1 of 31
Community Over Code
07/10/2023
Alessandro Benedetti, Director @ Sease
Introducing Multi-valued Vector
Fields in Apache Lucene
1
‣ Born in Tarquinia (ancient Etruscan city in Italy)
‣ R&D Software Engineer
‣ Director
‣ Master degree in Computer Science
‣ PC member for ECIR, SIGIR and Desires
‣ Apache Lucene/Solr PMC member/committer
‣ Elasticsearch/OpenSearch expert
‣ Semantic search, NLP, Machine Learning
technologies passionate
‣ Beach Volleyball player and Snowboarder
ALESSANDRO BENEDETTI
WHO AM I ?
2
‣ Headquarter in London/distributed
‣ Open-source Enthusiasts
‣ Apache Lucene/Solr experts
‣ Elasticsearch/OpenSearch experts
‣ Community Contributors
‣ Active Researchers
‣ Hot Trends : Neural Search,
Natural Language Processing
Learning To Rank,
Document Similarity,
Search Quality Evaluation,
Relevance Tuning
SEArch SErvices
www.sease.io
3
AGENDA
Why Multi-valued?
HNSW and modifications
Index time internals
Challenges of a contribution
Query time internals
4
3
2
1
WHAT
What Can you do now?
the text content of a field exceeds the maximum amount of characters accepted by
your inference model (to encode vectors)
Split the content in
paragraphs across
multiple
documents.
Your unit of
information
becomes the
paragraph
When returning the
results you need to
aggregate back to
documents
5
3
2
1Split the content in
paragraphs across
multiple
documents.
Your unit of
information
becomes the
paragraph
When returning the
results you need to
aggregate back to
documents
● Indexing Time: nested
documents(slow/expe
nsive)
● Indexing Time:
flattened
documents(redundant
data)
● Query Time: parent-child
join queries?
(slow/expensive)
● Query Time:
collapsing/grouping
● Aggregations: faceting
becomes more
complicated
● Stats: aggregating data
and calculating stats is
impacted
HOW
6
● This applies for all fields and field types actually
● you may be ok applying those strategies …
● … but for some users may be quite annoying and
expensive
WHY MULTI-VALUED
7
● K Nearest Neighbour Algorithm?
● Indexing data structures and approach?
● Query time data structures and approach?
What does it mean to bring multi-valued to vectors?
8
ANN - Approximate Nearest Neighbor
● Exact Nearest Neighbor is expensive! (1vs1 vector
distance)
● it’s fine to lose accuracy to get a massive performance gain
● pre-process the dataset to build index data structures
● Generally vectors are modelled in:
○ Trees
○ Hashes
○ Graphs - HNSW
9
HNSW - Hierarchical Navigable Small World graphs
Hierarchical Navigable Small World (HNSW)
graphs are among the top-performing index-
time data structures for approximate nearest
neighbor search (ANN).
References
https://doi.org/10.1016/j.is.2013.10.006
https://arxiv.org/abs/1603.09320
10
HNSW - How it works in a nutshell
● Proximity graph
● Vertices are vectors, closer vectors are linked
● Hierarchical Layers based on skip lists
○ longer edges in higher layers(fast retrieval)
○ shorter edges in lower layers(accuracy)
● Each layer is a Navigable Small World Graph
○ greedy search for the closest friend(local minimum)
○ higher the degree of vertices(number of connections)
lower the probability of hitting local min (but more
expensive
○ move down layer for refining the minimum(closest
friend)
11
HNSW - Skip Lists
● the higher the layer, the more sparse
● descending in layers while searching
● fast to search and insert
12
HNSW - Small World
rd Graphs
● start from entry point
● greedy search (each time distance is calculated across friends)
● starting from zoom out (low degree) to zoom in(high degree)
● when building the graph, higher average degree improve quality at a cost
image from https://www.pinecone.io/learn/hnsw/
13
HNSW - Index time
● add a vector at the time
● probability to enter layer N
● when added, it goes to all other layers
-> identify the layer(s) of insertion
● topk=1 closest neighbour is identified
● we descend and repeat until
the layer of insertion
● topk=ef_construction to identify neighbours
candidates
● M neighbours are linked (easiest is calculate
the exact distance)
image from https://www.pinecone.io/learn/hnsw/
Multi-Valued
- each node is not a document
- multiple vectors per document Id
14
HNSW - Search time
● Start from layer N (top)
○ longer edges in higher layers(fast retrieval)
○ shorter edges in lower layers(accuracy)
● Each layer is a Navigable Small World Graph
○ greedy search for the closest friend(local minimum)
○ higher the degree of vertices(number of connections)
lower the probability of hitting local min (but more
expensive
○ move down layer for refining the minimum(closest
friend)
Multi-Valued
you may add in the top-K results the same
document Id multiple times
15
HNSW - MAX/SUM approach
MAX
when adding a vector from the
same document, you update the
score with the max
SUM
when adding a vector from the
same document, you update the
score summing
16
Nov 2020 - Apache Lucene 9.0
Dedicated File Format for Navigable Small World Graphs
https://issues.apache.org/jira/browse/LUCENE-9004
Jan 2022 - Apache Lucene 9.0
Handle Document Deletions
https://issues.apache.org/jira/browse/LUCENE-10040
Feb 2022 - Apache Lucene 9.1
Introduced Hierarchy in HNSW
https://issues.apache.org/jira/browse/LUCENE-10054
Mar 2022 - Apache Lucene 9.1
Re-use data structures across HNSW Graph
https://issues.apache.org/jira/browse/LUCENE-10391
Mar 2022 - Apache Lucene 9.1
Pre filters with KNN queries
https://issues.apache.org/jira/browse/LUCENE-10382
Aug 2022 - Apache Lucene 9.4
8 bits vector quantization
https://github.com/apache/lucene/issues/11613
JIRA ISSUES:
https://issues.apache.org/jira/issues/?jql=project%20%3D%2
0LUCENE%20AND%20labels%20%3D%20vector-based-
search
GITHUB ISSUES:
https://github.com/apache/lucene/labels/vector-based-search
Apache Lucene
17
9.8 - Parent Join for Vectors
18
Released 28/09/23
Merged in 9.8
● PROS:
- a child vector is a document, so it can
have metadata
● CONS:
- performance?
● NodeIdCachingHeap
● Knn Collector (various implementations)
○ ToParentJoinKnnCollector
● ToParentBlockJoinKnnVectorQuery.java
● For more Info: Benjamin Trent
Apache Lucene
19
GitHub
Pull Request
INDEXING - Auxiliary Data Structures
MAP VECTOR IDS TO DOCUMENT IDS
in a multi-valued scenario multiple vectors may belong to the
same document
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s
that impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● leverage sparse support
● ordinal (vector Id) to document (document Id) map
● DocsWithVectorsSet to keep track of vectors per documents
● DirectMonotonicWriter to write the map
Lucene95HnswVectorsWriter
Write auxiliary data structures
20
INDEXING - DocsWithVectorsSet
DocsWithVectorsSet
accumulator of documents that have vectors, used by the HnswVectorsWriter when writing data
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s
that impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● compatible with single valued dense/sparse scenarios
● keep a stack of vectors per document
● able to return a count of vectors for each document
DocsWithVectorsSet
21
INDEXING - DirectMonotonicWriter
DirectMonotonicWriter
write a sequence of integers monotonically increasing (never decreasing),
in blocks
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s
that impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● each integer is a document Id
● the same document Id repeated for each vector in the document
● DirectMonotonicReader ordToDoc used then at reading time in the
SparseOffHeapVectorValues
● public int ordToDoc(int ord) -> the ordinal (vector Id) is the index
to use to access the block and the position within the block to
finally get the document Id
22
INDEXING - building HNSW Graph
NODE ID is the VECTOR ID
same as the sparse scenario, each node in the graph has an incremental ID
aligned with the vector ID
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s
that impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● the nodes count in the graph = vector count
● no code changes
23
QUERY TIME - Exact Search
Vector Scorer
(naive solution) all vectors are iterated, only the ones corresponding
to an accepted doc are scored
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s
that impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● VectorScorer scores only BitSet acceptedDocs
● all vectors from ByteVectorValues/FloatVectorValues are iterated
● scores are updated MAX/SUM
AbstractKnnVectorQuery
24
QUERY TIME - Approximate Search
HNSW SEARCH
searching on vectors(graph nodes) and returning
documents(max/sum score)
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s
that impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● searching on level != 0 -> vectors are added as candidates/results
● searching on level = 0 -> document ID is added to the results
● int docId = vectors.ordToDoc(vectorId);
● results are added to NeighborQueue
HnswGraphSearcher
25
QUERY TIME - NeighborQueue
TOP-K DOCUMENTS (NeighborQueue)
data structure used to collect the top-k results
as a long heap
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s
that impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● MIN HEAP
● each element is a long
[32 bits][32 bits] ->
[score][~document Id]
NeighborQueue
26
QUERY TIME - NeighborQueue
TOP-K DOCUMENTS (NeighborQueue)
data structure used to collect the top-k results
as a long heap
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
● nodeIdToHeapIndex cache is used to
keep track of nodes position
● score is updated for the node
(MAX/SUM)
● DOWNHEAP (as the ranking may
have improved)
NeighborQueue
27
● to build the first prototype -> 1 year
● super active area -> merging
● Lucene codecs change names and old codec
is moved back to backwards codecs
● 85 classes DIFF!
○ simplified temporarily removing MAX/SUM
○ simplified temporarily removing separate
code branches for single/multivalued
○ down to 25 classes!
○ thanks to Benjamin Trent, Mayya Sharipova, Jim Ferenczi,
Josh Devins for the first reviews
Challenges - side project and merging
28
WRAPPING UP
Why Multi-valued?
HNSW and modifications
Index time internals
Challenges of a contribution
Query time internals
29
DO YOU WANT TO MAKE IT HAPPEN?
HELP WITH CODE
Pull Request
HELP WITH FUNDINGS
info@sease.io
30
After Lucene 9.8
significant changes happened
THANK YOU!
@seaseltd @sease-
ltd
@seaseltd @sease_ltd 31

More Related Content

Similar to Multi Valued Vectors Lucene

Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Neo4j
 
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneDoug Needham
 
Neo4j: What's Under the Hood & How Knowing This Can Help You
Neo4j: What's Under the Hood & How Knowing This Can Help You Neo4j: What's Under the Hood & How Knowing This Can Help You
Neo4j: What's Under the Hood & How Knowing This Can Help You Neo4j
 
ast nearest neighbor search with keywords
ast nearest neighbor search with keywordsast nearest neighbor search with keywords
ast nearest neighbor search with keywordsswathi78
 
LDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status updateLDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status updateLDBC council
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural searchDmitry Kan
 
MongoDB - General Purpose Database
MongoDB - General Purpose DatabaseMongoDB - General Purpose Database
MongoDB - General Purpose DatabaseAshnikbiz
 
Why Distributed Tracing is Essential for Performance and Reliability
Why Distributed Tracing is Essential for Performance and ReliabilityWhy Distributed Tracing is Essential for Performance and Reliability
Why Distributed Tracing is Essential for Performance and ReliabilityDevOps.com
 
High Performance Computing on NYC Yellow Taxi Data Set
High Performance Computing on NYC Yellow Taxi Data SetHigh Performance Computing on NYC Yellow Taxi Data Set
High Performance Computing on NYC Yellow Taxi Data SetParag Ahire
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentationrenjan131
 
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dataconomy Media
 
Unleashing the Power of Vector Search in .NET - DotNETConf2024.pdf
Unleashing the Power of Vector Search in .NET - DotNETConf2024.pdfUnleashing the Power of Vector Search in .NET - DotNETConf2024.pdf
Unleashing the Power of Vector Search in .NET - DotNETConf2024.pdfLuigi Fugaro
 
MongoDB NoSQL database a deep dive -MyWhitePaper
MongoDB  NoSQL database a deep dive -MyWhitePaperMongoDB  NoSQL database a deep dive -MyWhitePaper
MongoDB NoSQL database a deep dive -MyWhitePaperRajesh Kumar
 
Exploiting the Data / Code Duality with Dali
Exploiting the Data / Code Duality with DaliExploiting the Data / Code Duality with Dali
Exploiting the Data / Code Duality with DaliCarl Steinbach
 
The Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphThe Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphTrey Grainger
 

Similar to Multi Valued Vectors Lucene (20)

Mr bi
Mr biMr bi
Mr bi
 
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
 
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZone
 
Neo4j: What's Under the Hood & How Knowing This Can Help You
Neo4j: What's Under the Hood & How Knowing This Can Help You Neo4j: What's Under the Hood & How Knowing This Can Help You
Neo4j: What's Under the Hood & How Knowing This Can Help You
 
ast nearest neighbor search with keywords
ast nearest neighbor search with keywordsast nearest neighbor search with keywords
ast nearest neighbor search with keywords
 
LDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status updateLDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status update
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural search
 
MongoDB - General Purpose Database
MongoDB - General Purpose DatabaseMongoDB - General Purpose Database
MongoDB - General Purpose Database
 
big_data_case_studies.pdf
big_data_case_studies.pdfbig_data_case_studies.pdf
big_data_case_studies.pdf
 
Why Distributed Tracing is Essential for Performance and Reliability
Why Distributed Tracing is Essential for Performance and ReliabilityWhy Distributed Tracing is Essential for Performance and Reliability
Why Distributed Tracing is Essential for Performance and Reliability
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
Data analysis
Data analysisData analysis
Data analysis
 
High Performance Computing on NYC Yellow Taxi Data Set
High Performance Computing on NYC Yellow Taxi Data SetHigh Performance Computing on NYC Yellow Taxi Data Set
High Performance Computing on NYC Yellow Taxi Data Set
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentation
 
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
 
Unleashing the Power of Vector Search in .NET - DotNETConf2024.pdf
Unleashing the Power of Vector Search in .NET - DotNETConf2024.pdfUnleashing the Power of Vector Search in .NET - DotNETConf2024.pdf
Unleashing the Power of Vector Search in .NET - DotNETConf2024.pdf
 
MongoDB NoSQL database a deep dive -MyWhitePaper
MongoDB  NoSQL database a deep dive -MyWhitePaperMongoDB  NoSQL database a deep dive -MyWhitePaper
MongoDB NoSQL database a deep dive -MyWhitePaper
 
Exploiting the Data / Code Duality with Dali
Exploiting the Data / Code Duality with DaliExploiting the Data / Code Duality with Dali
Exploiting the Data / Code Duality with Dali
 
The Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphThe Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge Graph
 

More from Sease

When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaSease
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Sease
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveSease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaSease
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache SolrSease
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale IndexingSease
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfSease
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Sease
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfSease
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxSease
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingSease
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Sease
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationSease
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneSease
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSease
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information RetrievalSease
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationSease
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to RankSease
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusSease
 

More from Sease (20)

When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspective
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale Indexing
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptx
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr Interleaving
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache Lucene
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information Retrieval
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to Rank
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
 

Recently uploaded

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 

Recently uploaded (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 

Multi Valued Vectors Lucene

  • 1. Community Over Code 07/10/2023 Alessandro Benedetti, Director @ Sease Introducing Multi-valued Vector Fields in Apache Lucene 1
  • 2. ‣ Born in Tarquinia (ancient Etruscan city in Italy) ‣ R&D Software Engineer ‣ Director ‣ Master degree in Computer Science ‣ PC member for ECIR, SIGIR and Desires ‣ Apache Lucene/Solr PMC member/committer ‣ Elasticsearch/OpenSearch expert ‣ Semantic search, NLP, Machine Learning technologies passionate ‣ Beach Volleyball player and Snowboarder ALESSANDRO BENEDETTI WHO AM I ? 2
  • 3. ‣ Headquarter in London/distributed ‣ Open-source Enthusiasts ‣ Apache Lucene/Solr experts ‣ Elasticsearch/OpenSearch experts ‣ Community Contributors ‣ Active Researchers ‣ Hot Trends : Neural Search, Natural Language Processing Learning To Rank, Document Similarity, Search Quality Evaluation, Relevance Tuning SEArch SErvices www.sease.io 3
  • 4. AGENDA Why Multi-valued? HNSW and modifications Index time internals Challenges of a contribution Query time internals 4
  • 5. 3 2 1 WHAT What Can you do now? the text content of a field exceeds the maximum amount of characters accepted by your inference model (to encode vectors) Split the content in paragraphs across multiple documents. Your unit of information becomes the paragraph When returning the results you need to aggregate back to documents 5
  • 6. 3 2 1Split the content in paragraphs across multiple documents. Your unit of information becomes the paragraph When returning the results you need to aggregate back to documents ● Indexing Time: nested documents(slow/expe nsive) ● Indexing Time: flattened documents(redundant data) ● Query Time: parent-child join queries? (slow/expensive) ● Query Time: collapsing/grouping ● Aggregations: faceting becomes more complicated ● Stats: aggregating data and calculating stats is impacted HOW 6
  • 7. ● This applies for all fields and field types actually ● you may be ok applying those strategies … ● … but for some users may be quite annoying and expensive WHY MULTI-VALUED 7
  • 8. ● K Nearest Neighbour Algorithm? ● Indexing data structures and approach? ● Query time data structures and approach? What does it mean to bring multi-valued to vectors? 8
  • 9. ANN - Approximate Nearest Neighbor ● Exact Nearest Neighbor is expensive! (1vs1 vector distance) ● it’s fine to lose accuracy to get a massive performance gain ● pre-process the dataset to build index data structures ● Generally vectors are modelled in: ○ Trees ○ Hashes ○ Graphs - HNSW 9
  • 10. HNSW - Hierarchical Navigable Small World graphs Hierarchical Navigable Small World (HNSW) graphs are among the top-performing index- time data structures for approximate nearest neighbor search (ANN). References https://doi.org/10.1016/j.is.2013.10.006 https://arxiv.org/abs/1603.09320 10
  • 11. HNSW - How it works in a nutshell ● Proximity graph ● Vertices are vectors, closer vectors are linked ● Hierarchical Layers based on skip lists ○ longer edges in higher layers(fast retrieval) ○ shorter edges in lower layers(accuracy) ● Each layer is a Navigable Small World Graph ○ greedy search for the closest friend(local minimum) ○ higher the degree of vertices(number of connections) lower the probability of hitting local min (but more expensive ○ move down layer for refining the minimum(closest friend) 11
  • 12. HNSW - Skip Lists ● the higher the layer, the more sparse ● descending in layers while searching ● fast to search and insert 12
  • 13. HNSW - Small World rd Graphs ● start from entry point ● greedy search (each time distance is calculated across friends) ● starting from zoom out (low degree) to zoom in(high degree) ● when building the graph, higher average degree improve quality at a cost image from https://www.pinecone.io/learn/hnsw/ 13
  • 14. HNSW - Index time ● add a vector at the time ● probability to enter layer N ● when added, it goes to all other layers -> identify the layer(s) of insertion ● topk=1 closest neighbour is identified ● we descend and repeat until the layer of insertion ● topk=ef_construction to identify neighbours candidates ● M neighbours are linked (easiest is calculate the exact distance) image from https://www.pinecone.io/learn/hnsw/ Multi-Valued - each node is not a document - multiple vectors per document Id 14
  • 15. HNSW - Search time ● Start from layer N (top) ○ longer edges in higher layers(fast retrieval) ○ shorter edges in lower layers(accuracy) ● Each layer is a Navigable Small World Graph ○ greedy search for the closest friend(local minimum) ○ higher the degree of vertices(number of connections) lower the probability of hitting local min (but more expensive ○ move down layer for refining the minimum(closest friend) Multi-Valued you may add in the top-K results the same document Id multiple times 15
  • 16. HNSW - MAX/SUM approach MAX when adding a vector from the same document, you update the score with the max SUM when adding a vector from the same document, you update the score summing 16
  • 17. Nov 2020 - Apache Lucene 9.0 Dedicated File Format for Navigable Small World Graphs https://issues.apache.org/jira/browse/LUCENE-9004 Jan 2022 - Apache Lucene 9.0 Handle Document Deletions https://issues.apache.org/jira/browse/LUCENE-10040 Feb 2022 - Apache Lucene 9.1 Introduced Hierarchy in HNSW https://issues.apache.org/jira/browse/LUCENE-10054 Mar 2022 - Apache Lucene 9.1 Re-use data structures across HNSW Graph https://issues.apache.org/jira/browse/LUCENE-10391 Mar 2022 - Apache Lucene 9.1 Pre filters with KNN queries https://issues.apache.org/jira/browse/LUCENE-10382 Aug 2022 - Apache Lucene 9.4 8 bits vector quantization https://github.com/apache/lucene/issues/11613 JIRA ISSUES: https://issues.apache.org/jira/issues/?jql=project%20%3D%2 0LUCENE%20AND%20labels%20%3D%20vector-based- search GITHUB ISSUES: https://github.com/apache/lucene/labels/vector-based-search Apache Lucene 17
  • 18. 9.8 - Parent Join for Vectors 18 Released 28/09/23 Merged in 9.8 ● PROS: - a child vector is a document, so it can have metadata ● CONS: - performance? ● NodeIdCachingHeap ● Knn Collector (various implementations) ○ ToParentJoinKnnCollector ● ToParentBlockJoinKnnVectorQuery.java ● For more Info: Benjamin Trent
  • 20. INDEXING - Auxiliary Data Structures MAP VECTOR IDS TO DOCUMENT IDS in a multi-valued scenario multiple vectors may belong to the same document Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● leverage sparse support ● ordinal (vector Id) to document (document Id) map ● DocsWithVectorsSet to keep track of vectors per documents ● DirectMonotonicWriter to write the map Lucene95HnswVectorsWriter Write auxiliary data structures 20
  • 21. INDEXING - DocsWithVectorsSet DocsWithVectorsSet accumulator of documents that have vectors, used by the HnswVectorsWriter when writing data Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● compatible with single valued dense/sparse scenarios ● keep a stack of vectors per document ● able to return a count of vectors for each document DocsWithVectorsSet 21
  • 22. INDEXING - DirectMonotonicWriter DirectMonotonicWriter write a sequence of integers monotonically increasing (never decreasing), in blocks Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● each integer is a document Id ● the same document Id repeated for each vector in the document ● DirectMonotonicReader ordToDoc used then at reading time in the SparseOffHeapVectorValues ● public int ordToDoc(int ord) -> the ordinal (vector Id) is the index to use to access the block and the position within the block to finally get the document Id 22
  • 23. INDEXING - building HNSW Graph NODE ID is the VECTOR ID same as the sparse scenario, each node in the graph has an incremental ID aligned with the vector ID Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● the nodes count in the graph = vector count ● no code changes 23
  • 24. QUERY TIME - Exact Search Vector Scorer (naive solution) all vectors are iterated, only the ones corresponding to an accepted doc are scored Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● VectorScorer scores only BitSet acceptedDocs ● all vectors from ByteVectorValues/FloatVectorValues are iterated ● scores are updated MAX/SUM AbstractKnnVectorQuery 24
  • 25. QUERY TIME - Approximate Search HNSW SEARCH searching on vectors(graph nodes) and returning documents(max/sum score) Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● searching on level != 0 -> vectors are added as candidates/results ● searching on level = 0 -> document ID is added to the results ● int docId = vectors.ordToDoc(vectorId); ● results are added to NeighborQueue HnswGraphSearcher 25
  • 26. QUERY TIME - NeighborQueue TOP-K DOCUMENTS (NeighborQueue) data structure used to collect the top-k results as a long heap Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● MIN HEAP ● each element is a long [32 bits][32 bits] -> [score][~document Id] NeighborQueue 26
  • 27. QUERY TIME - NeighborQueue TOP-K DOCUMENTS (NeighborQueue) data structure used to collect the top-k results as a long heap Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? ● nodeIdToHeapIndex cache is used to keep track of nodes position ● score is updated for the node (MAX/SUM) ● DOWNHEAP (as the ranking may have improved) NeighborQueue 27
  • 28. ● to build the first prototype -> 1 year ● super active area -> merging ● Lucene codecs change names and old codec is moved back to backwards codecs ● 85 classes DIFF! ○ simplified temporarily removing MAX/SUM ○ simplified temporarily removing separate code branches for single/multivalued ○ down to 25 classes! ○ thanks to Benjamin Trent, Mayya Sharipova, Jim Ferenczi, Josh Devins for the first reviews Challenges - side project and merging 28
  • 29. WRAPPING UP Why Multi-valued? HNSW and modifications Index time internals Challenges of a contribution Query time internals 29
  • 30. DO YOU WANT TO MAKE IT HAPPEN? HELP WITH CODE Pull Request HELP WITH FUNDINGS info@sease.io 30 After Lucene 9.8 significant changes happened