Big Data LDN 2017: Cognitive Search & Analytics – Bringing the Power of AI to Enterprise Search
1. 1
COGNITIVE SEARCH & ANALYTICS
BRINGING THE POWER OF AI
TO ENTERPRISE SEARCH
Gengis BIRSEN, Senior Solution Consultant
BIRSEN Gengis, Senior Solutions Consultant
The Cognitive Search and Analytics Platform
6. 6
And in the business world…
Identifying Business Experts Detect Money laundering schemes Recommendation System
7. 7
How do we use AI at business level?
DATA
LOTS OF IT
SUITABLE
8. 8
Using enterprise data can be daunting
• Connect to data,
• Understand it,
• Index it,
• Secure it,
• Clean it,
• Enrich it,
• Search it,
• Analyse it.
To leverage it we need to
9. 9
Cognitive search can tackle the data challenge
Find, extract
Connect to various
sources of data
Refine
Structure unstructured
data using NLP and ML
Distribute
Immediate and
secure
Distribute
On all devices
Connect to all Data
Using the 150+ connectors for
structured and unstructured
data sources
Analyze the data
132 languages supported, 21 with advanced
NLP developed over 20 years, augmented
by Machine Learning
Get a unique perspective
Sinequa UI or Search APIs
Quick Time to value
Quickly deployed & Highly
scalable
10. 10
The Combination makes the difference!
Technologies are applied in combination –
not simply in parallel
Each technology enriches the others, so the
end result is more than the sum of its parts
11. 11
Platform
COGNITIVE
ANALYTICS
Natural Language Processing
Statistical Analysis Semantic Extractors
Machine Learning
Sinequa Algorithms
150+
SMART
CONNECTORS
Directories
Social Networks
Archives
Cloud Sources
Websites/Intranet
Databases
BI/Data Lake E-mails
CMS/ERP/CRMApplications
LOGICAL DATA WAREHOUSE
INSIGHT
GENERATION
SBA Studio Global Business API Global Analytics API
Data
Scientists
End
Users
Enterprise
Applications
External SPARK
Cluster
HADOOP
Data Lake
12. 12
Language Recognition
Part-of-speech Tagging / Lemmatization
Concept Extraction
Named Entities (people, places, e-mails, etc.)
Text Mining Agents (date, plate #, amount, phone, etc.)
Natural Language Processing
• struck (strike, verb)
• insured (insure, adj.)
• client (noun)
• Adam (first name)
• Johnson (unknown)
Insurance reference: A 45 65 45
Insuree: Adam Johnson
Dear Adjuster,
On October 15, 2005, my 2001 Honda Civic, license plate VML085,
was struck by your insured client Adam Johnson’s 2002 Volkswagen
Jetta, license plate ED386K, at the corner of (…) in New York City.
My medical bills totaled $3,450 as follows (copies of bills attached).
(...)
I have lost wages in the amount of $1000. I have had considerable
pain and suffering as a result of this accident and continue to suffer
from neck and back pain. I demand settlement of my claim in the
amount of $25,000.
Please respond to this demand with an offer to settle within 15 days.
Thank you.
Sincerely,
Joe Smith
GSM: 1(404) 456 123
joe.smith@mail.mail
• struck (strike, verb)
• insured (insure, adj.)
• client (noun)
• Adam (first name)
• Johnson (unknown)
Insurance reference: A 45 65 45
Insuree: Adam Johnson
Dear Adjuster,
On October 15, 2005, my 2001 Honda Civic, license plate VML085,
was struck by your insured client Adam Johnson’s 2002 Volkswagen
Jetta, license plate ED386K, at the corner of (…) in New York City.
My medical bills totaled $3,450 as follows (copies of bills attached).
(...)
I have lost wages in the amount of $1000. I have had considerable
pain and suffering as a result of this accident and continue to suffer
from neck and back pain. I demand settlement of my claim in the
amount of $25,000.
Please respond to this demand with an offer to settle within 15 days.
Thank you.
Sincerely,
Joe Smith
GSM: 1(404) 456 123
joe.smith@mail.mail
13. 13
Machine learning: Pros and Cons, When to Use
You have a defined task
to perform
A good set of logical rules will always do
what I want
ML is a good option when you have a defined use
case, such as automating a process or a human
task (Image recognition, categorization, etc.)
You have large, clean
data sets, and time to
experiment
You already have a set of
rules which can perform
the desired task
You do not have time to
learn a model or need
explanation of results
If there is no set of rules or if the set
of rules is too complex, then ML is
recommended
ML requires large and clean data sets
to efficiently train a model.
ML involves training a model to perform
a task. This requires experimentation
and testing
ML is good to accomplish something
specific. You cannot “want to use ML”
Training a ML model takes many
iterations and each training iteration
requires large amount of time and
computing resources
ML models are not easily understandable by humans.
They rely on huge amount of dimensions and weights.
There is no logical rule which can be explained
Supervised algorithms require a
training set which is representative of
the entire corpus
14. 14
Sinequa is a Natural Fit for Machine Learning
While ML naturally enhances Sinequa native functionalities, Sinequa indexing and NLP capabilities also naturally fit with ML
projects
Sinequa
Enriched Index
Sinequa
ML Platform
Sinequa Cognitive
Functionalities
Algorithms feed from indexes content and Sinequa
NLP & indexing functions:
• Multi Languages Tokenizers & Stop word
removers
• Lemmas & Part of Speech tagging
ML enhances Sinequa native functionalities:
• Learning to Rank
• Collaborative filtering
• Query Expansion
• Auto-completion
ML algorithms’ output further enhances Sinequa
indexes:
• New Hierarchies (Auto Classification,
Clustering)
• New Concepts/Entities (Topic Detection, NER)
Usage feedback further enhances ML capabilities
• Search queries, filtering activities
• User content ratings, labeling
• User reinforcement feedback
15. 15
Sinequa ML Packaged Algorithms
Sinequa embeds algorithms out-of-the-box, ready for use on your data
Requires a curated training data set
Classification
Model learns to auto-classify
documents, from a labelled
training set
Clustering
Model identifies several document
clusters and place each document in
one cluster
Topic detection
Model identifies a fixed number of
topics from text, each described by
keywords. Model then distribute
documents across topics
Regression
Model identifies correlation and
patterns in a learning data set and
later apply these to predict variables
Learning to Rank
Re-organizes the engine’s results
by learning a user-specific
relevancy
Custom Algorithms
Data scientist can create their own
custom model in the language of
their choice (Python, Scala, Java,
etc.), using Sinequa native features
(Tokenizers, Lemmas, Stop Word
detection, etc.)
Key-Phrase
Extraction
Uses a corpus coherency to extract
key-words and key-sentences from
any document
Query expansion
and auto-completion
Analyzes the logs and users profile to
suggest query expansions and / or
auto-completion
Recommends related document by
content (content-based) or by
examining users’ interactions
(collaborative filtering)
Recommendation
Engine
Named Entity
Recognition
Model learns textual patterns
associated to entities from a tagged
training set. Model can then detect
new entities candidates
Relevance Feedback
Model
Model learns from user’s
activity and use M-L computed
document similarity, to fine tune
documents relevance
Similarity
Algorithm identifies documents
with similar features
Extends Existing Sinequa
Functionality
Provides New Capabilities
16. 16
Evolution of a cognitive insight platform
Connect data and extract
entities (ontologies)
Tune relevancy/retrieval
on business purpose
(configuration)
Further enhance relevancy
on user feedback
Leverage Insight via
Analytics (i.e Data
Science)
Access information from
different channels/UIs
(public web, Q&A -
chatbot, …)
Uncover additional value
(i.e Find the Expert)
18. 18
Bank Fraud
Question : Can we use Banking transaction history
and data to identify abnormal activity (outliers)?
Data:
• Structured: transactions, ammounts, deposits,…
• Unstructured: Labels, Accounts details
Approach:
Clustering of account behavior over a time window