Concept Searching ConceptClassifier For SharePoint
Concept Searching Portal Solutions Search Engine Face Off
1. Search Engine Face-Off
Keyword Search versus Metadata Search
Don Miller, VP of Business Development Val Orekhov, VP of Business Development
1 (408) 828-3400 1 (240) 450-2166 x 103
donm@conceptsearching.com val@portalsolutions.net
2. Concept Searching
Don Miller
Don Miller is a senior executive at ConceptSearching with over 20 years experience in knowledge
management. He is a frequent speaker about Records Management and Information Architecture
problems and solutions. Don has been a guest speaker at Taxonomy Bootcamp, Management
Electronic Records and numerous SharePoint events about information organization and records
management.
Don Miller, VP of Business Development * 1 (408) 828-3400 * donm@conceptsearching.com
Portal Solutions
Val Orekhov
Val Orekhov, Chief Architect for Portal Solutions is deeply skilled in Enterprise Application Development,
Web development, portals, relational databases and data access, modeling, and is versed in a number
of programming languages and technologies. He has been with Portal Solutions for almost five years
and drives the technical team to excel year over year. He holds a Master of Science in Computer
Science from Kyrgyz Technical University in Bishkek, Kyrgyzstan.
Val Orekhov, Chief Technical Architect * (1) (240) 450-2166 x 103 * val@portalsolutions.net
3. Agenda
ConceptSearching:
Keyword vs Metadata
Keyword vs Metadata Costs
Google vs. SharePoint vs. FAST
What’s wrong with a manual metadata approach
Automated approaches
USAF Case Study
Portal Solutions:
Enterprise Search – Google vs FAST in SharePoint
Indexing Options
Approach to Security Trimming
Ranking Algorithms & Sorting Options
Metadata & Search Refinements
Questions and Answers
Demo of product if time permits
Concept Searching • Don Miller • (408) 828-3400 • donm@conceptsearching.com
4. Concept Searching, Inc.
Company founded in 2002
Product launched in 2003
Focus on management of structured and unstructured information
Technology
All technologies based on our ‘open conceptualTagging
framework’
Automatic concept identification, content tagging, auto-
classification, taxonomy management
Only statistical vendor that can extract conceptual metadata
2009 and 2010 ‘100 Companies that Matter in KM’ (KM World
Magazine)
KMWorld ‘Trend Setting Product’ of 2009
and 2010
Locations: US, UK, & South Africa
Client base: Fortune 500/1000 organizations
Microsoft Enterprise Search ISV , FAST Partner
Product Suite: conceptSearch, conceptTaxonomyManager,
conceptClassifier, conceptClassifier for SharePoint,
contentTypeUpdater for SharePoint
Concept Searching • Don Miller • (408) 828-3400 • donm@conceptsearching.com
5. What Type of Search or Information Architecture Do You Need?
Keyword Search = ~66%+ Metadata Search = 100%
of results (Recall) of results (Recall)
• Simple • Guided Navigation
• No administration • Records Management
• Good enough • Sensitive Information
Removal
Recall (information retrieval), a • Collaboration
statistical measure (contrasted with
precision), the fraction of (all) relevant • Improved Precision and
material that are returned by a search Recall
query
Precision (information retrieval), • Evolution of Keyword
the percentage of documents returned Search
that are relevant
Concept Searching • Don Miller • (408) 828-3400 • donm@conceptsearching.com
6. What Is Keyword vs. Metadata Costing You?
Problem Pre Migration Search Records Management Data Privacy Protection
•60% of stored •“It’s not about better •67% of data loss in •Average cost per
documents are search” Records Management is exposed record is $197
obsolete •Less than 50% of content due to end user error and ranges from $90-
•50% of documents are is correctly indexed, meta •It costs and organization $305 per record
duplicates tagged or efficiently $180 per document to •70% of breaches are due
•Requires resources to searchable recreate it when it is not to a mistake or malicious
identify what •85% of relevant tagged correctly and intent by an
should/not be migrated documents are never cannot be found organization’s own staff
retrieved in search
•Eliminate duplicate •Eliminate manual tagging •Eliminate inconsistent •Identify any type of
Solution end user tagging organizationally defined
documents & replace with automatic
•Identify privacy data identification of multi- •Automatically declare privacy data
exposures word concepts documents of record •Combines pattern
•Identify and declare •Provide guided based on vocabulary and matching with associated
records that were not navigation via the retention codes vocabulary
previously identified taxonomy structure (i.e. •Automatically change the •Automatic Content Type
•Identify high value concepts) Content Type and route updating enabling
content •Go beyond dynamic to the Records workflows and rights
•Migrating required clustering with Management repository management
content to a structure conceptual clustering
based on the taxonomies
Benefit •Reduces migration •Taxonomy navigation •Savings of $4.00 - $7.04 •Average cost runs from
costs is 36% - 48% faster per record by eliminating $225K to $35M
•Ensures •Savings 2.5 hours manual tagging
compliance and per user per day •Ensures compliance and
protection of reduces potential
content assets litigation exposures
Concept Searching • Don Miller • (408) 828-3400 • donm@conceptsearching.com
7. Metadata Search vs. Keyword and Guided Navigation “Proposal”
“Software License” “SLA” “Licensee” “Addendum”
“License Agreement” “License”
100% of Results
Results “Documents of Record” Metadata Search
also known
as “Recall” “Proposals” “Contract”
66% Key + Synonym Search
“Proposal”
Entity Extraction
33% Keyword Search
20-33% of results
Entity extraction without complex
rules is ineffective. It is just keyword Cost (Time, Money and Complex)
match, which is what keyword search
is, which is 33% effective.
Concept Searching • Don Miller • (408) 828-3400 • donm@conceptsearching.com
8. Similar Features Against Total Number of Documents Returned
Google SharePoint FAST
Index 500 M + 100 M 500 M +
Key Word /– 33% of Yes Yes – Good as Yes
results Google or FAST
Synonyms Up to Yes Yes Yes
50-66%+ of results
Apply metadata No No Key Word only
automatically for which equals 33%
100% of results of results
Ranking Algorithm Non tunable Tunable Very Tunable
+ Best Bets: Does
not improve
number of results
only how presented
Concept Searching • Don Miller • (408) 828-3400 • donm@conceptsearching.com
9. What Is Missing To Get to 100% of Relevant Results in Every Search?
Metadata Google SharePoint FAST
Auto Classification No – No – Entity extraction,
Missing 33-50% of Missing 33-50% of which is the same
results on any results on any as keyword search
particular topic particular topic 33% results. No
RECALL results
improvement with
this approach
Taxonomy No Yes, but can’t do No
Management any thing with it in
this release.
Security issues for
managing Term
Store.
Concept Searching • Don Miller • (408) 828-3400 • donm@conceptsearching.com
10. Miscellaneous Items to Review
Google SharePoint FAST
SharePoint Refiners Hard Yes – Easy to use Medium – Initial
and Navigators with for standard search. release, does not
counts. No counts on leverage Term Store
results. yet. XML –
RECALL Powershell based
Customization Difficult Difficult Extendable
Concept Searching • Don Miller • (408) 828-3400 • donm@conceptsearching.com
11. Summary
• Google – Best for no administration, install and walk away. Usually missing 33%-
50% of results on any given topic because of missing metadata. Not easy to
integrate refiners or navigators into SharePoint UI.
• SharePoint Search – Cost effective, comes free with SharePoint. Search Algorithm
is as good as FAST or Google. Also very easy to install and walk away. Limited
extensibility. Easy integration for refiners and navigators (no counts). Also missing
50% of results on any topic.
• FAST – Extremely customizable, but requires training or professional services to
customize. Most likely Microsoft long term platform for search. Very scalable and
can provide refiner counts. Still missing 33-50% of results from any given search
because of metadata inconsistency.
• However, they are all missing a true metadata strategy which is the only way to
ensure 100% of results.
Concept Searching • Don Miller • (408) 828-3400 • donm@conceptsearching.com
12. A Manual Metadata Approach Will Fail 95%+ Of The Time
Issue Organizational Impact
Inconsistent Less than 50% of content is correctly indexed, meta-tagged or efficiently
searchable rendering it unusable to the organization (IDC)
Subjective Highly trained Information Specialists will agree on meta tags between
33% - 50% of the time. (C. Cleverdon)
Cumbersome - Expensive Average cost of manually tagging one item runs from $4 - $7 per
document and does not factor in the accuracy of the meta tags nor the
repercussions from mis-tagged content (Hoovers)
Malicious Compliance End users select first value in list (Perspectives on Metadata, Sarah Courier)
No perceived value for end user What’s in it for me? End user creates document, does not see value for
organization nor risks associated with litigation and non conformance to
policies.
What have you seen Metadata will continue to be a problem due to inconsistent human
behavior
The answer to consistent metadata is an automated approach that can extract the meaning from
content eliminating manual metadata generation yet still providing the ability to manage
knowledge assets in alignment with the unique corporate knowledge infrastructure.
Concept Searching • Don Miller • (408) 828-3400 • donm@conceptsearching.com
13. conceptClassifier’s TaxonomyManager Automated Metadata Approach
Drives Business Value
Create enterprise automated metadata
framework/model
Average return on investment minimum of 38%
and runs as high as 600% (IDC) 1. Model and
Validate
Apply consistent meaningful metadata to
enterprise content
Incorrect meta tags costs an organization 6. Life Cycle 2. Automate
Management Tagging
$2,500 per user per year – in addition potential
costs for non-compliance (IDC)
Guide users to relevant content with taxonomy
navigation
Savings of $8,965 per year per user based on an
5. Records
$80K salary (Chen & Dumais) Management 3. Findability
100% “Recall” of content, 35% Faster access to and PII
content “Precision”
4. Business
Use automatic conceptual metadata Processes
generation to improve Records Management
Eliminate inconsistent end user tagging at $4-$7
per record (Hoovers)
Improve compliance processes, eliminate
potential privacy exposures
Concept Searching • Don Miller • (408) 828-3400 • donm@conceptsearching.com
14. USAF Human Performance Clearinghouse
GOAL : Leverage Existing USAF, AFDW, and AFMS License Agreements to
Enable IM, RM, & Privacy & Security Compliance
Requirements
• DoDD 8320 (Data Sharing in a Net-Centric DoD)
• DoDD 5015 (Records Management) Data Privacy
• USAF Privacy Act Program & HIPAA
• Freedom of Information Act (FOIA)
Migration
Migration
Records Management
Search
eDiscovery & FOIA
Tel: 703.246.9360 | Fax: 240.465.1182
Distribution Statement A: Approved for public release; distribution is unlimited.
Distribution Statement A: Approved for public release; distribution is
311 ABG/PA No. 09-488, 16 Oct 2009 unlimited.
311 ABG/PA No. 09-488, 16 Oct 2009
15. Taxonomy Improves “Precision” with Guided Refiners for “Proposals”
• After 100% of Results are
returned, leverage metadata
for guided navigation and
refiners
• Use taxonomy/metadata
structures before query and
after query to guide users to
the right document
• Accelerate document finding
[PRECISION] by a minimum
of 35%
I want all proposals in two
specific regions. I could then
have a guided refiner for
vertical, amount, etc.
Concept Searching • Don Miller • (408) 828-3400 • donm@conceptsearching.com
16. Dynamic Clustering Is Not Guided Navigation for “Proposals”
• Brings back clusters
• They are best guesses
• They might help, they
might make it worse
• Better than nothing,
but not a long term
strategy or evolution of
key word search
Dynamic navigation (CLUSTERING) is
ineffective. How does an information
worker know when it is a good topic or not?
This is NOT PRECISION!
Concept Searching • Don Miller • (408) 828-3400 • donm@conceptsearching.com
17. Enterprise Search Comparison for SharePoint Google vs FAST
Why Enterprise Search needs Metadata and Taxonomy Management
Recall – Ensures you bring back 100% of Results
Enhance Precision – Fastest way to filter to the right results so that you are
looking at the documents that matter the most
MUST HAVES:
Heterogeneous content sources:
HTML, Documents and LOBs records
Located on Portals, File Systems and in Databases
Required Security Trimming:
Integrate with Identity Providers (AD, LDAP, SQL)
Implement authorization decision logic
Able to take advantage of metadata stored with documents and LOBs
Concept Searching • Don Miller • (408) 828-3400 • donm@conceptsearching.com
18. Google Search Appliance 6.8
vs.
FAST Search Server for SharePoint 2010
For metadata-driven search scenarios in a SharePoint environment
19. Portal Solutions Corporate Overview - Vitals
• Founded in 2002
• SharePoint 2010 Microsoft Gold
Certified partner
• Over 100 SharePoint deployments
• 30+ certified engineers/developers
• Member of Microsoft SharePoint
Early Adopter Program
• A recognized best place to work by
Washingtonian magazine
• A growing IT consulting organization
comprised of talented and certified
staff
20. Corporate Overview - Solutions
• Employee Portals and Intranets
• Public facing web sites
• Knowledge Management solutions
• Document and Records Management
• Performance And Risk Management/BI
• Customer Extranets
• Enterprise Search solutions
• Business Process Automation
21. Introducing the Contenders
Google Search Appliance (GSA)
• Search Appliance, Google.com in a box
• Hardware & Software Solution
• Pre-packaged functionality ready to work
• “Black box” approach to search results
FAST Search Server for SharePoint 2010
• Spin off of the earlier FAST ESP
• Software-only solution
• Allows to customize many aspects of the engine functionality
down to relevancy tuning algorithms
• Platform rather than a product
25. Security Trimming
• Answers the “Who Am I” and “What Results Can I See”
questions
• Required with most Enterprise Search scenarios
• Approaches include Late & Early Authorization/Biding
Authorization Access Rights Pros Cons
Approach (ACLs)
Late Checked at run - Up-to-date presentation - Slow on large
time against system result sets
of record
Early Information stored - Fast - Duplicates info
in the index at item - Facilitates metadata - Potential for
level clustering outdated results
26. Security Trimming Options Support
GSA FAST SharePoint
2010
Late - “Default” option in - - Custom
Authorization many scenarios
- Via Kerberos, SAML
Bridge or Connector
Early - Rel. 6.0 –High level - Item-level ACLs for Native support
Authorization Policy ACLs configured Windows and for Item-level
by admins or through a SharePoint security ACLs with
remote API * principals supported Windows and
- Rel. 6.8 – Item-level natively SharePoint
ACLs) ** - Allows to setup multiple security
user property stores and principals
map user principals
* Best applied to enterprises with a manageable number of high level policies, or able to invest into custom ACL sync tools
** SharePoint Connector Rel. 2.6.4 sends SharePoint Site Groups with the feed but the Groups are not expanded property by GSA
29. Result Set Ranking
• Fidelity of keyword matches (All Engines)
• Proximity
• Frequency
• Completeness
• Hyper Text Matching (GSA only)
• Analyzes keyword location on a rendered page and related pages
• Hub and Spoke Algorithm (All engines)
• Driven by linkages between web pages
• Pages receiving or providing most links have higher rankings
• GSA – PageRank; FAST – Document authority;
• Static rank biasing, document importance
• Document, Site, Metadata -based promotion / demotion (All engines)
• User-tagged documents receive higher importance (FAST, SharePoint search)
• Adaptive ranking
• User clicks in search results (FAST, SharePoint search)
• Custom Ranking
• Build custom ranking models w/ FAST
30. Result Set Sorting
• GSA
• Date/Time only (Document Modification Date, or a date extracted
from Title, Metadata or Body of a document)
• FAST
• Any property marked as Sortable
• Supported data types: String, Number, Date/Time
32. Index Schema Management
• GSA (All-inclusive)
• All discovered metadata (Crawled Properties) are stored in the index by default
• Metadata from MS Office documents stored in the index results. (GSA Feature
Request ID# 1371024)
• All string-type metadata is associated with FTI by default, matches on metadata
controlled through query time (allintext:, allintitle: keyword filters)
• Metadata in results limited to 1,500 chars per field (Rel. 6.8; prev. releases – 320
chars)
• FAST (Opt-in)
• Crawled properties have to be associated with Managed Properties (MPs) to be
stored in the index
• MPs represent a level of abstraction from Content Sources
• MPs can be configured to be used as:
• Stored in the index (Queryable)
• Associated with FTI (Searchable)
• Sortable
• Refiner-enabled
33. Search Refinement with Metadata
Approach Completeness Pros Cons
Run-time Smaller sample of - Smaller index size - Degraded
clustering much larger set; performance w/
Top 50-100 query larger samples
results. - No cluster counts
Index-based Entire result set - Fast - Increases index
clustering stored in the index. - Allows for precise cluster size
counts
34. Search Refinement with Metadata
GSA FAST SharePoint
2010
Run-time - The only option prior to - OTB - OTB
clustering Rel. 6.8 (Custom)
Index-based - “Preview” status in Rel. - OTB for MPs marked as - Not available
clustering 6.8 (OTB) Refinable
- Inverted Index and
Metadata Property Store
combined into a high
performance OLAP cube
35. Conclusions*
• SharePoint intranet as a hub + • Heterogeneous content sources
FAST
GSA
document libraries, LOBs; dominated by web pages
• Search results served from the • Search UI served by GSA
SharePoint portal • Predominantly Keyword –driven
• Active Directory -tied systems w/ search experience,
content security policies applied • Custom run-time search refiners for
broadly protected content; OTB “Dynamic
• Fine level of control over index Navigation” for LOB / public data
schema and document processing • Result biasing via URL patterns,
• Custom search results ranking / metadata values
relevancy models • Medium complexity metadata-based
• High complexity metadata-based search scenarios
search scenarios
• Full & Mini Search-driven
applications
* Usage scenarios best aligned with OTB functionality, minimum possible customizations.
36. Special Offer
First ten attendees to sign up will receive a two-hour evaluation of
your current or planned enterprise search strategy.
For more information contact:
Val Orekhov - val@portalsolutions.net