Miles Kehoe, Enterprise Search Guru, presents FAST Search for SharePoint and ways to improve search in your organization by utilizing the document processing pipeline in FAST Search for SharePoint.
Using the Fast Search for SharePoint Pipeline to Improve Search
1. Improving search using the pipeline
in FAST Search for SharePoint
Miles Kehoe
Author of: Professional Microsoft Search
Miles.kehoe@ideaeng.com
www.enterprisesearchblog.com
@miles_kehoe
mileskehoe
ideaeng.com SurfRay.com
2. Agenda
• Introductions
• When FS4SP makes sense
• What is the FS4SP indexing pipeline?
• Why is it important to you?
• How do you use it?
• Wrap Up
3. About Me
• Founder of New Idea Engineering Inc.
• Work with enterprise search since 1989
• Co-Author Professional Microsoft Search/Wrox
• Author several blogs:
- Enterprisesearchblog.com
- SearchComponentsOnline.com
• Search nerd
4. When to use FS4SP
Large datasets
• SP Search indexes 100M documents
• FS4SP virtually unlimited (650M in tests)
• Rows and Columns concept
Need to fine-tune index & search
• Pipeline
• Need custom relevance profiles
• Need to fine-tune queries for relevance
5. What is the FS4SP indexing
pipeline?
Standard sequence of ‘stages’ from crawl to index
• Format conversion & language detection
• Lemmatization / Stemming
• Entity extraction
• Map crawled properties to managed properties
Unique to FAST: the ability to insert custom processing
• ‘Must’ be just before mapper
• C# supported; but any code using STDIN/STDOUT ok
• Time critical!
A great way to fix up messy data!
6. Pipeline Architecture
Index Flow
Content Indexer Query
Crawler
Processor Processor
Data Sources User Queries
FS4SP Pipeline
…
Entity Extraction
Lemmatization
Language Detection
Format Conversion
Custom Extensibility
Mapper
7. Why is the pipeline
important to you?
Sometimes content IS messy:
• URLs with abbreviations
• Additional metadata is in external sources
• Geo-tag documents
Diagnose problems in the indexing process:
• Identify bad or missing metadata
8. Examples where the pipeline
can save you
Cryptic URLs
• With URLs like www.myco.com/mkt/prodmgmt/products.aspx
• I can add specific metadata to the document
‘marketing’ (because of ‘mkt’) & product management’ (because of ‘prodmgmt’)
Adding valuable metadata:
• When I find a user name in a document I can lookup and return phone number and email
• When I find a city name I can geo-tag with latitude and longitude
Debugging the indexing process
• When things are not as they seem I can diagnose problems in the indexing process
9. How do you use the pipeline?
Pipeline configuration files in FASTSearchetc
• PipelineConfig.xml
• PipelineExtensibility.xml
For each Document Processor node:
• Create an entry for a new ‘processor’
• Add your new processor name to the <pipelines> node
• Restart the ‘FAST processor server’ from CMD: psctrl reset
• Submit a single known test document
• Check your results
13. How do you create a
custom stage?
Edit file %FASTSEARCH%etcpipelineconfig as above
Edit file %FASTSearch%etcPipelineExtensibility.xml
<PipelineExtensibility>
<Run command=“YourCode.EXE %(input)s %(output)s">
<Input>
<CrawledProperty propertyName=“author" propertySet=“GUID“ varType="31" />
</Input>
<Output>
<CrawledProperty propertyName=“mytags” propertySet=“GUID" varType="31"/>
<CrawledProperty propertyName=“phone" propertySet=“GUID" varType=“31"/>
</Output>
</Run>
</PipelineExtensibility>
Restart content servers from command Line prompt
psctrl reset
14. Pipeline is
performance-critical
Pipeline runs in ‘sandbox’ environment
• NOT the same type of ‘sandbox’ in O365
• File I/O only allowed in C:users<fast service user>AppDataLocalLow
• Maximum of 10 seconds to live
• Permissions restricted regardless of FAST Service user permissions
• Each Document Processor (DP) is an individual instance
• Only one item passes thru a DP at a time
• If each document takes 1 second then10 DPs can process at best 10 docs/sec
• Consider 1 sec for each of 100K docs ~ 3 hours!
15. Pipeline Hints
MS only supports:
• Single custom stage (in PipelineConfig.xml)
• .NET languages (C#, etc)
But:
• A custom stage can appear in multiple places in PipelineConfig.xml even
w/ different parameters
• Theoretically any executable that handles STDIN/STDOUT will do
• VC#/VC++/VBScript/CMD files seem to work
• Web services calls are supported
16. Using web services in Sandbox
Web Service
Stage
Stage
XML
Stage
XML
Stage
XML Config
17. Ontolica FAST Management
Ontolica Fast Management provides clear and easy to use configuration directly from
within the SharePoint admin GUI. Forget XML configuration files, manual file
deployments, and tricky PowerShell configuration with easy management consoles.
Key Features:
• Backup, Manage, & Deploy Configurations
• Manage FAST Relevance Profiles
• Upload & Manage Pipeline Extensions
• Create & Manage JDBC Connections
• FAST Webcrawler Configuration
• Manage FAST Server Processes from Central
Admin
18. Additional Resources
• This slide deck live at http://slidesha.re/sCGAaP
• SP2010 ES/FS4SP Blog (Eric Belisle) - http://fs4sp.blogspot.com/
• Enterprise Search Blog (NIE) - http://www.enterprisesearchblog.com/
• Search Unleashed (Len Ocsouza) - http://searchunleashed.wordpress.com/
• ESW Blog - http://www.enterprisesearchwiki.com/wp/
• TechNet/MSDN/Microsoft
• And of course: SurfRay.com (Robert Piddocke & Josh Noble)
19. Q/A & Contact Details
Miles Kehoe
Author of: Professional Microsoft Search
Miles.kehoe@ideaeng.com
www.enterprisesearchblog.com
@miles_kehoe
mileskehoe
Robert Piddocke
Author: Pro SharePoint 2010 Search
rcp@surfray.com
@rpiddocke
R Piddocke
ideaeng.com SurfRay.com
Editor's Notes
By default two pipelines defined – Attachments and Office14