SlideShare a Scribd company logo
1 of 19
Download to read offline
Improving search using the pipeline
  in FAST Search for SharePoint

            Miles Kehoe
            Author of: Professional Microsoft Search
            Miles.kehoe@ideaeng.com
            www.enterprisesearchblog.com
            @miles_kehoe
            mileskehoe




                                  ideaeng.com          SurfRay.com
Agenda
• Introductions

• When FS4SP makes sense

• What is the FS4SP indexing pipeline?

• Why is it important to you?

• How do you use it?

• Wrap Up
About Me
• Founder of New Idea Engineering Inc.

• Work with enterprise search since 1989

• Co-Author Professional Microsoft Search/Wrox

• Author several blogs:

   -   Enterprisesearchblog.com

   -   SearchComponentsOnline.com

• Search nerd
When to use FS4SP

Large datasets

 •   SP Search indexes 100M documents

 •   FS4SP virtually unlimited (650M in tests)

 •   Rows and Columns concept

Need to fine-tune index & search

 •   Pipeline

 •   Need custom relevance profiles

 •   Need to fine-tune queries for relevance
What is the FS4SP indexing
               pipeline?
Standard sequence of ‘stages’ from crawl to index
  •   Format conversion & language detection
  •   Lemmatization / Stemming
  •   Entity extraction
  •   Map crawled properties to managed properties
Unique to FAST: the ability to insert custom processing
  •   ‘Must’ be just before mapper
  •   C# supported; but any code using STDIN/STDOUT ok
  •   Time critical!
A great way to fix up messy data!
Pipeline Architecture
                                  Index Flow




                                                                      Content                                                      Indexer     Query
                Crawler
                                                                     Processor                                                               Processor




Data Sources                                                                                                                                             User Queries

                                              FS4SP Pipeline
                                                                                                            …
                                                                                       Entity Extraction
                                                                       Lemmatization
                                                Language Detection
                          Format Conversion




                                                                                                           Custom Extensibility

                                                                                                                                  Mapper
Why is the pipeline
             important to you?
Sometimes content IS messy:
 • URLs with abbreviations
 • Additional metadata is in external sources
 • Geo-tag documents

Diagnose problems in the indexing process:
 • Identify bad or missing metadata
Examples where the pipeline
                  can save you
Cryptic URLs
     •   With URLs like www.myco.com/mkt/prodmgmt/products.aspx
     •   I can add specific metadata to the document
           ‘marketing’ (because of ‘mkt’) & product management’ (because of ‘prodmgmt’)

Adding valuable metadata:
•   When I find a user name in a document I can lookup and return phone number and email
•   When I find a city name I can geo-tag with latitude and longitude

Debugging the indexing process
•   When things are not as they seem I can diagnose problems in the indexing process
How do you use the pipeline?
Pipeline configuration files in FASTSearchetc
    • PipelineConfig.xml
    • PipelineExtensibility.xml
For each Document Processor node:
    • Create an entry for a new ‘processor’
   • Add your new processor name to the <pipelines> node
   • Restart the ‘FAST processor server’ from CMD: psctrl reset
   • Submit a single known test document
   • Check your results
Config Files
Adding a Processor Stage
On each FAST document processor node:
• Edit %FASTSEARCH%etcpipelineconfig.xml
    <processor name=“Spy1" type="general" hidden="0">
             <load module="processors.Spy" class="Spy"/>
             <config>
             <param name="SpyDumpFile" value="var/log/spy.txt" type="str"/>
             <param name="FileStringCutOffLen" value="32768" type="int"/>
             </config>
             <inputs>
             </inputs>
     </processor>
• In the ‘Document Conversion’ section, add the new pipeline stage to run (in the Office 14
   pipeline)
     <processor name=“Spy1” />
• Reset (each) document processor node:
     psctrl reset
FS4SP Pipeline Extensibility
How do you create a
                        custom stage?
Edit file %FASTSEARCH%etcpipelineconfig as above
Edit file %FASTSearch%etcPipelineExtensibility.xml

<PipelineExtensibility>
      <Run command=“YourCode.EXE %(input)s %(output)s">
      <Input>
        <CrawledProperty propertyName=“author" propertySet=“GUID“ varType="31" />
      </Input>
      <Output>
         <CrawledProperty propertyName=“mytags” propertySet=“GUID" varType="31"/>
         <CrawledProperty propertyName=“phone" propertySet=“GUID" varType=“31"/>
      </Output>
      </Run>
 </PipelineExtensibility>
Restart content servers from command Line prompt
    psctrl reset
Pipeline is
            performance-critical
Pipeline runs in ‘sandbox’ environment
 •   NOT the same type of ‘sandbox’ in O365
 •   File I/O only allowed in C:users<fast service user>AppDataLocalLow
 •   Maximum of 10 seconds to live
 •   Permissions restricted regardless of FAST Service user permissions
 •   Each Document Processor (DP) is an individual instance
 •   Only one item passes thru a DP at a time
 •   If each document takes 1 second then10 DPs can process at best 10 docs/sec
 •   Consider 1 sec for each of 100K docs ~ 3 hours!
Pipeline Hints
MS only supports:
 • Single custom stage (in PipelineConfig.xml)
 • .NET languages (C#, etc)
But:
 • A custom stage can appear in multiple places in PipelineConfig.xml even
   w/ different parameters
 • Theoretically any executable that handles STDIN/STDOUT will do
 • VC#/VC++/VBScript/CMD files seem to work
 • Web services calls are supported
Using web services in Sandbox
                         Web Service



                           Stage


                           Stage

                   XML
                           Stage

                   XML
                           Stage




                         XML Config
Ontolica FAST Management
Ontolica Fast Management provides clear and easy to use configuration directly from
within the SharePoint admin GUI. Forget XML configuration files, manual file
deployments, and tricky PowerShell configuration with easy management consoles.

Key Features:

•    Backup, Manage, & Deploy Configurations
•    Manage FAST Relevance Profiles
•    Upload & Manage Pipeline Extensions
•    Create & Manage JDBC Connections
•    FAST Webcrawler Configuration
•    Manage FAST Server Processes from Central
     Admin
Additional Resources
• This slide deck live at http://slidesha.re/sCGAaP

• SP2010 ES/FS4SP Blog (Eric Belisle) - http://fs4sp.blogspot.com/

• Enterprise Search Blog (NIE) - http://www.enterprisesearchblog.com/

• Search Unleashed (Len Ocsouza) - http://searchunleashed.wordpress.com/

• ESW Blog - http://www.enterprisesearchwiki.com/wp/

• TechNet/MSDN/Microsoft

• And of course: SurfRay.com (Robert Piddocke & Josh Noble)
Q/A & Contact Details
 Miles Kehoe
 Author of: Professional Microsoft Search
 Miles.kehoe@ideaeng.com
 www.enterprisesearchblog.com
 @miles_kehoe
 mileskehoe


 Robert Piddocke
 Author: Pro SharePoint 2010 Search
 rcp@surfray.com
 @rpiddocke
 R Piddocke
                                ideaeng.com   SurfRay.com

More Related Content

Recently uploaded

Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 

Recently uploaded (20)

Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 

Using the Fast Search for SharePoint Pipeline to Improve Search

  • 1. Improving search using the pipeline in FAST Search for SharePoint Miles Kehoe Author of: Professional Microsoft Search Miles.kehoe@ideaeng.com www.enterprisesearchblog.com @miles_kehoe mileskehoe ideaeng.com SurfRay.com
  • 2. Agenda • Introductions • When FS4SP makes sense • What is the FS4SP indexing pipeline? • Why is it important to you? • How do you use it? • Wrap Up
  • 3. About Me • Founder of New Idea Engineering Inc. • Work with enterprise search since 1989 • Co-Author Professional Microsoft Search/Wrox • Author several blogs: - Enterprisesearchblog.com - SearchComponentsOnline.com • Search nerd
  • 4. When to use FS4SP Large datasets • SP Search indexes 100M documents • FS4SP virtually unlimited (650M in tests) • Rows and Columns concept Need to fine-tune index & search • Pipeline • Need custom relevance profiles • Need to fine-tune queries for relevance
  • 5. What is the FS4SP indexing pipeline? Standard sequence of ‘stages’ from crawl to index • Format conversion & language detection • Lemmatization / Stemming • Entity extraction • Map crawled properties to managed properties Unique to FAST: the ability to insert custom processing • ‘Must’ be just before mapper • C# supported; but any code using STDIN/STDOUT ok • Time critical! A great way to fix up messy data!
  • 6. Pipeline Architecture Index Flow Content Indexer Query Crawler Processor Processor Data Sources User Queries FS4SP Pipeline … Entity Extraction Lemmatization Language Detection Format Conversion Custom Extensibility Mapper
  • 7. Why is the pipeline important to you? Sometimes content IS messy: • URLs with abbreviations • Additional metadata is in external sources • Geo-tag documents Diagnose problems in the indexing process: • Identify bad or missing metadata
  • 8. Examples where the pipeline can save you Cryptic URLs • With URLs like www.myco.com/mkt/prodmgmt/products.aspx • I can add specific metadata to the document ‘marketing’ (because of ‘mkt’) & product management’ (because of ‘prodmgmt’) Adding valuable metadata: • When I find a user name in a document I can lookup and return phone number and email • When I find a city name I can geo-tag with latitude and longitude Debugging the indexing process • When things are not as they seem I can diagnose problems in the indexing process
  • 9. How do you use the pipeline? Pipeline configuration files in FASTSearchetc • PipelineConfig.xml • PipelineExtensibility.xml For each Document Processor node: • Create an entry for a new ‘processor’ • Add your new processor name to the <pipelines> node • Restart the ‘FAST processor server’ from CMD: psctrl reset • Submit a single known test document • Check your results
  • 11. Adding a Processor Stage On each FAST document processor node: • Edit %FASTSEARCH%etcpipelineconfig.xml <processor name=“Spy1" type="general" hidden="0"> <load module="processors.Spy" class="Spy"/> <config> <param name="SpyDumpFile" value="var/log/spy.txt" type="str"/> <param name="FileStringCutOffLen" value="32768" type="int"/> </config> <inputs> </inputs> </processor> • In the ‘Document Conversion’ section, add the new pipeline stage to run (in the Office 14 pipeline) <processor name=“Spy1” /> • Reset (each) document processor node: psctrl reset
  • 13. How do you create a custom stage? Edit file %FASTSEARCH%etcpipelineconfig as above Edit file %FASTSearch%etcPipelineExtensibility.xml <PipelineExtensibility> <Run command=“YourCode.EXE %(input)s %(output)s"> <Input> <CrawledProperty propertyName=“author" propertySet=“GUID“ varType="31" /> </Input> <Output> <CrawledProperty propertyName=“mytags” propertySet=“GUID" varType="31"/> <CrawledProperty propertyName=“phone" propertySet=“GUID" varType=“31"/> </Output> </Run> </PipelineExtensibility> Restart content servers from command Line prompt psctrl reset
  • 14. Pipeline is performance-critical Pipeline runs in ‘sandbox’ environment • NOT the same type of ‘sandbox’ in O365 • File I/O only allowed in C:users<fast service user>AppDataLocalLow • Maximum of 10 seconds to live • Permissions restricted regardless of FAST Service user permissions • Each Document Processor (DP) is an individual instance • Only one item passes thru a DP at a time • If each document takes 1 second then10 DPs can process at best 10 docs/sec • Consider 1 sec for each of 100K docs ~ 3 hours!
  • 15. Pipeline Hints MS only supports: • Single custom stage (in PipelineConfig.xml) • .NET languages (C#, etc) But: • A custom stage can appear in multiple places in PipelineConfig.xml even w/ different parameters • Theoretically any executable that handles STDIN/STDOUT will do • VC#/VC++/VBScript/CMD files seem to work • Web services calls are supported
  • 16. Using web services in Sandbox Web Service Stage Stage XML Stage XML Stage XML Config
  • 17. Ontolica FAST Management Ontolica Fast Management provides clear and easy to use configuration directly from within the SharePoint admin GUI. Forget XML configuration files, manual file deployments, and tricky PowerShell configuration with easy management consoles. Key Features: • Backup, Manage, & Deploy Configurations • Manage FAST Relevance Profiles • Upload & Manage Pipeline Extensions • Create & Manage JDBC Connections • FAST Webcrawler Configuration • Manage FAST Server Processes from Central Admin
  • 18. Additional Resources • This slide deck live at http://slidesha.re/sCGAaP • SP2010 ES/FS4SP Blog (Eric Belisle) - http://fs4sp.blogspot.com/ • Enterprise Search Blog (NIE) - http://www.enterprisesearchblog.com/ • Search Unleashed (Len Ocsouza) - http://searchunleashed.wordpress.com/ • ESW Blog - http://www.enterprisesearchwiki.com/wp/ • TechNet/MSDN/Microsoft • And of course: SurfRay.com (Robert Piddocke & Josh Noble)
  • 19. Q/A & Contact Details Miles Kehoe Author of: Professional Microsoft Search Miles.kehoe@ideaeng.com www.enterprisesearchblog.com @miles_kehoe mileskehoe Robert Piddocke Author: Pro SharePoint 2010 Search rcp@surfray.com @rpiddocke R Piddocke ideaeng.com SurfRay.com

Editor's Notes

  1. By default two pipelines defined – Attachments and Office14
  2. http://fs4sp.blogspot.com/2011/05/manipulating-crawled-properties-in-fast.html