architecture overview content ingestion content enrichment advanced enrichment

42

Upload: robert-roman

Post on 31-Mar-2015

236 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Architecture overview Content ingestion Content enrichment Advanced enrichment
Page 2: Architecture overview Content ingestion Content enrichment Advanced enrichment

Search Content Enrichment and Extensibility in SharePoint 2013Brent GroomSenior PFEMicrosoft

SPC414

Sreedhar MallangiSenior ConsultantMicrosoft

Page 3: Architecture overview Content ingestion Content enrichment Advanced enrichment

Identify content extensibility pointsLearn about custom connectorsLearn the basics of content enrichmentAdvanced content enrichmentLearn about two community Toolkits

Almost all “on-prem”

Session Objectives

Page 4: Architecture overview Content ingestion Content enrichment Advanced enrichment

Agenda

Architecture overviewContent ingestionContent enrichmentAdvanced enrichment

Page 5: Architecture overview Content ingestion Content enrichment Advanced enrichment

SharePoint 2013 Search Architecture

SearchAdmin

Content UXCrawl

ContentProcessing Index

QueryProcessing WFE

API

AnalyticsProcessing

Crawl

Search Admin

Link

Analytics Reporting

FAST Search Index

Public API

Unit of scale/role boundary

Extensibility PointsQuery Features

Page 6: Architecture overview Content ingestion Content enrichment Advanced enrichment

SharePoint 2013 Search Architecture

SearchAdmin

Content UXCrawl

ContentProcessing Index

QueryProcessing WFE

API

AnalyticsProcessing

Crawl

Search Admin

Link

Analytics Reporting

FAST Search Index

ContentEnrichmentWeb Service

CustomConnectors

Public API

Unit of scale/role boundary

Extensibility Points

Page 7: Architecture overview Content ingestion Content enrichment Advanced enrichment

Crawl Component OOB connectors Extensible through

BCS Local disk cache Crawled items

tracked in Crawl database

Configurations stored in Admin database

Crawl modes Full Crawl Incremental Crawl Continuous Crawl

Crawl

ContentProcessing Index

Crawl

FAST Search Index

HTTP

File Shares

SharePoint

User Profiles

Exchange

Lotus Notes

Documentum

Custom

...Admin

mssearch.exe

SearchAdmin

Page 8: Architecture overview Content ingestion Content enrichment Advanced enrichment

Content Processing Component

Extending content processing

Web Service Callout

Web Service

You can customize the search experience through the extensibility points in the content processing flow

Delete

Update

Crawler

IndexDelete Links

Security Descriptors

Inse

rt

Detect language

Document summary

Map to managed properties

Custom Entity

Extraction

Phonetic name

variations

Word breaking

Web Service Callout

Ifilter sandbox

Security Descriptor

s

Parse Documents

AnalyticsMetadata Extration

Register crawled

properties

Page 9: Architecture overview Content ingestion Content enrichment Advanced enrichment

Agenda

Architecture overview

Content ingestionContent enrichmentAdvanced enrichment

Page 10: Architecture overview Content ingestion Content enrichment Advanced enrichment

Why?Enterprises have many different data sourcesWe are building Enterprise Search PlatformsAllow users to find the content they are looking for - all sources in one placeIncrease productivity

No Search Content API anymoreFAST ESP had a push based content API

Page 11: Architecture overview Content ingestion Content enrichment Advanced enrichment

OK! What do we have?Connector

Protocol Handlers

Default Solutions

BCS Connector Framework

Lotus NotesExchange public folder

Documentum

File shareSharePoint

WebsitePeople Profile

BCS

Custom solutions

Page 12: Architecture overview Content ingestion Content enrichment Advanced enrichment

What is Business Connectivity Services?Connects external data sources to SharePointCan be used as a search sourceHas several flavors

No-Code OData

SQL

Code WCF

.NET Assembly

B311@TechEd 2013

Page 13: Architecture overview Content ingestion Content enrichment Advanced enrichment

Search Indexing Toolkit - SITA generic implementation of a Custom SharePoint Indexing ConnectorGeneric Data Model FileImplements all the complexities ofBatching – for scalabilityCrawling – Full and IncrementalSecurity Trimming – Both Active Directory security and Custom Claims security

Hides all of that behind one single interface

Page 14: Architecture overview Content ingestion Content enrichment Advanced enrichment

What’s in the package?

Search Indexing ToolkitSIT Core Library

SITModel.xml

XML Files Indexing Connector

AdventureWorks Product DB

Indexing Connector

Implementing the ISearchConnector interface With a

detailed How-To Guide

Page 15: Architecture overview Content ingestion Content enrichment Advanced enrichment

SIT XML file connectorIndex Any XML FileThe connector can split items on a configurable xml element

FlexibleAll sub elements are submitted as crawled properties, no need to configure

High PerformanceTesting has shown 100 DPS even on a laptop

ScalableCrawl million of XML files

Page 16: Architecture overview Content ingestion Content enrichment Advanced enrichment

DemoIndexing Wikipedia Abstracts

Search Indexing Toolkit

Page 17: Architecture overview Content ingestion Content enrichment Advanced enrichment

SIT ISearchConnector interface

SIT CoreYour

Connector

ContentSource

GetAllItems[id1,id2,id3..]

GetSpecificItem(id1)

Initialize

[id1’s properties]

id1’s dataGetSpecificItemData(id1)

id1’s security descriptorGetSecurityDescriptorForSpecificItem

offsetcrawlTypechangeTokenchangeTokenUpdate

itemId, aclmeta,usesPluggableAuth

Page 18: Architecture overview Content ingestion Content enrichment Advanced enrichment

Content source supports NTLM?

Pass-through the security descriptor

Item level securityTag each document with an NTLM security descriptor

Otherwise…

Need to map to NTLM and create security descriptors

If no NTLM available, use Custom claims

Implement Custom claims provider or security trimmer

Page 19: Architecture overview Content ingestion Content enrichment Advanced enrichment

Crawling XML files generated from 3rd party sources.

SQL Server with security trimming

SQL Server with related BLOB on file share

Live Use cases

Page 20: Architecture overview Content ingestion Content enrichment Advanced enrichment

SIT reduces the complexity to create SharePoint Search connectors

Enhance the Search experience

SIT back and relax!

SIT Takeaways

Page 21: Architecture overview Content ingestion Content enrichment Advanced enrichment

Agenda

Architecture overviewContent ingestion

Content enrichmentAdvanced enrichment

Page 22: Architecture overview Content ingestion Content enrichment Advanced enrichment

Business Use Cases

Add DB or ERP meta-data into search results

Clean-up or reformat existing properties to facilitate search

Label documents that contain known patterns

Tag documents that violate corporate policy

Copy data from one managed property to another (including a type change)

What are your customers trying to do?What would your customers like to do?

Page 23: Architecture overview Content ingestion Content enrichment Advanced enrichment

Content Enrichment Web Service (CEWS)Web service hosted outside of SharePointReplaces SharePoint 2010 Pipeline Extensibility executableOptimized for performance (no need to read/write XML files, start a new process, etc)Input/output managed properties

CrawlerContent

Processing Index

Web Service

ProcessedItemProcessItem(Item)

Page 24: Architecture overview Content ingestion Content enrichment Advanced enrichment

CEWS Configuration

Endpoint URL of web service

Input properties Managed properties passed in

Output properties Managed properties that can be returned

Include raw data? Optionally include raw data (read only)

Debug mode Sends all input properties, ignores all output properties

Error mode Warning or Error. In Error mode, failing items are dropped

Trigger Test to determine if enrichment should be called (per document)

Register with Search Service Application via PowerShell

Page 25: Architecture overview Content ingestion Content enrichment Advanced enrichment

Average number of milliseconds spent on content enrichment

Page 26: Architecture overview Content ingestion Content enrichment Advanced enrichment

6 more things you need to know about CEWS1. Properties must exist when you register 2. Property names are case sensitive3. Cannot use property aliases4. Some standard properties can be

confusingDisplayAuthors vs Author

5. Some properties are read-only (body!)6. Single web service per Search Application

Page 27: Architecture overview Content ingestion Content enrichment Advanced enrichment

Agenda

Architecture overviewContent ingestionContent enrichment

Advanced enrichmentChallenges and techniques

Page 28: Architecture overview Content ingestion Content enrichment Advanced enrichment

Doing it in production: the challengesScale-outIncrease capacity to match farmLarge topology ≈ 144 flow instances

Fault toleranceSurvive hardware failures without loss of functionality

Service aggregationMultiple enrichment tasks to support disparate content sources

Page 29: Architecture overview Content ingestion Content enrichment Advanced enrichment

Doing it in production: techniquesWCF RoutingIntroduced in .NET 4.0100% declarative, configured in Web.config xmlApplies Xpath filters against request to determine destination endpoint Supports backup destination endpoints to achieve Fault Tolerance

Load BalancingHide multiple end points behind a load balancer to provide Scale and Fault Tolerance

“Localhost”Register web service on localhost and run instance on each content processing nodeScales with content processingProvides Fault Tolerance with that content processing node

http://aka.ms/Pqkjjj

Page 30: Architecture overview Content ingestion Content enrichment Advanced enrichment

Agenda

Architecture overviewContent ingestionContent enrichment

Advanced enrichmentCEWS Pipeline Toolkit

Page 31: Architecture overview Content ingestion Content enrichment Advanced enrichment

CEWS Pipeline ToolkitEnhance Search IndexDocument markupEntity extraction

ArchitectureWCFXML config

Hides the complexities ofScalabilityService aggregationConditional processing

Powerful framework for content enrichment

Page 32: Architecture overview Content ingestion Content enrichment Advanced enrichment

CEWS Pipeline Toolkit – What does it do?

Extract entitiesString matchingRegular ExpressionsDictionary-based

Normalize Manipulate strings

Access external repositories

Page 33: Architecture overview Content ingestion Content enrichment Advanced enrichment

Framework for document analysisSolves majority of customer business use casesPackaged with over 55 pipeline stagesConfigurable document routing

CEWS Pipeline Toolkit – What’s in the package?

Platform supportSharePoint 2013 Enterprise SearchFAST Search For SharePoint 2010Stand-alone

Easy to install, Easy to CustomizeVisual Studio 2012 & .NET 4.5 FrameworkInherit from AbstractDocumentProcessor class

Detailed documentation on TechNet Wiki – Help the community

Page 34: Architecture overview Content ingestion Content enrichment Advanced enrichment

CEWS Pipeline Toolkit architecture

CrawlerContent

Processing Index

Web Service

ProcessedItemProcessItem(Item)

CEWS Pipeline Toolkit

Pipeline configxml

Initialize

Page 35: Architecture overview Content ingestion Content enrichment Advanced enrichment

DemoWikipedia categoryTotal population

CEWS Pipeline Toolkit

Page 36: Architecture overview Content ingestion Content enrichment Advanced enrichment

Future – Community Effort

DataWikipediaFileshareDB – Adventure WorksWeb Services

DisplayCustom Search CenterSearch App

DeployDemoPOCDevQAProduction

Page 37: Architecture overview Content ingestion Content enrichment Advanced enrichment

CEWS and SIT – Join the community effort Canned prototypes for search POCsSeveral sample scenarios to leverage in your projectSimple to deploy and useProduction ready

Page 38: Architecture overview Content ingestion Content enrichment Advanced enrichment

How to get these toolsMCS ContactPremier ContactPublic Available DateGoing through the legal process. Will be made available publicly once approved.

Page 39: Architecture overview Content ingestion Content enrichment Advanced enrichment

Identified extensibility points in content acquisition

Saw how to customize the content processing pipeline via code callout.

Learned how to use SIT. Dove into advanced content enrichment

topics (CEWS Pipeline Toolkit)

In Review: Session Objectives

Page 40: Architecture overview Content ingestion Content enrichment Advanced enrichment

See you at the Search booth’s & Search tables at Asks the Experts WED @6:15!

Session Session Room Time

Develop Advanced Search-Driven SharePoint 2013 Apps SPC402 Palazzo I, J Tue 1:45pm

Best practices for Hybrid Search deployments SPC306 Veronese 2401 Tue 5:00pm

SharePoint 2013 Search Analytics SPC340 Palazzo M, N Wed 9:00am

How to manage and troubleshoot Search: A practical guide SPC375 Veronese 2401

Wed 10:45am

6 Proven Steps to Get the Best Out of Search in SharePoint 2013 SPC265 Delphino 4001 Wed 1:45pm

Best practices for Information Architecture and Enterprise Search SPC207 Veronese 2401 Wed 1:45pm

Search content enrichment and extensibility in SharePoint 2013 SCP414 Palazzo K, L Wed 1:45pm

Customizing Search experiences with Azure Hosted Data and Bing Maps SPC321 Veronese 2401 Wed 3:15pm

Futuristic Search applications using Kinect and Yammer! SPC405 Palazzo M, N Wed 3:15pm

Search architecture and sizing in SharePoint 2013 SPC336 Titian 2201 Wed 5:00pm

Effective Search deployment and operations in SharePoint 2013 SPC360 Veronese 2401 Thu 9:00am

SharePoint 2013 Search display templates and query rules SPC322 Palazzo M, N Thu 9:00am

Managing Search Relevance in SharePoint 2013 and O365 SPC382 Veronese 2401 Thu 12:00pm

Searc

h R

ela

ted S

ess

ion

s

Page 41: Architecture overview Content ingestion Content enrichment Advanced enrichment

MySPCSponsored by

connect. reimagine. transform.

Evaluate sessionson MySPC using yourlaptop or mobile device:myspc.sharepointconference.com

Page 42: Architecture overview Content ingestion Content enrichment Advanced enrichment

© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.