alm search presentation for the vss arch council

25
ALM Search Sunita Shrivastava 6/20/2014 Microsoft Confidential 1

Upload: sunita-shrivastava

Post on 09-Feb-2017

79 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: ALM Search Presentation for the VSS Arch Council

Microsoft Confidential 1

ALM SearchSunita Shrivastava

6/20/2014

Page 2: ALM Search Presentation for the VSS Arch Council

Microsoft Confidential 2

ALM Search Start with Code Search but eventually support search for other

artefacts

Agenda Discuss the current architecture and concerns Share the investigations Share the learning Get feedback on open design issues

Page 3: ALM Search Presentation for the VSS Arch Council

Microsoft Confidential 3

Indexing Engine Choices BING and Elastic Search

Our requirements : Code Element Search, Phrase Search, AND/OR/NOT Search, WildCard Search, Faceting, Highlighting, Paginations

Indexing : Support for Continuous indexing, Performant, Scale out feasibility, Real timeness, Different Schemas Our Evaluation shared at : https

://microsoft.sharepoint.com/teams/EngSys/Documents/Modernizing%20Our%20Engineering%20System%20AOI/Search/TechEval/ElasticSearch/Tech%20Eval%20Summary.pptx?web=1

ES Observations so far in context of Code Search Schema-less

Multiple artefacts can be stored in the same index Can deal with change in data schema of the artefact

Main Value Add of ES over Lucene Is Aggregation! If you need aggregation of search across different indexes, aim to use the aggregator of ES! This means that it is likely that sharing a large ES cluster is the right thing to do for search aggregation across VSO artefacts

Highly Extensible Code Element Search

Move from Nested Documents to Custom Analyzer Highlighting

ES allows the REST APIs to be extended/added We chose a custom query extension mechanism

Azure Search Service, though layered on ES, hid these mechanisms making it intractable for code element search Feeding ES

For Large Scale Indexing, being able to feed it fast enough is important, so the scalability of the pipeline is important.

Page 4: ALM Search Presentation for the VSS Arch Council

Microsoft Confidential 4

High level Architecture

Datacenter A

Search Service

Search Service Front End

Search Service Backend

REST API

Web UX

Index ServersIndex Servers

VSO Service

Query Pipeline

Crawl/Parse/Feed Pipeline

Datacenter B

Search Service

Search Service Front End

Search Service Backend

REST API

Index Servers

Index Servers

VSO Service

Query Pipeline

Crawl/Parse/Feed Pipeline

Mapper Data Mapper Data

Page 5: ALM Search Presentation for the VSS Arch Council

Microsoft Confidential 5

Planned Service Architecture XSS Scripting and circular dependency problems during build force the Search Client

This is going to be more and more common as more standalone services come into existence Thanks to Patrick/Phecda for this picture

Page 6: ALM Search Presentation for the VSS Arch Council

Microsoft Confidential 6

Deployment for MsEng Thanks to Sharad Agarwal for this Slide! Elastic Search cluster (Indexer)

3 (Master + Query)  Nodes (A2) 3 Data Nodes (A5) Probably 1 Marvel node (A2) – Need data from AppInsight’s team

Search Service (CPF + Query) 3 Job Agent Nodes (A2) 3 App Tier Nodes (A2) Config DB (SQL Azure) 1 Azure Storage account

Portal UX in TFS Both Search Service, ES, Marvel cluster are within a VNET

Search Service talks to the ES query/ingestion nodes through an ILB This helped take care of DNS issues

Page 7: ALM Search Presentation for the VSS Arch Council

Microsoft Confidential 7

Logging, Diagnostics and Monitoring Logging

All our code will be instrumented, including the code inside ES.  Developers can get these logs. 

Diagnostics Each team, provides diagnostics data, which is higher level data that provides

insights into the usage/activities happening in the context of the component.  Query Pipeline Telemetry

Total Number of Queries

Successful Queries

Failed Queries

Slow Queries

 Portal Telemetry Total Number Queries

Queries that don’t result in a click on the facet or result page in the top 20 results

Queries that result in a click beyond 20 results  Search Usage per account

Page 8: ALM Search Presentation for the VSS Arch Council

Microsoft Confidential 8

Diagnostics, Monitoring (cont) Indexing Telemetry

Storage Used for Temporary Data(Blobs) Storage Used for Entity State Data(Tables?) Storage Used for Meta Data Storage used for Provisioning Data Amount of Data (Mbytes) indexed in the last one hour Number of commits handled in the last one hour Number of pending tasks Number of pending pipelines Number of pending commits Cold Start  Summary

Monitoring Use Marvel

Page 9: ALM Search Presentation for the VSS Arch Council

Microsoft Confidential 9

Query Pipeline Quick, Low Overhead Query Builder uses the Arriba Parser, NEST is used to talk to Elastic Search Mapper(Not Required): The scope of search defines a unique ES Index Alias which refers to the appropriate

indices Security Trimming (Cause of Concern) : For TFVC at file level, For WorkItems at Area Level

IndexorElasticSearch

Query Pipeline

FileHash Mapper(For Dedup)

REST

En

dpoi

nt

Repo Access/Auth Mgmt Query BuilderQuery Builder

(Format Checking, Query Parsing)

Security Trimmer(only for

tfvc)Aggregator

Mapper

Highlighter AddIn

Page 10: ALM Search Presentation for the VSS Arch Council

Microsoft Confidential 10

Query Pipeline Component Diagram Thanks to Bittu and Neeraj for this diagram!

Search UIRest API for Query

Interaction

Query Builder

Search String& Filters

Search Query Backend

Custom Highlighter

TFS GIT RepoQuery String Parser

ES Client

Query Monitor

OI

Query Executor

ES Cluster

Repo Access Management/Authentication

Search Response

ES Search Results

Custom Query

Custom Analyzer

Page 11: ALM Search Presentation for the VSS Arch Council

Microsoft Confidential 11

Security Three Options

Use Remote Security Name Spaces for caching artefact permissions GIT + WIT -

Index level permissios Mostly Open Model

Page 12: ALM Search Presentation for the VSS Arch Council

Microsoft Confidential 12

Indexing Pipeline Currently built on VSSF Framework, Backend REST APIs : Seminal objects are Tasks, Pipelines, Entities

E.g. of a Task Creation : Request to perform an indexing pipeline related operation X(e.g. reindex,start,stop) on some Entity results in creation of a task

Tasks create pipelines, a task is completed when all pipelines spawned are finished

Indexing Pipeline

Crawl

BE R

EST

Endp

oint

Meta Data Analysis Cold Start Index Prep

Index Provisioning

Parse Feed

IndexorElasticSearch

Ready Index for Query

Mapper Update

Crawl Parse Feed

Cold Start Cleanup

Dedup Detection(opt)

Page 13: ALM Search Presentation for the VSS Arch Council

Microsoft Confidential 13

Indexing Pipeline Component Design

TFS Commit SyncTFS Account SyncRe-indexer

Crawer Abstraction Layer Crawler Extensions

Parser

Parser Extensions

Feeder

ES Wrapper

CPF Arbritrator

ES Map and Topology

Configurator

Index Monitor

OI

Job Scheduler

ES Extensions(Custom Analyzer/

Plugins)

De-dup Multi-tenancy

...

Logger/Telemetry

Repo Content DB Abstraction Layer

Parser DB Abstraction Layer

ES Cluster

Data

Data

Thanks to Tapas for this diagram!

Page 14: ALM Search Presentation for the VSS Arch Council

Microsoft Confidential 14

Indexing Pipeline (cont) Cold Start Crawl Spec :

For GIT, the ‘default branch’ is enabled for Indexing by default Others will need to get whitelisted explicitly

TFS Repo has many topic and feature branches Need Closure on UX experience on this

For TFVC : TBD For Work Items : TBD

Page 15: ALM Search Presentation for the VSS Arch Council

Microsoft Confidential 15

Performance Summary For up to 5 Million Files, performance of 90% queries remained under 60

msec Feeder ran into issues quickly on A2 configurations, because of low memory

issues By not storing the file content, but only term vectors the performance came

down from a range of ~1.5 msec to 20 msec on A5 configurations. Following in Progress

Multiple Smaller Indexes on the same node Queries during Continuous Indexing Indexing Performance with Multiple Replicas Multi-Index Search

Detailed analysis available at https://microsoft.sharepoint.com/teams/EngSys/_layouts/15/WopiFrame.aspx?sour

cedoc={83CFFEAA-1C78-46FA-BFE7-9D3E36DCA3CA}&file=Sprint66_PerftAnalysis-1.pptx&action=default

Page 16: ALM Search Presentation for the VSS Arch Council

Microsoft Confidential 16

UX Requirements :

Search UI needs to be uncluttered and simple User should not lose context of what he was doing Experience should be largely similar for searching different artefacts

Sharepoint has a precedent for multi-artefact search Search launches a different page Seems like a reasonable model to follow

Page 17: ALM Search Presentation for the VSS Arch Council

Microsoft Confidential 17

Indexing Pipeline (Cont) Crawler Strategy : Current plan is to use the LibGit2Sharp

Following methods were compared Crawl File by File with GitHttpClient(current implementation) Download Zipped trees using GitHttpClient Clone a Repo using Git Command line LibGit2Sharp

https://microsoft.sharepoint.com/teams/EngSys/_layouts/15/WopiFrame.aspx?sourcedoc={43E4F3C7-D54E-432E-BDF7-33F96A912E58}&file=Git%20Repos%20Crawl%20Option%20Comparison.docx&action=default

Implications Entire Git repo is brought down to Azure storage(Blob Store)

To Dedup or not to Dedup TFS repo on mseng has ~35 feature branches, ~300 scope branches Results (10 Million Files, .4 M Unique Files, Duplication ratio 1:25 (what is seen in Windows SD depots/branches))

No Deduplication : 60GB index size, 19 hours indexing time, 9ms avg query time Single Document : ~3GB, 50 minutes, 3.7 ms Parent Child Mappings : 11 GB, ~1 hour 50 Min, 122 ms

https://microsoft.sharepoint.com/teams/EngSys/_layouts/15/WopiFrame.aspx?sourcedoc={A19453EF-9EB8-447C-A2E0-85C245D4CA79}&file=Summary%20of%20Deduplication%20Effort.docx&action=default

Backend APIs : For diagnostics/dealing with corruption etc. Seminal objects are Tasks, Pipelines, Entities E.g. of a Task Creation : Request to perform an indexing pipeline related operation X(e.g. reindex,start,stop) on

some ‘Entity’ results in creation of a task Tasks create pipelines, a task is completed when all pipelines spawned are finished

Page 18: ALM Search Presentation for the VSS Arch Council

Microsoft Confidential 18

Indexing Pipeline Scaleout We want to host indexes for different artefacts on the same ES cluster

This will enable search aggregation through ES This opens ups several interesting scenarios in future

Scale-out and Isolation for different pipelines based on Job Infra is not possible To leverage efforts across teams

Implies that the Crawl/Parse/Feed pipeline should be generalized Potentially we might want to think of extensibilities at the query pipeline as well

Page 19: ALM Search Presentation for the VSS Arch Council

Microsoft Confidential 19

ALM Search Deployment Topology

AT

Job

ES

Load Balancer

Private Network

Inte

rnal

Lo

ad

Bala

ncer

ALM Search Service

AT

Job

TR Data

Nodes

Search Data

Nodes

TR Query/Indexi

ng Nodes

Search

Query/Indexi

ng Nodes

Shared Master Nodes

Page 20: ALM Search Presentation for the VSS Arch Council

Microsoft Confidential 20

Cross Account Search and Public Repositories There is a desire to include all public repositories in Search either by

default or as an option to the user How will VSO support the notion of a public repository? Will there be public accounts ?

Page 21: ALM Search Presentation for the VSS Arch Council

Microsoft Confidential 21

On Premise and Cloud Search Federation Sharepoint and Office 365 have a precedent, supports 3 models based on Oauth

http://msdn.microsoft.com/en-us/library/dn155905.aspx Three models for federation

Outbound : Searching on the portal for the on-premise service endpoint, will return results from cloud as well Inbound : Searching on the portal for the cloud service, returns results from the on-premise TFS indexes as well Both ways : Search is symmetric

Look out for more details in this space next time!

Code Search Service

Code Search Service

Aggregator

Indexorc

Repository

Indexora

Repository

Indexora

Repository

Indexorb

Cloud

Repository

Indexora

VS IDE

VSO Web UX

Aggregator

VSO Web UX

Page 22: ALM Search Presentation for the VSS Arch Council

Microsoft Confidential 22

Futures Semantic Search OSS Search Requirements Extensions to Code Search for Test Cases

Page 23: ALM Search Presentation for the VSS Arch Council

Microsoft Confidential 23

Appendix

Page 24: ALM Search Presentation for the VSS Arch Council

Microsoft Confidential 24

Perf Testing on 1 Node for upto 4 M Files

Page 25: ALM Search Presentation for the VSS Arch Council

Microsoft Confidential 25

Indexing Rate Analysis(Thanks to Perf Crew)

A7-1N1S

A7-1N3S

A7-1N5S

A6-1N1S

A6-1N3S

A6-1N5S

A5-1N1S

A5-1N3S

A5-1N5S

0

50

100

150

200

250

300

350

Indexing Rate

10K 100K 500K 1M 2M 3M 4M

Files Indexed

Docs

Inde

xed/

sec

Setup• A5: 6GB allocated to JVM Heap• A6: 12GB allocated to JVM Heap.• A7: 20G allocated to JVM Heap.• Feeder: A4 machine feeding asynchronously. Observation• On A5 indexing rate remains same across shards.• On A6 & A7 using more than 1 shard improved Index rate. 3

and 5 shards behavior remained same.• Indexing rate remained linear across post 500K files during

whole indexing period.• On A5 maximum indexing rate is 160 Docs/sec while

minimum is 107 Docs/sec.• On A6 Maximum indexing rate is 200 Docs/sec while

minimum rate is 125 Docs/sec.• On A7 indexing rate is 302 Docs/sec while minimum is 120

Docs/sec.

Conclusion• Indexing rate remained linear once 500K docs were

indexed.• For onboarding a new repo we can clearly predict/estimate

the maximum time needed to index the repo.