alm search presentation for the vss arch council

Microsoft Confidential 1

ALM SearchSunita Shrivastava

6/20/2014


ALM Search Start with Code Search but eventually support search for other

artefacts

Agenda Discuss the current architecture and concerns Share the investigations Share the learning Get feedback on open design issues


Indexing Engine Choices BING and Elastic Search

Our requirements : Code Element Search, Phrase Search, AND/OR/NOT Search, WildCard Search, Faceting, Highlighting, Paginations

Indexing : Support for Continuous indexing, Performant, Scale out feasibility, Real timeness, Different Schemas Our Evaluation shared at : https

://microsoft.sharepoint.com/teams/EngSys/Documents/Modernizing%20Our%20Engineering%20System%20AOI/Search/TechEval/ElasticSearch/Tech%20Eval%20Summary.pptx?web=1

ES Observations so far in context of Code Search Schema-less

Multiple artefacts can be stored in the same index Can deal with change in data schema of the artefact

Main Value Add of ES over Lucene Is Aggregation! If you need aggregation of search across different indexes, aim to use the aggregator of ES! This means that it is likely that sharing a large ES cluster is the right thing to do for search aggregation across VSO artefacts

Highly Extensible Code Element Search

Move from Nested Documents to Custom Analyzer Highlighting

ES allows the REST APIs to be extended/added We chose a custom query extension mechanism

Azure Search Service, though layered on ES, hid these mechanisms making it intractable for code element search Feeding ES

For Large Scale Indexing, being able to feed it fast enough is important, so the scalability of the pipeline is important.

https://mail.microsoft.com/owa/redir.aspx?C=GRBHYf_0OUO9UNiUh8GRZdvGsV-LQNEInRsAqkMuOwcw8p14VYe5B-ohXmTSQwOtg1hyZdhmV2Q.&URL=https://microsoft.sharepoint.com/teams/EngSys/Documents/Modernizing%20Our%20Engineering%20System%20AOI/Search/TechEval/ElasticSearch/Tech%20Eval%20Summary.pptx?web%3D1




High level Architecture

Datacenter A

Search Service

Search Service Front End

Search Service Backend

REST API

Web UX

Index ServersIndex Servers

VSO Service

Query Pipeline

Crawl/Parse/Feed Pipeline

Datacenter B

Search Service

Search Service Front End

Search Service Backend

REST API

Index Servers

Index Servers

VSO Service

Query Pipeline

Crawl/Parse/Feed Pipeline

Mapper Data Mapper Data


Planned Service Architecture XSS Scripting and circular dependency problems during build force the Search Client

This is going to be more and more common as more standalone services come into existence Thanks to Patrick/Phecda for this picture


Deployment for MsEng Thanks to Sharad Agarwal for this Slide! Elastic Search cluster (Indexer)

3 (Master + Query) Nodes (A2) 3 Data Nodes (A5) Probably 1 Marvel node (A2) – Need data from AppInsight’s team

Search Service (CPF + Query) 3 Job Agent Nodes (A2) 3 App Tier Nodes (A2) Config DB (SQL Azure) 1 Azure Storage account

Portal UX in TFS Both Search Service, ES, Marvel cluster are within a VNET

Search Service talks to the ES query/ingestion nodes through an ILB This helped take care of DNS issues


Logging, Diagnostics and Monitoring Logging

All our code will be instrumented, including the code inside ES. Developers can get these logs.

Diagnostics Each team, provides diagnostics data, which is higher level data that provides

insights into the usage/activities happening in the context of the component. Query Pipeline Telemetry

Total Number of Queries

Successful Queries

Failed Queries

Slow Queries

Portal Telemetry Total Number Queries

Queries that don’t result in a click on the facet or result page in the top 20 results

Queries that result in a click beyond 20 results Search Usage per account


Diagnostics, Monitoring (cont) Indexing Telemetry

Storage Used for Temporary Data(Blobs) Storage Used for Entity State Data(Tables?) Storage Used for Meta Data Storage used for Provisioning Data Amount of Data (Mbytes) indexed in the last one hour Number of commits handled in the last one hour Number of pending tasks Number of pending pipelines Number of pending commits Cold Start Summary

Monitoring Use Marvel


Query Pipeline Quick, Low Overhead Query Builder uses the Arriba Parser, NEST is used to talk to Elastic Search Mapper(Not Required): The scope of search defines a unique ES Index Alias which refers to the appropriate

indices Security Trimming (Cause of Concern) : For TFVC at file level, For WorkItems at Area Level

IndexorElasticSearch

Query Pipeline

FileHash Mapper(For Dedup)

REST

En

dpoi

nt

Repo Access/Auth Mgmt Query BuilderQuery Builder

(Format Checking, Query Parsing)

Security Trimmer(only for

tfvc)Aggregator

Mapper

Highlighter AddIn


Query Pipeline Component Diagram Thanks to Bittu and Neeraj for this diagram!

Search UIRest API for Query

Interaction

Query Builder

Search String& Filters

Search Query Backend

Custom Highlighter

TFS GIT RepoQuery String Parser

ES Client

Query Monitor

OI

Query Executor

ES Cluster

Repo Access Management/Authentication

Search Response

ES Search Results

Custom Query

Custom Analyzer


Security Three Options

Use Remote Security Name Spaces for caching artefact permissions GIT + WIT -

Index level permissios Mostly Open Model


Indexing Pipeline Currently built on VSSF Framework, Backend REST APIs : Seminal objects are Tasks, Pipelines, Entities

E.g. of a Task Creation : Request to perform an indexing pipeline related operation X(e.g. reindex,start,stop) on some Entity results in creation of a task

Tasks create pipelines, a task is completed when all pipelines spawned are finished

Indexing Pipeline

Crawl

BE R

EST

Endp

oint

Meta Data Analysis Cold Start Index Prep

Index Provisioning

Parse Feed

IndexorElasticSearch

Ready Index for Query

Mapper Update

Crawl Parse Feed

Cold Start Cleanup

Dedup Detection(opt)


Indexing Pipeline Component Design

TFS Commit SyncTFS Account SyncRe-indexer

Crawer Abstraction Layer Crawler Extensions

Parser

Parser Extensions

Feeder

ES Wrapper

CPF Arbritrator

ES Map and Topology

Configurator

Index Monitor

OI

Job Scheduler

ES Extensions(Custom Analyzer/

Plugins)

De-dup Multi-tenancy

...

Logger/Telemetry

Repo Content DB Abstraction Layer

Parser DB Abstraction Layer

ES Cluster

Data

Data

Thanks to Tapas for this diagram!


Indexing Pipeline (cont) Cold Start Crawl Spec :

For GIT, the ‘default branch’ is enabled for Indexing by default Others will need to get whitelisted explicitly

TFS Repo has many topic and feature branches Need Closure on UX experience on this

For TFVC : TBD For Work Items : TBD


Performance Summary For up to 5 Million Files, performance of 90% queries remained under 60

msec Feeder ran into issues quickly on A2 configurations, because of low memory

issues By not storing the file content, but only term vectors the performance came

down from a range of ~1.5 msec to 20 msec on A5 configurations. Following in Progress

Multiple Smaller Indexes on the same node Queries during Continuous Indexing Indexing Performance with Multiple Replicas Multi-Index Search

Detailed analysis available at https://microsoft.sharepoint.com/teams/EngSys/_layouts/15/WopiFrame.aspx?sour

cedoc={83CFFEAA-1C78-46FA-BFE7-9D3E36DCA3CA}&file=Sprint66_PerftAnalysis-1.pptx&action=default

https://microsoft.sharepoint.com/teams/EngSys/_layouts/15/WopiFrame.aspx?sourcedoc=%7B83CFFEAA-1C78-46FA-BFE7-9D3E36DCA3CA%7D&file=Sprint66_PerftAnalysis-1.pptx&action=default




UX Requirements :

Search UI needs to be uncluttered and simple User should not lose context of what he was doing Experience should be largely similar for searching different artefacts

Sharepoint has a precedent for multi-artefact search Search launches a different page Seems like a reasonable model to follow


Indexing Pipeline (Cont) Crawler Strategy : Current plan is to use the LibGit2Sharp

Following methods were compared Crawl File by File with GitHttpClient(current implementation) Download Zipped trees using GitHttpClient Clone a Repo using Git Command line LibGit2Sharp

https://microsoft.sharepoint.com/teams/EngSys/_layouts/15/WopiFrame.aspx?sourcedoc={43E4F3C7-D54E-432E-BDF7-33F96A912E58}&file=Git%20Repos%20Crawl%20Option%20Comparison.docx&action=default

Implications Entire Git repo is brought down to Azure storage(Blob Store)

To Dedup or not to Dedup TFS repo on mseng has ~35 feature branches, ~300 scope branches Results (10 Million Files, .4 M Unique Files, Duplication ratio 1:25 (what is seen in Windows SD depots/branches))

No Deduplication : 60GB index size, 19 hours indexing time, 9ms avg query time Single Document : ~3GB, 50 minutes, 3.7 ms Parent Child Mappings : 11 GB, ~1 hour 50 Min, 122 ms

https://microsoft.sharepoint.com/teams/EngSys/_layouts/15/WopiFrame.aspx?sourcedoc={A19453EF-9EB8-447C-A2E0-85C245D4CA79}&file=Summary%20of%20Deduplication%20Effort.docx&action=default

Backend APIs : For diagnostics/dealing with corruption etc. Seminal objects are Tasks, Pipelines, Entities E.g. of a Task Creation : Request to perform an indexing pipeline related operation X(e.g. reindex,start,stop) on

some ‘Entity’ results in creation of a task Tasks create pipelines, a task is completed when all pipelines spawned are finished

https://microsoft.sharepoint.com/teams/EngSys/_layouts/15/WopiFrame.aspx?sourcedoc=%7B43E4F3C7-D54E-432E-BDF7-33F96A912E58%7D&file=Git%20Repos%20Crawl%20Option%20Comparison.docx&action=default

https://microsoft.sharepoint.com/teams/EngSys/_layouts/15/WopiFrame.aspx?sourcedoc=%7B43E4F3C7-D54E-432E-BDF7-33F96A912E58%7D&file=Git%20Repos%20Crawl%20Option%20Comparison.docx&action=default

https://microsoft.sharepoint.com/teams/EngSys/_layouts/15/WopiFrame.aspx?sourcedoc=%7BA19453EF-9EB8-447C-A2E0-85C245D4CA79%7D&file=Summary%20of%20Deduplication%20Effort.docx&action=default

https://microsoft.sharepoint.com/teams/EngSys/_layouts/15/WopiFrame.aspx?sourcedoc=%7BA19453EF-9EB8-447C-A2E0-85C245D4CA79%7D&file=Summary%20of%20Deduplication%20Effort.docx&action=default


Indexing Pipeline Scaleout We want to host indexes for different artefacts on the same ES cluster

This will enable search aggregation through ES This opens ups several interesting scenarios in future

Scale-out and Isolation for different pipelines based on Job Infra is not possible To leverage efforts across teams

Implies that the Crawl/Parse/Feed pipeline should be generalized Potentially we might want to think of extensibilities at the query pipeline as well


ALM Search Deployment Topology

AT

Job

ES

Load Balancer

Private Network

Inte

rnal

Lo

ad

Bala

ncer

ALM Search Service

AT

Job

TR Data

Nodes

Search Data

Nodes

TR Query/Indexi

ng Nodes

Search

Query/Indexi

ng Nodes

Shared Master Nodes


Cross Account Search and Public Repositories There is a desire to include all public repositories in Search either by

default or as an option to the user How will VSO support the notion of a public repository? Will there be public accounts ?


On Premise and Cloud Search Federation Sharepoint and Office 365 have a precedent, supports 3 models based on Oauth

http://msdn.microsoft.com/en-us/library/dn155905.aspx Three models for federation

Outbound : Searching on the portal for the on-premise service endpoint, will return results from cloud as well Inbound : Searching on the portal for the cloud service, returns results from the on-premise TFS indexes as well Both ways : Search is symmetric

Look out for more details in this space next time!

Code Search Service

Code Search Service

Aggregator

Indexorc

Repository

Indexora

Repository

Indexora

Repository

Indexorb

Cloud

Repository

Indexora

VS IDE

VSO Web UX

Aggregator

VSO Web UX

http://msdn.microsoft.com/en-us/library/dn155905.aspx

http://msdn.microsoft.com/en-us/library/dn155905.aspx


Futures Semantic Search OSS Search Requirements Extensions to Code Search for Test Cases


Appendix


Perf Testing on 1 Node for upto 4 M Files


Indexing Rate Analysis(Thanks to Perf Crew)

A7-1N1S

A7-1N3S

A7-1N5S

A6-1N1S

A6-1N3S

A6-1N5S

A5-1N1S

A5-1N3S

A5-1N5S

0

50

100

150

200

250

300

350

Indexing Rate

10K 100K 500K 1M 2M 3M 4M

Files Indexed

Docs

Inde

xed/

sec

Setup• A5: 6GB allocated to JVM Heap• A6: 12GB allocated to JVM Heap.• A7: 20G allocated to JVM Heap.• Feeder: A4 machine feeding asynchronously. Observation• On A5 indexing rate remains same across shards.• On A6 & A7 using more than 1 shard improved Index rate. 3

and 5 shards behavior remained same.• Indexing rate remained linear across post 500K files during

whole indexing period.• On A5 maximum indexing rate is 160 Docs/sec while

minimum is 107 Docs/sec.• On A6 Maximum indexing rate is 200 Docs/sec while

minimum rate is 125 Docs/sec.• On A7 indexing rate is 302 Docs/sec while minimum is 120

Docs/sec.

Conclusion• Indexing rate remained linear once 500K docs were

indexed.• For onboarding a new repo we can clearly predict/estimate

the maximum time needed to index the repo.