effective searching by dominik kornas

43
Effective searching Integrating External Search Engines with Adobe AEM Dominik Kornaś

Upload: aem-hub-2014

Post on 10-May-2015

498 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Effective Searching by Dominik Kornas

Effective searchingIntegrating External Search Engines with Adobe AEM

Dominik Kornaś

Page 2: Effective Searching by Dominik Kornas

3 years in Cognifide – exactly today

Senior software engineer & technical lead

Focused on systems integration tasks

The ”search guy” in Cognifide

Who am I?

Page 3: Effective Searching by Dominik Kornas

What we won’t talk about

Sorting

Document

structure

Indexing

Managed relevancy model

Input data

processing

Highlighter

Faceted search

Wildcard search

Statistics

Autocomplete

Spellchecking

Lemmatization

Sentence search

Pagination

Content normalizatio

n

Metadata

Data collections & views

Page 4: Effective Searching by Dominik Kornas

The goal of searching

Page 5: Effective Searching by Dominik Kornas

„What is the best British football team?”

If we ask such a question, will the search engine find the answer?

The goal of searching

Page 6: Effective Searching by Dominik Kornas

„What is the best British football team?”

The search engine will find the question, not the answer.

The goal of searching

Page 7: Effective Searching by Dominik Kornas

„What is the best British football team?”

vs.

„best team football UK”

Are we asking questions or issuing queries?

The goal of searching

Page 8: Effective Searching by Dominik Kornas

The goal of searching

Effective searching is about finding keywords:

• in the shortest possible time

• close to each other in a block of text

• that are in a desired context

and being sure the engine knows about the data we are looking for!

Page 9: Effective Searching by Dominik Kornas

Effective searchingIndexing

Page 10: Effective Searching by Dominik Kornas

The Past

Page 11: Effective Searching by Dominik Kornas

Microsoft FAST

The first major external search integration with AEM (then: CQ 5.4) in Cognifide.

Push-like indexing using CQ-FAST connector from Adobe.

Page 12: Effective Searching by Dominik Kornas

Microsoft FAST

Implemented as a dedicated replication agent, triggered by the content replication.

http://wem.help.adobe.com/enterprise/en_US/10-0/wem/administering/cq2fast.html

Page 13: Effective Searching by Dominik Kornas

Content builder

Transport handler MS FAST

Microsoft FAST

Replication agent processing workflow: HTTP request for a content

Metadata

Markup

Page 14: Effective Searching by Dominik Kornas

Microsoft FAST

We can decide which instance the content should be read from.

Page 15: Effective Searching by Dominik Kornas

Content builder

Transport handler MS FAST

Microsoft FAST

Replication agent processing workflow: metadata.ecma evaluation

Markup

Metadata

Page 16: Effective Searching by Dominik Kornas

Content builder

Transport handler MS FAST

Microsoft FAST

Replication agent processing workflow: data upload

Markup

Metadata

Page 17: Effective Searching by Dominik Kornas

Microsoft FAST

Sends content to MS FAST.

The ”cq5” suffix in the URI is a document collection.

A named subset of documentsin the entire FAST index.

http://wem.help.adobe.com/enterprise/en_US/10-0/wem/administering/cq2fast.html

Page 18: Effective Searching by Dominik Kornas

Content builder

Transport handler MS FAST

Microsoft FAST

Replication agent processing workflow: indexing

Markup

Metadata

Page 19: Effective Searching by Dominik Kornas

Microsoft FAST

The replication agent is OK for one site, stored in a single FAST collection of documents.

It becomes complicated in the multi-site environment where each site must be located in a separate index area.

And when the search results should not contain data coming from the different sites.

Page 20: Effective Searching by Dominik Kornas

Microsoft FAST

Page 21: Effective Searching by Dominik Kornas

Microsoft FAST

The complex ACL configuration has been used to ensure that only one proper agent will deliver the document to FAST.

It was hard to set and maintain without the proper tools that have automated the whole process.

Page 22: Effective Searching by Dominik Kornas

The Present Day

Page 23: Effective Searching by Dominik Kornas

Google Search Appliance

For the AEM & GSA integration, we have considered reusing of the CQ-FAST connector approach.

But aware of the issues, we have decided to develop our own micro-framework that takes care about the indexing process.

Installed as a single OSGi bundle.

Provides a set of services and utilities to help with the indexing.

Page 24: Effective Searching by Dominik Kornas

Google Search Appliance

Content replication Filtering Push to

Publish

Indexing queue (-s)

Content gathering

Metadata processing

Push to external engine

The indexing process spans between the author and the publish AEM instances.

All stages are tracked and it is possible to recover from the failure and retry the indexing.

Au

tho

rP

ub

lish

Process status tracking & persistence

Page 25: Effective Searching by Dominik Kornas

Google Search Appliance

Content replication Filtering Push to

Publish

Indexing queue (-s)

Content gathering

Metadata processing

Push to external engine

The process starts with the content replication.

OR

Programatically from the backend, e.g. triggered by the scheduler service.

Au

tho

rP

ub

lish

Process status tracking & persistence

Page 26: Effective Searching by Dominik Kornas

Google Search Appliance

Content replication Filtering Push to

Publish

Indexing queue (-s)

Content gathering

Metadata processing

Push to external engine

Each replicated content path is filtered against a whitelist & a blacklist.

There’s an option to use a custom OSGi service able to decide if the content should be indexed, removed or ignored.

Au

tho

rP

ub

lish

Process status tracking & persistence

Page 27: Effective Searching by Dominik Kornas

Google Search Appliance

Content replication Filtering Push to

Publish

Indexing queue (-s)

Content gathering

Metadata processing

Push to external engine

The indexing information is persisted in a special kind of repository node and replicated to the publish instance.

We can choose which publish instance(-s) will receive the data.

Au

tho

rP

ub

lish

Process status tracking & persistence

Page 28: Effective Searching by Dominik Kornas

Google Search Appliance

Content replication

Filtering

Push to Publish

Indexing queue (-s)

Content gathering

Metadata processing

The information is received and instantly dispatched to the indexing queue(-s).

We can handle indexing in a single or multiple different search engines.

Au

tho

rP

ub

lish

Process status tracking & persistence

Page 29: Effective Searching by Dominik Kornas

Google Search Appliance

Content replication

Filtering

Push to Publish

Indexing queue (-s)

Content gathering

Metadata processing

The content is gathered using the SlingRequestProcessor OSGi service.

It’s like a request for an HTML page sent from the Java code and consumed by itself.

Au

tho

rP

ub

lish

Process status tracking & persistence

Page 30: Effective Searching by Dominik Kornas

Google Search Appliance

Content replication

Filtering

Push to Publish

Indexing queue (-s)

Content gathering

Metadata processing

Metadata is collected according to multiple different rules:• the content resource type• the content path• values of the component properties• custom rules

Au

tho

rP

ub

lish

Process status tracking & persistence

Page 31: Effective Searching by Dominik Kornas

Google Search Appliance

Content replication

Filtering

Push to Publish

Indexing queue (-s)

Content gathering

Metadata processing

The content and metadata are combined together and sent to the search engine.

Depending on the implementation it can be done for each single document or in batches.

Au

tho

rP

ub

lish

Process status tracking & persistence

Page 32: Effective Searching by Dominik Kornas

Google Search Appliance

Content replication

Filtering

Push to Publish

Indexing queue (-s)

Content gathering

Metadata processing

In case of any failure, indexing is rescheduled and launched again as many times as it is configured.

If the server goes down, indexing will restart when the machine is up again.

Au

tho

rP

ub

lish

Process status tracking & persistence

Page 33: Effective Searching by Dominik Kornas

Google Search Appliance

The flexible nature of our solution saved us when some fancy requirements came.

Page 34: Effective Searching by Dominik Kornas

The Future

Page 35: Effective Searching by Dominik Kornas

Apache Solr

The search engine, which is:

• free & open source• powerful• customizable• scalable

And what is the most important, it is a part of the Jackrabbit Oak (JCR 3), the repository engine which has been used for AEM 6.

AEM with the integrated Solr is right there.

Page 36: Effective Searching by Dominik Kornas

Apache Solr

The solution developed for GSA has been ported to work with Solr.

Changes:• Replaced the ”glue code” that does the final data

push, with one that uses SolrJ Java library.• Names of the document metadata fields has been

changed to follow the Solr naming convention for dynamic fields.

Everything else remained untouched.

Page 37: Effective Searching by Dominik Kornas

Search driven components

Page 38: Effective Searching by Dominik Kornas

Search driven components

No server-side processing.

Search engine used as a mini database of metadata.

Configuration via query parameters.

Pure front-end implementation.

Page 39: Effective Searching by Dominik Kornas

Search driven components

The whole page can be read from the dispatcher cache.

An AJAX request gets the content directly from the search engine.

The response is JSON-structured, easy to parse and to display, using JavaScript.

{ "id": "223344", "firstName": "Michael", "lastName": "Johnson", "phone": "(123)-777-8888", "office": "Office UK", "department": "504", "title": "Lead Architect" }

Page 40: Effective Searching by Dominik Kornas

Search driven components

Search results component configured to return employee data.

Page 41: Effective Searching by Dominik Kornas

Search driven components

User profile.

The name, mobile, email, image path etc. are all metadata values of the document.

Page 42: Effective Searching by Dominik Kornas

Search driven components

Carousel with news.

By changing the maximum numberof search results, we can control the number of slides in the carousel.

Page 43: Effective Searching by Dominik Kornas

Thank you!