digital commerce lab - endeca - pipeline

Digital Commerce LabEndeca Pipeline ConfigurationDocument Version 1.0

Lab

En

deca

Pip

elin

e C

onfig

ura

tion

4 Ju

n 20

10

Dig

ital C

om

mer

ce

Endeca Pipeline Configuration

Table of Contents

1 Overview 31.1 Scope and Objectives 31.2 References 3

2 IAP Overview and Architecture 42.1 ITL Flow Diagram and Components 42.2 MDEX Engine 52.3 EAC 52.4 Workbench 6

3 Deployment Template 73.1 Introduction 73.2 Installation 73.3 Deploy EAC application on Windows 73.4 Application Configuration 83.4.1 Core Configuration 83.4.2 Run load_baseline_test_data [bat|sh] 93.4.3 Run baseline_update [bat|sh] 93.5 JSP Reference Application 9

4 Pipeline Development 104.1 Workflow Diagram 104.2 Source Data Loading 104.2.1 Pre Configuration 104.2.2 Record Adapter 114.3 Joining Data 134.3.1 Join Type 134.3.2 Record Cache 134.3.3 Record Assembler 144.4 Source Data Manipulation 154.5 Data Mapping 164.5.1 Property Mapping 164.5.1.1 Define the Endeca Property 164.5.1.2 Map the Endeca Property 174.5.2 Dimension Mapping 174.5.2.1 Define the Endeca Dimension 174.5.2.2 Define the Endeca Dimension Values 184.5.2.3 Map the Endeca Dimension 194.6 Index Adapter 20


1 Overview

1.1 Scope and Objectives

This objective of this document is to introduce Endeca information access platform (IAP) regarding key components, configuration along with Deploy Template to perform related operation tasks. Meanwhile, the document will focus on pipeline configuration with sample reference application integrating with WebSphere Commerce source data.

1.2 References

Endeca Developer Network (EDeN)

http://eden.endeca.com

http://eden.endeca.com/


2 IAP Overview and Architecture

2.1 ITL Flow Diagram and Components

The Endeca Information Transformation Layer (ITL) reads in the source data and manipulates it into a set of indices for the Endeca MDEX Engine to query. In particular:

• Forge is the data processing program that transforms your source data into standardized, tagged Endeca records.

• Dgidx is the indexing program that reads the tagged Endeca records that were prepared by Forge and creates the proprietary indices for the Endeca MDEX Engine.

Forge and Dgidx are the two key components of the data foundry program, which uses an instance configuration to accomplish their tasks. An instance configuration includes a pipeline, a dimension hierarchy, and an index configuration.

• Pipeline is the data processing workflow connecting individual components through links. Each component specifies the format and the location of the source data, manipulation of source data as well as mapping between source data and Endeca properties and dimensions.

• Dimension Hierarchy contains a unique name and ID for each dimension, as well as names and IDs for any dimension values created in Developer Studio. These names and IDs can be created in Developer Studio or imported from external system.

• Index Configuration defines how your Endeca records, Endeca properties, dimensions, anddimension values are indexed by the Data Foundry.


2.2 MDEX Engine

The Endeca MDEX Engine is the indexing and query engine that provides the search result for application query request.

The MDEX Engine uses proprietary data structures and algorithms that allow it to provide real-timeresponses to client requests. The MDEX Engine stores the indices that were created by the EndecaInformation Transformation Layer (ITL). After the indices are stored, the MDEX Engine receives clientrequests via the application tier, query the indices, and then return the results.

The Dgraph is the name of the process for the MDEX Engine, which is the query engine that provides the backbone for all Dgraph Endeca solutions. The Dgraph uses proprietary data structures and algorithms that allow it to provide real-time responses to client requests.

2.3 EAC

EAC is the central system for managing one or more Endeca applications and all of the components installed on all of the hosts. It consist of the EAC Central Server (which coordinates the command, control, and monitoring of all Agents in an Endeca implementation), the EAC Agent (which controls the work of an Endeca implementation on a single host machine) and the EAC command-line utility, eaccmd.

EAC agent installed on each host machine where one or more Endeca components have been installed which receive commands from the EAC Central Server and executes the command for the components provisioned on that host machine.


2.4 Workbench

Endeca Workbench is a suite of tools that brings together site management capabilities including merchandising, content spotlighting, search configuration, and usage reporting.

In addition to these powerful tools for business users, it also provides features for system administrators to configure the resources used by an Endeca implementation, monitor its status, start and stop system processes, and download an implementation's instance configuration for debugging and troubleshooting purposes.


3 Deployment Template

3.1 Introduction

The Deployment Template is a self-contained deployment package which provides a collection of operational components that serve as a starting point for development and application deployment.

The template includes the complete directory structure required for deployment, including EndecaApplication Controller (EAC) scripts, configuration files, and batch files or shell scripts that wrap common script functionality.

The Deployment Template is the highly recommended method for building your application deployment environment, otherwise it will be painful.

3.2 Installation

The Deployment Template is distributed as a zip file, deploymentTemplate-[VERSION].zip.

The zip file should be unpacked using WinZip or an alternate decompression utility, and may be unzipped into any location. The package will unpack into a self-contained directory structure tree:

Endeca\Solutions\deploymentTemplate-[VERSION]\

It is recommended that the package be unzipped to install into the same directory as the Endeca software. For example, if you have installed Endeca Platform Services on Windows in the following location:

C:\Endeca\PlatformServices\6.1.0\

This project should be unzipped into C:\ so that the template installs into:

C:\Endeca\Solutions\deploymentTemplate-[VERSION]\

3.3 Deploy EAC application on Windows

Before begin deployment, the template must be installed on the primary controller server in the deployment environment.

In every deployment environment, one server serves as the primary control machine and hosts theEAC Central Server, while all other servers act as agents to the primary server and host EAC Agentprocesses that receive instructions from the Central Server. Both the EAC Central Server and the EAC Agent run as applications inside the Endeca HTTP Service. The Deployment Template only needs to be installed on the selected primary server, which typically is the machine that hosts the Central Server

To deploy the application:

Run deploy.bat under [deploymentTemplate-installation-path]\bin

- Specify whether your application is going to be a Dgraph deployment or an Agraph deployment. Choose “Dgraph”.

- Specify a short name for your application. The name should consist of lower- or uppercase letters, or digits between zero and nine.


- Specify the full path into which your application should be deployed. This directory must already exist. The installation creates a folder inside of the deployment directory with the name of your application and the application directory structure and files will be deployed there.

- Specify the port number of the EAC Central Server.

- Specify whether your application will use Endeca Workbench for configuration management. Choose “Yes”.

- Specify the port number of Endeca Workbench. By default, the installer assumes that the Workbench host is the machine on which it is being run, but this can be re-configured once the application is deployed.

- The installer can be configured to prompt the user for custom information specific to the deployment. By default, Dgraph deployments use this functionality to prompt the user for Dgraph and Log Server port numbers.

3.4 Application Configuration

3.4.1 Core Configuration

Follow the below process to configure your deployment before running the application.

a). Start the EAC on each server in the deployment environment.

On Windows, make sure the Endeca HTTP service running; On Unix, run the $ENDECA_ROOT/tools/server/bin/startup.sh to start the service.

b). Edit the AppConfig.xml file in [appdir]/config/script to reflect your environment specific details.

• Ensure that the eacHost and eacPort attributes of the app element specify the correct hostand port of the EAC Central Server.• Ensure that the host elements specify the correct host name or names and EAC ports of allEAC Agents in your environment.• Ensure that the ConfigManager component specifies the correct host and port for EndecaWorkbench, or that Workbench integration is disabled.

In addition to checking the host and port settings, you should configure components (for example,add or remove Dgraphs to specify an appropriate Dgraph cluster for your application), adjust process flags if necessary, and select appropriate ports for each Dgraph and Logserver.

c). Run the initialize_services script to initialize each server in the deployment environment with the directories and configuration required to host your application.

%ENDECA_PROJECT_DIR%\control\initialize_services.[sh|bat]

This script removes any existing provisioning associated with this application in the EAC and thenadds the hosts and components in your provisioning document to the EAC, creating the directorystructure used by these components on all servers. In addition, if Workbench integration is enabled, this script initializes Endeca Workbench by uploading the application's configuration files.

d). Upload Configuration XML files (optional)

Replace the sample wine configuration files in %ENDECA_PROJECT_DIR%\config\pipeline with the configuration files created in Developer Studio for your application. If you need to build new pipeline from scratch, skip this step.


e). Upload extracted source data (optional)

Replace the sample wine data (wine_data.txt.gz ) in %ENDECA_PROJECT_DIR%\

test_data\baseline with the data used by your application. Similarly, replace the partial update data in %ENDECA_PROJECT_DIR%\test_data\partial with your application's partial data.

If your application retrieves data directly from a database via ODBC or JDBC or from CAS crawl, skip this upload source data process and remove the wine_data.txt.gz file from %ENDECA_PROJECT_DIR%\test_data\baseline.

3.4.2 Run load_baseline_test_data [bat|sh]

After the related configuration is done, run the %ENDECA_PROJECT_DIR%\control\load_baseline_test_data [bat|sh] script to simulate the data extraction process and baseline update flag setting. Basically, it will copy any data extracts from the \test_data directory to the \data\incoming directory and then set baseline_data_ready flag to indicate that data has been extracted and ready for baseline update processing.

3.4.3 Run baseline_update [bat|sh]

The baseline update script runs the MDEX Engine (the indexer and the Dgraph) to index the records and to update the MDEX Engine with the indexed data.

Run %ENDECA_PROJECT_DIR%\control\baseline_update [bat|sh].

Check %ENDECA_PROJECT_DIR%\logs\[app-name].0.0.log for any errors.

Note: Do not run the baseline update or other scripts when you have log files open. It might throw errors.

3.5 JSP Reference Application

After successfully run a baseline update, you can access the Endeca JSP reference implementation to navigate and search your data.

URL: http://[WorkbenchHost]:[WorkbenchPort]/endeca_jspref

Replace [WorkbenHost] and [WorkbenchPort] with your configuration accordingly, for example, http://localhost:8006/endeca_jspref

Click the “ENDECA-JSP Reference Implementation” link.

Enter the host name and port of the machine that the MDEX Engine is running on and click “Go”.

The sample wine data or your application data will be displayed.


4 Pipeline Development

4.1 Workflow Diagram

Basically, the Pipeline functions as the script for the entire data transformation process when running the forge program. It specifies the source data location, format, changes as well as mapping between source data property and Endeca property/dimensions.

The whole pipeline development should be done in Endeca Developer Studio by utilizing the provided basic pipeline template showed in above diagram.

• Record adapter (LoadData) loads source data into pipeline.• Property mapper (PropMapper) maps source properties to Endeca properties and dimensions.• Indexer adapter (IndexerAdapter) outputs data that is ready to be indexed by the Dgidx.• Dimension adapter (Dimensions) loads dimension data.• Dimension server (DimensionServer) functions as a single repository for dimension data that has been input via one or more dimension adapters.

4.2 Source Data Loading

There are a couple of ways Endeca can load source data into the pipeline, including Text file extracts (Delimited/Vertical/Fixed Width/XML), Database Connectivity (ODBC/JDBC), Content Acquisition System (Crawling).

The Endeca reference application wine data use delimited text file as the source format. In this doc, as an example, we will directly read from WebSphere Commerce sample store data (DB2) through JDBC adapter.

4.2.1 Pre Configuration

• Copy db2jcc.jar (DB2 JDBC driver) and AdvJDBCColumnHandler.jar (It provides support for obtaining data from database column types that are not supported by the standard Endeca JDBC record adapter, such as CLOBs and BLOBs) into the %ENDECA_PROJECT_DIR%\config\lib\java folder.


• Open %ENDECA_PROJECT_DIR%\config\script\environment.properies file, add the following statement:

forge.javaClasspath = ${ENDECA_PROJECT_DIR}/config/lib/java/db2jcc.jar;${ENDECA_PROJECT_DIR}/config/lib/java/AdvJDBCColumnHandler.jar

• Edit %ENDECA_PROJECT_DIR%\config\script\AppConfig.xml, find the <forge> component entry, and add --javaClasspath argument, referencing the variable defined in previous step:

<args>

<arg>-vw</arg>

<arg>--javaClasspath</arg>

<arg>${forge.javaClasspath}</arg>

</args>

4.2.2 Record Adapter

The record adapter which is used to load your source data, can be added either in Developer Studio pipeline diagram or edited the pipeline.epx directly.

a). Open %ENDECA_PROJECT_DIR%\config\pipeline\pipeline.epx file in notepad, add your record adapter definition, below is an example.

-------------------------------------------------------------------------------------------------------------------

<RECORD_ADAPTER COL_DELIMITER="" DIRECTION="INPUT" FILTER_EMPTY_PROPS="TRUE"

FORMAT="JDBC" FRC_PVAL_IDX="FALSE" MULTI="FALSE" NAME="[Record Adapter Name]" PREFIX=""

REC_DELIMITER="" REQUIRE_DATA="TRUE" ROW_DELIMITER="" STATE="FALSE" URL="">

<COMMENT></COMMENT>

<PASS_THROUGH NAME="DB_DRIVER_CLASS">com.ibm.db2.jcc.DB2Driver</PASS_THROUGH>

<PASS_THROUGH NAME="DB_URL">jdbc:db2://[DBHostname]:[Port]/[DBName]</PASS_THROUGH>

<PASS_THROUGH

NAME="COLUMN_HANDLER_CLASS">com.endeca.soleng.itl.jdbc.AdvancedJDBCColumnHandler</

PASS_THROUGH>

<PASS_THROUGH NAME="SQL">select * from xxx where xxx </PASS_THROUGH>

<PASS_THROUGH NAME="DB_CONNECT_PROP">user=dbuser</PASS_THROUGH>

<PASS_THROUGH NAME="DB_CONNECT_PROP">password=dbpwd</PASS_THROUGH>

<PASS_THROUGH NAME="DB_CONNECT_PROP">defaultFetchSize=-2147483648</PASS_THROUGH>

</RECORD_ADAPTER>

--------------------------------------------------------------------------------------------------------------------


b). Alternatively, you can use developer studio to add it in GUI.

In the pipeline diagram, click New -> Record -> Adapter, input name, under “General” tab, choose “JDBC Adapter” as format, switch to “Pass Throughs” tab, and add the following name-value pair.

Name Value

DB_DRIVER_CLASS com.ibm.db2.jcc.DB2Driver

DB_URL jdbc:db2://[DBHostname]:[Port]/[DBName]

COLUMN_HANDLER_CLASS com.endeca.soleng.itl.jdbc.AdvancedJDBCColumnHandler

SQL select * from xxx where xxx

DB_CONNECT_PROP user=dbuser

DB_CONNECT_PROP password=dbpwd

Similarly, you can add more record adapters to retrieve data from different sources.


4.3 Joining Data

When Endeca record is coming from multiple data sources, they need to be joined together to create a single record containing both information

Record assembler is the component used to join data from one to another in the pipeline. Please note that all source data in a join must be a record cache except the following two scenarios:

a). Switch Joins don’t do record comparisons so no cache is needed.

b). The left source of a Left Join does not need to be cached since it is scanned instead of lookup.

4.3.1 Join Type

There are five common join types supported by Forge.

• Left Join: Equivalent to SQL Left Outer Join. If a record from the left source compares equally to any records from the other sources those records are combined. Records from the non-left sources that do not compare equally to a record in the left source are discarded.

• Outer Join: All records from all sources get merged into a single record. No records are discarded.

• Inner Join: Records are shared by all sources get combined and merged. Records do not exist in all sources are discarded.

• Switch Join: Similar to SQL Union. Load all sources data into one record list and does not merge records.

• Sort Switch Join: It is a switch join but with records being sorted by record index value.

4.3.2 Record Cache

All source data that feed a join must be a record cache with two exceptions mentioned above. A record cache component is sorted by record index and stores the records in memory for lookup during a join.

In the pipeline diagram, click New -> Record -> Cache, input unique name, then choose record source. Switch to Record Index tab, add either a property or dimension value as record index. Also, remember to check “Combine Records” checkbox if you want to merge records with equivalent record index key values into a single record. For one-to-one or many-to-many joins, leave “Combine Records” unchecked.


4.3.3 Record Assembler

To add a new record assembler, in the pipeline diagram, click New -> Record -> Assembler, input unique name, in the source tab, add “record sources” or “dimension source”. In the Record Join tab, configure the join details as following:

a).Choose “Join Type” in the dropdown list.

b).If you are performing a left join, check the “Multi Sub-records” option if the left record can be joined to more than one right record.

c).In the “Join Entries” list, define the order of your join entries by selecting an entry and clicking “Up” or “Down”. For all joins, properties get processed from join sources in the order in they are in the list. The first entry is the Left entry for a left join.

d).To define the join key for a join entry, selects the entry from the Join Entries list and click “Edit”. The Join Entry editor appears; click “Add” to create a join key that is identical to record index key for the join entry.


4.4 Source Data Manipulation

Sometimes, the source data may not in a format which is suitable for search result display/ guided navigation, or need to be renamed before property/dimension mapping occur. So a particular manipulation needs to be done to handle source data.

Actually, the data manipulation can be done in the source database itself, or some pre processing script before loading into the pipeline, or it can be done within pipeline. In the pipeline, there are three methods to achieve this: Record Manipulators, Perl Manipulators, and Java Manipulators.

Record Manipulator which uses XML expressions to configure manipulation is a common method in the pipeline. Below is an example to format price property into a 2 digital decimals and display.

-------------------------------------------------------------------------------------------------

<EXPRESSION LABEL="" NAME="UPDATE" TYPE="VOID" URL="">

<COMMENT> To reformat price property</COMMENT>

<EXPRNODE NAME="PROP_NAME" VALUE="PRICE"/>

<EXPRESSION LABEL="" NAME="FORMAT" TYPE="STRING" URL="">

<EXPRNODE NAME="PRECISION" VALUE="2"/>

<EXPRNODE NAME="SHOW_SIGN" VALUE="FALSE"/>

<EXPRESSION LABEL="" NAME="IDENTITY" TYPE="PROPERTY" URL="">

<EXPRNODE NAME="PROP_NAME" VALUE="PRICE"/>

</EXPRESSION>


</EXPRESSION>

</EXPRESSION>

----------------------------------------------------------------------------------------------------

To add a Record Manipulator, in the pipeline diagram, click New -> Record -> Manipulator, input unique name, choose “Record Source” or “Dimension Source”. Then double click the component to add the XML expression script.

For Expression language details, please refer to Data Foundry Expression Reference Guide in the EDeN site.

4.5 Data Mapping

After source data is loaded, joined and manipulated, we need map them to Endeca properties and dimensions, the output data will be ready for indexing.

4.5.1 Property Mapping

4.5.1.1 Define the Endeca Property

In the Developer Studio Properties window, click “New”, and specify a unique property name. The name should be NCName format (no spaces in name).

Check or uncheck the checkbox to specify the property behaviour. The function meaning for checkbox is outlined below.

Checkbox Function

Prepare sort offline Property will be used for record sorting

Rollup Making aggregated records (roll up records that share the same value

Enable for record filters Filtering records (record based on this property value will not be accessed)

Use for record spec As unique identifiers for record retrieval, analysis and reporting

Show with record list Display the property on search result page

Show with record Display this property in result detail page

Enable record search This property is searchable using keyword search

Enable wildcard search This property supports wildcard search feature.


4.5.1.2 Map the Endeca Property

Open the property mapping component in the pipeline diagram. Click “Mappings…” button, then click “New” choose “Property Mapping” in the dropdown, specify the EXACT (case sensitive) source name and choose Endeca property which will be mapped to. Click “OK” to finish.

4.5.2 Dimension Mapping

4.5.2.1 Define the Endeca Dimension

In the Developer Studio Dimensions window, click “New”, and specify a unique dimension name.


Check or uncheck the checkbox to specify the dimension behaviour. The function meaning for checkbox which is dimension specific is outlined below. (Refer to property section for common features).

Checkbox Function

Hidden Prevents dimensions and dimension values from being displayed in the guided navigation menu.

Compute refinement statistics

Calculates and returns the number of records associated with each refinement in the guided navigation menu.

Multiselect Allows the user to refine on more than one dimension value in a given dimension.

Enable dynamic ranking Dynamic refinement ranking is enabled for this dimension.

Generate “More…” dimension value

Retrieve all of the refinement for the parent dimension. It is a special “Inert” dimension value.

4.5.2.2 Define the Endeca Dimension Values

For each dimension defined, corresponding dimension values must be defined as well.

Dimension values can be defined in Developer Studio manually or automatically generated by Forge process or with external taxonomies.

There are three ways to manually generate dimension values in Developer Studio, which includes, Load & Promote, New Dimension Value Editor and Import externally.


• Load & Promote: Load button will load auto generated dimension values which are occurred during the Forge process. In this case, the dimension mapping Match Mode is set to “Auto Generate” and the value has not been manually defined. It is read-only after Load process. After click “Promote”, the AutoGen DimVals are editable and can be drag and drop in the window.

• New Dimension Value Editor: Choose specific dimension, click “Values”, to add new DimVals, select New -> Child or New -> Sibling, choose one of types which includes: Exact, Range, and Sift.

a). Exact – It only matches to property values exactly same value.

b). Range – It matches ranges of property values (specified by the upper and lower bounds of the range dimension value). For example: price range DimVals.

c). Sift – Sift DimVals is a kind of auto generated DimVals sift through the hierarchy according to the range they match. For example: alphabetical group DimVals like A-F, G-Q, and R-Z.

4.5.2.3 Map the Endeca Dimension

Open the property mapping component in the pipeline diagram. Click “Mappings…” button, then click “New” chooses “Dimension Mapping” in the dropdown, specify the EXACT (case sensitive) source name and choose Endeca dimension which will be mapped to. Finally, choose one of Match mode in the drop down, which includes: Normal, Must Match, and Auto Generate

• Normal: Maps source data to manually defined DimVals.

• Must match: Maps source data to manually defined DimVals and warns of source data without a match.

• Auto generate: Maps source data to manually defined DimVals and auto generates source data without a match.


4.6 Index Adapter

Index Adapter is the last component in the whole pipeline workflow. Basically, it functions as the writer to output processed Endeca records and will be indexed by Dgidx.

digital commerce lab - endeca - pipeline

Documents