digital commerce lab - endeca - pipeline

of 23/23
Digital Commerce Lab Endeca Pipeline Configuration Document Version 1.0 Lab Endeca Pipeline Configuration 4 Jun 2010 Digital Commerce

Post on 15-Oct-2014

794 views

Category:

Documents

6 download

Embed Size (px)

TRANSCRIPT

Lab Endeca Pipeline Configuration

Digital Commerce Lab Endeca Pipeline ConfigurationDocument Version 1.0

Digital Commerce

4 Jun 2010

Endeca Pipeline Configuration

Table of Contents1 Overview 2 1.1 Scope and Objectives 2 1.2 References 2 2 IAP Overview and Architecture 3 2.1 ITL Flow Diagram and Components 3 2.2 MDEX Engine 4 2.3 EAC 4 EAC is the central system for managing one or more Endeca applications and all of the components installed on all of the hosts. It consist of the EAC Central Server (which coordinates the command, control, and monitoring of all Agents in an Endeca implementation), the EAC Agent (which controls the work of an Endeca implementation on a single host machine) and the EAC command-line utility, eaccmd. 4 2.4 Workbench 5 3 Deployment Template 6 3.1 Introduction 6 3.2 Installation6 3.3 Deploy EAC application on Windows 6 3.4 Application Configuration 7 3.4.1 Core Configuration.......................................................................................7 3.4.2 Run load_baseline_test_data [bat|sh]..........................................................8 3.4.3 Run baseline_update [bat|sh].......................................................................8 3.5 JSP Reference Application 8 4 Pipeline Development 9 4.1 Workflow Diagram 9 4.2 Source Data Loading 9 4.2.1 Pre Configuration.........................................................................................9 4.2.2 Record Adapter..........................................................................................10 4.3 Joining Data 12 4.3.1 Join Type....................................................................................................12 4.3.2 Record Cache............................................................................................12 4.3.3 Record Assembler......................................................................................13 4.4 Source Data Manipulation 14 4.5 Data Mapping 15 4.5.1 Property Mapping.......................................................................................15 4.5.1.1 Define the Endeca Property....................................................................15 4.5.1.2 Map the Endeca Property........................................................................16 4.5.2 Dimension Mapping....................................................................................16 4.5.2.1 Define the Endeca Dimension.................................................................16 4.5.2.2 Define the Endeca Dimension Values.....................................................17 4.5.2.3 Map the Endeca Dimension....................................................................18 4.6 Index Adapter 19

Endeca Pipeline Configuration

11.1

OverviewScope and ObjectivesThis objective of this document is to introduce Endeca information access platform (IAP) regarding key components, configuration along with Deploy Template to perform related operation tasks. Meanwhile, the document will focus on pipeline configuration with sample reference application integrating with WebSphere Commerce source data.

1.2

ReferencesEndeca Developer Network (EDeN) http://eden.endeca.com

Endeca Pipeline Configuration

22.1

IAP Overview and ArchitectureITL Flow Diagram and Components

The Endeca Information Transformation Layer (ITL) reads in the source data and manipulates it into a set of indices for the Endeca MDEX Engine to query. In particular: Forge is the data processing program that transforms your source data into standardized, tagged Endeca records. Dgidx is the indexing program that reads the tagged Endeca records that were prepared by Forge and creates the proprietary indices for the Endeca MDEX Engine. Forge and Dgidx are the two key components of the data foundry program, which uses an instance configuration to accomplish their tasks. An instance configuration includes a pipeline, a dimension hierarchy, and an index configuration. Pipeline is the data processing workflow connecting individual components through links. Each component specifies the format and the location of the source data, manipulation of source data as well as mapping between source data and Endeca properties and dimensions. Dimension Hierarchy contains a unique name and ID for each dimension, as well as names and IDs for any dimension values created in Developer Studio. These names and IDs can be created in Developer Studio or imported from external system. Index Configuration defines how your Endeca records, Endeca properties, dimensions, and dimension values are indexed by the Data Foundry.

Endeca Pipeline Configuration

2.2

MDEX Engine

Request Indices from ITL MDEX Engine Application Tier(API) Response

The Endeca MDEX Engine is the indexing and query engine that provides the search result for application query request. The MDEX Engine uses proprietary data structures and algorithms that allow it to provide real-time responses to client requests. The MDEX Engine stores the indices that were created by the Endeca Information Transformation Layer (ITL). After the indices are stored, the MDEX Engine receives client requests via the application tier, query the indices, and then return the results. The Dgraph is the name of the process for the MDEX Engine, which is the query engine that provides the backbone for all Dgraph Endeca solutions. The Dgraph uses proprietary data structures and algorithms that allow it to provide real-time responses to client requests.

2.3

EAC

Host Machine

FORGE

DGIDX

DGRAPH

HTTP Service (8888)

EAC Central Server

EAC Agent

DB Storage

WSDL (Public)

WSDL (Internal)

EAC is the central system for managing one or more Endeca applications and all of the components installed on all of the hosts. It consist of the EAC Central Server (which coordinates the command, control, and monitoring of all Agents in an Endeca implementation), the EAC Agent (which controls the work of an Endeca implementation on a single host machine) and the EAC command-line utility, eaccmd. EAC agent installed on each host machine where one or more Endeca components have been installed which receive commands from the EAC Central Server and executes the command for the components provisioned on that host machine.

Endeca Pipeline Configuration

2.4

WorkbenchEndeca Workbench is a suite of tools that brings together site management capabilities including merchandising, content spotlighting, search configuration, and usage reporting. In addition to these powerful tools for business users, it also provides features for system administrators to configure the resources used by an Endeca implementation, monitor its status, start and stop system processes, and download an implementation's instance configuration for debugging and troubleshooting purposes.

Endeca Pipeline Configuration

33.1

Deployment TemplateIntroductionThe Deployment Template is a self-contained deployment package which provides a collection of operational components that serve as a starting point for development and application deployment. The template includes the complete directory structure required for deployment, including Endeca Application Controller (EAC) scripts, configuration files, and batch files or shell scripts that wrap common script functionality. The Deployment Template is the highly recommended method for building your application deployment environment, otherwise it will be painful.

3.2

InstallationThe Deployment Template is distributed as a zip file, deploymentTemplate-[VERSION].zip. The zip file should be unpacked using WinZip or an alternate decompression utility, and may be unzipped into any location. The package will unpack into a self-contained directory structure tree: Endeca\Solutions\deploymentTemplate-[VERSION]\ It is recommended that the package be unzipped to install into the same directory as the Endeca software. For example, if you have installed Endeca Platform Services on Windows in the following location: C:\Endeca\PlatformServices\6.1.0\ This project should be unzipped into C:\ so that the template installs into: C:\Endeca\Solutions\deploymentTemplate-[VERSION]\

3.3

Deploy EAC application on WindowsBefore begin deployment, the template must be installed on the primary controller server in the deployment environment. In every deployment environment, one server serves as the primary control machine and hosts the EAC Central Server, while all other servers act as agents to the primary server and host EAC Agent processes that receive instructions from the Central Server. Both the EAC Central Server and the EAC Agent run as applications inside the Endeca HTTP Service. The Deployment Template only needs to be installed on the selected primary server, which typically is the machine that hosts the Central Server To deploy the application: Run deploy.bat under [deploymentTemplate-installation-path]\bin Specify whether your application is going to be a Dgraph deployment or an Agraph deployment. Choose Dgraph. Specify a short name for your application. The name should consist of lower- or uppercase letters, or digits between zero and nine.

-

Endeca Pipeline Configuration Specify the full path into which your application should be deployed. This directory must already exist. The installation creates a folder inside of the deployment directory with the name of your application and the application directory structure and files will be deployed there. Specify the port number of the EAC Central Server. Specify whether your application will use Endeca Workbench for configuration management. Choose Yes. Specify the port number of Endeca Workbench. By default, the installer assumes that the Workbench host is the machine on which it is being run, but this can be re-configured once the application is deployed. The installer can be configured to prompt the user for custom information specific to the deployment. By default, Dgraph deployments use this functionality to prompt the user for Dgraph and Log Server port numbers.

-

-

-

3.43.4.1

Application ConfigurationCore Configuration Follow the below process to configure your deployment before running the application. a). Start the EAC on each server in the deployment environment. On Windows, make sure the Endeca HTTP service running; On Unix, run the $ENDECA_ROOT/tools/server/bin/startup.sh to start the service. b). Edit the AppConfig.xml file in [appdir]/config/script to reflect your environment specific details. Ensure that the eacHost and eacPort attributes of the app element specify the correct host and port of the EAC Central Server. Ensure that the host elements specify the correct host name or names and EAC ports of all EAC Agents in your environment. Ensure that the ConfigManager component specifies the correct host and port for Endeca Workbench, or that Workbench integration is disabled. In addition to checking the host and port settings, you should configure components (for example, add or remove Dgraphs to specify an appropriate Dgraph cluster for your application), adjust process flags if necessary, and select appropriate ports for each Dgraph and Logserver. c). Run the initialize_services script to initialize each server in the deployment environment with the directories and configuration required to host your application. %ENDECA_PROJECT_DIR%\control\initialize_services.[sh|bat] This script removes any existing provisioning associated with this application in the EAC and then adds the hosts and components in your provisioning document to the EAC, creating the directory structure used by these components on all servers. In addition, if Workbench integration is enabled, this script initializes Endeca Workbench by uploading the application's configuration files. d). Upload Configuration XML files (optional) Replace the sample wine configuration files in %ENDECA_PROJECT_DIR%\config\pipeline with the configuration files created in Developer Studio for your application. If you need to build new pipeline from scratch, skip this step.

Endeca Pipeline Configuration e). Upload extracted source data (optional) Replace the sample wine data (wine_data.txt.gz ) in %ENDECA_PROJECT_DIR %\test_data\baseline with the data used by your application. Similarly, replace the partial update data in %ENDECA_PROJECT_DIR%\test_data\partial with your application's partial data. If your application retrieves data directly from a database via ODBC or JDBC or from CAS crawl, skip this upload source data process and remove the wine_data.txt.gz file from %ENDECA_PROJECT_DIR%\test_data\baseline. 3.4.2 Run load_baseline_test_data [bat|sh] After the related configuration is done, run the %ENDECA_PROJECT_DIR %\control\load_baseline_test_data [bat|sh] script to simulate the data extraction process and baseline update flag setting. Basically, it will copy any data extracts from the \test_data directory to the \data\incoming directory and then set baseline_data_ready flag to indicate that data has been extracted and ready for baseline update processing. 3.4.3 Run baseline_update [bat|sh] The baseline update script runs the MDEX Engine (the indexer and the Dgraph) to index the records and to update the MDEX Engine with the indexed data. Run %ENDECA_PROJECT_DIR%\control\baseline_update [bat|sh]. Check %ENDECA_PROJECT_DIR%\logs\[app-name].0.0.log for any errors. Note: Do not run the baseline update or other scripts when you have log files open. It might throw errors.

3.5

JSP Reference ApplicationAfter successfully run a baseline update, you can access the Endeca JSP reference implementation to navigate and search your data. URL: http://[WorkbenchHost]:[WorkbenchPort]/endeca_jspref Replace [WorkbenHost] and [WorkbenchPort] with your configuration accordingly, for example, http://localhost:8006/endeca_jspref Click the ENDECA-JSP Reference Implementation link. Enter the host name and port of the machine that the MDEX Engine is running on and click Go. The sample wine data or your application data will be displayed.

Endeca Pipeline Configuration

44.1

Pipeline DevelopmentWorkflow Diagram

Forge

Dgidx

Pipeline.epx Dimension Adapter

Dims

Dimensions.xml

DimensionServer IndexAdapter Endeca Records

PropMapper Config XML Source Data RecordAdapter

Basically, the Pipeline functions as the script for the entire data transformation process when running the forge program. It specifies the source data location, format, changes as well as mapping between source data property and Endeca property/dimensions. The whole pipeline development should be done in Endeca Developer Studio by utilizing the provided basic pipeline template showed in above diagram. Record adapter (LoadData) loads source data into pipeline. Property mapper (PropMapper) maps source properties to Endeca properties and dimensions. Indexer adapter (IndexerAdapter) outputs data that is ready to be indexed by the Dgidx. Dimension adapter (Dimensions) loads dimension data. Dimension server (DimensionServer) functions as a single repository for dimension data that has been input via one or more dimension adapters.

4.2

Source Data LoadingThere are a couple of ways Endeca can load source data into the pipeline, including Text file extracts (Delimited/Vertical/Fixed Width/XML), Database Connectivity (ODBC/JDBC), Content Acquisition System (Crawling). The Endeca reference application wine data use delimited text file as the source format. In this doc, as an example, we will directly read from WebSphere Commerce sample store data (DB2) through JDBC adapter.

4.2.1

Pre Configuration

Copy db2jcc.jar (DB2 JDBC driver) and AdvJDBCColumnHandler.jar (It provides support forobtaining data from database column types that are not supported by the standard Endeca JDBC record adapter, such as CLOBs and BLOBs) into the %ENDECA_PROJECT_DIR%\config\lib\java folder.

Endeca Pipeline Configuration Open %ENDECA_PROJECT_DIR%\config\script\environment.properies file, add the following statement: forge.javaClasspath = ${ENDECA_PROJECT_DIR}/config/lib/java/db2jcc.jar;$ {ENDECA_PROJECT_DIR}/config/lib/java/AdvJDBCColumnHandler.jar Edit %ENDECA_PROJECT_DIR%\config\script\AppConfig.xml, find the component entry, and add --javaClasspath argument, referencing the variable defined in previous step: -vw --javaClasspath ${forge.javaClasspath} 4.2.2 Record Adapter The record adapter which is used to load your source data, can be added either in Developer Studio pipeline diagram or edited the pipeline.epx directly. a). Open %ENDECA_PROJECT_DIR%\config\pipeline\pipeline.epx file in notepad, add your record adapter definition, below is an example. ------------------------------------------------------------------------------------------------------------------ com.ibm.db2.jcc.DB2Driver jdbc:db2://[DBHostname]:[Port]/[DBName] com.endeca.soleng.itl.jdbc.AdvancedJDBCColumnHandler select * from xxx where xxx user=dbuser password=dbpwd defaultFetchSize=-2147483648

--------------------------------------------------------------------------------------------------------------------

Endeca Pipeline Configuration b). Alternatively, you can use developer studio to add it in GUI. In the pipeline diagram, click New -> Record -> Adapter, input name, under General tab, choose JDBC Adapter as format, switch to Pass Throughs tab, and add the following name-value pair. NameDB_DRIVER_CLASS DB_URL

Valuecom.ibm.db2.jcc.DB2Driver jdbc:db2://[DBHostname]:[Port]/[DBName]

COLUMN_HANDLER_CLASS com.endeca.soleng.itl.jdbc.AdvancedJDBCColumnHandler SQL DB_CONNECT_PROP DB_CONNECT_PROP select * from xxx where xxx user=dbuser password=dbpwd

Similarly, you can add more record adapters to retrieve data from different sources.

Endeca Pipeline Configuration

4.3

Joining DataWhen Endeca record is coming from multiple data sources, they need to be joined together to create a single record containing both information Record assembler is the component used to join data from one to another in the pipeline. Please note that all source data in a join must be a record cache except the following two scenarios: a). Switch Joins dont do record comparisons so no cache is needed. b). The left source of a Left Join does not need to be cached since it is scanned instead of lookup.

4.3.1

Join Type There are five common join types supported by Forge. Left Join: Equivalent to SQL Left Outer Join. If a record from the left source compares equally to any records from the other sources those records are combined. Records from the non-left sources that do not compare equally to a record in the left source are discarded. Outer Join: All records from all sources get merged into a single record. No records are discarded. Inner Join: Records are shared by all sources get combined and merged. Records do not exist in all sources are discarded. Switch Join: Similar to SQL Union. Load all sources data into one record list and does not merge records. Sort Switch Join: It is a switch join but with records being sorted by record index value.

4.3.2

Record Cache All source data that feed a join must be a record cache with two exceptions mentioned above. A record cache component is sorted by record index and stores the records in memory for lookup during a join. In the pipeline diagram, click New -> Record -> Cache, input unique name, then choose record source. Switch to Record Index tab, add either a property or dimension value as record index. Also, remember to check Combine Records checkbox if you want to merge records with equivalent record index key values into a single record. For one-to-one or many-to-many joins, leave Combine Records unchecked.

Endeca Pipeline Configuration

4.3.3

Record Assembler To add a new record assembler, in the pipeline diagram, click New -> Record -> Assembler, input unique name, in the source tab, add record sources or dimension source. In the Record Join tab, configure the join details as following: a).Choose Join Type in the dropdown list. b).If you are performing a left join, check the Multi Sub-records option if the left record can be joined to more than one right record. c).In the Join Entries list, define the order of your join entries by selecting an entry and clicking Up or Down. For all joins, properties get processed from join sources in the order in they are in the list. The first entry is the Left entry for a left join. d).To define the join key for a join entry, selects the entry from the Join Entries list and click Edit. The Join Entry editor appears; click Add to create a join key that is identical to record index key for the join entry.

Endeca Pipeline Configuration

4.4

Source Data ManipulationSometimes, the source data may not in a format which is suitable for search result display/ guided navigation, or need to be renamed before property/dimension mapping occur. So a particular manipulation needs to be done to handle source data. Actually, the data manipulation can be done in the source database itself, or some pre processing script before loading into the pipeline, or it can be done within pipeline. In the pipeline, there are three methods to achieve this: Record Manipulators, Perl Manipulators, and Java Manipulators. Record Manipulator which uses XML expressions to configure manipulation is a common method in the pipeline. Below is an example to format price property into a 2 digital decimals and display. ------------------------------------------------------------------------------------------------ To reformat price property

Endeca Pipeline Configuration

---------------------------------------------------------------------------------------------------To add a Record Manipulator, in the pipeline diagram, click New -> Record -> Manipulator, input unique name, choose Record Source or Dimension Source. Then double click the component to add the XML expression script. For Expression language details, please refer to Data Foundry Expression Reference Guide in the EDeN site.

4.5

Data MappingAfter source data is loaded, joined and manipulated, we need map them to Endeca properties and dimensions, the output data will be ready for indexing.

4.5.1

Property Mapping

4.5.1.1 Define the Endeca Property

In the Developer Studio Properties window, click New, and specify a unique property name. The name should be NCName format (no spaces in name). Check or uncheck the checkbox to specify the property behaviour. The function meaning for checkbox is outlined below.

Checkbox Prepare sort offline Rollup

Function Property will be used for record sorting Making aggregated records (roll up records that share the same value Filtering records (record based on this property value will not be accessed) As unique identifiers for record retrieval, analysis and reporting Display the property on search result page Display this property in result detail page This property is searchable using keyword search This property supports wildcard search feature.

Enable for record filters

Use for record spec

Show with record list Show with record Enable record search Enable wildcard search

Endeca Pipeline Configuration

4.5.1.2 Map the Endeca Property

Open the property mapping component in the pipeline diagram. Click Mappings button, then click New choose Property Mapping in the dropdown, specify the EXACT (case sensitive) source name and choose Endeca property which will be mapped to. Click OK to finish.

4.5.2

Dimension Mapping

4.5.2.1 Define the Endeca Dimension

In the Developer Studio Dimensions window, click New, and specify a unique dimension name.

Endeca Pipeline Configuration Check or uncheck the checkbox to specify the dimension behaviour. The function meaning for checkbox which is dimension specific is outlined below. (Refer to property section for common features).

Checkbox Hidden

Function Prevents dimensions and dimension values from being displayed in the guided navigation menu. Calculates and returns the number of records associated with each refinement in the guided navigation menu. Allows the user to refine on more than one dimension value in a given dimension. Dynamic refinement ranking is enabled for this dimension. Retrieve all of the refinement for the parent dimension. It is a special Inert dimension value.

Compute refinement statistics Multiselect

Enable dynamic ranking Generate More dimension value

4.5.2.2 Define the Endeca Dimension Values

For each dimension defined, corresponding dimension values must be defined as well. Dimension values can be defined in Developer Studio manually or automatically generated by Forge process or with external taxonomies. There are three ways to manually generate dimension values in Developer Studio, which includes, Load & Promote, New Dimension Value Editor and Import externally.

Endeca Pipeline Configuration Load & Promote: Load button will load auto generated dimension values which are occurred during the Forge process. In this case, the dimension mapping Match Mode is set to Auto Generate and the value has not been manually defined. It is read-only after Load process. After click Promote, the AutoGen DimVals are editable and can be drag and drop in the window. New Dimension Value Editor: Choose specific dimension, click Values, to add new DimVals, select New -> Child or New -> Sibling, choose one of types which includes: Exact, Range, and Sift. a). Exact It only matches to property values exactly same value. b). Range It matches ranges of property values (specified by the upper and lower bounds of the range dimension value). For example: price range DimVals. c). Sift Sift DimVals is a kind of auto generated DimVals sift through the hierarchy according to the range they match. For example: alphabetical group DimVals like A-F, G-Q, and R-Z.

4.5.2.3 Map the Endeca Dimension

Open the property mapping component in the pipeline diagram. Click Mappings button, then click New chooses Dimension Mapping in the dropdown, specify the EXACT (case sensitive) source name and choose Endeca dimension which will be mapped to. Finally, choose one of Match mode in the drop down, which includes: Normal, Must Match, and Auto Generate Normal: Maps source data to manually defined DimVals. Must match: Maps source data to manually defined DimVals and warns of source data without a match. Auto generate: Maps source data to manually defined DimVals and auto generates source data without a match.

Endeca Pipeline Configuration

4.6

Index AdapterIndex Adapter is the last component in the whole pipeline workflow. Basically, it functions as the writer to output processed Endeca records and will be indexed by Dgidx.