denodo scheduler administrator guide - denodo platform help

DENODO SCHEDULER 4.5 ADMINISTRATOR GUIDE

Update Nov 13th 2009

NOTE This document is confidential and is the property of denodo technologies (hereinafter denodo). No part of the document may be copied, photographed, transmitted electronically, stored in a document management system or reproduced by any other means without prior written permission from denodo.

copyright © 2009 This document may not be reproduced in total or in part without written permission from denodo technologies

Scheduler 4.5 Administrator Guide

CONTENTS

PREFACE ............................................................................................................................................................................ I SCOPE .......................................................................................................................................................................... I WHO SHOULD USE THIS MANUAL ........................................................................................................................ I SUMMARY OF CONTENTS....................................................................................................................................... I

1 INTRODUCTION...................................................................................................................................... 1

2 GENERAL ARCHITECTURE.................................................................................................................... 2

3 INSTALLATION AND EXECUTION....................................................................................................... 5

4 ADMINISTRATION................................................................................................................................. 6 4.1 AUTHENTICATION................................................................................................................................. 6 4.2 SERVER CONFIGURATION.................................................................................................................... 6

4.2.1 Authentication........................................................................................................................................... 6 4.2.2 Ports .......................................................................................................................................................... 7 4.2.3 Outgoing Mail Server................................................................................................................................ 7 4.2.4 Execution Threads..................................................................................................................................... 7 4.2.5 Plugins and JDBC Adapters ...................................................................................................................... 7

4.3 LOG CONFIGURATION........................................................................................................................... 8

5 CREATING AND SCHEDULING JOBS ................................................................................................. 1 5.1 ACTIVE JOBS.......................................................................................................................................... 1 5.2 ADDING DATA SOURCES ..................................................................................................................... 2

5.2.1 ARN Data Sources .................................................................................................................................... 3 5.2.2 ARN-Index Data Sources .......................................................................................................................... 3 5.2.3 CSV Data Sources ..................................................................................................................................... 3 5.2.4 ITP Data Sources....................................................................................................................................... 3 5.2.5 JDBC Data Sources................................................................................................................................... 4 5.2.6 VDP Data Sources ..................................................................................................................................... 4

5.3 FILTER SEQUENCES............................................................................................................................... 5 5.3.1 Boolean Content Filter .............................................................................................................................. 6 5.3.2 Content Extraction Filter (HTML, PDF, Word, Excel, PowerPoint, XML, EML, and Text).......................... 8 5.3.3 Field Aggregation Filter............................................................................................................................. 9 5.3.4 Summary Generation Filter ....................................................................................................................... 9 5.3.5 Title Generation Filter ............................................................................................................................. 10 5.3.6 URL Unicity and Standardization Filters ................................................................................................. 10 5.3.7 Useful Web Content Extraction Filter ..................................................................................................... 11

5.4 CONFIGURING NEW JOBS................................................................................................................. 11 5.4.1 General structure of a job ....................................................................................................................... 11 5.4.2 Aracne-type Job Extraction Section ....................................................................................................... 12 5.4.3 VDP Extraction Section ........................................................................................................................... 16 5.4.4 ITP Extraction Section ............................................................................................................................. 19 5.4.5 JDBC Extraction Section ......................................................................................................................... 19 5.4.6 Data Schema Generated by the Different Types of Extraction Jobs...................................................... 19 5.4.7 Jobs for Maintaining Aracne Indexes..................................................................................................... 20 5.4.8 Postprocessing Section (Filters/Exporters) ............................................................................................. 20 5.4.9 Handler Section....................................................................................................................................... 22 5.4.10 Time-based Job Scheduling Section ...................................................................................................... 23


6 DEVELOPER API ................................................................................................................................... 25 6.1 CLIENT API............................................................................................................................................ 25

6.1.1 Scheduler ................................................................................................................................................ 25 6.1.2 Compatible API for 4.0 and 4.1 versions................................................................................................. 26

6.2 EXTENSIONS (PLUGINS) .................................................................................................................... 27 6.2.1 Filters....................................................................................................................................................... 28 6.2.2 Exporters ................................................................................................................................................. 29 6.2.3 Handlers .................................................................................................................................................. 29 6.2.4 Aracne Custom Crawlers ........................................................................................................................ 29

7 APPENDIX ............................................................................................................................................. 31 7.1 DATEFORMAT FUNCTION SYNTAX.................................................................................................. 31 7.2 REGULAR EXPRESSIONS FOR FILTERS............................................................................................ 31 7.3 JDBC DRIVERS ..................................................................................................................................... 32 7.4 USE OF THE IMPORT/EXPORT SCRIPTS FOR BACKUP ................................................................. 33 7.5 USE OF THE MIGRATION TOOL ......................................................................................................... 34

BIBLIOGRAPHY.............................................................................................................................................................. 36


LIST OF FIGURES Figure 1 Denodo Scheduler Architecture................................................................................................................ 3


LIST OF TABLES Table 1 Metacharacters....................................................................................................................................... 32 Table 2 JDBC Drivers ........................................................................................................................................... 32 Table 3 IBM, MySQL, Microsoft, and Sybase Drivers ......................................................................................... 33


Preface i

PREFACE

SCOPE

This document presents the job scheduling system for the Denodo Platform.

WHO SHOULD USE THIS MANUAL

This document is aimed at administrators who seek to install, configure, and/or use Denodo Scheduler for the time-based scheduling of data extraction jobs from the Web, databases, file systems, e-mail servers, etc.

SUMMARY OF CONTENTS

More specifically, this document describes:

• The installation procedures for the Denodo Scheduler software.

• How the system is configured for subsequent use.

• Operating the system using its Web administration tool.

• Extension of the system's functionalities using the Denodo Scheduler API to automate access to the system or to include new components.


Introduction 1

1 INTRODUCTION

The Denodo Technologies product suite provides advanced functionalities for the time-based scheduling of jobs integrating data from disperse and heterogeneous sources that may be poorly structured. Denodo Scheduler allows the scheduling and execution of data extraction and integration jobs, as defined on the different modules of the Denodo Platform. In combination with Denodo Scheduler, the modules of the Denodo Platform provide features such as the following:

• Denodo Virtual DataPort. Scheduling any job that involves the collection of data from various dispersed and heterogeneous sources, combining such information and exporting it to different types of repositories. It can also be used to preload data regularly in the Virtual DataPort cache. See [VDP] for further information.

• Denodo ITPilot. Periodic automation of Web data extraction and storage or time-based scheduling of Web automation jobs. See [ITP] for further information.

• Denodo Aracne. Scheduling crawling, filtering, and indexing jobs of unstructured Web information, document repositories, e-mail servers, RSS sites, etc. See [ARN] for further information.

The main features of Denodo Scheduler include:

• Scheduling flexible batch jobs on the different components of the Denodo Platform: DataPort, ITPilot, and/or Aracne.

• Generation of detailed reports on the outcome of job execution, including detailed information on errors. The reports can be sent by e-mail to the addresses that have been configured.

• The results obtained by a job can be exported to a CSV file, a SQL file, a database, or to an index. It also allows for the inclusion of new exporters developed for specific purposes.

• Support for data extraction of sources with limited query capabilities. For example, consider a Web service or a Web site that allows information to be obtained on a particular company based on their Tax ID. It is possible to define a job that obtains the different Tax IDs of a database server or CSV file and queries the Web service or the Web site using each of them.

• Persistent jobs. If you restart the system while a job is being executed, the job can continue its execution from the last query that was successfully executed. In the previous example, if the complete job obtains information from 1000 companies and the system restarts after the first 200 queries, after the start of the system you can launch the execution of the job in the state in which it ended, continuing from query 201 instead of starting again from the first one.

• Transparent retries in the event of failure. • Possibility of configuring a parallel execution of the different queries involved in the same job.


General Architecture 2

2 GENERAL ARCHITECTURE

Denodo Scheduler is a tool for time-based scheduling of automatic data extraction jobs from different data sources. In particular, it allows the configuration of different extraction jobs to be defined through its Web administration tool, persistently store this information, and plan the execution of these jobs against corresponding data servers as desired. Denodo Scheduler allows extraction jobs to be defined against various modules of the Denodo Platform. It also allows data to be extracted from relational databases via JDBC. For the extracted data, Denodo Scheduler allows different filtering algorithms to be applied and allows the data obtained to be exported in different formats and repositories. At the core of the system are the extraction jobs that can be defined for the different components of the Denodo Platform.

• Denodo Aracne. It is possible to define two types of jobs for this module: crawling and maintaining indexes, which will be performed on the crawling and indexing servers for Denodo Aracne [ARN].

The crawling jobs (ARN) allow data to be collected from unstructured sources. The following subtypes of jobs are particularly considered: o WebBot and IECrawler crawl through the Web hypertext structure, starting with a group

of initial URLs, and recursively retrieve all the pages accessible from the original URL group. They also allow connecting to an FTP server and obtaining the information contained in all the files and subdirectories of a specified directory as the initial URL. Multiple language fonts are supported for crawled documents.

o WebBot is also capable of exploring a file system (even located in a shared folder) considering a directory as an initial URL and extracting data contained in all its files and subdirectories.

o POP3/POP3S/IMAP/IMAPS Crawler. Allows retrieval of data from e-mails stored in servers accessible using POP3, POP3S, IMAP or IMAPS protocols. This includes support for attached files.

o MS Exchange Crawler. Allows data to be retrieved from e-mails stored in MS Exchange servers [MSEX]. This includes support for attached files.

o Salesforce.com Crawler. Allows the retrieval of data contained in entities of data accessible via an account with the on-line service Salesforce.com [SLF].

o CustomCrawler allows data to be extracted from a data source through a Java implementation provided by the Denodo Aracne administrator. This type of robot allows the ad-hoc construction of a crawler for a specific source.

Index maintenance jobs (ARN-INDEX) allow the automatic maintenance of indexes created, by deleting documents that are old, obsolete, not accessible, etc.

• Denodo ITPilot (ITP). Executes queries on wrappers from Denodo ITPilot [ITP] to obtain structured data from Web sources.

• Denodo Virtual DataPort (VDP). Executes queries on wrappers and views defined in Denodo Virtual DataPort [VDP] to obtain data resulting from the integration of data that can come from dispersed and heterogeneous sources.

• It is also possible to define a JDBC type of job that explores the tables specified in a database and retrieves the data contained in them.

On a general level and for all jobs, it is possible to configure your time-based scheduling (when and how often it should be executed), various types of filters for post-processing the data retrieved by the system, and the way in which the results obtained by the job will be exported. The available exporters are:

• Dumping the final results in a database • Indexing the final results in the Aracne indexing server ([ARN])



• Dumping the final results in a CSV-type file (it can also be used to generate MS-Excel compliant files). • Dumping the final results in a SQL-type file

It also allows the programmer to create new exporters for ad-hoc needs. In the Figure 1 the server's basic architecture is shown. In addition to the jobs and filters, the scheduler lets users define the data sources to be used for the extraction jobs and by exporters. Denodo Scheduler allows data sources to be defined for the different components of the Denodo Platform (ARN, VDP, and ITP), for relational databases, and delimited files. In the case of ITP-, VDP-, and JDBC-type jobs, it is possible to specify a query parameterized by a series of variables, along with the possible values for these variables, thus several queries are executed against the corresponding server.

Figure 1 Denodo Scheduler Architecture

The following briefly describes two typical examples of the use of Denodo Scheduler. Example 1: extracting structured data from the Web with ITPilot Suppose you want to periodically extract information from customers accessible via a corporate Web site. The Web site offers users a query form in which a customer's Tax ID should be specified and returns as a response information of interest about the customer specified. The list of all the Tax IDs to be queried is available in an internal database accessible via JDBC. The set of data extracted must be dumped to another internal database also accessible via JDBC. The steps to be followed to carry out this job with the Denodo Platform are as follows:

1. Create a new ITPilot wrapper (see [ITP] for more details) that automates the operation of obtaining data from a customer from the corporate Web site. The wrapper will receive as a mandatory parameter a customer's Tax ID, it will automatically execute the query on the Web, and will extract the desired results.

2. Add a new JDBC-type data source to Scheduler to access the database that contains the Tax IDs of the required customers (see section 5.2.5 to find out how to add JDBC data sources).

3. Add another new JDBC data source to Scheduler to access the database in which the extracted data will be dumped into.

4. Create an ITP-type job in Scheduler (see section 5.4). The ITP job will query a wrapper to which the different values will be specified for the Tax ID attribute. To get the different values of the Tax ID attribute,



a query on the JDBC data source defined in step 2 will be used. Then, to execute the job, the ITPilot wrapper will be invoked for each of the Tax IDs sought.

5. Create a JDBC-type exporter for the ITP job (see section 5.4.8). This exporter will use the JDBC data source defined in step 3.

6. Finally, configure the frequency with which you want to execute the job in Scheduler (see section 5.4.10). Example 2: crawling, filtering and indexing of unstructured data with Denodo Aracne Suppose you want to periodically explore a particular Web site to download all the documents relevant to a specific topic. The new documents found must be dumped in an index that will then be used by a search engine to make complex Boolean, keyword-based searches. The steps required to carry out this job with the Denodo Platform are the following:

1. Create a WebBot- or IECrawler-type ARN job (see section 5.4). This job will complete a crawling of the desired Web site, downloading all the documents found.

2. Create a sequence of filters for post-processing the documents obtained by crawling. For example, you can use the Boolean content filter (see section 5.3.1) to retain only those documents containing certain keywords relevant for the topic desired, the uniqueness filter (see section 5.3.6) to discard document duplicates and the content filter (see section 5.3.2) to only index the textual content of documents (by discarding the HTML marks and the Javascript code for the page).

3. Create an Aracne index-type exporter for the job (see section 5.4.8). Thus, the documents will be indexed so their content can be searched.

4. Finally, configure the frequency with which you want to execute the job in Scheduler (see section 5.4.10).


Installation and Execution 5

3 INSTALLATION AND EXECUTION

The Denodo Platform Installation Guide [DENINST] provides all the information required to install Denodo Scheduler, including the minimum requirements for hardware and software, and instructions for the use of the installation tool and for the initial system configuration. Denodo Scheduler includes a server for scheduling jobs and a Web server that supports the administration tool. The servers can be started and stopped using the Denodo Platform Control Center tool (see the Denodo Platform Installation Guide [DENINST]). To connect to the administration tool you need to use the user “admin” with the initial password “admin”. The default URL for accessing the Web administration tool from a local machine is http://localhost:9090/webadmin/denodo-scheduler-admin. Alternatively, scripts are provided in the path DENODO_HOME/bin. There is a script for the scheduling server scheduler_startup.sh (scheduler_startup.bat and scheduler_startup.exe in Windows) to start it and a script scheduler_shutdown.sh (scheduler_shutdown.bat and scheduler_shutdown.exe in Windows) to stop it. To start and stop the Web administration tool the scripts scheduler_webadmin_startup.sh and scheduler_webadmin_shutdown.sh are available (scheduler_webadmin_startup.bat and scheduler_webadmin_shutdown.bat in Windows). In the case of Windows machines, a script is included to install the scheduling server as a service. The script receives the name schedulerservice.bat.


Administration 6

4.1

4 ADMINISTRATION

The Denodo Platform Installation Guide [DENINST] provides detailed information on the configuration jobs that need to be carried out before executing Scheduler. The configuration options for the server and the system logs are described in the following section.

AUTHENTICATION

To access the Denodo Scheduler administration tool an initial authentication screen is shown in which the user will have to enter the user password “admin”. The option of remembering the password for future authentication will be offered. In the same screen, by clicking on the link Edit server advanced config, you can change the Denodo Scheduler server to which the tool is connected (server name and administration process port).

4.2 SERVER CONFIGURATION

Once the Denodo Scheduler server has started up, it is possible to change some parameters from the “Configuration” perspective of the administration tool. The main screen of the “Configuration” perspective shows the name and port of the server that is using the administration tool. The links to configure the following features will be shown in the left side of the screen:

• Change the user password. • Change the ports used by the server. • Change the server's configuration for outgoing e-mail. • Change the configuration of execution threads for the server. • Add or delete libraries that encapsulate extensions of the system (plugins) such as exporters, handlers,

crawlers or custom filters, and JDBC adapters used by these types of sources. • Export projects (datasources, filter sequences and jobs), plugins and JDBC adapters (Export option).

This is useful for migration and backup purposes. It generates a zip file including all the required information to restore the current server metadata. It is possible to choose what elements to export:

o All the projects, plugins, adapters and server configuration. This is the default option. o All the projects, but choosing the set of resources to export among plugins, JDBC adapters

and server configuration. o A project or set of projects and also choosing the set of resources to export among server

configuration and the plugins and JDBC adapters used by the elements of the exported projects.

The platform also provides scripts to run automatic backup copies (see appendix 7.4). • Import server configuration, projects, plugins and JDBC adapters from a file containing the metadata

of a server (Import option). It is possible to specify if existent elements with the same name should be overwritten by the ones included in the imported file. This is useful for migration and backup purposes. Denodo Scheduler also provides scripts for importing the metadata (see appendix 7.4).

The following subsections deal with each of these features, respectively.

4.2.1 Authentication

The user password admin can be changed by clicking on the link Change password. The form for updating the password asks the user to enter the old password (Old password) and the new one duplicated (New password


Administration 7

and Retype new password). The changes will be effective after pressing the “Accept” button, and the user may cancel the operation by pressing the button “Cancel”.

4.2.2 Ports

The Scheduler server uses three port numbers for client communications: the server execution port, the server stoppage port, and the auxiliary port. These ports can be configured by choosing the link Change remote ports. NOTE: Where the connection between customers and the Scheduler server has to go through a firewall, this must be configured to allow for access to the execution port and the auxiliary port. The port changes will take effect the next time the Scheduler server restarts.

4.2.3 Outgoing Mail Server

The link Mail configuration allows you to modify the name of the outgoing mail server (5.4.9) to send reports on the execution of jobs. It also lets you specify the e-mail address used by Scheduler to send the mail (From) and the subject of the mail (Subject). Additionally, if the outgoing mail server requires authentication, user name (Username) and its password (Password) must be specified.

4.2.4 Execution Threads

The Scheduler server allows you to execute various extraction jobs simultaneously. Additionally, the VDP, ITP, and JDBC jobs allow the same job to run different queries concurrently on the same data source, varying the parameters. The link Threads configuration lets you change the concurrence configuration of the Scheduler server. You can specify the maximum number of jobs that the server will execute concurrently with the parameter Maximum number of concurrent jobs (by default 20). A change to the number of concurrent jobs will take effect the next time the Scheduler server is restarted. With regard to VDP-, ITP-, and JDBC-type jobs, the Scheduler server uses a pool of reusable threads for managing the execution of multiple queries which the same job can generate. The parameters that can be configured are as follows:

• Normal number of threads. This represents the number of threads in the pool from which the inactive threads are reused (20 by default). Whilst there are fewer threads than these in the pool, new threads will continue to be created. When a thread is requested and the number of threads in the pool is the same as or more than this value, inactive threads are returned, if they exist; otherwise, new threads will continue to be created until obtaining the value established by the following parameter. Intuitively, this parameter indicates the number of threads that the system should have active simultaneously in normal load conditions.

• Maximum number of threads. Represents the maximum number of pool threads (by default 60). • Keep alive time (ms). Specifies the maximum time in milliseconds that an inactive thread stays in the

pool, if the number of threads exceeds the total indicated in the Normal number of threads (by default 0). If the value is 0, the threads created above this value end, once the execution of their job has been completed. Otherwise, the ones that exceed the time specified by this parameter will end.

4.2.5 Plugins and JDBC Adapters

Denodo Scheduler lets you manage the extensions added to Scheduler via the link Plugins and Drivers. In the following sections, the functionalities are described in detail. NOTE: deleting extensions can cause parts of Scheduler that depend on them to cease functioning (for example, a JDBC data source that uses an adapter that has just been deleted).


Administration 8

4.3

NOTE: the maximum allowed size for a file is 100 MB.

4.2.5.1 Plugins

Denodo Scheduler allows the user to create their own filters, exporters, handlers or Aracne crawlers for those functionalities not supported by the server or that are specific to a particular project. The administration tool shows a table with the extensions registered in Scheduler. For each extension it shows its name, the name of the implementation class, its type (filter, exporter, handler or crawler), the name of the JAR file that it contains, and a link to delete the extension from the system. To create a new extension, certain Java interfaces need to be implemented (according to the extension type), a configuration file created, and everything packed together in a JAR file (see section 6.2). To register a new extension in Scheduler, the JAR that contains it needs to be selected to upload it to the server. Scheduler analyzes the JAR and, based on the metadata contained in the file MANIFEST.MF, detects the type of extension and the implementation class.

4.2.5.2 JDBC Adapters

The sources of JDBC data defined using the JDBC data sources use adapters that need to be previously registered in Scheduler. In particular, Denodo Scheduler includes preinstalled adapters for some managers (see section 7.3). It is possible to add adapters for new relational managers by specifying the following mandatory information:

• Database adapter. The adapter name will be used, together with the version, to identify the adapter in Scheduler.

• Version. Version of the database that the adapter applies to. • Class name. The JDBC adapter's Java class. • Connection URI template. The sample connection URI for the manager to use the adapter. • Select JAR file to upload. JAR file containing the JDBC adapter classes.

Once a new adapter has been added, it can be deleted. However, it is not possible to delete adapters included in the product distribution.

LOG CONFIGURATION

The logs configuration files for the server is in the path DENODO_HOME/conf/scheduler (where DENODO_HOME specifies the basic installation path) the. These files are based on Log4j [LOG4J]. Among other possibilities, they let you change the path where the log files are stored and the log level for the categories defined in the application. For more information please refer to the documentation of Log4j. The Web administration tool also has a configuration file log4j.xml to establish the register level of the events generated by this application. This file is found in the directory DENODO_HOME/resources/apache-tomcat/webapps/webadmin/denodo-scheduler-admin/WEB-INF/classes. The Scheduler server generates a file called scheduler.log in the path DENODO_HOME/logs/scheduler. The administration tool generates two log files:

• DENODO_HOME/logs/scheduler/scheduler-admin.log. It contains information on running the administration tool.

• DENODO_HOME/logs/apache-tomcat/denodo-tomcat.log. It contains information related to starting up/installing/stopping the administration tool in the Web server.


Creating and Scheduling Jobs 1

5 CREATING AND SCHEDULING JOBS

In addition to the “Configuration” perspective, the Denodo Scheduler administration tool presents two other additional perspectives: “Workspace” and “Scheduler”. The aim of the “Workspace” perspective is to facilitate the definition of data extraction jobs. In particular, it allows projects, data sources, filter sequences, and jobs to be created/modified/deleted. The various elements of the Scheduler work space are organized by projects; you can have elements with the same name in different projects. On the left side, a selector with the existing projects is shown and a button to add new projects is also available. The selector allows the active project to be selected. To add a new project, its name and an optional description need to be specified. A project groups together a list of data sources, filter sequences, and jobs. For the active project, the left side of this perspective shows a tree with the different elements. Using the links expand, collapse, and refresh it is possible to expand, collapse, or synchronize with the server. After the installation the project called “default” is automatically created. If other components of the Denodo Platform have been selected in the installation, a properly configured data source for each of them is also included and, in the case of installing Denodo Aracne, a default filter sequence called “default_arn”. To add new elements to the tree it is necessary to click on the nodes Data Sources, Filter Sequences, or Jobs. The following paragraphs describe the creation/editing screens for the different types of elements in detail. Clicking on an element of the tree will display its information, making it possible to amend it. Moreover, the “Scheduler” perspective allows you to monitor in real time the execution status of the various jobs that have been scheduled. It also allows you to find out the status of the last execution (see prior execution reports), force the execution of a job at a given time or delete its execution, among other things. In the following paragraphs both perspectives are described in detail.

5.1 ACTIVE JOBS

The “Scheduler” perspective for the administration tool allows the list of Scheduler jobs to be displayed. It is possible to filter by project name or job type. The table showing the job information allows them to be ordered by all their fields. For this, you only have to click on the field header for the one you want to perform the sorting. The following information is provided for each job:

• Name. The job's name. • Project. The name of the project the job belongs to. • Type. The type of job (ARN, ARN-Index, ITP, JDBC, VDP) • State. The current state of the job. In particular, a job can be found running (RUNNING), not

running (NOT_RUNNING), or having been disabled (DISABLED), in which case, it will not move to the RUNNING state until it is enabled. The jobs will appear DISABLED also in the case where the server has been paused using the API (you can only disable the whole server if there is no job in a RUNNING state). See section 6.1 for more information on how to pause the server via API.

• Previous Execution. Shows the last time the job was executed. It will be blank if it never has been executed.



5.2

• Next Execution. Shows the next time the job will be executed. It will appear blank if it has been disabled or, according to its time-based scheduling, it will not run again.

• Last Execution State. Shows the completion state of the last execution of the job. A job can be completed correctly (COMPLETED), end with an error condition detected (ERROR), or stopped by the user (STOPPED).

• Extracted Tuples/Errors. Number of tuples/documents that have been extracted in the last execution of the job and number of errors occurred during this process. These numbers are dynamically updated when the job is running.

• Exported Tuples/Errors. For each tuple/extracted document the configured filter sequences are applied, and the tuples that fulfill their filters are sent to the exporters. This column displays the name of each exporter (with the format <name of processor of results>#<name of exporter>), the number of tuples that have been sent to it and number of errors that occurred during the exporting process.

• Actions. Shows the different actions that can be performed on a job. o Start. Forces a full execution of the job at this moment. Only applicable to jobs that are

in a NOT_RUNNING state. o Start With State. This functionality is equivalent to the action Start for ARN and ARN-

Index jobs. In the case of ITP, JDBC, or VDP jobs that can launch multiple queries against the same data source, if in the last execution any error occurred when executing any query or the job was stopped before completing the execution, Start With State will execute the job by only completing those queries that failed or that had not been completed yet. Otherwise, it works in the same way as Start by repeating the execution of the whole job.

o Stop. Stops a job. Only applicable to ARN (for crawlers, WebBot, IECrawler, and Custom), ARN-Index, ITP, JDBC, and VDP jobs. In the case of ARN jobs using custom crawlers, it will depend on the implementation of the crawler itself.

o Enable. Enables a job that is in a DISABLED state, so that it can be executed. o Disable. Disables a job, so that it cannot be executed. The job needs to be previously

in a state of NOT_RUNNING. o Reports. Allows access to reports of the last executions of each job. Each report shows

information on the execution date for each job, the number of extracted tuples/documents, and the number of exported tuples/documents for each job exporter, indicating whether there have been any errors when configuring, accessing sources, or exporting results. According to the type of job executed, the report shows more detailed information specific to this type of job. In the case of ARN jobs, they also show the URLs that have been rejected by the robots exclusion protocol, URL filters defined by the user (see section 5.4.2.1.3), and URLs which cause an I/O or http error. In the case of ITP, JDBC, or VDP jobs that execute multiple queries, they will show a detailed report for the result of the execution of each of the queries.

ADDING DATA SOURCES

To configure the extraction jobs, or the sources for obtaining parameters for those jobs that can specify a parameterizable query (see section 5.4.3), the user needs to create data sources. Managing data sources is accomplished on the tree of the current project in the left part of the work space. By clicking on the tree node data sources, a list of the different types that can be created is shown: ARN, ARN-Index, CSV, ITP, JDBC, and VDP. For each one of them it is necessary to specify a name and a set of parameters depending on the type. The following describes the configuration needed to create or edit each of them. By default, Denodo Scheduler provides as many data sources as the ones provided by the Denodo Platform servers that have been installed (“arn,” “arn-index,” “itp,” and “vdp”).



5.2.1 ARN Data Sources

They configure access to the Denodo Aracne crawlers and are used in ARN jobs. To create an ARN Data Source you need to specify the following parameters:

• Host. Machine name in which the ARN server will run. • Port. Port number for the ARN server. • Username. Username for connecting to the ARN server. • Password. Password associated with the specified user. • Query timeout (optional). Maximum time to wait for the results from crawling. 0 means no time limit

(by default 0). • Chunk size (optional). Specifies the block size of results for transfers between the ARN server and

Scheduler (by default 100).

5.2.2 ARN-Index Data Sources

In order to create ARN-Index jobs as well as index exporters, a data source that gives access to the Aracne indexing server for the Denodo Platform needs to be defined by specifying the following parameters:

• Host. Machine name in which the ARN-Index server will run. • Port. Port number for the ARN-Index server. • Username. Username for connecting to the ARN-Index server. • Password. Password associated with the specified user.

5.2.3 CSV Data Sources

To use a CSV file as a data source to assign values to variables in an ITP, VDP, or JDBC job created using a parameterized query (see section 5.4.3), a CSV data source needs to be defined that references to that file. When creating a CSV data source, the following parameters must be specified:

• File. The path to the file. Depending on whether the checkbox Upload to server is checked or not (checked by default), the file will be uploaded to the server or not. In this latter case, the user must specify a path that is local to the server (absolute or relative to $DENODO_HOME) where the file is stored. It is important to note that the CSV files referenced by CSV data sources created with the “Upload to server” checkbox unchecked will not be included in the ZIP file generated when exporting their parent project.

• Separator. The column separator to be used to obtain the file tuples. The tuples separator is assumed to be the carriage return.

• Header (optional). If this checkbox is checked, the first row of the file will be used to give a name to the fields of each tuple obtained from it.

NOTE: The maximum size for a CSV file uploaded to the server is 100 MB.

5.2.4 ITP Data Sources

To be able to create an ITP job (see section 5.4.4) it is necessary to create an ITP data source in advance. To create an ITP data source the following parameters need to be specified:

• Host. Machine name in which the ITPilot server will run. • Port. Port number for the ITPilot server. • Database name (optional). Name of the database for executing the wrappers (by default itpilot). • Username (optional). Username with which the connection to the ITPilot server will be made (by

default “admin”). • Password (optional). Password associated with the user specified (by default “admin”). • Query timeout (optional). Maximum time (in milliseconds) that Scheduler will wait for the wrapper to

execute. If not indicated (or the value 0 is received), then it waits until execution is complete (by default 0).



• Chunk timeout (optional). Maximum time (in milliseconds) that Scheduler will wait for a new chunk of results. Where this time is exceeded, ITPilot returns an empty partial result. If not specified (or the value 0 is received), ITPilot returns all the results together at the end of the statement run (by default 0).

• Chunk size (optional). Number of results that make up a chunk of results. When ITPilot obtains this number of results, it will return them to the Scheduler, even though the Chunk Timeout has not been reached (by default 100).

5.2.5 JDBC Data Sources

JDBC data sources can be used for the following objectives: • Creating a JDBC job (see section 5.4). • Use data from a relational database to obtain values for a variable in a parameterizable query of ITP,

VDP, or JDBC (reference) jobs. • Create a relational database exporter (see section 5.4.8).

To create a JDBC data source you need to specify the following parameters:

• Database adapter. Name of the JDBC adapter to be used to access the relational database. In section 4.2.5.2, the adapters distributed with Denodo Scheduler are discussed as well as how new ones can be added. When selecting an adapter, the connection URI fields, name of the driver class, and classpath are filled in automatically. In the case of a connection URI, a connection template for the database will appear which should be modified according to the remote server that you want to access.

• Connection URI. Database access URI. • Driver class name. Name of the JAVA class of the JDBC adaptor to be used. • Classpath. Path for the JAR file that contains the implementation classes for the JDBC adaptor. • Username (optional). Username for accessing the external database. • Password (optional). User password for accessing the external database. • Enable pool (optional). It is possible to enable the use of a pool of connections against the database

server by checking this checkbox. In this case, the following parameters can be specified for the pool. o Validation query (optional). SQL query used by the pool to verify the status of the

connections that are cached. The query should be simple, and the table in question should exist.

o Initial size of the pool (optional). Number of connections with which the pool is to be initialized. The specified number of connections are established and created in an “idle” state, ready to be used.

o Maximum active connections in the pool (optional). Maximum number of active connections that can be managed by the pool at the same time (zero means no limit).

o Maximum idle connections in the pool (optional). Maximum number of active connections that can remain idle in the pool without the need for additional connections to be disengaged (zero implies no limit).

o Test connections (optional). If you check this option, the pool will try to validate each connection before being returned. Where the connection is not valid (restart of the management system, closed connection, etc.), it will be eliminated from the pool and a new one will be created.

5.2.6 VDP Data Sources

Allows a data source to be configured to access a Denodo Virtual DataPort server. It is necessary to create this type of data source to create a VDP job. The parameters to be specified are as follows:

• Connection URI: Connection URI to the server. • Username: Username for connecting to the DataPort server. • Password: Password associated with the specified user.



5.3

• Query timeout (optional). Maximum time (in milliseconds) that Scheduler will wait for the statement to be completed. If not indicated (or the value 0 is received), then it waits until execution is complete (by default 0).

• Chunk timeout (optional). Maximum time (in milliseconds) that Scheduler will wait until it arrives a new chunk of results. Where this time is exceeded, Virtual DataPort returns an empty partial result. If not specified (or the value 0 is received), DataPort returns all the results together at the end of the statement run (by default 0).

• Chunk size (optional). Number of results that make up a chunk of results. When ITPilot obtains this number of results, it will return them to the Scheduler, even though the Chunk Timeout has not been reached (by default 100).

• Enable pool (optional). It is possible to enable the use of a pool of connections against the Virtual DataPort server by checking this checkbox. In this case, the following parameters can be specified for the pool.

o Initial pool size (optional). Number of connections with which the pool is to be initialized. The specified number of connections are established and created in “idle” state, ready for use.

o Maximum active connections in the pool (optional). Maximum number of active connections that can be managed by the pool at the same time (zero means no limit).

o Maximum idle connections in the pool (optional). Maximum number of active connections that can remain idle in the (zero implies no limit).

FILTER SEQUENCES

Once data has been extracted from the sources, the obtained tuples can be filtered and/or modified by applying a filter sequence to them. A filter sequence is comprised of individual filters in which the output of a filter becomes the input for the next filter in the sequence. The input for a filter sequence are the tuples/documents obtained by the extractors, and the output are those tuples/documents that verify all the filters, possibly modified or extended with additional data generated by the filters in the chain. To create a filter sequence click on the node Filter Sequences from the tree of elements for the current project on the left side of the “Workspace” perspective. Once a filter sequence has been created, it can be changed or deleted by clicking on the element that represents this sequence in the tree. Once in the filter sequences editing screen, to add a new filter the type of filter needs to be selected, then click on the Add new filter button. It is possible to reorder the filters of a sequence by dragging and dropping a filter to the desired position in the sequence (drag & drop). The platform provides a series of pre-defined filters and also offers the possibility of adding new filters to the system (see section 6.2.1). To create a filter chain the user should specify the filters it comprises, the execution order, and the parameters for each filter. The filters included are:

• Boolean. Boolean Content Filter. Allows tuples to be filtered according to whether the content of some of their fields verifies or not a specific boolean expression composed of various keywords.

• Content-extractor. HTML, PDF, Word, Excel, PowerPoint, XML, EML, and Text Content Extraction Filter. Extracts useful texts contained in documents in the respective formats by rejecting formatting marks.

• New-field. Filter for aggregating a new field to the tuples. Adds a new field to the tuple, allowing its name and value to be specified.

• Summary-generator. Summary Generation Filter. Automatically generates a summary of the content of a document.



• Title-generator. Title Generation Filter. Automatically generates a title for the contents of a document.

• Unicity. Unicity Filter. Deletes the tuples that have the same value in a specified field. • Uri-normalizer. URI Normalization Filter. This transforms URIs into a normalized format for

comparison. • Useful-content-extractor. Useful Content Extraction Filter. This filter uses several heuristics to

automatically extract the useful content of a document, eliminating browser menus, images, and other normal adornments in many Web documents. This filter uses the Content-extractor filter internally (Content Extraction Filter); therefore the Content Extraction Filter needs not be included, if the Useful Content Extraction Filter is used.

For Aracne-type jobs, Scheduler distributes a pre-created filter sequence (default_arn). This sequence of filters features the following filters:

• Unicity Filter • URI Normalization Filter • Useful Content Extraction Filter • Title Generation Filter • Summary Generation Filter

For a more detailed explanation of the characteristics of each filter, see the following subsections.

5.3.1 Boolean Content Filter

This filter acts on the fields specified in the parameter Input field. The Expression parameter allows specifying a set of regular expressions. The content of at least one of the previously specified fields should match with at least one of the regular expressions. In other case, the filter will reject the tuple. The following subsection details the syntax used to specify regular expressions. Although this filter is applicable to the tuples returned by any type of job, it is especially targeted at ARN jobs. That is why the fields “title” and “content” appear by default (they always appear in the documents obtained by Aracne [ARN]) as Input Fields.

5.3.1.1 Syntax for the expressions in the content filters

The expressions can be: • Simple, formed from just one keyword. • Compound, formed from more than one keyword combined using operators.

5.3.1.1.1 Keywords

Keywords constitute the terms to be searched in the value of a field. These should be enclosed in double quotation marks. The search for keywords in the value of a field is carried out without distinguishing between lower and upper case. Thus, for example, the keywords “Management” and “management” have the same behavior. In this document, all the keywords used as an example are written in lower case and without accents. Keywords, like expressions, can be:

• Simple, formed from one term. The search is positive if the term appears in the value of the field.

“internet”

“telecommunications”



• Compound, formed from more than one term. The search is positive only if the terms appear in the value of a field in the correct order.

“electronic commerce”

“risk prevention in the workplace”

When compound keywords are used like “electronic commerce”, only one space should be put between the terms. Each space is interpreted as one or more spaces in the document to be filtered. Wildcards can be used to represent variable or optional parts in the keywords:

• Asterisk (*) represents a group of zero or more characters without spaces, punctuation marks, hyphens, etc.

• Question mark (?) represents a single character that may or may not appear. In this case, any character is valid, including spaces, punctuation marks, hyphens, etc.

With the help of wildcards, bigger keywords can be constructed that cover different variants of a term. Thus, for example, variations in the end of a term can be dealt with by including the asterisk wildcard at the end of it.

“grant*” would give a positive result, if the terms grant, grants, etc. appear in the value of the field.

Various terms can also be covered in one keyword, where these share the same root or the same ending.

“*silicon” would give a positive result, if the terms silicon, ferrosilicon, etc. appear in the value of the field.

“*silic*” would give a positive result, if the terms silicon, ferrosilicon, silicate, etc. appear in the value of the field.

The asterisk wildcard can also go in the middle of a term:

“elect*fy” would give a positive result, if the terms electrify, electronify, etc. appear in the value of the fields, but terms such as electricity or electrification would be left out.

Although the asterisk wildcard reflects options in the sense that it represents a group of characters that may or may not appear, it does not represent special characters like punctuation marks. This problem can be avoided with the question mark wildcard.

“co?generation” would give a positive result, if the terms cogeneration, co-generation, co generation, etc. appear in the value of the field.

Wildcards can also be applied to compound keywords.

“co?generation plant*”

“industrial waste*”

5.3.1.1.2 Operators

Operators can be classified into: • Unaries are placed before a simple or compound expression modifying its meaning. In the case of

compound expressions, these should be enclosed in brackets. • Binaries combine two expressions to form a compound expression. The expressions to be

combined can be simple or compound expressions, which should then be enclosed in brackets.



The available operators are:

• Negation Operator (!)– unary operator that inverts the meaning of the expression it goes before.

!“grant” would give a positive result, if the word grant does NOT appear in the value of the field.

!“*silicon” would give a positive result, if the word silicon does NOT appear in the value of the field nor any word that ends with silicon.

• Operator AND (&&) – binary operator that obliges fulfillment of the expressions that combine for the global result to be positive.

“commerce” && “internet” would give a positive result, if the words commerce and internet appear in the document in any part, even where they are not contiguous.

• Operator OR (||) – binary operator that requires satisfaction of at least one of the two expressions that combine for the global result to be positive.

“commerce” || “internet” would give a positive result, if the word commerce, the word internet, or both appear in the value of the field.

More complex expressions can be formed by putting compound expressions between brackets and combining them in turn with the above operators:

“commerce” && (“electronic” || “internet”) would give a positive result, if the word commerce and either the word electronic or the word Internet are contained in the value of the field.

“commerce && (“electronic” && (!“B2C”)) would give a positive result, if the word commerce and also the word electronic appear in the value of the field, but not the word B2C.

5.3.2 Content Extraction Filter (HTML, PDF, Word, Excel, PowerPoint, XML, EML, and Text)

This filter analyzes the content of the field specified in the parameter Input field to remove the possible marks associated with the document format in which it is codified and stores the text obtained in the field indicated in the parameter Output field. For example, in the case of a Web document, you can use this feature to remove HTML marks, JavaScript code, etc. The input field for the filter can be of either binary or textual type. In the case of a binary field, the content extractor auto detects the encoding for the document contained in the input field. There is an option to specify, in the parameter MIME type field, the name of the field that contains the MIME type for the document. If this parameter is not specified, the filter will try to auto detect the adequate MIME type. In the case of EML documents (typically obtained using Denodo Aracne), it also includes the text of the possible files attached to the e-mail. According to the type of content extractor, this filter can add additional fields to the processed document:

• The EML documents extractor also adds the following fields: o subject (text). A text with the subject of the e-mail message. o from (collection of texts). List of source e-mail addresses for the message. o recipient (collection of texts). List of destination e-mail addresses for the message. o replyto (collection of texts). List of e-mail addresses to which the message replies. o receiveddate (date). Date the message was received. o sentdate (date). Date the message was sent.

• The PDF documents extractor also adds the following fields. o title (text). Document title.



o subject (text). Subject of the document. o author (text). Name of the document's author. o creator (date). Name of the person who created the document. o creationDate (date). Document's creation date. o keywords (text). Text contained in the document's keywords. o modificationDate (date). Document's modification date. o producer (text). Application that generated the document.

• The MS Office documents extractor also adds the following fields: o title (text). Document title. o subject (text). Subject of the document. o author (text). Name of the document's author. o creationDate (date). Document's creation date. o keywords (text). Text contained in the document's keywords. o modificationDate (date). Document's modification date.

Although this filter is applicable to the tuples returned by any type of job, it is especially ARN-job-oriented, hence appearing by default as Input field the field “binarydata” that contains the document obtained from the crawler in binary format, as MIME type field the field “mimetype”, and as Output field the field “content”. All these fields appear in the documents returned by Aracne [ARN].

5.3.3 Field Aggregation Filter

This filter adds one or several fields to the processed tuple. For each new field it allows a name and its value to be specified. The name of the field is established using the configuration parameter Field name. The value of the field can be established in two ways:

• By specifying a constant value in the parameter Field value. • By specifying a field of the tuple containing a URL as value in the Input field parameter and by

specifying the name of a HTTP request parameter used in the URLs of such field in Parameter name. This is useful in Aracne-type jobs, when it is known that the URL for the documents obtained includes a parameter that you want to extract separately to use it for other purposes. For example, it may happen that a parameter included in the document URL identifies the section in the website from which the document was extracted.

Although this filter is applicable to the tuples returned for any type of job, it is especially oriented to ARN jobs, hence appearing by default as Input field the field “url”, which is the one that stores the URL of the document obtained by ARN.

5.3.4 Summary Generation Filter

This filter acts on the value of the field specified in the parameter Input field (which has to be textual) and stores the result in the field specified in the parameter Output field. It uses various heuristics to automatically generate a summary of the content of the field specified. Its behavior varies depending on the type of value of the field processed:

• In the case of contents expressed in RSS format, the summary corresponds to the value of the field “description”.

• In the rest of the documents, the summary is generated automatically by applying various heuristics. Although this filter is applicable to tuples returned for any type of job, it is particularly oriented to ARN jobs, hence appearing by default as Input field the field “content” with the content of the document obtained by the crawler and as Output field the field “summary”.



5.3.5 Title Generation Filter

This filter acts on the value of the field specified in the parameter Input field (which has to be textual) and stores the result in the field specified in the parameter Output field. It uses various heuristics to automatically generate a title for the content of the field specified. Its behavior varies depending on the type of value of the field processed:

• In HTML documents, the value of the title corresponds to that of the HTML tag title, if this exists on the page. If it does not exist, an alternative title is automatically generated.

• In the case of RSS items, the title corresponds to the field value “title” of the RSS item. • In the case of EML documents, the value of the title corresponds to the “subject” of the e-mail

message. • In the rest of the documents, the title is generated automatically by applying various heuristics.

Although this filter is applicable to tuples returned for any type of job, it is particularly oriented to ARN jobs, hence appearing by default as Input field the field “content” with the content of the document obtained by the crawler and as Output field the field “title”.

5.3.6 URL Unicity and Standardization Filters

The unicity and URL standardization filters act on fields that contain a URL that can be considered the primary key for a tuple. The URL standardization filter transforms the field value specified in the parameter Input field to a standardized format to facilitate its comparison and stores it in the field specified in the parameter Output field. The URL standardization executed is as follows:

- Both the protocol and the URL host are converted to lowercase. - The port number specification: 80 is deleted, if it exists. - The reference (“anchor”) to a section of an HTML page is deleted, if it exists. - The characters "/../" are deleted. - The session identifiers PHPSESSID and jsessionid regularly used in Web sites made from PHP and Java

Server Pages technologies are deleted. Although the URL standardization filter is applicable to tuples returned for any type of job, it is particularly oriented to ARN jobs, hence appearing by default as Input field the field “url” with the URL of the document obtained by crawling and as Output field the same field “url”. The unicity filter is used to reject tuples with repeated URLs. The field name containing the URL is specified in the parameter Input field, and the name of the field that stores the filter output is specified in Output field. Optionally, the unicity filter can be configured using the following parameters:

• Parameter to be removed: Allows irrelevant parameters of the URL to be deleted. Two identical URLs, except for the value of these parameters, shall be regarded as the same for purposes of unicity.

• Key parameter: Allows key parameters to be specified of the URL acting as an identifier. Two URLs that take the same value for these parameters will be considered the same for purposes of unicity, regardless of the value that other parameters take.

• Scope: The value indicated for this parameter will be added to all the URLs of the processed documents and will be taken into account for unicity checks (two identical identifiers but with a different SCOPE will be considered different). The most common use for this parameter is to avoid documents with the same URL but extracted by different jobs to overwrite each other (the identifier field value—which is often the url in this type of job—is used as primary key in the ARN-Index scheme by default). To avoid this problem just assign them the job name as a value (or any other value that does not appear in the Scope of any other job).



5.4

Although the unicity filter is applicable to tuples returned for any type of job, it is particularly oriented to ARN jobs, hence appearing by default as Input field the field “url” with the URL of the document obtained by crawling and as Output field the field “identifier”. Example: Suppose you want to get news from news.acme.com and it is known that the pages for each individual item have URLs of the form

http://news.acme.com/servlet/ContentServer?inifile=futuretense.ini&cid=1145997360218&arglink=nolink

http://news.acme.com/servlet/ContentServer?inifile=futuretense.ini&cid=1145997361017&arglink=nolink where the parameter cid acts as an identifier for each news item and the rest of the URL parameters do not affect the documents of interest for the Acme job. The unicity filter would be created with the following values:

• Key parameter: cid • Parameter to be removed: inifile • Parameter to be removed: arglink • Scope: acme

The Scope parameter is configured with the name of the job to restrict the unicity checks to documents downloaded by this job. These filters should form part of all filter strings created in Denodo-Aracne-type jobs.

5.3.7 Useful Web Content Extraction Filter

This filter receives the same parameters as the content extraction filter (5.3.2). It uses several heuristics to automatically extract the useful content of the field value, eliminating browser menus, images, and other normal adornments in many Web documents. This filter uses the content extraction filter internally; therefore the Content Extraction Filter needs not be included, if the Useful Content Extraction Filter is used.

CONFIGURING NEW JOBS

5.4.1 General structure of a job

Denodo Scheduler has two basic types of jobs: extraction jobs and Denodo Aracne index maintenance jobs [ARN]. The following are the different types of extraction jobs supported by Denodo Scheduler:

• ARN: allow data to be extracted from unstructured data sources, mainly Web sites, file systems, or e-mail servers.

• ITP: allow data to be extracted by querying Web automation flows from Denodo ITPilot [ITP]. • VDP: allow data to be extracted by querying views or processes stored from Denodo Virtual DataPort

[VDP]. • JDBC: allow data to be extracted from tables or processes contained in relational databases using JDBC.

Each job of this type is informed of the data to access a certain database and the table or SQL query to be used to retrieve the required data.

All jobs have a name, a description, and share the result handlers sections (Handlers Section) as well as time-based scheduling (Triggers Section). Also, the extraction jobs share the filter/exporter section (Filter Exporter Section). The extraction section (Extraction Section) is specified for each type of extraction job and will be discussed in detail for each type in the following sections.



In the extraction section the data source from which the data is obtained is specified using a previously created data source (see section 5.2). Different configuration data needs to be supplied depending on the type of data source. In the filters and exporters section a list of tuple processors can be specified. For each tuple processor a filter sequence for post processing the results obtained can be indicated and a list of exporters to dump the results into one or more external repositories. Denodo Scheduler supplies implemented exporters to CSV files (this exporter can also be used to generate files compatible with MS Excel), relational databases which have a JDBC adaptor, and Denodo Aracne indexes. It also allows new exporters developed for special ad-hoc needs to be used. In the handlers section actions to be performed once the extraction and exportation of all the tuples of a job have finished are specified. It allows, among other actions, sending an e-mail with the execution summary of a job to a series of e-mail addresses. It also allows the use of new handlers developed for custom needs. Lastly, each job defines scheduling data that specifies when it will be executed. The current configuration allows similar features as the classic cron application for UNIX systems. In section 5.4.6 the fields returned in the documents obtained for each type of job are discussed, as well as their types. NOTE: An incomplete job can be created by using the “Save Draft” button instead of the “Accept” button. In this case, the only mandatory field is the name of the job. A draft job is a potentially incomplete job; thus, it is not shown in the Scheduler perspective and is marked with a cross in the jobs tree. A draft job can be edited like the rest of the jobs. After completely filling in all the mandatory fields of a job, pressing the “Accept” button will create an executable job. NOTE: You can create a new job with the same configuration as an existing one. To do this, edit the job you want to clone and press the button “Save as”. Then provide a name for the new job, and it will be created with this name and the same configuration as the original one.

5.4.2 Aracne-type Job Extraction Section

Aracne-type jobs require a previously created Aracne-type data source to be specified that identifies the server that will process the corresponding crawling job. It is also necessary to specify the type of crawler to use. Denodo Aracne allows the following crawlers to be used:

• WebBot. Crawls documents from Web sites, FTP servers, or file systems using the WebBot crawling module. In its configuration you need to indicate the initial URLs, the link and rewriting filters that must be applied, the level of exploration for each Web site, FTP server, or directory of a file system, and if the standard of robot exclusion must be respected (see section 5.4.2.1.1 for more information).

• IECrawler. Crawls documents from Web sites or FTP servers using the IECrawler module. In its configuration you need to indicate the initial URLs, the link and rewriting filters that must be applied, the level of exploration for the Web site or FTP server, and if the standard of robot exclusion must be respected (see section 5.4.2.1.2 for more information).

• Mail. Gets e-mail messages (including attachments) from servers accessed using POP3 and/or IMAP protocols. The mail server to be connected to and the e-mail accounts to be indexed need to be specified (see section 5.4.2.2 for more information).

• ExchangeMail. Retrieves e-mail messages (including attachments) from MS Exchange 2003 or 2007 servers (with compatibility mode enabled). Access is achieved using the Exchange native API. The mail server to be connected to and the e-mail accounts to be indexed need to be specified (see section 5.4.2.3 for more information).

• Salesforce. Performs queries against entities from the CRM on-line Salesforce service [SLF]. Access is achieved using the Web Service API for the service (see section 5.4.2.4 for more information).



Users can also create their own crawler to get data from a specific source type (see section 6.2.4). Custom crawlers are added by means of the Scheduler extensions system based on plugins (see section 4.2.5). Once added, it will appear automatically in the crawlers’ selector for an ARN job, so that the user can select it.

5.4.2.1 Web Crawling and File Systems Configuration

WebBot is a crawling module capable of getting data from Web sites, FTP servers, and file systems. It visits the URLs which are provided as a starting point (in the case of file systems, these URLs will use the file protocol), it stores the retrieved documents, and extracts the links (files or subdirectories, in the case of FTP servers and file systems) that these contain to add them to the list of links which the crawler will visit. This process is repeated until all the URLs have been accessed or until the depth level defined to stop the crawling process has been reached. WebBot allows regular expression filters to be defined (see section 5.4.2.1.3) that make the system only process those links that match some of the filters, rejecting all the others. WebBot also allows rewriting links filters to be defined (see section 5.4.2.1.4). These filters, which will be described in detail on subsequent sections, are used to rewrite the URLs which match a given regular expression before adding them to the list of URLs that remain to be browsed. IECrawler is a crawling module that uses a set of Internet browsers as "robots" similar to those used by humans to surf the Web, but changed and extended to allow the execution of automatic crawling processes. The main added value of this approach is that it is capable of crossing links and downloading documents from any type of Web site, although it includes JavaScript, complex redirections, session identifiers, dynamic HTML, etc. This is due to its automatic navigation module that automatically emulates the navigation events that a human user would produce when browsing a Web site. The current implementation of IECrawler is based on Microsoft Internet Explorer technology [MSIE].

5.4.2.1.1 WebBot

The first parameter that can be configured in this type of job is Robots exclusion. If this checkbox is checked, the job will respect the limitations set by the robot exclusion standard. This standard allows Web site administrators to indicate to the crawlers which parts of the Web site should not be accessed. It is a protocol that recommends, but does not require, and relies on the cooperation of all the Web robots. For this reason, it is advised that this property remain activated, configuring it with the value “Yes” (activated by default). In this type of job, a list of Web sites or file systems to crawl can be configured. The link Add Site allows adding a new Web site or file system to be crawled. For each new site the following parameters can be specified:

• Exploration Level: Indicates the maximum depth level for stopping the crawling process of a Web site, FTP server, or file system directory. It is also possible to specify whether the default configuration should be used.

• Minimum and maximum number of workers: This indicates the initial number and the maximum number of crawlers to be run in parallel on the site while the job is being run.

• URL: It allows indicating the initial URL for the crawling. If the URL refers to a file system directory, a URL with the file protocol should be used (e.g. file:///C:/tmp). If the URL indicated uses the FTP protocol, it must follow the format ftp://user:password\@server/directory (the symbol ‘@’ must be preceded by the escape character ‘\’). If the authentication data is not indicated, the connection to the server will be made using anonymous FTP.

• Download initial URLs: Indicates if the crawler should store the pages provided as initial URLs in the repository.

• Link and/or rewriting filters can be added (see sections 5.4.2.1.3 and 5.4.2.1.4).



5.4.2.1.2 IECrawler

Configuration and use of IECrawler-type jobs are described below. IECrawler differentiates from WebBot in that the Web exploration processes work at a higher abstraction level. More specifically, IECrawler uses a group of Internet browsers for executing crawling processes. These are similar to those used by humans when Web browsing, but modified and extended to allow the execution of automatic crawling processes. The main differences with regard to configuration are the details of the Web site to be explored. In this case, IECrawler only allows one Web site or FTP server to be configured. The following parameters can be configured:

• Exploration level: Indicates the maximum depth level for stopping the crawling process of a Web site or FTP server.

• Maximum number of browsers. This indicates the maximum number of crawlers (browsers) to be run in parallel on the site.

• URL: Initial URL for crawling. Indicates a URL to navigate to and its type. Multiple URLs can be added with the “Add URL” button. The types of URLs allowed are GET, POST, and sequence (NSEQL navigation sequence. See [NSEQL].). If the URL indicated uses the FTP protocol, it must follow the format ftp://user:password\@server/directory (the symbol ‘@’ must be preceded by the escape character ‘\’). If the authentication data is not indicated, the connection to the server will be made using anonymous FTP.

• Link and/or rewriting filters can be added (see sections 5.4.2.1.3 and 5.4.2.1.4).

5.4.2.1.3 Using URL filters

In WebBot- and IECrawler-type jobs there is a link filter configuration section. This type of filter allows configuring what links should be traversed by the crawling process depending on whether or not they satisfy certain regular expressions. Inclusion filters allow specifying regular expressions that should match with the URLs or the texts of the new links discovered by the crawling process. If a link discovered during the crawling process does not match with the specified expressions, the link will not be traversed and its associated document will not be downloaded (and, therefore, the possible links from the document will not be traversed either). Exclusion filters can also be specified. In this case, the links that match the regular expression associated with the filter will be rejected. It is important to highlight that the filters are applied in the order in which they are defined, whereby the process is stopped for a link the first time this matches the regular expression defined in one of the filters. To add filters to the system the “Add Link Filter” is used. For each filter the following parameters should be specified:

• Pattern expression: The regular expression that defines the filter for the link. The supported syntax is described in section 7.2. • Included: Indicates whether the links matching the regular expression should be included or

rejected in the list of pages that the crawler should visit. • Apply pattern expression on anchor. If checked, the regular expression is applied to the link

text instead of the URL. Example: Suppose you only want to get the news pages from the Web site http://news.acme.com. We know that these documents contain the word “news” in their path and that the domain should not be considered. Thus, after pressing the “Add Link Filter” button the following data should be entered:

• Regular Expression: (.)*news(.)* • Included: Yes

Certain functions can be specified in the regular expression of the link filters. Aracne includes the function DateFormat for handling dates (see section 7.1 for a description of its syntax); new functions can also be added (see [ARN]).



The order or link filters is very important. They are processed in order. If only exclusion filters are specified, all URLs are discarded. If you want to specify only exclusion filters, you have to add, at the end of the link filters list, a inclusion link filter like this one: .*; so, by default, all URLs will be included except for those matching with the exclusion filters previously defined.

5.4.2.1.4 URL rewritings

In WebBot- and IECrawler-type jobs there is a rewriting filter configuration section which allows rewriting rules to be defined for the links. To add rules of this type to the system press the “Add Link rewrite” button, which displays the filter edit screen. This screen contains the elements indicated below:

• Pattern expression: Defines the regular expression of the URLs of the links to be rewritten. See section 7.2 for a description of the required syntax.

• Substitution: Defines the regular expression with which the link URL will be replaced. It can refer to fragments of the above-mentioned regular expression that correspond to groups of the matched regular expression (see section 7.2 for more information on groups). For example, to retrieve the i-th group the tag $i would have to be included in this regular expression.

Example: For example, a link rewriting filter can be defined to obtain news from the Web site news.acme.com if we know that all the useful content of the news page is obtained through the link Print news, i.e. by eliminating the advertisements and navigation menus. If, for example, the news pages have URLs such as:

http://news.acme.com/news/12/121465.html

and the link Print is as follows: http://acme.news.com/news/print.php3?id=121465

then the filter will be like this: • Pattern: http://news.acme.com./news/(.)+/(.+).html • Substitution: http://news.acme.com/news/print.php3?id=$2

It is possible to specify certain functions in the pattern expression and in the substitution expression for the rewriting filters. Aracne includes the function DateFormat (see section 7.1 for a description of its syntax); it is also possible to add new functions (see [ARN]).

5.4.2.2 POP3/IMAP E-mail Server Crawling Configuration

This crawler allows retrieving the content (including the attached files) of the messages from one or several e-mail accounts of a server accessed by using either the POP3 or IMAP protocols. The parameters specified for this crawler are:

• Host: Name of the incoming e-mail server. The protocols allowed are POP3 and IMAP. • Accounts: E-mail accounts whose messages will be retrieved and indexed by Aracne. The username

(User) and password (Password) need to be specified for each account.

5.4.2.3 MS Exchange Server Crawling Configuration

This crawler allows retrieving the content (including the attached files) of the messages from one or several e-mail accounts from an MS Exchange server. It also allows retrieving the data contained in all the server's mail accounts, provided that a user/password with administrator rights on the MS Exchange server is given. The ExchangeMail crawler has been tested successfully with MS Exchange 2003 and MS Exchange 2007 (provided that during the installation process for Microsoft Exchange, in the client configuration, you select the option which

http://acme.news.com/news/12/121465.html

http://acme.news.com/news/print.php3?id=121465



specifies that in the organization some client machine exists that runs Outlook 2003 and previous versions or Entourage). The parameters specified for the extraction section of this crawler are:

• Exchange Crawler Server Host. Name of the machine in which the Aracne ExchangeMailCrawler server is installed.

• Exchange Crawler Server Port. The port number in which the Aracne ExchangeMailCrawler server is launched.

• Exchange Server Name. Name of the machine in which the Microsoft Exchange Server is installed. • Administrator Account Login. Administrator's user ID in Microsoft Exchange Server. • Exchange Server Login. User ID used for authentication on the machine in which Microsoft

Exchange Server is installed. The user will need to have administrator rights from Exchange. This parameter is only needed if Microsoft Exchange Server and the Denodo Aracne ExchangeMailCrawler server are installed on different machines.

• Exchange Server Password. User password used for authentication on the machine in which Microsoft Exchange Server is installed. The user will need to have administrator rights from Exchange. This parameter is only needed if Microsoft Exchange Server and the Denodo Aracne ExchangeMail server are installed on different machines.

• General Minimum Date. If this parameter is specified, the crawler will only get those messages received on a matching date or one after that specified. If it is not specified, all the messages contained in the server will be obtained. The date must be specified in the format YYYY-MM-dd (e.g. May 2, 2007 will be written as “2007-05-02”).

• Users. E-mail accounts whose messages will be retrieved and indexed by Aracne. For each account the username and, optionally, a minimum date need to be specified. If this last parameter is not specified, the value of the parameter General Minimum Date is used. If no user is indicated, the data from all the mail server accounts will be obtained.

5.4.2.4 Crawling Configuration of Entities in SalesForce.com Accounts

This crawler allows data contained in an on-line Salesforce.com CRM service [SLF] account to be accessed using its Web service. The parameters specified for this crawler are:

• Login. User ID used for authentication on Salesforce.com. • Password. User password used for authentication on Salesforce.com. • Element. Name of the Salesforce data entity to be queried (e.g. “Lead”). • Field name. Multivalued parameter that lets you specify the name of the fields that you wish to

obtain in the query to the element.

5.4.3 VDP Extraction Section

To configure the extraction section for VDP-type jobs a VDP-type data source needs to be selected. Once selected, the query to be performed against the VDP server needs to be specified using a parameterized query statement (Parameterized query field). A parameterized query is a query expressed in the server's query language (in this case, the query language is DataPort, called VQL [VDP]), which can include variables prefixed with the character @ (the detailed syntax for including variables is explained in section 5.4.3.1). A parameterized query that includes variables represents a group of queries that you want to execute against the server. The different queries will be generated by replacing each variable with a value included in a list of input values. The lists of input values are obtained from a data source (see section 5.2).



Example: Suppose that the next parameterized query has been configured to obtain data from a view called CLIENTS in the DataPort server:

SELECT * FROM CLIENTS WHERE taxId=’@TaxId’ Suppose also that a data source for accessing a CSV file that contains a list of Tax Ids as follows has been configured:

B78596011 B78596012 B78596013

and that these values have been associated with the variable @TaxId (in section 5.4.3.2 there is a detailed explanation of how to do it). Then Denodo Scheduler would generate the following queries on DataPort to get all the job data:

SELECT * FROM CLIENTS WHERE TaxId=’B78596011’ SELECT * FROM CLIENTS WHERE TaxId=’B78596012’ SELECT * FROM CLIENTS WHERE TaxId=’B78596013’

In section 5.4.3.2 the various sources that can be used to assign possible values to variables are described. Each source can provide value to one or more variables simultaneously. Where there are different sources, Scheduler generates as many queries from the parameterized statement as possible combinations in the data returned from each of the sources that provide values to variables. Additionally, it is possible to configure the number of query combinations to run from the same parameterized query and the level of concurrence in the execution:

• Maximum number of iterations. Specifies the maximum number of queries to be generated from the parameterized query statement specified. If the value specified is greater than the number of query combinations generated, it is ignored. If it is not specified, then all the combinations are executed.

• Maximum number of concurrent iterations. Specifies the maximum number of queries that will be executed in parallel from the queries generated by the parameterized query. The concurrent execution is performed by blocks, i.e. while the execution of the queries of the first block has not ended, execution of the next block of queries is not initiated. If this value is not specified, all queries will be executed sequentially.

It is important to note that Scheduler logs query combinations that have been successfully executed to distinguish them from those that have not been executed yet or have returned some type of error. Therefore, it is possible to associate a handler for retries with the job (see section 5.4.9) that repeats the execution of those queries that have returned an error in their last execution and those that have not been executed yet (useful in the event that, for unknown reasons, the server has performed its execution in an irregular way). It will also be possible to force the execution of failed queries from the “Scheduler” perspective using the action Start with state (see section 5.1).

5.4.3.1 Syntax of Parameterized Queries

A parameterized query is an expression depending on variables which generates a character string representing a query as a result. Variables are specified by prefixing them with the symbol ‘@’, followed by the name of the variable, provided that this name is an alphanumeric character string (letters and the characters ‘#’ and ‘_’). Variables with a name that includes any other character can be specified including the name between the symbols ‘@{‘ and ‘}’. NOTE: When any of the symbols ‘@’, ‘\’, ‘^’, ‘{‘, ‘}’ appear in the constant parts of the parameterized statement, they must be escaped by the character ’\’ (i.e. \@, \\, \^, \{, \}).



5.4.3.2 Configuring the Values to be used in a Parameterized Query

Parameterized queries can obtain their values from different data sources. Scheduler allows data to be obtained from a CSV file, from a query against a database or from a manually introduced e list of values. The configuration needed for each type of source is as follows:

• CSV. A CSV data source that has been created previously. • DATABASE. A JDBC Data Source (or VDP datasource) needs to be selected and a non-parameterized

query to be executed against the database (Query (non-parameterized) field) specified. • LIST. A list of values needs to be specified (Values field) separated by the character specified in the

Separator field. In the case of the LIST type, each tuple only consists of a field and, therefore, can only assign values to a variable. In the case of DATABASE and CSV sources, a tuple can include various fields and, therefore, can assign values to more than one variable. Example: Suppose that the next parameterized query has been configured to obtain data from a view called COMPANY in the DataPort server:

SELECT * FROM COMPANY WHERE NAME=’@COMPANYNAME’ AND INDUSTRY=’@COMPANYINDUSTRY’

Suppose that a data source for accessing a CSV file that contains a list of tuples with two fields each has also been configured. The data in the CSV file is as follows:

COMPANYNAME;COMPANYINDUSTRYDenodo;Information Technologies Acme Drinks;Beverages

Then the CSV source fields could be assigned to the variables so that they generate queries as follows to obtain all the job data: SELECT * FROM COMPANY WHERE NAME=’Denodo’ AND INDUSTRY=’Information Technologies’ SELECT * FROM COMPANY WHERE NAME=’Acme Drinks’ AND INDUSTRY=’Beverages’ Once the data sources have been added and configured, it is necessary to define for which query variables does each source return values. This can be done in two different ways:

• Implicit association. This type of association is only applicable for those sources that return tuples with field names (DATABASE and CSV files that specify header). In these cases, it is assumed that the variables used in the parameterized query have the same name as some of the fields returned by the data sources.

• Explicit association. CSV and DATABASE sources allow defining associations between the variables in the query (Query Parameter) and the name of the field in the source (Source Parameter).

o In the case of CSV files that include a header that specifies the field names, the association is done with the name of the field; for CSV files without header the association is carried out by the position of the field, starting at 0.

o In the case of DATABASE sources, the association is made between the name of the variable in the query and the name of the field in the source.

It is important to take into account that several sources cannot return values for the same variable. It is also possible to define a VDP job for querying an ITP wrapper, without creating its correspondent view in VDP. The syntax of the query would be as following: QUERY WRAPPER ITP <name:identifier> [ ( <name:identifier> = <value:literal>

[, <name:identifier> = <value:literal>]* )]



Using this syntax, ITP wrappers can be queried by using JDBC, to obtain parameters values to be used in parameterized queries.

5.4.4 ITP Extraction Section

The extraction section for ITP-type jobs allows specifying an ITPilot [ITP] wrapper. The ITP extraction section is similar to the previous description for VDP-type jobs (see section 5.4.3). The idea is the same, i.e. it allows a set of queries to be defined against the same wrapper of an ITP server, allowing concurrency features and a maximum limit for queries to be configured. The only differences with regard to the VDP wrapper are as follows:

• A previously created ITP-type Data source needs to be selected. • Instead of specifying a parameterized query the name of the wrapper to be queried is indicated

(Wrapper name) and the list of wrapper fields that you want to retrieve (Output fields). The list of wrapper fields plays the role of the variables in the parameterized query from the VDP extraction section, and it is for those fields for which it is possible to specify data sources (in the case of the sources for parameters of ITP wrappers, implicit associations are not applicable; it is necessary to always specify associations explicitly).

5.4.5 JDBC Extraction Section

The databases containing the required data will be accessed using the JDBC standard (Java Database Connectivity) [JDBC]. Appendix 7.3 shows the JDBC drivers included and/or normally used by Denodo Aracne. It is also possible to easily incorporate other JDBC drivers into the system (see section 4.2.5.2). The JDBC job extraction section is configured in the same way as the VDP jobs (see section 5.4.3). The only difference is that, instead of selecting a VDP data source, you have to select a JDBC data source, and instead of specifying a parameterized statement in VQL language, it has to be specified in SQL.

5.4.6 Data Schema Generated by the Different Types of Extraction Jobs

All the extraction jobs return the following fields in all its tuples, in addition to its own fields: • _$job_project (text). Name of the project pertaining to the job. • _$job_name(text). Name of the job. • _$job(numerical). Identifier of the job. • _$job_start_time (numerical). Time (in milliseconds) when the job was first executed, • _$job_retry_start_time (date). Time at which the current job execution started. • _$job_retry_count (numerical). Number of the current retry execution.

With regard to Aracne jobs, all the crawlers include the following fields. In the case of WebBot and IECrawler, all these fields will have a non-null value, while in the rest of the crawlers some of them may have null values:

• url(text). Represents the URL for the document obtained. • path(text). Represents the path in the local file system associated with the document. The path is relative

to DENODO_HOME/work/arn/data/repository. • title(text). The title of the document in the case of HTML. Value of the ‘Title’ tag for RSS documents. • charset(text). Document encoding obtained from the server's response or from metadata on the document. • mimetype(text). The document's MIME type. This information is obtained from the server's response. In the

absence of this data, the document is analyzed to try to detect it automatically • anchortext(text). In the case of web crawling, text of the link that pointed to the document. • binarydata(binary). Binary representation of the document.

The fields returned by the rest of the crawlers are as follows:



• WebBot. When it is used to crawl a file system or FTP server, the tuples also include one additional field called filename that specifies the file name of the document.

• Mail. Generates the fields url, path, binarydata, and mimetype. The url field encapsulates data from the server and user account queried. The path field points to a file that stores the content of the mail (with attachments) in .eml format and the binarydata field contains the file content. The value of the mimetype field is always “message/rfc822”. The folder field specifying the name of the folder from which the message was extracted may also be returned.

• ExchangeMail. Generates the fields url, path, binarydata, and mimetype. The url field encapsulates data from the server and user account queried. The path field refers to the file that stores the content of the mail (with attachments) in .eml format and the binarydata field contains the file content. The value of the mimetype field is always “message/rfc822”.It can also generate the login and user fields, if the mails from all users are obtained.

• Salesforce. Generates the url field, and the mimetype field value is always “Structured Textual Content”. It will also generate a field for each field specified in the Field Name parameter from the query on Salesforce. All these fields will be considered as being of text type.

Additionally, for RSS-type documents the following fields are added: pubdate(text), categories(collection of texts), and description(text). They contain the corresponding value for those fields of the RSS item. With regard to the rest of the extraction jobs (ITP, JDBC, and VDP), all the fields retrieved from their respective data sources are added.

5.4.7 Jobs for Maintaining Aracne Indexes

The ARN-Index maintenance section allows configuring maintenance operations to be performed on one or several Denodo Aracne indexes. The operations that can be added are as follows (it is possible to add one or several):

• CHECKURI. This action is used to detect and remove documents that are not longer available in the web. It allows specifying a query to be executed (Query parameter), the list of indexes on which the query will be executed (Indexes), the name of the field in the index that contains the URI (URI field) of the document, the primary key field name for the document in ARN-Index (Identifier field, by default with “identifier” value) and the threshold of times that the access to a URL may fail before proceeding to its deletion from the index (URI errors threshold). For each document obtained as a result of performing the query, the job checks whether its URL remains accessible and does not return any errors. If the number of consecutive times that a document returns an error exceeds the threshold configured by the administrator, the document is deleted from the index. In turn, if a document that previously returned an error is accessible in the following execution of the maintenance job, its error counter is again reset to 0.

• DELETEDOCUMENTS. Allows a query to be specified (Query parameter) and one or several indexes on which to perform the query (Indexes parameter). The documents obtained as a result of performing the query will be deleted from the indexes.

NOTE: The query syntax for both actions is documented in Appendix 6.1 of the Denodo Aracne Administrator Guide [ARN].

5.4.8 Postprocessing Section (Filters/Exporters)

In Denodo Scheduler different post processing actions can be defined on the tuples obtained. Each post processing action (PROCESSOR) will be composed of one filter sequence and one or several exporters of results. The filter sequence has to be created in advance (see section 5.3). It allows using different criteria to filter/transform the tuples that will be sent to the different exporters. The following are different exporters that Denodo Scheduler considers, along with a small description of use for each of them (users can also create their own custom exporters. See section 6.2.2).



• ARN-Index. Stores the extracted tuples in an index of the Aracne indexing server (see Denodo Aracne Administrator Guide [ARN] for more information). To configure this exporter the following parameters need to be specified:

o Data source. For selecting the ARN-index data source associated to the index to which the tuples/documents will be exported.

o Index name. Name of the ARN-Index server index in which the tuples/documents will be stored.

o Clear index. If checked, before proceeding to store new documents in the specified index, all documents associated with the identifier of the current job will be deleted.

• CSV. Stores the tuples obtained in a plain text file delimited by the character specified in the Separator parameter. The user can specify the path and the name of the generated file. By default, it is generated in the directory DENODO_HOME/work/scheduler/data/csv with the name <PROJECT_NAME>_<JOB_NAME>_<JOB_ID>_Processor#<PID>_CSVExporter#<EID>_<JOB_START_TIME>.csv, JOB_START_TIME being the moment of execution of the job expressed in milliseconds since January 1, 1970. Checking the Include header checkbox includes a first line in the file with the name of the fields of each exported tuple. If the Overwrite checkbox is ticked, the file name will be generated without information of the moment of execution (without the JOB_START_TIME), so that multiple executions of the job will overwrite the same file instead of generating new files. Use the Encoding drop-down to select the encoding of the text file. Finally, if you want to include the project name, job name job identifier and job start time among the fields for the exported tuples, it is necessary to check the last checkbox (Export job identifier, job name project name and execution time fields). NOTE: You can use the CSV exporter to generate files that can be directly opened with MS Excel. In order to do that, you just need to specify ‘;’ as separator.

• File-Repository. Although it can be used with any type of job, this exporter is specially oriented to ARN jobs. This exporter allows a local repository to be created with the documents obtained by the crawler. The exporter stores the content of the tuple field specified in Content field on disk. The files will be stored in the directory DENODO_HOME/work/scheduler/data/repository/<PROJECT_NAME>_<JOB_NAME>_<JOB_ID>_<JOB_START_TIME>. The relative path of each file inside that directory is specified by the value of the tuple field indicated in the Path field parameter. In ARN jobs, the “binarydata” field of the tuple contains the document obtained in the crawling, and the “path” field contains the path extracted for it from the website. Therefore, if we respectively specify “binary data” and “path” as values for the Content field and Path field parameters, this filter will create a local repository replicating the structure of the crawled website, If the Clean checkbox is ticked, the repository contents will be deleted before each execution of the job,

• JDBC. Stores the tuples in a relational database table. To configure it a JDBC-type Data source needs to be specified that accesses the relational database containing the table. The Table name parameter (case sensitive) specifies an existing table where the tuples will be inserted. By default, the destination table schema must have a field with a compatible data type for each job field. It is also possible to specify associations between the field names in the tuples obtained by the job (parameter Document field) and the actual names of fields in the database table (Table column, this parameter is case sensitive). If associations have been specified, the Export only mapped attributes checkbox allows to export those fields only. You can also include the project name, job name, job identifier and job execution time among the fields of the exported tuples by checking the Export job identifier, job name, project name and execution time fields’ checkbox).

• SQL. The SQL exporter functionality is similar to that of the JDBC exporter, but instead of inserting tuples in a relational database it generates a text file with corresponding sql INSERT statements. This way, the generated file can be used to load the data in a database at a later time. Unlike the JDBC exporter, it does not require a Data source to be specified, but it is possible to specify a database adapter (Database adapter) to take into account the differences in the SQL syntax used by different databases (currently a special adapter for Oracle and another for SQL Server are supported). The user



can specify the path and the name of the generated file. By default, it will be generated in the DENODO_HOME/work/scheduler/data/sql directory with the name following the convention <PROJECT_NAME>_<JOB_NAME>_<JOB_ID>_Processor#<PID>_SQLExporter#<EID>_<JOB_START_TIME>.sql, JOB_START_TIME being the moment the job is executed expressed in milliseconds since January 1, 1970. If the Overwrite checkbox is ticked, the file name will be generated without information of the moment of execution (without the JOB_START_TIME), so that multiple executions of the job will overwrite the same file instead of generating new files. Use the Encoding drop-down to select the encoding of the text file

NOTE: The ARN-Index exporter does not support indexing of binary fields. In the case of CSV and SQL exporters, binary fields are exported as text encoded in base64. NOTE: With regard to compound data types, they are exported in text, formatted in XML. NOTE: If using a JDBC exporter and the exportation of a tuple fails, the process will continue with the remaining ones of the block, unless the exception is applicable for the whole set of tuples. In this case, they will be discarded. The report of the job will show the document identifier (tuple number) that failed and the reason. NOTE: In the CSV and SQL exporters, the resultant files can be encrypted (using the MD5-DES cipher algorithm) or compressed (ZIP file). NOTE: When exporting data to ODBC targets, the JDBC exporter used must have a data source configured with the “JDBC-ODBC Bridge” driver. If the target is an Excel sheet, the data source must be configured with the “Excel” adapter and the Table name field must be filled in with “[Sheet1$]”. Both drivers are distributed with Denodo Scheduler (see JDBC Drivers). When working with Excel sheets it is important to take into account some limitations (see http://office.microsoft.com/en-us/excel/HP051992911033.aspx for Excel 2003 and http://office.microsoft.com/en-us/excel/HP100738491033.aspx for Excel 2007).

5.4.9 Handler Section

The last step before completing the execution of a job is to execute the handlers that have been configured. Denodo Scheduler allows one or several of the following types of handlers to be added for a job (users can also create their own custom handlers. See section 6.2.3):

• MOVE-FILE-REPOSITORY. This handler is only applicable for jobs that use the FILE REPOSITORY exporter. It moves the data repository created in the local file system to a different path specified in the Destination directory parameter.

• MAIL. Allows the results report on the job execution to be sent by e-mail (see section 5.1, Reports action on a job). Requires a list of destination e-mail addresses to be specified (Mail addresses) and allows the following sending conditions to be set:

o Always. An e-mail is sent to the recipients each time a job execution is completed. o Only when errors. An e-mail will only be sent when there have been errors in the

execution of the job (any errors that have occurred before the handlers’ stage, i.e. extraction/filtering/export errors).

o Only when number of exported documents is less than XX. An e-mail will be sent when the number of exported tuples is below the threshold that is specified as a parameter.

• RETRY. A retry handler allows the execution of a job to be repeated in certain exceptional circumstances. Allows the retry of a job execution to be configured in the following situations:

o Only when errors. The whole job is repeated when errors occurred during its execution. This option is only applicable to extraction jobs. In the case of ITP, JDBC, or VDP jobs, the job will only be retried if errors occurred in the extraction stage, and in the case of ARN, only if there are URLs with http access or Input/Output errors.

o Only queries which return an error. When there have been errors in the execution of the job (during the extraction stage), it only repeats those queries that have returned an error or have not started to be executed. This option has the same effects as executing the job from the “Scheduler” perspective using the “Start with state” option. For ARN jobs that do not

http://office.microsoft.com/en-gb/excel/HP051992911033.aspx

http://office.microsoft.com/en-us/excel/HP100738491033.aspx



execute multiple queries, this option has the same effect as the previous one (Only when errors). As in the previous option, it is not applicable to ARN-Index-type jobs.

o Only when number of exported documents is less than XX. The whole job is repeated when the number of exported documents is less than the threshold specified.

This handler also lets you configure how long it keeps performing retries, in the event that after one or several retries the errors persist. The options are as follows: o While errors. While errors still occur. o Until this number of times is reached XX. When it has retried up to the number of times

configured. A job can only have one retry handler and it will be executed the first (in order to let the rest of configured handlers know whether the job has finished or it will be repeated). For the other handlers there are no restrictions. When using a retry handler and an exporter to a file (CSV or SQL) is configured, all the job retries are exported to the same file.

5.4.10 Time-based Job Scheduling Section

In the Triggers section tab the moments in which the job will be executed are configured. It is possible to add various time-based scheduling configurations for each job, each one of which will have an associated expression in the cron format (see section 5.4.10.1). Optionally, you can assign a start and end time (Start time and End time respectively) for the job. In that case, the job will only be executed when specified by the cron expression and when, in addition, the current time is in the specified interval. The start and end times can be entered manually, respecting the syntax yyyy-MM-dd and HH:mm for date and time, respectively or by using the calendars associated with each field. Also optionally, it is possible to define dependencies of the trigger with other jobs, so that the execution of this trigger will wait until the dependent jobs have successfully finished (see section 5.4.10.2 for more information). By default, the cron expression generated represents the execution of the job every day at midnight. The cron expression can be entered manually (see syntax in section 5.4.10.1) or visually using the visual editor for cron expressions. The visual editor allows specifying the minutes/hours/days/months and days of the week in which the job will run (the value for seconds is not visually configurable, assumed equal to 0). It is also possible to specify typical periodical schedules such as “every day” or “every five minutes, every day”. NOTE: By manually writing the cron expression, it is possible to specify configurations that cannot visually configured (see section 5.4.10.1)

5.4.10.1 Syntax for the cron Expression

A cron expression is represented by a set of values for seconds, minutes, hours, days, months, and days of the week separated by spaces. Each element allows a simple value or a set of values. The job will be executed when the expression has values in all the fields that coincide with the current time. There are several ways of specifying different values for a field:

• The comma (‘,’) operator specifies a list of values, such as “1,3,4,7,8”. • The dash (‘-‘) operator specifies a range of values, such as “1-6,” which is equivalent to “1,2,3,4,5,6”. • The asterisk (‘*’) operator specifies all the possible values for a field. For example, an asterisk in the

time field would mean that the job would be executed every hour.



• A slash operator (‘/’) also exists that can be used to specify increments. For example, “*/3” in the time field is equivalent to “0,3,6,9,12,15,18,21”. A number in front of the bar specifies the initial value.

The range of values allowed by each field is as follows: minutes (0-59), hours (0-23), days (1-31 [?LW]), months (1-12 or JAN-DEC), and days of the week (1-7 or SUN-SAT [?L#]). There are some special cases:

• ‘?’ is permitted for the “day of the month” and “day of the week” fields. It is used for not specifying a value. It is useful when you need to specify a value in one of these two fields, but not in the other.

• The ’L‘ character is permitted for the “day of the month” and “day of the week” fields. This character is an abbreviation of “last” but has a different meaning for each field. For example, the value L in the day of the month means “the last day of the month” In the day of the week it means ’7’ or “SAT”. But if it is used to continue another value, it means "the last day xxx of the month". For example, “6L” means “the last Friday of the month”. When the L option is used, it is important to not specify lists or ranges of values so as not to get confusing results.

• The ’W’ character is allowed for the day of the month field. It is used to specify the weekday (Monday through Friday) closest to the day specified. It cannot be used with ranges of days.

• The ’L’ and ‘W’ characters can be combined in the expression of the day of the month (LW). In this case, it would mean the last weekday of the month.

The ‘#’ character is allowed for the day of the week. It is used to specify the n-th day XX of the month. For example, “2#1” means the first Monday of the month.

5.4.10.2 Dependencies among Jobs

When editing a job, it is possible to specify the job(s) this one depends on. The cron time trigger has a section for specifying this purpose. Each trigger may have its own dependencies. A trigger will not execute the job unless the jobs it depends on have finished their execution. Optionally, the maximum time (in milliseconds) for waiting for the dependencies to be completed can be specified It is important to note that jobs cannot depend on jobs that are on draft state. Nevertheless, draft jobs may have dependencies (and they will be kept if the project is exported). If a job is waiting for a trigger's dependencies to be satisfied and then the jobs is executed as consequence of other trigger, then the wait for the first trigger dependencies is reset. Example: Suppose a job "A" having a trigger that depends on a job "B". Then, while "B" does not begin or while "B" is executing, "A" will be in the WAITING state (waiting for "B" to finish). When "B" finishes, job "A" will be executed. If while B is being executed, another trigger causes the execution of A, then the first trigger waiting is reset. Therefore, when B finishes, A will not be executed. Another execution of B starting and ending while A is not running would be required for the dependencies of the first trigger to be satisfied. If "B" were deleted, the configuration of job "A" would be updated (removing this dependency from the trigger/s that have it) and would change to the DISABLED state (although if "A" were in the RUNNING state, it would continue execution until finishing and, then, would change to DISABLED state). The user must then enable or edit the job to re-schedule it.


Developer API 25

6.1

6 DEVELOPER API

CLIENT API

6.1.1 Scheduler

The Denodo Scheduler API allows jobs, filter sequences, data sources, and handlers to be programmatically configured and managed. To do this, the platform has a facade com.denodo.scheduler.client.SchedulerManager which allows the following features:

• Getting the job list with information on its state (JobData): getJobsInformation. • See a job's execution reports (JobReport): getJobsReport. • Create, obtain, update, and delete the configuration of a job, a filter sequence, and a Data Source

(Configuration): addElement, getElement, updateElement and removeElement.

• Enable or disable a job: resumeJob and pauseJob. • Execute a job only once, without taking into account the start and end time for its configuration:

startJob. • Execute a job with state. If the execution of a job gives any errors, executing it again with state will only

execute those queries that had errors (see section 5.1, Start with state): retryOrContinueJob. • Stop a job from running: stopJob. • Obtain the types of jobs, filters, and data sources allowed by the system: ITP, VDP, etc:

getElementTypes and getElementSubTypes. • Pause the job scheduler. When the scheduler is paused, the jobs will wait until it is resumed before being

executed again. This order will only run successfully if no job is being executed: pauseScheduler. • Resume executing the scheduler when it is paused: resumeScheduler.

All configuration elements (jobs, data sources, and filter sequences) are based on the object com.denodo.configuration.Configuration, which represents a collection of parameters (simple or compound) that constitute the configuration of the element. The list of parameters and their specification (data type, mandatoriness, etc.) are given by the object com.denodo.configuration.metadata.MetaConfiguration. The filter sequences comprise an identifier of the sequence and a list of filters that make it up, each one also being represented by an object Configuration. The jobs have an identifier, a description, a type (ARN, ARN-Index, ITP, JDBC, or VDP), some associated handlers, and a time-based scheduler. VDP- and JDBC-type jobs have an extraction section which specifies where the data is going to be extracted from (both differ as to the types of data sources that can be used). ITP-type jobs also have an extraction section, where the wrapper to be executed is indicated (with its parameters, if applicable) for obtaining data. In the ARN jobs extraction section the type of crawler (WebBot, IECrawler, MailCrawler, SalesforceCrawler, ExchangeCrawler, or CustomCrawler) you want to use to collect data from a source is indicated. All these jobs have an export section, where a filter sequence (or none [optional]) is associated with a list of exporters (ARN-Index, JDBC, SQL, CSV); it can define several "filter sequence/exporter" associations, each one constituting a so-called "processor". The jobs also have a handlers section, where a list of them is specified (mail, move-to-repository, retry) that will be executed after the job export stage. For their part, ARN-Index-type jobs, in their extraction section, define a list of maintenance actions to be executed on the Aracne indexes; these jobs do not have an export section.


Developer API 26

An example is shown below for creating a VDP-type job, assuming that the data source has been previously created and has the identifier stored in the vdpDatasourceIdentifier variable. This example shows how the Configuration object for a job is created, using ParameterStructure objects to specify the job parameters. Configuration jobConfig = new Configuration(); jobConfig.setDraft(false); jobConfig.setType("job"); jobConfig.setSubType("vdp"); ParameterStructure parameters = new ParameterStructure(); parameters.put(Helper.buildSimpleMonoValuedParam("name", "jobName")); parameters.put(Helper.buildSimpleMonoValuedParam("description", "jobDescription")); parameters.put( Helper.buildSimpleMonoValuedParam("dataSource", vdpDatasourceIdentifier)); parameters.put(Helper.buildSimpleMonoValuedParam("parameterizedQuery", "select * from emp")); Parameter trigger = new CompoundParameter(); trigger.setName("trigger"); Collection<ParameterStructure> params = new ArrayList<ParameterStructure>(); ParameterStructure param = new ParameterStructure(); param.put(Helper.buildSimpleMonoValuedParam("type", "cron")); param.put(Helper.buildSimpleMonoValuedParam("cronExpression", "0 10,44 14 ? 3 WED")); params.add(param); trigger.setValues(params); parameters.put(trigger); jobConfig.setParameters(parameters); schedulerManager.addElement(projectId, jobConfig);

The name of the parameters to use for configuring each type o element can be obtained from the XML files in DENODO_HOME/metadata/scheduler/elements. They are organized in subfolders by element type (datasource, job, exporter …). For more information please refer to the Denodo Scheduler Javadoc documentation and the examples in DENODO_HOME/samples/scheduler/scheduler-api.

6.1.2 Compatible API for 4.0 and 4.1 versions

In order to make 4.0 and 4.1 versions compatible with this release, some facades of Aracne have been adapted. These are the following ones: com.denodo.aracne.client.document.DocumentManager:

• Find all documents from an index: findAll(String handler, int startIndex, int count).

• Find all documents from an index with main terms: findAll(String handler, MainTermsFilterConfiguration mainTermsConfiguration, int startIndex, int count).

• Find documents by keywords: findByKeyword(String handler, String query, int startIndex, int count).

• Find documents by keywords and with main terms: findByKeyword(String handler, String query, MainTermsFilterConfiguration mainTermsConfiguration, int startIndex, int count).


Developer API 27

• Find documents by keywords and without highligthting: findByKeywordWithoutHighlight(String handler, String query, int startIndex, int count).

• Find documents by keywords, without highlight and with main terms: findByKeywordWithoutHighlight(String handler, String query, MainTermsFilterConfiguration mainTermsConfiguration, int startIndex, int count).

• Get all fields from an index: getFields(String handler). • List searchable indexes: listSearchableHandlers().

com.denodo.aracne.client.handler.HandlerConfiguration:

• Create a new handler (index): createHandler. • Find all the extensions of the schemas: findAllSchemaExtensions.

com.denodo.aracne.client.task.TaskManager:

• Create a filter sequence: createFilterSequence. • Create a task with a name and a description: createTask. • Get all the types of filters: findAllFilterTypes. • Get all the names of the tasks: findAllTaskNames. • Get the configuration of a concrete task: findTask. • Get the scheduling configuration of the whole tasks: getTasksInformation. • Remove a task: removeTask. • Clonate a task: replicateTask. • Update the configuration of the different aspects of a task: updateCrawlingConfiguration,

updateMaintenanceActions, updatePostProcessingActions and updateSchedulerConfiguration.

This API will be available for use after running the migration tool (see section 7.5) for more information about how to migrate from 4.0 or 4.1 versions to this one), so it is highly recommended to run it before trying to use this compatible API. Once migrated from a 4.0 or 4.1 version, it is possible to use this compatible API. In order to use it, the library arn-compat-api.jar must be placed at $DENODO_HOME/lib/contrib (it is done by the migration tool), and files filter-metaconfig-compat.xml and resulthandler-metaconfig-compat.xml must be in the application classpath. These files are generated by the migration tool at $DENODO_HOME/tools/scheduler/migrate/conf and are a copy of the 4.0 / 4.1-version ones.

6.2 EXTENSIONS (PLUGINS)

Denodo Scheduler allows users to create their own Aracne filters, exporters, handlers or crawlers. The following sections describe how to implement a new Aracne filter (6.2.1), exporter (6.2.2), handler (6.2.3) or crawler (6.2.4). Once the extension is implemented, it needs to be packed in a JAR file along with an XML file that specifies its configuration metadata: the extension type, its name, and its input parameters. This metadata is used by the administration tool to configure the extension correctly (metadata specifies the value of the parameters the extension receives in its init method). The name and location of the metadata file must be the same as that of the class that implements the extension (i.e. in the same package as the implementation class). In this file the element tag allows the type (type) and


Developer API 28

subtype (subType) to be specified for the extension. Each configuration parameter is specified using the param element, whose attributes are the parameter name (name), its mandatoriness (mandatory), if it can have multiple values (multivalued), and its Java type (javaType). If it is a compound parameter, the components of the parameter are listed within the components element. The following metaconfiguration XML file for the POP3/IMAP mail server crawler included in Scheduler is shown as an example. As can be seen, it is a crawler element called “Mail”, which has a machine name and a list of user accounts with logins and passwords as input parameters, from which to retrieve e-mails. <metaconfig> <element type="crawler" subType="Mail" />

<param javaType="java.lang.String" mandatory="true" multivalued="false" name="host"/>

<param mandatory="true" multivalued="true" name="accounts"> <components> <param javaType="java.lang.String" mandatory="true" multivalued="false" name="user"/> <param javaType="java.lang.String" mandatory="true" multivalued="false" name="password"/> </components> </param> </metaconfig>

Finally, in the META-INF/MANIFEST.MF JAR file the following metadata needs to be specified in a section Name: CustomElement:

• PluginType: filter | exporter | handler | crawler • PluginName: Name of the extension (it will be used as name for the extension in the administration

tool). • PluginClass: Name of the extension implementation class.

Section 4.2.5.1 describes how to install a new extension in Scheduler.

6.2.1 Filters

Denodo Scheduler allows new data-processing filters to be created to form part of the filter sequences that are applied to the job extraction results. To do this, it is necessary to create a class that implements com.denodo.scheduler.api.filter.Filter from which the following method should be implemented:

boolean execute(Document document) throws FilterException

The execute method performs the data processing specific to the filter, for which it receives as a parameter a Document object, which is a map that allows accessing the values of the document fields. The filter can add/delete or modify fields as wished. In addition, the method returns a Boolean variable that indicates whether or not the document passed the filter. package com.denodo.commons; public class Document { public Object get(Object key)


Developer API 29

public Map getFields() public Object put(Object key, Object value) }

For more information please refer to the Denodo Scheduler Javadoc documentation or the example for creating a filter in DENODO_HOME/samples/scheduler/filter-api.

6.2.2 Exporters

Denodo Scheduler allows new custom exporters to be created. To create a new exporter the interface com.denodo.scheduler.api.exporter.Exporter needs to be implemented. This interface has the following methods:

• init. Initializes the exporter. • export. Method invoked by Scheduler to perform the document export. • getName. Method invoked by Scheduler to get the name of the exporter.

Another interface called com.denodo.scheduler.api.exporter.SchedulerExporter can be implemented in order to provide information about the job execution time (the first time the job was executed, before retrying it, if it was necessary), the job retry number and the retry execution time. This interface has the following methods:

• open. Initializes resources needed by the exporter and gets runtime information about the executed job. • close. Closes any resources opened by the exporter and returns a collection of resources if necessary.

CSV and SQL exporters implement this interface in order to just open the exported files once, write everything and close them. New user-defined exporters may also implement this interface. For more information please refer to the Denodo Scheduler Javadoc documentation or the example for creating an exporter in DENODO_HOME/samples/scheduler/exporter-api.

6.2.3 Handlers

Denodo Scheduler allows new custom handlers to be created. To create a new handler the interface com.denodo.scheduler.api.handler.Handler needs to be implemented. This interface has the following methods:

• init. Initializes the handler. • execute. Method invoked by Scheduler once the extraction and exporting of all the tuples of a job have

finished. For more information please refer to the Denodo Scheduler Javadoc documentation or the example for creating a handler in DENODO_HOME/samples/scheduler/handler-api.

6.2.4 Aracne Custom Crawlers

To create a new custom crawler the interface com.denodo.crawler.Crawler needs to be implemented. This interface has the following methods:

• execute. Method invoked by ARN to execute the crawler. • stop. Method invoked by Scheduler to stop the execution of the crawler.


Developer API 30

The execution of the crawler must provide the results to Aracne in the form of com.denodo.crawler.data.CrawlDocument objects using the add methods from com.denodo.crawler.data.DataManager. package com.denodo.crawler.data; public interface DataManager { public void add(Collection documents) public void add(Collection documents) public void addEvent(CrawlEvent event) public void addEvents(Collection events) public void close() public void setMappingWriter(MappingRepository writer) public void setRepositoryWriter(FileRepository writer) }

If during the execution of the custom crawler any event or error occurs which Aracne needs to be informed about, the addEvent or addEvents method from com.denodo.crawler.data.DataManager must be invoked. The Aracne API for the creation of custom crawlers also allows a repository to be built that stores copies of the data obtained by the crawler. To do this, if the “binarydata” field from CrawlDocument is not empty, the contents of the document are stored in the repository. The path for this repository would be that indicated by the “path” field, if applicable; otherwise, that indicated by the encoded “url” field. For more information please refer to the Denodo Aracne Javadoc documentation and the example of SalesforceCrawler in DENODO_HOME/samples/arn/crawler-api.


Appendix 31

7.1

7 APPENDIX

DATEFORMAT FUNCTION SYNTAX

The syntax of the ^DateFormat function is as follows: ^DateFormat("expression", "pattern"), where

• expression defines the expression which should be assessed to obtain the specific date. The possible values of this expression are:

o today: represents today’s date on the system. o yesterday: represents yesterday’s date. o today – n: represents the date corresponding to n days before the current date. Therefore,

yesterday is equivalent to today – 1. • pattern determines the format of the text string this function returns. This format is constructed by

combining the following letters: o y: represents a year digit. o M: represents a month digit. o d: represents a digit of the day of the month.

For example, to obtain news from news.acme.com on today’s date, knowing that on this Web site the news has links of the form (supposing that today’s date is 2004/12/29):

http://news.acme.com /news/2004/12/29/weblog/1104290407.html in which the news date and, thus, number varies, the following has to be entered in the regular expression field of the link filter:

http://news.acme.com/news/^DateFormat(“today”,”yyyy/MM/dd”)/weblog/(.)+html

7.2 REGULAR EXPRESSIONS FOR FILTERS

A regular expression is a text model formed of ordinary characters (for example, letters from a to z) and special characters known as metacharacters. The model describes one or several strings that coincide when searching a text body. The regular expression is used as a template to collate a model of characters with the string being searched. The table below includes a complete list of metacharacters and their behavior in the context of the regular expressions:

Expression Matches

X The character x

\\ The character \

[abc] a, b, or c

[^abc] Any character except a, b, or c (negation)

[A-Z] From A to Z inclusive


Appendix 32

. Any character

^ Line start

$ Line end

X? X, once or never

X* X, zero or more times

X+ X, once or more times

XY X followed by Y

X|Y X or Y

(X) X, as a group

Table 1

7.3

Metacharacters

Groups are enumerated by counting the brackets open from left to right. The zero group refers to the complete expression. The ’\’ character can be used to escape the metacharacters used in the expressions. For instance, the \\ expression represents a single \, and \( represents a parenthesis.

JDBC DRIVERS

The following table shows the JDBC adapters included with Denodo Scheduler. For each driver the databases for which it has been tested, the name of the class that must be specified when creating a JDBC data source that uses the adapter and the URI format used are shown. Database Class URI Derby 10 org.apache.derby.jdbc.ClientDriver jdbc:derby://<hostName>:<port>/<databaseName> Excel sun.jdbc.odbc.JdbcOdbcDriver jdbc:odbc:<databaseName> Oracle 8i Oracle 9i Oracle 10g

oracle.jdbc.OracleDriver jdbc:oracle:<protocol>:@<hostName>:<port>:<databaseName> Protocols: thin (recommended), oci, oci8, kprb

PostgreSQL 7.2.3 PostgreSQL 7.4.6 PostgreSQL 8

org.postgresql.Driver jdbc:postgresql://<hostName>:<port>/<databaseName>

SQL Server 8.00.194 net.sourceforge.jtds.jdbc.Driver jdbc:jtds:sqlserver://<hostName>>:<port>>/<DatabaseName> Sybase Adaptive Server Enterprise12.5B

net.sourceforge.jtds.jdbc.Driver jdbc:jtds:sybase://<hostName>:<port>/<DatabaseName>

JDBC-ODBC Bridge sun.jdbc.odbc.JdbcOdbcDriver jdbc:odbc:<databaseName>

Table 2 JDBC Drivers

Adapters for IBM DB2 and MySQL, as well as those created by their manufacturers for Microsoft SQL Server and Sybase, are not included in the distribution of Scheduler, but can be downloaded from the Web sites of these companies. They have also been tested successfully, and their data are shown in the following table.


Appendix 33

Database Class URI DB2 8.2 com.ibm.db2.jcc.DB2Driver jdbc:db2://<hostName>:<port>/<databaseName> MySQL 4.0.15 MySQL 4.1.1 MySQL 5.x

com.mysql.jdbc.Driver jdbc:mysql://<hostName>:<port>/<databaseName>

SQL Server 8.00.194 com.microsoft.jdbc.sqlserver.SQLServerDriver

jdbc:microsoft:sqlserver://<hostName>;DatabaseName=<databaseName>

Sybase Adaptive Server Enterprise 12.5B

com.sybase.jdbc3.jdbc.SybDriver jdbc :sybase:Tds:/<hostName>/:/<port>/<database>

Table 3

7.4

IBM, MySQL, Microsoft, and Sybase Drivers

The JDBC drivers in the above lists have been successfully tested, although any other JDBC driver should work alongside Denodo Scheduler.

USE OF THE IMPORT/EXPORT SCRIPTS FOR BACKUP

The import and export scripts are available in the tools/scheduler directory of the platform. They are provided in two versions: import.sh and export.sh (for Linux systems) and import.bat and export.bat (for Windows systems). The export script allows for all metadata and configuration of a Scheduler server to be exported to a zip file. The data exported is the same as the obtained with the equivalent option of the administration tool (see section 4.2). The format in which the script is invoked is as follows: export –h host –p port –l login -P password [-L <project1, project2…>] [-config] [-drivers] [-plugins] –f outputFilename where: -h host indicates the name or IP address of the machine where the server is launched. -p port indicates the port number at which the server is launched. -l login indicates the login name used to connect to the server. -P password indicates the password used to connect to the server. -L p1, p2… is an optional argument. Using it causes named projects to be exported; otherwise all projects are exported. -config is an optional argument. Using it causes server configuration to be exported. -drivers is an optional argument. Using it causes JDBC adapters to be exported. When the export all projects option is selected all JDBC adapters are exported, otherwise only the adapters used by the selected projects are exported. -plugins is an optional argument. Using it causes plugins to be exported. When the export all projects option is selected all plugins are exported, otherwise only the plugins used by the selected projects are exported. –f outputFilename indicates the name of the zip file that will contain the exported metadata. The line below is an example of running the export command: export –h localhost –p 8000 -l admin -P admin –L default –f backup.zip This command exports the metadata of the default project of the Scheduler server being run in the local machine on port 8000. Access to the server is made using the login admin with the password admin. The result of the export is saved to a file known as backup.zip. The import script allows importing the metadata and configuration contained in a zip file obtained by using the export utility.


Appendix 34

7.5

The format used to invoke the script is as follows: import –h host –p port –l login -P password –f inputFilename [–replace], where: -h host indicates the name or IP address of the machine where the server is launched. -p port indicates the port number at which the server is launched. -l login indicates the login name used to connect to the server. -P password indicates the password used to connect to the server. -f inputFilename is the file containing the metadata to be imported. -replace is an optional argument that specifies if existing elements with the same name should be overwritten by the ones included in the imported file. For example: import –h localhost –p 8000 -l admin -P admin –f backup.zip -replace This sentence imports the metadata contained in backup.zip to the server running in the local machine on port 8000. Access to the server uses the login admin with the password admin. Information and warning messages returned by the server as a result of the import are written to the console.

USE OF THE MIGRATION TOOL

The migration tool is used to migrate from a 4.0 or 4.1 Denodo Aracne installation. The script for migrating is available in the tools/scheduler/migrate/bin directory of the platform (after unzipping the migrate.zip file). It is provided in two versions: migrate.sh (for Linux systems) and migrate.bat (for Windows systems). This script allows for all metadata and configuration of a 4.0 or 4.1 Aracne installation to be migrated to a new Scheduler+Aracne one. It also provides an API for the new installation to be backwards-compatible (see section 6.1.2). However, some configuration files (such as webbot’s, iecrawler’s and com.denodo.stream.general.xm) must be manually migrated. Also, the user has to manually copy the extension classes (JARs and CUSTOM classes) from the old installation to the new one. The format in which the script is invoked is as follows: migrate –homeOld old_denodo_home [-homeNew new_denodo_home] [–ah aracne_host] [–ap aracne_port] [-ih aracne_index_host] [-ip aracne_index_port] [-Ah arn_index_host] [-Ap arn_index_port] [-ch arn_crawler_host] [-cp arn_crawler_port] [-cl arn_crawler_login] [-cpwd arn_crawler_password][-sh scheduler_host] [-sp scheduler_port] [-sl scheduler_login] [-spwd scheduler_password] where: -homeOld indicates the path to the old installation of the Denodo Platform (the one wanted to be migrated from).


Appendix 35

-homeNew indicates the path to the new installation of the Denodo Platform (the one wanted to be migrated to). This parameter is optional and by default it is resolved to the DENODO_HOME directory this script is launched from. -ah indicates the name or IP address of the machine where the old aracne server is launched. -ap indicates the port number at which the old Aracne server is launched. -ih indicates the name or IP address of the machine where the old Aracne indexer server is launched. -ip indicates the port number at which the old Aracne indexer server is launched. -Ah indicates the name or IP address of the machine where the new arn indexer server is launched. -Ap indicates the port number at which the new arn indexer server is launched. -ch indicates the name or IP address of the machine where the new arn crawler server is launched. -cp indicates the port number at which the new arn crawler server is launched. -cl indicates the login name for the new arn crawler server. -cpwd indicates the password for the new arn crawler server. -sh indicates the name or IP address of the machine where the scheduler server is launched. -sp indicates the port number at which the scheduler server is launched. -sl indicates the login name for the scheduler server. -spwd indicates the password for the scheduler server. Hosts, ports, logins and passwords are all optional arguments so, if not specified, they will be evaluated to the default values (if you have changed them while installing the product/s, you must specify them to the script). The line below is an example of running the migrate command with default hosts, ports, logins and passwords: migrate.bat –homeOld “C:\Program Files\Denodo Platform 4.1” –homeNew “C:\Program Files\Denodo Platform 4.5” This command migrates a 4.1-installation from C:\Program Files\Denodo Platform 4.1 to a 4.5 installation placed at C:\Program Files\Denodo Platform 4.5. The migration tool creates a project in the Scheduler called “Aracne 4.1”, where all the migrated elements (data sources, filter sequences and jobs) are included. Regarding to the web administration tools, we should consider the following aspects:

- In the Scheduler Administration tool, a new filter called "compat-adapter" will appear. This new filter is used to configure user-defined filters. The implementation class and its parameters must be filled to configure those adapted filters.

- In the ARN Administration tool, a new schema called "schema-adapter" will be created. If your installation of Aracne 4.0 or 4.1 used “Index Schema Extensions”, the implementation classes of these extensions will appear assigned to this schema.

Steps for migrating:

1. Start up the following servers in the same machine: • Aracne. • Aracne-indexer. • ARN-Crawler. • ARN-Indexer. • Scheduler.

2. Manually migrate the formerly mentioned config files (if modified from default) and copy the extensions. 3. Execute the migrate script.


Bibliography 36

BIBLIOGRAPHY

[ARN] Denodo Aracne 4.5 Administrator Guide. Denodo Technologies, 2009.

[DENINST] Denodo Platform 4.5 Installation Guide. Denodo Technologies, 2009.

[ITP] Denodo ITPilot 4.5 User Manual. Denodo Technologies, 2009.

[JDBC] Java DataBase Connectivity. http://java.sun.com/products/jdbc/

[LOG4J] Log4j, http://logging.apache.org/log4j/docs/

[MSEX] Microsoft Exchange Server. http://www.microsoft.com/exchange/

[MSIE] Microsoft Internet Explorer. http://www.microsoft.com/windows/ie/

[NSEQL] NSEQL 4.5 Manual (Navigation SEQuence Language). Denodo Technologies, 2009

[SLF] Salesforce.com. On-demand Customer Relationship Management. http://www.salesforce.com/

[VDP] Denodo Virtual DataPort 4.5 Administrator Guide. Denodo Technologies, 2009

http://java.sun.com/products/jdbc/

http://logging.apache.org/log4j/docs/

http://www.microsoft.com/exchange/

http://www.microsoft.com/windows/ie/

http://www.salesforce.com/

denodo scheduler administrator guide - denodo platform help

Documents