› files.ckan.org › ckancon... · automatic data publication in ckan using kettle (a success...

October 4th, 2016

Automatic data publication in CKAN using Kettle

(a success case in Generalitat Valenciana)

index

• Intro – a short story • Pentaho Data Integration (Kettle)

• What is Kettle? • Some features • Some screenshots

• Description of the solution • Architecture • Execution phases • Use of API

• Creating a new dataset: step by step • Conclusions

Intro – a short story

• Lack of culture around data reusability

• Lack of resources to provide reusable data

Problems

index





Community Edition available Free license

What is Kettle?

Pentaho Business Analytics

Reporting

Analysis Services

Dashboard

Data Integration (PDI)

Data Mining

BI Server

Spoon

Pan Kitchen

Carte

PDI (also called Kettle) is the component of Pentaho responsible for the Extraction, Transformation and Loading (ETL) processes. ETL tools are most frequently used in datawarehouses environments, however Pentaho Data Integration (PDI) can also be used for other purposes:

• Migrating data between applications or databases

• Exporting data from databases to flat files

• Loading data massively into databases

• Data cleaning • Integrating applications

Kettle - features

• Data flow control (bandwidth consumption)

• Parallel or sequential process execution

• Process scheduling

• Develop custom java classes / Java libraries

• Scripting (SQL, JavaScript, Shell)

Oracle

MySQL

PostgreSQL

SQLite

Sybase

SAP, Vertica, Palo, Hadoop…

JDBC, ODBC, OCI, JNDI

Network folders

FTP servers

REST and SOAP services

Supported datasources

Catalog Data Sources

IBM DB2

Hypersonic

Informix

MS SQL Server

dBase

Text files (XML, JSON,

CSV, RSS)

Excel

MS Acces

Some features

Kettle - screenshots

Designing a transformation

Just a bunch of commands… drag & drop… and configure!!


Community and documentation links


Execution log view: what happened?


Verifying a transformation: warnings and errors details


Debugging a transformation: breakpoint configuration


index





Datasources

architecture

AP

I

Databases

Files Input files

Output files

Business Intelligence

System

1

2 3 4

Orchestrator process 1.File

recovering

2.Import to BI

3.Resources generation

4.Resources uploading

phases

• The main process runs phases in sequencial order. A phase never starts until the previous one has finished.

• Within each phase, the process is run for each dataset.

1. File recovering

2. Import to BI 3. Resources generation

4. Resources uploading

D1

D2

D3

phases

1. File recovering



D1

D2

D3

D1



phases

D1

D2

D3



1. File recovering



D1

D2

D3

phases



D1

D2

D3

1. File recovering



D1

D2

D3

phases

1. File recovering



D1

D2

D3



phases

How does the Kettle know what datasets are in which phase?

A database is used to store the information regarding the whole process and the state of execution of each dataset.

What if something goes wrong?

If something goes wrong with a dataset, it remains “stopped” in that phase until the next iteration of the main process.

phases


1. File recovering



D1

D2

D3

D1

1st iteration

phases

D1

D2

D3

1. File recovering



D1

D2

D3


1st iteration

phases

D1

D2

1. File recovering



D1

D2

D3


1st iteration

phases

D1

D2

1. File recovering



D1

D2

D3


2nd iteration

phases

1. File recovering



D1

D2

D3


4th phase – Use of API

ckan.logic.action.create

ckan.logic.action.create.package_create

ckan.logic.action.create.resource_create

ckan.logic.action.create.tag_create

ckan.logic.action.patch

ckan.logic.action.patch.package_patch

ckan.logic.action.patch.resource_patch

ckan.logic.action.update

ckan.logic.action.update.term_translation_update

ckan.logic.action.update.term_translation_update_many

ckan.logic.action.delete

ckan.logic.action.delete.package_delete

ckan.logic.action.delete.dataset_purge

ckan.logic.action.delete.resource_delete

• 4th phase - example of API functions being used:

index





Creating a dataset: step-by-step

1st Step: Create a new .properties file for your new dataset

• This file contains several properties related the new dataset. For instance: metadata in two languages, the type of datasource (file or BD), how to update the dataset regularly (daily, weekly, monthly, yearly).

• This file is read by the orchestrator process and is used to set the matadata when creating the dataset.


2nd Step: Register the dataset in the database

• There is a script for launching these queries • This step is required to call the new dataset from the orchestrator process


3rd Step: create the folder structure for the new dataset

• config: contains the properties file • input_files: contains source files • input_error_files: contains processed files if any error ocurred • output_files: contains files pending to be uploaded to CKAN • output_processed_files: contains a copy of the files uploaded to CKAN Files are moved from one folder to another, when every phase is finished


4th Step: load process development

• This process reads data from the origin and load into the database (2nd phase) • Depending on the data, this process can be copied (templates) among several datasets • In any event, many steps of the process are reused (error handling, file/database access…)


5th Step: “select” query to generate resources.

• Each dataset has a different “SELECT”. • This select is called by a common transformation process which generates CSV, XML and

JSON files.

SELECT UPPER (OWCKAN.CSV_UTIL_PKG.ARRAY_TO_CSV(OWCKAN.T_STR_ARRAY(CR_ANYO, CR_MES, CR_DESC_MES, CR_DEPTO_ATENCION, CR_SEXO, CR_COD_SEXO, CR_EDAD, CR_CITAS_REGISTRADAS),'';'')) AS CSV, ''1'' FROM OWCKAN.OD_SAN_IND_AT_PRIMARIA_CITAS LEFT JOIN OWCKAN.OD_SAN_AP_MD_EDAD ON CR_EDAD = AP_COD_EDAD LEFT JOIN OWCKAN.OD_SAN_MD_DEPTO ON CR_DEPTO_ATENCION = AH_DC_COD_DEPTO WHERE CR_ANYO = ''ANYO_CHANGE'' AND CR_MES = ''MES_CHANGE'' ORDER BY CR_DEPTO_ATENCION, CR_SEXO


• Most of the steps are can be inmediately accomplished

• We get datasets fully automatized in less than one hour when using Kettle templates.

.properties Register dataset

Create folder structure

Process development

“select” query

5’ 10’ 5’ 2h – 16h 10’

index





Conclusions

• Lack of culture in some organizations about how to make reusability easier

• We have the data, we have the platform for publishing, we have to simplify the transition from the source to the portal.

• The easier we make this transition the more contributors will be willing to participate.

• Kettle is a good alternative, it‘s powerful, free, open source, already known by many IT departments.

• This platform is just an approach. You can design your own solution.

• Whenever you can, help people to produce the best reusable formats

› files.ckan.org › ckancon... · automatic data publication in ckan using kettle (a success...

Documents