› files.ckan.org › ckancon... · automatic data publication in ckan using kettle (a success...

October 4th, 2016

Automatic data publication in CKAN using Kettle

(a success case in Generalitat Valenciana)

• Intro – a short story • Pentaho Data Integration (Kettle)

• What is Kettle? • Some features • Some screenshots

• Description of the solution • Architecture • Execution phases • Use of API

• Creating a new dataset: step by step • Conclusions

Intro – a short story

• Lack of culture around data reusability

• Lack of resources to provide reusable data

Problems

Community Edition available Free license

What is Kettle?

Pentaho Business Analytics

Reporting

Analysis Services

Dashboard

Data Integration (PDI)

Data Mining

BI Server

Pan Kitchen

PDI (also called Kettle) is the component of Pentaho responsible for the Extraction, Transformation and Loading (ETL) processes. ETL tools are most frequently used in datawarehouses environments, however Pentaho Data Integration (PDI) can also be used for other purposes:

• Migrating data between applications or databases

• Exporting data from databases to flat files

• Loading data massively into databases

• Data cleaning • Integrating applications

Kettle - features

• Data flow control (bandwidth consumption)

• Parallel or sequential process execution

• Process scheduling

• Develop custom java classes / Java libraries

• Scripting (SQL, JavaScript, Shell)

Oracle

PostgreSQL

SQLite

Sybase

SAP, Vertica, Palo, Hadoop…

JDBC, ODBC, OCI, JNDI

Network folders

FTP servers

REST and SOAP services

Supported datasources

Catalog Data Sources

IBM DB2

Hypersonic

Informix

MS SQL Server

Text files (XML, JSON,

CSV, RSS)

MS Acces

Some features

Kettle - screenshots

Designing a transformation

Just a bunch of commands… drag & drop… and configure!!

Community and documentation links

Execution log view: what happened?

Verifying a transformation: warnings and errors details

Debugging a transformation: breakpoint configuration

Datasources

architecture

Databases

Files Input files

Output files

Business Intelligence

System

Orchestrator process 1.File

recovering

2.Import to BI

3.Resources generation

4.Resources uploading

phases

• The main process runs phases in sequencial order. A phase never starts until the previous one has finished.

• Within each phase, the process is run for each dataset.

1. File recovering

2. Import to BI 3. Resources generation

4. Resources uploading

phases

1. File recovering

phases

1. File recovering

phases

1. File recovering

phases

1. File recovering

phases

1. File recovering

phases

How does the Kettle know what datasets are in which phase?

A database is used to store the information regarding the whole process and the state of execution of each dataset.

What if something goes wrong?

If something goes wrong with a dataset, it remains “stopped” in that phase until the next iteration of the main process.

phases

1. File recovering

1st iteration

phases

1. File recovering

1st iteration

phases

1. File recovering

1st iteration

phases

1. File recovering

1st iteration

phases

1. File recovering

1st iteration

phases

1. File recovering

2nd iteration

phases

1. File recovering

2nd iteration

phases

1. File recovering

2nd iteration

phases

1. File recovering

2nd iteration

phases

1. File recovering

4th phase – Use of API

ckan.logic.action.create

ckan.logic.action.create.package_create

ckan.logic.action.create.resource_create

ckan.logic.action.create.tag_create

ckan.logic.action.patch

ckan.logic.action.patch.package_patch

ckan.logic.action.patch.resource_patch

ckan.logic.action.update

ckan.logic.action.update.term_translation_update

ckan.logic.action.update.term_translation_update_many

ckan.logic.action.delete

ckan.logic.action.delete.package_delete

ckan.logic.action.delete.dataset_purge

ckan.logic.action.delete.resource_delete

• 4th phase - example of API functions being used:

Creating a dataset: step-by-step

1st Step: Create a new .properties file for your new dataset

• This file contains several properties related the new dataset. For instance: metadata in two languages, the type of datasource (file or BD), how to update the dataset regularly (daily, weekly, monthly, yearly).

• This file is read by the orchestrator process and is used to set the matadata when creating the dataset.

2nd Step: Register the dataset in the database

• There is a script for launching these queries • This step is required to call the new dataset from the orchestrator process

3rd Step: create the folder structure for the new dataset

• config: contains the properties file • input_files: contains source files • input_error_files: contains processed files if any error ocurred • output_files: contains files pending to be uploaded to CKAN • output_processed_files: contains a copy of the files uploaded to CKAN Files are moved from one folder to another, when every phase is finished

4th Step: load process development

• This process reads data from the origin and load into the database (2nd phase) • Depending on the data, this process can be copied (templates) among several datasets • In any event, many steps of the process are reused (error handling, file/database access…)

5th Step: “select” query to generate resources.

• Each dataset has a different “SELECT”. • This select is called by a common transformation process which generates CSV, XML and

JSON files.

SELECT UPPER (OWCKAN.CSV_UTIL_PKG.ARRAY_TO_CSV(OWCKAN.T_STR_ARRAY(CR_ANYO, CR_MES, CR_DESC_MES, CR_DEPTO_ATENCION, CR_SEXO, CR_COD_SEXO, CR_EDAD, CR_CITAS_REGISTRADAS),'';'')) AS CSV, ''1'' FROM OWCKAN.OD_SAN_IND_AT_PRIMARIA_CITAS LEFT JOIN OWCKAN.OD_SAN_AP_MD_EDAD ON CR_EDAD = AP_COD_EDAD LEFT JOIN OWCKAN.OD_SAN_MD_DEPTO ON CR_DEPTO_ATENCION = AH_DC_COD_DEPTO WHERE CR_ANYO = ''ANYO_CHANGE'' AND CR_MES = ''MES_CHANGE'' ORDER BY CR_DEPTO_ATENCION, CR_SEXO

• Most of the steps are can be inmediately accomplished

• We get datasets fully automatized in less than one hour when using Kettle templates.

.properties Register dataset

Create folder structure

Process development

“select” query

5’ 10’ 5’ 2h – 16h 10’

Conclusions

• Lack of culture in some organizations about how to make reusability easier

• We have the data, we have the platform for publishing, we have to simplify the transition from the source to the portal.

• The easier we make this transition the more contributors will be willing to participate.

• Kettle is a good alternative, it‘s powerful, free, open source, already known by many IT departments.

• This platform is just an approach. You can design your own solution.

• Whenever you can, help people to produce the best reusable formats

› files.ckan.org › ckancon... · automatic data publication in ckan using kettle (a success...

Documents

ckan 2.0 introduction

lod2 ckan workshop vienna: einleitung, martin kaltenböck

enabling re-use via ckan: discoverability and...

ckan 2.0 introduction (20140618 updated)

データポータルソフトウェア ckan

pivot kettle: the inclusive kettle

comprehensive knowledge archive network (ckan) developer...

getting to know ckan, 24 june 2015

ckancon 2016 & iodc16

nsw office of environment and heritage - ckan meetup

andrea talamonti: ckan a tool for open data

ckan slidedeck (june2012)

ckan 2.0 installation notes - read the docs€¦ · ckan...

lod2 webinar series: publicdata.eu and ckan

d2.2 ckan plugins

ckan is not a repository ckan is a repository introduction...

ckan @ the open government data camp

comprehensive knowledge archive network (ckan) developer...

ckan: open source data catalog

ckan - portal web de dados abertos