› files.ckan.org › ckancon... · automatic data publication in ckan using kettle (a success...

42
October 4th, 2016 Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

Upload: others

Post on 25-Feb-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

October 4th, 2016

Automatic data publication in CKAN using Kettle

(a success case in Generalitat Valenciana)

Page 2: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

index

• Intro – a short story • Pentaho Data Integration (Kettle)

• What is Kettle? • Some features • Some screenshots

• Description of the solution • Architecture • Execution phases • Use of API

• Creating a new dataset: step by step • Conclusions

Page 3: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

Intro – a short story

• Lack of culture around data reusability

• Lack of resources to provide reusable data

Problems

Page 4: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

index

• Intro – a short story • Pentaho Data Integration (Kettle)

• What is Kettle? • Some features • Some screenshots

• Description of the solution • Architecture • Execution phases • Use of API

• Creating a new dataset: step by step • Conclusions

Page 5: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

Community Edition available Free license

What is Kettle?

Pentaho Business Analytics

Reporting

Analysis Services

Dashboard

Data Integration (PDI)

Data Mining

BI Server

Spoon

Pan Kitchen

Carte

PDI (also called Kettle) is the component of Pentaho responsible for the Extraction, Transformation and Loading (ETL) processes. ETL tools are most frequently used in datawarehouses environments, however Pentaho Data Integration (PDI) can also be used for other purposes:

• Migrating data between applications or databases

• Exporting data from databases to flat files

• Loading data massively into databases

• Data cleaning • Integrating applications

Page 6: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

Kettle - features

• Data flow control (bandwidth consumption)

• Parallel or sequential process execution

• Process scheduling

• Develop custom java classes / Java libraries

• Scripting (SQL, JavaScript, Shell)

Oracle

MySQL

PostgreSQL

SQLite

Sybase

SAP, Vertica, Palo, Hadoop…

JDBC, ODBC, OCI, JNDI

Network folders

FTP servers

REST and SOAP services

Supported datasources

Catalog Data Sources

IBM DB2

Hypersonic

Informix

MS SQL Server

dBase

Text files (XML, JSON,

CSV, RSS)

Excel

MS Acces

Some features

Page 7: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

Kettle - screenshots

Designing a transformation

Page 8: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

Just a bunch of commands… drag & drop… and configure!!

Kettle - screenshots

Page 9: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

Community and documentation links

Kettle - screenshots

Page 10: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

Execution log view: what happened?

Kettle - screenshots

Page 11: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

Verifying a transformation: warnings and errors details

Kettle - screenshots

Page 12: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

Debugging a transformation: breakpoint configuration

Kettle - screenshots

Page 13: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

index

• Intro – a short story • Pentaho Data Integration (Kettle)

• What is Kettle? • Some features • Some screenshots

• Description of the solution • Architecture • Execution phases • Use of API

• Creating a new dataset: step by step • Conclusions

Page 14: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

Datasources

architecture

AP

I

Databases

Files Input files

Output files

Business Intelligence

System

1

2 3 4

Orchestrator process 1.File

recovering

2.Import to BI

3.Resources generation

4.Resources uploading

Page 15: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

phases

• The main process runs phases in sequencial order. A phase never starts until the previous one has finished.

• Within each phase, the process is run for each dataset.

1. File recovering

2. Import to BI 3. Resources generation

4. Resources uploading

D1

D2

D3

Page 16: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

phases

1. File recovering

2. Import to BI 3. Resources generation

4. Resources uploading

D1

D2

D3

D1

• The main process runs phases in sequencial order. A phase never starts until the previous one has finished.

• Within each phase, the process is run for each dataset.

Page 17: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

phases

D1

D2

D3

• The main process runs phases in sequencial order. A phase never starts until the previous one has finished.

• Within each phase, the process is run for each dataset.

1. File recovering

2. Import to BI 3. Resources generation

4. Resources uploading

D1

D2

D3

Page 18: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

phases

• The main process runs phases in sequencial order. A phase never starts until the previous one has finished.

• Within each phase, the process is run for each dataset.

D1

D2

D3

1. File recovering

2. Import to BI 3. Resources generation

4. Resources uploading

D1

D2

D3

Page 19: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

phases

• The main process runs phases in sequencial order. A phase never starts until the previous one has finished.

• Within each phase, the process is run for each dataset.

D1

D2

D3

1. File recovering

2. Import to BI 3. Resources generation

4. Resources uploading

D1

D2

D3

Page 20: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

phases

1. File recovering

2. Import to BI 3. Resources generation

4. Resources uploading

D1

D2

D3

• The main process runs phases in sequencial order. A phase never starts until the previous one has finished.

• Within each phase, the process is run for each dataset.

Page 21: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

phases

How does the Kettle know what datasets are in which phase?

A database is used to store the information regarding the whole process and the state of execution of each dataset.

What if something goes wrong?

If something goes wrong with a dataset, it remains “stopped” in that phase until the next iteration of the main process.

Page 22: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

phases

What if something goes wrong?

1. File recovering

2. Import to BI 3. Resources generation

4. Resources uploading

D1

D2

D3

D1

1st iteration

Page 23: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

phases

D1

D2

D3

1. File recovering

2. Import to BI 3. Resources generation

4. Resources uploading

D1

D2

D3

What if something goes wrong?

1st iteration

Page 24: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

phases

D1

D2

D3

1. File recovering

2. Import to BI 3. Resources generation

4. Resources uploading

D1

D2

D3

What if something goes wrong?

1st iteration

Page 25: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

phases

D1

D2

D3

1. File recovering

2. Import to BI 3. Resources generation

4. Resources uploading

D1

D2

D3

What if something goes wrong?

1st iteration

Page 26: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

phases

D1

D2

1. File recovering

2. Import to BI 3. Resources generation

4. Resources uploading

D1

D2

D3

What if something goes wrong?

1st iteration

Page 27: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

phases

D1

D2

1. File recovering

2. Import to BI 3. Resources generation

4. Resources uploading

D1

D2

D3

What if something goes wrong?

2nd iteration

Page 28: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

phases

D1

D2

1. File recovering

2. Import to BI 3. Resources generation

4. Resources uploading

D1

D2

D3

What if something goes wrong?

2nd iteration

Page 29: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

phases

D1

D2

1. File recovering

2. Import to BI 3. Resources generation

4. Resources uploading

D1

D2

D3

What if something goes wrong?

2nd iteration

Page 30: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

phases

D1

D2

1. File recovering

2. Import to BI 3. Resources generation

4. Resources uploading

D1

D2

D3

What if something goes wrong?

2nd iteration

Page 31: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

phases

1. File recovering

2. Import to BI 3. Resources generation

4. Resources uploading

D1

D2

D3

What if something goes wrong?

Page 32: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

4th phase – Use of API

ckan.logic.action.create

ckan.logic.action.create.package_create

ckan.logic.action.create.resource_create

ckan.logic.action.create.tag_create

ckan.logic.action.patch

ckan.logic.action.patch.package_patch

ckan.logic.action.patch.resource_patch

ckan.logic.action.update

ckan.logic.action.update.term_translation_update

ckan.logic.action.update.term_translation_update_many

ckan.logic.action.delete

ckan.logic.action.delete.package_delete

ckan.logic.action.delete.dataset_purge

ckan.logic.action.delete.resource_delete

• 4th phase - example of API functions being used:

Page 33: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

index

• Intro – a short story • Pentaho Data Integration (Kettle)

• What is Kettle? • Some features • Some screenshots

• Description of the solution • Architecture • Execution phases • Use of API

• Creating a new dataset: step by step • Conclusions

Page 34: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

Creating a dataset: step-by-step

1st Step: Create a new .properties file for your new dataset

• This file contains several properties related the new dataset. For instance: metadata in two languages, the type of datasource (file or BD), how to update the dataset regularly (daily, weekly, monthly, yearly).

• This file is read by the orchestrator process and is used to set the matadata when creating the dataset.

Page 35: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

Creating a dataset: step-by-step

2nd Step: Register the dataset in the database

• There is a script for launching these queries • This step is required to call the new dataset from the orchestrator process

Page 36: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

Creating a dataset: step-by-step

3rd Step: create the folder structure for the new dataset

• config: contains the properties file • input_files: contains source files • input_error_files: contains processed files if any error ocurred • output_files: contains files pending to be uploaded to CKAN • output_processed_files: contains a copy of the files uploaded to CKAN Files are moved from one folder to another, when every phase is finished

Page 37: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

Creating a dataset: step-by-step

4th Step: load process development

• This process reads data from the origin and load into the database (2nd phase) • Depending on the data, this process can be copied (templates) among several datasets • In any event, many steps of the process are reused (error handling, file/database access…)

Page 38: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

Creating a dataset: step-by-step

5th Step: “select” query to generate resources.

• Each dataset has a different “SELECT”. • This select is called by a common transformation process which generates CSV, XML and

JSON files.

SELECT UPPER (OWCKAN.CSV_UTIL_PKG.ARRAY_TO_CSV(OWCKAN.T_STR_ARRAY(CR_ANYO, CR_MES, CR_DESC_MES, CR_DEPTO_ATENCION, CR_SEXO, CR_COD_SEXO, CR_EDAD, CR_CITAS_REGISTRADAS),'';'')) AS CSV, ''1'' FROM OWCKAN.OD_SAN_IND_AT_PRIMARIA_CITAS LEFT JOIN OWCKAN.OD_SAN_AP_MD_EDAD ON CR_EDAD = AP_COD_EDAD LEFT JOIN OWCKAN.OD_SAN_MD_DEPTO ON CR_DEPTO_ATENCION = AH_DC_COD_DEPTO WHERE CR_ANYO = ''ANYO_CHANGE'' AND CR_MES = ''MES_CHANGE'' ORDER BY CR_DEPTO_ATENCION, CR_SEXO

Page 39: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

Creating a dataset: step-by-step

• Most of the steps are can be inmediately accomplished

• We get datasets fully automatized in less than one hour when using Kettle templates.

.properties Register dataset

Create folder structure

Process development

“select” query

5’ 10’ 5’ 2h – 16h 10’

Page 40: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

index

• Intro – a short story • Pentaho Data Integration (Kettle)

• What is Kettle? • Some features • Some screenshots

• Description of the solution • Architecture • Execution phases • Use of API

• Creating a new dataset: step by step • Conclusions

Page 41: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)

Conclusions

• Lack of culture in some organizations about how to make reusability easier

• We have the data, we have the platform for publishing, we have to simplify the transition from the source to the portal.

• The easier we make this transition the more contributors will be willing to participate.

• Kettle is a good alternative, it‘s powerful, free, open source, already known by many IT departments.

• This platform is just an approach. You can design your own solution.

• Whenever you can, help people to produce the best reusable formats

Page 42: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)