› files.ckan.org › ckancon... · automatic data publication in ckan using kettle (a success...
TRANSCRIPT
October 4th, 2016
Automatic data publication in CKAN using Kettle
(a success case in Generalitat Valenciana)
index
• Intro – a short story • Pentaho Data Integration (Kettle)
• What is Kettle? • Some features • Some screenshots
• Description of the solution • Architecture • Execution phases • Use of API
• Creating a new dataset: step by step • Conclusions
Intro – a short story
• Lack of culture around data reusability
• Lack of resources to provide reusable data
Problems
index
• Intro – a short story • Pentaho Data Integration (Kettle)
• What is Kettle? • Some features • Some screenshots
• Description of the solution • Architecture • Execution phases • Use of API
• Creating a new dataset: step by step • Conclusions
Community Edition available Free license
What is Kettle?
Pentaho Business Analytics
Reporting
Analysis Services
Dashboard
Data Integration (PDI)
Data Mining
BI Server
Spoon
Pan Kitchen
Carte
PDI (also called Kettle) is the component of Pentaho responsible for the Extraction, Transformation and Loading (ETL) processes. ETL tools are most frequently used in datawarehouses environments, however Pentaho Data Integration (PDI) can also be used for other purposes:
• Migrating data between applications or databases
• Exporting data from databases to flat files
• Loading data massively into databases
• Data cleaning • Integrating applications
Kettle - features
• Data flow control (bandwidth consumption)
• Parallel or sequential process execution
• Process scheduling
• Develop custom java classes / Java libraries
• Scripting (SQL, JavaScript, Shell)
Oracle
MySQL
PostgreSQL
SQLite
Sybase
SAP, Vertica, Palo, Hadoop…
JDBC, ODBC, OCI, JNDI
Network folders
FTP servers
REST and SOAP services
Supported datasources
Catalog Data Sources
IBM DB2
Hypersonic
Informix
MS SQL Server
dBase
Text files (XML, JSON,
CSV, RSS)
Excel
MS Acces
Some features
Kettle - screenshots
Designing a transformation
Just a bunch of commands… drag & drop… and configure!!
Kettle - screenshots
Community and documentation links
Kettle - screenshots
Execution log view: what happened?
Kettle - screenshots
Verifying a transformation: warnings and errors details
Kettle - screenshots
Debugging a transformation: breakpoint configuration
Kettle - screenshots
index
• Intro – a short story • Pentaho Data Integration (Kettle)
• What is Kettle? • Some features • Some screenshots
• Description of the solution • Architecture • Execution phases • Use of API
• Creating a new dataset: step by step • Conclusions
Datasources
architecture
AP
I
Databases
Files Input files
Output files
Business Intelligence
System
1
2 3 4
Orchestrator process 1.File
recovering
2.Import to BI
3.Resources generation
4.Resources uploading
phases
• The main process runs phases in sequencial order. A phase never starts until the previous one has finished.
• Within each phase, the process is run for each dataset.
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
phases
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
D1
• The main process runs phases in sequencial order. A phase never starts until the previous one has finished.
• Within each phase, the process is run for each dataset.
phases
D1
D2
D3
• The main process runs phases in sequencial order. A phase never starts until the previous one has finished.
• Within each phase, the process is run for each dataset.
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
phases
• The main process runs phases in sequencial order. A phase never starts until the previous one has finished.
• Within each phase, the process is run for each dataset.
D1
D2
D3
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
phases
• The main process runs phases in sequencial order. A phase never starts until the previous one has finished.
• Within each phase, the process is run for each dataset.
D1
D2
D3
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
phases
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
• The main process runs phases in sequencial order. A phase never starts until the previous one has finished.
• Within each phase, the process is run for each dataset.
phases
How does the Kettle know what datasets are in which phase?
A database is used to store the information regarding the whole process and the state of execution of each dataset.
What if something goes wrong?
If something goes wrong with a dataset, it remains “stopped” in that phase until the next iteration of the main process.
phases
What if something goes wrong?
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
D1
1st iteration
phases
D1
D2
D3
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
What if something goes wrong?
1st iteration
phases
D1
D2
D3
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
What if something goes wrong?
1st iteration
phases
D1
D2
D3
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
What if something goes wrong?
1st iteration
phases
D1
D2
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
What if something goes wrong?
1st iteration
phases
D1
D2
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
What if something goes wrong?
2nd iteration
phases
D1
D2
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
What if something goes wrong?
2nd iteration
phases
D1
D2
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
What if something goes wrong?
2nd iteration
phases
D1
D2
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
What if something goes wrong?
2nd iteration
phases
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
What if something goes wrong?
4th phase – Use of API
ckan.logic.action.create
ckan.logic.action.create.package_create
ckan.logic.action.create.resource_create
ckan.logic.action.create.tag_create
ckan.logic.action.patch
ckan.logic.action.patch.package_patch
ckan.logic.action.patch.resource_patch
ckan.logic.action.update
ckan.logic.action.update.term_translation_update
ckan.logic.action.update.term_translation_update_many
ckan.logic.action.delete
ckan.logic.action.delete.package_delete
ckan.logic.action.delete.dataset_purge
ckan.logic.action.delete.resource_delete
• 4th phase - example of API functions being used:
index
• Intro – a short story • Pentaho Data Integration (Kettle)
• What is Kettle? • Some features • Some screenshots
• Description of the solution • Architecture • Execution phases • Use of API
• Creating a new dataset: step by step • Conclusions
Creating a dataset: step-by-step
1st Step: Create a new .properties file for your new dataset
• This file contains several properties related the new dataset. For instance: metadata in two languages, the type of datasource (file or BD), how to update the dataset regularly (daily, weekly, monthly, yearly).
• This file is read by the orchestrator process and is used to set the matadata when creating the dataset.
Creating a dataset: step-by-step
2nd Step: Register the dataset in the database
• There is a script for launching these queries • This step is required to call the new dataset from the orchestrator process
Creating a dataset: step-by-step
3rd Step: create the folder structure for the new dataset
• config: contains the properties file • input_files: contains source files • input_error_files: contains processed files if any error ocurred • output_files: contains files pending to be uploaded to CKAN • output_processed_files: contains a copy of the files uploaded to CKAN Files are moved from one folder to another, when every phase is finished
Creating a dataset: step-by-step
4th Step: load process development
• This process reads data from the origin and load into the database (2nd phase) • Depending on the data, this process can be copied (templates) among several datasets • In any event, many steps of the process are reused (error handling, file/database access…)
Creating a dataset: step-by-step
5th Step: “select” query to generate resources.
• Each dataset has a different “SELECT”. • This select is called by a common transformation process which generates CSV, XML and
JSON files.
SELECT UPPER (OWCKAN.CSV_UTIL_PKG.ARRAY_TO_CSV(OWCKAN.T_STR_ARRAY(CR_ANYO, CR_MES, CR_DESC_MES, CR_DEPTO_ATENCION, CR_SEXO, CR_COD_SEXO, CR_EDAD, CR_CITAS_REGISTRADAS),'';'')) AS CSV, ''1'' FROM OWCKAN.OD_SAN_IND_AT_PRIMARIA_CITAS LEFT JOIN OWCKAN.OD_SAN_AP_MD_EDAD ON CR_EDAD = AP_COD_EDAD LEFT JOIN OWCKAN.OD_SAN_MD_DEPTO ON CR_DEPTO_ATENCION = AH_DC_COD_DEPTO WHERE CR_ANYO = ''ANYO_CHANGE'' AND CR_MES = ''MES_CHANGE'' ORDER BY CR_DEPTO_ATENCION, CR_SEXO
Creating a dataset: step-by-step
• Most of the steps are can be inmediately accomplished
• We get datasets fully automatized in less than one hour when using Kettle templates.
.properties Register dataset
Create folder structure
Process development
“select” query
5’ 10’ 5’ 2h – 16h 10’
index
• Intro – a short story • Pentaho Data Integration (Kettle)
• What is Kettle? • Some features • Some screenshots
• Description of the solution • Architecture • Execution phases • Use of API
• Creating a new dataset: step by step • Conclusions
Conclusions
• Lack of culture in some organizations about how to make reusability easier
• We have the data, we have the platform for publishing, we have to simplify the transition from the source to the portal.
• The easier we make this transition the more contributors will be willing to participate.
• Kettle is a good alternative, it‘s powerful, free, open source, already known by many IT departments.
• This platform is just an approach. You can design your own solution.
• Whenever you can, help people to produce the best reusable formats