auto-loading of drupal cck nodes

38
David Naughton | December 3, 2008 Automatic Scheduled Loading of CCK Nodes ETL with drupal_execute, OO, drush, & cron

Upload: nihiliad

Post on 31-May-2015

6.157 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Auto-loading of Drupal CCK Nodes

David Naughton | December 3, 2008

Automatic Scheduled

Loading of CCK Nodes

ETL with drupal_execute, OO, drush, & cron

Page 2: Auto-loading of Drupal CCK Nodes

Who am I?David Naughton

● Web Applications Developer

● University of Minnesota Libraries

[email protected]

● 11+ years development experience

● New to Drupal & PHP

Page 3: Auto-loading of Drupal CCK Nodes

What's EthicShare?ethicshare.org

• Who: UMN Center for Bioethices, UMN Libraries, & UMN Csci & EE

• What: A sustainable aggregation of bioethics research and a forum for scholarship

• When: Pilot Phase January 2008 – June 2009

• How: Funded by Andrew W. Mellon Foundation

Page 4: Auto-loading of Drupal CCK Nodes

Sustainable Aggregation of Bioethics Research

• My part of the project

• Extract citations from multiple sources

• Transform into Drupal-compatible format

• Load into Drupal

• On a regular, ongoing basis

Page 5: Auto-loading of Drupal CCK Nodes

ETL...• Extract, Transform, and Load = ETL

• Very common IT problem

• ETL is the most common term for it

• Librarians like to say...

• “Harvesting” instead of Extracting

• “Crosswalking” instead of Transforming

• ...but they're peculiar

Page 6: Auto-loading of Drupal CCK Nodes

...ETL• Complex problem

• Lots of packaged solutions

• Mostly Java, for data warehouses

• Not a good fit for EthicShare

• Using Drupal 5 and CCK

• No Batch API

• When we move to Drupal 6...

• Batch API http://bit.ly/BatchAPI?

• content.crud.inc http://bit.ly/content-crud-inc?

Page 7: Auto-loading of Drupal CCK Nodes

Without Automation• First PubMed load alone was > 100,000 citations• Without automation, I could have been doing lots of this:

Page 8: Auto-loading of Drupal CCK Nodes

One SolutionIf money were no object, we could have hired lots of these:

Page 9: Auto-loading of Drupal CCK Nodes

Really want...

Page 10: Auto-loading of Drupal CCK Nodes

...but don't want:

Page 11: Auto-loading of Drupal CCK Nodes

drush

Architecture

CiteETLPubMed

WorlCat

New York Times

BBC

Extractors

PubMed

WorlCat

New York Times

BBC

XML

XML

XML

XML

Transformers

PHP ArrayLoader EthicShare

MySQL

Page 12: Auto-loading of Drupal CCK Nodes

drushA portmanteau of “Drupal shell”.

“…a command line shell and Unix scripting interface for Drupal, a veritable Swiss Army knife designed to make life easier for those of us who spend most of our working hours hacking away at the command prompt.”

-- http://drupal.org/project/drush

Page 13: Auto-loading of Drupal CCK Nodes

Why drush?• Very flexible scheduling via cron

● Uses php-cli, so no web timeouts

● Experimental support for running drush without a

running Drupal web instance

● Run tests from the cli with Drush simpletest runner

Page 14: Auto-loading of Drupal CCK Nodes

Why not hook_cron?• If you're comfortable with cron, flexible scheduling via hook_cron requires unnecessary extra work

● Subject to web timeouts

● Runs within a Drupal web instance, so large loads

may affect user experience

Page 15: Auto-loading of Drupal CCK Nodes

drush help$ cd $drush_dir$ ./drush.php helpUsage: drush.php [options] <command> <command> ...

Options: -r <path>, --root=<path> Drupal root directory to use

(default: current directory) -l <uri> , --uri=<uri> URI of the drupal site to use (only

needed in multisite environments)...

Commands: cite load Load data to create new citations. help View help. Run "drush help [command]" to view command-specific help.

pm install Install one or more modules

Page 16: Auto-loading of Drupal CCK Nodes

drush command help$ ./drush.php help cite loadUsage: drush.php cite load [options]

Options: --E=<extractor class> Base name of an extractor class, excluding the CiteETL/E/ parent path & '.php'. Required. --T=<transformer class> Base name of an transformer class, excluding the CiteETL/T/ parent path & '.php'. Required. --L=<loader class> Base name of an loader class, excluding the CiteETL/L/ parent path & '.php'. Optional: default is 'Loader'.

--dbuser=<db username> Optional: 'cite load' will authenticate the user only if both dbuser & dbpass are present. --dbpass=<db password> Optional: 'cite load' will authenticate the user only if both dbuser & dbpass are present. --memory_limit=<memory limit> Optional: default is 512M.

Page 17: Auto-loading of Drupal CCK Nodes

drush cite loadExample specifying the New York Times – Health extractor & transformer classes on the cli:

$ ./drush.php cite load --E=NYTHealth \ --T=NYTHealth --dbuser=$dbuser \ --dbpass=$dbpass

Allows for flexible, per-data-source scheduling via cron, a requirement for EthicShare.

Page 18: Auto-loading of Drupal CCK Nodes

php-cli Problems• PHP versions < 5.3 do not free circular references.

This is a problem when parsing loads of XML: Memory

Leaks With Objects in PHP 5

http://bit.ly/php5-memory-leak

• Still may have to allocate huge amounts of memory to

PHP to avoid “out of memory” errors.

Page 19: Auto-loading of Drupal CCK Nodes

drush APIUndocumented, but simple & http://drupal.org/project/drushlinks to some modules that use it. To create a drush command…

● Implement hook_drush_command, mapping cli text to a

callback function name

● Implement the callback function

…and optionally…

● Implement a hook_help case for your command

Page 20: Auto-loading of Drupal CCK Nodes

drush getopt emulation…Supports:

● --opt=value

● -opt or --opt (boolean based on presence or

absence)

Contrary to README.txt, does not support:

● -opt value

● -opt=value

Page 21: Auto-loading of Drupal CCK Nodes

…drush getopt emulation• Puts options in an associative array, where keys are the option

names: $GLOBALS['args']['options']

● Puts commands (“words” not starting with a dash) in an array:

$GLOBALS['args']['commands']

Quirks:

● in cases of repetition (e.g. -opt --opt=value ), last one wins

● commands & options can be interspersed, as long as order of

commands is maintained

Page 22: Auto-loading of Drupal CCK Nodes

cite.module example…function cite_drush_command() { $items['cite load'] = array( 'callback' => 'cite_load_cmd', 'description' => t('Load data to create new citations.') ); return $items;}

Page 23: Auto-loading of Drupal CCK Nodes

…cite.module example…function cite_load_cmd($url) {

global $args; $options = $args['options'];

// Batch loading will often require more // than the default memory. $memory_limit = ( array_key_exists('memory_limit', $options) ? $options['memory_limit'] : '512M' ); ini_set('memory_limit', $memory_limit);

// continued on next slide…

Page 24: Auto-loading of Drupal CCK Nodes

…cite.module example // …continued from previous slide if (array_key_exists('dbuser', $options) && array_key_exists('dbpass', $options)) { user_authenticate($options['dbuser'], $options['dbpass']); } set_include_path( './' . drupal_get_path('module', 'cite') . PATH_SEPARATOR . './' . drupal_get_path('module', 'cite') . '/contrib' . PATH_SEPARATOR . get_include_path() );

require_once 'CiteETL.php'; $etl = new CiteETL( $options ); $etl->run();

} // end function cite_load_cmd

Page 25: Auto-loading of Drupal CCK Nodes

CiteETL.php…class CiteETL {

private $option_property_map = array( 'E' => 'extractor', 'T' => 'transformer', 'L' => 'loader');

// Not shown: identically-named accessors for these propertiesprivate $extractor;private $transformer;private $loader;

Page 26: Auto-loading of Drupal CCK Nodes

…CiteETL.php…function __construct($params) { // The loading process is the almost always the same... if (!array_key_exists('L', $params)) { $params['L'] = 'Loader'; }

foreach ($params as $option => $class) { if (!preg_match('/^(E|T|L)$/', $option)) { continue; } // Naming-convention-based, factory-ish, dynamic // loading of classes, e.g. CiteETL/E/NYTHealth.php: require_once 'CiteETL/' . $option . '/' . $class . '.php'; $instantiable_class = 'CiteETL_' . $option . '_' . $class; $property = $this->option_property_map[$option]; $this->$property = new $instantiable_class; }}

Page 27: Auto-loading of Drupal CCK Nodes

…CiteETL.phpfunction run() { // Extractors must all implement the Iterator interface. $extractor = $this->extractor(); $extractor->rewind(); while ($extractor->valid()) { $original_citation = $extractor->current(); try { $transformed_citation = $this->transformer->transform( $original_citation ); } catch (Exception $e) { fwrite(STDERR, $e->getMessage() . "\n"); $extractor->next(); } try { $this->loader->load( $transformed_citation ); } catch (Exception $e) { fwrite(STDERR, $e->getMessage() . "\n"); } $extractor->next(); }}

Page 28: Auto-loading of Drupal CCK Nodes

Example E. Base Class…require_once 'simplepie.inc';

class CiteETL_E_SimplePie implements Iterator {

private $items = array();private $valid = FALSE;

function __construct($params) { $feed = new SimplePie(); $feed->set_feed_url( $params['feed_url'] ); $feed->init(); if ($feed->error()) { throw new Exception( $feed->error() ); } $feed->strip_htmltags( $params['strip_html_tags'] ); $this->items = $feed->get_items();}

// continued on next slide…

Page 29: Auto-loading of Drupal CCK Nodes

…Example E. Base Class// …continued from previous slidefunction rewind() { $this->valid = (FALSE !== reset($this->items));}

function current() { return current($this->items);}

function key() { return key($this->items);}

function next() { $this->valid = (FALSE !== next($this->items));}

function valid() { return $this->valid;}

} # end class CiteETL_E_SimplePie

Page 30: Auto-loading of Drupal CCK Nodes

Example Extractorrequire_once 'CiteETL/E/SimplePie.php';

class CiteETL_E_NYTHealth extends CiteETL_E_SimplePie {

function __construct() { parent::__construct(array( 'feed_url' => 'http://www.nytimes.com/services/xml/rss/nyt/Health.xml', 'strip_html_tags' => array('br','span','a','img') ));}

} // end class CiteETL_E_NYTHealth

Page 31: Auto-loading of Drupal CCK Nodes

Example Transformer…class CiteETL_T_NYTHealth {

private $filter_pattern;

function __construct() {

$simple_keywords = array( 'abortion', 'advance directives', // whole bunch of keywords omitted… 'world health', ); $this->filter_pattern = '/(' . join('|', $simple_keywords) . ')/i';}

// continued on next slide…

Page 32: Auto-loading of Drupal CCK Nodes

…Example Transformer…// …continued from previous slide

function transform( $simplepie_item ) { // create an array matching the cite CCK content type structure: $citation = array();

$citation['title'] = $simplepie_item->get_title(); $citation['field_abstract'][0]['value'] = $simplepie_item->get_content(); $this->filter( $citation );

// lots of transformation ops omitted…

$categories = $simplepie_item->get_categories(); $category_labels = array(); foreach ($categories as $category) { array_push($category_labels, $category->get_label()); } $citation['field_subject'][0]['value'] = join('; ', $category_labels);

$this->filter( $citation ); return $citation;}

Page 33: Auto-loading of Drupal CCK Nodes

…Example Transformer// …continued from previous slide

function filter( $citation ) {

$combined_content = $citation['title'] . $citation['field_abstract'][0]['value'] . $citation['field_subject'][0]['value'];

if (!preg_match($this->filter_pattern, $combined_content)) { throw new Exception( "The article '" . $citation['title'] . "', id: " . $citation['source_id'] . " was rejected by the relevancy filter" ); }}

Page 34: Auto-loading of Drupal CCK Nodes

Why not FeedAPI?• Supports only simple one-feed-field to one-CCK-field

mappings

• Avoid the Rube Goldberg Effect by using the same

ETL system for feeds that use for everything else

Page 35: Auto-loading of Drupal CCK Nodes

Loaderclass CiteETL_L_Loader {

function load( $citation ) { // de-duplication code omitted… $node = array('type' => 'cite'); $citation['status'] = 1; $node_path = drupal_execute( 'cite_node_form', $citation, $node ); $errors = form_get_errors(); if (count($errors)) { $message = join('; ', $errors); throw new Exception( $message ); } // de-duplication code omitted…}

Page 36: Auto-loading of Drupal CCK Nodes

CCK Auto-loading Resources• Quick-and-dirty CCK imports

http://bit.ly/quick-dirty-cck-imports

• Programmatically Create, Insert, and Update CCK

Nodes http://bit.ly/cck-import-update

• What is the Content Construction Kit? A View from the

Database. http://bit.ly/what-is-cck

Page 37: Auto-loading of Drupal CCK Nodes

CCK Auto-loading Problems• Column names may change from one database

instance to another if other CCK content types with

identical field names already exist.

• drupal_execute bug in Drupal 5 Form API:

• cannot call drupal_validate_form on the same form

more than once: http://bit.ly/drupal5-formapi-bug

• Fixed in Drupal versions > 5

Page 38: Auto-loading of Drupal CCK Nodes

Questions?