contents type 1: overwrite scd type 2: add row scd type 3: add column scd type 4: mini-dimensions...
TRANSCRIPT
Contents
Introduction xxxiii
Parti Getting Started with Pentaho 1
Chapter 1 Quick Start: Pentaho Examples 3Getting Started with Pentaho 3
Downloading, and Installing the Software 4Running the Software 5
Starting the Pentaho BI Server 5L°gging in ' ,- 6
Mantle, the Pentaho User Console "- 7Working with the Examples 8
Using the Repository Browser .9Understanding the Examples 9
Running the Examples \\Reporting Examples , \\BI Developer Examples: Regional Sales - HTML 11Steel Wheels: Income Statement 12Steel Wheels: Top 10 Customers 13BI Developer Examples:
, button-single-parameter.prpt 13Charting Examples 14
Steel Wheels: Chart Pick List 15
xvi Contents
Steel Wheels: Flash Chart ListBI Developer Examples: Regional Sales -Line/Bar Chart
Analysis ExamplesBI Developer Examples: Slice and DiceSteel Wheels Analysis Examples
Dashboarding ExamplesOther Examples -
Summary
Chapter 2 PrerequisitesBasic System Setup
Installing UbuntuUsing Ubuntu in Native ModeUsing a Virtual Machine
Working with the TerminalDirectory NavigationCommand History
Using Symbolic LinksCreating Symbolic Links in UbuntuCreating Syrnlinks in Windows Vista
Java Installation and ConfigurationInstalling Java on Ubuntu LinuxInstalling Java on Windows
MySQL InstallationInstalling MySQL Server and Client on UbuntuInstalling MySQL Server and Client on WindowsMySQL GUI ToolsUbuntu InstallWindows Install
Database ToolsPower*Architect and Other Design ToolsSquirrel SQL ClientUbuntu InstallWindows Install
SQLeonardoSummary
15
16161718192020
212222232324242525262627272829293031313131313232333334
Chapter 3 Server Installation and ConfigurationServer Configuration
InstallationInstallation DirectoryUser AccountConfiguring TomcatAutomatic Startup
Managing Database DriversDriver Location for the Server
Contents xvn
373738383839404444
Driver LocationHfor the Administration Console . 44Managing JDBC Drivers on UNIX-BasedSystems
System DatabasesSetting Up the MySQL Schemas
• Configuring Quartz and HibernateConfiguring JDBC SecuritySample DataModify the Pentaho Startup Scripts
E-mailBasic SMTP Configuration ^Secure SMTP ConfigurationTesting E-mail Configuration
Publisher PasswordAdministrative Tasks
The Pentaho Administration Console. Basic PAC Configuration -^
Starting and Stopping PACThe PAC Front EndConfiguring PAC Security and Credentials
User ManagementData SourcesOther Administrative Tasks
Summary
Chapter 4 The Pentaho BI StackPentaho BI Stack Perspectives
FunctionalityServer .Web Client, and Desktop Programs
4445464650515152
, 5 2
545454
' 55555556565758606161
63656565
XVDDD Contents
Front-Ends and Back-EndsUnderlying Technology
The Pentaho Business Intelligence ServerThe PlatformThe Solution Repository and the Solution EngineDatabase Connection Pool ManagementUser Authentication and AuthorizationTask SchedulingE-mail Services
BI ComponentsThe Metadata LayerAd hoc Reporting ServiceThe ETL EngineReporting EnginesThe OLAP EngineThe Data Mining Engine
The Presentation LayerUnderlying Java Servlet Technology
Desktop ProgramsPentaho Enterprise Edition and Community EditionCreating Action Sequences with Pentaho DesignStudioPentaho Design Studio (Eclipse) PrimerThe Action Sequence EditorAnatomy of an Action SequenceInputsOutputsActions
Summary
6666676768696969707070727272727273747476
7778808383858589
Part OD Dimensional Modeling aimd Data WarehouseDesign 91
Chapter 5 Example Business Case: World Class A/iovies 93World Class Movies: The Basics 94The WCM Data 95
Obtaining and Generating Data 97WCM Database: The Big Picture 97
Contents xix
DVD CatalogCustomersEmployeesPurchase OrdersCustomer Orders and PromotionsInventory Management
Managing the Business: The Purpose of BusinessIntelligenceTypical Business Intelligence Questions for WCM
, Data Is KeySummary
Chapter 6 Data Warehouse PrimerWhy Do You Need a Data Warehouse?The Big Debate: Inmon Versus KimballData Warehouse Architecture
The Staging AreaThe Central Data WarehouseData MartsOLAP CubesStorage Formats and MDX
Data Warehouse ChallengesData QualityData Vault and Data QualityUsing Reference and Master Data
Data Volume and PerformanceOpen Source Database Window SupportChanged Data CaptureSource Data-Based CDCTrigger-Based CDCSnapshot-Based CDCLog-Based CDCWhich CDC Alternative Should You Choose?
Changing User RequirementsData Warehouse Trends
Virtual Data WarehousingReal-Time Data WarehousingAnalytical Databases
99101101101102104
105108109110
1111121141161.18119121121122123124125127128132133133134135136137137139139140142
xx Contents
:er7
Data Warehouse AppliancesOn Demand Data Warehousing
Summary
Modeling the Business Using Star SchemasWhat Is a Star Schema?
Dimension Tables and Fact TablesFact Table Types
Querying Star SchemasJoin Types
. Applying Restrictions in a QueryCombining Multiple RestrictionsRestricting Aggregate ResultsOrdering Data
The Bus ArchitectureDesign Principles
Using Surrogate KeysNaming and Type ConventionsGranularity and AggregationAudit ColumnsModeling Date and TimeTime Dimension GranularityLocal Versus UTC TimeSmart Date KeysHandling Relative Time
Unknown Dimension Keys ^Handling Dimension Changes
SCD Type 1: OverwriteSCD Type 2: Add RowSCD Type 3: Add ColumnSCD Type 4: Mini-DimensionsSCD Type 5: Separate History TableSCD Type 6: Hybrid Strategies
Advanced Dimensional Model ConceptsMonster" DimensionsJunk, Heterogeneous, and DegenerateDimensions
Role-Playing Dimensions
143144144
147147148149150153156157157158158160160162163164165165165166166169169171171174174176178179179
180181
Multi-Valued Dimensions and Bridge TablesBuilding HierarchiesSnowflakes and Clustering DimensionsOutriggersConsolidating Multi-Grain Tables
Summary
Chapter/8 The Data Mart Design ProcessRequirements Analysis
Getting the Right Users InvolvedCollecting Requirements
Data AnalysisData ProfilingUsing eobjects.org DataCleanerAdding Profile Tasks
'Adding Database ConnectionsDoing an Initial ProfileWorking with Regular ExpressionsProfiling and Exploring ResultsValidating and Comparing DataUsing a Dictionary for Column DependencyChecks
Alternative SolutionsDeveloping the ModelData Modeling with Power*ArchitectBuilding the WCM Data Marts
Generating the DatabaseGenerating Static DimensionsSpecial Date Fields and Calculations
Source to Target MappingSummary
Contents xxi
182184186188188189
191191192193195197198200201202202204205
205205206208210212213216218220
Part III ETL and Data Integration 221
Chapter 9 Pentaho Data Integration Primer 223Data Integration Overview 223
Data Integration Activities 224Extraction 226
xxii Contents
Change Data, CaptureData StagingData ValidationData CleansingDecoding and RenamingKey ManagementAggregationDimension and Bridge Table MaintenanceLoading Fact Tables
Pentaho Data Integration Concepts andComponentsTools and UtilitiesThe Data Integration EngineRepository
r Jobs and TransformationsPlug-in Architecture
Getting Started with SpoonLaunching the Spoon ApplicationA Simple "Hello, World!" Example
Building the TransformationRunning the TransformationThe Execution Results PaneThe Output
Checking Consistency and DependenciesLogical ConsistencyResource DependenciesVerifying ther Transformation \
Working with Database ConnectionsJDBC and ODBC ConnectivityCreating a Database ConnectionTesting Database ConnectionsHow Database Connections Are UsedA Database-Enabled "Hello, World!" ExampleDatabase Connection ConfigurationManagement
Generic Database ConnectionsSummary
226226227228228229229229230
230230232232232235236236237237
* 244245246247247247247248248249252252253
256257258
Contents xxiii
Chapter 10 Designing Pentaho Data Integration Solutions 261Generating Dimension Table Data 262
Using Stored Procedures 262Loading a Simple Date Dimension 263CREATE TABLE dim_date: Using the ExecuteSQL Script Step 265
Missing Date and Generate Rows with InitialDate: The Generate Rows Step • 267
Days Sequence: The Add Sequence Step 268Calculate and Format Dates: The Calculator Step 269The Value Mapper Step ' 273Load dim_date: The Table Output Step 275
More Advanced Date Dimension Features 276ISO Week and Year ' 276Current and La'st Year Indicators 276Internationalization and Locale Support 277
Loading a Simple Time Dimension A ' 277Combine: The Join Rows (Cartesian product)Step 279
Calculate Time: Again, the Calculator Step • 281Loading the Demography Dimension 281Understanding the stage_demography anddim_demography Tables 283
Generating Age and Income Groups 284Multiple Incoming and Outgoing Streams 285
Loading Data from Source Systems ^ 286Staging Lookup Values 286The stage_lookup_data Job 287The START Job Entry 288Transformation Job Entries 288Mail Success and Mail Failure 289The extract_lookup_type andextract_lookup_value Transformations 292
The stage_lookup_data^Transformation 293Check If Staging Table Exists: The Table ExistsStep 294
The Filter rows Step 294Create Staging Table: Executing Dynamic SQL 295The Dummy Step 296
xxiv Contents
The Stream Lookup Step ' " 297Sort on Lookup Type: The Sort Rows Step 299Store to Staging Table: Using a Table OutputStep to Load Multiple Tables 300
The Promotion Dimension 300Promotion Mappings 301
- Promotion Data Changes 301Synchronization Frequency 302The load_dim_promotion Job 302The extract_promotion Transformation 303Determining Promotion Data Changes 304Saving the Extract and Passing on the File Name "306Picking Up the File and Loading the Extract 306
Summary r 308
Chapter 11 Deploying Pentaho Data integration Solutions 309Configuration Management 310
Using Variables 310Variables in Configuration Properties 311User-Defined Variables 312Built-in Variables 314Variables Example: Dynamic Database
Connections . 314More About the Set Variables Step 318Set Variables Step Gotchas 319
Using' JNDI Connections 319What Is JNDI? - - 319Creating a JNDI Connection 320JNDI Connections and Deployment 321
Working with the PDI Repository 322Creating a PDI Repository 322Connecting to the Repository 323
. Automatically Connecting to a DefaultRepository <. 324
The Repository Explorer 325Managing Repository User Accounts 327How PDI Keeps Track of Repositories 328Upgrading an Existing Repository 329
Running in the Deployment Environment 330
Contents xxv
Running from the Command Line 330Command-Line Parameters 330Running Jobs with Kitchen . 332Running Transformations with Pan . 332Using Custom Command-line Parameters 333Using Obfuscated Database Passwords 334
Running Inside the Pentaho BI Server 334Transformations in Action Sequences 334Jobs in Action Sequences 335The Pentaho BI Server and the PDI Repository 336
Remote Execution with Carte , 337Why Remote Execution? 338Running Carte 339Creating Slave Servers - 340Remotely Executing a Transformation or Job 341Clustering 341
Summary 343
Part IV Business Intelligence Applications 345
Chapter 12 The Metadata Layer 347Metadata Overview 347
What Is Metadata? 347The Advantages of the Metadata Layer 348Using Metadata to Make a More User-Friendly
Interface "" 348Adding Flexibility and Schema Independence 348Refining Access Privileges 349Handling Localization 349Enforcing Consistent Formatting and Behavior 350
Scope and Usage of the Metadata Layer 350Pentaho Metadata Features 352
Database and Query Abstraction 352Report Definition: A Business User's Point ofView 352
Report Implementation: A SQL Developer'sPoint of View , 353
Mechanics of Abstraction: The Metadata Layer 355
XXVD Contents
Properties, Concepts, and Inheritance in theMetadata LayerPropertiesConceptsInheritanceLocalization of Properties
Creation and*Maintenance of MetadataThe Pentaho Metadata EditorThe Metadata RepositoryMetadata DomainsThe Sublayers of the Metadata LayerThe Physical LayerThe Logical LayerThe Delivery Layer
Deploying and Using MetadataExporting and Importing XMI filesPublishing the Metadata to the ServerRefreshing the Metadata
Summary
Chapter 13 Using The Pentaho Reporting ToolsReporting ArchitectureWeb-Based ReportingPractical Uses of WAQRPentaho Report Designer
The PRD ScreenReport Structure """Report ElementsCreating Data SetsCreating SQL Queries Using JDBCCreating Metadata QueriesExample Data Set
Adding and Using Parameters .Layout and FormattingAlternate Row Colors: Row Bandingf
Grouping and Summarizing Datai , Adding and Modifying Groups
Using Functions •Using Formulas
355355356356357
- 357357358359359359362365366366367367368
371371373375376377378380381382385386386389390391391393395
Contents xxvii
Adding Charts and Graphs' 397Adding a Bar Chart . 400Pie Charts ' 400Working with Images - 401
Working with Subreports 404Passing Parameter Values to Subreports 405
Publishing arid Exporting Reports 406Refreshing the Metadata - 407Exporting Reports 408
Summary 408
Chapter 14 Scheduling, Subscription, and Bursting 411
Scheduling 411Scheduler Concepts 412Public and Private Schedules . 412Content Repository 412
Creating and Maintaining Schedules with thePentaho Administration Console 413Creating a New Schedule 414Running Schedules 416Suspending and Resuming Schedules 416Deleting Schedules 417
Programming the Scheduler with ActionSequences . 417Add Job . 418Suspend Job, Resume Job, and Delete Job 420Other Scheduler Process Actions 420
Scheduler Alternatives 420UNIX-Based Systems: Cron 421Windows: The at Utility and the Task Scheduler 421
Background Execution and Subscription 422How Background Execution Works 422How Subscription Works 423Allowing Users to Subscribe 423Granting Execute and Schedule Privileges 424The Actual Subscription 425
The User's Workspace 426Viewing the Contents of the Workspace 426
XXVIDD Contents
The Waiting, Complete, and My SchedulesPanes 427
The Public Schedules Pane 427The Server Administrator's Workspace 428Cleaning Out the Workspace 429
Bursting . ' '430Implementation of Bursting in Pentaho 430Bursting Example: Rental Reminder E-mails 430Step 1: Finding Customers with DVDs That Are
Due This Week 431Step 2: Looping Through the Customers 432Step 3: Getting DVDs That Are Due to Be
Returned . '434Step 4: Running the Reminder Report 434Step 5: Sending the Report via E-mail 436
Other Bursting Implementations 438Summary 439
Chapter 15 OLAP Solutions Using Pentaho Analysis Services 441Overview of Pentaho Analysis Services 442
Architecture • 442Schema , 444Schema Design Tools ' 444Aggregate Tables • 445
MDX Primer 445Cubes, Dimensions, and Measures 446
The Cube Concept "" 446Star Schema Analogy 447Cube Visualization " 447
Hierarchies, Levels, and Members 448Hierarchies 448Levels and Members 449The All Level, All Member, and Default Member 450Member Sets 451Multiple Hierarchies 451
Cube Family Relationshipsr . 451Relative Time Relationships 452
MDX Query Syntax 453Basic MDX Query 453
Contents xxix
Axes: ON ROWS and ON COLUMNS 453Looking at a Part of the Data 454Dimension on Only One Axis 455More MDX Examples: a Simple Cube . 455The FILTER Function 455The ORDER Function 456Using TOPCOUNT and BOTTOMCOUNT 457Combining Dimensions: The CROSSJOIN 'Function 457
Using NON EMPTY 457Working with Sets and the WITH Clause 458Using Calculated Members 459
Creating Mondrian Schemas 460Getting Started with Pentaho Schema Workbench 460Downloading Mondrian 460Installing Pentaho Schema Workbench 461Starting Pentaho Schema Workbench 461Establishing a Connection 462JDBC Explorer - 463
Using the Schema Editor 463Creating a New Schema 463Saving the Schema on Disk 464Editing Object Attributes 465Changing Edit Mode 465
Creating and Editing a Basic Schema 466Basic Schema Editing Tasks 466Creating a Cube 466Choosing a Fact Table 468Adding Measures 469Adding Dimensions 470Adding and Editing Hierarchies and ChoosingDimension Tables 471
Adding Hierarchy Levels " 474Associating Cubes with Shared Dimensions 476Adding the DVD and Customer Dimensions 478XML Listing 480
Testing and Deployment 481Using the MDX Query Tool 481Publishing the Cube 482
xxx Contents
Schema Design Topics We Didn't Cover 483Visualizing Mondrian Cubes with JPivot 484
Getting Started with the Analysis View 484Using the JPivot Toolbar . 485
Drilling . 486Drilling Flavors - 486Drill Member and Drill Position . 487Drill Replace 488Drill Through 488
The OLAP Navigator 488Controlling Placement of Dimensions on Axes 489Slicing with the OLAP Navigator 490Specifying Member Sets with the OLAPNavigator 492
Displaying Multiple Measures - ' „ 493Miscellaneous Features . 493MDX Query Pane 493PDF and Excel Export 494Chart 494
Enhancing Performance Using the PentahoAggregate Designer . 496Aggregation Benefits 496Extending Mondrian with Aggregate Tables 497Pentaho Aggregate Designer 500Alternative Solutions 502
Summary 502
Chapter 16 Data Mining with Weka 503Data Mining Primer 504
Data Mining Process 504Data Mining Toolset 506Classification 506Clustering 507Association 507Numeric Prediction (Regression) 508Data Mining Algorithms 508
Training and Testing . 509Stratified Cross-Validation 509
The Weka Workbench 510
Contents xxxi
Weka Input Formats 511Setting up Weka Database Connections 512Starting Weka- 514The Weka Explorer 516The Weka Experimenter . 517Weka KnowledgeFlow - 5 1 8
Using Weka with Pentaho 519Adding PDI Weka Plugins 520Getting Started with Weka and PDI 520
Data Acquisition and Preparation , 521Creating and Saving the Model -523Using the Weka Scoring Plugin 525
Further Reading 527Summary - ~ • 527
Chapter 17 Building Dashboards 529The Community Dashboard Framework 529
CDF, the Community, and the PentahoCorporation 529CDF Project History and Who's Who 530Issue Management, Documentation, and
Support 531Skills and Technologies for CDF Dashboards 531
CDF Concepts and Architecture - . 532The CDF Plugin 534
The CDF Home Directory ^ 534The plugin.xml File " 535CDF JavaScript and CSS Resources 536
The .xcdf File ' 537Templates 538
Document Template (a.k.a. Outer Template) 538Content Template . 541
Example: Customers and Websites Dashboard 542Setup 544Creating the .xcdf File 544Creating the Dashboard HTML File 545
, Boilerplate Code: Getting the Solution and Path 545Boilerplate Code: Dashboard Parameters 546Boilerplate Code: Dashboard Components 546
Xxxii Contents
Testing ' ' '547Customers per Website Pie Chart 548Customers per Website: Pie Chart Action
Sequence . 548Customers per Website: XactionComponent 551
Dynamically Changing the Dashboard Title 553Adding the website_name DashboardParameter 553
Reacting to Mouse Clicks on the Pie Chart 554Adding a TextComponent , 555
Showing Customer Locations 557CDF MapComponent Data Format 557Adding a Geography Dimension 558Location Data Action Sequence ' • 559
•- Putting It on the Map ' 561Using Different Markers Depending on Data 562
Styling and Customization 565Styling the Dashboard 566Creating a Custom Document Template 568
Summary ""-- 569
Index 571