weka: a useful tool for air quality forecasting

27
Weka: A Useful Tool for Air Quality Forecasting William F. Ryan Department of Meteorology The Pennsylvania State University [email protected] 2007 National Air Quality Conference, Orlando

Upload: butest

Post on 26-Jan-2015

128 views

Category:

Documents


4 download

DESCRIPTION

 

TRANSCRIPT

Page 1: WEKA: A Useful Tool for Air Quality Forecasting

Weka: A Useful Tool for Air Quality Forecasting

William F. Ryan

Department of Meteorology

The Pennsylvania State University

[email protected]

2007 National Air Quality Conference, Orlando

Page 2: WEKA: A Useful Tool for Air Quality Forecasting

Weka

The weka, or woodhen, is a birdnative to New Zealand. Weka is

also the name of a suite of machinelearning software tools, written in

Java, and developed at the Universityof Wiakato in New Zealand.

http://www.cs.waikato.ac.nz/ml/weka

Page 3: WEKA: A Useful Tool for Air Quality Forecasting

Machine Learning

• Machine learning is a subfield of artificial intelligence (AI) concerned with the development of algorithms and techniques that allow computers to "learn".

• The machine learning algorithms in Weka include, among others, linear regression, classification trees, clustering and artificial neural networks (ANN).

Page 4: WEKA: A Useful Tool for Air Quality Forecasting

Weka Can Be A Useful Tool

• Weka has the potential to be a useful tool to support local air quality forecasting efforts – particularly those operating on a limited budget. – Weka is open source (free) software - although the

purchase of the associated text book is strongly recommended.

– Weka is easily installed on standard PC's but can also run on Linux and other platforms.

– Only minimal modifications are necessary to prepare data files for use in Weka.

– The user interface is simple and intuitive.

Page 5: WEKA: A Useful Tool for Air Quality Forecasting

Weka and PM2.5 Forecasting

• Of particular interest to air quality forecasters is the wide range of algorithms included in Weka.

• These algorithms may be useful to address shortcomings in statistical forecast guidance for fine particulate matter (PM2.5).

• Simple linear regression methods provide reasonable skill for O3 forecasting, due to the very strong and nearly linear ozone-temperature relationship, but linear regression methods have shown limited skill in forecasting PM2.5.

Page 6: WEKA: A Useful Tool for Air Quality Forecasting

PM2.5 Forecasting

O3 (left panel) is well-behavedstatistically. Distribution is nearnormal with a strong associationwith maximum temperature. As a

result, linear techniques areuseful.

PM2.5 (right panel) is not well-behaved. Distribution is skewed,

no strong association with anyparticular weather variable.

Tools included in Weka, including ANN and classification

and regression trees (CART), are capable of addressing

non-linear problems posed by PM2.5.

Page 7: WEKA: A Useful Tool for Air Quality Forecasting

Weka: Information

http://www.cs.waikato.ac.nz/ml/weka/

Page 8: WEKA: A Useful Tool for Air Quality Forecasting

Input File Format

Weka uses its ownfile format called: *.aarf

All you need to dothough is provide a*.csv file with variablenames in the first lineand Weka will convert

Page 9: WEKA: A Useful Tool for Air Quality Forecasting

aarf Format

aarf format is simple anyway:

ASCII fileList of variable and type

Then data follows, comma separated

Missing data marked as “?”

Page 10: WEKA: A Useful Tool for Air Quality Forecasting

Data Editing

Data can be easily editedwithin Weka itself

Page 11: WEKA: A Useful Tool for Air Quality Forecasting

Analyzing Data

Variables can be easilyscanned with basic

statistics and histogramsprovided by Weka

Page 12: WEKA: A Useful Tool for Air Quality Forecasting

Quick Analysis Tools

Page 13: WEKA: A Useful Tool for Air Quality Forecasting

Sampling and Test Data Set Options

Page 14: WEKA: A Useful Tool for Air Quality Forecasting

Functions Available

WEKA includes a number of different techniques that can be useful for forecast development.

These include:

Linear and logistic regressionPerceptron models (Neural networks)

Page 15: WEKA: A Useful Tool for Air Quality Forecasting

Linear Regression

Unfortunately, the “work horse” linearregression module in Weka is limited inusefulness:

-No automatic stepwise function-Poor diagnostics

Compare: SYSTAT, Minitab

Page 16: WEKA: A Useful Tool for Air Quality Forecasting

Classification and Regression Trees (CART)

A variety of classificationalgorithms are available.

Standard algorithm isJ48, which is a souped up version of the lastfree version of CART(Version 4.5)

Commercial version iscurrently 5.0.

Page 17: WEKA: A Useful Tool for Air Quality Forecasting

CART Options

A number of optionsare available tofine tune the CARTAnalysis:

-Minimum # of cases per node-Types of pruning: e.g., sub-tree raising-Confidence values for splitting nodes

Page 18: WEKA: A Useful Tool for Air Quality Forecasting

CART Diagnostics

CART is notorious for usingCPU resources but the WEKAversion runs efficiently on mystandard PC.

Diagnostics are better forCART than linear regression.

Example on left is of a 4 categoryPM2.5 CART forecast.

Page 19: WEKA: A Useful Tool for Air Quality Forecasting

CART Visualization

Page 20: WEKA: A Useful Tool for Air Quality Forecasting

Artificial Neural Networks (ANN)

“Linear Regression by a mob”

Produces forecast bytaking the weightedsum of predictors andthen layering the process.

Page 21: WEKA: A Useful Tool for Air Quality Forecasting

Artificial Neural Networks - Summary

Known samples (historical data) are used to “train” the network.

Input data (xi) are assigned weights (wi) and combined in the “hidden” layer – like a set of linearregressions. These sets are then combined in additional layers – like regressions of regressions.

The sum of data and weights are transformed(“squashed”) to the range of the training data and error is measured.

A supervised training algorithm uses output error to adjust network weights to minimize errors.

Page 22: WEKA: A Useful Tool for Air Quality Forecasting

Artificial Neural Networks – Pros/Cons

• Pro: ANN’s are a powerful technique utilized across scientific disciplines.

• Pro: Theoretically well suited to non-linear processes like air quality.

• Con: Not transparent to users. Hard to integrate into forecast thinking.

• Con: Technically difficult to understand, raises risk of misuse.

Page 23: WEKA: A Useful Tool for Air Quality Forecasting

Example: Neural Network Structure

www.doc.ic.ac.uk/~sgc/teaching/v231/

Page 24: WEKA: A Useful Tool for Air Quality Forecasting

WEKA Neural Networks

WEKA provides user controlof training parameters:

# of iterations or epochs (“training time”)

Increment of weight adjustments in back propogation (“learning rate”) Controls on varying changes to increments (“momentum”)

Page 25: WEKA: A Useful Tool for Air Quality Forecasting

Conclusions

• Weka is a low-cost forecasting tool that has the potential to be a useful for air quality forecasting – particularly in situations where non-linear effects dominate.

• Some Weka modules are not fully developed for forecast algorithm development.

• Patience, use of textbook and Weka listserv are required to get the most out of the program.

Page 26: WEKA: A Useful Tool for Air Quality Forecasting

URLs of Interest

• Weka:– http://www.cs.waikato.ac.nz/ml/weka

• Mailing List: – https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist

• Mailing List Archives– https://list.scms.waikato.ac.nz/mailman/htdig/wekalist/

• Informal FAQ:– http://www.public.asu.edu/~sksinghi/weka-faq.html

Page 27: WEKA: A Useful Tool for Air Quality Forecasting

Acknowledgements

• The Delaware Valley Regional Planning Commission (DVRPC) – Mike Boyer and Sean Greene – and the member states (PA, DE and NJ) for supporting air quality forecast development.

• Dr. George Young of Penn State for his advice, patience and teaching skill.