machine learning for pi system and sql server analysis services
DESCRIPTION
PITRANSCRIPT
-
Machine Learning for PI System
and SQL Server Analysis
Services
OSIsoft vCampus White Paper
-
How to Contact Us
Email: [email protected]
Web: http://vCampus.osisoft.com > Contact Us
OSIsoft, Inc.
777 Davis St., Suite 250
San Leandro, CA 94577 USA
Houston, TX
Johnson City, TN
Mayfield Heights, OH
Phoenix, AZ
Savannah, GA
Seattle, WA
Yardley, PA
Worldwide Offices
OSIsoft Australia
Perth, Australia
Auckland, New Zealand
OSIsoft Europe
Altenstadt, Germany
OSI Software Asia Pte Ltd.
Singapore
OSIsoft Canada ULC
Montreal, Quebec
Calgary, Alberta
OSIsoft, Inc. Representative Office
Shanghai, Peoples Republic of China
OSIsoft Japan KK
Tokyo, Japan
OSIsoft Mexico S. De R.L. De C.V.
Mexico City, Mexico
Sales Outlets and Distributors
Brazil
Middle East/North Africa
Republic of South Africa
Russia/Central Asia
South America/Caribbean
Southeast Asia
South Korea
Taiwan
WWW.OSISOFT.COM
OSIsoft, Inc. is the owner of the following trademarks and registered trademarks: PI System, PI ProcessBook, Sequencia, Sigmafine, gRecipe, sRecipe, and RLINK. All terms mentioned in this book that are known to be trademarks or service marks have been appropriately capitalized. Any trademark that appears in this book that is not owned by OSIsoft, Inc. is the property of its owner and use herein in no way indicates an endorsement, recommendation, or warranty of such partys products or any affiliation with such party of any kind.
RESTRICTED RIGHTS LEGEND Use, duplication, or disclosure by the Government is subject to restrictions as set forth in subparagraph (c)(1)(ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.227-7013
Unpublished rights reserved under the copyright laws of the United States.
1998-2011 OSIsoft, LLC
-
ii
TABLE OF CONTENTS
Overview ........................................................................................................................................... 1
About this Document ............................................................................................................................. 1
What You Need to Start ......................................................................................................................... 1
What is Machine Learning? ................................................................................................................ 2
Introduction ............................................................................................................................................ 2
Applications ............................................................................................................................................ 2
Connection to the PI System .................................................................................................................. 3
Machine Learning ................................................................................................................................... 4
Available Tools ........................................................................................................................................ 5
Machine Learning with PI Server and SQL Server Analysis Services ...................................................... 6
Architecture ............................................................................................................................................ 6
What You Need ...................................................................................................................................... 6
Creating an Estimation Model ................................................................................................................ 7
Generating Predictions ......................................................................................................................... 12
Discussion ............................................................................................................................................. 16
Conclusion ....................................................................................................................................... 18
Revision History ............................................................................................................................... 19
-
1
OVERVIEW
ABOUT THIS DOCUMENT
This document is exclusive to the OSIsoft Virtual Campus (vCampus) and is available on its
online Library, located at http://vCampus.osisoft.com/Library/library.aspx.
Any question or comment related to this document should be posted in the appropriate
vCampus discussion forum (http://vCampus.osisoft.com/forums) or sent to the vCampus
Team at [email protected].
ABOUT THIS WHITE PAPER
This White Paper is focused on introducing the concept of Machine Learning to the PI
Users Community and providing an example of doing that. The purpose of such introduction
and usage is deriving more value out of the data collected and archived by the PI
infrastructure. This means extracting more information out of the recorded and collected
data.
WHAT YOU NEED TO START
You must have the following software installed to be able to follow the examples in this
white paper:
PI Server
PI DataLink or the piconfig utility
Microsoft SQL Server Analysis Services
Microsoft Excel Data Mining add-in
-
2
WHAT IS MACHINE LEARNING?
We are drowning in information and starving for knowledge.
Rutherford D. Roger
INTRODUCTION
Machine Learning is a broad term referring to a slew of techniques used for learning
information and knowledge from past observations. Usually these methods are based on
statistical procedures. In classical methods, such a linear regression, one was able to fit a
function of a specific format to the observed variables and optimize some measure of
accuracy to find the corresponding unknown parameters. Other statistical methods were
good at handling modest volumes of data. However, in todays world the number of
variables and/or observations can be so big that no classical method of function fitting or
statistical analysis would yield an acceptable outcome.
APPLICATIONS
The applications of machine learning abound. It is applicable whenever we are interested in
deriving a model representing large amounts of observations for future uses and predictions
or filling the gaps of the measurement.
In the field of power generation being able to predict the going price of the market for
some time in the future, say one or a few days ahead, plays a crucial role. The predicted
price would depend on the time of the year, day of the week, hour of the day, temperature
forecast, other generators availability, and potentially some other factors. This price would
determine the profit-maximizing generation point for a generator. On the other hand, there
are sizeable amounts of observations archived from years ago available for mining. These
observations can be used to build a relationship between the predicting parameters and the
output (price). Given the size of observations and number of variables, machine learning is a
very good fit to approach the problem.
As another example, assume that we have a device/element and a number of PI
tags/attributes attached to it archiving multiple quantities/attributes. Historically, we know
when the device failed or needed maintenance. Now the goal is to learn from previous
incidents and predict in advance when the device is about to fail or needs maintenance.
Even though this will definitely need intuition and knowledge of the operation of the device,
-
3
being able to decipher the connection between many variables and the outcome can be
formidable. This is where a machine learning algorithm can tackle the previous cases of
failure, build a model and learn the relationship between variables, and use that for
prediction purposes. It means we can use machine learning to perform preemptive
maintenance.
Another application would be to most accurately and consistently fill the gaps between data
measurements. Imagine that a variable is supposed to being measured and archived
regularly. If for some reason some measurements are missed, we can use other relevant
variables measurements and fill the gaps based on historical learning while conforming to
the general behavior of the underlying process.
Web analytics is another application where there are typically millions or billions of
observations. The goal would be to predict the chance of a specific action performed by the
website visitor given previous measurements and observations. Meteorologists can use
previous measurements to make more accurate forecasts by using machine learning.
Another popular application is in the field of genomics where a single observation includes
millions or billions of variables and an outcome such as a defect or feature. Calculating the
probability of certain characteristics based on the observed genetics properties is the goal.
In all the cases above, if the number of variables or observations is limited, we can use more
traditional methods such as linear or nonlinear regression based on a quadratic cost
function and least squares methodology to fit a model to our observations. However, when
the number of observations and/or the predictors exceed a certain point such classic
methods would not converge in reasonable time or deliver acceptable results.
CONNECTION TO THE PI SYSTEM
The PI System is the infrastructure to collect, archive, analyze, and serve time series data
across multiple industries. Thousands of organizations around the world utilize their
investment in the infrastructure and gain strategic actionable insight by collecting data using
the PI System.
Over the years and by expansion of the organizations, the volume of the collected data
becomes larger than ever. In almost all cases this valuable collection of information hides
much more valuable insight than what it appears on the surface. In other words, if one
knows how to mine the data they can improve their return of investment significantly.
Our goal in this white paper is to take a step in that direction. We would like to showcase
some data mining methods using third party analytical tools to perform such operations on
-
4
PI Data. In particular, we focus on machine learning.
Figure 1: A noisy version of the Sine wave (red) used for learning and to make predictions (green).
A Decision Tree was used to create the learning model.
MACHINE LEARNING
Machine learning refers to a set of analytical and computational tools used to make
inferences based on previous observations. In supervised learning, there is a measure
variable that is sought to be predicted based on previous data. While in unsupervised
learning, patterns, segmentations, or otherwise associations are the aims of the analysis.
In this paper our attention is devoted to supervised learning. In fact, we would like to make
predictions for a single variable based on our current measurement of the predicting
variables or predictors. This prediction is based upon previous observations of the predictors
and the output variable. The bulk of the machine learning algorithm deals with the question
of how to teach the model the association between the predictors and the output variable
so that some accuracy metric (usually statistical) is optimized.
Our focus here will be on regression algorithms; i.e. the algorithms in which the output
variable is a real number. However, very similar methodology applies to classification
algorithms where the output variable is an unordered variable belonging to a set such as
{Yes, No} or {Faulty, Medium condition, Excellent condition}.
-
5
AVAILABLE TOOLS
Among the most popular tools providing machine learning solutions are open source
packages written in R, MATLAB Data Mining Toolbox, Microsoft SQL Server Analysis Services,
SAS, and Google Prediction API.
Each of the commercial tools mentioned above deploy one or more of the underlying
machine learning algorithms. Among the most popular algorithms are Decision Trees and
Neural Networks. Decision Trees come in several different flavors to improve the precision
or other criteria of the learning and prediction. Here are a number of important factors to
be aware of when choosing the right algorithm:
Time to learn: The time each algorithm takes to learn is a function of the number of
observations and predictors. This is one of the important factors the designers should
consider.
Time to predict: Once the learning procedure is over, it is time to use the model for
prediction purposes. Sometimes, this has to be done as quickly as possible for real-time
applications while in some other applications time sensitivity is not too high.
Interpretability: Intuition is a very important feature in prediction algorithms. Imagine an
algorithm where any prediction is backed by an easy to understand and, intuitive, or even
visually presentable explanation. Compare that with a model that is only producing a
number as the output. Even if the latter does a better job in predicting, usually decision
makers prefer the former over a black box that just spits out a number.
Accuracy and variance: These are two very important factors one has to consider. While
accuracy is always desired, sometimes you do not want to be as accurate as possible with
the given set of observations. The reason is that usually there is a trade-off between
accuracy and variance. Too much variance in the output variable with respect to the changes
in the predictors is usually an undesired phenomenon. Different machine learning methods
vary in this sense.
Ability to handle missing data: Often times in practice some observations are missing. For
example, if there are 4 predictors and 1 million observations, chances are not all 4 values are
available in every single observation.
Sensitivity to irrelevant data: This is a very important phenomenon which could be counter-
intuitive as well. Sometimes, we are not sure about any causality or correlation between a
variable and the output variable. We just collect, say, five variables as predictors and let the
model learn the relationship between them and the output variable. If some of the
predictors are not correlated with the output, the model is prone to some difficulties. One
would imagine that in the worst scenario, the model would gain nothing from the irrelevant
predictor; however, in some models irrelevant predictors deteriorate the prediction
procedure.
-
6
MACHINE LEARNING WITH PI SERVER AND
SQL SERVER ANALYSIS SERVICES
In this section we will see how we can leverage machine learning capabilities provided by Microsoft
SQL Server Analysis Services for data stored in PI Server. In particular we will focus on the Excel client
that operates on an Excel sheet exposing some of those features in the Office environment. Being an
easy to use and insightful data mining tool it can be extremely useful for decision makers and
operators who use PI infrastructure.
ARCHITECTURE
PI System is the infrastructure for acquiring, archiving, analysis, and delivery of enterprise time-
series data. In this section we export the data stored in PI archives to Microsoft Excel using PI
DataLink. The machine learning algorithms take place inside the SQL Server Analysis Services. The
Data Mining add-in to Microsoft Excel serves as a client building the underlying model inside the SQL
Server, invoking the learning algorithm, and bringing the results back to the Excel sheet.
Figure 2 The data flow of the machine learning procedure
WHAT YOU NEED
In order to use data mining features of MS SQL Server in MS Excel we need to have SQL
Server Analysis Services (SSAS) accessible. On top of this we need the Data Mining add-in to
MS Excel which serves as a client for SSAS. This add-in will add new features to MS Excel
ribbons to be used later. At the time of writing this document the add-in for MS Office 2010
can be accessed from here for free.
Note that once the data mining model is built and trained, you can send as many queries
to it as you need without further training the model. The Data Mining add-in to MS
Excel is an easy user interface to expose such features. Also note that SQL Server
-
7
Analysis Services also provides a graphical design interface for creating queries, and
also a query language called Data Mining Extensions (DMX) that is useful for creating
custom predictions and complex queries. To build DMX prediction queries, you can start
with the query builders that are available in both SQL Server Management Studio and
Business Intelligence Development Studio. A set of DMX query templates is also provided
in SQL Server Management Studio. To read more on building queries you can follow this
link.
On the PI System side we need a PI Server and PI DataLink available. The idea is to use PI
DataLink to bring in the observations and predictor values into MS Excel.
CREATING AN ESTIMATION MODEL
In this section we explain how you can use machine learning features of SSAS through an
example.
Note that throughout this example we make use of some data that we create ourselves
using an explicit formula. This is only because we need to verify the result of the
machine learning procedure by comparing the predictions against the values we know.
Otherwise, in most cases this procedure is used to model unstructured and unknown data.
In any case, no knowledge of the formula will be used by any means to perform the data
mining procedure and the generated values will be treated as mere numerical
observations.
To make this example, assume there are two tags called tag1 and tag2. There is another tag,
the output of the model, which is being measured and is believed to be a function of tag1
and tag2. We call it tag3. However, the relationship between the predictors and the output
tag is unknown to the designer. The idea is to build and train a machine learning model with
existing measurements and use it for future predictions.
We generated the data for this example as the following: we created 10000 values for each
of the predictors (tag1 and tag2) according to a uniformly distributed random distribution in
the range of [0, 10]. We construct our output to be:
tag3 = SIN(tag1 + 2*tag2)
Obviously we dont use this knowledge in our machine learning example. Therefore, we
have gathered 10,000 observations of two predictors and one output variable. We export
the 10,000 triplets either using PI DataLink or through exporting to a csv file (using the
piconfig utility) to an Excel sheet:
-
8
Figure 3 The two predictors and the output variable exported to Excel.
The next step is to turn the range containing our data into a table. This is not required for
the operation we are intending to do but it will make our work space more manageable
down the road. Also some other features of the Data Mining add-in would only work on
tables. So, we select the 10,000 rows and 3 columns and click on Format as Table on the
Home ribbon and choose the desired format. Name the columns to match our naming.
Figure 4 The three columns of our data formatted as a table.
The next step is the heart of building the machine learning model to predict new values of
the output variable tag3. The purpose of this step is to create a machine learning model
inside the SQL server using the data stored in our table. To start we select the Estimate
button on the Data mining ribbon.
-
9
Figure 5 Choosing the Estimate to create the mining structure.
This will open a wizard for us to define model parameters. As for the source data we can use
our table, an Excel range, or an external source. Choose the table we just created.
Figure 6 Choose the table as the source of data.
In the next step we check all three columns as the inputs to our model as we will be using all
the data in the three columns. In general we might be interested in investigating the
relationship between only a subset of the predictors and the output. This can be because
some observed variables do not have any relevant information or we are running some
relevance tests. Also, we are interested in predicting tag3; therefore, we select tag3 as the
Column to analyze. Click Next.
-
10
Figure 7 Including all three columns in our model and selecting the output.
After selecting the columns in the model, the wizard asks us what percentage of the data
will be used for training and what percentage for validation of the model. The typical value
is 30% for validation and the rest for training. SSAS uses validation to improve and trim the
learning model. The Decision Tree algorithm applied in SSAS builds a tree based on the raw
data designated for training. In this case 70% of the data is randomly picked for training
purposes. It then applies the resulting tree to the remaining portion of the data to validate
the prediction against observations. In case the error in prediction exceeds a threshold the
tree gets improved or trimmed. For more information you can read more on cross validation
of the learning models. Click next to accept the 30%.
The last step in creating the model is to name the structure. Name the structure Table1
Estimate tag3. Click Finish.
You will see that the model starts being built inside SQL Server. What the SSAS does is that it
takes all the predictors (in this case two of them) as well as the output variable and starts
building a Decision Tree minimizing the error criterion. This Decision Tree will be saved in
the SSAS for future prediction purposes. As you will see later, we will send queries to this
structure to predict the value of the output variable based on predictor values.
-
11
Figure 8 Naming the structure is the last step in creating the model.
One point of interest is the amount of time it takes for the whole analysis. This is a very
efficient algorithm which is capable of handling large amounts of data. In this case with
10,000 observations with two predictors on a laptop running Windows 7 with 8GB of RAM
the building of tree takes no more than 10 seconds.
As a result of the analysis the Data Mining add-in shows a graphical representation of the
Decision Tree as well. Each node of the tree is comprised of a predictor variable and a
collection of numbers or breaking points. While traversing down the tree the prediction can
be very easily achieved by following the corresponding branch. This is one of the very strong
points of the Decision Tree algorithm in machine learning. Not only it makes it fast and
efficient to look up estimation, but also it results in very intuitive interpretation of the
prediction. This is very important because often when a black box prediction algorithm, such
as Neural Networks, offers just the prediction value the decision makers would love to see
some background on why the prediction has been made.
-
12
Figure 9 The resulting Decision Tree
GENERATING PREDICTIONS
Having built our model based on the observations, we will move on to use the model to predict tag3
values. So we create a table with a few pairs of values for tag1 and tag2, i.e. observations on the
predictors. These are potential future values of the predictors that we will use to make predictions.
Figure 10 Adding new pairs of predictors
Now its time to use the structure we built. In order to do that, click on the Query button on the Data
Mining tab. It opens a wizard for us to build the query against the machine learning model we built
in the last section. First we select the structure and model to send the query to. So we choose
Table1 Estimate tag3. There is only one model under this structure which is Estimate tag3_1 in
this case. Select the model and click Next.
-
13
Figure 11 Selecting the model we built in previous section.
We are now asked to provide the source of data. We will have the option to refer to a table, an Excel
range, or an external data source. We use the table containing our new values as the source of data.
-
14
Figure 12 Pointing to our table for the predictor values.
Now the wizard wants to know how each column of the table is mapped to each predictor in the
previously stored model. In this case we have given each column a similar name as in the model,
namely tag1 and tag2. We dont have in this case any values for tag3 because this is the output
variable to be predicted. Click Next.
-
15
Figure 13 Map the columns to the variables in the model.
The next step is to define an output by clicking on Add Output. In this case tag3 is the output.
Note that in general this can be not straightforward. In a lot of applications lots of observations lack
one or more variable values. So this is not always necessarily obvious to SSAS which variable should
be the output of the model.
Figure 14 Defining the output
-
16
Note that we choose tag 3 as the output and we are interested in predicting the values of tag3. With
other options we can take a look at the variance of the output variable or its support instead. In
other words, instead of the actual value, these other options allow us to examine some basic
statistical characteristics of the predicted value. More variance for example means the predicted
value is prone to bigger changes in case the observations change. Click OK and then Next. On the
next dialog we choose to append to the input data to see the results on the spread sheet along
with the predictors. Click finish. Now you should be able to see the predictions as a third column
added to the table. We have calculated the actual values of the function SIN(tag1+ 2*tag2) as the 4th
column of the table for comparison purposes. As you can compare tag3 (prediction) and the actual
value, the prediction has done a very good job in learning the behavior of the function being
predicted.
Figure 15 Prediction values for tag3 along with the actual values. The prediction has done a good
job.
DISCUSSION
In short we have been able to learn the behavior of the function SIN(tag1 + 2*tag2) through 10,000
observations and apply it for prediction purposes. You can also apply the model to the original table
which was used to train the model - to see the actual and predicted values side by side.
The precision of the model depends on many parameters of the algorithm as well as the nature of
the problem. As a general rule the more observations the better. Also, noisy observations can
contaminate the data at a level determined by the power of noise. Another important factor is
having irrelevant data in training. Sometimes we dont know for sure if the output depends on a
certain variable (tag) or not. We have to note that irrelevant data can in fact hurt. Decision Trees are
among the more robust algorithms when it comes to irrelevant data. However, knowing the physics
of the problem at hand plays a key role in making the learning procedure more efficient. Also, in
more complex cases we need to run several different models with different predictor variables to
see which one makes a better model of the problem.
Also, it is common that not all observations have all the predictor values in them. Decision Trees are
among the best algorithms handling missing samples.
When the relationship between the predictors and the output variable is nonlinear the Decision
Trees typically do a better job. When predicting a linear relationship they tend to struggle. This is a
direct result of the underlying algorithm which is trying to fit a piecewise constant function to the
observations. Therefore, if we believe the relationship between the predictors and the output is
linear, we may get better results using other machine learning algorithms such as neural networks.
-
17
A very important point is that each problem in machine learning has its own characteristics. Even
though todays algorithms and their implementations are very powerful, some nuances and fine
tunings will be left to the specific problem at hand. Therefore, good and close knowledge of the
underlying problem and the relationship between predictors and the output variable is extremely
important to a successful machine learning procedure.
Besides, in almost all machine learning cases data needs to be prepared before being ready for the
algorithms. This includes separating useful informative portions of the data from spam or hollow
samples, enriching the portions we need more precision in the model, and other problem-specific
operations.
-
18
CONCLUSION
In this white paper we discussed an important aspect of data mining for the data stored in a PI
System. In particular, we focused on how we can learn the relationship between several PI tag values
and an output tag. The machine learning algorithm offered in SQL Server Analysis Services along with
its Excel client provides us with a very convenient way to perform machine learning on PI data. We
use PI DataLink to import data onto an Excel sheet and perform machine learning algorithm on
them.
We analyzed several aspects and factors involved in the choice of the right tool. Among those are
the time we can allocate to learning and prediction, robustness to missing data, and robustness to
irrelevant data. In this paper we focused on the Decision Tree algorithm offered by SSAS.
We showed how we can leverage SSAS to do machine learning by walking through an example. We
generated two random tags and created a third tag, as the output tag, which was defined as a sine
function of a linear combination of the two predictors. We used 10,000 observations and fed that
into SSAS in order to learn the behavior of the output. We then used the resulting structure to
predict the value of the output at arbitrary pairs of values of the predictors.
-
19
REVISION HISTORY
16-Aug-2011 Initial draft by Ahmad Fattahi