introduction · web viewintroduction this is a series of demonstrations intended to spur interest...

Introduction

This is a series of demonstrations intended to spur interest in Machine Learning (“ML”) applications on Portfolio123. The demo uses Google Colaboratory and all files are stored/retrieved from Google Drive. You will need a Google account to use the demos as is. Everything is in the cloud and you do not have to worry about installing anything on your desktop computer. If you are determined to run the code on your desktop then it is possible to do so by making the appropriate modifications to the code, with regards to setting up the file paths and directories. I don’t support desktop applications so you will have to do that on your own.

Python is used as the primary scripting language. I recommend w3schools as the place to start if you are not familiar with Python. The only area not covered is pandas which is necessary for DataFrames. I haven’t found a good pandas tutorial site but I have found that a Google search usually does the trick when you encounter a problem.

The ML software for these demos is XGBoost. It is fast and robust. I have written some software that makes XGBoost extremely easy to use and will shave weeks off the development time for any ML application. My software is called tulip, a name that expresses my faith/optimism that the world will blossom this spring with Covid-19 a thing of the past, and Portfolio123 ML will also blossom of course.

Google Drive Setup

Before we begin, you will need to set up Google Drive, creating a project directory and library directory. You can copy the 7 demo files at this time, or when you start to use Google Colaboratory later.

Lib Directory

Within the Lib directory you must deposit the Portfolio123 api code, my tulip software, and your Portfolio123 authorization key and ID as shown below.

Tulip.py can be retrieved here:

https://drive.google.com/file/d/1P47sD2bvPFzD0BGj_1i0STWSog4w0sRn/view?usp=sharing

Below is a sample P123 Api key file. You can modify this file with your Portfolio123 credentials. Then change the name to P123_auth.py and put it in your Lib directory. Make sure that it kept as a text-based file, even though the file extension is .py.

https://docs.google.com/document/d/1YLiXIYZg4cdoU9v-lmGk0Kntzfyd4-3m3VAG7onRBzc/edit?usp=sharing

Project Directory

The demos run with a project called ‘Test’. You have to make a ‘Projects’ directory where all projects are to be located (I’m ambitious), and a subdirectory that has the same name as the project i.e. Test.

For now, the Test directory is empty.

Demo Files

There are 7 demo files.

The links provided below will take you to Google Colaboratory. You should “Save a Copy” to your Google Drive.

Demo1:

https://colab.research.google.com/drive/1trazc-cNYk_BhrdiJ4e3AQMKQfRLtG8m?usp=sharing

Demo2:

https://colab.research.google.com/drive/1e4HR9v5INcUFxDSm5Wyl0WqHID5HnaAu?usp=sharing

Demo3:

https://colab.research.google.com/drive/1HVbfxrUdv_EUajfosrDeAd2CSkYRLUCO?usp=sharing

Demo4:

https://colab.research.google.com/drive/1GXMjA1wyN-Wxqz5j8rh2uF4Ra5Hn13Rj?usp=sharing

Demo5:

https://colab.research.google.com/drive/1cy2p-oh5Fr2pyTJ7BnEt233lhN22xWIF?usp=sharing

Demo6:

https://colab.research.google.com/drive/1TfiaFwwvy-WHmdnhXOs8Oo1V-8Ntaf4J?usp=sharing

Demo7:

https://colab.research.google.com/drive/1POm7BNHXB8lzhrCV0_DdDX_DN_RVKa7Y?usp=sharing

Portfolio123 Ranking System and Custom Universe

Demo1 retrieves training data (and run data) from the Ranking System and Custom Universe at Portfolio123. They are located here for reference:

https://www.portfolio123.com/app/ranking-system/374649

Custom Universe – approximately 200 digital transformation stocks

https://www.portfolio123.com/app/universe/summary/249294?st=1&mt=7

Demo1

Fetch data from a Portfolio123 Ranking System and generate the file used to train the model (Test_TRAIN.csv) and the file used to make future predictions (Test_RUN.csv).

https://colab.research.google.com/drive/1trazc-cNYk_BhrdiJ4e3AQMKQfRLtG8m?usp=sharing

This demo code will deposit training data and run data in the Projects/Test directory as shown below:

The software determines which file to put the data in, based on the Target column. The files will look something like what is shown below:

Demo2

Trains an XGBoost model based on the file Test_TRAIN.csv and then the trained model is used to generate predictions based on the file Test_RUN.csv. Both Test_TRAIN.csv and Test_RUN.csv were generated in Demo1. The predictions are saved in Test_PREDICT.csv. The XGBoost model is saved as Test_MODEL.dat.

https://colab.research.google.com/drive/1e4HR9v5INcUFxDSm5Wyl0WqHID5HnaAu?usp=sharing

When the code is executed, you should see a report printout similar to the one below:

The arrangement of data is illustrated below. The ideal data allocation was 50% training data, 35% validation data, and 15% test data. The S/W attempts to come as close as possible to the ideal, but achieving it exactly is impractical. The actual assignment is 50.6% / 35.5% / 13.9%.

The important thing to keep an eye on are the R^2 (R-Squared) results, in particular the R^2 value for the test data. This is the number that is most meaningful and the higher the better. A score of 1 is ideal but not achievable in the real world.

Once the model is trained then the demo generates a prediction file based on Test_RUN.csv. The prediction data should look something like what is shown below.

You will now have four files that have been deposited into your Google Drive project directory.

Demo3

This demo shows how to evaluate the inputs and decide which should be used for subsequent training. Test_TRAIN.csv, created in Demo1, is used for this evaluation effort.

https://colab.research.google.com/drive/1HVbfxrUdv_EUajfosrDeAd2CSkYRLUCO?usp=sharing

I suggest starting with no training inputs selected. Run the demo and depending on the results, start to build up the list of inputs. Each input configuration is tested against a ‘grid’ of XGBoost parameters. 25 iterations are performed with randomly chosen XGBoost parameters. The R^2 scores are an average of the 25 iterations.

Demo4

This demo shows how the training inputs / target perform for train/validation/test over a broad range of data split configurations. As with earlier demos, Test_TRAIN.csv, created in Demo1, is used for this evaluation effort.

https://colab.research.google.com/drive/1GXMjA1wyN-Wxqz5j8rh2uF4Ra5Hn13Rj?usp=sharing

Demo5

This demo shows how the training inputs / target perform for different XGBoost parameter configurations. As with earlier demos, Test_TRAIN.csv, created in Demo1, is used for this evaluation effort.

https://colab.research.google.com/drive/1cy2p-oh5Fr2pyTJ7BnEt233lhN22xWIF?usp=sharing

The demo code calls out each programmable XGBoost parameter that tulip allows to be configured as separate lines of code. In your work, you can evaluate one or as many XGBoost parameters as you want.

I have provided a unique feature for diagnosing problems. It is a function called Modelx.DumpXGBoostParam(). You set a threshold value as a calling parameter. When the software detects a test R^2 below the threshold, then the XGBoost parameters are dumped to the screen. This function has helped me a lot with setting minimum and maximum parameter values.

Demo6

This demo shows how to override the default XGBoost grid settings. As with previous demos, this demo depends on the training data from Test_TRAIN.csv that was generated in Demo1.

https://colab.research.google.com/drive/1TfiaFwwvy-WHmdnhXOs8Oo1V-8Ntaf4J?usp=sharing

Low R^2 values result in the XGBoost parameters to be dumped per the previous demo.

Demo7 - This demo shows how to override the default training parameters for XGBoost and also for the DataSplit configuration. As with previous demos, this demo depends on the training data from Test_TRAIN.csv that was generated in Demo1.

https://colab.research.google.com/drive/1POm7BNHXB8lzhrCV0_DdDX_DN_RVKa7Y?usp=sharing

Undocumented Functions

There are some functions not covered in the demos. They are listed below.

Modelx.TrainNoiseEval(n) This function causes random noise with maximum amplitude of n% to be applied to the training data only. The validation and test data are not affected.

Modelx.TrainShuffleEval(True) This function causes the training data to be randomly shuffled. The validation and test data are not affected.

Modelx.XGBoostEarlyStop(n) This function tells XGBoost to stop training under certain conditions after n epochs.

Modelx.DumpDataSplitEval(th) This function is similar to Modelx.DumpXGBoostParam() but instead dumps the data configuration at the time of the low R^2 score.

Modelx.DataSplitOptions([

{'Train':50,'Val':35,'Test':15}, # Evaluation selections

{'Train':40,'Val':40,'Test':20},

{'Val':35,'Train':50,'Test':15},

{'Val':40,'Train':40,'Test':20},

{'Val':35,'Test':15,'Train':50},

{'Val':40,'Test':20,'Train':40},

{'Test':15,'Train':50,'Val':35},

{'Test':20,'Train':40,'Val':40},

{'Test':15,'Val':35,'Train':50},

{'Test':20,'Val':40,'Train':40}

]

This function allows an override of the default array of DataSplit configurations.

introduction · web viewintroduction this is a series of demonstrations intended to spur interest...

Documents