introduction · web viewintroduction this is a series of demonstrations intended to spur interest...
TRANSCRIPT
Introduction
This is a series of demonstrations intended to spur interest in Machine Learning (“ML”) applications on Portfolio123. The demo uses Google Colaboratory and all files are stored/retrieved from Google Drive. You will need a Google account to use the demos as is. Everything is in the cloud and you do not have to worry about installing anything on your desktop computer. If you are determined to run the code on your desktop then it is possible to do so by making the appropriate modifications to the code, with regards to setting up the file paths and directories. I don’t support desktop applications so you will have to do that on your own.
Python is used as the primary scripting language. I recommend w3schools as the place to start if you are not familiar with Python. The only area not covered is pandas which is necessary for DataFrames. I haven’t found a good pandas tutorial site but I have found that a Google search usually does the trick when you encounter a problem.
The ML software for these demos is XGBoost. It is fast and robust. I have written some software that makes XGBoost extremely easy to use and will shave weeks off the development time for any ML application. My software is called tulip, a name that expresses my faith/optimism that the world will blossom this spring with Covid-19 a thing of the past, and Portfolio123 ML will also blossom of course.
Google Drive Setup
Before we begin, you will need to set up Google Drive, creating a project directory and library directory. You can copy the 7 demo files at this time, or when you start to use Google Colaboratory later.
Lib Directory
Within the Lib directory you must deposit the Portfolio123 api code, my tulip software, and your Portfolio123 authorization key and ID as shown below.
Tulip.py can be retrieved here:
https://drive.google.com/file/d/1P47sD2bvPFzD0BGj_1i0STWSog4w0sRn/view?usp=sharing
Below is a sample P123 Api key file. You can modify this file with your Portfolio123 credentials. Then change the name to P123_auth.py and put it in your Lib directory. Make sure that it kept as a text-based file, even though the file extension is .py.
https://docs.google.com/document/d/1YLiXIYZg4cdoU9v-lmGk0Kntzfyd4-3m3VAG7onRBzc/edit?usp=sharing
Project Directory
The demos run with a project called ‘Test’. You have to make a ‘Projects’ directory where all projects are to be located (I’m ambitious), and a subdirectory that has the same name as the project i.e. Test.
For now, the Test directory is empty.
Demo Files
There are 7 demo files.
The links provided below will take you to Google Colaboratory. You should “Save a Copy” to your Google Drive.
Demo1:
https://colab.research.google.com/drive/1trazc-cNYk_BhrdiJ4e3AQMKQfRLtG8m?usp=sharing
Demo2:
https://colab.research.google.com/drive/1e4HR9v5INcUFxDSm5Wyl0WqHID5HnaAu?usp=sharing
Demo3:
https://colab.research.google.com/drive/1HVbfxrUdv_EUajfosrDeAd2CSkYRLUCO?usp=sharing
Demo4:
https://colab.research.google.com/drive/1GXMjA1wyN-Wxqz5j8rh2uF4Ra5Hn13Rj?usp=sharing
Demo5:
https://colab.research.google.com/drive/1cy2p-oh5Fr2pyTJ7BnEt233lhN22xWIF?usp=sharing
Demo6:
https://colab.research.google.com/drive/1TfiaFwwvy-WHmdnhXOs8Oo1V-8Ntaf4J?usp=sharing
Demo7:
https://colab.research.google.com/drive/1POm7BNHXB8lzhrCV0_DdDX_DN_RVKa7Y?usp=sharing
Portfolio123 Ranking System and Custom Universe
Demo1 retrieves training data (and run data) from the Ranking System and Custom Universe at Portfolio123. They are located here for reference:
https://www.portfolio123.com/app/ranking-system/374649
Custom Universe – approximately 200 digital transformation stocks
https://www.portfolio123.com/app/universe/summary/249294?st=1&mt=7
Demo1
Fetch data from a Portfolio123 Ranking System and generate the file used to train the model (Test_TRAIN.csv) and the file used to make future predictions (Test_RUN.csv).
https://colab.research.google.com/drive/1trazc-cNYk_BhrdiJ4e3AQMKQfRLtG8m?usp=sharing
This demo code will deposit training data and run data in the Projects/Test directory as shown below:
The software determines which file to put the data in, based on the Target column. The files will look something like what is shown below:
Demo2
Trains an XGBoost model based on the file Test_TRAIN.csv and then the trained model is used to generate predictions based on the file Test_RUN.csv. Both Test_TRAIN.csv and Test_RUN.csv were generated in Demo1. The predictions are saved in Test_PREDICT.csv. The XGBoost model is saved as Test_MODEL.dat.
https://colab.research.google.com/drive/1e4HR9v5INcUFxDSm5Wyl0WqHID5HnaAu?usp=sharing
When the code is executed, you should see a report printout similar to the one below:
The arrangement of data is illustrated below. The ideal data allocation was 50% training data, 35% validation data, and 15% test data. The S/W attempts to come as close as possible to the ideal, but achieving it exactly is impractical. The actual assignment is 50.6% / 35.5% / 13.9%.
The important thing to keep an eye on are the R^2 (R-Squared) results, in particular the R^2 value for the test data. This is the number that is most meaningful and the higher the better. A score of 1 is ideal but not achievable in the real world.
Once the model is trained then the demo generates a prediction file based on Test_RUN.csv. The prediction data should look something like what is shown below.
You will now have four files that have been deposited into your Google Drive project directory.
Demo3
This demo shows how to evaluate the inputs and decide which should be used for subsequent training. Test_TRAIN.csv, created in Demo1, is used for this evaluation effort.
https://colab.research.google.com/drive/1HVbfxrUdv_EUajfosrDeAd2CSkYRLUCO?usp=sharing
I suggest starting with no training inputs selected. Run the demo and depending on the results, start to build up the list of inputs. Each input configuration is tested against a ‘grid’ of XGBoost parameters. 25 iterations are performed with randomly chosen XGBoost parameters. The R^2 scores are an average of the 25 iterations.
Demo4
This demo shows how the training inputs / target perform for train/validation/test over a broad range of data split configurations. As with earlier demos, Test_TRAIN.csv, created in Demo1, is used for this evaluation effort.
https://colab.research.google.com/drive/1GXMjA1wyN-Wxqz5j8rh2uF4Ra5Hn13Rj?usp=sharing
Demo5
This demo shows how the training inputs / target perform for different XGBoost parameter configurations. As with earlier demos, Test_TRAIN.csv, created in Demo1, is used for this evaluation effort.
https://colab.research.google.com/drive/1cy2p-oh5Fr2pyTJ7BnEt233lhN22xWIF?usp=sharing
The demo code calls out each programmable XGBoost parameter that tulip allows to be configured as separate lines of code. In your work, you can evaluate one or as many XGBoost parameters as you want.
I have provided a unique feature for diagnosing problems. It is a function called Modelx.DumpXGBoostParam(). You set a threshold value as a calling parameter. When the software detects a test R^2 below the threshold, then the XGBoost parameters are dumped to the screen. This function has helped me a lot with setting minimum and maximum parameter values.
Demo6
This demo shows how to override the default XGBoost grid settings. As with previous demos, this demo depends on the training data from Test_TRAIN.csv that was generated in Demo1.
https://colab.research.google.com/drive/1TfiaFwwvy-WHmdnhXOs8Oo1V-8Ntaf4J?usp=sharing
Low R^2 values result in the XGBoost parameters to be dumped per the previous demo.
Demo7 - This demo shows how to override the default training parameters for XGBoost and also for the DataSplit configuration. As with previous demos, this demo depends on the training data from Test_TRAIN.csv that was generated in Demo1.
https://colab.research.google.com/drive/1POm7BNHXB8lzhrCV0_DdDX_DN_RVKa7Y?usp=sharing
Undocumented Functions
There are some functions not covered in the demos. They are listed below.
Modelx.TrainNoiseEval(n) This function causes random noise with maximum amplitude of n% to be applied to the training data only. The validation and test data are not affected.
Modelx.TrainShuffleEval(True) This function causes the training data to be randomly shuffled. The validation and test data are not affected.
Modelx.XGBoostEarlyStop(n) This function tells XGBoost to stop training under certain conditions after n epochs.
Modelx.DumpDataSplitEval(th) This function is similar to Modelx.DumpXGBoostParam() but instead dumps the data configuration at the time of the low R^2 score.
Modelx.DataSplitOptions([
{'Train':50,'Val':35,'Test':15}, # Evaluation selections
{'Train':40,'Val':40,'Test':20},
{'Val':35,'Train':50,'Test':15},
{'Val':40,'Train':40,'Test':20},
{'Val':35,'Test':15,'Train':50},
{'Val':40,'Test':20,'Train':40},
{'Test':15,'Train':50,'Val':35},
{'Test':20,'Train':40,'Val':40},
{'Test':15,'Val':35,'Train':50},
{'Test':20,'Val':40,'Train':40}
]
This function allows an override of the default array of DataSplit configurations.