asset price prediction with machine learning
TRANSCRIPT
WhichvariablesmatterforpredictingS1
Whenassessingthistask,itisimportanttorememberthatthisistimeseries
baseddata.Assuch,andoftenparticularlywithstockrelateddata,multicollinearity
willmostlikelybeanissue.Thispresentsmajorproblemsforregressionanalysis
sincemulticollinearitywillinflatethesumofsquaresinourregressions.Inaddition
tothis,wealsomustrecognizethatincludingallofthevariablesinamodelwould
leadtoover-fittinginsampleandsubsequentlypoorpredictiveperformanceinout
ofsampledata.Withtheseproblemsinmind,wewillremedythemwithPrincipal
ComponentAnalysis.
PrincipalComponentAnalysisisastatisticalmethodusedtoreduce
dimensionalityofdatasets.Simplystated,wetransformthedataintonewvariables
calledprincipalcomponentsandeliminatetheprincipalcomponentsthatexplain
negligibleamountsofthevarianceexhibitedwithinthedataset.Thebenefitofthis
techniqueisthatwepreservethevarianceofthedatasetwhilebeingableto
performvisualandexploratoryanalysismucheasierthanpriortothe
transformation.WhenformingthematrixofdatawewillperformPCAon,we
removeS1,sincethisistheresponsevariable,andretaincolumnsS2throughS10.
Afterrunningprincipalcomponentanalysisonthefirst50rowsofS2throughS10,
weseethefollowing:
Eachrowindexnumberrepresentstheprincipalcomponentnumberandeach
valuewithinaparticularprincipalcomponentrepresentsthepercentageofthe
variabilitythatprincipalcomponentexplains. Inthisexperiment,ourthresholdfor
whetherweshallretainaprincipalcomponentis1%.Subsequently,wenoticethat
onlythefirst5principalcomponentsmeetthethresholdwehaveset.Assuch,we
removecomponents6through10.Whentranslatingthiseliminationoftheprincipal
componentstotheoriginaldata,wechoosetokeepcolumnsS2throughS6,and
eliminatetherestfromourtrainingdata.
DoesS1goupordowncumulatively(onanopen-to-closebasis)overthis
period?
S1representsdailyopentoclosechangesofastock. Wefindthats1
increasescumulativelyoverthis50dayperiodby5.92points.Whenobservingthe
cumulativechangesinstockoverthefirst50days,weseethefollowing:
WhatTechniquesdidyouuse?Why?
Webeganourexperimentbyusingprincipalcomponentanalysisandfromthis
techniquedeterminedourexplanatoryvariablestobeS2throughS6.Asstatedprior,
thebenefitofthistechniqueisthatwepreservethevarianceofthedataset,butare
alsoabletotransformitinamannerthatallowsustounderstandthecontribution
ofeachprincipalcomponenttothetotalvariancewithinthedata. Afterthetraining
datafortheexplanatoryvariableshasbeendetermined,wecrossvalidateboththe
responseandexplanatoryvariablesbyrandomlysamplingtheindexandrow
respectivelywithintherangeofthetrainingset.Bydoingthis,wearenotonly
preventingoverfitting,butwearealsoabletotestourmodelon“new”data.This
allowsustogainamorerealisticperspectiveonhowitwouldperformwithoutof
sampledata.
ModelsUsedtoPredictS1
Whenperformingthisexperiment,thesefollowingfivemodelswerechosenfor
evaluation.TheScikitlearnmodulewasusedforseveraloftheimplementations,
whileonmodelwasconstructedinstepwisefashion.Themodelsusedareas
follows:
a.RidgeRegression–methodusedtoanalyzingmultipleregression
datathatsuffersfrommulticollinearity(linearornearlinear
relationshipsbetweenexplanatoryvariables).Thisregressionaccounts
forbias,sothestandarderrorsarereducedandthereforemore
reliablethantraditionalregressionmethods.[Scikitlearn]
b.SupportVectorRegression–regressionthatutilizeskernels
(functionsthatoperateinfeaturespacewithouthavingtocompute
coordinatesofthedataandcomputinginnerproductsbetweendata
pairsinstead)tooptimizetheboundsfortheregression.[Scikitlearn]
c.KernelRidgeRegression–ridgeregressionexceptlinearfunctionis
learnedinthespaceinducedbytherespectivekernel.[Scikitlearn]
d.NeuralNetworkusingRidgeRegression–systemof“neurons”that
dataisinputtedintocontainingweights.Theseweightsareupdated
eachiterationofthealgorithm.RidgeRegressionisusedasthe
functionwithintheneurons.[Implementedmanually]
e.StochasticGradientDescent–findingthelocalminimumofa
function,usingthenegativedirectionofthegradient(increaseor
decreaseinmagnitude/derivativeoffunction).[Scikitlearn]
Forthisexperiment,wechoosetoiteratetheimplementationofthese
algorithmsfor100trials.Thereasoningbehindthisistogainamorereasonable
approximationofthefollowingsummarystatisticswithrespecttothesumof
squares:
• Maximum
• Minimum
• Mean
• Standarddeviation
• Range
Weshallalsobefindingthersquaredvalue,whichinformsushowmuchofthe
variabilityinycanbeexplainedbyx.However,thisdoesnotchangefromiteration
toiterationsinceonlytheorientationofthedataischanging.Ourobjectiveisto
choosethemodelwiththelowestsumofsquareswhilealsomaximizingourr
squaredvalue.Uponcompletionoftheiterations,weobservethefollowing:
DeterminingTheModeltoChooseandWhy
Wefindthat,generally,theSupportVectorRegressionperformsthebestin
considerationofourobjectives.Ofallthemodelsutilized,ithasthehighestr
squaredvalue,theloweststandarddeviationofsumofsquares,andhasthelowest
maximumsumofsquaredvalues.Whileitdoesnothavethelowestrangeofsumof
squares,nordoesithavethelowestminimumsumofsquaresobserved,the
differenceinthesestatisticsfromthebestperformingmodelsisveryminimal.
`Thepositiveperformanceofthesupportvectorregressionisdueinpartto
theepsilonintensivelossfunction.Thisfunctionessentiallyignoreserrorswithina
certaindistanceofthetruevalueofthedatapoint.Usingthisfunction,weachievea
globalminimum,whilestillretaininggeneralizationwithintheboundsofthe
hyperplaneorsetofhyperplanes(theboundswithinwhichweobservethegiven
data,definedbythekernel). Thismodelisrobust,andcanhandlebothlinearand
nonlinearregression,alsomakingitasuitablechoiceforthetaskathand.Bethisas
itmay,ourmodelisnotperfectandwemustunderstanditslimits,particularly
withinthecontextoffinancialdata.
Howmuchconfidencedoyouhaveinyourmodel?Whyandwhenwouldit
fail?
Asstatedprior,financialdatapresentsitselfwithmanyproblemsthatmustbe
accountedfor.WhenexaminingthevolatilityofS1inourtrainingdataset,we
observethefollowing:
Where
Y-axis:F–M,Tu–Th,andTotalrepresentFridaysandMondays,Tuesdaysthrough
Thursdays,andTotalDaysrespectively.
X-axis:Vol,#Days,%Days,andSSRsrepresentvolatility,numberofdays,
percentageofdaysandsumofsquaredresidualsforaparticulariteration.
ItisworthnotingF-Mis10two-daypairsandTu–Th10three-daypairs.We
canseethatthereismorevariabilityonFridayandMondaythanThursdaythrough
Friday.Cumulatively,themostinaccuratepredictionsinthisobservationcomefrom
theTuesdaythroughThursdayperiod.Below,weobservetheactualS1inredand
ourpredictedS1ingreenforthetrainingperiod(notcrossvalidateddata):
Thealgorithmdoesperformwellwithrespecttoitspredictiveabilities,however
therearestillshortcomingstothistechnique.Themainshortcomingisthatthe
modelgenerallyoverestimatesreturnsslightlyandwithmoderatevariabilityinthe
residuals.Asforwhythishappens,thisismostlikelyduetothekernelwehave
selected,whichisnon-linear.Differentkernelsproducedifferinghyperplanes,and
thereforedifferentpredictions.Ingeneral,wewouldliketokeepourmodelsmore
generalizedforoutofsampleprediction,butsupportvectorregressionisnotedfor
oftenrequiringspecifickernelselectionforbetterpredictiveresults.Figuringwhich
kerneltochoosewouldrequiresignificantamountsoftimeandwhateverbenefitwe
gaininmoreaccuratepredictionsinsample,wetradeinthegeneralaccuracyofthe
model,particularlywithoutofsampledata.
Asforwhenthismodelwouldoperatebest,thatwouldlikelybewhenthe
systemicfactorswithinthemarketstaythesame,sothatthekernelchosenisstill
appropriateacrosstheentiretyofthedataset.Periodssuchas2008wouldlikely
renderthismodelnotasusefulasperiodsinwhichthereisrelativestability,or
knownaswhenthemarketistrading“sideways.”Inconclusion,thesupportvector
regressionmodelingofourreduceddataset(viaprincipalcomponentanalysis)is
thebestmodelforourregression,butwecanseethatthereisstillfinetuningthat
mustbedonerespectivetothesituation,suchaswhichkerneltouse.Solongasthis
modelisusedinperiodsinwhichsystemicfactorsareconstant,thepredictive
powerissignificantlyenhancedandisthereforerecommendableasacomponentof
adecision-makingprocesses.