asset price prediction with machine learning

WhichvariablesmatterforpredictingS1

Whenassessingthistask,itisimportanttorememberthatthisistimeseries

baseddata.Assuch,andoftenparticularlywithstockrelateddata,multicollinearity

willmostlikelybeanissue.Thispresentsmajorproblemsforregressionanalysis

sincemulticollinearitywillinflatethesumofsquaresinourregressions.Inaddition

tothis,wealsomustrecognizethatincludingallofthevariablesinamodelwould

leadtoover-fittinginsampleandsubsequentlypoorpredictiveperformanceinout

ofsampledata.Withtheseproblemsinmind,wewillremedythemwithPrincipal

ComponentAnalysis.

PrincipalComponentAnalysisisastatisticalmethodusedtoreduce

dimensionalityofdatasets.Simplystated,wetransformthedataintonewvariables

calledprincipalcomponentsandeliminatetheprincipalcomponentsthatexplain

negligibleamountsofthevarianceexhibitedwithinthedataset.Thebenefitofthis

techniqueisthatwepreservethevarianceofthedatasetwhilebeingableto

performvisualandexploratoryanalysismucheasierthanpriortothe

transformation.WhenformingthematrixofdatawewillperformPCAon,we

removeS1,sincethisistheresponsevariable,andretaincolumnsS2throughS10.

Afterrunningprincipalcomponentanalysisonthefirst50rowsofS2throughS10,

weseethefollowing:

Eachrowindexnumberrepresentstheprincipalcomponentnumberandeach

valuewithinaparticularprincipalcomponentrepresentsthepercentageofthe

variabilitythatprincipalcomponentexplains. Inthisexperiment,ourthresholdfor

whetherweshallretainaprincipalcomponentis1%.Subsequently,wenoticethat

onlythefirst5principalcomponentsmeetthethresholdwehaveset.Assuch,we

removecomponents6through10.Whentranslatingthiseliminationoftheprincipal

componentstotheoriginaldata,wechoosetokeepcolumnsS2throughS6,and

eliminatetherestfromourtrainingdata.

DoesS1goupordowncumulatively(onanopen-to-closebasis)overthis

period?

S1representsdailyopentoclosechangesofastock. Wefindthats1

increasescumulativelyoverthis50dayperiodby5.92points.Whenobservingthe

cumulativechangesinstockoverthefirst50days,weseethefollowing:

WhatTechniquesdidyouuse?Why?

Webeganourexperimentbyusingprincipalcomponentanalysisandfromthis

techniquedeterminedourexplanatoryvariablestobeS2throughS6.Asstatedprior,

thebenefitofthistechniqueisthatwepreservethevarianceofthedataset,butare

alsoabletotransformitinamannerthatallowsustounderstandthecontribution

ofeachprincipalcomponenttothetotalvariancewithinthedata. Afterthetraining

datafortheexplanatoryvariableshasbeendetermined,wecrossvalidateboththe

responseandexplanatoryvariablesbyrandomlysamplingtheindexandrow

respectivelywithintherangeofthetrainingset.Bydoingthis,wearenotonly

preventingoverfitting,butwearealsoabletotestourmodelon“new”data.This

allowsustogainamorerealisticperspectiveonhowitwouldperformwithoutof

sampledata.

ModelsUsedtoPredictS1

Whenperformingthisexperiment,thesefollowingfivemodelswerechosenfor

evaluation.TheScikitlearnmodulewasusedforseveraloftheimplementations,

whileonmodelwasconstructedinstepwisefashion.Themodelsusedareas

follows:

a.RidgeRegression–methodusedtoanalyzingmultipleregression

datathatsuffersfrommulticollinearity(linearornearlinear

relationshipsbetweenexplanatoryvariables).Thisregressionaccounts

forbias,sothestandarderrorsarereducedandthereforemore

reliablethantraditionalregressionmethods.[Scikitlearn]

b.SupportVectorRegression–regressionthatutilizeskernels

(functionsthatoperateinfeaturespacewithouthavingtocompute

coordinatesofthedataandcomputinginnerproductsbetweendata

pairsinstead)tooptimizetheboundsfortheregression.[Scikitlearn]

c.KernelRidgeRegression–ridgeregressionexceptlinearfunctionis

learnedinthespaceinducedbytherespectivekernel.[Scikitlearn]

d.NeuralNetworkusingRidgeRegression–systemof“neurons”that

dataisinputtedintocontainingweights.Theseweightsareupdated

eachiterationofthealgorithm.RidgeRegressionisusedasthe

functionwithintheneurons.[Implementedmanually]

e.StochasticGradientDescent–findingthelocalminimumofa

function,usingthenegativedirectionofthegradient(increaseor

decreaseinmagnitude/derivativeoffunction).[Scikitlearn]

Forthisexperiment,wechoosetoiteratetheimplementationofthese

algorithmsfor100trials.Thereasoningbehindthisistogainamorereasonable

approximationofthefollowingsummarystatisticswithrespecttothesumof

squares:

• Maximum

• Minimum

• Mean

• Standarddeviation

• Range

Weshallalsobefindingthersquaredvalue,whichinformsushowmuchofthe

variabilityinycanbeexplainedbyx.However,thisdoesnotchangefromiteration

toiterationsinceonlytheorientationofthedataischanging.Ourobjectiveisto

choosethemodelwiththelowestsumofsquareswhilealsomaximizingourr

squaredvalue.Uponcompletionoftheiterations,weobservethefollowing:

DeterminingTheModeltoChooseandWhy

Wefindthat,generally,theSupportVectorRegressionperformsthebestin

considerationofourobjectives.Ofallthemodelsutilized,ithasthehighestr

squaredvalue,theloweststandarddeviationofsumofsquares,andhasthelowest

maximumsumofsquaredvalues.Whileitdoesnothavethelowestrangeofsumof

squares,nordoesithavethelowestminimumsumofsquaresobserved,the

differenceinthesestatisticsfromthebestperformingmodelsisveryminimal.

`Thepositiveperformanceofthesupportvectorregressionisdueinpartto

theepsilonintensivelossfunction.Thisfunctionessentiallyignoreserrorswithina

certaindistanceofthetruevalueofthedatapoint.Usingthisfunction,weachievea

globalminimum,whilestillretaininggeneralizationwithintheboundsofthe

hyperplaneorsetofhyperplanes(theboundswithinwhichweobservethegiven

data,definedbythekernel). Thismodelisrobust,andcanhandlebothlinearand

nonlinearregression,alsomakingitasuitablechoiceforthetaskathand.Bethisas

itmay,ourmodelisnotperfectandwemustunderstanditslimits,particularly

withinthecontextoffinancialdata.

Howmuchconfidencedoyouhaveinyourmodel?Whyandwhenwouldit

fail?

Asstatedprior,financialdatapresentsitselfwithmanyproblemsthatmustbe

accountedfor.WhenexaminingthevolatilityofS1inourtrainingdataset,we

observethefollowing:

Where

Y-axis:F–M,Tu–Th,andTotalrepresentFridaysandMondays,Tuesdaysthrough

Thursdays,andTotalDaysrespectively.

X-axis:Vol,#Days,%Days,andSSRsrepresentvolatility,numberofdays,

percentageofdaysandsumofsquaredresidualsforaparticulariteration.

ItisworthnotingF-Mis10two-daypairsandTu–Th10three-daypairs.We

canseethatthereismorevariabilityonFridayandMondaythanThursdaythrough

Friday.Cumulatively,themostinaccuratepredictionsinthisobservationcomefrom

theTuesdaythroughThursdayperiod.Below,weobservetheactualS1inredand

ourpredictedS1ingreenforthetrainingperiod(notcrossvalidateddata):

Thealgorithmdoesperformwellwithrespecttoitspredictiveabilities,however

therearestillshortcomingstothistechnique.Themainshortcomingisthatthe

modelgenerallyoverestimatesreturnsslightlyandwithmoderatevariabilityinthe

residuals.Asforwhythishappens,thisismostlikelyduetothekernelwehave

selected,whichisnon-linear.Differentkernelsproducedifferinghyperplanes,and

thereforedifferentpredictions.Ingeneral,wewouldliketokeepourmodelsmore

generalizedforoutofsampleprediction,butsupportvectorregressionisnotedfor

oftenrequiringspecifickernelselectionforbetterpredictiveresults.Figuringwhich

kerneltochoosewouldrequiresignificantamountsoftimeandwhateverbenefitwe

gaininmoreaccuratepredictionsinsample,wetradeinthegeneralaccuracyofthe

model,particularlywithoutofsampledata.

Asforwhenthismodelwouldoperatebest,thatwouldlikelybewhenthe

systemicfactorswithinthemarketstaythesame,sothatthekernelchosenisstill

appropriateacrosstheentiretyofthedataset.Periodssuchas2008wouldlikely

renderthismodelnotasusefulasperiodsinwhichthereisrelativestability,or

knownaswhenthemarketistrading“sideways.”Inconclusion,thesupportvector

regressionmodelingofourreduceddataset(viaprincipalcomponentanalysis)is

thebestmodelforourregression,butwecanseethatthereisstillfinetuningthat

mustbedonerespectivetothesituation,suchaswhichkerneltouse.Solongasthis

modelisusedinperiodsinwhichsystemicfactorsareconstant,thepredictive

powerissignificantlyenhancedandisthereforerecommendableasacomponentof

adecision-makingprocesses.

asset price prediction with machine learning

Data & Analytics