recurrent marked temporal point processes: …lsong/papers/dudaitrietal16.pdftemporal point...

10
Recurrent Marked Temporal Point Processes: Embedding Event History to Vector Nan Du Georgia Tech [email protected] Hanjun Dai Georgia Tech [email protected] Rakshit Trivedi Georgia Tech [email protected] Utkarsh Upadhyay MPI-SWS [email protected] Manuel Gomez-Rodriguez MPI-SWS [email protected] Le Song Georgia Tech [email protected] ABSTRACT Large volumes of event data are becoming increasingly avail- able in a wide variety of applications, such as healthcare an- alytics, smart cities and social network analysis. The precise time interval or the exact distance between two events car- ries a great deal of information about the dynamics of the underlying systems. These characteristics make such data fundamentally different from independently and identically distributed data and time-series data where time and space are treated as indexes rather than random variables. Marked temporal point processes are the mathematical framework for modeling event data with covariates. However, typical point process models often make strong assumptions about the generative processes of the event data, which may or may not reflect the reality, and the specifically fixed para- metric assumptions also have restricted the expressive power of the respective processes. Can we obtain a more expres- sive model of marked temporal point processes? How can we learn such a model from massive data? In this paper, we propose the Recurrent Marked Temporal Point Process (RMTPP) to simultaneously model the event timings and the markers. The key idea of our approach is to view the intensity function of a temporal point process as a nonlinear function of the history, and use a recurrent neural network to automatically learn a representation of influences from the event history. We develop an efficient stochastic gradient algorithm for learning the model param- eters which can readily scale up to millions of events. Using both synthetic and real world datasets, we show that, in the case where the true models have parametric specifica- tions, RMTPP can learn the dynamics of such models with- out the need to know the actual parametric forms; and in the case where the true models are unknown, RMTPP can also learn the dynamics and achieve better predictive per- formance than other parametric alternatives based on par- ticular prior assumptions. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. KDD ’16, August 13-17, 2016, San Francisco, CA, USA c 2016 ACM. ISBN 978-1-4503-4232-2/16/08. . . $15.00 DOI: http://dx.doi.org/10.1145/2939672.2939875 9:00 AM 10:30 AM 11:05 AM What’s next ? Figure 1: Given the trace of past locations and time, can we predict the location and time of the next stop? Keywords Marked temporal point process; Stochastic process; Recur- rent neural network; 1. INTRODUCTION Event data with marker information can be produced from social activities, to financial transactions, to electronic health records, which contains rich information about what type of event is happening between which entities by when and where. For instance, people might visit various places at dif- ferent moments of a day. Algorithmic trading systems buy and sell large volume of stocks within short-time frames. Pa- tients regularly go to the clinic with a longitudinal data of diagnoses about their concerned diseases. Although the aforementioned situations come from a broad range of domains, we are interested in a commonly encoun- tered question: based on the observed sequence of events, can we predict what kind of event will take place at what time in the future ? Accurately predicting the type and the timing of the next event will have many interesting appli- cations. For mainstream personal assistants, shown in Fig- ure 1, since people tend to visit different places specific to the temporal/spatial contexts, successfully predicting their next destinations at the most likely time will make such ser- vices more relevant and usable. In stock market, accurately forecasting when to sell or buy a particular stock means critical business success. For modern healthcare, patients may have several diseases that have complicated dependen- cies on each other. Accurately estimating when a clinical event might occur can effectively facilitate patient-specific care and prevention to reduce the potential future risks. Existing studies in literature attempt to approach this problem mainly in two ways: first, classic varying-order Markov models [4] formulate the problem as a discrete-time

Upload: others

Post on 25-Jun-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recurrent Marked Temporal Point Processes: …lsong/papers/DuDaiTriEtal16.pdftemporal point processes are very active research topics in econometrics [2, 3]. In sociology, temporal-spatial

Recurrent Marked Temporal Point Processes:Embedding Event History to Vector

Nan DuGeorgia Tech

[email protected]

Hanjun DaiGeorgia Tech

[email protected]

Rakshit TrivediGeorgia Tech

[email protected] Upadhyay

[email protected]

ManuelGomez-Rodriguez

[email protected]

Le SongGeorgia Tech

[email protected]

ABSTRACTLarge volumes of event data are becoming increasingly avail-able in a wide variety of applications, such as healthcare an-alytics, smart cities and social network analysis. The precisetime interval or the exact distance between two events car-ries a great deal of information about the dynamics of theunderlying systems. These characteristics make such datafundamentally different from independently and identicallydistributed data and time-series data where time and spaceare treated as indexes rather than random variables. Markedtemporal point processes are the mathematical frameworkfor modeling event data with covariates. However, typicalpoint process models often make strong assumptions aboutthe generative processes of the event data, which may ormay not reflect the reality, and the specifically fixed para-metric assumptions also have restricted the expressive powerof the respective processes. Can we obtain a more expres-sive model of marked temporal point processes? How canwe learn such a model from massive data?

In this paper, we propose the Recurrent Marked TemporalPoint Process (RMTPP) to simultaneously model the eventtimings and the markers. The key idea of our approach isto view the intensity function of a temporal point processas a nonlinear function of the history, and use a recurrentneural network to automatically learn a representation ofinfluences from the event history. We develop an efficientstochastic gradient algorithm for learning the model param-eters which can readily scale up to millions of events. Usingboth synthetic and real world datasets, we show that, inthe case where the true models have parametric specifica-tions, RMTPP can learn the dynamics of such models with-out the need to know the actual parametric forms; and inthe case where the true models are unknown, RMTPP canalso learn the dynamics and achieve better predictive per-formance than other parametric alternatives based on par-ticular prior assumptions.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

KDD ’16, August 13-17, 2016, San Francisco, CA, USAc© 2016 ACM. ISBN 978-1-4503-4232-2/16/08. . . $15.00

DOI: http://dx.doi.org/10.1145/2939672.2939875

9:00 AM

10:30 AM

11:05 AM

What’s next ?

Figure 1: Given the trace of past locations and time, can wepredict the location and time of the next stop?

KeywordsMarked temporal point process; Stochastic process; Recur-rent neural network;

1. INTRODUCTIONEvent data with marker information can be produced from

social activities, to financial transactions, to electronic healthrecords, which contains rich information about what typeof event is happening between which entities by when andwhere. For instance, people might visit various places at dif-ferent moments of a day. Algorithmic trading systems buyand sell large volume of stocks within short-time frames. Pa-tients regularly go to the clinic with a longitudinal data ofdiagnoses about their concerned diseases.

Although the aforementioned situations come from a broadrange of domains, we are interested in a commonly encoun-tered question: based on the observed sequence of events,can we predict what kind of event will take place at whattime in the future? Accurately predicting the type and thetiming of the next event will have many interesting appli-cations. For mainstream personal assistants, shown in Fig-ure 1, since people tend to visit different places specific tothe temporal/spatial contexts, successfully predicting theirnext destinations at the most likely time will make such ser-vices more relevant and usable. In stock market, accuratelyforecasting when to sell or buy a particular stock meanscritical business success. For modern healthcare, patientsmay have several diseases that have complicated dependen-cies on each other. Accurately estimating when a clinicalevent might occur can effectively facilitate patient-specificcare and prevention to reduce the potential future risks.

Existing studies in literature attempt to approach thisproblem mainly in two ways: first, classic varying-orderMarkov models [4] formulate the problem as a discrete-time

Page 2: Recurrent Marked Temporal Point Processes: …lsong/papers/DuDaiTriEtal16.pdftemporal point processes are very active research topics in econometrics [2, 3]. In sociology, temporal-spatial

sequence prediction task. Based on the observed sequenceof states, they can predict the most likely state the processwill evolve into on the next step. As a result, one limit ofthe family of classic Markov models is that it assumes theprocess proceeds by unit time-steps, so it cannot capture theheterogeneity of the time to predict the timing of the nextevent. Furthermore, when the number of states is large,Markov model usually cannot capture long dependency onthe history since the overall state-space will grow exponen-tially in the number of time steps considered. Semi-Markovmodel [26] can model the continuous time-interval betweentwo successive states to some extent by assuming the inter-vals have very simple distributions, but it still has the samestate-space explosion issue when the order grows.

Second, marked temporal point processes and intensityfunctions are a more general mathematical framework formodeling such event data. For example, in seismology, markedtemporal point processes have originally been widely usedfor modeling earthquakes and aftershocks [20, 21, 31]. Eachearthquake can be represented as a point in the temporal-spatial space, and seismologists have proposed different for-mulations to capture the randomness of these events. In thefinancial area, temporal point processes are active researchtopics of econometrics, which often leads to many simple in-terpretations of the complex dynamics of modern electronicmarkets [2, 3].

However, typical point process models, such as the Hawkesprocesses [20], the autoregressive conditional duration pro-cesses [12], are making specific assumptions about the func-tional forms of the generative processes, which may or maynot reflect the reality, and thus the respective fixed sim-ple parametric representations may restrict the expressivepower of these models. How can we obtain a more expres-sive model of marked temporal point processes, and learnsuch a model from large volume of data? In this paper, wepropose a novel marked temporal point process, referred toas the Recurrent Marked Temporal Point Process, to simul-taneously model the event timings and markers. The keyidea of our approach is to view the intensity function of atemporal point process as a nonlinear function of the his-tory of the process, and parameterize the function using arecurrent neural network. More specifically, our work makesthe following contributions:

• We propose a novel marked point process to jointly modelthe time and the marker information by learning a gen-eral representation of the nonlinear dependency over thehistory based on recurrent neural networks. Using ourmodel, event history is embedded into a compact vectorrepresentation which can be used for predicting the nextevent time and marker type.• We point out that the proposed Recurrent Marked Tem-

poral Point Process establishes a previously unexploredconnection between recurrent neural networks and pointprocesses, which has implications beyond temporal-spatialsettings by incorporating more rich contextual informa-tion and features.• We conduct large-scale experiments on both synthetic and

real-world datasets across a wide range of domains to showthat our model has consistently better performance forpredicting both the event type and timing compared toalternative competitors.

2. PROBLEM DEFINITIONThe input data is a set of sequences C =

S1,S2, . . .

.

Each Si =((ti1, y

i1), (ti2, y

i2), . . .

)is a sequence of pairs (tij , y

ij)

where tij is the time when the event of type (or marker) yijhas occurred to the entity i, and tij < tij+1. Depending onspecific applications, the entity and the event type can havedifferent meanings. For example, in transportation, Si canbe a trace of time and location pairs for a taxi i where tijis the time when the taxi picks up or drops off customersin the neighborhood yij . In financial transactions, Si can bea sequence of time and action pairs for a particular stock iwhere tij is the time when a transaction of selling (yij = 0) or

buying (yij = 1) has occurred. In electronic health records,

Si is a series of clinical events for patient i where tij is the

time when the patient is diagnosed with the disease yij . De-spite that these applications emerge from a diverse range ofdomains, we want to build models which are able to:• Predict the next event pair (tin+1, y

in+1) given a sequence

of past events for entity i;• Evaluate the likelihood of a given sequence of events;• Simulate a new sequence of events based on the learned

parameters of the model.

3. RELATED WORKTemporal point processes [6] are mathematical abstrac-

tions for many different phenomena across a wide range ofdomains. In seismology, marked temporal point processeshave originally been widely used for modeling earthquakesand aftershocks [20, 19, 21, 31]. In computational finance,temporal point processes are very active research topics ineconometrics [2, 3]. In sociology, temporal-spatial point pro-cesses have been used to model networks of criminals [35]. Inhuman activity modeling, Poisson Process and its variants,have been used to model the inter-event durations of humanactivities [29, 15]. More recently, the self-excitation pointprocess [20], has become an ongoing hot topic for modelingthe latent dynamics of information diffusion [16, 9, 10, 8, 40,14, 22], online-user engagement [13], news-feed streams [7],and context-aware recommendations [11].

A major limitation of these existing studies is that theyoften draw various parametric assumptions about the latentdynamics governing the generation of the observed pointpatterns. In contrast, in this work, we seek to propose amodel that can learn a general and efficient representation ofthe underlying dynamics from the event history without as-suming a fixed parametric forms in advance. The advantageis that the proposed model can be more flexible to be au-tomatically adapted to the data. We compare the proposedRMTPP with many other processes of specific parametricforms in Section 6 to demonstrate the superb robustness ofRMTPP to model misspecification.

4. MARKED TEMPORAL POINT PROCESSMarked temporal point process is a powerful mathemati-

cal tool to model the latent mechanisms governing the ob-served random event patterns along time. Since the occur-rence of an event may be triggered by what happened inthe past, we can essentially specify models for the timingof the next event given what we have already known so far.More formally, a marked temporal point process is a randomprocess of which the realization consists of a list of discreteevents localized in time, tj , yj, with the timing tj ∈ R+,

Page 3: Recurrent Marked Temporal Point Processes: …lsong/papers/DuDaiTriEtal16.pdftemporal point processes are very active research topics in econometrics [2, 3]. In sociology, temporal-spatial

the marker yj ∈ Y and j ∈ Z+ . Let the history Ht bethe list of event time and marker pairs up to the time t.The length dj+1 = tj+1 − tj of the time interval betweenneighboring successive events tj and tj+1 is referred to asthe inter-event duration.

Given the history of past events, we can explicitly spec-ify the conditional density function that the next event willhappen at time t with type y as f∗(t, y) = f(t, y|Ht) wheref∗(t, y) emphasizes that this density is conditional on thehistory. By applying the chaining rule, we can derive thejoint likelihood of observing a sequence as the following:

f((tj , yj)nj=1

)=∏j

f(tj , yj |Ht) =∏j

f∗(tj , yj) (1)

One can design many forms for f∗(tj , yj). However, in prac-tice, people typically choose very simple factorized formu-lations like f(tj , yj |Ht) = f(yj)f(tj | . . . , tj−2, tj−1) due tothe excessive complications caused by jointly and explicitlymodeling the timing and the marker information. One canthink of f(yj) as a multinomial distribution when yj canonly take finite number of values and is totally independenton the history. f(tj | . . . , tj−2, tj−1) is the conditional den-sity of the event occurring at the time tj given the timingsequence of past events. However, note that f∗(tj) cannotcapture the influence of past markers.

4.1 ParametrizationsThe temporal information in a marked point process can

be well captured by a typical temporal point process. Animportant way to characterize temporal point processes isvia the conditional intensity function — the stochastic modelfor the next event time given all previous events. Within asmall window [t, t + dt), λ∗(t)dt is the probability for theoccurrence of a new event given the history Ht:

λ∗(t)dt = P event in [t, t+ dt)|Ht . (2)

The ∗ notation reminds us that the function depends on thehistory. Given the conditional density function f∗(t), theconditional intensity function can be specified as:

λ∗(t)dt =f∗(t)dt

S∗(t)=

f∗(t)dt

1− F ∗(t) , (3)

where F ∗(t) is the cumulative probability that a new eventwill happen before time t since the last event time tn, and

S∗(t) = exp(−∫ ttnλ∗(τ)dτ

)is the respective probability

that no new event has ever happened up to time t since tn.As a consequence, the conditional density function can bealternatively specified by

f∗(t) = λ∗(t) exp

(−∫ t

tn

λ∗(τ)dτ

). (4)

Particular functional forms of the conditional intensityfunction λ∗(t) are often designed to capture the phenom-ena of interests [1]. In the following, we review a few rep-resentative examples of typical point processes where theconditional intensity has particularly specified parametricforms.

Poisson process [28]. The homogeneous Poisson pro-cess is the simplest point process. The inter-event timesare independent and identically distributed random vari-ables conforming to the exponential distribution. The con-ditional intensity function is assumed to be independentof the history Ht and keeping constant over time, i.e.,λ∗(t) = λ0 > 0. For a more general inhomogeneous pois-

son process, the intensity is also assumed to be indepen-dent of the history Ht but it can be a function varying overtime, i.e., λ∗(t) = g(t) > 0.

Hawkes process [20]. A Hawkes process captures themutual excitation phenomenon among events with the con-ditional intensity being defined as

λ∗(t) = γ0 + α∑tj<t

γ(t, tj), (5)

where γ(t, tj) > 0 is the triggering kernel capturing tem-poral dependencies, γ0 > 0 is a baseline intensity indepen-dent of the history and the summation of kernel terms ishistory dependent and a stochastic process by itself. Thekernel function can be chosen in advance, e.g., γ(t, tj) =exp(−β (t− tj)) or γ(t, tj) = I[t > tj ], or directly learnedfrom data. A distinctive feature of the Hawkes process isthat the occurrence of each historical event increases the in-tensity by a certain amount. Since the intensity functiondepends on the history up to time t, the Hawkes process isessentially a conditional Poisson process (or doubly stochas-tic Poisson process [27]) in the sense that conditioned on thehistory Ht, the Hawkes process is a Poisson process formedby the superposition of a background homogeneous Poissonprocess with the intensity γ0 and a set of inhomogeneousPoisson processes with the intensity γ(t, tj). However, be-cause the events in a past interval can affect the occurrenceof the events in later intervals, the Hawkes process in generalis more expressive than a Poisson process.

Self-correcting process [25]. In contrast to the Hawkesprocess, the self-correcting process seeks to produce regularpoint patterns with the conditional intensity function

λ∗(t) = exp

(µt−

∑ti<t

α

), (6)

where µ > 0, α > 0. The intuition is that while the intensityincreases steadily, every time when a new event appears, it isdecreased by multiplying a constant e−α < 1, so the chanceof new points decreases after an event has occurred recently.

Autoregressive Conditional Duration process [12].An alternative way of conditional intensity parametrizationis to capture the dependency between inter-event durations.Let di = ti − ti−1. The expectation for di is given byψi = E(di| . . . , di−2, di−1). The simplest form assumes thatdi = ψiεi where εi is independently and identically dis-tributed exponential variables with expectation one. Asa consequence, the conditional intensity has the followingform:

λ∗(t) = ψ−1N(t), (7)

where ψi = γ0 +∑mj=0 αjdi−j to capture the influences from

the most recent m durations, and N(t) is the total numberof events up to t.

4.2 Major LimitationsCurse of Model Misspecification. All these different

parameterizations of the conditional intensity function seekto capture and represent certain forms of dependency on thehistory in different ways: Poisson process makes the assump-tion that the duration is stationary; Hawkes process assumesthat the influences from past events are linearly additive to-wards the current event; Self-correcting process specifies anon-linear dependency over these past events; and autore-gressive conditional duration model imposes a linear struc-

Page 4: Recurrent Marked Temporal Point Processes: …lsong/papers/DuDaiTriEtal16.pdftemporal point processes are very active research topics in econometrics [2, 3]. In sociology, temporal-spatial

ture between successive inter-event durations. These differ-ent parameterizations encode our prior knowledge about thelatent dynamics we try to model. In practice, however, thetrue model is never known. Thus, we have to try differentspecifications for λ∗(t) to tune the predictive performanceand most often we can expect to suffer from certain errorscaused by the model misspecification.

Marker Generation. Furthermore, it is quite often thatwe have additional information (or covariates) associatedwith each event like the markers. For instance, the markerof a NYC taxi can be the neighborhood-name of the placewhere it picks up (or drops off) passengers; the marker ofeach financial transaction can be the action of buying (orselling); and the marker of a clinical event can be the diag-nosis of the major disease. Classic temporal point processescan be extended to capture the marker information mainlyin the following two ways: first, the marker is directly incor-porated into the intensity function; second, each marker canbe regarded as an independent dimension to have a multi-dimensional temporal point process. In terms of the for-mer approach, we still need to specify a proper form for theconditional intensity function. Moreover, due to the extracomplexity of the function induced by the markers, peoplenormally make strong assumptions that the marker is inde-pendent on the history [33], which greatly reduces the flex-ibility of the model. With respect to the latter method, itis very often to have large number of markers, which resultsin a sparsity problem associated with each dimension whereonly very few events can happen.

5. RECURRENT MARKED TEMPORALPOINT PROCESS

Each parametric form of the conditional intensity functiondetermines the temporal characteristics of a family of pointprocesses. However, it will be hard to correctly decide whichform to use without any sufficient prior knowledge in orderto take into account both the marker and the timing infor-mation. To tackle this challenge, in this section, we proposea unified model capable of modeling a general nonlinear de-pendency over the history of both the event timing and themarker information.

5.1 Model FormulationBy carefully investigating the various forms of the condi-

tional intensity function (5), (6), and (7), we can observethat they are inherently different representations and real-izations of various kinds of dependency structures over thepast events. Inspired by this critical insight, we seek tolearn a general representation to approximate the unknowndependency structure over the history.

Recurrent Neural Network (RNN) is a feedforward neuralnetwork structure where additional edges, referred to as therecurrent edges, are added such that the outputs from thehidden units at the current time step are fed into them againas the future inputs at the next time step. In consequence,the same feedforward neural network structure is replicatedat each time step, and the recurrent edges connect the hid-den units of the network replicates at adjacent time stepstogether along time, that is, the hidden units with recur-rent edges not only receive the input from the current datasample but also from the hidden units in the last time step.This feedback mechanism creates an internal state of the

time

dj+1 = tj+1 tj

(tj+1, yj+1)

hidden hj+1

f(tj+1) = f(dj+1|hj)(tj , yj)

hidden hjhj1

log f(tj+1) log P (yj+1|hj) log P (yj+2|hj+1) log f(tj+2)

Figure 2: Illustration of Recurrent Marked Temporal PointProcess. For each event with the timing tj and the markeryj , we treat the pair (tj , yj) as the input to a recurrent neuralnetwork where the embedding hj up to the time tj learns ageneral representation of a nonlinear dependency over boththe timing and the marker information from past events.Note that the solid diamond and the circle on the timelineindicate two events of different types yj 6= yj+1.

network to memorize the influence of each past data sam-ple. In theory, finite-sized recurrent neural networks withsigmoidal activation units can simulate a universal Turingmachine [36], which is able to perform an extremely richfamily of computations. In practice, RNN has been shownto be a powerful tool for general purpose sequence model-ing. For instance, in Natural Language Processing, recurrentneural network has state-of-the-arts predictive performancefor sequence-to-sequence translations [24], image caption-ing [38], handwriting recognition [18]. It has also been usedfor discrete-time series data prediction [30, 39, 34](treat timeas discrete indices) for a long time.

Our key idea is to let the RNN (or its modern variantLSTM [23], GRU [5], etc.) model the nonlinear dependencyover both of the markers and the timings from past events.As shown in Figure 2, for the event occurring at the timetj of type yj , the pair (tj , yj) is fed as the input into a re-current neural network unfolded up to the j + 1-th event.The embedding hj−1 represents the memory of the influencefrom the timings and the markers of past events. The neu-ral network updates hj−1 to hj by taking into account theeffect of the current event (tj , yj). Since now hj representsthe influence of the history up to the j-th event, the con-ditional density for the next event timing can be naturallyrepresented as

f∗(tj+1) = f(tj+1|Ht) = f(tj+1|hj) = f(dj+1|hj), (8)

where dj+1 = tj+1 − tj . As a consequence, we can dependon hj to make predictions to the timing tj+1 and the typeyj+1 of the next event.

The advantage of this formulation is that we explicitlyembed the event history into a latent vector space, and bythe elegant relation (4), we are now able to capture a generalform of the conditional intensity function λ∗(t) without theneed of specifying a fixed parametric specification for thedependency structure over the history. Figure 3 presentsthe overall architect of the proposed RMTPP. Given a se-quence of events S =

((tj , yj)

nj=1

), we design an RNN which

computes a sequence of hidden units hj by iterating thefollowing components.

Input Layer. At the j-th event, the input layer firstprojects the sparse one-hot vector representation of the markeryj into a latent space. We add an embedding layer with theweight matrix Wem to achieve a more compact and efficientrepresentation yj = W>

emyj + bem, where bem is the bias.We learn Wem and bem while we train the network. Inaddition, for the timing input tj , we can extract the associ-

Page 5: Recurrent Marked Temporal Point Processes: …lsong/papers/DuDaiTriEtal16.pdftemporal point processes are very active research topics in econometrics [2, 3]. In sociology, temporal-spatial

Timing tj Marker yj

log f(tj+1)

hidden hj

log P (yj+1|hj)

W t embedding yj

Wem

W y

W h

V y

Recurrent Layer

Input Layer

Output Layervt

Figure 3: Architect of RMTPP. For a given sequence S =((tj , yj)

nj=1

), at the j-th event, the marker yj is first em-

bedded into a latent space. Then, the embedded vector andthe temporal features are fed into the recurrent layer. Therecurrent layer learns a representation that summaries thenonlinear dependency over the previous events. Based onthe learned representation hj , it outputs the prediction forthe next marker yj+1 and timing tj+1 to calculate the re-spective loss functions.

ated temporal features tj (e.g., like the inter-event durationdj = tj − tj−1).

Hidden Layer. We update the hidden vector after re-ceiving the current input and the memory hj−1 from thepast. In RNN, we have

hj = maxW yyj + W ttj + W hhj−1 + bh, 0

. (9)

Marker Generation. Given the learned representationhj , we model the marker generation with a multinomial dis-tribution by

P (yj+1 = k|hj) =exp

(V yk,:hj + byk

)∑Kk=1 exp

(V yk,:hj + byk

) , (10)

where K is the number of markers, and V yk,: is the k-th row

of matrix V y.Conditional Intensity. Based on hj , we can now for-

mulate the conditional intensity function by

λ∗(t) = exp

(vt> · hj︸ ︷︷ ︸past

influence

+ wt(t− tj)︸ ︷︷ ︸current

influence

+ bt︸︷︷︸base

intensity

), (11)

where vt is a column vector, and wt, bt are scalars. Morespecifically,

• The first term vt> · hj represents the accumulative in-

fluence from the marker and the timing information ofthe past events. Compared to the fixed parametric for-mulations of (5), (6), and (7) for the past influence, wenow have a highly non-linear general specification of thedependency over the history.• The second term emphasizes the influence of the current

event j.• The last term gives a base intensity level for the occur-

rence of the next event.• The exponential function outside acts as a non-linear trans-

formation and guarantees that the intensity is positive.

By invoking the elegant relation between the conditionalintensity function and the conditional density function in (4),we can derive the likelihood that the next event will occur

at the time t given the history by the following equation:

f∗(t) = λ∗(t) exp

(−∫ t

tj

λ∗(τ)dτ

)

= exp

vt> · hj + wt(t− tj) + bt +

1

wexp(vt

> · hj + bt)

− 1

wexp(vt

> · hj + wt(t− tj) + bt)

. (12)

Then, we can estimate the timing for the next event usingthe expectation

tj+1 =

∫ ∞tj

t · f∗(t)dt. (13)

In general, the integration in (13) does not have analyticsolutions, so we can apply commonly used numerical in-tegration techniques [32] for one-dimensional functions tocompute (13) instead.

Remark. Based on the hidden unit of RNN, we are ableto learn a unified representation of the dependency over thehistory. In consequence, the direct formulation (11) of theconditional intensity function λ∗(tj+1) captures both of theinformation from past event timings and event markers. Onthe other hand, since the prediction for the marker alsodepends nonlinearly on the past timing information, thismay improve the performance of the classification task aswell when both of these two information are correlated witheach other. In fact, experiments on synthetic and real worlddatasets in the following experimental section do verify suchmutual boosting phenomenon.

5.2 Parameter LearningGiven a collection of sequences C =

Si

, where Si =((tij , y

ij)nij=1

), we can learn the model by maximizing the joint

log-likelihood of observing C.

`(Si) =∑i

∑j

(logP (yij+1|hj) + log f(dij+1|hj)

), (14)

We exploit the Back Propagation Through Time (BPTT) fortraining RMTPP. Given the size of BPTT as b, we unrollour model in Figure 3 by b steps. In each training iteration,we take b consecutive samples (tik, yik)j+bk=j from a singlesequence, apply the feed-forward operation through the net-work, and update the parameters with respect to the lossfunction. After we unroll the model for b steps through time,all the parameters are shared across these copies, and will beupdated sequentially in the back propagation stage. In ouralgorithm framework1, we need both sparse (the marker yj)and dense features at time tj . Meanwhile, the output is alsomixed of discrete markers and real-value time, which is thenfed into different loss functions including the cross-entropy ofthe next predicted marker and the negative log-likelihood ofthe next predicted event timing. Therefore, we build an effi-cient and flexible platform2 particularly optimized for train-ing general directed acyclic structured computational graph(DAG). The backend is supported via CUDA and MKL forGPU and CPU platform, respectively. In the end, we ap-ply stochastic gradient descent (SGD) with mini-batch andseveral other techniques of training neural networks [37].

1https://github.com/dunan/NeuralPointProcess2https://github.com/Hanjun-Dai/graphnn

Page 6: Recurrent Marked Temporal Point Processes: …lsong/papers/DuDaiTriEtal16.pdftemporal point processes are very active research topics in econometrics [2, 3]. In sociology, temporal-spatial

6. EXPERIMENTWe evaluate RMTPP in large-scale synthetic and real

world data. We compare it to several discrete-time andcontinuous-time sequential models showing that RMTPP ismore robust to model misspecificationmisspecification thanthese alternatives.

6.1 BaselinesTo evaluate the predictive performance of forecasting mark-

ers, we compare with the following discrete-time models:• Majority Prediction. This is also known as the 0-order

Markov Chain (MC-0), where at each time step, we al-ways predict the most popular marker regardless of thehistory. Most often, predicting the most popular type isa strong heuristic.• Markov Chain. We compare with Markov models with

varying orders from one to three, denoted as MC-1, MC-2, and MC-3, respectively.To show the effectiveness of predicting time, we compare

with the following continuous-time models:• ACD. We fit a second-order autoregressive conditional

duration process with the intensity function given in (7).• Homogeneous Poisson Process. The intensity func-

tion is a constant, which produces an estimate of the av-erage inter-event gaps.• Hawkes Process. We fit a self-excitation Hawkes pro-

cess with the intensity function in (5).• Self-correcting Process. We fit a self-correcting pro-

cess with the intensity function in (6).Finally, we compare with the Continuous-Time Markov

Chain (CTMC) model, which learns continuous transitionrates between two states (or markers). This model predictsthe next state with the earliest transition time, so it canpredict both the marker and the timing for the next eventjointly.

6.2 Synthetic DataTo show the robustness of RMTPP, we propose the fol-

lowing generative processes3:Autoregressive Conditional Duration. The condi-

tional density function for the next duration dn conforms toan exponential distribution with the expectation determinedby the past m subsequent durations in the following form,which is denoted as ACD.

f(dn|Hn−1) = αn exp(−αnτn), αn =

(µ0 + γ

m∑i=1

dn−i

)−1

,

(15)

where µ0 is the base duration to generate the first eventstarting from zero, d1 ∼ (µ0)−1 exp(−d1/µ0). We set m = 2,µ0 = 0.5 and γ = 0.25.

Hawkes Process. The conditional intensity function isgiven by λ(t) = λ0 + α

∑ti<t

exp(− t−ti

σ

)where λ0 = 0.2,

α = 0.8 and σ = 1.0.Self-Correcting Process. The conditional intensity func-

tion is given by λ(t) = exp(µt−

∑ti<t

α)

where µ = 1 and

α = 0.2.State-Space Continuous-Time Model. To model the

influence from both markers and time, we further proposethe State-Time Mixture model with the following steps:

3https://github.com/dunan/MultiVariatePointProcess

1. For each time tn−1, we take the mod of tn−1 by a periodof P = 24. If the residual is greater than 12, the processis defined to be in the time state rn−1 = 0; otherwise, itis in the time state rn−1 = 1.

2. Based on the combination of both the time state rn−jmj=1

and the marker state yn−jmj=1 of the previous m events,the process will have the marker k for the next step inthe probability P (yn = k| yn−jmj=1 , rn−j

mj=1).

3. Similarly, based on the combination of yn−jmj=1 and

rn−jmj=1 from the previous m events, the duration dn =tn− tn−1 has a Poisson distribution with the expectationdetermined by rn−jmj=1 and yn−jmj=1 jointly. Here,we use the Poisson distribution to mimic the elapsed timeunits (e.g., hours, minutes).

In our experiments, without loss of generality, we set the to-tal number of markers to two, m = 3 and randomly initializethe transition probabilities between states.

Experimental Results. Figure 4 presents the predic-tive performance of RMTPP fitted to different types of time-series data, where we simulate 1,000,000 events and use 90%for training and the rest 10% for testing for each case. Wefirst compare the predictive performance of RMTPP withthe optimal estimator in the left column where the opti-mal estimator knows the true conditional intensity function.We treat the expectation of the time interval between thecurrent and the next event as our estimation. Grey curvesare the observed inter-event durations from 100 successiveevents in the testing data. Blue curves are the respectiveexpectations given by the optimal estimator. Red curvesare the predictions from RMTPP. We can observe that eventhough RMTPP has no prior knowledge about the true func-tional form of each process, its predictive performance isalmost consistent with the respective optimal estimator.

The middle column of Figure 4 compares the learned con-ditional intensity functions (red curves) with the true ones(blue curves). It clearly demonstrates that RMTPP is ableto adaptively and accurately capture the unknown hetero-geneous temporal dynamics of different time-series data. Inparticular, because the order of dependency over the his-tory is fixed for ACD, RMTPP almost exactly learns theconditional intensity function with comparable BPTT steps.The Hawkes and the self-correcting processes are more chal-lenging in that the conditional intensity function dependson the entire history. Because the events are far from be-ing uniformly distributed, the influence from individual pastevent to the occurrence of new future events can vary widely.From this perspective, these processes essentially have ran-dom varying order dependency on the history compared toACD. However, with properly chosen BPTT steps, RMTPPcan accurately capture the general shape and each singlechange point of the true intensity function. In particular,for the Hawkes case, the abruptly increased intensity fromtime index 60 to 100 results in 40 events in a very tiny timeinterval, but still, the predictions of RMTPP can capturethe trend of the true data.

The right column of Figure 4 reports the overall RMSE ofdifferent processes between the predictions and the true test-ing data. We can observe that RMTPP has very strong com-petitive performance and better robustness against modelmisspecification to capture the heterogeneity of the latenttemporal dynamics of different time-series data comparedto other parametric alternatives.

In addition to time, the state-space continuous-time model

Page 7: Recurrent Marked Temporal Point Processes: …lsong/papers/DuDaiTriEtal16.pdftemporal point processes are very active research topics in econometrics [2, 3]. In sociology, temporal-spatial

Time Prediction Curve Intensity Function Time Prediction ErrorACD

0 20 40 60 80 100Time Index

0

2

4

6

8

10

12

14

16

18In

ter-

event tim

eDataOptRMTPP

0 20 40 60 80 100Time Index

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Inte

nsity

OptRMTPP

1.20421.1703

1.2039

1.1098 1.1097 1.1109

0.9

1.1

1.3

1.5

Methods

RMSE

Methods

PoissonHawkesSelfCorrectingACDOPTRMTPP

Hawkes

0 20 40 60 80 100Time Index

0

2

4

6

8

10

12

14

Inte

r-event tim

e

DataOptRMTPP

0 20 40 60 80 100Time Index

0

2

4

6

8

10

12

Inte

nsity

OptRMTPP

2.3543

2.2422

2.3548

2.675

2.23952.2659

2.0

2.2

2.4

2.6

Methods

RMSE

Methods

PoissonHawkesSelfCorrectingACDOPTRMTPP

Self-C

orrecting

0 20 40 60 80 100Time Index

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Inte

r-e

ve

nt

tim

e

DataOptRMTPP

0 20 40 60 80 100Time Index

3

4

5

6

7

8

9

10

Inte

nsity

OptRMTPP

0.1812 0.1812

0.1758

0.181

0.1753 0.1752

0.170

0.175

0.180

0.185

Methods

RMSE

Methods

PoissonHawkesSelfCorrectingACDOPTRMTPP

Figure 4: Time predictions on the testing data produced from different processes. Left column is the predicted inter-eventtime. Blue curve is the optimal estimator, and red curve is RMTPP without any prior knowledge about the true dynamics ofeach case. Middle column shows the learned intensity functions vs. the respective true ones. Right column gives the overalltesting RMSE of predicting the timings from different processes.

0 20 40 60 80 100Time Index

0

2

4

6

8

10

12

14

16

18

Inte

r-e

ve

nt

tim

e

DataOptRMTPP

(a) Time Prediction Curve

4.4559

4.9856

3.7

7.2966

4.4787

2.58512.7395

2

3

4

5

6

7

Methods

RMSE

Methods

PoissonHawkesCTMCSelfCorrectingACDOPTRMTPP

(b) Time Prediction Error

46.35 46.35

42.31

40.57

46.35

26.0527.16

20

30

40

50

Methods

Err

or

%

Methods

MC-0MC-1MC-2MC-3CTMCOPTRMTPP

(c) Marker Prediction Error

Figure 5: Performance evaluation of predicting timings and markers on the state-space continuous-time model.

Page 8: Recurrent Marked Temporal Point Processes: …lsong/papers/DuDaiTriEtal16.pdftemporal point processes are very active research topics in econometrics [2, 3]. In sociology, temporal-spatial

NYC Taxi Financial Stack Overflow MIMIC II

92.22

90.29

86.3486.01 86.12

85.75

80.0

82.5

85.0

87.5

90.0

92.5

Methods

Err

or

%

Methods

CTMCMC-0MC-1MC-2MC-3RMTPP

38.032

49.48

38.024 38.046 38.058 37.678

35

40

45

50

Methods

Erro

r %

MethodsCTMCMC-0MC-1MC-2MC-3RMTPP

89.33

56.8555.41 55.04 55.27

51.4

40

50

60

70

80

90

Methods

Err

or

%

Methods

CTMCMC-0MC-1MC-2MC-3RMTPP

21.56

68.21

19.38

40.23

60.12

17.49

0

20

40

60

Methods

Err

or

%

Methods

CTMCMC-0MC-1MC-2MC-3RMTPP

1.0882

1.0217

1.0514

1.0876 1.0845

0.9477

0.90

0.95

1.00

1.05

1.10

1.15

Methods

RM

SE

(h

ou

r)

Methods

CTMCPoissonHawkesSelfCorrectingACDRMTPP

1.5468 1.5451 1.5472

1.804

1.47141.409

0.5

1.0

1.5

2.0

2.5

Methods

RM

SE

(s)

Methods

CTMCPoissonHawkesSelfCorrectingACDRMTPP

13.0671

11.988512.3113

11.8859

13.466

10.78

8

10

12

14

Methods

RM

SE

(d

ay)

Methods

CTMCPoissonHawkesSelfCorrectingACDRMTPP

7.3056

5.9703 5.9224

6.2213

6.7299

5.7639

4

5

6

7

8

Methods

RM

SE

(d

ay)

Methods

CTMCPoissonHawkesSelfCorrectingACDRMTPP

Figure 7: Performance evaluation for predicting both marker and timing of the next event. The top row presents theclassification error of predicting th event markers, and the bottom row gives the RMSE of predicting the event timings.

4.3252

2.7395

2.5

3.0

3.5

4.0

Methods

RMSE

Methods

RNNRMTPP

(a) Time Prediction Error

39.59

27.16

25

30

35

40

45

Methods

Err

or

%

Methods

RNNRMTPP

(b) Marker Prediction Error

Figure 6: Predictive performance comparison with RNNwhich is trained for predicting the next timing only in (a),and for predicting the next marker only in (b).

also includes the marker information. Figure 5 compares theerror rates of different processes in predicting both eventtimings and markers. Compared to the other baselines,RMTPP is again consistent with the optimal estimator with-out any prior knowledge about the true underlying genera-tive process.

Finally, since the occurrences of future events depend onboth of the past marker and timing information, we wouldlike to investigate whether learning a unified representationof the joint information can further improve future predic-tions. Therefore, we train an RNN by only using the tempo-ral and the marker information, respectively. Figure 6 givesthe comparisons between RMTPP and RNN where in panel(a), RNN has the 4.3522 RMSE while RMTPP achieves a2.7395 RMSE, and in panel (b), RNN reports 39.59% clas-sification error while RMTPP reaches to the 27.16% level.Clearly, they verify that the joint modeling of both informa-tion can boost the performance of predicting future events.

6.3 Real DataWe evaluate the predictive performance of RMTPP on

real world datasets from a diverse range of domains.New York City Taxi Dataset. The NYC taxi dataset4

4http://www.andresmh.com/nyctaxitrips/

contains ∼173 million trip records of individual Taxi forconsecutive 12 months in 2013. The location informationis available in the form of latitude/longitude coordinates.Each record also contains the temporal information of pick-up (drop-off) passengers associated with every trip. Wehave used NYC Neighborhood Names GIS dataset5 to mapthe coordinates to neighborhood names. For those coordi-nates of which the location name is not directly availablein the GIS dataset, we use geodesic distance to map themto the nearest neighborhood name. With this process weobtained 299 unique locations as our markers. An event isa pickup record for a taxi. Further, we have divided eachsingle sequence of a taxi into multiple fine-grained subse-quences where two consecutive events are within 12 hours.We obtained 670,753 sequences in total out of which 536,603were used for training and 134,150 were used for testing. Wepredict the location and the time of the next pickup event.

Financial Transaction Dataset. We have collecteda raw limited order book data from NYSE of the high-frequency transactions for a stock in one day. It contains0.7 million transaction records, each of which records thetime (in millisecond) and the possible action (B = buy, S= sell). We treat the type of actions as markers. The in-put data is a single long sequence with 624,149 events fortraining and 69,350 events for testing. The task is to predictwhich action will be taken next at what time.

Electrical Medical Records. MIMIC II medical datasetis a collection of de-identified clinical visit records of Inten-sive Care Unit patients for seven years. We have filtered out650 patients and 204 diseases. Each event records the timewhen a patient had a visit to the hospital. We have usedthe sequences of 585 patients to train, and the rest for test.The goal is to predict which major disease will happen to agiven patient at what time in the future.

Stack OverFlow Dataset. Stack Overflow6 is a question-

5https://data.cityofnewyork.us/City-Government/Neighborhood-Names-GIS/99bc-9p236https://archive.org/details/stackexchange

Page 9: Recurrent Marked Temporal Point Processes: …lsong/papers/DuDaiTriEtal16.pdftemporal point processes are very active research topics in econometrics [2, 3]. In sociology, temporal-spatial

NYC Taxi Financial Stack Overflow MIMIC II

86.482

85.75

85.0

85.5

86.0

86.5

87.0

Methods

Err

or

%

Methods

RNNRMTPP

37.762

37.678

37.00

37.25

37.50

37.75

38.00

Methods

Err

or

%

Methods

RNNRMTPP

52.724

51.4

45.0

47.5

50.0

52.5

55.0

Methods

Err

or

%

Methods

RNNRMTPP

18.15

17.49

15

16

17

18

19

20

Methods

Err

or

%

Methods

RNNRMTPP

0.9591

0.9477

0.90

0.92

0.94

0.96

Methods

RM

SE

(h

ou

r)

Methods

RNNRMTPP

1.45691.409

0.50

0.75

1.00

1.25

1.50

1.75

Methods

RM

SE

(s)

Methods

RNNRMTPP

11.7473

10.78

8

9

10

11

12

Methods

RM

SE

(d

ay)

Methods

RNNRMTPP

5.81215.7639

5.00

5.25

5.50

5.75

6.00

Methods

RM

SE

(d

ay)

Methods

RNNRMTPP

Figure 8: Predictive performance comparison with RNN which are trained only using the markers in the left column andonly using the temporal information in the right column.

answering website which exploits badges to encourage userengagement and guide behaviors [17]. There are 81 typesof non-topical (i.e., non-tag affiliated) badges which can beawarded either only once (e.g. Altruist, Inquisitive, etc.)or multiple times (e.g. Stellar Question, Guru, Great An-swer, etc.) to a user. By ignoring the badges which can beawarded only once, we first select users who have earned atleast 40 badges between 2012-01-01 and 2014-01-01 and thenthose badges which have been awarded at least 100 timesto the users selected in the first step. We have removedthe users who have been instantaneously awarded multiplebadges due to technical issues of the servers. In the end,we have ∼6 thousand users with a total of ∼480 thousandevents where each badge is treated as a marker.

Experimental Results. We compare and report thepredictive performance of different models on the testingdata of each dataset in Figure 7. The hyper-parametersof RMTPP across all these datasets are tuned as follow-ing: learning rate in 0.1, 0.01, 0.001; hidden layer size in64, 128, 256, 512, 1024; momentum = 0.9 and L2 penalty =0.001; and batch-size in 16, 32, 64. Figure 7 compares thepredictive performance of forecasting markers and timingsfor the next event of different processes across the four realdatasets. RMTPP outperforms the other alternatives withlower errors for predicting both timings and markers. Be-cause the MIMIC-II dataset has many short sequences andis the smallest out of the four datasets, increasing the orderof Markov chain will decrease its classification performance.

We also compare RMTPP with RNN trained only withthe marker and with the timing information separately inFigure 8. We can observe that RMTPP trained by incor-porating both the past marker and timing information per-forms consistently better than RNN trained with either onesource of the information alone. Finally, Figure 9 showsthe empirical distribution for the inter-event times on theStack Overflow and the financial transaction data. Com-pared to the other temporal processes of fixed parametricforms, even though the real datasets might have quite dif-

-8 -6 -4 -2 0 2 4Log inter-event time

0

20

40

60

80

100

120

140

co

un

t

-12 -10 -8 -6 -4 -2 0Log inter-event time

0

2

4

6

8

10

12

14

coun

t

×104

(a) Stack Overflow (b) Financial TransactionFigure 9: Empirical distribution for the inter-event times.The x-axis is in log-scale.

ferent characteristics, RMTPP is more flexible to capturesuch heterogeneity in general.

7. DISCUSSIONSWe present the Recurrent Marked Temporal Point Pro-

cess, which builds a connection between recurrent neuralnetworks and point processes. The recurrent neural net-work supports different architects, including the classic RNNand modern LSTM, GRU, etc. Besides, in addition to theinter-event temporal features, our model can be readily gen-eralized to incorporate other contextual information. Forinstance, in addition to training a global model, we can alsotake the potential user-profile features into account for per-sonalization. Furthermore, based on the structural informa-tion of social networks, our model can be generalized in sucha way that the prediction of one user sequence depends notonly on her own history but also on the other users’ historyto capture their interactions in a networked setting.

To conclude, RMTPP inherits the advantages from boththe recurrent neural networks and the temporal point pro-cesses to predict both the marker and the timing of the fu-ture events without any prior knowledge about the hiddenfunctional forms of the latent temporal dynamics. Exper-iments on both synthetic and real world datasets demon-strate that RMTPP is robust to model mis-specificationsand has consistently better performance compared to theother alternatives.

Page 10: Recurrent Marked Temporal Point Processes: …lsong/papers/DuDaiTriEtal16.pdftemporal point processes are very active research topics in econometrics [2, 3]. In sociology, temporal-spatial

AcknowledgementsThis project was supported in part by NSF/NIH BIGDATA1R01GM108341, ONR N00014-15-1-2340, NSF IIS-1218749,and NSF CAREER IIS-1350983.

8. REFERENCES[1] O. Aalen, O. Borgan, and H. Gjessing. Survival and event

history analysis: a process point of view. Springer, 2008.[2] E. Bacry, A. Iuga, M. Lasnier, and C.-A. Lehalle. Market

impacts and the life cycle of investors orders. 2014.[3] E. Bacry, T. Jaisson, and J.-F. Muzy. Estimation of slowly

decreasing hawkes kernels: Application to high frequencyorder book modelling. 2015.

[4] R. Begleiter, R. El-Yaniv, and G. Yona. On predictionusing variable order markov models. J. Artif. Intell. Res.(JAIR), 22:385–421, 2004.

[5] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio.On the properties of neural machine translation:Encoder-decoder approaches. CoRR, abs/1409.1259, 2014.

[6] D. Daley and D. Vere-Jones. An introduction to the theoryof point processes: volume II: general theory and structure,volume 2. Springer, 2007.

[7] N. Du, M. Farajtabar, A. Ahmed, A. J. Smola, and L. Song.Dirichlet-hawkes processes with applications to clusteringcontinuous-time document streams. In KDD. ACM, 2015.

[8] N. Du, L. Song, M. Gomez-Rodriguez, and H. Zha. Scalableinfluence estimation in continuous-time diffusion networks.In NIPS, 2013.

[9] N. Du, L. Song, A. J. Smola, and M. Yuan. Learningnetworks of heterogeneous influence. In NIPS, 2012.

[10] N. Du, L. Song, H. Woo, and H. Zha. Uncovertopic-sensitive information diffusion networks. In ArtificialIntelligence and Statistics (AISTATS), 2013.

[11] N. Du, Y. Wang, N. He, and L. Song. Time-sensitiverecommendation from recurrent user activities. In NIPS,2015.

[12] R. F. Engle and J. R. Russell. Autoregressive conditionalduration: A new model for irregularly spaced transactiondata. Econometrica, 66(5):1127–1162, Sep 1998.

[13] M. Farajtabar, N. Du, M. Gomez-Rodriguez, I. Valera,H. Zha, and L. Song. Shaping social activity byincentivizing users. In NIPS, 2014.

[14] M. Farajtabar, M. Gomez-Rodriguez, N. Du, M. Zamani,H. Zha, and L. Song. Back to the past: Sourceidentification in diffusion networks from partially observedcascades. In AISTAT, 2015.

[15] A. Ferraz Costa, Y. Yamaguchi, A. Juci Machado Traina,C. Traina, Jr., and C. Faloutsos. Rsc: Mining and modelingtemporal activity in social media. In Proceedings of the21th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, KDD ’15, pages269–278, 2015.

[16] M. Gomez-Rodriguez, D. Balduzzi, and B. Scholkopf.Uncovering the temporal dynamics of diffusion networks. InProceedings of the International Conference on MachineLearning, 2011.

[17] S. Grant and B. Betts. Encouraging user behaviour withachievements: an empirical study. In Mining SoftwareRepositories (MSR), 2013 10th IEEE Working Conferenceon, pages 65–68. IEEE, 2013.

[18] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami,H. Bunke, , and J. Schmidhuber. A novel connectionistsystem for uncon- strained handwriting recognition. IEEETransactions on Pattern Analysis and MachineIntelligence, pages 855–868, 2009.

[19] A. G. Hawkes. Point spectra of some mutually excitingpoint processes. Journal of the Royal Statistical SocietySeries B, 33:438–443, 1971.

[20] A. G. Hawkes. Spectra of some self-exciting and mutuallyexciting point processes. Biometrika, 58(1):83–90, 1971.

[21] A. G. Hawkes and D. Oakes. A cluster processrepresentation of a self-exciting process. Journal of AppliedProbability, pages 493–503, 1974.

[22] X. He, T. Rekatsinas, J. Foulds, L. Getoor, and Y. Liu.Hawkestopic: A joint model for network inference and topicmodeling from text-based cascades. In ICML, pages871–880, 2015.

[23] S. Hochreiter and J. Schmidhuber. Long short-termmemory. Neural computation, 9(8):1735–1780, 1997.

[24] O. V. Ilya Sutskever and Q. V. Le. Sequence to sequencelearning with neural networks. 2014.

[25] V. Isham and M. Westcott. A self-correcting pint process.Advances in Applied Probability, 37:629–646, 1979.

[26] J. Janssen and N. Limnios. Semi-Markov Models andApplications. Kluwer Academic, 1999.

[27] J. Kingman. On doubly stochastic poisson processes.Mathematical Proceedings of the Cambridge PhilosophicalSociety, pages 923–930, 1964.

[28] J. F. C. Kingman. Poisson processes, volume 3. Oxforduniversity press, 1992.

[29] R. D. Malmgren, D. B. Stouffer, A. E. Motter, and L. A. N.Amaral. A Poissonian explanation for heavy tails in e-mailcommunication. Proceedings of the National Academy ofSciences, 105(47):18153–18158, 2008.

[30] H. Min, X. Jiahui, X. Shiguo, and Y. Fuliang. Prediction ofchaotic time series based on the recurrent predictor neuralnetwork. IEEE Transactions on Signal Processing,52:3409–3416, 2004.

[31] Y. Ogata. Space-time point-process models for earthquakeoccurrences. Annals of the Institute of StatisticalMathematics, 50(2):379–402, 1998.

[32] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P.Flannery. Numerical Recipes in C. The Art of ScientificComputation. Cambridge University Press, Cambridge,UK, 1994.

[33] J. G. Rasmussen. Temporal point processes: theconditional intensity function. http://people.math.aau.dk/˜jgr/teaching/punktproc11/tpp.pdf, 2009.

[34] C. Rohitash and Z. Mengjie. Cooperative coevolution ofelman recurrent neural networks for chaotic time seriesprediction. Neurocomputing, 86:116–123, 2012.

[35] M. Short, G. Mohler, P. Brantingham, and G. Tita. Gangrivalry dynamics via coupled point process networks.Discrete and Continuous Dynamical Systems Series B,19:1459–1477, 2014.

[36] H. T. Siegelmann and E. D. Sontag. Turing computabilitywith neural nets. Applied Mathematics Letters, 4:77–80,1991.

[37] I. Sutskever, J. Martens, G. E. Dahl, and G. E. Hinton. Onthe importance of initialization and momentum in deeplearning. In S. Dasgupta and D. Mcallester, editors,Proceedings of the 30th International Conference onMachine Learning (ICML-13), volume 28, pages 1139–1147.JMLR Workshop and Conference Proceedings, May 2013.

[38] O. Vinyals, A. Toshev, S. Bengio, , and D. Erhan. Showand tell: A neural image caption generator. IEEEConference on Computer Vision and Pattern Recognition,pages 3156–3164, 2015.

[39] C. Xindi, Z. Nian, V. Ganesh K., and W. I. Donald C.Time series prediction with recurrent neural networks

trained by a hybrid psoaASea algorithm. Neurocomputing,70:2342–2353, 2007.

[40] Q. Zhao, M. A. Erdogdu, H. Y. He, A. Rajaraman, andJ. Leskovec. Seismic: A self-exciting point process modelfor predicting tweet popularity. In Proceedings of the 21thACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, KDD ’15, pages 1513–1522,2015.