real-time task recognition in cataract surgery videos using adaptive spatiotemporal polynomials

11
0278-0062 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2014.2366726, IEEE Transactions on Medical Imaging IEEE TRANSACTIONS ON MEDICAL IMAGING 1 Real-Time Task Recognition in Cataract Surgery Videos using Adaptive Spatiotemporal Polynomials Gw´ enol´ e Quellec, Mathieu Lamard, B´ eatrice Cochener, Guy Cazuguel Abstract—This paper introduces a new algorithm for recog- nizing surgical tasks in real-time in a video stream. The goal is to communicate information to the surgeon in due time during a video-monitored surgery. The proposed algorithm is applied to cataract surgery, which is the most common eye surgery. To compensate for eye motion and zoom level variations, cataract surgery videos are first normalized. Then, the motion content of short video subsequences is characterized with spa- tiotemporal polynomials: a multiscale motion characterization based on adaptive spatiotemporal polynomials is presented. The proposed solution is particularly suited to characterize deformable moving objects with fuzzy borders, which are typically found in surgical videos. Given a target surgical task, the system is trained to identify which spatiotemporal polynomials are usually extracted from videos when and only when this task is being performed. These key spatiotemporal polynomials are then searched in new videos to recognize the target surgical task. For improved performances, the system jointly adapts the spatiotemporal polynomial basis and iden- tifies the key spatiotemporal polynomials using the multiple- instance learning paradigm. The proposed system runs in real- time and outperforms the previous solution from our group, both for surgical task recognition (A z =0.851 on average, as opposed to Az = 0.794 previously) and for the joint segmentation and recognition of surgical tasks (A z =0.856 on average, as opposed to A z =0.832 previously). Index Terms—cataract surgery, real-time task recognition, spatiotemporal polynomials, multiple-instance learning I. I NTRODUCTION I N anterior eye segment surgeries, the surgeon wears a binocular microscope and the output of the microscope is video-recorded. Real-time analysis of these videos may be useful to automatically communicate information to the surgeon in due time. Typically, relevant information about the patient, the surgical tools or the implant may be communicated to the surgeon whenever he or she begins a new surgical task. Such a system would be particularly useful to the less experienced surgeons: recommendations Copyright (c) 2010 IEEE. Personal use of this material is per- mitted. However, permission to use this material for any other pur- poses must be obtained from the IEEE by sending a request to pubs- [email protected]. All authors are with Inserm, UMR 1101, SFR ScInBioS, Brest, F-29200 France (e-mail: [email protected]) M. Lamard and B. Cochener are with Univ Bretagne Occidentale, Brest, F-29200 France K. Charri` ere and G. Cazuguel are with Institut Mines-Telecom; Telecom Bretagne; UEB; Dpt ITI, Brest, F-29200 France B. Cochener is with CHRU Brest, Service d’Ophtalmologie, Brest, F- 29200 France on how to best perform the current or the next task, given the patient’s specificities, may be communicated to them. To achieve this goal, we must be able to recognize surgical tasks in real-time during the surgery. In this paper, we focus on cataract surgery, which is the most common eye surgery [1]. An algorithm has been proposed for the automatic segmentation of cataract surgery videos into surgical tasks [2]. Temporal segmentation is based either on Dynamic Time Warping or on a Hidden Markov Model [3]: the visual content of images is described by visual features extracted within the pupil area. How- ever, that algorithm does not allow real-time recognition of the surgical tasks: temporal segmentation can only be performed when the surgical video is available in full, i.e. after the end of the surgery. In previous works, we have already presented a solution for the automated recognition of surgical tasks in manually-delimited cataract surgery videos [4]. The feature vectors used in that previous solu- tion were unchanged by variations in duration and temporal structure among the target surgical tasks. Therefore, it was possible to retrieve similar video segments and categorize surgical tasks, in real-time, using simple and fast distance measures. Recently, we have presented a solution for the joint segmentation and recognition of surgical tasks in full cataract surgery videos [5], also in real-time. Segmentation relies on the detection of ’idle phases’, during which little motion is detected in the surgical scene, which may indicate that the surgeon is changing tools. ’Action phases’, which are delimited by two idle phases, are categorized using a Conditional Random Field (CRF) [6] whenever their end is detected. Both segmentation and categorization rely on the video analysis framework proposed in [4]. Although quite fast, that video analysis framework relied on a very simple motion description, which may limit the task recognition ability, hence the need for a new framework. Regarding other surgeries, an overview of existing methods for the automatic recognition and temporal segmentation of tasks or gestures can be found in our previous paper [4]; research in laparoscopic surgery is particularly active [7], [8], [9]. Note that none of these methods was designed to run in real-time. In automatic video analysis systems, the visual content of a video is usually characterized by feature vectors that represent the shape, the texture, the color and, more importantly, the motion content of the video at different time instants [10], [11]. Motion feature extraction usually

Upload: guy

Post on 09-Mar-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Real-Time Task Recognition in Cataract Surgery Videos Using Adaptive Spatiotemporal Polynomials

0278-0062 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TMI.2014.2366726, IEEE Transactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 1

Real-Time Task Recognition in Cataract SurgeryVideos using Adaptive Spatiotemporal

PolynomialsGwenole Quellec, Mathieu Lamard, Beatrice Cochener, Guy Cazuguel

Abstract—This paper introduces a new algorithm for recog-nizing surgical tasks in real-time in a video stream. The goal isto communicate information to the surgeon in due time duringa video-monitored surgery. The proposed algorithm is appliedto cataract surgery, which is the most common eye surgery.To compensate for eye motion and zoom level variations,cataract surgery videos are first normalized. Then, the motioncontent of short video subsequences is characterized with spa-tiotemporal polynomials: a multiscale motion characterizationbased on adaptive spatiotemporal polynomials is presented.The proposed solution is particularly suited to characterizedeformable moving objects with fuzzy borders, which aretypically found in surgical videos. Given a target surgicaltask, the system is trained to identify which spatiotemporalpolynomials are usually extracted from videos when and onlywhen this task is being performed. These key spatiotemporalpolynomials are then searched in new videos to recognize thetarget surgical task. For improved performances, the systemjointly adapts the spatiotemporal polynomial basis and iden-tifies the key spatiotemporal polynomials using the multiple-instance learning paradigm. The proposed system runs in real-time and outperforms the previous solution from our group,both for surgical task recognition (Az = 0.851 on average,as opposed to Az = 0.794 previously) and for the jointsegmentation and recognition of surgical tasks (Az = 0.856on average, as opposed to Az = 0.832 previously).

Index Terms—cataract surgery, real-time task recognition,spatiotemporal polynomials, multiple-instance learning

I. INTRODUCTION

IN anterior eye segment surgeries, the surgeon wears abinocular microscope and the output of the microscope

is video-recorded. Real-time analysis of these videos maybe useful to automatically communicate information tothe surgeon in due time. Typically, relevant informationabout the patient, the surgical tools or the implant may becommunicated to the surgeon whenever he or she beginsa new surgical task. Such a system would be particularlyuseful to the less experienced surgeons: recommendations

Copyright (c) 2010 IEEE. Personal use of this material is per-mitted. However, permission to use this material for any other pur-poses must be obtained from the IEEE by sending a request to [email protected].

All authors are with Inserm, UMR 1101, SFR ScInBioS, Brest, F-29200France (e-mail: [email protected])

M. Lamard and B. Cochener are with Univ Bretagne Occidentale, Brest,F-29200 France

K. Charriere and G. Cazuguel are with Institut Mines-Telecom; TelecomBretagne; UEB; Dpt ITI, Brest, F-29200 France

B. Cochener is with CHRU Brest, Service d’Ophtalmologie, Brest, F-29200 France

on how to best perform the current or the next task, giventhe patient’s specificities, may be communicated to them.To achieve this goal, we must be able to recognize surgicaltasks in real-time during the surgery.

In this paper, we focus on cataract surgery, which isthe most common eye surgery [1]. An algorithm has beenproposed for the automatic segmentation of cataract surgeryvideos into surgical tasks [2]. Temporal segmentation isbased either on Dynamic Time Warping or on a HiddenMarkov Model [3]: the visual content of images is describedby visual features extracted within the pupil area. How-ever, that algorithm does not allow real-time recognitionof the surgical tasks: temporal segmentation can only beperformed when the surgical video is available in full, i.e.after the end of the surgery. In previous works, we havealready presented a solution for the automated recognitionof surgical tasks in manually-delimited cataract surgeryvideos [4]. The feature vectors used in that previous solu-tion were unchanged by variations in duration and temporalstructure among the target surgical tasks. Therefore, it waspossible to retrieve similar video segments and categorizesurgical tasks, in real-time, using simple and fast distancemeasures. Recently, we have presented a solution for thejoint segmentation and recognition of surgical tasks in fullcataract surgery videos [5], also in real-time. Segmentationrelies on the detection of ’idle phases’, during which littlemotion is detected in the surgical scene, which may indicatethat the surgeon is changing tools. ’Action phases’, whichare delimited by two idle phases, are categorized using aConditional Random Field (CRF) [6] whenever their end isdetected. Both segmentation and categorization rely on thevideo analysis framework proposed in [4]. Although quitefast, that video analysis framework relied on a very simplemotion description, which may limit the task recognitionability, hence the need for a new framework. Regardingother surgeries, an overview of existing methods for theautomatic recognition and temporal segmentation of tasksor gestures can be found in our previous paper [4]; researchin laparoscopic surgery is particularly active [7], [8], [9].Note that none of these methods was designed to run inreal-time.

In automatic video analysis systems, the visual contentof a video is usually characterized by feature vectorsthat represent the shape, the texture, the color and, moreimportantly, the motion content of the video at differenttime instants [10], [11]. Motion feature extraction usually

Page 2: Real-Time Task Recognition in Cataract Surgery Videos Using Adaptive Spatiotemporal Polynomials

0278-0062 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TMI.2014.2366726, IEEE Transactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 2

involves motion segmentation [12], [13] or salient pointcharacterization [14], [15]. If moving objects have fuzzyborders or are deformable, like in a surgical video, thensegmenting the motion content is challenging. Detectingsalient point, on the other hand, is possible, but usefulinformation does not necessarily lie where the salient pointshave been detected. A different solution is proposed in thispaper: the motion content of short video subsequences ischaracterized globally, using a deformable motion model;a polynomial model was adopted. A few authors proposedthe use of spatial polynomials [16], [17] or spatiotemporalpolynomials [18], [19] for motion analysis. However, theorder of the spatiotemporal polynomials was limited to 2[18] or 3 [19]; a generalization to arbitrary spatiotemporalpolynomial orders is proposed in this paper. More impor-tantly, we propose to adapt the polynomial basis to eachdetection problem.

Recognizing surgical tasks is challenging: a task may becomposed of multiple gestures, some of which are specificto this task, some of which are not. To achieve high taskrecognition performance, feature extraction and classifica-tion need to be trained. Given a target surgical task, we needto provide the classifier with examples of videos where thistask is visible. Ideally, we would ask an expert to temporallysegment key surgical gestures in these videos, in order tohelp the classifier identify what it should look for in videos.However, this solution would require too much annotationwork. Therefore, the classifier should be able to recognizethese key gestures itself: for supervision, we simply indicatewhich videos contain the target surgical task (and thereforethe key surgical gestures). Once this is done, detecting thosekey gestures in new videos is straightforward. Duchenneet al. [20] proposed a solution based on support-vectormachines and the bag-of-visual-words (BoW) model [21],[22], [23]. A solution based on Multiple-Instance Learning(MIL) [24], a variation on supervised learning, is pre-sented in this paper. In MIL, the supervision labels arenot assigned to instances (surgical gestures in our case)but to bags containing multiple instances (surgical tasksas a whole in our case): a bag is labeled positive if andonly if at least one of its instances is positive [25], [26],[27], [28]. The proposed system is trained to recognizewhich spatiotemporal polynomials are usually extracted intraining videos containing the target surgical task, but notin training videos that do not. These key spatiotemporalpolynomials can be used to detect key surgical gesturesin new videos and therefore to categorize surgical tasks.Our previous solution for the automated recognition ofsurgical tasks was also based on the MIL principle [4].However, motion was analyzed separately along the spatialdimensions and along the temporal dimension. Besides, theproposed characterizations could not be adapted to eachsurgical task in order to improve performance. In this paper,the system jointly adapts the spatiotemporal polynomialbasis and identifies the key spatiotemporal polynomials, forimproved performances.

II. OVERVIEW OF THE METHOD

Let V (i) denote a video containing one surgical task. Tocategorize the surgical task in a real-time in V (i), shortvideo subsequences are analyzed within V (i). For eachvideo subsequence, a feature vector characterizing motionthroughout this spatiotemporal volume is defined. To allowmeaningful feature vector comparisons, each frame in V (i)

needs to be registered to a coordinate system attached to thepatient. The main goal is to remove irrelevant motion in-formation, such as camera motion and patient motion: onlythe relative surgical tool-patient motions will be extracted.A secondary goal is to remove size variations, due to zoomlevel variations or camera-patient distance variations. Thisstep is surgery specific: the particular case of anterior eyesurgeries, including cataract surgery, is presented in sectionIII.

Once video subsequences are normalized, their motioncontent is extracted as described in section IV. Then, theextracted motion content is characterized using a spatiotem-poral polynomial model. Two solutions are presented. Thefirst solution relies on the canonical basis of spatiotemporalpolynomials (§V). The second solution relies on an adaptivebasis of spatiotemporal polynomials (§VI).

To train the system, a collection D of surgical task videosis needed. Given a target surgical task (e.g. incision), eachvideo V (i) ∈ D in which the target task appears is referredto as a relevant video and is noted V (i,+). All other videosare referred to as irrelevant videos and are noted V (i,−).This dataset is divided into a training subset Dtrain and atest subset Dtest.

Once video subsequences are characterized, relevant sub-sequences are detected in V (i). In order to detect relevantvideo subsequences, the idea is to identify spatiotemporalpolynomials that tend to appear in relevant training videos(V (i,+) ∈ Dtrain) but not in irrelevant ones (V (i,−) ∈Dtrain). These spatiotemporal polynomials are called keyspatiotemporal polynomials. They are learnt using theDiverse Density (DD) criterion [25], a well-known MILalgorithm (§VII). If an adaptive basis of spatiotemporalpolynomials is used, a variation on DD is used jointlyfor basis adaptation and key spatiotemporal polynomiallearning.

Finally, the detected relevant subsequences are used toestimate the binary label (+ or -) for video V (i) as awhole, using one DD model trained per target surgical task(§VIII). There are essentially two novelties in this paper: thedesign of an adaptive basis of spatiotemporal polynomialsfor motion characterization (§VI) and the generalization ofDD for basis adaptation (§VII-D). Cataract surgery videonormalization was recently presented at a conference [29]and is summarized hereafter.

III. NORMALIZING CATARACT SURGERY VIDEOS

In order to normalize motion information extracted fromvideos, each frame in these videos is registered to acoordinate system attached to the anterior segment of theeye (see Fig. 1). The main anatomical landmarks in these

Page 3: Real-Time Task Recognition in Cataract Surgery Videos Using Adaptive Spatiotemporal Polynomials

0278-0062 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TMI.2014.2366726, IEEE Transactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 3

videos are the pupil boundaries and the limbus (the outeriris boundary). However, segmenting them is challengingdue 1) to occlusion, 2) to the variety of colors and texturesin the pupil, the iris, the sclera and the lids and 3) tothe variety of zoom factors. In the proposed solution, thepupil center and a scale factor are tracked in videos withoutexplicitly segmenting the pupil or the iris.

A. Pupil Center TrackingThe proposed pupil center detector uses the fact that the

pupil boundaries, the limbus and the sclera / lid interfaceare concentric. Using a Hough transform to detect circlecenters in images, a prominent response is observed atthe pupil center, which is also the center of the limbusand of the sclera / lid interface. To improve performance,the Hough transform is modified in such a way that edgeinformation is only accumulated on the darkest side of theedge, on which the pupil likely is. Performance is improvedfurther by smoothing the accumulator both spatially andtemporally. The distance between the manually-indicatedand the estimated pupil centers was equal to 8.0± 6.9% ofthe manually-measured limbus radius [29].

B. Scale factor TrackingThe image scale factor is estimated from the illumination

pattern reflected on the cornea: the three bright spots insidethe pupil (see Fig. 1). It is defined as the height of thetriangular pattern. Note that the three spots are not alwaysvisible, in particular when the cornea is distorted due to atool insertion (see Fig. 3 (b), (g), (i)). The solution is tokeep track of the estimated scale factor and only update itwhen the three spots are clearly visible. It should be notedthat surgeons don’t change the zoom level while they areperforming a surgical gesture: the zoom level only changeswhen no tools are inserted. The limbus diameter is thebest indicator of the zoom level in images: 1) it does notchange over time, unlike the pupil diameter, and 2) it is littlevariant across the population. The correlation between theestimated zoom level and the manually-measured limbussize in images was R = 0.834 [29].

sclera

pupil

iris

limbus corneal

reflections

Region of interest

Fig. 1: Video normalization

C. Region of InterestA squared region of interest (ROI) is defined around the

estimated pupil center (see Fig. 1). Its size equals 1.5 timesthe estimated limbus diameter. The content of this ROI isresized to a fixed-size 513 x 513 image, which is thenprocessed by the motion extraction module (§IV). If theROI is not fully contained inside the original image (whenthe pupil is close to the image border), then part of thenew fixed-size image is undefined. This undefined area isignored by the motion extraction module.

IV. MOTION EXTRACTION

Once videos are normalized, motion is extracted fromthe optical flow between consecutive frames.

A. Optical Flow

Let V(i)k and V

(i)k+1 denote two consecutive frames in

video V (i). Dense optical flows seem more suitable thansparse optical flows in surgical applications: while sparseoptical flow algorithms rely on the detection of strongcorners, dense optical flows can capture soft tissue de-formations and track line-shaped tools, with little corners.Farneback’s algorithm is used to compute the dense opticalflow between V

(i)k and V

(i)k+1 [30]. This algorithm relies

on local approximations of image neighborhoods by poly-nomial expansions; a multiscale implementation is used[31]. For faster computations, the dense optical flow isnot estimated in all pixel positions: displacements are onlyestimated in ∆ uniformly sampled measurement points.

B. Video SubsequencesIn order to detect surgical subtasks or gestures, motion

information extracted from two consecutive frames maynot be enough: the time interval likely is too short. Moregenerally, motion is analyzed inside video subsequencesof n frames. These subsequences may overlap: one videosubsequence is analyzed every m frames. Parameter mis chosen to tradeoff retrieval precision and computationtimes (§VIII-B). Let V

(i)[jm+1;jm+n] denote the jth video

subsequence in V (i):

V(i)[jm+1;jm+n] =

¶V

(i)jm+1, V

(i)jm+2, ..., V

(i)jm+n

©(1)

All motion vectors extracted from consecutive frames ofV

(i)[jm+1;jm+n] are put into a single motion field F (i,j), or

F for short. Each element fd ∈ F maps a spatiotemporalcoordinate (xd = xk,p, yd = yk,p, td = k − jm), i.e.a time-indexed measurement point, to a displacement (ud,vd) provided by Farneback’s algorithm, d = 1..D, whereD = (n− 1)∆.

V. MOTION CHARACTERIZATION USING CANONICALSPATIOTEMPORAL POLYNOMIALS

In order to characterize the motion field within subse-quence V

(i)[jm+1;jm+n], the motion vectors in F are ap-

proximated by two spatiotemporal polynomials. The firstpolynomial maps the spatiotemporal coordinate (x, y, t)to the horizontal displacement u. The second maps thespatiotemporal coordinate to the vertical displacement v.

Page 4: Real-Time Task Recognition in Cataract Surgery Videos Using Adaptive Spatiotemporal Polynomials

0278-0062 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TMI.2014.2366726, IEEE Transactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 4

A. Spatiotemporal Polynomials of Maximal Order p

Let p denote the maximal polynomial order. Given abasis of canonical polynomials, we search for a matrix ofpolynomial coefficients, noted P (p,i,j) or P for short, thatminimize the sum of the squared errors between the truemotion field F and the motion field approximated by apolynomial model of maximal order p. Matrix P is referredto as the canonical motion characterization of subsequenceV

(i)[jm+1;jm+n].

B. Canonical Polynomial Bases

Canonical polynomial bases of maximal order p arenoted C(p). They have the following structure: C(1) ={1, x, y, t}, C(2) = {1, x, y, t, xy, xt, yt, x2, y2, t2},etc. Let Lp = |C(p)| denote the number of canonicalpolynomials. The number of canonical polynomials of orderk is the number of combinations of k elements in {x, y, t}with repetition, therefore:

Lp =

p∑k=0

Å2 + kk

ã(2)

C. Canonical Motion Characterization

Let (xd, yd, td), d = 1..D, be the spatiotemporal co-ordinates of the salient points detected in subsequenceV

(i)[jm+1;jm+n]. Let C(p)

d , or Cd for short, denote the vectorformed by the canonical polynomials in C(p), evaluated atcoordinate (xd, yd, td): for instance, C(1)

d = (1, xd, yd, td).The approximated motion vector at coordinate (xd, yd, td)can be expressed as a matrix product CdP (see Fig. 2).Matrix P is defined as follows:

P = argminX

D∑d=1

∥(ud, vd)− CdX∥2 (3)

The optimal solution is found when the derivative of thesum, with respect to matrix X , equals 0. This solution canbe rewritten as follows:

AP = E

A =D∑

d=1

CTd Cd

E =D∑

d=1

CTd (ud, vd)

(4)

with A ∈ MLp,Lp and E ∈ MLp,2. Matrix P is obtainedby solving two systems of linear equations of order Lp:one for the horizontal displacements and one for the verticaldisplacements. In the first system, the right-hand side of theequation is the first column of E. In the second system, theright-hand side is the second column of E. These systemsare solved using the LU decomposition of A. The solutionof the first (respectively the second) system is stored in thefirst (respectively the second) column of P .

D. Complexity analysis

The most complex step in the computation of the canon-ical motion characterization P , given the motion field F ,is the computation of matrix A. Since A is symmetric, itscomputation requires O( 12DL3

p) operations. In comparison,the complexity of its LU decomposition is in O( 12L

3p). So,

the complexity of the entire process increases linearly withD, the number of time-index measurement points (§IV).

VI. MOTION CHARACTERIZATION USING ADAPTIVESPATIOTEMPORAL POLYNOMIALS

In this section, the basis polynomials are no longercanonical polynomials, but rather a linear combination ofcanonical polynomials.

A. Adaptive Polynomial Basis

Let l denote the number of adaptive basis polynomials.Let Π(p,l) ∈ MLp,l, or Π for short, denote the projectionmatrix between the canonical polynomial basis of maximalorder p and the adaptive polynomial basis of dimension l.In this new basis, the polynomial coefficients evaluated atcoordinate (xd, yd, td) are stored in a l-dimensional vectorC

(p,l)d , or Cd for short:

Cd = CdΠ (5)

Let P (p,l,i,j), or P for short, denote the motion character-ization of a video subsequence V

(i)[jm+1;jm+n] in the new

basis. P can be expressed as follows:

AP = E

A =D∑

d=1

(CdΠ)T (CdΠ) = ΠTAΠ

E =D∑

d=1

(CdΠ)T (ud, vd) = ΠTE

(6)

where A ∈ Ml,l is an adapted version of matrix A (seeequation 4) and E ∈ Ml,2 is an adapted version of matrixE.

B. Basis Adaptation

For basis adaptation purposes, we need to know howeach coefficient of the projection matrix (Π) impacts themotion characterization (P ) of the video subsequence. Inother words, we need to compute the partial derivatives ofP with respect to each coefficient of Π. In order to reducethe complexity of basis adaptation, the dimension l of theadaptive basis should be as low as possible.

C. Deriving the Motion Characterization with respect toProjection Coefficients

In order to use standard derivation rules, equation 6 isinverted as follows:

P = A−1E (7)

Page 5: Real-Time Task Recognition in Cataract Surgery Videos Using Adaptive Spatiotemporal Polynomials

0278-0062 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TMI.2014.2366726, IEEE Transactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 5

(a) (b) (c) (d)

Fig. 2: Motion field measurement and approximation. The green motion field is the one measured by Farneback’salgorithm between the previous and the current frame. The blue motion field is the one obtain through the spatiotemporalpolynomial approximation (polynomial order p=6, subsequence length: n=5 frames). The resized region of interest isshown below each frame. Figure (a) shows that blurry tool motions can be detected. Figure (b) shows that images withunusual zoom levels can be processed correctly. Figure (c) indicates that soft tissue deformation can be detected. Figure(d) shows that fluid displacements can also be detected.

According to the rule of matrix product derivation, thepartial derivative of P with respect to coefficient Πu,v ofΠ is given below:

∂P

∂Πu,v≜ ∂A−1

∂Πu,vE + A−1 ∂E

∂Πu,v(8)

According to the rule of inverse matrix derivation, thepartial derivative of A−1 with respect to Πu,v is givenbelow:

∂A−1

∂Πu,v≜ −A−1 ∂A

∂Πu,vA−1 (9)

Therefore, equation 8 becomes:

∂P

∂Πu,v= A−1

ï∂E

∂Πu,v− ∂A

∂Πu,vA−1E

ò(10)

Matrix ∂E∂Πu,v

has the following format:

∂E

∂Πu,v=

0 0...

...0 0

Eu,1 Eu,2

0 0...

...0 0

(11)

where the only non-zero row is the vth row. This is becauseEq,r =

∑Lp

s=1 Πs,qEs,r, whose derivative with respect toΠu,v is zero whenever q = v. As for matrix ∂A

∂Πu,v, it has

the following format:

∂A

∂Πu,v=

0 · · · 0 au,1 0 · · · 0...

. . ....

...... . ..

...0 · · · 0 au,v−1 0 · · · 0

au,1 · · · au,v−1 au,v au,v+1 · · · au,l0 · · · 0 au,v+1 0 · · · 0... . ..

......

.... . .

...0 · · · 0 au,l 0 · · · 0

(12)

where the only non-zero row is the vth row, the only non-zero column is the vth column and au,q =

∑Lp

r=1 Πr,qAu,r.These formulas take advantage of the symmetry of matrixA (see equation 4).

VII. FINDING THE KEY SPATIOTEMPORALPOLYNOMIALS

Once each video subsequence has been characterized,key spatiotemporal polynomials are identified throughMultiple-Instance Learning (MIL) in the training subsetDtrain.

A. Multiple-Instance Learning

Multiple-instance learners are supervised learners thatreceive a set of bags of instances. A binary label (relevantor irrelevant) is assigned to each bag [24]. A bag is labeledirrelevant if all the instances in it are irrelevant. On the otherhand, a bag is labeled relevant if it contains at least onerelevant instance (or one key instance). From a collectionof labeled bags, multiple-instance learners are trained todetect relevant instances.

Page 6: Real-Time Task Recognition in Cataract Surgery Videos Using Adaptive Spatiotemporal Polynomials

0278-0062 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TMI.2014.2366726, IEEE Transactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 6

In this paper, the characterization of each video sub-sequence V

(i)[jm+1;jm+n] is regarded as an instance. Each

video V (i) is regarded as a bag of instances. The relevantinstances we are looking for are the key spatiotemporalpolynomials.

The most popular MIL frameworks are Andrews’ SVM[26], Diverse Density (DD) [25], and their derivatives. InAndrews’ SVM, a support-vector machine processes theinstance labels as unobserved integer variables, subjectedto constraints defined by the bag labels. The goal is tomaximize the soft-margin over hidden label variables anda discriminant function. DD measures the intersection ofthe relevant bags minus the union of the irrelevant bags.The location of relevant instances in feature space, and alsothe best weighting of the features, is found by maximizingDD. DD was chosen for its simplicity and its generality: inparticular, it is suitable for spatiotemporal polynomial basisadaptation (§VII-D).

B. Notations

Irrelevant and relevant videos are noted V (i,−) andV (i,+), respectively (§II). In the canonical model (§V), thecharacterizations of their subsequences are noted P (p,i,j,−)

and P (p,i,j,+), respectively, or P (i,j,−) and P (i,j,+) forshort. In the adaptive model (§VI), these characteriza-tions are noted P (p,l,i,j,−) and P (p,l,i,j,+), respectively,or P (i,j,−) and P (i,j,+) for short. Let “P denote the keyspatiotemporal polynomial.

C. Maron’s Diverse Density

DD, in its simplest form, is defined as follows:

“P = argmaxP

{∏i

Pr(+|V (i,+), P )

×∏i

Pr(−|V (i,−), P )

}Pr(+|V (i,+), P ) = 1− U+

i

Pr(−|V (i,−), P ) = U−i

U δi =

∏j

î1− exp

Ä−∥∥P (i,j,δ) − P

∥∥2äó(13)

where δ stands for the class label (+ or -). For convenience,instead of maximizing the DD criterion, the opposite of itslogarithm, noted f(P ), is minimized instead:

f(P ) = f+(P ) + f−(P )

f+(P ) = −∑

i logÄ1− eV

+i

äf−(P ) = −

∑i V

−i

V δi =

∑j log

(1− e−∥P

(i,j,δ)−P∥2) (14)

The key spatiotemporal polynomial, “P , is found by gradi-ent descents controlled by the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm [32]. In this paper, nd gradi-ent descents are performed to find a key spatiotemporalpolynomial. Each descent is initialized by a randomlyselected instance within the relevant bags [27]. The solutionmaximizing the DD criterion (i.e. minimizing the oppositeof its logarithm) is retained.

Considering the canonical model (§V), the gradient ofthe objective function f(P ) consists of the 2Lp partialderivatives with respect to the polynomial coefficients. Thepartial derivative of f(P ) with respect to one polynomialcoefficients Pu,v is given by:

∂f(P )∂Pu,v

= ∂f+(P )∂Pu,v

+ ∂f−(P )∂Pu,v

∂f+(P )∂Pu,v

=∑

ieV

+i

1−eV

+i

∂V +i

∂Pu,v

∂f−(P )∂Pu,v

= −∑

i∂V −

i

∂Pu,v

∂V δi

∂Pu,v= 2

∑j

e−∥P (i,j,δ)−P∥2

1−e−∥P (i,j,δ)−P∥2

ÄP

(i,j,δ)u,v − Pu,v

ä(15)

Note that features can be weighted simultaneously: theEuclidean distance simply needs to be replaced by aweighted Euclidean distance [25]:

(“P , s) = argmaxP,s

{∏i

Pr(+|V (i,+), P, s)

×∏i

Pr(−|V (i,−), P, s)

}∥∥P (i,j,δ) − P

∥∥2 =∑Lp

k=1

∑2q=1 s

2k,q(P

(i,j,δ)k,q − Pk,q)

2

(16)In this case, the gradient of the objective function f(P )consists of the 2Lp partial derivatives with respect to thepolynomial coefficients and the 2Lp partial derivatives withrespect to the weights. The weights are initialized to 1 ineach gradient descent.

D. Joint Basis Adaptation and Key Spatiotemporal Polyno-mial Learning

In this section, the adaptive model (§VI) is considered.The goal is to find the key spatiotemporal polynomial andthe best polynomial basis simultaneously:

(“P , Π) = argmaxP,Π

{∏i

Pr(+|V (i,+), P,Π)

×∏i

Pr(−|V (i,−), P,Π)

} (17)

Searching for the best polynomial basis alters the distancebetween instances, so there is no need to search for optimalweights: modifying the polynomial basis is more general.In this case, the gradient of the objective function f(P )(see equation 14) consists of the 2l partial derivatives withrespect to the polynomial coefficients (obtained similarlyto 15) and the Lp × l partial derivatives with respect tothe projection coefficients. The partial derivative of theobjective function, with respect to one projection coefficient

Page 7: Real-Time Task Recognition in Cataract Surgery Videos Using Adaptive Spatiotemporal Polynomials

0278-0062 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TMI.2014.2366726, IEEE Transactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 7

Πu,v , is given by:

∂f(P )∂Πu,v

= ∂f+(P )∂Πu,v

+ ∂f−(P )∂Πu,v

∂f+(P )∂Πu,v

=∑

ieV

+i

1−eV

+i

∂V +i

∂Πu,v

∂f−(P )∂Πu,v

= −∑

i∂V −

i

∂Πu,v

V δi =

∑j log

(1− e−∥P

(i,j,δ)−P∥2)

∂V δi

∂Πu,v= 2

∑j

e−∥P (i,j,δ)−P∥2

1−e−∥P (i,j,δ)−P∥2

×cÄ∂P (i,j,δ)

∂Πu,v

äTc(P (i,j,δ) − P

)(18)

where δ stands for the class label (+ or -) and c(.) is anoperator that concatenates the columns of a matrix. Thepartial derivatives of the adapted motion characterizationsP (i,j,δ) are given in equation 10. Note that the maindifference with system 15 lies in the last equation.

The optimal basis and the key spatiotemporal polynomialare found by nd gradient descents controlled by the BFGSalgorithm (§VII-C). For each descent, the polynomial basisis initialized though a principal component analysis of thespatiotemporal polynomials in input space (RLp ): the spa-tiotemporal polynomial predicting horizontal displacementsand those predicting vertical displacements altogether. Thefirst l components are used as the l columns of the initialprojection matrix.

VIII. SURGICAL TASK RECOGNITION

Once key spatiotemporal polynomials are characterized,the surgical task performed in video V (i) is recognized.

A. Multiclass Multiple Instance LearningFor surgery monitoring, we are interested in detecting

multiple surgical tasks in videos: Tt, t = 1, ..., T . Foreach key surgical task Tt, one key spatiotemporal poly-nomial “Pt and either one weight vector st (see equation16) or one polynomial basis Πt (see equation 17) havebeen learnt. The probability that task Tt occurs in videoV (i) is given either by Pt = Pr(+|V (i), “Pt, st) or byPt = Pr(+|V (i), “Pt, Πt): Pt may be used as the criterionto decide whether or not a task of type t occurred in videoV (i). But, to push recognition performance further, weshould take advantage of all Pu probabilities, u = 1, ..., T ,to recognize task t. In that purpose, a classifier was trainedin the T -dimensional space generated by all Pu proba-bilities, u = 1, ..., T , using the Dtrain dataset. The twoleading classification frameworks nowadays are RandomForests (RFs) [33] and Support-Vector Machines (SVMs)[34]. The main advantage of SVMs over RFs is that theyare good at avoiding over-fitting: they can perform wellwith less training data. One drawback of SVMs is thattheir performance depends on how the data is normalized,and normalization is particularly challenging in case ofheterogeneous data. However, in this application, input dataare all probabilities, so they are homogeneous and do notneed to be normalized. An SVM classifier was built usingthe well-known Gaussian kernel. The training procedure forthe full system is described in the following section.

B. Training Surgical Task Recognition

In the weight adaptation solution, key spatiotemporalpolynomial finding has five parameters (§II, §V, §VII-C):the number of frames per subsequence (n), the delaybetween the beginning of two consecutive subsequences(m), the number of selected measurement points per frame(∆), the maximal polynomial order (p) and the number ofgradient descents (nd). In the basis adaptation solution, thetask recognition system has a sixth parameter (§VI): thenumber of adaptive basis polynomials (l). Two parameterswere chosen empirically to allow real-time fitting of thepolynomial models: m=5 frames and ∆=400 measurementpoints per frame. One parameter was chosen empirically tohave fast training times: nd=5. The other two (n and p) orthree (n, p and l) were chosen by two-fold cross-validationin the training set. Each tested (n, p) or (n, p, l) tuple wasgraded by the average, over all surgical tasks and both folds,of the area under the Receiver Operating Characteristic(ROC) of the SVM classifier. In each fold, the parametersof the SVM (§VIII-A) were trained by a nested two-foldcross-validation. Once the optimal parameters were found,the full system was retrained in the entire training set usingthe optimal parameters.

C. Joint Segmentation and Recognition of Surgical Tasks

For the task of jointly segmenting and recognizing sur-gical tasks in real time, the framework proposed in [5] isused, after replacing our previous video analysis framework[4] with the proposed framework. Because the frameworkin [5] relies on analogy reasoning, the SVM classifierwas replaced by a k-nearest neighbors classifier in therecognition step.

The segmentation step also relies on analogy reasoning.But in that case, instances are assessed individually [5],so a distance between spatiotemporal polynomials needsto be defined (as opposed to a distance between bags ofinstances in the recognition step). In that purpose, thepresence of surgical tools is manually detected in a fewtraining videos: a set of video portions is obtained. Thosecontaining surgical tools are labeled positive, the othersare labeled negative. These video portions are used totrain an ‘action detector’, defined as a key spatiotemporalpolynomial detector (§VII). In the case of (“P , Π) adaptation(§VII-D), the distance between instances is defined as theEuclidean distance in the adapted polynomial space. In thecase of (“P , s) adaptation, it is defined as the weightedEuclidean distance (see equation 16).

IX. APPLICATION TO A CATARACT SURGERY DATASET

The proposed system was applied to cataract surgery.Two experiments were conducted. The first one was aboutsurgical task recognition in manually-delimited cataractsurgery videos. The second experiment was about thejoint segmentation and recognition of surgical tasks infull cataract surgery videos. The dataset used in theseexperiments is presented hereafter.

Page 8: Real-Time Task Recognition in Cataract Surgery Videos Using Adaptive Spatiotemporal Polynomials

0278-0062 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TMI.2014.2366726, IEEE Transactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 8

A. The Cataract Surgery Video Dataset

A dataset of 186 videos from 186 consecutive cataractsurgeries was collected at Brest University Hospital (Brest,France) between February and July 2011. Surgeries wereperformed by 10 different surgeons of various experiencelevels. Some videos were recorded with a CCD-IRIS device(Sony, Tokyo, Japan), the others were recorded with a Med-iCap USB200 video recorder (MediCapture, Philadelphia,USA). They were stored in MPEG2 format, with the highestquality settings, or in DV format. Image definition is 720 x576 pixels. The frame frequency is 25 frame per seconds.

In each video, a temporal segmentation was provided bycataract experts for each surgical task. The following sur-gical tasks were temporally segmented in videos: incision,rhexis, hydrodissection, phacoemulsification, epinucleus re-moval, viscous agent injection, implant setting-up, viscousagent removal and stitching up (see Fig. 3).

(a) incision (b) rhexis (c) hydrodissection

(d) phacoemulsifica-tion

(e) epinucleusremoval

(f) viscous agent in-jection

(g) implant setting-up

(h) viscous agent re-moval

(i) stitching up

Fig. 3: High-level surgical tasks

1) Surgical Task Recognition Experiment: in order tocompare the proposed framework with our previous surgicaltask recognition solution [4], a subset of 100 videos wasused in the first experiment. Nine manually-delimited clipswere obtained per surgery. Overall, 900 clips were obtained.Clips have an average duration of 94 seconds. The datasetwas randomly divided into two subsets of 50 surgeries: onewas used as training set, the other was used as test set. Forcomparison purposes, the dataset split defined in [4] wasused.

2) Joint Segmentation and Recognition Experiment:the full dataset was randomly divided into two subsetsof 93 surgeries: one was used as training set, the otherwas used as test set. For comparison purposes, the datasetsplit defined in [5] was used. Note that the training set(respectively the test set) include the smaller training set

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14 16 18 20

weig

ht in

the c

om

ponent

canonical basis polynomials

first componentsecond component

third componentfourth component

fifth component

Fig. 4: Optimal projection matrix for the ’incision’ class(n = 10, p = 3, l = 5)

TABLE I: Cross-validation results for (“P , Π) adaptation inthe training set

n p l Az

varying n

5 3 5 0.80610 3 5 0.82720 3 5 0.85350 3 5 0.815100 3 5 0.797

varying p

10 2 5 0.81010 3 5 0.82610 4 5 0.85410 5 5 0.84310 10 5 0.824

varying l

10 3 2 0.80910 3 3 0.81410 3 4 0.82610 3 5 0.84110 3 10 0.82610 3 20 0.752

(respectively test set) defined above. For this experiment, amiscellaneous task category was created to account for allthe optional surgical phases: iris retractor setting-up, irisretractor removal, angle measurement, landmark tracing,etc. In order to train the idle phase detector (§VIII-C),the presence of surgical tools was manually delimited ina subset of 10 surgery videos from the training set [5].

B. Results of the Surgical Task Recognition Experiment

Each parameter was trained separately, using a defaultvalue (n = 10, p = 3, l = 5) for the other two parame-ters. As an illustration, the optimal projection parametersobtained for the ’incision’ class for these parameters arereported in Fig. 4. Cross-validation results for (“P , Π)adaptation in the training set are reported in table I.The optimal tuples of parameters were (n = 20, p = 4)and (n = 20, p = 4, l = 5). For each surgical task,the ROC curve of the SVM classifier in the test set isreported in Fig. 5. The area under these curves (Az) for theproposed method using either (“P , s) adaptation or (“P , Π)adaptation are reported in table II. To show the advantageof the proposed multiclass MIL extension (§VIII-A), the

Page 9: Real-Time Task Recognition in Cataract Surgery Videos Using Adaptive Spatiotemporal Polynomials

0278-0062 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TMI.2014.2366726, IEEE Transactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 9

TABLE II: Performance of surgical task recognition in the 50-video test set (Az)

Method (P , s) adaptation (P , Π) adaptation(P , Π) adaptation / Previous Duchenne et al.

single-class MIL [4] [20]incision 0.771 0.816 0.780 0.741 0.801rhexis 0.796 0.814 0.805 0.878 0.837

hydrodissection 0.702 0.753 0.742 0.762 0.719phacoemulsification 0.911 0.944 0.911 0.923 0.912epinucleus removal 0.882 0.924 0.900 0.969 0.946

viscous agent injection 0.860 0.920 0.641 0.561 0.614implant setting-up 0.779 0.821 0.798 0.703 0.792

viscous agent removal 0.765 0.737 0.764 0.729 0.695stitching up 0.883 0.932 0.955 0.883 0.982

average 0.817 0.851 0.811 0.794 0.811standard error 0.023 0.027 0.032 0.043 0.041

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

true p

ositiv

e r

ate

false positive rate

incisionrhexis

hydrodissectionphacoemulsificationepinucleus removal

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

true p

ositiv

e r

ate

false positive rate

viscous agent injectionimplant setting-up

viscous agent removalstitching up

Fig. 5: Performance of surgical task recognition with (“P , Π)adaptation in the 50-video test set

performance of the standard single-class MIL (i.e. the areaunder the ROC curve of Pr(+|V, “P , Π)) is also reportedin table II. The performance of the proposed method wascompared to our previous method for cataract surgical taskcategorization [4] and to our implementation of Duchenne’smethod [20] in terms of Az (see table II) and in terms ofcomputation times.

Using one core of an Intel Xeon(R) processor runningat 2.53GHz, the proposed approach processed 24.4 frames

per seconds (FPS) while our previous method processed24.3 FPS and Duchenne’s method processed 0.72 FPS.Our methods were implemented in C++ using OpenCV1

and LIBSVM2. The most computationally-intensive partof Duchenne’s method, namely Space-Time Interest Pointextraction [14], is OpenCV code provided by the authors3.The rest of the method was implemented in C++, also usingOpenCV and LIBSVM. SVM classification takes less than0.01 milliseconds on average, due to the limited number ofinput dimensions (one dimension per task) and the limitednumber of samples (there are 132 support vectors perclassifier on average).

To assess the overall accuracy of the system, we per-formed an additional experiment where the nine task recog-nition SVMs are run simultaneously. The largest SVMprediction response defines the most likely surgical task.An overall accuracy of 75.9% was achieved by the system,which is significantly larger than the overall accuracyachieved by Duchenne’s method (69.3%, p < 0.0001) orby the previous system (72.9%, p = 0.0468) [4]. P-valueswere computed using two-sided exact binomial tests.

C. Results of the Joint Segmentation and Recognition Ex-periment

To segment idle phases, an ‘action detector’ was trainedusing (“P , Π) adaptation (§VIII-C). Using the same segmen-tation parameters as in [5], a false positive rate (FPR) of3.2 was measured while achieving the sensitivity reportedin [5] (0.846). This is comparable to the FPR reported in[5] (3.5).

Regarding the recognition step, the optimal parametervalues obtained in the first experiment, including the poly-nomial bases and the adapted spatiotemporal polynomials,were used in this second experiment. Other parameterswere trained as described in [5]. The performance of theproposed method was compared to our previous methodfor surgical task segmentation and categorization [5] interms of Az (see table III). ROC curves are reported inFig. 6. An overall accuracy of 81.2% was achieved bythe system, which is significantly higher than the overall

1http://opencv.org2http://www.csie.ntu.edu.tw/ cjlin/libsvm/3http://www.di.ens.fr/ laptev/download.html

Page 10: Real-Time Task Recognition in Cataract Surgery Videos Using Adaptive Spatiotemporal Polynomials

0278-0062 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TMI.2014.2366726, IEEE Transactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 10

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

tru

e p

ositiv

e r

ate

false positive rate

incisionrhexis

hydrodissectionphacoemulsificationepinucleus removal

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

tru

e p

ositiv

e r

ate

false positive rate

viscous agent injectionimplant setting-up

viscous agent removalstitching up

miscellaneous

Fig. 6: Performance of surgical task recognition in the 93automatically segmented test videos

TABLE III: Performance of surgical task recognition (Az)in the 93 automatically segmented test videos

Method Proposed Previous [5]incision 0.953 0.943rhexis 0.784 0.850

hydrodissection 0.820 0.883phacoemulsification 0.920 0.891epinucleus removal 0.866 0.840

viscous agent injection 0.848 0.722implant setting-up 0.875 0.810

viscous agent removal 0.855 0.768stitching up 0.911 0.863

miscellaneous 0.728 0.748average 0.856 0.832

standard error 0.021 0.022

accuracy achieved by the previous system (79.3%, p =0.0342) [5].

Note that surgical task segmentation is faster than cate-gorization. It can be run concurrently in a different thread,so jointly segmenting and recognizing surgical tasks doesnot impact the real-time performance of the system.

X. DISCUSSION

A novel framework for task segmentation and recognitionin cataract surgery videos was presented in this paper.In order to normalize motion information in videos, each

frame was registered to a coordinate system attached to theanterior segment of the eye. Then, the motion content ofshort video subsequences was modeled using spatiotempo-ral polynomials. Next, for each surgical task, key spatiotem-poral polynomials were identified through multiple-instancelearning. These key spatiotemporal polynomials were thensearched in new videos to detect key surgical gestures andtherefore recognize the target surgical tasks. To improverecognition performance, the basis of spatiotemporal poly-nomials was adapted and a support-vector machine wasused to combine multiple key spatiotemporal polynomialdetectors. In a dataset of 900 surgical task videos from 100cataract surgeries, the proposed method compared favorablyto our previous method for cataract surgical task recognition[4]. It also compared favorably to Duchenne’s method [20]for human action recognition. In a dataset of 186 videosfrom full cataract surgeries, if compares favorably to ourprevious method for the joint segmentation and recognitionof cataract surgery tasks [5].

Our previous method for cataract surgical task recog-nition also relied on the extraction of short video subse-quences. However, motion was analyzed separately alongthe spatial dimensions and along the temporal dimension.The proposed approach does not have this limitation. Thespatiotemporal analysis performed in this new solution doesnot imply larger computation times (24.4 FPS, as opposedto 24.3 FPS).

The comparison to Duchenne et al.’s method is particu-larly interesting since it is a typical example of the Bag-of-visual-Words (BoW) model frequently used in computervision. In the BoW model, we are mainly interested in localimage/motion patterns and the global image/motion patternis ignored. We believe the proposed approach is morerelevant in a surgery context for two reasons. First, motioninformation tends to be more global in surgery videos thanin general purpose videos: typically, when the surgeonmoves a tool, the surrounding tissues are affected. Oneadvantage of the proposed method is that is captures bothlocal and global motion, through high-order and low-orderpolynomials respectively. Second, Duchenne’s method onlycharacterizes motion in the surrounding of salient points.Therefore, it fails to characterize the deformation of smoothtissues. To summarize, we believe the proposed methodis good at capturing deformations at multiple scales. Interms of computation times, the proposed method is advan-tageous. As opposed to Duchenne’s method, it allows real-time analysis of videos (24.4 FPS, as opposed to 0.72 FPS).The main reason for that difference is that the proposedmethod relies on the extraction of 2-D interest points (§IV)while Duchenne’s method relies on the extraction of 3-D (spatiotemporal) interest points. In Duchenne’s method,once 3-D interest points are detected, local spatiotemporalcharacterizations are extracted at their location, so that localmotions can be finely analyzed. In the proposed method, theunderlying 3-D motion information is captured afterwardsusing polynomials approximations, which is very fast.

Compared to previous works on multiple-instance learn-ing and spatiotemporal polynomials, the main novelty lies

Page 11: Real-Time Task Recognition in Cataract Surgery Videos Using Adaptive Spatiotemporal Polynomials

0278-0062 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TMI.2014.2366726, IEEE Transactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 11

in the design of an adaptive basis of spatiotemporal poly-nomials and the generalization of diverse density for basisadaptation. Adapting the polynomial basis, rather than train-ing a set of weights improves performance significantly (seetable II). This solution increases the dimensionality of theoptimization problem. However, the principal componentanalysis seems to provide a good initial solution for theadaptive polynomial basis, which makes the optimizationprocess easier. We believe the use of a classifier to combinemultiple key instance detectors, one detector being trainedfor a different target class, is also novel. This solution isparticularly helpful when no good key instance detector isfound for a given class. In our experiment, this was the casefor the ’viscous agent injection’ class: an area Az = 0.641was found under the ROC curve of Pr(+|V, “P , Π), asopposed to Az = 0.920 using the classifier (see table II).Overall, this multiclass MIL solution improves performancesignificantly.

One limitation of the proposed method is that motion isthe only visual feature used for surgical task recognition, al-though color is used to detect the pupil center and thereforenormalize motion information. However, including othervisual features may significantly improve performance. Inparticular, linking the location of surgical tools with the de-tected motion seems promising. This may be done throughthe proposed spatiotemporal polynomial framework. Sofar, our spatiotemporal polynomials are two-dimensional:the first dimension is associated with horizontal motion,the second is associated with vertical motion. In futureworks, additional features such as tool detections will beincluded in the framework as additional dimensions of thespatiotemporal polynomials.

In conclusion, a novel action recognition framework hasbeen presented. This framework seems particularly suitedto surgery videos and an experiment on a cataract surgerydataset confirmed it. However, it could be beneficial toother problems where relevant visual information does notnecessarily lie in the neighborhood of salient points andwhere motion, and possibly other visual features, cannotbe easily segmented.

REFERENCES

[1] X. Castells, M. Comas, M. Castilla, F. Cots, and S. Alarcon, “Clinicaloutcomes and costs of cataract surgery performed by planned ECCEand phacoemulsification,” Int Ophthalmol, vol. 22, no. 6, pp. 363–7,1998.

[2] F. Lalys, L. Riffaud, D. Bouget, and P. Jannin, “A framework forthe recognition of high-level surgical tasks from video images forcataract surgeries,” IEEE Trans Biomed Eng, vol. 59, no. 4, pp. 966–76, 2012.

[3] ——, “An application-dependent framework for the recognition ofhigh-level surgical tasks in the OR,” in Proc MICCAI’11, vol. 14,2011, pp. 331–8.

[4] G. Quellec, K. Charriere, M. Lamard, Z. Droueche, C. Roux,B. Cochener, and G. Cazuguel, “Real-time recognition of surgicaltasks in eye surgery videos,” Med Image Anal, vol. 18, no. 3, pp.579–90, 2014.

[5] G. Quellec, M. Lamard, B. Cochener, and G. Cazuguel, “Real-timesegmentation and recognition of surgical tasks in cataract surgeryvideos,” IEEE Trans Med Imaging, in press.

[6] J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields:Probabilistic models for segmenting and labeling sequence data,” inProc ICML’01, 2001, pp. 282–9.

[7] N. Padoy, T. Blum, S. Ahmadi, H. Feussner, M. Berger, andN. Navab, “Statistical modeling and recognition of surgical work-flow,” Med Image Anal, vol. 16, no. 3, pp. 632–41, 2012.

[8] L. Tao, L. Zappella, G. D. Hager, and R. Vidal, “Surgical gesturesegmentation and recognition,” in Lecture Notes in Computer Sci-ence, vol. 8151, 2013, pp. 339–46.

[9] L. Zappella, B. Bejar, G. Hager, and R. Vidal, “Surgical gestureclassification from video and kinematic data.” Med Image Anal,vol. 17, no. 7, pp. 732–45, 2013.

[10] B. V. Patel and B. B. Meshram, “Content based video retrievalsystems,” Int J UbiComp, vol. 3, no. 2, pp. 13–30, 2012.

[11] W. Hu, N. Xie, L. Li, X. Zeng, and S. Maybank, “A survey on visualcontent-based video indexing and retrieval,” IEEE Trans Syst ManCybern C Appl Rev, vol. 41, no. 6, pp. 797–819, 2011.

[12] F. Rothganger, S. Lazebnik, C. Schmid, and J. Ponce, “Segmenting,modeling, and matching video clips containing multiple movingobjects,” IEEE Trans Pattern Anal Mach Intell, vol. 29, no. 3, pp.477–491, 2007.

[13] T. Yamasaki and K. Aizawa, “Motion segmentation and retrieval for3d video based on modified shape distribution,” EURASIP J ApplSignal Process, vol. 2007, no. 1, p. 059535, 2007.

[14] I. Laptev, “On space-time interest points,” Int J Comput Vis, vol. 64,no. 2-3, pp. 107–123, 2005.

[15] Y.-G. Jiang, C.-W. Ngo, and J. Yang, “Towards optimal bag-of-features for object categorization and semantic video retrieval,” inProc ACM CIVR, 2007, pp. 494–501.

[16] S. Jeannin, “On the combination of a polynomial motion estimationwith a hierarchical segmentation based video coding scheme,” inProc ICIP, 1996, pp. 489–492.

[17] O. Kihl, B. Tremblais, B. Augereau, and M. Khoudeir, “Humanactivities discrimination with motion approximation in polynomialbases,” in Proc ICIP, 2010, pp. 2469–2472.

[18] X. Hu and N. Ahuja, “Long image sequence motion analysis usingpolynomial motion models,” in Proc MVA, 1992, pp. 109–114.

[19] J. Jakubiak, S. Nomm, J. Vain, and F. Miyawaki, “Polynomial basedapproach in analysis and detection of surgeon’s motions,” in ProcICARCV, 2008, pp. 611–616.

[20] O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce, “Automaticannotation of human actions in video,” in Proc IEEE ICCV, 2009,pp. 1491–1498.

[21] R. Pires, H. F. Jelinek, J. Wainer, S. Goldenstein, E. Valle, andA. Rocha, “Assessing the need for referral in automatic diabeticretinopathy detection,” IEEE Trans Biomed Eng, vol. 60, no. 12, pp.3391–8, 2013.

[22] Z. Liu, H. Li, W. Zhou, R. Zhao, and Q. Tian, “Contextual hashingfor large-scale image search,” IEEE Trans Image Process, vol. 23,no. 4, pp. 1606–14, 2014.

[23] R. Ji, L.-Y. Duan, J. Chen, L. Xie, H. Yao, and W. Gao, “Learningto distribute vocabulary indexing for scalable visual search,” IEEETrans Multimedia, vol. 15, no. 1, pp. 153–66, 2013.

[24] J. Amores, “Multiple instance classification: review, taxonomy andcomparative study,” Artif Intell, vol. 201, pp. 81–105, 2013.

[25] O. Maron and T. Lozano-Perez, “A framework for multiple-instancelearning,” in Proc NIPS, 1998, pp. 570–576.

[26] S. Andrews, I. Tsochantaridis, and T. Hofmann, “Support vectormachines for multiple-instance learning,” in Proc NIPS, vol. 15,2003, pp. 561–568.

[27] J. R. Foulds and E. Frank, “Speeding up and boosting diverse densitylearning,” in Proc DS, 2010, pp. 102–116.

[28] G. Quellec, M. Lamard, M. D. Abramoff, E. Decenciere, B. Lay,A. Erginay, B. Cochener, and G. Cazuguel, “A multiple-instancelearning framework for diabetic retinopathy screening,” Med ImageAnal, vol. 16, no. 6, pp. 1228–40, 2012.

[29] G. Quellec, K. Charriere, M. Lamard, B. Cochener, and G. Cazuguel,“Normalizing videos of anterior eye segment surgeries,” in ProcIEEE EMBC, 2014, in press.

[30] G. Farneback, “Two-frame motion estimation based on polynomialexpansion,” in LNCS, 2003, pp. 363–70.

[31] B. D. Lucas and T. Kanade, “An iterative image registration tech-nique with an application to stereo vision,” in Proc IUW, 1981, pp.121–130.

[32] C. G. Broyden, “The convergence of a class of double-rank mini-mization algorithms,” J Inst Math Appl, vol. 6, pp. 76–90, 1970.

[33] L. Breiman, “Random forests,” Mach Learn, vol. 45, pp. 5–32, 2001.[34] C. Cortes and V. Vapnik, “Support-vector networks,” Mach Learn,

vol. 20, no. 3, pp. 273–297, 1995.