guillaume blin - igmmonge.univ-mlv.fr/~gblin/ligm_like/dossier-guillaume... · 2013. 3. 29. ·...

1

Ce document doit être joint au dossier transmis àl'attention du comité de sélection

DÉCLARATION DE CANDIDATURE AU RECRUTEMENT SUR UN EMPLOI DE PROFESSEUR DES

UNIVERSITES(Campagne 2013)

(décret n° 84-431 du 6 juin 1984 modifié)Authentification :c3a8baf720dc2dad4cb4b2fd7da2843e (1362134012034)

adressée au chef d'établissement de : INSTITUT POLYTECHNIQUE DE BORDEAUX Poste(s) : n°4029 Poste susceptible d'etre vacantPublié le : 26 février 2013 à 00:02Section(s) C.N.U : 27 (Informatique)Profil : informatiqueLocalisation : talenceArticle 46-1Chaire : NonJe soussigné(e) M.Nom de famille : BLINNom d'usage : BLINPrénom : GUILLAUMEDate et lieu de naissance : 21/12/1979 - ST GERMAIN EN LAYENationalité : FrançaiseNumen : 24S0600376DNKN° de qualification : 13127164252 .

Fonctions et établissement actuel : Enseignant-chercheur (Maître de conférences)-UNIVERSITE DE MARNE LAVALLEE Date de création :01/03/2013 à 11:03Date de dernière modification :01/03/2013 à 11:03Titres universitaires français :Habilitation à diriger des recherches Diplôme le plus récent : HDR

Adresse postale et électronique à laquelle seront acheminées toutes les correspondances 87 RUE BELLIARD

Code postal : 75018 Ville :PARIS Pays : FRANCE Téléphone : 0142542811 Télécopie : 0160957557 Adresse électronique : [email protected]

2

Titre : Combinatorial Objects in Bio-Algorithmics: Related problems and complexitiesDate de soutenance : 18/06/2012Lieu de la soutenance : UNIVERSITE DE MARNE LA VALLEEMention :Directeur : PAS DE DIRECTEURComposition du jury : THIERRY LECROQ, EXAMINATEUR BERNARD MORET, RAPPORTEUR ERIC RIVALS, EXAMINATEUR MARIE-FRANCE SAGOT, RAPPORTEUR MAXIME CROCHEMORE, PRESIDENT LAURENT VUILLON, RAPPORTEUR Activités en matière d'enseignement :Environ 220h d'enseignement par an entre 2006 et 2013, à l'université de Marne-la-vallée (à l'exception d'une année dedélégation et d'une demie-CRCT), de licence 1ère année à master 1ère année ainsi que dans deux écoles d'ingénieurs surles 2 premières années : algorithmique, SGBD, programmation réseaux, java, langage C, architecture des ordinateurs,programmation web, programmation concurrente.thème de recherche et mots clés :Code 22, Graphes, combinatoire, complexité et Code 50, Bioinformatique Bio-algorithmique: étude algorithmique de la comparaison d'objets biologiques. Mots clés: ARN, Réarrangement Génomique, Réseaux Biologiques, Radiothérapie, Complexité classique, paramétrée,approximation, heuristique, programmation linéaireActivités en matière d'administration et autres responsabilités collectives :Organisation du séminaire hebdomadaire du laboratoire (2006-2010)Responsable de la bibliographie de l'Université avec spécification et mise en place d'un portail HAL ainsi que formationdes E-Chercheurs (2010-2013)Responsable web du LIGM (2009-2012)Membre du conseil du LIGM (2010-2013) et du comité permanent (2010-2013)Responsable de la 3ème année d'ingénieur réseaux de l'ESIPE (2012-2013)Autres titres et diplômes :Titulaire de la PES au titre de la campagne nationale d'évaluation des candidatures desenseignants-chercheurs 2012.Travaux, ouvrages, articles, réalisations :Dans les 10 années que comporte ma carrière de chercheur, le volume de ma production scientifique se répartit commesuit. * 12 publications dans des revues internationales avec comité de lecture. * 25 publications dans des conférences internationales avec comité de lecture (dont 11 avec un taux de sélection inférieur à40%). * 1 chapitre d'ouvrage.

3

Liste des pièces à fournir par le candidat :Pièces obligatoires mentionnées selon le cas dans les arrêtés du 7 octobre 2009, fixant les dispositions permanentesapplicables à l'ensemble des recrutements de professeurs des universités et de maîtres de conférences ou bien dans l'arrêtédu 20 février 2012 relatif aux modalités de recrutement des professeurs du Muséum national d'Histoire Naturelle et desmaitres de conférences du Muséum National d'Histoire Naturelle. Ces arrêtés sont accessibles depuis le portail GALAXIE(rubrique 'A consulter' dans la colonne gauche). déclare faire acte de candidature sur l'emploi ci-dessus désigné :Fait à le

Signature

Dossier de candidature au recrutement sur le poste deProfesseur des Universités n˚4029

Guillaume BlinMaître de conférences, Université de Marne-la-vallée

Habilité à diriger des recherches depuis Juin 2012

Table des matièresCurriculum vitæ synthétique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Activités d’enseignement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Activités de recherche . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Responsabilités administratives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Lettres de recommandation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Pièces administratives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Exemplaires de travaux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Guillaume BlinMaître de conférences habilité à diriger des recherches

LIGM, Université Paris-Est Marne-la-Vallée

État civilNom Blin.

Prénom Guillaume.Né le 21 Décembre 1979, Saint Germain en Laye (78), France, 33 ans.

Nationalité Française.Statut Maître de conférences habilité à diriger des recherches, LIGM, UPEMLV.

DiplômesJuin 2012 Habilitation à Diriger des Recherches, Université Paris-Est, Laboratoire d’Informatique Gas-

pard Monge (LIGM).

“Combinatorial Objects in Bio-Algorithmics : Related problems and complexities”.? Bernard Moret - Professeur d’Université à l’EPFL en Suisse (rapporteur),? Marie-France Sagot - Directrice de Recherches INRIA à l’Université Claude Bernard de Lyon (rappor-teur),? Laurent Vuillon - Professeur d’Université de l’Université de Savoie (rapporteur),? Maxime Crochemore - Professeur Emérite de l’Université de Marne-la-vallée (président),? Thierry Lecroq - Professeur d’Université de l’Université de Rouen (examinateur),? Eric Rivals - Directeur de Recherches CNRS à l’Université de Montpellier (examinateur)

Novembre2005

Doctorat Informatique, Université de Nantes, Laboratoire d’Informatique Nantes Atlantique(LINA).

“Combinatoire and Bio-informatique : Comparaison de structures d’ARN et calcul de dis-tances intergénomiques”.? Marie-France Sagot - Directrice de Recherches INRIA à l’Université Claude Bernard de Lyon (rappor-teur),? Hélène Touzet, Directrice de Recherches CNRS à l’Université de Lille (rapporteur),? Guillaume Fertin, Professeur d’Université de l’Université de Nantes (co-directeur),? Romeo Rizzi, Professeur d’Université de l’Università degli Studi di Trento en Italie (examinateur),? Irena Rusu, Professeur d’Université de l’Université de Nantes (directrice),? Stéphane Vialette, Directeur de Recherches CNRS de l’Université de Marne-la-vallée (examinateur)

Parcours professionnel2010–2011 Délégation CNRS, Laboratoire d’Informatique Gaspard Monge (LIGM), Université Paris-Est

Marne-la-Vallée.

2006–à cejour

Maître de Conférences, Laboratoire d’Informatique Gaspard Monge (LIGM), Université Paris-Est Marne-la-Vallée.

2005–2006 Attaché Temporaire d’Enseignement et de Recherche (ATER), Laboratoire d’InformatiqueGaspard Monge (LIGM), Université Paris-Est Marne-la-Vallée.

LIGM, Université Paris-Est Marne-la-Vallée – 77454 Marne-la-Vallée Cedex 02H 06 52 05 52 35 • T 01 60 95 77 49 • u 01 60 95 75 57

B [email protected] – http://igm.univ-mlv.fr/∼gblin

mailto:[email protected] -- http://igm.univ-mlv.fr/$\sim $gblin

2002–2005 Allocataire de Recherche - MENRT - Moniteur, Laboratoire d’Informatique Nantes Atlan-tiques (LINA), Université de Nantes.

Enseignement• Environ 220h d’enseignement par an entre 2006 et 2013, à l’université de Marne-la-vallée (à

l’exception d’une année de délégation et d’une demie-CRCT), de licence 1ère année à master1ère année ainsi que dans deux écoles d’ingénieurs sur les 2 premières années : algorithmique,bases de données, programmation réseaux, java, langage C, architecture des ordinateurs,programmation web, programmation concurrente.

• Responsable de la 3ème année d’Ingénieur Réseaux de l’ESIPE depuis 2012.• Responsable du Last Project de 3ème année d’Ingénieur Réseaux depuis 2012.• Responsable local Socrates-Erasmus sur 2 accords depuis 2006.• Enseignant Référent de 9 étudiants de L1 pour le plan Licence 2012.

• Représentant enseignant de l’ESIPE au salon européen de l’éducation 2012, au Forum desMasters UPEMLV, aux JPO UPEMLV, JPO ESIPE ainsi qu’au Forum Poursuite d’Études del’IUT de Nantes.

Administration• Responsable et organisateur du séminaire hebdomadaire du LIGM de 2006 à 2010.• Responsable de la bibliographie de toute l’Université de Marne-la-vallée depuis 2010.• Responsable du site web du LIGM depuis 2009.• Membre élu du Conseil de Laboratoire depuis 2010.• Membre élu du Comité Permanent depuis 2010.• Membre de 8 commissions de spécialistes section 27 depuis 2009.• Co-organisateur de 2 workshops internationaux et 4 journées conjointes des GDR BIM et IM.

Publications• 12 publications dans des revues internationales avec comité de lecture.

• 25 publications dans des conférences internationales avec comité de lecture (dont 11 avec untaux de sélection inférieur à 40%).

• 1 chapitre d’ouvrage.

Recherche• Titulaire de la PES au titre de la campagne nationale d’évaluation des candidatures des

enseignants-chercheurs 2012.

• Porteur du projet ANR Jeune Chercheur BIRDS (2011-2015).

• Thèmes de recherche (mots clés) : aspects algorithmiques de la comparaison d’objets biolo-giques, complexités classique et paramétrée, approximation.

• Co-encadrement à 50% des thèses de Florian Sikora (2008-2011 - actuellement MdC au LAM-SADE) et Paul Morel (2011-2014).

• Rapporteur de la thèse de Stefano Beretta (soutenance prévue en 2013) de l’Università DegliStudi di Milano-Bicocca (ITALIE).

• Participations à 2 projets PEPS CNRS, 1 projet ANR Jeunes Chercheurs (également porteur duprojet), 2 Programmes blanc ANR, 1 projet franco-italien, 1 projet franco-quebecois et 2 ACI.

• Membre du comité de programme de 5 conférences internationales.• Relecture d’articles pour 13 journaux internationaux et 9 conférences internationales.




Activités pédagogiques

EnseignementContexte Mes activités d’enseignement en tant que maître de conférences ont eu lieu au sein de l’Ecole

Supérieure d’Ingénieurs Paris-Est (ESIPE – http ://esipe.univ-mlv.fr), de l’Ecole d’IngénieurImage Multimédia Audiovisuel Communication (IMAC – http ://www.ingenieur-imac.fr/),ainsi qu’à l’Institut d’électronique et d’informatique Gaspard Monge (IGM – http ://igm.univ-mlv.fr).

2000–2002 127h équi. TDs, en tant que Vacataire.• UFR Droit Nantes – M1 : Bureautique (100h TP)

? Apprentissage des logiciels word, excel et access

• IUT Informatique Nantes – 2ème année : Réseaux Protocoles et Services (38h TD)? Notions de bases de voies de communications (calcul de débits, trame)

? L’interconnexion de voies (calcul de délai d’acheminement, commutation)

? Les supports physiques (modulation)

? Les codages (manchester, manchester différentiel), la détection et correction d’erreurs

• IUT Informatique Nantes – Licence Pro : Client/Serveur, Traitement Répartis (10h TP)? Programmation d’applications client/serveur (communications UDP, TCP, bloquante et non-bloquante)

• IUT Informatique Nantes – Licence Pro : Interface Homme Machine (23h TP)? Programmation d’Interface Graphique sous AWT/Swing

2002–2005 219h équi. TDs, en tant que Moniteur.• Fac. Sciences de Nantes – L2 : CAML (30h TP)

? Approche de la programmation fonctionnelle

• Fac. Sciences de Nantes – L3 : Architecture des ordinateurs (63h TD + 72h TP)? Problématiques générales de l’architecture des ordinateurs

? Programmation bas niveau

? Mécanismes utilisés pour la gestion et le traitement des données

• Fac. Science de Nantes – L3 : Système d’exploitation (27h TD + 8h TP)? Caractèristiques d’un système d’exploitation

? Les processus

? Multitâches

? Primitives du système liées aux processus

? Communication entre processus

• Fac. Sciences de Nantes – M1 : Algorithmique avancée (25h TP)

? Algorithmes dont l’application ou l’inspiration provient de la vie courante (recherche de motifs, decomposantes connexes, de chemins dans les graphes, etc.)

• Fac. Sciences de Nantes – L MIAGE : SGBD (14h TP)• Fac. Sciences de Nantes – L MIAGE : Architecture des ordinateurs (20h TD + 15h TP)

2005–2006 104h équi. TDs, en tant qu’ATER.• ESIPE - IR2 : Système d’exploitation (24h TD)• IGM - L1 : HTML (20h TD) et SGBD (36h TD)

? Bases du langage HTML

? Utilisation du CSS

? Langage SQL

? Les schémas E/A et relationnel

• IGM - L3 : Architecture des ordinateurs (24h TD)2006–2007 180h équi. TD, en tant que Maître de conférences.




• ESIPE - IR1 : Algorithmique 1 et 2 (24h TD)• ESIPE - IR1 : Architecture des ordinateurs (24h TD)• ESIPE - IR2 : Application réseaux (24h TD)

? Comprendre et maîtriser la conception et l’implémentation d’applications réseaux

? Protocoles (rédaction de RFC)

? Architecture logicielle des applications (clients, serveurs, pairs, concurrence, entrées-sorties non blo-quantes)

• IGM - L1 : SGBD (36h TD)• IGM - L3 : Architecture des ordinateurs (24h TD)• IGM - M1 : Java Avancé (24h TD)

? Maîtriser le langage Java

? Connaître les différents idiomes et les techniques classiques du langage Java

? Reconnaître les pièges et les embûches classiques et connaitre les façons de les résoudres

• IGM - M1 : Informatique Génomique (24h TD)? Alignements de deux séquences

? Alignements multiples de séquences

? Prédiction de structures d’ARN

? Phylogénie

? Réarrangements génomiques

2007–2008 228h équi. TD, en tant que Maître de conférences.• ESIPE - IR1 : Algorithmique 1 et 2 (20h TD)• ESIPE - IR1 : Architecture des ordinateurs (24h TD)• ESIPE - IR1 : Programmation C 1 et 2 (20h TD)

? Linux et Introduction

? Types et variables

? Tableaux et structures de contrôle

? Fonctions

? Types structurés

? Entrées-sorties et fichiers

? Manipulation de bits

? Préprocesseur et fonctions avancées

? Allocation dynamique

? Bibliothèques et librairies

? Programmation avancée

• ESIPE - IR2 : Java Avancé (2h TD)• ESIPE - IR2 : Concurrences et E/S (14h TD)

? Compréhension des spécificités de la programmation concurrente

? Mise en oeuvre d’une application concurrente en Java

• ESIPE - IR2 : Application réseaux (24h TD)• ESIPE - IR3 : XML et XSLT (16h TD)

? Validation XML/CSS

? XPath et Initiation à XSLT

? Récursivité

? <xsl :key> et regroupement

• IGM - L3 : Architecture des ordinateurs (24h TD)• IGM - M1 : Java Avancé (24h TD)




• IGM - M1 : Application réseaux (24h TD)• IGM - M1 : Informatique Génomique (24h CM)• IGM - M1 : Suivis de stage (4h)

2008–2009 245h30 équi. TD, en tant que Maître de conférences.• ESIPE - IR1 : Architecture des ordinateurs (20h CM + 12h TD)• ESIPE - IR2 : Java Avancé (18h TD)• ESIPE - IR2 : Concurrences et E/S (14h TD)• ESIPE - IR2 : Application réseaux (24h TD)• IGM - L3 : Architecture des ordinateurs (24h CM)• IGM - L3 : Réseaux (18h TD)• IGM - M1 : Java Avancé (24h TD)• IGM - M1 : Application réseaux (24h TD)• IGM - M1 : Informatique Génomique (24h CM)• IGM - M1 : Suivis de stage (9h30)

2009–2010 227h équi. TD, en tant que Maître de conférences.• ESIPE - IR1 : Architecture des ordinateurs (20h CM)• ESIPE - IR2 : Java Avancé (20h TD)• ESIPE - IR2 : Concurrences et E/S (14h TD)• ESIPE - IR2 : Application réseaux (50h TD)• ESIPE - OC1 : Architecture des ordinateurs (10h CM)• IGM - L3 : Architecture des ordinateurs (24h CM + 24h TD)• IGM - M1 : Java Avancé (24h TD)• IGM - M1 : Suivis de stage (14h)

2010–2011 0h équi. TD, en tant que Maître de conférences en délégation CNRS.2011–2012 60h équi. TD, en tant que Maître de conférences en demi-CRCT.

• IMAC - IMAC1 : Algorithmique (24h CM + 24h TD)? Notions de programme et retour sur le C

? Représentation des données

? Notions de complexité

? Types abstraits simples

? Les tris

? Les arbres

? Les graphes

2012–2013 222h équi. TD, en tant que Maître de conférences.• ESIPE - IR1/IG1 : Programmation C (44h CM + 30h TD)• ESIPE - IR2 : Concurrences et E/S (24h TD)• ESIPE - OC1 : Programmation C (20h CM)• IMAC - IMAC1 : Algorithmique (24h CM)• IGM - L3 : Programmation C (24h CM)




Encadrement2006–à ce

jourTutorat sur 3 ans d’apprentis ingénieurs (7 apprentis), ESIPE, Université Paris-Est, Marne-la-Vallée.Chaque apprenti est suivi, pour la totalité de la durée de sa formation, par deux tuteurs : un tuteuringénieur, dans l’entreprise, qui est le "Maître d’apprentissage" prévu par la loi et un tuteur enseignant,qui est son homologue référent à l’école. Ces deux tuteurs sont les interlocuteurs privilégiés de l’apprentiet se rencontrent régulièrement a) lors du séminaire d’intégration des nouveaux tuteurs ingénieurs, audébut de la première année de formation, b) lors des visites faites par le tuteur enseignant en entreprise, c)lors de réunions annuelles où sont conviés l’ensemble des tuteurs et d) lors des soutenances des exercicesd’alternance annuels de l’apprenti.

2008 Stage M2 Recherche de Florian Sikora, LIGM, Université Paris-Est, Marne-la-Vallée.Ce stage a porté sur l’étude de la comparaison de réseaux d’interactions protéiques. Plus précisémentcelui de la recherche de motif, et sur la réalisation d’un nouvel algorithme proposant une extension àQPath et une alternative à QNet.

2002-2005 Stage M1 de 8 étudiants en maîtrise informatique dans le cadre de leurs TER, LINA, Uni-versité de Nantes.

Responsabilité et animation de l’enseignement2013 Représentant enseignant de l’ESIPE aux Journées Portes Ouvertes de l’UPEMLV et l’ESIPE

ainsi qu’au Forum Poursuite d’Études de l’IUT de Nantes, ESIPE, Université Paris-Est,Marne-la-Vallée.L’objectif des journées portes ouvertes et du forum poursuite d’études sont d’attirer de futurs candidatsà notre formation en alternance. Les stands de chaque filière d’ingénieurs de l’ESIPE sont animés par lesresponsables de filières, du personnel administratif et des enseignants ainsi que beaucoup d’apprentisvolontaires.

2013 Représentant enseignant au Forum des Masters de l’UPEMLV, UPEMLV, Université Paris-Est, Marne-la-Vallée.Au cours de ce forum, les enseignants, étudiants de l’université ainsi que le personnel du SIO accueillentles étudiants, élèves, salariés en reprises d’études pour les guider et répondre à leurs questions concernantles poursuites d’études. Ce rendez-vous est également l’occasion de leur faire découvrir les structures,les aides et les outils disponibles pour les étudiants en matière d’insertion professionnelle.

2012–à cejour

Responsable de la 3ème année d’Ingénieur Réseaux, ESIPE, Université Paris-Est, Marne-la-Vallée.L’activité consiste principalement en une partie organisation (emplois du temps, sessions d’examens,jurys, réunions pédagogiques) et le suivi des étudiants.

2012–à cejour

Responsable du Last Project de 3ème année d’Ingénieur Réseaux, ESIPE, Université Paris-Est, Marne-la-Vallée.Le Last Project est un projet de Génie Logiciel (GL) destiné à mettre en application toutes les connaissanceset tout le savoir-faire acquis durant la formation Ingénieur Réseaux. La réalisation de ce projet se fait paréquipe de 6 apprentis au plus. Chaque équipe travaille sur un sujet distinct. Chaque sujet est porté parun enseignant ou industriel qui joue le rôle de client.

Novembre2012

Représentant enseignant de l’ESIPE au salon européen de l’éducation, ESIPE, UniversitéParis-Est, Marne-la-Vallée.Le Salon Européen de l’Education s’est déroulé du 22 au 25 novembre au Parc des expositions – Paris –Porte de Versailles. L’étudiant peut s’y documenter, dialoguer avec des professionnels de l’éducation,rencontrer des responsables et des professeurs d’établissements de l’enseignement supérieur. J’y aiparticipé en tant que représentant enseignant pour l’ESIPE et l’UPEMLV.

2012–à cejour

Enseignant Référent de 9 étudiants de L1 pour le plan Licence 2012, Université Paris-Est,Marne-la-Vallée.Le rôle des Enseignants Référents est de limiter au maximun les décrochages ou abandons en cours deS1 et de guider les étudiants entrants dans leurs premiers apprentissages universitaires, de les aiderà adapter leur méthode de travail en fonction de leur nouvel environnement et de les orienter, le caséchéant, vers les services ou personnes compétentes pour résoudre des difficultés dépassant le cadrepédagogique de terrain.




2006–à cejour

Responsable/Développeur de la zone d’enseignement partagée, IGM, Université Paris-Est,Marne-la-Vallée.En collaboration avec deux collègues, nous avons mis en place un système collaboratif de mise en lignede contenus pédagogiques basé sur SVN et XML. Cette plateforme permet de gérer des espaces publicset privés avec sujets et corrections à destination des étudiants et des enseignants. http ://igm.univ-mlv.fr/ens

2006–à cejour

Responsable local Socrates-Erasmus, avec l’Université de Bielefeld (Allemagne).Cet accord a permis à un étudiant de M1 de l’UPEMLV (Enrico Siragusa) d’obtenir, après un an passé àl’Université de Bielefeld, un Master en Bio-informatique.

2009–à cejour

Responsable local Socrates-Erasmus, avec l’Université de Brno (République Tchèque).Cet accord a permis à 2 étudiants en doctorat de l’Université de Brno (Jiri Koutny et Martin Cermak) devenir faire quatre visites de travail au sein du LIGM.

Projet d’enseignementCes dernières années, j’ai effectué des enseignements à plusieurs niveaux, dans différentesfilières et sur divers sujets. J’ai eu l’opportunité d’enseigner dans des types d’instituts (IUT,Faculté, IUP, Ecole d’ingénieurs) et des formations (DUT, Licence, Licence professionnelle,Maîtrise) très variés.

Ma formation et mon expérience me permettent d’effectuer des enseignements dans diversdomaines de l’informatique théorique, en particulier dans ses aspects algorithmiques et sesliens avec les mathématiques discrètes ; mais également dans des domaines très éloignésde mon expertise de recherche tels que la programmation réseaux, les systèmes de gestionde bases de données, la programmation conccurente ou l’architecture des ordinateurs (quicomposent la majeure partie de mon expérience d’enseignement).

Si j’intègre le département informatique de l’ENSEIRB-MATMECA, je pourrai immédiatementassurer des enseignements proches de ceux dont j’ai déjà eu la charge. Je pourrai apporter monsoutien au "tronc commun" des 6 filières concernant l’Environnement de travail (IF104), laprogrammation impérative en C (PG101 et PG106), l’algorithmique (de l’initiation à avancée)et les structures de données (IF101, IF102, IF106, PG116), mais également en architecture desordinateurs (IT102). Je pourrai aisément m’impliquer dans l’enseignement de POO en Java(PG202, PG203) et d’Applications TCP/IP (RE218). En revanche, il me faudrait un peu plus detemps pour aborder la programmation C++ qui est très peu enseignée à l’UPEMLV.

Egalement à court terme, je suis capable d’apporter mon soutien aux cours de Systèmesd’exploitation (IT201), de systèmes de gestion de bases de données classiques (IT203) ainsique d’Applications concurrentes et distribuées (IT310). Bien que l’occasion ne m’ait jamaisété offerte, j’aimerais m’investir dans un cours sur la programmation web (Ajax, php, xml)potentiellement dans le module IF205. Je pourrai sans trop de difficultés prendre en charge descours de réseaux de "bas niveaux" : à savoir l’introduction aux réseaux (RE100) . En revanche,je ne suis pas certains de pouvoir assurer un cours sur les Systèmes d’information - n’ayantpas une idée claire du contenu d’un tel enseignement.

Je suis, à titre personnel, très attiré par l’électronique et les nouvelles technologies associées.J’encadre d’ores-et-déjà quelques projets innovants au sein de l’ESIEE (http ://www.esiee.fr/) -un pilote automatique de parapente - et de l’ESIPE - une extension de easystroke (logiciel pourtablette tactile) utilisant la reconnaissance de geste avec kinect. J’aimerais poursuivre cetteouverture vers le monde de l’électronique, d’abord à travers d’autres projets tutorés et, à terme,par des enseignements transversaux électronique/informatique (à l’aide de la plateformeArduino par exemple). Une intégration au sein de l’ENSEIRB-MATMECA, me permettrait, autravers des nombreux projets proposés aux étudiants tout au long de leur formation de menerà bien cette ouverture (e.g. IT213,IT214).




Enfin, je pense qu’un poste de professeur va de paire avec des responsabilités administratives.Je suis aujourd’hui responsable de la troisième année d’ingénieur réseaux de l’ESIPE qui estune école d’ingénieur par apprentissage. Bien que le rythme d’alternance des formationsde l’ENSEIRB-MATMECA soit différent, mon expérience au sein de l’ESIPE me permettraitaisément de prendre des charges administratives liées à l’apprentissage au sein de l’école.Il est évident qu’en cas d’intégration au sein de l’ENSEIRB-MATMECA, j’accepterais lescharges administratives que l’on désirerait me confier ; qu’elles soient liées à la formation enapprentissage ou non.

Recherche

Ligne directrice et principaux résultatsDepuis 2003, mes activités de recherche suivent une ligne directrice qui est l’étude des as-pects algorithmiques de la comparaison d’objets biologiques. En effet, je me suis consacré àl’étude algorithmique de divers problèmes biologiques faisant appel à la comparaison d’entitésbiologiques. Parmi ces dernières, on peut citer les permutations, les séquences arc-annotées,les 2-intervalles, les matrices binaires, les DAG, les graphes linéaires, les séquences d’entierssignés, ainsi que les structues arborescentes. La comparaison est un processus central dansl’étude des processus biologiques. En effet, il est courant que ce qui se ressemble ait unefonction ou un rôle commun. Les divers aspects de la comparaison que j’ai abordés sont l’ali-gnement, le calcul de scénarios parcimonieux, le calcul de distances/scores, la reconstruction,la prédiction, la recherche de motifs. Conjointement à la diversité des définitions du termecomparaison, mes travaux ont porté sur une large variété d’entités ; à savoir : les structuresd’ARN, les structures de protéines et leurs interactions, l’ordre et l’héritage des gènes, les SNPs.La phase initiale de chacun de mes travaux a été de rechercher ou fournir une représentationalgorithmique adéquate de l’entité étudiée ; simultanément adaptée au niveau d’expressionsouhaité et permettant de bénéficier de propriétés algorithmiques intéressantes.

Les contributions que j’ai founies durant mon doctorat relevaient essentiellement de la questionbinaire suivante : le problème étudié est-il polynomial ou non ? La réponse généralement proposéeavec mes collaborateurs fut de concevoir un algorithme exacte et efficace quand cela étaitpossible ou, au contraire, de prouver la NP-complétude du problème. J’ai toujours été fascinépar la puissance de la NP-complétude. En effet, comme illustré par ce dessin paru dans[Garey :Johnson :1979], plutôt que d’avouer son échec à trouver une solution efficace à unproblème donné, on peut prouver que personne ne peut trouver une telle solution.

Dès mon doctorat, je me suis intéressé à aller plus loin que ces résultats "négatifs" puisque,comme on le constate facilement, il semblerait que la difficulté d’un problème est bien souventproportionnelle à l’intérêt qu’il suscite (ce qui est quelque peu frustrant comme l’illustre cedessin de Danny Hermelin).




Je me suis donc plongé dans l’approximation et la complexité paramétrée. Je dois admettreavoir toujours eu une préférence pour la complexité paramétrée puisqu’elle permet de résoudredes problèmes difficiles (et donc intéressants) de manière exacte, au contraire de l’approxima-tion, soit en se focalisant sur des cas spéciaux, soit en confinant l’explosion combinatoire.

Je qualifierais l’objectif de mes contributions plus récentes à porter sur une question plus"pratique" : qu’est-ce qui rend le problème difficile ? J’ai essayé d’y répondre avec mes collèguesaussi souvent que possible. J’apprécie la procédure systématique que nous avons mise enplace avec Stéphane Vialette au sein de notre groupe de recherche et qui consiste à étudier enprofondeur les problèmes que nous considérons en utilisant une boîte à outils de complexité :preuves de difficultés, la programmation dynamique, la complexité paramétrée, le color-coding,la recherche de noyaux, la programmation linéaire, etc.

Dans cette section, je décris les différents travaux de recherche que j’ai conduis durant lesdix dernières années. Les thématiques abordées, ainsi que les méthodologies mises en oeuvredans ce contexte démontrent la parfaite adéquation de mes intérêts et compétences auxthématiques de recherche actuelles de l’équipe MaBioVis du LaBRI, plus précisément ausein des équipes "Algorithmique pour l’analyse de structures biologiques" et "Génomiquecomparative, modélisation, analyse de données biologiques".

Comparaison de structures d’ARN.

J’ai commencé par m’intéresser à la complexité inhérente à la comparaison de deux structuresd’ARN dans le cadre de mon doctorat. D’un point de vue algorithmique, j’ai étudié diversformalismes de comparaison basés sur la notion centrale de séquence arc-annotée introduitepar P. Evans en 1999. Plus précisément, j’ai proposé des résultats de complexité classique etd’approximation pour les paradigmes Longest Arc-Annotated Subsequence (LAPCS), ArcPreserving Subsequence (APS) ainsi que l’EDIT distance. Nous avons également défini deuxnouveaux paradigmes : le Maximum Arc-Preserving Common Subsequence (MAPCS) ainsique la hiérarchie ALIGN. Pour ces dernières, nous avons fourni une étude quasi-exhaustive.L’ensemble de ces résultats a donné lieu à 15 publications réparties dans les actes de 9 confé-rences internationales, 5 journaux internationaux et résumé dans un chapitre de livre. Cethème de la comparaison de structures d’ARN, malgré quelques publications récentes dans cedomaine, ne constitue plus une préoccupation constante dans mes recherches mais pourraitêtre un point de départ naturel pour des interactions avec Julien Allali et Pascal Ferraro (tout enétant conscient de leur présence relative actuellement au sein du groupe dûe à leur implicationdans le projet SIMBALS).

Calcul de distances inter-génomiques.

Le second domaine de recherche qui a conduit à un grand nombre de mes publications (12 pu-blications réparties dans les actes de 9 conférences internationales et 3 journaux internationaux)concerne la prise en compte de réalités biologiques dans le calcul de distances inter-génomiqueset la détection de clusters de gènes. D’un point de vue algorithmique à nouveau, j’ai étudiél’impact, en terme de complexité, de la prise en compte de la présence de gènes dupliquésdans le calcul de quatre distances basées sur des mesures utilisant le nombre d’IntervallesCommuns, d’Intervalles Conservés, d’Intervalles Communs Approchés ainsi que de Break-points. Cette étude a été menée en considérant également les deux stratégies classiques quesont la recherche de couplages et l’exemplarisation. Récemment, nous avons proposé d’unifierla notion d’intervalles communs approchés et investigué la complexité du problème de leurlocalisation. D’autre part, j’ai proposé de considérer les notions précédement présentées dansle cadre de la linéarisation d’ordres partiels ; permettant ainsi une représentation plus généraledes génomes. Les compétences que j’ai acquises dans ce domaine pourraient naturellementconduire à des interactions avec le groupe "Génomique comparative, modélisation, analysede données biologiques" mais cela dépendra de la future orientation recherche que prendral’entité qui remplacera, peut-être, l’équipe-projet INRIA Magnome.




Comparaison de réseaux biologiques.

Dans le cadre de la thèse de Florian Sikora que j’ai co-encadré de 2008 à 2011, nous noussommes intéressés à la recherche de motifs dans des réseaux biologiques. Etant donné unréseau biologique et un motif, le problème est de trouver les sous-réseaux correspondant à cemotif. Le problème étant naturellement difficile (car lié à l’homéomorphisme de graphe), nousavons proposé une solution de complexité paramétrée implémentée dans le logiciel PADA1.Nous avons également étudié la recherche de motifs sans topologies (i.e. un ensemble desommets), où des contraintes de connexité sont imposées au sous-graphe correspondant àl’occurence. Les résultats, à nouveau de complexité paramétrée, ont donné lieu à un logicielappelé GraMoFoNe sous la forme d’un plugin cytoscape. L’ensemble de ces résultats ontdonné lieu à 4 publications réparties dans les actes de 3 conférences internationales et d’unjournal international aini que la production de 2 logiciels (dont un maintenu et disponible àl’adresse suivante http ://igm.univ-mlv.fr/AlgoB/gramofone).

Autres thèmes.

Enfin, on pourra noter quelques travaux annexes, liés à des collaborations ponctuelles suite àdes rencontres, des séminaires ou des questions posées. Ainsi j’ai pu m’intéresser au problèmedu calcul d’une médiane de permutations (1 acte de conférence internationale, 1 journal inter-national et 1 journal international en cours de soumission), à l’étude de propriétés spécifiquesde matrices binaires (3 actes de conférences internationales et 2 journaux internationaux) ainiqu’au problème de stockage de masse de données (1 acte de conférence internationale et 1acte de conférence nationale). C’est justement l’étude des matrices binaires qui m’a conduit àm’intéresser aux problèmes liés à la Radiothérapie qui occupent actuellement la plus grandepartie de mon temps de recherche dans le cadre du projet ARN Jeune Chercheur dont je suis leporteur.

Projet de recherche à long termeComme démontré précédemment mes travaux concernant a) les structures d’ARN, b) lesstructures de protéines et leurs interactions ainsi que c) l’ordre et l’héritage des gènes s’intègrentbien avec les thématiques existantes de l’équipe MaBioVis et devraient permettre, à très courtterme, de travailler avec les différents membres actuels de l’équipe. D’autre part, j’ai un projetde recherche à plus long terme que je présente ci-après.

Dans le contexte de l’ANR Jeune Chercheur intitulé BIRDS (2011-2015) dont je suis le coordi-nateur, une grande partie de mon temps de recherche (80%) est d’ores-et-déjà dédiée à dessujets spécifiques. En trois mots, ce projet ambitieux est composé de trois tâches indépen-dantes : i) L’algorithmie des d-intervals, ii) La recherche de motifs sans topologie dans lesréseaux biologiques et iii) La Radiothérapie. La première tâche est essentiellement menée àbien par S. Vialette avec mon aide ponctuelle en collaboration avec R. Rizzi de l’Universitéde Verona. La seconde a été menée à son terme dans le cadre du doctorat de Florian Sikoraet des collaborations que nous avons eu avec l’équipe Combi du LINA. Depuis Janvier 2011,je me concentre sur le thème de la Radiothérapie pour lequel je co-encadre la thèse de PaulMorel avec S. Vialette (du LIGM) et X. Wu (de l’Université d’IOWA City). C’est ce dernier axede recherche que je souhaiterais poursuivre et développé à long terme au sein de l’équipeMaBioVis du LaBRI.

Ce thème de recherche ne fait pas partie des thématiques actuelles de l’équipe. En revanche,ce projet s’intègre pleinement dans le site de recherche intégrée sur le cancer (SIRIC) intitulé"Bordeaux Recherche Intégrée Oncologie" porté par l’Institut Bergonié et labélisé en Juilletdernier par l’Institut national du cancer (INCa). L’équipe MaBioVis, via son activité dans leCBiB (Centre de Bioinformatique de Bordeaux), a déjà était sollicité pour intervenir au sein duSIRIC – sur un des axes de recherches dont elle a l’expertise ; à savoir l’agorithmique des sé-quences. Le département de Radiothérapie de l’Institut Bergonié possède un plateau technique,que ce soit en radiothérapie externe et en curiethérapie, permettant de répondre à l’évolu-tion technologique. L’un des objectifs majeurs du département est, a priori, de poursuivrele développement des techniques innovantes. C’est précisément dans ce cadre que j’aimeraidévelopper des collaborations entre l’équipe MaBioVis et le département de Radiothérapie del’Institut Bergonié.LIGM, Université Paris-Est Marne-la-Vallée – 77454 Marne-la-Vallée Cedex 02

H 06 52 05 52 35 • T 01 60 95 77 49 • u 01 60 95 75 57B [email protected] – http://igm.univ-mlv.fr/∼gblin


La radiothérapie est une méthode de traitement des cancers par irradiation des cellulestumorales utilisée chez plus de la moitié des patients ayant un cancer. La première phaseconsiste à localiser, via une vision en trois dimensions issue de rayons X, la tumeur maiségalement des organes dit à risque (i.e. dans le champs de radiation). La seconde phasecorrespond à la prescription, par le médecin, des doses d’irradiation de la tumeur et de chaqueorgane à risque. Ces dosages sont souvents conflictuels car on veut garantir à la fois un taux deradiation suffisant pour la tumeur et un taux aussi petit que possible pour les organes à risque.La dernière phase du traitement consiste à irradier la tumeur à l’aide d’un accélerateur linéairede particules. Parmi les grandes techniques de radiothérapie, on s’intéresse à la radiothérapieexterne et la curiethérapie (radiothérapie interne).

Radiothéparie externe.

La radiothérapie externe est une technique de radiothérapie basée sur la concentration d’unfaisceau de particules sur la zone tumorale. La radiothérapie externe exploite principalementdes faisceaux de photons mais d’autres techniques utilisent des protons (appelée protonthéra-pie). La source de radiation va tourner à vitesse variable autour du patient et fournir des dosesprécises de radiation pour chaque angle (à savoir une configuration appelée également beam).La modulation d’intensité s’effectue à l’aide de lamelles métaliques d’un collimateur (MLC)qui permettent l’obstruction partielle de la radiation.

À la suite de l’arc, une irradiation concentrique sur la tumeur est obtenue permettant ainside minimiser les radiations des organes à risques. Le modèle de calcul est basé sur unediscrétisation de l’arc. Le problème algorithmique revient alors à décomposer chaque beamen une combinaison linéaire de matrices binaires ; chaque matrice binaire représentant uneconfiguration des lamelles. Les matrices binaires ainsi obtenues sont dotées de la propriétédite des "1" consécutifs. C’est une propriété que j’ai d’ores-et-déjà étudiée dans le cadre detravaux précédents présentés à CiE 2010 et CSR 2011 et faisant actuellement l’objet d’un projetPEPS auquel je participe.

Dans l’exemple suivant, pour un angle donné, la décomposition indiquée permet d’obtenir troisconfigurations de lamelles successives permettant d’aboutir à la dose de radiation souhaitée.Les coéfficients (1, 2 et 4) représentent des mesures de temps d’irradiation.[

364215

]= 1

[100011

]+ 2

[110100

]+ 4

[011001

]= 1

[��

]+ 2

[��

]+ 4

[��

](1)




Dans ce projet, nous nous intéressons à l’étude de dispositifs innovants concernant la radiothé-rapie externe. Plus précisément, nous souhaitons définir les possibilités et les limites théoriquesde deux dispositifs profitant des propriétés des MLCs. Le premier consiste à utiliser la rotationdu MLC durant l’irradiation pour réduire le temps d’irradiation du patient. On peut facile-ment constater que dans certains cas, l’utilisation d’une double orientation du MLC permetde réduire le nombre de configurations néccessaires pour le traitement. À titre d’exemple, ladécomposition suivante utilisant 4 configurations horizontales et 2 verticales permet de réduired’un quart le nombre de configurations (i.e. 8) lors de l’utilisation d’une unique orientation.

1425133213556460

=

0001001100111110

+

0001011100111110

+

0111111001111110

+

1111000011111110

+

0100010001011010

+

0101000000101010

Le second dispositif consiste à "superposer" deux MLCs de façon orthogonale i.e., en les plaçantde façon à ce que les lamelles de l’un soient perpendiculaires à celles de l’autre. Par exemple,

M =

011010101111101100101111101000100011

correspond à une configuration possible réalisée par les positionnements de lamelles suivants

Mhorizontale =

011111111111111100111111111000111111

and Mverticale =

011010101111101111101111101011100011

.

Dans les deux cas, les possibilités algorithmiques n’ont presque pas été abordées. Les problèmesalgorithmiques correspondants sont "naturellement" difficiles (travail en cours d’écriture). J’ai-merais investir la complexité paramétrée et l’approximation de ces problèmes. L’évaluationpratique des algorithmes obtenus sur des données de patients anonymisées permettrait de va-lider les approches éventuellement retenues. De nouveau, l’interaction avec l’Institut Bergoniéet le centre d’Oncologie d’IOWA City seront primordiales.




Protonthérapie.

La Protonthérapie est une variante de la radiothérapie externe basée sur l’utilisation defaisceaux de protons. L’avantage majeur de cette technique réside dans le fait que tous lesprotons d’une certaine énergie partagent une distance de pénétration calculable (aucun protonne pénètre derrière cette distance). Cette propriété permet d’éviter l’irradiation de part en partdu patient (ce qui est le cas avec les photons). De plus, contrairement à la photonthérapie,la dose délivrée au tissu est maximale juste sur les derniers millimètres de cette profondeur(appelé pic de Bragg). Cette profondeur dépend de l’énergie à laquelle les particules ontété accélérées par l’accélérateur de proton. La protonthérapie, cependant, nécessite de groséquipements. Par exemple, le Centre de Protonthérapie d’Orsay (l’un des deux uniques centresfrançais) utilise un cyclotron de 240 tonnes.

Dans le cadre du doctorat de Paul Morel, nous étudions la prise en compte de la mobilitédu patient dans l’élaboration du traitement. L’objectif principal est de proposer un logicielpermettant, dans un premier temps, de présenter à l’oncologue les effets de la mobilité dupatient sur le traitement initialement prévu. Dans un deuxième temps, il s’agit de proposerune modification online du traitement permettant de "compenser" les erreurs induites par lamobilité du patient. Cette partie du projet se ferait en collaboration avec l’Université d’IOWACity qui est en passe de mettre en place un centre de protonthérapie.

Curiethérapie.

Finalement, je souhaite aborder les aspects algorithmiques des traitements par curiethérapie,qui contrairement à la radiothérapie externe qui est qualifiée d’externe puisque la source desparticules est à l’extérieur du corp du patient, est une radiothérapie interne où les sources deradiation sont introduites dans le patient.

Via un applicateur, passé par les voies naturelles du patient, un tube de liaison est mis enplace pour lier le projecteur de sources radioactives à la tumeur. Ce dernier va alors litérra-lement envoyer une source radioactive au coeur de la tumeur ; qui une fois activée (on parled’afterloading) va uniformément irradier les organes à proximité.

Le problème principal de ce traitement vient de cet aspect uniforme de la radiation qui necorrespond pas, généralement, à la forme d’une tumeur. La solution envisagée dans ce projetde recherche est d’adapter la modulation des techniques externes à la curiethérapie. Il existedéjà un type de curiethérapie modulée qui consiste en la pose de cathéters en parrallèle avecdes sources de radiation uniformes mais de dosages différrents. Mon projet consiste à modulerune unique source de radiation à l’aide d’un équipement inspiré de la radiothérapie externe.




Dans ce projet, nous souhaitons étudier les aspects algorithmiques de divers équipementspotentiels pouvant produire de la modulation d’intensité pour la curiethérapie. L’objectif est deproposer un "bouclier" métallique configurable permettant l’obstruction partielle de la sourcede radiation. Pour l’instant, je me concentre sur un type particulier d’équipement faisantl’oeuvre d’un dépôt de brevet mais n’ayant pas été mis en place. Ce dernier correspond à un cy-lindre métallique entourant la source de radiation, découpé en secteur ; chaque secteur pouvantêtre rétracté de manière isolée (permettant ainsi l’irradiation de la zone correspondante).

Considérant les contraintes physiques inhérentes aux matériaux, il n’est, a priori, pas possiblede construire des secteurs aussi petits (l’angle beta) et donc précis (l’angle alpha) que souhaités.Par conséquent, en utilisant la possibilité de rotation de l’équipement, l’objectif est de trouverune suite de configurations de lamelles permettant de s’approcher au mieux de la dosesouhaitée. Pour l’instant, à ma connaissance, ces aspects sont complètement inexplorés etl’objectif est de procéder à une étude algorithmique permettant de guider la conception del’équipement final. Ce travail pourrait se faire en interaction avec le groupe de Radiothérapiede l’Institut Bergonié et plus précisément le Dr Laurence Thomas.

Encadrement2008–2011 Co-encadrant (50%) avec Stéphane Vialette de la thèse de Florian Sikora sur les réseaux

biologiques (soutenue le 30 Septembre 2011), LIGM, Université Paris-Est, Bourse MENRT, Ilest maintenant Maître de conférences au LAMSADE - Université Paris-Dauphine.

2011–à cejour

Co-encadrant (50%) avec Stéphane Vialette de la thèse de Paul Morel sur la proton thérapie,LIGM, Université Paris-Est, Bourse ANR.

2012 ou2013–à ce

jour

Co-encadrement (50%) prévu avec Gregory Kucherov d’une thèse sur le séquençage denouvelle génération, LIGM, Université Paris-Est, Bourse Investissements d’Avenir.

Responsabilité et animation de la recherche2012–à ce

jourTitulaire de la PES classé MCF - "A".Comme indiqué dans la note d’information sur l’évaluation en 2012 des PES pour la section 27 rédigée parOlivier Roux (Président du comité), le taux de sélection pour la classe MCF - "A" est de 11,2% dans monsecteur disciplinaire (section 27) ; qui constitue l’un des plus important avec 477 dossiers de candidaturescette année.

2006–2010 Responsable et organisateur du séminaire hebdomadaire du LIGM.




2006–à cejour

Responsable de la bibliographie du LIGM.Afin de faciliter la mise-à-jour de la bibliographie du laboratoire, j’ai développé et mis en place unesolution logicielle permettant à tout membre du laboratoire de gérer sa bibliographie. Cet outils permet,entre autre, la mise-à-jour automatique des pages webs relatives aux publications des chercheurs duLIGM.

2008–à cejour

Participation à la mobilité de chercheurs.Sylvie Hamel de Montréal, Canada (1 mois en 2008), Jens Stoye de Bielefeld, Allemagne (1 mois en2008), Romeo Rizzi d’Udine, Italie (1 mois en 2009), Minghui Jiang d’Utah, USA (1 mois en 2011), DannyHermelin de Saarbrucken, Allemagne (2 semaines en 2010), Xiao Yang d’Iowa, USA (1 mois en 2010),Riccardo Dondi de Milano, Italie (1 semaine en 2011), Sylvie Hamel de Montréal, Canada (1 semaine en2011), Xiaodong Wu d’Iowa, USA (1 mois en 2012) et Romeo Rizzi de Verona, Italie (1 mois en 2013).

2010–à cejour

Responsable de la bibliographie du l’Université de Marne-la-vallée.Fort de l’expérience acquise au sein de mon laboratoire, j’ai été amené à proposer une solution à l’échellede l’Université. J’ai étudié la possibilité de mettre en place un portail HAL et les besoins spécifiquesinhérents à notre université. Ce projet que j’ai porté depuis le début a donné lieu au portail http ://hal-univ-mlv.archives-ouvertes.fr dont j’ai actuellement la responsabilité. Depuis peu, je suis à l’origine dela mise en commun de notre portail avec l’Université Paris-Est Créteil (appartenant au même PRES)http ://hal-upec-upem.archives-ouvertes.fr et des formations des chercheurs référents de chaque unitéde recherche dont l’une des tutelles est l’Université de Marne-la-vallée ou l’Université Paris-Est Créteil.

2009–à cejour

Responsable du site web du LIGM.Je suis à l’intiative de la refonte du site du LIGM en générant automatiquement les parties liées à labibliographie et aux membres du laboratoire à partir des bases de données de HAL et Labintel du CNRS.

2010–à cejour

Membre élu du Conseil de Laboratoire.

2010–à cejours

Membre élu du Comité Permanent.Le comité permanent a la charge du recrutement et classement des dossiers d’ATER, de l’étude desdemandes de titularisation et avancement de grade, ainsi que de la composition des comités de sélectionpour les postes d’enseignant-chercheur relatifs au LIGM.

Participation à des Projets2013–2016 Programme blanc ANR, “COLIB’READS Calling biological information from raw reads”.2012-2013 PEPS CNRS, Propriété des 1 consécutifs, nouvelles extensions, approximations et applications.

2011-2015 ANR Jeunes Chercheurs, BIRDS BIological networks, RaDiotherapy and Structures, en tantque coordinateur.

2010-2011 PEPS CNRS, Traduction Automatique et Génomique Comparative.

2006–2010 Programme blanc ANR, “BRASERO Biologically Relevant Algorithms and Softwares forEfficient RNA Structure Comparison”.

2005 PAI, Projet franco-italien Galileo n˚08484VH.

2005–2006 CPCFQ, Projet bilatéral franco-québécois de la Commission Permanente de CoopérationFranco-Québécoise sur les “Structures conservées et duplications pour les réarrangementsgénomiques”.

2005–2008 Action Concertée Incitative, “Nouvelles Interfaces des Mathématiques pi-vert”.2004–2007 Groupe de travail, “ARENA”, fondé par l’ACI IMPBio.

2003–2004 Action Spécifique CNRS - Département STIC, “Nouveaux modèles et algorithmes degraphes pour la biologie”.

2003–2006 Action Concertée Incitative, “Masse de Données NavGraphe”.

Charges collectives2008–2013 Comité de programme

International Workshop On Combinatorial Algorithms (IWOCA), 2013, Rouen, France.Workshop on Algorithms in Bioinformatics (WABI), 2012, Ljubljana, Slovenie.

Research on Computational Molecular Biology - Comparative Genomics (RECOMB-CG), 2010,Ottawa, Ontario, Canada.LIGM, Université Paris-Est Marne-la-Vallée – 77454 Marne-la-Vallée Cedex 02

H 06 52 05 52 35 • T 01 60 95 77 49 • u 01 60 95 75 57B [email protected] – http://igm.univ-mlv.fr/∼gblin


RECOMB-CG, 2009, Budapest, Hongrie.RECOMB-CG, 2008, Paris, France.

2007–2012 Comité d’organisation

SPIRE 2012 Workshop on the Algorithmic Analysis of Biological Data, (WAABD), 2012, Carta-gena deIndias, Colombia.

Research on Computational Molecular Biology - Comparative Genomics (RECOMB-CG), 2008,Paris, France.

Journées conjointes des groupes ”Analyse de séquences” du GDR Bio-Informatique Molécu-laire et ”Combinatoire des mots, algorithmique du texte et du génome” du GDR InformatiqueMathématique (SEQBIO), 2012, Marne-la-vallée, France.

SEQBIO, 2011, Lille, France.SEQBIO, 2011, Rennes, France.SEQBIO, 2007, Marne-la-vallée, France.

2004–2013 Relecture d’articles

Pour les journaux suivant : Algorithmica, Discrete Mathematics, Theoretical Computer Science,Journal of Computational Biology, Information Processing Letters, International Journal ofData Mining and Bioinformatics, Bioinformatics, Advances and Applications in Bioinformaticsand Chemistry, Journal of Discrete Algorithms, Algorithmica, BMC Bioinformatics, IEEE/ACMTransactions on Computational Biology and Bioinformatics, Journal of Combinatorial Optimi-zation, Engineering Applications of Artificial Intelligence et les conférences suivantes : CPM,WABI, SPIRE, STACS, PSB, RECOMB-CG, IWOCA, RECOMB, DLT.

2009–2013 Comité de sélectionEcole d’ingénieurs Polytech Paris Sud, 2013.LABRI - Université de Bordeaux, 2012.LIGM - IUT de l’Université de Marne-la-Vallée, 2011.LRI - Faculté d’Orsay, 2011.LRI - IUT d’Orsay, 2010.LIGM - Université de Marne-la-Vallée, 2009.LRI - IUT d’Orsay, 2009.LIFL - Faculté des sciences de Lille 1, 2009.




Publications

Chapitres d’ouvrages édités.

2011 ? Guillaume Blin, Maxime Crochemore, Stéphane Vialette. Algorithmic Aspects of Arc-Annotated Sequences. Elloumi Mourad, Zomaya Albert Y. Algorithms in ComputationalMolecular Biology : Techniques, Approaches and Applications, Wiley, pp. 113-126, Feb. 2011

Articles publiés dans des revues d’audience internationale avec comité de lecture.

2012 ? Guillaume Blin, Romeo Rizzi, Stéphane Vialette. A faster algorithm for finding minimumTucker submatrices. Theory of Computing Systems, 2012, 10 pp.

? Guillaume Blin, Paola Bonizzoni, Riccardo Dondi, Florian Sikora. On the ParameterizedComplexity of the Repetition Free Longest Common Subsequence Problem. InformationProcessing Letters, 2012, 9 pp.

2011 ? Guillaume Blin, Maxime Crochemore, Sylvie Hamel, Stéphane Vialette. Median of an oddnumber of permutations. Pure Mathematics and Applications, 2011, 21 (2), pp. 161 - 175

? Guillaume Blin, Romeo Rizzi, Florian Sikora, Stéphane Vialette. Minimum Mosaic Inferenceof a Set of Recombinants. International Journal of Foundations of Computer Science, 2011,pp.17

2010 ? Guillaume Blin, Florian Sikora, Stéphane Vialette. Querying Graphs in Protein-ProteinInteractions Networks using Feedback Vertex Set. IEEE/ACM IEEE/ACM Transactions onComputational Biology and Bioinformatics, 2010, 7 (4), pp. 628-635

? Guillaume Blin, David Célestin Faye, Jens Stoye. Finding Nested Common Intervals Effi-ciently. Journal of Computational Biology, 2010, 17 (9), pp. 1183-1194

? Guillaume Blin, Alain Denise, Serge Dulucq, Claire Herrbach, Hélène Touzet. Alignmentsof RNA structures.. IEEE/ACM IEEE/ACM Transactions on Computational Biology andBioinformatics, 2010, 7 (2), pp. 309-322. <http ://dx.doi.org/10.1109/TCBB.2008.28>

2008 ? Guillaume Blin, Guillaume Fertin, Danny Hermelin, Stéphane Vialette. Fixed-ParameterAlgorithms For Protein Similarity Search Under mRNA Structure Constraints. Journal ofDiscrete Algorithms, 2008, 6 (4), pp. 618-626

2007 ? Guillaume Blin, Guillaume Fertin, Stéphane Vialette. Extracting Constrained 2-IntervalSubsets in 2-Interval Sets. Theoretical Computer Science, 2007, 385 (1-3), pp. 241-263

? Guillaume Blin, Cedric Chauve, Guillaume Fertin, Romeo Rizzi, Stéphane Vialette. Compa-ring Genomes with Duplications : a Computational Complexity Point of View. ACM/IEEETrans. Computational Biology and Bioinformatics, 2007, 4 (4), pp. 523-534

? Guillaume Blin, Eric Blais, Danny Hermelin, Pierre Guillon, Mathieu Blanchette, NadiaEl-Mabrouk. Gene Maps Linearization using Genomic Rearrangement Distances. Journal ofComputational Biology, 2007, 14 (4), pp. 394-407

2005 ? Guillaume Blin, Guillaume Fertin, Romeo Rizzi, Stéphane Vialette. What makes the Arc-Preserving Subsequence problem hard ?. LNCS Transactions on Computational Systems Bio-logy, 2005, 2, pp. 1-36




Communications à des manifestations d’audience internationale avec comité de sélection,avec taux d’acceptation < 40%.

2012 ? Guillaume Blin, Minghui Jiang, Stéphane Vialette. The longest common subsequence problemwith crossing-free arc-annotated sequences. L. Calderon-Benavides et al.. 19th edition ofthe International Symposium on String Processing and Information Retrieval (SPIRE 2012),Oct 2012, Cartagena de Indias, Colombia. Springer, Heidelberg, 7608, pp. 130-142, LNCS(acceptance rate 32%)

? Guillaume Blin, Paola Bonizzoni, Riccardo Dondi, Romeo Rizzi, Florian Sikora. ComplexityInsights of the Minimum Duplication Problem. 38th International Conference on CurrentTrends in Theory and Practice of Computer Science (SOFSEM 2012), Jan 2012, Špindleruv Mlýn,Czech Republic. 19pp. (acceptance rate 35.5%)

2011 ? Guillaume Blin, Guillaume Fertin, Hafedh Mohamed-Babou, Irena Rusu, Florian Sikora,Stéphane Vialette. Algorithmic Aspects of Heterogeneous Biological Networks Comparison.5th International Conference on Combinatorial Optimization and Applications (COCOA 2011),2011, Zhangjiajie, China. Springer, LNCS, pp. 272-286, Lecture Notes in Computer Science(acceptance rate 39.7%)

? Guillaume Blin, Romeo Rizzi, Stéphane Vialette. A Polynomial-Time Algorithm for FindingMinimal Conflicting Sets. 6th International Computer Science Symposium in Russia (CSR’11),2011, St Petersbourg, Russian Federation. 6651, pp. 373-384, Lecture Notes in Computer Science(acceptance rate 38.2%)

2010 ? Guillaume Blin, Sylvie Hamel, Stéphane Vialette. Comparing RNA structures with biologi-cally relevant operations cannot be done without strong combinatorial restrictions. RahmanMd. S. and Fujita S.. 4th Workshop on Algorithms and Computation (WALCOM’10), Feb2010, Dhaka, Bangladesh, Bangladesh. Springer-Verlag, 5942, pp. 149-160, Lecture Notes inComputer Science (acceptance rate 38.3%)

? Guillaume Blin, Romeo Rizzi, Stéphane Vialette. A faster algorithm for finding minimumTucker submatrices. 6th Computability in Europe (CiE’10), 2010, Portugal. Springer, 6158, pp.69-77, Lecture Notes in Computer Science (acceptance rate 31%)

2009 ? Guillaume Blin, Guillaume Fertin, Florian Sikora, Stéphane Vialette. The Exemplar BreakpointDistance for non-trivial genomes cannot be approximated. Das S. and Uehara R. Proc. 3rdWorkshop on Algorithms and Computation (WALCOM 2009), 2009, Kolkata, India. Springer-Verlag, Lecture Notes in Computer Science (LNCS), pp. 357-368, Lecture Notes in ComputerScience (LNCS) (acceptance rate 29.4%)

2007 ? Guillaume Blin, Guillaume Fertin, Gaël Herry, Stéphane Vialette. Comparing RNA Structures :Towards an Intermediate Model Between the EDIT and the LAPCS Problems. Sagot Marie-France and Telles Walter Maria Emilia. Brazilian Symposium on Bioinformatics (BSB 2007),Aug 2007, Angra dos Reis, Brazil. Springer-Verlag, Lecture Notes in BioInformatics (LNBI), pp.101-112, Lecture Notes in BioInformatics (LNBI) (acceptance rate 31.7%)

2006 ? Guillaume Blin, Helene Touzet. How to compare arc-annotated sequences : The alignmenthierarchy. Crestani Fabio and Ferragina Paolo and Sanderson Mark. 13th String Processingand Information Retrieval, Oct 2006, Glasgow, United Kingdom. Springer Verlag, 4209, pp.291-303, Lecture Notes in Computer Sciences (acceptance rate 30.4%)

2005 ? Guillaume Blin, Roméo Rizzi. Conserved Interval Distance Computation Between Non-trivialGenomes. Wang Lusheng. 11th Annual International Conference Computing and Combinato-rics (COCOON’05), Aug 2005, Kunming, China, China. Springer-Verlag, 3595, pp. 22-31, LNCS(acceptance rate 27%)

? Guillaume Blin, Guillaume Fertin, Danny Hermelin, Stéphane Vialette. Fixed-ParameterAlgorithms for Protein Similarity Search Under mRNA Structure Constraints. Kratsch Dieter.31st International Workshop on Graph-Theoretic Concepts in Computer Science (WG’05), Jun2005, Metz, France, France. Springer-Verlag, 3787, pp. 271-282, LNCS (acceptance rate 33%)




Communications à des manifestations d’audience internationale avec comité de sélection,avec taux d’acceptation > 40%.

2012 ? Guillaume Blin, Laurent Bulteau, Minghui Jiang, Tejada Pedro J., Stéphane Vialette. Hardnessof longest common subsequence for sequences with bounded run-lengths. Juha Kärkkäinenand Jens Stoye. 23rd Annual Symposium on Combinatorial Pattern Matching (CPM’12), Jul2012, Helsinki, Finland. Springer-Verlag, 11 pp, Lecture Notes in Computer Science (acceptancerate 55%)

? Xiao Yang, Florian Sikora, Guillaume Blin, Sylvie Hamel, Roméo Rizzi, Srinivas Aluru.An Algorithmic View on Multi-related-segments : a new unifying model for approximatecommon interval. Manindra Agrawal, S. Barry Cooper, and Angsheng Li. 9th annual conferenceon Theory and Applications of Models of Computation (TAMC), May 2012, Beijing, China.Springer-Verlag, 7287, 10pp., LNCS (acceptance rate 46.5%)

2011 ? Guillaume Blin, Romeo Rizzi, Florian Sikora, Stéphane Vialette. Minimum Mosaic Inference ofa Set of Recombinants. Potanin Alex and Viglas Taso. 17th Computing : the Australasian TheorySymposium (CATS’11), Jan 2011, Perth, Australia. ACS, 119, pp. 23-30, CRPIT (acceptance rate52.6%)

2010 ? Guillaume Blin, Florian Sikora, Stéphane Vialette. GraMoFoNe : a Cytoscape plugin for que-rying motifs without topology in Protein-Protein Interactions networks. Hisham Al-Mubaid.Bioinformatics and Computational Biology (BICoB’10), Mar 2010, Honolulu, United States.pp. 38–43, International Society for Computers and their Applications (ISCA) (acceptance rateunkown)

? Olivier Curé, David Célestin Faye, Guillaume Blin. Towards a better insight of RDF triplesOntology-guided Storage system abilities. 6th International Workshop on Scalable SemanticWeb Knowledge Base Systems (SSWS’10), 2010, Shanghai, China. 10pp. (acceptance rateunknown)

2009 ? Guillaume Blin, Jens Stoye. Finding Nested Common Intervals Efficiently. Ciccarelli FrancescaD. and Miklós István. 7th RECOMB Satellite Workshop on Comparative Genomics (RECOMB-CG’09), Sep 2009, Budapest, Hungary. Springer-Verlag, 5817, pp. 59-69, Lecture Notes inBioinformatics (acceptance rate 61.3%)

? Guillaume Blin, Florian Sikora, Stéphane Vialette. Querying Protein-Protein Interaction Net-works. Mandoiu Ion and Narasimhan Giri and Zhang Yanqing. 5th International Symposiumon Bioinformatics Research and Applications (ISBRA’09), May 2009, Fort Lauderdale, UnitedStates. Springer-Verlag, 5542, pp. 52-62, LNBI (acceptance rate 47.3%)

2007 ? Guillaume Blin, Guillaume Fertin, Irena Rusu, Christine Sinoquet. Extending the Hardnessof RNA Secondary Structure Comparison. Chen Bo and Paterson Mike and Zhang Guochuan.1st International Symposium on Combinatorics, Algorithms, Probabilistic and ExperimentalMethodologies (ESCAPE 2007), Apr 2007, Hangzhou, China. Springer-Verlag, 4614, pp. 140-151,Lecture Notes in Computer Science (LNCS) (acceptance rate unknown)

2006 ? Guillaume Blin, Eric Blais, Pierre Guillon, Mathieu Blanchette, Nadia El-Mabrouk. Infer-ring Gene Orders from Gene Maps using the Breakpoint Distance. Bourque Guillaume andEl-Mabrouk Nadia. 4th Annual RECOMB Satellite Workshop on Comparative Genomics(RECOMB-CG’06), Sep 2006, Montréal, Canada. Springer-Verlag, 4205, pp. 99-112, LNBI (ac-ceptance rate 50%)

? Guillaume Blin, Annie Chateau, Cedric Chauve, Yannick Gingras. Inferring Positional Ho-mologs with Common Intervals of Sequences. Bourque Guillaume and El-Mabrouk Nadia. 4thAnnual RECOMB Satellite Workshop on Comparative Genomics (RECOMB-CG’06), Sep 2006,Montreal, Canada. Springer-Verlag, 4205, pp. 24-38, LNBI (acceptance rate 50%)

2005 ? Guillaume Blin, Guillaume Fertin, Romeo Rizzi, Stéphane Vialette. What Makes the Arc-Preserving Subsequence Problem Hard ?. S. Sunderam Vaidy and van Albada G. Dick and M. A.Sloot Peter and Dongarra Jack. 5th Int. Workshop on Bioinformatics Research and Applications(IWBRA’05), May 2005, Atlanta, GA, USA, United States. Springer-Verlag, 3515, pp. 860-868,LNCS (acceptance rate unknown)




? Guillaume Blin, Cedric Chauve, Guillaume Fertin. Genes Order and Phylogenetic Recons-truction : Application to γ-Proteobacteria. McLysaght Aoife and H. Huson Daniel. 3rd AnnualRECOMB Satellite Workshop on Comparative Genomics (RECOMB-CG’05), Sep 2005, Dublin,Ireland, Ireland. Springer-Verlag, 3678, pp. 11-20, LNCS (acceptance rate 66.7%)

2004 ? Guillaume Blin, Guillaume Fertin, Cedric Chauve. The breakpoint distance for signed se-quences. 1st Conference on Algorithms and Computational Methods for biochemical andEvolutionary Networks (CompBioNets’04), Dec 2004, Recife, Brazil, Brazil. King’s CollegeLondon publications, 3, pp. 3-16, Texts in Algorithms (acceptance rate unknown)

? Guillaume Blin, Guillaume Fertin, Stéphane Vialette. New Results for the 2-Interval Pat-tern Problem. Sahinalp Suleyman Cenk and Muthukrishnan S. and Dogrusoz Ugur. 15thSymposium on Combinatorial Pattern Matching (CPM’04), Jul 2004, Istanbul, Turkey, Turkey.Springer-Verlag, 3109, pp. 311-322, LNCS (acceptance rate 45,6%)

Thèses.

2012 ? Guillaume Blin. Combinatorial Objects in Bio-Algorithmics : Related problems and complexi-ties. Université Paris-Est, Jun. 2012. English

2005 ? Guillaume Blin. Combinatoire and Bio-informatique : Comparaison de structures d’ARN etcalcul de distances intergénomiques. Université de Nantes, Nov. 2005. French

Divers.

2012 ? David Célestin Faye, Olivier Curé, Guillaume Blin. A survey of RDF storage approaches.Revue Africaine de la Recherche en Informatique et Mathématiques Appliquées, 2012, 15 (1),pp. 25

2009 ? Guillaume Blin, Maxime Crochemore, Sylvie Hamel, Stéphane Vialette. Finding the medianof three permutations under the Kendall-tau distance. 7th annual international conference onPermutation Patterns, Jul 2009, Firenze, Italy.

En cours de soumission dans des revues d’audience internationale avec comité delecture.

2013 ? Guillaume Blin, Paola Bonizzoni, Riccardo Dondi, Romeo Rizzi, Florian Sikora. ComplexityInsights of the Minimum Duplication Problem.

2012 ? Xiao Yang, Florian Sikora, Guillaume Blin, Sylvie Hamel, Romeo Rizzi, Srinivas Aluru. Theintractability of gene losses in common interval models.

Lettres de recommandation– Lettre de Etienne Duris en sa qualité de Responsable de la Filière Informatique et Réseaux– Lettre de Gilles Roussel en sa qualité de Président de l’Université de Marne-la-vallée– Lettre de Christian Soize en sa qualité de Vice-Président Recherche de l’Université de Marne-la-vallée– Lettre de Thierry Lecroq en sa qualité d’examinateur de mon HDR– Lettre de Alain Denise en sa qualité de Professeur d’Université ayant une expertise reconnue dans le

domaine de la Bio-informatique




Etienne Duris

Maître de ConférencesLaboratoire d'Informatique Gaspard-MongeResponsable filière Informatique et RéseauxDirecteur adjoint ESIPE-MLVUniversité Paris-Est Marne-la-Vallée

Batiment Copernic, bureau 1B075 Champs sur Marne, le 4 décembre 20125, boulevard Descartes - Champs sur Marne77454 Marne-la-Vallée Cedex 2 [email protected]él : 01.60.95.74.20 À qui de droit,

Lettre de recommandation pour Guillaume BLIN

Je connais Guillaume BLIN depuis qu'il a été nommé ATER à l'Université Paris-Est Marne-la-Vallée, avant d'être recruté comme Maître de Conférences. Je venais alors de prendre la responsabilité de la formation d’ingénieurs par apprentissage en « Informatique et Réseaux » de l’ESIPE, UFR regroupant aujourd'hui les 7 formations d'ingénieur de l'université.

À travers les enseignements qu'il a pu dispenser, à la fois en Master et dans ces formations d'ingénieur, j'ai pu le côtoyer en tant que chargé de travaux dirigés ou de responsable de cours. Il a toujours pris très au sérieux la pédagogie, assez particulière dans le contexte des formations par apprentissage, et a toujours été très disponible, que ce soit dans la préparation des supports, dans l'encadrement ou l'évaluation des élèves ou dans les travaux d'évolution des contenus.

Au delà de ces aspects pédagogiques, il a toujours fait preuve d'intérêt et de dévouement pour la « collectivité », que ce soit à travers la prise en charge de l'animation de séminaires de recherche au niveau du laboratoire d'informatique Gaspard-Monge, ou dans son investissement pour développer et mettre en place des outils, comme la centralisation des bibliographies en ligne des chercheurs ou encore la centralisation et la diffusion automatisée en ligne des supports de travaux dirigés pour les enseignements d'informatique des différentes formations de l'université.

Dans le cadre de la formation par apprentissage dont je suis responsable, il a depuis plusieurs années été « tuteur enseignant », c'est à dire référent « école » de plusieurs apprentis ingénieur qui passent la moitié de leur formation de 3 ans dans une entreprise : son investissement personnel et le sérieux avec lequel il a occupé ces fonctions, qui vont des visites en entreprise des élèves, où il faut être à l'écoute des enjeux industriels mais également garant de la valeur formatrice de ces périodes pour les apprentis, à la participation/organisation de diverses soutenances à l'école, et en passant par des rendez vous de suivi individuel, n'ont jamais été mis en défaut.

Enfin, depuis la rentrée 2012, il a accepté la responsabilité de la troisième année de la formation d'ingénieur en « Informatique et Réseaux » : à ce titre, il est en charge des évolutions de contenus pédagogiques, des recherches d’intervenants, de la constitution des plannings, des commissions d'évaluation et, plus généralement, il est l’interlocuteur privilégié pour tout ce qui concerne cette année de formation. Plus particulièrement, cette dernière année de formation intègre un projet de décloisonnement, appelé « The Last Project », qui est porté par des « clients » qui peuvent être enseignants, chercheurs ou industriels, et qui implique un groupe de 5 à 6 élèves pendant les 6 mois de présence à l'université : Guillaume gère l'organisation de ces projets et ses différents interlocuteurs.

Dans toutes ces activités, ce qui ressort de la personnalité et des actions de Guillaume BLIN est son engagement, ses compétences et son sens du collectif, avec un souci de qualité du service rendu. Comme par ailleurs c'est quelqu'un de fort sympathique avec qui il est agréable de travailler, je suis convaincu qu'il a les qualités requises pour être un Professeur qui saura prendre toute sa place dans la recherche comme dans l'enseignement avec le souci du collectif d'une équipe. Je le recommande donc sans réserve pour cette fonction.

En restant à votre disposition pour d’avantage de renseignements, je vous prie de croire, Madame, Monsieur, en l’expression de mes cordiales salutations.

Etienne Duris

Laboratoire MSME

Université Paris-Est Marne-la-Vallée

5, boulevard Descartes – CHAMPS-SUR-MARNE – 77454 MARNE-LA-VALLEE CEDEX 2

Tél : 01 60 95 77 88 – Télécopie : 01 60 95 77 99

Modélisation et Simulation Multi Echelle MSME UMR 8208

______________________________________________________________________________________________

Lettre de recommandation

En tant que Vice-Président en charge de la Recherche de l’Université Paris-Est Marne-la-Vallée

de 2002 à début 2012, j’ai constaté l’excellent investissement au sein de l’Université, de

Guillaume BLIN, Maitre de Conférence HDR et membre du Laboratoire d’Informatique Gaspard

Monge (LIGM, UMR 8049). En particulier, j’ai pu apprécier ces deux dernières années ses

qualités remarquables dans le cadre de la mise en place du portail HAL (Archives Ouvertes) pour

toute l’Université. Il a été chargé de la bibliographie de l’Université, de la mise en place du portail

HAL et d’en assurer la responsabilité. Cette mise en place a nécessité, entre autre, de faire évoluer

les possibilités de la plateforme HAL pour intégrer les besoins spécifiques de l’université, afin de

couvrir tous les aspects bibliométriques des domaines des « Sciences et Technologies » et des

« Sciences Humaines et Sociales ». Pour cette mission difficile, Guillaume BLIN s’est fortement

investi et a fait preuve d’excellence, de compétences, d’efficacité et de rigueur. Il a proposé une

solution adaptée, ergonomique et fiable, en respectant le calendrier et en menant à bien un projet

en interactions avec différents acteurs : la direction de l’Université, les informaticiens

développeurs et gestionnaires de la plateforme nationale HAL, le tissu des enseignants-chercheurs

et chercheurs des laboratoires de recherche de disciplines diverses.

Outre ces aspects, Guillaume BLIN est un enseignant-chercheur de grande valeur, ayant un très

grand potentiel. Je le recommande très vivement aux différentes commissions qui auront à

examiner son dossier pour lui permettre d’obtenir un poste de Professeur des Universités.

A Marne-la-Vallée, le 5 décembre 2012,

Christian Soize

Professeur des Universités

Directeur du laboratoire MSME, UMR 8208 CNRS

Vice-Président Recherche de l’UPEMLV de 2002 à début 2012

Prof. Thierry Lecroq Mont-Saint-Aignan, le 29 novembre 2012LITIS EA 4108UFR des Sciences et des TechniquesUniversite de Rouen76821 MONT-SAINT-AIGNAN CedexTel : 02 35 14 65 81Mel : [email protected]

Lettre de recommandation

A qui de droit

J’ai d’abord rencontre Guillaume Blin a de multiples occasions lors de conferencesd’algorithmique et de bioinformatique. Il m’a ensuite invite a participer a son juryd’Habilitation a Diriger des Recherches en juin 2012. J’ai ainsi pu constater que, sansnegliger les activites d’enseignement (et pas seulement en algorithmique) et les tachesadministratives (en assurant l’organisation du seminaire et la gestion de la base bi-bliographique de son laboratoire par exemple) il a su mettre en œuvre les conditionsfavorables a la poursuite et au developpement de l’activite de bioinformatique auLIGM a Marne-la-Vallee. Il a montre qu’il etait capable de diversifier ses activites derecherche en continuant a obtenir d’excellents resultats de complexite et en amorcantdes travaux en lien avec le radiotherapie et egalement autour des nouvelles technolo-gies de sequencage haut-debit. Il vient tout juste d’organiser avec succes les journeesnationales du groupe de travail COMATEGE (Combinatoire des mots, algorithmiquedu texte et du genome) du GDR Informatique Mathematique du CNRS les 26 et27 novembre 2012 a Marne-la-Vallee. Je l’ai d’ailleurs invite a rejoindre la comite deprogramme de la conferences International Workshop on Combinatorial Algorithms(IWOCA) qui sera organisee a Rouen en juillet 2013. Je suis certain que Guillaumea toutes les qualites requises pour occuper un poste de professeur des universites.

Thierry LecroqProfesseur des universites

Universite de Rouen

LRI, UMR CNRS 8623

Bât. 650

Université Paris-Sud

91405 Orsay cedex

IGM, UMR CNRS 8621

Bât. 400

Université Paris-Sud

91405 Orsay cedex

Alain Denise

Tél : 01.69.15.63.69

Fax : 01.69.15.65.86

[email protected] Orsay, le 12 mars 2013

Lettre de recommandation pour la candidature de Guillaume Blin à un poste de

professeur à l’université Bordeaux I.

J’ai connu Guillaume Blin alors qu’il était en thèse à Nantes. Il travaillait alors sur la

complexité du problème de la comparaison de structures d’ARN, un problème que j’étudiais

de près avec Serge Dulucq et notre étudiante Claire Herrbach. Guillaume avait prouvé que le

problème d’édition de deux structures secondaires d’ARN est NP-complet. Ce problème avait

été laissé ouvert par Jiang et ses co-auteurs en 2002 dans leur article A General Edit Distance

between RNA Structures. Le résultat de Guillaume m’avait impressionné par sa technicité.

D’autre part il était important car il montrait que le problème de l’édition de structures

secondaires d’ARN (avec des opérations biologiquement pertinentes) n’est pas de même

nature que celui de l’édition d’arbres qui est polynomial, alors que ce sont des objets de même

nature. Il fallait donc explorer d’autres stratégies que l’édition si l’on voulait un algorithme

exact polynomial pour comparer les structures d’ARN. Ce travail avait donné lieu à de très

intéressantes discussions lors de nos rencontres, notamment dans le groupe de travail AReNa

dédié à la bioinformatique de l’ARN. Depuis lors je suis avec intérêt la carrière de Guillaume

et nous avons eu d’autres occasions de discuter. Notamment, nous avons publié ensemble sur

la comparaison d’ARN, et nous collaborons actuellement sur le calcul d’une médiane

d’ensembles partiellement ordonnés. Ce dernier travail est lié à la recherche d’informations

pertinentes dans les bases de données biomédicales.

Guillaume fait partie de la petite –et précieuse- communauté des chercheurs qui travaillent

« en amont » de la plupart des bioinformaticiens. Il s’intéresse notamment à la complexité

intrinsèque des problèmes posés par la biologie, en tâchant de comprendre en profondeur ce

qui rend ces problèmes difficiles. Ce type de travail est important car il permet de donner une

solide assise théorique a des problèmes en bioinformatique et d'aider ceux qui les abordent

« pragmatiquement » à savoir comment les attaquer. Cependant Guillaume ne s’arrête pas à

ces études théoriques. Il va aussi jusqu’au développement de logiciels fondés sur ses

algorithmes. C’est le cas par exemple du plugin Cytoscape GraMoFoNe, pour la comparaison

de réseaux biologiques, qui a été développé dans le cadre de la thèse de Florian Sikora que

Guillaume a co-encadrée.

Une lecture du CV de Guillaume montre à l’évidence non seulement la qualité et la fécondité

de ses travaux scientifiques, mais aussi le fait qu’il effectue toutes les tâches d’un enseignant-

chercheur de niveau professoral. Il a en effet participé à l’organisation d’événements

nationaux et internationaux. Il a pris des responsabilités conséquentes au sein de son

université, en enseignement (notamment responsabilité de la 3ème année d'ingénieur de

l'ESIPE et la mise en place d'accord Socrates) et en tâches d’intérêt général (notamment prise

en charge du portail HAL, participation à de nombreux conseils et commissions). J’ajoute

qu’il effectue en parallèle une charge d’enseignement de 220h par an à tous niveaux et dans

une large palette de champs de l’informatique.

Il est clair pour moi que Guillaume Blin est un excellent candidat pour un poste de professeur

à l’université Bordeaux I. C’est pourquoi je soutiens très vivement et très chaleureusement sa

candidature.

Alain Denise

Professeur d’informatique

Université Paris-Sud 11.

Pièces administratives– Attestation provisoire de diplôme d’habilitation à diriger des recherches– Liste d’émargement des membres du jury– Rapport de soutenance– Rapport sur le manuscrit du Pr. Bernard Moret - Professeur d’Université à l’EPFL en Suisse– Rapport sur le manuscrit de la Dr. Marie-France Sagot - Directrice de Recherches INRIA à l’Université

Claude Bernard de Lyon– Rapport sur le manuscrit du Pr. Laurent Vuillon - Professeur d’Université de l’Université de Savoie– Pièce d’identité




Exemplaires de travaux– Chapitre d’ouvrage sur les séquences arc-annotées : Guillaume Blin, Maxime Crochemore, Stéphane Vialette.

Algorithmic Aspects of Arc-Annotated Sequences. Elloumi Mourad, Zomaya Albert Y.. Algorithms inComputational Molecular Biology : Techniques, Approaches and Applications, Wiley, pp. 113-126, Feb. 2011

– Article publié dans une revue d’audience internationale avec comité de lecture sur la propriété des 1consécutifs dans les matrices binaires : Guillaume Blin, Romeo Rizzi, Stéphane Vialette. A faster algorithmfor finding minimum Tucker submatrices. Theory of Computing Systems, 2012, 10 pp (à paraître).

– Article publié dans une revue d’audience internationale avec comité de lecture sur la recherche de motifsdans les réseaux PPI : Guillaume Blin, Florian Sikora, Stéphane Vialette. Querying Graphs in Protein-ProteinInteractions Networks using Feedback Vertex Set. IEEE/ACM Transactions on Computational Biology andBioinformatics, 2010, 7 (4), pp. 628-635

– Communication à un manifestation d’audience internationale avec comité de sélection sur le plugin Cytos-cape que nous avons produit : Guillaume Blin, Florian Sikora, Stéphane Vialette. GraMoFoNe : a Cytoscapeplugin for querying motifs without topology in Protein-Protein Interactions networks. Hisham Al-Mubaid.Bioinformatics and Computational Biology (BICoB’10), Mar 2010, Honolulu, United States. pp. 38–43,International Society for Computers and their Applications (ISCA)




CHAPTER 1

ALGORITHMIC ASPECTS OFARC-ANNOTATED SEQUENCES

1.1 INTRODUCTION

Structure comparison for RNA has become a central computational prob-lem bearing many computer science challenging questions. Indeed, RNA sec-ondary structure comparison is essential for (i) identification of highly con-served structures during evolution (which cannot always be detected in theprimary sequence, since it is often unpreserved) which suggest a significantcommon function for the studied RNA molecules, (ii) RNA classification ofvarious species (phylogeny), (iii) RNA folding prediction by considering a setof already known secondary structures and (iv) identification of a consensusstructure and consequently of a common role for molecules.

From an algorithmic point of view, RNA structure comparison was firstconsidered in the framework of ordered trees [21]. More recently, it has alsobeen considered in the framework of arc-annotated sequences [10]. An arc-annotated sequence is a pair (S, P ) where S is a sequence of RNA bases andP represents hydrogen bonds between pairs of elements of S. From a purelycombinatorial point of view, arc-annotated sequences are a natural extensionof simple sequences. However, using arcs for modeling non-sequential infor-

Please enter \offprintinfo{(Title, Edition)}{(Author)}

at the beginning of your document.

1

2 ALGORITHMIC ASPECTS OF ARC-ANNOTATED SEQUENCES

mation together with restrictions on the relative positioning of arcs allow forvarying restrictions on the structure of arc-annotated sequences.

Different pattern matching and motif search problems have been consideredin the context of arc-annotated sequences among which we can mention theLongest Arc-Annotated Subsequence (LAPCS) problem, the Arc PreservingSubsequence (APS) problem, the Maximum Arc-Preserving Common Subse-quence (MAPCS) problem, and the Edit-distance for arc-annotated sequence(Edit) problem. This chapter is devoted to presenting algorithmic results forthese arc-annotated problems.

This chapter is organized as follows. We present basic definitions in Sec-tion 1.2. Section 1.3 is devoted to the problem of finding a longest arc-preserving common subsequence (LAPCS) between two arc-annotated se-quences whereas we consider in Section 1.4 the restriction of deciding whetheran arc-annotated sequence occurs in another arc-annotated sequence, the so-called arc-preserving subsequence (APS) problem. Section 1.5 is concernedwith some variants of the longest arc-preserving common subsequence prob-lem. Section 1.6 is devoted to computing the edit distance between two arc-annotated sequences.

1.2 PRELIMINARIES

Arc-annotated sequences

Given a finite alphabet Σ, an arc-annotated sequence is defined by a pair(S, P ), where S is a string of Σ∗ and P is a set of arcs connecting pairs ofcharacters of S. The set P is usually represented by set of pairs of positionsin S. Characters that are not incident to any arc are called free.

In the context of RNA structures, we have Σ = {A,C,G,U}, and S andP represent the nucleotide sequence and the hydrogen bonds of the RNAstructure, respectively. Characters in S are thus often referred to as bases.

Relative positioning of arcs is of particular importance for arc-annotatedsequences and is completely described by three binary relations. Let p1 = (i, j)and p2 = (k, l) be two arcs in P that do not share a vertex. Define

the precedence relation (<) – p1 < p2 if i < j < k < l

the embedding relation (@) – p1 @ p2 if i < k < l < j

the crossing relation (G) – p1 G p2 if i < k < j < l

Using arcs for modeling non-sequential information together with theserelations allow us for varying restrictions on the complexity of arc-annotatedsequences.

PRELIMINARIES 3

Hierarchy

Five levels of arc structure have been initially considered in the foundationwork of Evans [9]:

Unlimited (Unlim) – no restriction at all,

Crossing (Cros) – there is no character incident to morethan one arc,

Nested (Nest) – there is no character incident to morethan one arc and no arcs are crossing,

Chain (Chain) – there is no character incident to morethan one arc, no arcs are crossing and no arc embeddedinto another, and

Plain – there is no arc.

The induced hierarchy is described by the following chain of inclusion:

Plain ⊂ Chain ⊂ Nested ⊂ Crossing ⊂ Unlimited.

Refined Hierarchy

In [13], Guignon et al. extended the above-mentioned hierarchy by introducinga new refinement of the Nested level called Stem: no character is incidentto more than one arc, and given any two arcs, one is embedded into the other.

For providing a unified framework and as a next step towards a better un-derstanding of the inner complexity of the problems related to arc-annotatedsequences, Blin et al. [4] proposed to further refine the hierarchy following theexample of Vialette [22, 23] in the context of 2-intervals (a simple abstractstructure for modeling RNA secondary structures). The refinement consists insplitting those models of arc-annotated sequences into more precise relationsbetween arcs, taking advantage of the combinatorics induced by the relations<, @, and G.

Two arcs p1 and p2 are R-comparable for some R ∈ {<,@, G} if p1Rp2 orp2Rp1. Let P be a set of arcs and R be a non-empty subset of {<,@, G}.The set P is said to be R-comparable if any two distinct arcs of P are R-comparable for some R ∈ R. An arc-annotated sequence (S, P ) is said to bean R-arc-annotated sequence for some non-empty subset R ⊆ {<,@, G} if Pis R-comparable. By abuse of notation, we will write R = ∅ in case P = ∅.

As a straightforward illustration of the above definitions, most levels in theclassical hierarchy can be expressed in terms of a combination of the threerelations: Plain is fully described by R = ∅, Chain is fully described byR = {<}, Stem is fully described by R = {@}, Nested is fully describedby R = {<,@} and Crossing is fully described by R = {<,@, G}. The keypoint is to observe that this refinement allows us to consider new levels forarc-annotated sequences, namely R = {G}, R = {<, G} and R = {@, G}.


Alignment

Given two sequences S and T on a common alphabet Σ, we define an alignmentof S and T as a pair of sequences (S′, T ′) built from S and T on Σ∪{−} (− isusually referred to as a gap) such that (i) |S′| = |T ′|, (ii) for any 1 ≤ i ≤ |S′|,either S′[i] = T ′[i] 6= − or exactly one of S′[i] and T ′[i] is a gap, and (iii)removing the gaps from S′ (resp. T ′) yields S (resp. T ).

Let (S′, T ′) be an alignment of S and T . For any 1 ≤ i ≤ |S′| such thatS′[i] 6= −, character S′[i] is said to be aligned with character T ′[i] if T ′[i] 6= −,and deleted otherwise. Similarly, For any 1 ≤ i ≤ |T ′| such that T ′[i] 6= −,character T ′[i] is said to be aligned with character S′[i] if S′[i] 6= −, andinserted otherwise. An illustration is given in Figure 1.1.

Figure 1.1 Illustration of a) sequences alignment leading to a common subsequencewhich is “lgrtihm”, b) an arc-preserving alignment of two arc-annotated sequences andc) the resulting common arc-annotated subsequence

An alignment (S′, T ′) of two arc-annotated sequences (S, P ) and (T,Q)is arc-preserving if the arcs induced by (S′, T ′) are preserved, i.e., the arcsinduced by the aligned bases are preserved. In this context, the notion ofcommon subsequence is extended by including the common arcs – that is thearcs that have been preserved by the alignment.

Edit Operations

Following the example of stringology, when comparing two arc-annotated se-quences (S, P ) and (T,Q), instead of computing an alignment, one mightconsider a set of edit operations (together with their associate costs) that al-ter arc-annotated sequences, and seek for a minimal cost sequence accordingto these operations that leeds from (S, P ) to (T,Q).

Formally, given a set of edit operations E and two arc-annotated sequences(S, P ) and (T,Q), an edit-script from (S, P ) to (T,Q) refers to a series ofnon-oriented operations of E transforming (S, P ) into (T,Q). The cost of anedit-script from (S, P ) to (T,Q), denoted cost((S, P ), (T,Q), E), is the sum ofthe costs of all operation involved in the edit-script. The edit-distance between(S, P ) and (T,Q) is the minimum cost of an edit-script from (S, P ) to (T,Q).

The classical approach is to consider a subset of the operations introducedin [15] which can be divided into two groups:

Substitution operations, inducing renaming of characters in the arc-annotatedsequence:

LONGEST ARC-PRESERVING COMMON SUBSEQUENCE 5

match (wm : Σ→ IR) →

mismatch (wm : Σ→ IR) →

arc-match (wam : Σ4 → IR) →

arc-mismatch (wam : Σ4 → IR) → or

or

Deletion operations, inducing deletion of characters and/or of arcs in thearc-annotated sequence:

deletion (wd : Σ→ IR) →

arc-breaking (wb : Σ4 → IR) →

arc-removing (wr : Σ2 → IR) →

arc-altering (wa : Σ3 → IR) → or

1.3 LONGEST ARC-PRESERVING COMMON SUBSEQUENCE

Definition

The LAPCS problem has been introduced by Evans [9] and is defined asfollows: given two arc-annotated sequences (S, P ) and (T,Q), find an arc-preserving common subsequence of maximal length. The computational com-plexity of the LAPCS problem has been studied in [9, 10, 17, 18, 14, 7], andthe main results are summarized in tables 1.1, 1.2 and 1.3.

In the sequel, we use the notation LAPCS(A,B) to represent the LAPCSproblem where the arc structure of S (resp. T ) – namely P (resp. Q) – is oflevel A (resp. B).

Classical complexity

In [9], Evans proved that LAPCS(Chain,Chain) is polynomial-time solv-able, whereas both LAPCS(Unlimited,Plain) and LAPCS(Crossing,Plain)are NP-complete (reductions from Independent Set). In [18], Lin et al. provedthat LAPCS(Nested,Nested) is NP-complete (reduction from Indepen-dent Set). Complementing these results, Jiang et al. [17] designed an O(nm3)time algorithm for LAPCS(Nested,Chain) and LAPCS(Chain,Chain).Recently, Blin et al. [7] proved that LAPCS(Stem,Stem) is NP-complete(reduction from 3-SAT).


A×B LAPCS

Stem × Stem NP-complete – Blin et al. [7]

Chain × ChainO(nm3) – Jiang et al. [17]

Nest × Chain

Nest × Nest NP-complete even for unary, c-fragment (with c >2) and c-diagonal (with c > 1) – Jiang [18]

Cros × ChainNP-complete – Evans [9]

Cros × Nest

Cros × Cros NP-complete – Evans [9] but polynomial-time solv-able for 1-fragment LAPCS(Crossing,Crossing)and 0-diagonal LAPCS(Crossing,Crossing) [18]

Unlim × Chain

NP-complete – Evans [9]Unlim × Nest

Unlim × Cros

Unlim × Unlim

Table 1.1 LAPCS classical complexity with n = |S| and m = |T |

Lin et al. further investigated this last problem by studying restrictedcases : namely, c-fragmented, c-diagonal and unary LAPCS(Nested,Nested). Given two arc-annotated sequences which are divided into frag-ments of lengths exactly c (the last fragment can have a length less thanc), the c-fragment LAPCS problem with c ≥ 1, is defined as the classi-cal LAPCS problem with the extra constraint that the allowed matches arethose between fragments at the same location [14]. The c-diagonal LAPCSproblem with c ≥ 0 is an extension of c-fragment LAPCS, where charac-ter S[i] is allowed only to match a character in the range T [i − c, i + c].Lin et al. [18] showed the NP-hardness of the c-fragment (with c > 2)and c-diagonal (with c > 1) LAPCS (Nested,Nested problem. They alsoproved that the 1-fragment LAPCS(Crossing,Crossing) and 0-diagonalLAPCS(Crossing,Crossing) are solvable in time O(n).

Parameterized complexity

Considering the parameter l as being the desired length of common subse-quence, Evans [9], by using one of the previous above-mentioned reductionfor LAPCS(Unlimited,Plain) and by providing a reduction from Clique toLAPCS(Crossing,Crossing), roved that both LAPCS(Unlimited,Plain)and LAPCS(Crossing,Crossing) are W[1]-complete when parameterized

LONGEST ARC-PRESERVING COMMON SUBSEQUENCE 7

by l. Moreover, Evans proved in [10] that whereas LAPCS(Crossing,Crossing)is W[1]-complete, the problem becomes fixed-parameter tractable when pa-rameterized by the arc cutwidth. The arc cutwidth [10] of an arc-annotatedsequence is defined as the maximal number of arcs that cross or end at anyarbitrary position of the sequence. If both sequences have their cutwidthbounded by some k, the problem, as shown by Evans, can be solved inO(9knm) time, where |S| = n and |T | = m. Evans also investigated theparameterized complexity of the problem considering two other parameters :the bandwidth and the nesting depth. The bandwidth d of an arc-annotatedsequence (S, P ) is defined by max(i,j)∈P {|j − i|} and its nesting depth sis equal to max{|P ′|}, where P ′ ⊆ P such that, for all (i, j) ∈ P ′, theredoes not exist (k, l) ∈ P ′ with i < k < j < l or i < j < k < l. Evansshown that, if both sequences have their nesting depth bounded by some s,LAPCS(Nested,Nested) can be solved in O(s24snm) time, where |S| = nand |T | = m. In case the arcs do not share endpoints, both cutwidth andnesting depth are always no more than bandwidth. Thus, Evans, was able toextend the previously mentioned results to the parameter d. Finally, one hasto observe that if the complexity of the arc-structure is bounded by a loga-rithm of the maximal sequence length n, LAPCS can be solved in O(n2m)time even for Crossing type arc structures. Moreover, since the cutwidth isequal to 1 in the case of LAPCS(Chain,Chain), one can use the algorithmfor LAPCS(Crossing,Crossing) to solve this problem in O(nm) time.

Considering LAPCS(Nested,Nested), Alber et al. [1] designed an algo-rithm which determines in timeO(3.31k1+k2n) whether an arc-preserving com-mon subsequence can be obtained by deleting (together with incident arcs) k1characters from S and k2 from T , thereby proving that LAPCS(Nested,Nested)is fixed-parameter tractable when parameterized by the number of deletions.Finally, Alber et al. [1] shown that c-fragment LAPCS(Crossing,Crossing)and c-diagonal LAPCS(Crossing,Crossing) parameterized by the length lof the desired common subsequence are solvable in O((B + 1)lB + c3n) time,with B = c2 + 2c− 1 and B = 2c2 + 7c+ 2, respectively.

Approximability

Jiang et al. in [16] proved that LAPCS(Crossing,Crossing) admits a sim-ple 2-approximation algorithm running inO(nm) time whereas LAPCS(Unlimited,Plain)cannot be approximated within ratio nε for any ε ∈ (0, 14 ), where n de-notes the length of the longest input sequence. In the same paper, theyproved that LAPCS(Crossing,Plain) is MaxSNP-hard, thereby exclud-ing a polynomial-time approximation scheme (PTAS). Jiang et al. [18] provedthat both c-fragmented and c-diagonal LAPCS(Nested,Nested) have aPTAS. They also give a 4

3 -approximation algorithm for the unary LAPCS(Nested,Nested)problem.


A×B LAPCS

Stem × Stem FPT when parameterized by the number of deletion– Alber et al. [1]

Nest × Chain FPT when parameterized by the bandwidth or thenesting depth – Evans [9], FPT when parameterized

Nest × Nest by the number of deletion – Alber et al. [1]

Cros × Chain FPT when parameterized by the bandwidth or the

Cros × Nest cutwidth – Evans [9], Jiang et al. [16]

Cros × Cros W[1]-complete and FPT when parameterized by thebandwidth or the cutwidth – Evans [9], FPT whenparameterized by the desired common subsequencelength – Alber et al. [1]

Unlim × Chain

W[1]-complete – Evans [9]Unlim × Nest

Unlim × Cros

Unlim × Unlim

Table 1.2 LAPCS parameterized complexity with n = |S| and m = |T |.

A×B LAPCS

Nest × Chain 2-approximable – Jiang et al. [16], PTAS for

Nest × Nest c-fragmented and c-diagonal cases [18]

Cros × Chain

MaxSNP-hard, 2-approximable – Jiang et al. [16]Cros × Nest

Cros × Cros

Unlim × Chain

Unlim × Nest Cannot be approximated within ratio nε for any

Unlim × Cros ε ∈ (0, 14 ) – [16]

Unlim × Unlim

Table 1.3 LAPCS approximability.

ARC-PRESERVING SUBSEQUENCE 9

1.4 ARC-PRESERVING SUBSEQUENCE

Definition

The APS problem is a decision problem derived from LAPCS. Given twoarc-annotated sequences (S, P ) and (T,Q), the APS problem asks whether(T,Q) is the LAPCS of (S, P ) and (T,Q) , i.e., (T,Q) is an arc-preservingsubsequence of (S, P ). The computational complexity of the APS problemhas been studied in [9, 11, 12, 14, 5, 4], and the main results are summarizedin tables 1.4 and 1.5.

In the following, we use the notation APS(A,B) to represent the APSproblem where the arc structure of S (resp. T ) – namely P (resp. Q) – is oflevel A (resp. B).


Guo proved in [14] that the APS(Crossing,Chain) problem is NP-hard.Guo et al. observed in [11, 12] that the NP-completeness of the APS(Crossing,Crossing)and APS(Unlimited,Plain) easily follows from LAPCS Evans’ work [9].Furthermore, they gave anO(nm) time algorithm for the APS(Nested,Nested)problem. This algorithm can be applied to easier problems such as APS(Nested,Chain),APS(Nested,Plain), APS(Chain,Chain) and APS(Chain,Plain). Fi-nally, Guo et al. mentioned in [11, 12] that APS(Chain,Plain) can be solvedinO(n+m) time. Finally, Blin et al. [5, 4] proved that APS(Crossing,Plain)is NP-complete.

Classical complexity for the refined hierarchy

In analyzing the computational complexity of a problem, we are often tryingto define the precise boundary between the polynomial and the NP-completecases. Therefore, as another step towards establishing the precise complexitylandscape of the APS problem, it is of particular interest to refine the clas-sical complexity levels of the APS problem to precisely define what makesthe problem hard. To this aim, Blin et al. [5, 4] have used the frameworkintroduced by Vialette [23] in the context of 2-intervals. As a consequence,the number of complexity levels rises from 4 (not taking into account theUnlimited case) to 8.

On the positive side, Gramm et al. have shown that APS(Nested,Nested)is solvable in O(nm) time [11, 12]. Another way of stating this result is tosay that APS({<,@}, {<,@}) is solvable in O(mn) time. According to theproperties of the refined hierarchy, that result may be summarized by sayingthat APS(R1, R2) for any compatible R1 and R2 such that G/∈ R1 and G/∈ R2

is polynomial-time solvable.Conversely, the NP-completeness of APS(Crossing,Crossing) has been

proved by Evans [9]. A simple reading shows that her proof is actually con-


A×B APS

Chain × Chain

O(nm) – Guo et al [11, 12]Nest × Chain

Nest × Nest

Cros × Plain NP-complete – Blin et al. [5, 4]

Cros × ChainNP-complete – Guo et al. [11, 12]

Cros × Nest

Cros × Cros

NP-complete – Evans [9]

Unlim × Chain

Unlim × Nest

Unlim × Cros

Unlim × Unlim

Table 1.4 APS classical complexity with n = |S| and m = |T |.

cerned with {<,@, G}-arc-annotated sequences, and hence actually provesthat APS({<,@, G}, {<,@, G}) is NP-complete. Similarly, in proving thatAPS(Crossing,Chain) is NP-complete [14], Guo actually proved that APS({<,@, G}, {<}) is NP-complete. Therefore, both APS({<,@, G}, {<,@}) andAPS({<,@, G}, {<, G}) are NP-complete.

In [5, 4], Blin et al. proved that both APS({@, G}, ∅) and APS({<, G}, ∅)are NP-complete. They also gave a polynomial-time algorithm to show thatboth APS({G}, {G}) and APS({G}, ∅) problems can be solved in O(nm2) time.In other words, they proved that the relation G alone does not imply hardness.

Open problems

The refinement suggested by Blin et al. in [5, 4] shows that APS problembecomes hard when one considers sequences containing {G, R}-comparable forsome R ⊆ {<,@, G}. Therefore, crossing arcs alone do not imply APS hard-ness. It is of course a challenging problem to further explore the complexityof the APS problem, and especially the parameterized views, by consideringadditional parameters such as the cutwidth or the depth of the arc structures.

MAXIMUM ARC-PRESERVING COMMON SUBSEQUENCE 11

A×B APS

{<} × ∅ O(n+m) Guo et al. [11]

{<} × {<}Guo et al. [11, 12]{@} × ∗

{<,@} × ∗{G} × ∅

O(nm2) – Blin et al. [5, 4]{G} × {G}{<, G} × ∗

NP-complete– Blin et al [5, 4], Guo [14], Evans [9]{@, G} × ∗{<,@, G} × ∗

Table 1.5 APS classical refined complexity where n = |S| and m = |T |.

1.5 MAXIMUM ARC-PRESERVING COMMON SUBSEQUENCE

Definition

The MAPCS problem was introduced by Blin et al. [3] as an intermediatemodel for comparing arc-annotated sequences – lying between LAPCS andthe Edit (see Section 1.6). The MAPCS problem is defined as follows: giventwo arc-annotated sequences (S, P ) and (T,Q), and two functions fb : Σ→ N∗and fa : Σ2 → N∗, find a common arc-annotated subsequence (U,R) thatmaximizes the following score function:

∑c∈U fb(c) +

∑(c1,c2)∈R fa(c1, c2).

In other words, the MAPCS problem seeks to finding a common subsequencewhose score takes into account both the number of bases and arcs. Thecomputational complexity of the MAPCS problem was fully determined in [3],and the main results are summarized in Table 1.6.

In the following, we use the notation MAPCS(A,B) to represent theMAPCS problem where the arc structure of S (resp. T ) – namely P (resp.Q) – is of level A (resp. B).


In [3], Blin et al. first investigated two special cases of MAPCS, namelywhen one allows function fa or fb to return zero. They easily noticed thatfa(x, y) = 0 for all (x, y) ∈ Σ2 reduces to the LAPCS problem. Theyinvestigate the case fb(x) = 0 for all x ∈ Σ, problem called MAPCS∗,and proved that MAPCS∗(Chain,Chain) can be solved in O(nm) time,MAPCS∗(Nested,Nested) in O(n2m2) time , MAPCS∗(Nested,Chain)in O(nm2) time and MAPCS∗(Unlimited,Nested) in O(n4 log3 n) time,


A×B MAPCS∗ MAPCS

Chain × Chain O(nm) O(nm)

Nest × Chain O(n2m) O(nm3)

Nest × Nest O(n2m2)

NP-complete

Cros × ChainO(n4 log3 n)

Cros × Nest

Cros × Cros NP-complete

Unlim × ChainO(n4 log3 n)

Unlim × Nest

Unlim × CrosNP-complete

Unlim × Unlim

Table 1.6 MAPCS∗ and MAPCS classical complexity for n = |S| andm = |T |.

where n = |S| andm = |T |. They also proved that MAPCS∗(Crossing,Crossing)is NP-complete by providing a reduction from Clique.

They also completely investigated the complexity of MAPCS by proposingan O(nm) (res. O(nm3)) time algorithm for MAPCS (Chain,Chain) (resp.MAPCS (Nest,Chain)), and by proving that both MAPCS (Nested,Nested)and MAPCS (Crossing,Plain) are NP-complete.

Open Problems

As far as we know, neither the parameterized complexity nor the approxima-bility of MAPCS have been studied (except for the case fa always returnszero since it corresponds to LAPCS problem and inherits all its complexityresults).

1.6 EDIT DISTANCE

Definition

Given two arc-annotated sequences, the Edit problem is to find the edit-distance between (S, P ) and (T,Q). It has been extensively studied [15, 19,13, 8, 6, 3, 2, 7].

EDIT DISTANCE 13


Lin et al. proved in [19] that the problem Edit (Crossing, Plain) isNP-complete, and gave a (polynomial-time) dynamic programming algorithmfor the Edit (Nested, Plain) problem. Sankoff [20] had previously solvedEdit (Plain, Plain).

Blin et al. [8] proved that the LAPCS problem can actually be seen asa very specific case of the Edit problem. More precisely, any edit script ofminimum cost goes through a common subsequence of optimal score. Thismeans that finding one allows to find the other. Thus, LAPCS can be seenas a particular case of Edit where the cost system for edit operations is thefollowing: wr = 2wd = 2wa, and all substitution operations and arc-breakingsare prohibited by an arbitrary high cost. The main idea is to penalize dele-tion operations proportionally to the number of bases that are deleted. Thislast result proved that the complexity of Edit (Nested, Nested) simplyfollows from the complexity of LAPCS(Nested,Nested). This results wasextended in [6] where the authors showed that only a very restricted num-ber of instances of Edit (Nested, Nested) were shown to be NP-completeand that the corresponding cost system needed to satisfy restrictions whichcan be biologically discussed. Therefore, as another step towards establish-ing the precise complexity landscape of the Edit problem, they considereda more accurate class of instances – but not overlapping with the one usedin the proof from LAPCS –, for determining more precisely what makes theproblem hard.

The authors want to point out another interesting results from Blin etal. [8] – namely a unifying framework to express comparison of arc-annotatedsequences called Align. Indeed, Blin et al. showed that this hierarchy bringstogether most of the comparison models for arc-annotated sequences, andleads to the introduction of new comparison models that are biologically rel-evant. In particular, they proposed two polynomial-time algorithms for theproblem of comparing two Nested arc-annotated sequences, whereas cor-responding algorithms considering the same set of edit operations in otherformalisms are not polynomial-time solvable. Since it is not only relied onarc-annotated sequences, we decided not to include it into this contribution.

In [13], Guignon et al. introduced the notion of conservative edit distanceand mapping between two RNA stem-loops in order to design a polynomial-time algorithm for comparing general secondary RNA structures using the fullset of biological edit operations introduced in [15]. This algorithm is based ona decomposition in stem-loop-like substructures that are pairwised comparedand used to compare complete RNA secondary structures. As mentioned in[13], whereas in the very restrictive case of conservative distance and mapping,the computation of the general edit distance is polynomial-time solvable, itwas not known if the general, i.e., not conservative, edit distance betweentwo stem-loops can be also computed in polynomial-time. In [7], Blin et al.proved that the general edit distance is indeed NP-complete.


A×B Edit

Stem × Stem NP-complete – Blin et al. [7]

Chain × ChainO(nm3) – Lin et al. [19]

Nest × Chain

Nest × Nest NP-complete – Jiang et al. [15] and Blin et al. [6]

Cros × Chain

NP-complete – Lin et al. [19]

Cros × Nest

Cros × Cros

Unlim × Chain

Unlim × Nest

Unlim × Cros

Unlim × Unlim

Table 1.7 Edit classical complexity for n = |S| and m = |T |.

A×B Edit

Nest × Nest max{ 2wa

wb+wr, wb+wr

2wa}-approximable – Lin et al. [19]

Cros × Chain

MaxSNP-hard – Lin et al. [19]

Cros × Nest

Cros × Cros

Unlim × Chain

Unlim × Nest

Unlim × Cros

Unlim × Unlim

Table 1.8 Edit approximability for n = |S| and m = |T |.

Approximability

Lin et al. proved in [19] that the problem Edit (Crossing, Plain) isMaxSNP-hard. They also shown that Edit (Nested,Nested) has a polynomial-time approximation algorithm with ratio β = max{ 2wa

wb+wr, wb+wr

2wa}.

EDIT DISTANCE 15

Open problems

The approximation ratio of Edit (Nested,Nested) depends on the respec-tive values of the parameters wa, wb and wr. An interesting question iswhether there exists a polynomial-time algorithm with constant approxima-tion ratio.

References

1. J. Alber, J. Gramm, J. Guo, and R. Niedermeier. Towards optimally solvingthe longest common subsequence problem for sequences with nested arc annota-tions in linear time. In A. Apostolico and M. Takeda, editors, Proc. 13th AnnualSymposium on Combinatorial Pattern Matching (CPM), Fukuoka, Japan, vol-ume 2373 of Lecture Notes in Computer Science, pages 99–114. Springer, 2002.

2. G. Blin, A. Denise, S. Dulucq, C. Herrbach, and H. Touzet. Alignment of RNAstructures. IEEE/ACM Transactions on Computational Biology and Bioinfor-matics, 2008. To appear.

3. G. Blin, G. Fertin, G. Herry, and S. Vialette. Comparing RNA structures: to-wards an intermediate model between the edit and the lapcs problems. In Marie-France Sagot and Maria Emilia Telles Walter, editors, 1st Brazilian Symposiumon Bioinformatics (BSB’07), volume 4643 of Lecture Notes in Bioinformatics,pages 101–112, Angra dos Reis, Brazil, August 2007. Springer-Verlag.

4. G. Blin, G. Fertin, R. Rizzi, and S. Vialette. What makes the arc-preservingsubsequence problem hard ? T. Comp. Sys. Biology, 2:1–36, 2005.

5. G. Blin, G. Fertin, R. Rizzi, and S. Vialette. What makes the arc-preservingsubsequence problem hard ? In Proc Int. Workshop on Bioinformatics Researchand Applications (IWBRA), volume 3515 of Lecture Notes in Computer Science,pages 860–868, 2005.

6. G. Blin, G. Fertin, I. Rusu, and C. Sinoquet. Extending the hardness of RNAsecondary structures. In Bo Chen, Mike Paterson, and Guochuan Zhang, editors,

Please enter \offprintinfo{(Title, Edition)}{(Author)}

at the beginning of your document.

17

18 REFERENCES

1st international Symposium on Combinatorics, Algorithms, Probabilistic andExperimental methodologies (ESCAPE’07), volume 4614 of LNCS, pages 140–151, Hangzhou, China, April 2007. Springer-Verlag.

7. G. Blin, S. Hamel, and S. Vialette. Comparing RNA structures using a fullset of biologically relevant edit operations is intractable. Technical report, Uni-versite Paris Est, I.G.M., December 2008. electronic version (7 pp.) PreprintarXiv:0812.3946.

8. G. Blin and H. Touzet. How to Compare Arc-Annotated Sequences: The Align-ment Hierarchy. In Fabio Crestani, Paolo Ferragina, and Mark Sanderson, ed-itors, 13th String Processing and Information Retrieval (SPIRE’06), volume4209 of LNCS, pages 291–303, Glasgow, UK, October 2006. Springer-Verlag.

9. P. Evans. Algorithms and Complexity for Annotated Sequences Analysis. PhDthesis, University of Victoria, 1999.

10. P. Evans. Finding common subsequences with arcs and pseudoknots. InM. Crochemore and M. Paterson, editors, Proc. 10th Annual Symposium Com-binatorial Pattern Matching (CPM), Warwick University, UK, volume 1645 ofLecture Notes in Computer Science, pages 270–280. Springer, 1999.

11. J. Gramm, J. Guo, and R. Niedermeier. Pattern matching for arc-annotatedsequences. In M. Agrawal and A. Seth, editors, Proc. 22nd Foundations ofSoftware Technology and Theoretical Computer Science (FSTTCS), Kanpur,India, Lecture Notes in Computer Science, pages 182–193, 2002.

12. J. Gramm, J. Guo, and R. Niedermeier. Pattern matching for arc-annotatedsequences. ACM Transactions on Algorithms, 2(1):44–65, 2006.

13. Valentin Guignon, Cedric Chauve, and Sylvie Hamel. An edit distance betweenRNA stem-loops. In Mariano P. Consens and Gonzalo Navarro, editors, 12thInternational Conference SPIRE, volume 3772 of LNCS, pages 335–347, 2005.

14. J. Guo. Exact algorithms for the longest common subsequence problem forarc-annotated sequences. Master’s thesis, Univeristy of Tubingen, 2002.

15. T. Jiang, G. Lin, B. Ma, and K. Zhang. A general edit distance between RNAstructures. Journal of Computational Biology, 9(2):371–388, 2002.

16. T. Jiang, G. Lin, B. Ma, and K. Zhang. The longest common subsequenceproblem for arc-annotated sequences. Journal of Dicrete Algorithms, pages 257–270, 2004.

17. T. Jiang, G.-H. Lin, B. Ma, and K. Zhang. The longest common subsequenceproblem for arc-annotated sequences. In R. Giancarlo and D. Sankoff, edi-tors, Proc. 11th Annual Symposium on Combinatorial Pattern Matching (CPM),Montreal, Canada, volume 1848 of Lecture Notes in Computer Science, pages154–165. Springer, 2000.

18. G. Lin, Z-Z. Chen, T. Jiang, and J. Wen. The longest common subsequenceproblem for sequences with nested arc annotations. Journal of Computer andSystem Sciences, 65(3):465–480, 2002. Special issue on computational biology.

19. G. Lin, B. Ma, and K. Zhang. Edit distance between two RNA structures. InRECOMB, pages 211–220, 2001.

20. D. Sankoff and B. Kruskal. Time Warps, String Edits and Macromolecules: theTheory and Practice of Sequence Comparison. Addison-Wesley, 1983.

REFERENCES 19

21. D. Shasha and K. Zhang. Simple fast algorithms for the editing distance betweentrees and related problems. SIAM Journal on Computing, 18(6):1245–1262,1989.

22. S. Vialette. Pattern matching over 2-intervals sets. In A. Apostolico andM. Takeda, editors, Proc. 13th Annual Symposium Combinatorial PatternMatching (CPM), Fukuoka, Japan, volume 2373 of Lecture Notes in ComputerScience, pages 53–63. Springer, 2002.

23. S. Vialette. On the computational complexity of 2-interval pattern matchingproblems. Theoretical Computer Science, 312(2-3):223–249, 2004.

A faster algorithm for findingminimum Tucker submatrices

Guillaume Blin1, Romeo Rizzi2, and Stephane Vialette1

1 Universite Paris-Est, LIGM - UMR CNRS 8049, France.{gblin,vialette}@univ-mlv.fr

2 Dipartimento di Matematica ed Informatica (DIMI)Universit degli Studi di Udine, Italy. [email protected]

Abstract. A binary matrix has the Consecutive Ones Property (C1P)if its columns can be ordered in such a way that all 1s on each roware consecutive. Algorithmic issues of the C1P are central in computa-tional molecular biology, in particular for physical mapping and ances-tral genome reconstruction. In 1972, Tucker gave a characterization ofmatrices that have the C1P by a set of forbidden submatrices, and a sub-stantial amount of research has been devoted to the problem of efficientlyfinding such a minimum size forbidden submatrix. This paper presentsa new O(∆3m2(m∆ + n3)) time algorithm for this particular task for am×n binary matrix with at most ∆ 1-entries per row, thereby improvingthe O(∆3m2(mn+n3)) time algorithm of [M. Dom, J. Guo and R. Nie-dermeier, Approximation and fixed-parameter algorithms for consecutiveones submatrix problems, Journal of Computer and System Sciences,76(3-4): 204-221, 2010 ]. Moreover, this approach can be used – with amuch heavier machinery – to address harder problems related to MinimalConflicting Set [G. Blin, R. Rizzi, and S. Vialette. A Polynomial-TimeAlgorithm for Finding Minimal Conflicting Sets, Proc. 6th InternationalComputer Science Symposium in Russia (CSR), 2011 ].

1 Introduction

A binary matrix has the Consecutive Ones Property (C1P) if its columnscan be ordered in such a way that all 1s on each row are consecutive.Both deciding if a given binary matrix has the C1P and finding the cor-responding columns permutation can be done in linear time [9, 17, 18,22–24, 27, 30]. The C1P of matrices has a long history and it plays animportant role in combinatorial optimization, including application fieldssuch as scheduling [6, 20, 21, 35], information retrieval [25], and railwayoptimization [28, 29, 32] (see [15] for a recent survey). Furthermore, algo-rithmic aspects of the C1P turn out to be of particular importance forphysical mapping [2, 12, 26] and ancestral genome reconstruction [1, 11].(see also [10, 3–5, 13, 31] for other applications in computational molecular

hal-0

0657

340,

ver

sion

1 -

16 M

ar 2

012

Author manuscript, published in "Theory of Computing Systems ?, ? (2012) 10 pp."

biology). Actually, our main motivation for studying algorithmic aspectsof the C1P comes from minimal conflicting sets in binary matrices in thecontext of ancestral genome reconstruction [?]. A minimal conflicting setof rows in a binary matrix is a set of rows R that does not have the C1Pbut such that any proper subset of R has the C1P (a similar definitionapplies for columns). Tucker [34] has characterized the binary matricesthat have the C1P by a set of forbidden submatrices, and the aim of thispaper is to lay the foundations for efficiently computing minimal conflict-ing sets by presenting a new efficient algorithm for finding such minimumsize forbidden Tucker submatrices [8].

Recently, Dom et al. [16] investigated some natural problems aris-ing when a matrix M does not have the C1P property (the C1P is in-deed a desirable property than often leads to efficient algorithms): finda minimum-cardinality set of columns to delete such that the resultingmatrix has the C1P, find a minimum-cardinality set of rows to delete suchthat the resulting matrix has the C1P, and find a minimum-cardinalityset of 1-entries in the matrix that shall be flipped (that is, replaced by0-entries) such that the resulting matrix has the C1P. All these problemsare NP-hard even for simple instances [19, 33], and hence Dom et al. havefocused on approximation and parameterized complexity issues. To thisend, they have provided a technical solution based on efficiently detectingforbidden Tucker submatrices [34].

In this paper, we presents a new O(∆3m2(m∆+ n3)) time algorithmfor finding a a minimum size forbidden Tucker submatrix in m × n bi-nary matrices with at most ∆ 1-entries per row, thereby improving theO(∆3m2(mn + n3)) time algorithm of Dom et al. [16, 15].

This paper is organized as follows: we first recall basic definitionsin Section 2, and we then formally introduce the considered problem. InSection 3, we briefly recall the algorithm of Dom et a. Section 4 is devotedto presenting out technical improvement and we consider in Section 5matrices with unbounded ∆.

2 Preliminaries

All graphs are considered as undirected. Given a graph G = (V, E), letN(v) = {u|(u, v) ∈ E} denote the neighborhood of vertex v in G. Theneighborhood described above does not include v itself, and is more specif-ically the open neighborhood of v; it is also possible to define a neigh-borhood in which v itself is included, called the closed neighborhood. Forany subset V � ⊆ V of vertices, let G[V �] denote the subgraph of G in-

hal-0

0657

340,

ver

sion

1 -

16 M

ar 2

012

duced by the vertices of V � (i.e., G� = (V �, {(u, v) ∈ E|u, v ∈ V �})). Aninduced cycle is a cycle that is an induced subgraph of G; induced cyclesare also called chordless cycles. An asteroidal triple is an independent setof three vertices such that each pair is joined by a path that avoids theneighborhood of the third.

A binary matrix has the Consecutive Ones Property (C1P) if itscolumns can be ordered in such a way that all 1s on each rows are con-secutive. For a matrix M , we let ri and cj stand for the ith row andthe jth column of M , respectively. Let M be a m× n binary matrix. Itscorresponding bipartite graph G(M) = (VM , EM ) is defined as follows:VM = R ∪ C, where R = {ri : 1 ≤ i ≤ m} is the set of rows of Mand C = {ci : 1 ≤ i ≤ n} is the set of columns of M , and two verticesri ∈ R and cj ∈ C are connected by an edge if and only if M [i, j] = 1.Equivalently, M is the reduced adjacency matrix of G(M) (i.e. the non-redundant portion of the full adjacency matrix for the bipartite graph).Actually, it will be convenient to define G(M) as a vertex-colored bipartitegraph: each ri ∈ R (row vertex) is colored black and each cj ∈ c (columnvertex) is colored white. See Figure 1 for an illustration (we use capitalletters for black vertices and uncapitalized letters for white vertices).

M =

c1 c2 c3 c4 c5 c6

r1 0 1 0 1 1 0r2 0 0 0 1 1 0r3 0 1 1 0 1 0r4 1 0 1 0 1 1

c2GM

R1

R3

c4

c5

c3

R2

R4

c1

c6

Fig. 1. A binary matrix and its corresponding vertex-colored bipartite graph.

The following result of Tucker links the C1P for binary matrices toasteroidal triples.

hal-0

0657

340,

ver

sion

1 -

16 M

ar 2

012

Theorem 1 ([34], Theorem 6). A binary matrix has the C1P if andonly if its corresponding vertex-colored bipartite graph does not contain awhite asteroidal triple, i.e. an asteroidal triple on column vertices.

Moreover, Tucker has characterized the binary matrices that have theC1P by a set of forbidden submatrices (the number of vertices in theforbidden graph defines the size of the Tucker configuration).

Theorem 2 ([34], Theorem 9). A binary matrix has the C1P if andonly if it contains none of the matrices MIk

, MIIk, MIIIk

(k ≥ 1), MIV

and MV depicted in Figure 2.

Fig. 2. Forbidden Tucker submatrices represented as vertex-colored bipartite graphs[34]. Black and white vertices correspond to rows and columns, respectively.

3 The algorithm of Dom et al.

In [16], Dom et al. provided an algorithm for finding a forbidden Tuckersubmatrix (i.e., one of T = {MIk

, MIIk, MIIIk

, MIV , MV }) in a givenbinary matrix. The general algorithm is as follows. For each white as-teroidal triple u, v, w of G(M), compute the sum of the lengths of threeshortest paths connecting two by two u, v and w (each path has to avoidthe closed neighborhood of the third vertex). Select an asteroidal tripleu, v, w of G(M) with minimum total length of the paths connecting each

hal-0

0657

340,

ver

sion

1 -

16 M

ar 2

012

Time complexity

Tucker configuration Dom et al. Our contribution

MIk and MIIk O(∆mn2 + n3) O(∆3m2(n + ∆m))

MIIIk O(∆3m3n + ∆2m2n2) O(∆mn2(n + ∆m))

MIV O(∆3m2n3) Algorithm of Dom et al. used verbatim

MV O(∆4m2n) Algorithm of Dom et al. used verbatim

Overall O(∆3m2(mn + n3)) O(∆3m2(∆m + n3))

Table 1. Comparing our results with Dom et al. [16, 15].

pair of vertices and return the rows and columns of M that correspondto the vertices that occur along the three shortest paths. The authorsproved that the returned submatrix does contain a forbidden Tucker sub-matrix of T but which is not necessarily of minimum size (for MIIIk

, MIV

and MV ). Indeed, since the three shortest paths may share some vertices,the sum of the lengths of the three paths is not necessarily the numberof vertices in the union of the three paths. However, Dom et al. showedthat the returned submatrix contains at most three extra columns (resp.five extra rows) compared with a forbidden Tucker submatrix with min-imum number of columns (resp. rows). To overcome this problem, theyprovided another algorithm devoted to MIIIk

, MIV and MV submatrices.More precisely, they used the similarity between MIIIk

and MIkto reduce

the problem to a minimum-size chordless cycle search. For MIV and MV ,they provided an exhaustive search. On the whole, Dom et al. providedan algorithm for finding a forbidden Tucker submatrix in a given matrixM (assuming M does not have the C1P) in O(∆3m2n(m + n2)) time,where m is the number of rows of M , n is the number of columns ofn, and ∆ is the maximum number of 1-entries in a row. More precisely,the authors provided a O(∆mn2 + n3) time algorithm for finding a MIk

or MIIksubmatrix, a O(∆3m3n + ∆2m2n2) time algorithm for finding a

MIIIksubmatrix, a O(∆3m2n3) time algorithm for finding a MIV sub-

matrix, and a O(∆4m2n) time algorithm for finding a MV submatrix. SeeTable 1.

The main contribution of this paper is a simple O(∆3m2 (m∆+ n3))time algorithm for finding a minimum size forbidden Tucker submatrix.Our technical improvement on Dom et al. [16, 15] is based on shortestpaths and two graph pruning techniques: clean and complement clean(to be defined in the next section). Graph pruning techniques were in-troduced by Conforti and Rao [14]. One has to note that graph pruningtechnique does not always succeed in the detection of induced configu-rations. Indeed, in [7] Bienstock gave negative results among which one

hal-0

0657

340,

ver

sion

1 -

16 M

ar 2

012

can find an NP-completeness proof for the problem of deciding whethera graph contains an odd chordless cycle containing a given vertex. Thisnegative result, which in attacking the perfect graph conjecture was use-ful in posing limits in what could have been a reasonable approach, alsodemonstrates that not everything can be done with the detection of in-duced configurations.

4 Fast detection of minimum size forbidden Tuckersubmatrices

In this section, we design a general algorithm that reports in polynomial-time the smallest Tucker configuration of a given matrix M if it exists. Todo so, we seek for any subgraph of G(M) that corresponds to a submatrixof T .

Let us introduce the clean and cpl clean cleaning operations. Let Mbe a binary matrix and G(M) = (VM , EM ) be the corresponding vertex-colored bipartite graph. Let v be a vertex of G(M), then cleanG(M)(v)is the graph obtained by removing from G(M) all neighbors of v, i.e.,G(M)[VM \ N(v)]. Let v be a vertex of G(M), then cpl cleanG(M)(v)(read complement clean)) is the graph obtained by removing from G(M)all vertices that do not belong to the same partition nor the neighborhoodof v, i.e., G(M)[N(v) ∪ {u : color(u) = color(v)}]. For the sake ofsimplicity, we shall write cleanG(M)(u1, u2, . . . , uk) for the sequence

G1 ← cleanG(M)(u1),

G2 ← cleanG1(u2),

...

Gk ← cleanGk−1(uk),

Return Gk,

and cpl cleanG(M)(u1, u2, . . . , uk) for the sequence

G1 ← cpl cleanG(M)(u1),

G2 ← cpl cleanG1(u2),

. . . ,

Gk ← cpl cleanGk−1(uk),

Return Gk.

hal-0

0657

340,

ver

sion

1 -

16 M

ar 2

012

We now focus on the bipartite graphs that represent Tucker con-figurations (see Figure 2) and define our guessing functions. Define thefunction guessI(G(M)) (reps. guessII(G(M))) that returns all possiblesets of vertices {x, y, z, A, B} ⊆ VM as defined for the Tucker config-uration MIk

(resp. MIIk) in Figure 2. Furthermore, Define the function

guessIII(G(M)) that returns all possible sets of vertices {x, y, z, A} ⊆ VM

as defined for the Tucker configuration MIIIkin Figure 2. Of particular

importance, in the presented algorithms, guessed vertices will never beaffected (i.e., deleted) by the clean and cpl clean functions.

Fig. 3. MIk Tucker configuration.

Lemma 1. Let M be m× n binary matrix with at most ∆ 1-entries perrow. Algorithm 1 computes the smallest submatrix G(MIk

) in G(M) inO(m2∆3(n + ∆m)) time (if such a submatrix exists).

Proof. We apply Algorithm 1 to G(M). Let us first prove that if G(MIk)

occurs in G(M), then Algorithm 1 finds it. Suppose G� = G(MIk) occurs

in G(M). Then among all the guessed 5-plets x, y, z, A, B (Line 1), thereshould be at least one guess such that x, y, z, A, B are part of the verticesof G�. By definition, G� is a chordless cycle. Therefore, clean(x, A, B)preserves G� since in G�, (1) x is only connected to vertices A and B,(2) A is only connected to vertices x and y, and (3) B is only connectedto x and z. Therefore, looking for a shortest path p in the pruned graphbetween y and z after having removed A and B ensures the minimalityof the returned solution.

The guessing can be done in O(m2∆3) time. Indeed, once A has beenidentified, one can (i) select x and y among the at most ∆ neighbors ofA, and (ii) identify B and one of its at most ∆ neighbors as z such thatx ∈ N(B) and z /∈ {x, y}. For each such guessing, the cleaning of x, A, B

hal-0

0657

340,

ver

sion

1 -

16 M

ar 2

012

Algorithm 1 Finding G(MIk) in G(M)

1: for all {x, y, z, A, B}← guessI(G(M)) do2: G1 ← cleanG(M)(x,A,B)3: Remove the vertices A and B from G1. In the resulting graph, find a shortest

path p from y to z4: if p exists then5: S ← {x, y, z, A, B} ∪ {u : u ∈ p}6: return the induced subgraph G(M)[S]7: end if8: end for9: return None

can be done in O(∆+ m) time. Finally, a shortest path between y and zcan be computed in O(n+∆m) time (the pruned graph has at most m+nvertices and ∆m edges). On the whole, Algorithm 1 is O(m2∆3(n+∆m))time. ��

Fig. 4. MIIk Tucker configuration.

Lemma 2. Let M be a m × n binary matrix with at most ∆ 1-entriesper row. Algorithm 2 computes the smallest submatrix G(MIIk

) in G(M)in O(m2∆3(n + ∆m)) time (if such a submatrix exists).

Proof. We apply Algorithm 2 to G(M). Let us first prove that if G(MIIk)

occurs in G(M), then Algorithm 2 finds it. Suppose G� = G(MIIk) occurs

in G(M). Then among all the guessed 5-plets x, y, z, A, B in Line 1, theremust be at least one guess such that x, y, z, A, B are part of the vertices ofG�. By definition, in G� any unguessed white node is in the neighborhoodof both A and B. Thus, cpl clean(A, B) preserves G� since (1) y (theonly white node not in the neighborhood of B) has been guessed, and (2)

hal-0

0657

340,

ver

sion

1 -

16 M

ar 2

012

z (the only white node not in the neighborhood of A) has been guessed.Moreover, in G� x should be only connected to A and B. Thus, clean(x)preserves G�. Finally, looking for a shortest path p in the pruned graphbetween y and z after having removed A and B ensures the minimalityof the returned solution.

Algorithm 2 Finding G(MIIk) in G(M)

1: for all {x, y, z, A, B}← guessII(G(M)) do2: G1 ← cpl cleanG(M)(A, B)3: G2 ← cleanG1(x)4: Remove the vertices A and B from G2. In the resulting graph, find a shortest

path p from y to z5: if p exists then6: S ← {x, y, z, A, B} ∪ {u : u ∈ p}7: return the induced subgraph G(M)[S]8: end if9: end for

10: return None

The guessing can be done in O(m2∆3) time. For each guessing, thecleaning/complement-cleaning of x, A, and B can be done in O(n + m)time. Finally, a shortest path between y and z can be computed in O(n+∆m) time (the pruned graph has at most ∆+ n vertices and ∆m edges).On the whole, Algorithm 2 is O(m2∆3(n + ∆m)) time. ��

Fig. 5. MIIIk Tucker configuration.

Lemma 3. Let M be a m × n binary matrix with at most ∆ 1-entriesin each row. Algorithm 3 computes the smallest G(MIIIk

) in G(M) inO(m∆n2(n + ∆m)) time (if such a submatrix exists).

hal-0

0657

340,

ver

sion

1 -

16 M

ar 2

012

Algorithm 3 Finding G(MIIIk) in G(M)

Proof. 1: for all {x, y, z, A}← guessIII(G(M)) do2: G1 ← cpl cleanG(M)(A)3: G2 ← cleanG1(x)4: Remove the vertex A from G2. In the resulting graph, find a shortest path p

from y to z.5: if p exists then6: S ← {x, y, z, A} ∪ {u : u ∈ p}7: return the induced subgraph G(M)[S]8: end if9: end for

10: return None

We apply Algorithm 3 to G(M). Let us first prove that if G(MIIIk)

occurs in G(M), then Algorithm 3 finds it. Suppose G� = G(MIIIk) occurs

in G(M). Then among all the guessed 4-plets x, y, z, A in Line 1, theremust be at least one guess such that x, y, z, A are part of the vertices ofG�. By definition, in G� any unguessed white node is in the neighborhoodof A. Thus, cpl clean(A) preserves G� since in G� y and z (the onlywhite nodes not in the neighborhood of A) have been guessed. Moreover,in G� x is only connected to A. Therefore, clean(x) preserves G�. Finally,looking for a shortest path p in the pruned graph between y and z afterhaving removed A ensures the minimality of the returned solution.

The guessing can be done in O(m∆n2) time. Indeed, once A has beenidentified, one can (i) select x among the at most ∆ neighbors of A and (ii)then identify y and z. For each such guessing, the cleaning/complement-cleaning of x and A can be done in O(n+m) time. Finally, a shortest pathbetween y and z can be computed in O(n+∆m) time (the pruned graphhas at most ∆ + n vertices and ∆m edges). On the whole, Algorithm 3is O(m∆n2(n + ∆m)) time. ��

As for G(MIV ) and G(MV ), a simple brute-force search yields thetime complexity detailed in Lemma 4.

Lemma 4 ([16], Proposition 5.3). Let M be a m × n binary matrixwith at most ∆ 1-entries per row. The smallest G(MIV ) (resp. G(MV ))in G(M) can be computed in O(∆3m2n3) (resp. O(∆4m2n) time) if itexists.

We are now ready to state the main result of this paper (Table 1compares our results with Dom et al. [16].).

hal-0

0657

340,

ver

sion

1 -

16 M

ar 2

012

Fig. 6. MIV and MV Tucker configurations.

Theorem 3. Let M be a m × n binary matrix that does not have theC1P with at most ∆ 1-entries per row. A minimum size forbidden Tuckersubmatrix that occurs in M can be found in O(∆3m2(∆m + n3)) time.

Notice that our results do not improve on the ones of Dom et al. [16]for large ∆. Furthermore, our results also do not improve on the ones ofDom et al. [16] for each Tucker configuration. For example, if m = n, ouralgorithm has time complexity O(∆4 m3) for MIk

and MIIkTucker config-

urations, whereas Dom et al.’s algorithm has time complexity O(∆m3).

5 Matrices with unbounded ∆

As mentioned in [16], a natural question is concerned with the complex-ity of the problem when the number of 1s per row is unbounded. Onecan distinguish two subcases: the maximum number of 1s per column isbounded (say by C) or not. In the following, we prove that using a sim-ilar approach to the one used in the preceding section, one can achievean O(C2n3(m + C2n)) (resp. O(n4m4)) time complexity for the bounded(resp. unbounded) case.

Theorem 4. Let M be a m× n binary matrix with at most C 1-entriesper column. A minimum size forbidden Tucker submatrix that occurs inM can be found in O(C2n3(m + C2n)) time.

Proof. We apply Algorithms 1, Algorithms 2, and Algorithms 3 to findany forbidden submatrix of type MIk

, MIIkor MIIIk

. Let us now analyzethe time complexity.

For Algorithm 1, the guessing can be done in O(n3C2) time. Indeed,once x has been identified, one can select A and B among the at most C

hal-0

0657

340,

ver

sion

1 -

16 M

ar 2

012

neighbors of x and then identify y and z such that y ∈ N(A), y �∈ N(B),z ∈ N(B), z �∈ N(A) and x �= y �= z. For each such guessing, the cleaningof x, A, and B can be done in O(C + n) time. Finally, a shortest pathbetween y and z can be computed in O(m+ Cn) time (the pruned graphhas at most m+n vertices and Cn edges). We conclude that Algorithm 1is O(C2n3(m + Cn)) time.

For Algorithm 2, the guessing can also be done in O(n3C2). For eachsuch guessing, the cleaning/complement-cleaning of x, A and B can bedone in O(C + n) time. Finally, a shortest path between y and z can becomputed in O(m + Cn) time (the pruned graphs has at most m + nvertices and Cn edges). We conclude that Algorithm 2 is O(C2n3(m +Cn)) time.

As for Algorithm 3, the guessing can be done in O(n3C) time. Indeed,once x has been identified, one can select A among the at most C neigh-bors of x and then identify y and z such that y �∈ N(A), z �∈ N(A) andx �= y �= z. For each such guessing, the cleaning/complement-cleaning ofx and A can be done in O(C + n) time. Finally, a shortest path betweeny and z can be computed in O(m + Cn) time (the pruned graph has atmost m + n vertices and Cn edges). We conclude that Algorithm 3 isO(Cn3(m + Cn)) time.

MIV and MV forbidden submatrices deserve careful consideration.Indeed, a direct application of Proposition 5.3 [16] for finding MIV andMV forbidden submatrices results in a O(n6) time algorithm. We use thefollowing strategy: (i) Select the central three white nodes u, v, and w(distinct from x, y and z) among the O(n3) such triples, (ii) select fourblack nodes that are neighbors of at least one of u, v, or w among theO(C4) such 4-plets, and (iii) for each such combination check in O(n)time whether every columns of the matrix MIV appears at least once inthe submatrix induced by the selected rows.

A submatrix of the type MV can be found analogously: (i) select thetwo central white nodes (distinct from x, y, and z) among the O(n2) suchtriples, (ii) select four black nodes that are neighbors of at least one of uand v among the O(C4) such 4-plets, and (iii) for each such combinationcheck in O(n) time whether every columns of the matrix MV appears atleast once in the submatrix induced by the selected rows. ��

Replacing C by m in the above yields the following result.

Theorem 5. Let M be m×n binary matrix. A minimum size forbiddenTucker submatrix that occurs in M can be found in O(n4m4) time.

hal-0

0657

340,

ver

sion

1 -

16 M

ar 2

012

References

1. Z. Adam, M. Turmel, C. Lemieux, and D. Sankoff. Common intervals and symmet-ric difference in a model-free phylogenomics, with an application to streptophyteevolution. J. Comput. Biol., 14:436–445, 2007.

2. F. Alizadeh, R. Karp, D. Weisser, and G. Zweig. Physical mapping of chromosomesusing unique probes. J. Comput. Biol., 2:159–184, 1995.

3. E. Althaus, S. Canzar, M.R. Emmett, A. Karrenbauer, A.G. Marshall, A. Meyer-Baese, and H. Zhang. Computing h/d-exchange speeds of single residues fromdata of peptic fragments. In ACM Press, editor, 23rd ACM Symposium on AppliedComputing SAC ’08, page 12731277, 2008.

4. J.E. Atkins, E.G. Boman, and B. Hendrickson. A spectral algorithm for seriationand the consecutive ones problem. SIAM J. Comput., 28(1):297310, 1998.

5. J.E. Atkins and M. Middendorf. On physical mapping and the consecutive onesproperty for sparse matrices. Discrete Appl. Math., 71(13):2340, 1996.

6. J. J. Bartholdi, J. B. Orlin, and H. D. Ratliff. Cyclic scheduling via integer pro-grams with circular ones. Oper Res, 28(5):1074–1085, 1980.

7. Dan Bienstock. On the complexity of testing for odd holes and induced odd paths.Discrete Math., 90(1):85–92, 1991.

8. G. Blin, R. Rizzi, and S. Vialette. General framework for minimal conflicting set.Technical report, Universite Paris Est, I.G.M., jan 2010.

9. K.S. Booth and G.S. Lueker. Testing for the consecutive ones property, intervalgraphs, and graph planarity using pq-tree algorithms. J. Comput. System Sci.,13:335379, 1976.

10. C. Chauve, J. Manuch, and M. Patterson. On the gapped consecutive ones prop-erty. In Proc. 5th European conference on Combinatorics, Graph Theory and Appli-cations (EuroComb), Bordeaux, France, volume 34 of Electronic Notes on DiscreteMathematics, pages 121–125, 2009.

11. C. Chauve and E. Tannier. A methodological framework for the reconstruction ofcontiguous regions of ancestral genomes and its application to mammalian genome.PLoS Comput. Biol., 4:paper e1000234, 2008.

12. T. Christof, M. Junger, J. Kececioglu, P. Mutzel, and G. Reinelt. A branch-and-cutapproach to physical mapping of chromosome by unique end-probes. J. Comput.Biol., 4:433–447, 1997.

13. T. Christof, M. Oswald, and G. Reinelt. Consecutive ones and a betweenness prob-lem in computational biology. In Springer, editor, 6th International Conference onInteger Programming and Combinatorial Optimization IPCO ’98, volume 1412 ofLecture Notes in Comput. Sci., page 213228, 1998.

14. Michele Conforti and M. R. Rao. Structural properties and decomposition of linearbalanced matrices. Mathematical Programming, 55:129–168, 1992.

15. M. Dom. Algorithmic aspects of the consecutive-ones property. Bull. Eur. Assoc.Theor. Comput. Sci. EATCS, 98:2759, 2009.

16. M. Dom, J. Guo, and R. Niedermeier. Approximation and fixed-parameter algo-rithms for consecutive ones submatrix problems. Journal of Computer and SystemSciences, 76(3-4), 2010.

17. D.R. Fulkerson and O.A. Gross. Incidence matrices and interval graphs. PacificJ. Math., 15(3):835855, 1965.

18. M. Habib, R.M. McConnell, C. Paul, and L. Viennot. Lex-bfs and partition refine-ment, with applications to transitive orientation, interval graph recognition andconsecutive ones testing. Theoret. Comput. Sci., 234(12):5984, 2000.

hal-0

0657

340,

ver

sion

1 -

16 M

ar 2

012

19. M. Hajiaghayi and Y. Ganjali. A note on the consecutive ones submatrix problem.Information Processing Letters, 83(3):163166, 2002.

20. Refael Hassin and Nimrod Megiddo. Approximation algorithms for hitting objectswith straight lines. Discrete Applied Mathematics, 30(1):29 – 42, 1991.

21. Dorit S. Hochbaum and Asaf Levin. Cyclical scheduling and multi-shift scheduling:Complexity and approximation algorithms. Discrete Optimization, 3(4):327 – 340,2006.

22. W.-L. Hsu. A simple test for the consecutive ones property. J. Algorithms,43(1):116, 2002.

23. W.-L. Hsu and R.M. McConnell. Pc trees and circular-ones arrangements. Theoret.Comput. Sci., 296(1):99116, 2003.

24. N. Korte and R.H. Mhring. An incremental linear-time algorithm for recognizinginterval graphs. SIAM J. Comput., 18(1):6881, 1989.

25. Lawrence T. Kou. Polynomial complete consecutive information retrieval prob-lems. SIAM J. Comput., 6(1):67–75, 1977.

26. W.-F. Lu and W.-L. Hsu. A test for the consecutive ones property on noisy data –application to physical mapping and sequence assembly. J. Comput. Biol., 10:709–735, 2003.

27. R.M. McConnell. A certifying algorithm for the consecutive-ones property. InACM Press, editor, 15th Annual ACMSIAM Symposium on Discrete AlgorithmsSODA ’04, page 768777, 2004.

28. S. Mecke, A. Schbel, and D. Wagner. Station location complexity and approxi-mation. In 5th Workshop on Algorithmic Methods and Models for Optimization ofRailways ATMOS ’05, Dagstuhl, Germany, 2005.

29. S. Mecke and D. Wagner. Solving geometric covering problems by data reduction.In Springer, editor, 12th Annual European Symposium on Algorithms ESA ’04,volume 3221 of Lecture Notes in Comput. Sci., page 760771, 2004.

30. J. Meidanis, O. Porto, and G.P. Telles. On the consecutive ones property. DiscreteAppl. Math., 88:325354, 1998.

31. M. Oswald and G. Reinelt. The simultaneous consecutive ones problem. Theoret.Comput. Sci., 410(2123):19861992, 2009.

32. Nikolaus Ruf and Anita Schbel. Set covering with almost consecutive ones property.Discrete Optimization, 1(2):215 – 228, 2004.

33. J. Tan and L. Zhang. The consecutive ones submatrix problem for sparse matrices.Algorithmica, 48(3):287299, 2007.

34. A.C. Tucker. A structure theorem for the consecutive 1s property. Journal ofCombinatorial Theory. Series B, 12:153162, 1972.

35. A.F. Veinott and H.M. Wagner. Optimal capacity scheduling. Oper Res, 10:518–547, 1962.

hal-0

0657

340,

ver

sion

1 -

16 M

ar 2

012

1

Querying Graphs in Protein-Protein InteractionsNetworks using Feedback Vertex Set

Guillaume Blin, Florian Sikora, Stephane Vialette

Abstract

Recent techniques increase rapidly the amount of our knowledge on interactions between proteins. The in-terpretation of these new information depends on our ability to retrieve known sub-structures in the data, theProtein-Protein Interactions (PPI) networks. In an algorithmic point of view, it is an hard task since it oftenleads to NP-hard problems. To overcome this difficulty, many authors have provided tools for querying patternswith a restricted topology, i.e. paths or trees in PPI networks. Such restriction leads to the development of fixedparameter tractable (FPT) algorithms, which can be practicable for restricted sizes of queries. Unfortunately, GRAPHHOMOMORPHISM is a W[1]-hard problem, and hence, no FPT algorithm can be found when patterns are in theshape of general graphs. However, Dost et al. [2] gave an algorithm (which is not implemented) to query graphswith a bounded treewidth in PPI networks (the treewidth of the query being involved in the time complexity).In this paper, we propose another algorithm for querying pattern in the shape of graphs, also based on dynamicprogramming and the color-coding technique. To transform graphs queries into trees without loss of informations,we use feedback vertex set coupled to a node duplication mecanism. Hence, our algorithm is FPT for queryinggraphs with a bounded size of their feedback vertex set. It gives an alternative to the treewidth parameter, whichcan be better or worst for a given query. We provide a python implementation which allows us to validate ourimplementation on real data. Especially, we retrieve some human queries in the shape of graphs into the fly PPInetwork.

Index Terms

Graph Query, Pattern-Matching, Dynamic Programming, Protein-Protein Interactions networks.

I. INTRODUCTION

CONTRARY to what was predicted years ago, the human genome project has highlighted that humancomplexity may not only rely on its genes (only 25 000 for human compared to the 30 000 and

45 000 for the mouse and the poplar respectively). This observation increased the interest in proteinproperties (e.g. their numbers, functions, complexity and interactions). Among other protein properties,the set of all their interactions for an organism, called Protein-Protein Interactions (PPI) networks, haverecently attracted lot of interest. The number of reported interactions increases rapidly due to the use ofvarious genome-scale screening techniques [3], [4], [5]. Unfortunately, acquiring such valuable resourcesis prone to high noise rate [3], [6].

Comparative analysis of PPI tries to determine the extent to which protein networks are conservedamong species. Indeed, it was observed that proteins functioning together in a pathway (i.e., a path in theinteractions graph) or a structural complex (i.e., an assembling of strongly connected proteins) are likelyto evolve in a correlated fashion and during evolution, all such functionally linked proteins tend to beeither preserved or eliminated in a new species [7].

In this article, we focus on the following related problem called GRAPH QUERY (formally definedlater). Given a PPI network and a pattern with a graph topology, find a subnetwork of the PPI networkthat is as similar as possible to the pattern, in respect to the initial topology. Similarity is measured bothin terms of sequence similarity and graph topology conservation.

Universite Paris-Est, LIGM - UMR CNRS 8049, France.E-mail: {gblin, sikora, vialette}@univ-mlv.fr

An extended abstract of this work appeared in Proceedings of the 5th International Symposium on Bioinformatics Research and Applications(ISBRA’09) [1].

hal-0

0619

763,

ver

sion

1 -

6 O

ct 2

011

Author manuscript, published in "IEEE/ACM Transactions on Computational Biology and Bioinformatics 7, 4 (2010) 628-635"

2

Unfortunately, this problem is clearly equivalent to the NP-complete subgraph homeomorphism problem[8]. Recently, several techniques have been proposed to overcome the difficulty of this problem. Byrestricting the query to a path of length less than five, Kelley et al. [9] developped PathBlast, a software witha factorial time complexity which allows one consecutive mismatch. Later on, Shlomi et al. [10] proposedan alternative, called QPath, for querying paths in a PPI network which is based on the color-codingtechnique introduced by Alon, Yuster and Zwick [11]. The use of this technique allows to define a fixed-parameter tractable (FPT) algorithm parameterized by the size of the query. Recall that a parameterizedproblem is FPT if it can be determined in f(k)nO(1) time, where f is a function only depending on theparameter k, and n is the size of the input [12]. In addiction of being faster, QPath deals with longerpaths (until size ten) and allows more flexibility by considering a bounded number of non-exact matches.

By restricting the query to a tree, Pinter et al. [13] proposed an algorithm that is restricted to forest PPInetworks (i.e., collection of trees). Finally, Dost et al. [2] developed QNet, an algorithm to handle treequery in the general context of PPI networks. The authors also gave some theoretical results for queryinggraphs using the tree decomposition of the query.

Since QNet is the main reference in this field and is quite related to the work presented in this paper, letus present it briefly. QNet is an FPT algorithm for querying trees in a PPI network. The time complexity is2O(k)m ln(1

�), where k is the number of proteins in the query, m the number of edges of the PPI network

and 1−� the success probability (for any � > 0). As QPath, QNet uses dynamic programming together withthe color-coding technique. For querying graphs in a network, QNet uses, as a subroutine, an algorithmto query trees. To do so, they perform a tree decomposition (a formal definition of a tree decompositioncan be found in [14]). Roughly speaking, it is a transformation of a graph into a tree, a tree node (or abag) can contain several graph nodes. There exists several algorithms to perform such a transformation.The treewidth of a graph is the minimum (among all decompositions) of the cardinality of the largest bagminus one. Computing the treewidth is, however, NP-Hard [15]. From this tree decomposition, the timecomplexity of QNet is 2O(k)nt+1 ln(1

�)) time, where k is the size of the query, n is the size of the PPI

network, t is the treewidth of the query, and 1− � is the success probability (for any � > 0).QNet is an algorithm for querying trees in a PPI network. A logical extension would be to query

graphs. The authors of [2] provide a theoretical solution, without implementation and which depends onthe treewidth of the query. We propose here an alternative solution, that uses the color-coding technique(Section II). We provide in Section III some experimental results.

II. PADA1 AS AN ALTERNATIVE TO QNET

In this section, we propose an alternative to QNet called PADA1 (Protein Alignment Dealing withgrAphs). At a more general level, QNet and PADA1 use the very same approach: transform the queryinto a tree and find an occurrence of that tree in the PPI network by dynamic programming. However,whereas QNet uses tree decompositions, PADA1 combines feedback vertex sets together with nodesduplications (Algorithm GRAPH2TREE). Let note that independently, Cheng et al. use a similar techniqueto transform a graph into a tree in [16] in order to query a graph in a network. However, unlike ourapproach, in [16] the authors do not use nodes duplication and hence, it is not clear they can ensure thatall the edges of the query are kept in the results (especially the ones linking the nodes belonging to thefeedback vertex set). It is worth mentioning that, following the example of QPath and QNet, we willconsider non-exact matches (i.e., allowing indels). Since we allow queries to be graphs, PADA1 is clearlyan extension of QPath and a real alternative to QNet.

A. Transforming the query into a treeLet us first present Algorithm GRAPH2TREE which transforms a graph G = (V, E) into a tree. Our

transformation is lossless, and hence, one can reconstruct the graph starting from the tree. The mainidea of Algorithm GRAPH2TREE is to transform the graph into a tree by iteratively finding a cycle C,duplicating a node of C, and finally breaking cycle C by deleting one of its edges (see Figure 1 for an

hal-0

0619

763,

ver

sion

1 -

6 O

ct 2

011

3

illustration on how to break a cycle at vertex v1). Central in our approach is thus the node duplicationprocedure (Algorithm DUPLICATE). For each u ∈ V , write d(u) for the set of all copies of vertex uincluding itself and N(u) for the set of all its neighbors.

Function GRAPH2TREE(G)1

begin2

for all u of V , d(u)← {u};3

FV S ← FEEDBACKVERTEXSET(G);4

while ∃ a cycle C in G do5

Let u be one vertex randomly chosen in C ∩ FV S;6

Let v be one vertex randomly chosen in C ∩N(u);7

DUPLICATE(G, v, u, d);8

end9

end10

Algorithm 1: GRAPH2TREE algorithm

Let F denote the set of all nodes of G that have been duplicated at the end of Algorithm GRAPH2TREE,i.e., F = {v ∈ V : |d(v)| > 1}. The cardinality of F turns out to be an important parameter since, aswe will prove soon, the overall time complexity of PADA1 mostly depends on |F | and not on the totalnumber of duplications. Minimizing the cardinality of F is the well-known NP-complete FEEDBACKVERTEX SET problem [17]: Given a graph G, find a minimum cardinality subset of vertices with theproperty that removing of these vertices from the graph results in an acyclic graph.

We have implemented a “brute-force” algorithm for the FEEDBACK VERTEX SET problem. Once thisset is computed, we duplicate a node of a cycle as long as there is a cycle in the graph. By definition ofthe feedback vertex set, there is at least one node in a cycle belonging to the feedback vertex set. Thetime complexity of the Algorithm GRAPH2TREE is dominated by the process of computing the feedbackvertex set, which is in O(2|V |× |E|) since there are 2|V | potential subgraphs. Nevertheless this solution ispractical since it is still running in seconds if |V | is smaller than twenty. Indeed, the overall complexityof PADA1 considerably limits the size of our graph query. However, one may also consider an efficientFPT algorithm such as the one of Guo et al. [18], using iterative compression, in addition of the quadratickernalization of Thomasse [19] in order to compute efficiently the feedback vertex set.

B. Tree matchingWe now assume that the query has been transformed into a tree (with duplicated nodes) by Algorithm

GRAPH2TREE, and hence we only consider tree queries from this point. We show that an occurrence ofsuch a tree can be found in a PPI network by dynamic programming.

Function FEEDBACKVERTEXSET(G = (V, E))1

begin2

for (i = 0 ; i < |V | ; i + +) do3

foreach subgraph G� = (V �, E �) of G such that |V �| = |V |− i do4

if G� is acyclic then5

return V \ V �;6

end7

end8

end9

end10

Algorithm 2: Compute the Feedback Vertex Set of a graph.

hal-0

0619

763,

ver

sion

1 -

6 O

ct 2

011

4

Function DUPLICATE(G = (V, E), va, vb, d)1

begin2

Let i← |d(vb)|;3

Let vb,i be a new node;4

V ← V ∪ {vb,i};5

d(vb)← d(vb) ∪ {vb,i} ;6

E ← E − {(va, vb)};7

E ← E ∪ {(va, vb,i)};8

end9

Algorithm 3: Algorithm to duplicate a node when a cycle is detected.

Figure 1. Steps when Duplicate(G, v3, v1, d) is called on graph a). b) A node v1,1 from v1 is created. c) The edge (v3,v1) is deleted: thecycle is then broken. d) The edge (v3,v1,1) is added. Finally, the resulting graph is acyclic, and d(v1) = {v1, v1,1}.

Let us fix notations. PPI networks are represented by undirected edge weighted graphs GN = (VN , EN , w); each node of VN represents a protein and each weighted edge (vi, vj) ∈ EN represents an interactionbetween two proteins. A query is given by a tree TQ = (VQ, EQ) (output of Algorithm GRAPH2TREEon the graph query). The set VQ represents proteins while EQ represents interactions between theseproteins. As in QNet, we do not give weight on the query. Indeed, in PPI networks, the weight representsprobabilities of the interactions. There is no clue to use those probabilities in the query. In the following,we will consider that TQ is an ordered tree where q1, . . . , qnq denote the ordered nq children of any nodeq. As we will show afterwards, the result does not depend on this ordering.

Let h(p1, p2) be a function that returns a similarity score between two proteins p1 and p2. The similarityconsidered here will be computed according to amino-acid sequences similarity (using BLASTp [20]). Inthe following, given two nodes v1 and v2 of VQ (or VN ), we write h(v1, v2) for the similarity betweenthe two proteins corresponding to v1 and v2. A node v1 is considered to be homologous to a node v2 ifthe corresponding similarity score h(v1, v2) is above a given threshold. Biologically, one can assume thattwo homologous proteins have probably similar functions. Clearly, for every node v of F , all nodes ind(v) are homologous with the same protein.

An alignment of the query TQ and GN is defined as: (i) a subgraph GA = (VA, EA, w) ⊆ GN =(VN , EN , w), such that VA ⊆ VN and EA ⊆ EN , and (ii) a mapping σ : VQ → VA ∪ {del}. Moreprecisely, the function σ is defined such that if σ(q) = v then q and v are homologous.

For a given alignment of TQ and GN , a node q of VQ is said to be deleted if σ(q) = del and matchedotherwise. Moreover, any node va of VA such that σ−1(va) is undefined is said to be inserted. Note that,similarly to QNet, only nodes of degree two can be deleted (we can only contract paths). For practicalapplications, the number of insertions (resp. deletions) is limited to be at most Nins (resp. Ndel), eachinvolving a penalty score δi (resp. δd).

The GRAPH QUERY problem can be thus defined as follow: Given a query TQ with duplicated nodes,a PPI network GN , a similarity function h, penalty scores δi and δd for insertions and deletions, find an

hal-0

0619

763,

ver

sion

1 -

6 O

ct 2

011

5

Figure 2. a) The graph query with a cycle, before calling GRAPH2TREE algorithm. c) The query after calling GRAPH2TREE where q1 hasbeen duplicated. Thus, q1 and q1,1 have to be aligned with the same node of the network. b) and d) denote the resulting graph alignmentGA, subgraph of the network GN . The horizontal dashed lines denote a match between two proteins.

alignment (GA,σ) between TQ and GN of maximum score. The score of an alignment is defined as thesum of (i) similarity scores of aligned nodes (i.e.,

�v∈VA

σ−1(v) definedh(v,σ−1(v))), (ii) the sum of weights of

edges in EA (i.e.,�

e∈EAw(e)), (iii) a penalty score δd for each node deletion (i.e.,

�q∈VQ

σ(q)=del

δd), and (iv)

a penalty score δi for each node insertion (i.e.,�

v∈VAσ(v)−1 undefined

δi).

The general problem is NP-complete. However, it is Fixed Parameter Tractable in case the query is atree by a combination of the color-coding technique [11] and dynamic programming. This randomizedtechnique allows to find simple paths of length k in a network in O(2k) time (to be compared to theO(nk) time brute-force algorithm), where n is the number of proteins in the network [21]. In [2], theauthors of QNet adapted this technique for their query algorithm. Since one is looking for an alignment,each node of the query has to be considered once (and only once) in an incremental build of the alignmentby dynamic programming. Thus, one has to maintain a list of the nodes already considered in the query.Therefore, on the whole, one has to consider all O(nk) potential alignments, with n = |VN | and k = |VQ|.

Using color-coding, one may decrease this complexity to O(2k). First, nodes of the network are coloredrandomly using k colors, where k = |VQ|. Then, looking for a colorful alignment (i.e., an alignment thatcontains each color once) leads to a solution, which is not necessarily optimal. Therefore, one only needsto maintain a list of the colors already used in the alignment, storable in a table of size in O(2k). Inorder to get an optimal solution, this process is repeated. More precisely, according to QNet [2], since acolorful alignment happens with probability k!

kk � e−k, the coloration step has to be done ln(1�)ek times

to obtain an optimal alignment with high probability (1− �, for any �).The QNet dynamic programming algorithm can be summarized as follows. By an incremental construc-

tion, for each (qi, qj) ∈ EQ when one is considering qi ∈ VQ aligned with a node vi ∈ VN , check whetherthe score of the alignment is improved through: (i) a match of qj and any vj of VN such that qj and vj

are homologous and (vi, vj) ∈ EN , (ii) an insertion of a node vj of VN in the alignment graph GA, and(iii) a deletion of qj . This is done for a given coloration of the network, and repeated for each coloration.

Hereafter, we define an algorithm, inspired from QNet, which consider a query tree TQ, a PPI networkGN and seeks for an alignment (GA,σ). It is worth noticing that for a given coloration, our algorithm, asQNet, is exact. To deal with duplicated nodes (cf. GRAPH2TREE algorithm), we pre-compute all possibleassignment, called A, of the duplicated nodes VQ of TQ. More precisely, ∀q ∈ F, ∀v ∈ VN , define

hal-0

0619

763,

ver

sion

1 -

6 O

ct 2

011

6

Function PADA1 (TQ,GN , h, threshold)1

begin2

BestGA ← ∅; BestScore← −∞;3

for (i = 0; i < ln(1�)ek; i + +) do4

randomly colorize GN with k + Nins colors;5

foreach valid assignment A do6

ScoreGA ← BESTCONSTRAINTALIGNMENT(GN , TQ, A, h, threshold) + score(A);7

if ScoreGA > BestScore then8

Save coloration, A;9

BestScore← ScoreGA;10

end11

end12

end13

Load coloration, A;14

return Backtracking();15

end16

Algorithm 4: Sketch of the PADA1 algorithm to align a query graph to a network.

σ(q�) = v ∀q� ∈ d(q). We then compute for each assignment A the score of an alignment with respectto A. We denote BESTCONSTRAINTALIGNMENT this step. The difficulty lies in the construction of thebest alignment by dynamic programming, with respect to A.

As done in QNet, we use a set SC of k + Nins colors (as needed by color-coding) which will be usedwhen a node is matched or inserted. Moreover, in order to deal with potential duplicated nodes in TQ,we have to use another multiset S of colors (i.e., the colors in this set can appear more than once), ratherthan a classical set as in QNet. Indeed, every node in d(q) such that q ∈ F , must use the same color.

As a preprocess to PADA1, GN can be pruned, as shown in [22]. Let u ∈ GN be a protein which isnot homologous with any protein of the query and v ∈ GN be a protein which is homologous with aprotein of the query. Then, u can be too far from any v in terms of shortest path length to be inserted inthe solution in regards to the maximum number of insertions (i.e., Nins). According to this remark, u iskept in GN only if there are two proteins v1 and v2, both homologous with a protein of the query, suchthat dist(u, v1) + dist(u, v2) ≤ Nins + 1, where dist(u, v) is the length of the shortest path between uand v. Otherwise, u can never been in a solution, and hence can be safely deleted from GN .

Once GN has been pruned, PADA1 can be launched for each valid connected component of GN . Acomponent is said to be valid if it contains at least k−Ndel proteins which are homologous with a proteinof the query, where k is the size of the query. Otherwise, a solution can never be found in this component,and hence there is no need to consider it. As stated in [22], there is in practice only 5% of the networkproteins which are on average homologous with a query protein.

Algorithm 4 may be summarized as follow. Perform ln(1�)ek random colorations of the PPI network

GN to ensure optimality with a probability of at least 1− �. The coloration consists in defining a functionc : VN → SC , where the color in SC is randomly chosen. Then, for each coloration, we build all possiblevalid assignments A of the duplicated nodes. An assignment A is valid if no two non homologous nodesare matched in A. For each such assignment A, we compute the best score of an alignment accordingto A with Algorithm BESTCONSTRAINTALIGNMENT. We keep the best score of these trials and obtainthe corresponding alignment by a classic backtracking technique. The score of the assignment of theduplicated nodes is computed separatly as follows. Indeed, in order not to take into account the homologyscore of q more than once (i.e. |d(q)| times) in the overall score (namely ScoreGA), we precompute thepart of the score induced by the duplicated nodes (namely score(A)).

hal-0

0619

763,

ver

sion

1 -

6 O

ct 2

011

7

score(A) =�

q∈FA(q) �=del

h(q, A(q)) +�

q∈FA(q)=del

δd

Let us now describe in details the BESTCONSTRAINTALIGNMENT step that returns the score of GA

according to the precomputed assignment A.

ScoreGA ← maxv∈VN

WM(root, v, S, 1, A) + score(A)

The best alignment score is obtained by finding among all possibilities the best way to align the rootof the query to any protein v of the network. Similarly to QNet, the root is selected arbitrarily, but it isalways a node of degree one. Moreover, the score is computed only if the root and v are homologous. Inthis initial step, S represents the multiset of colors defined previously. As in QNet, for each query nodeq, let denote by q1, q2, . . . , qnq its nq children. To obtain the best alignment, we use three tables, namelyWM , W I and WD, which are filled as follows.If |S| ≤ 1,

WM(q, v, S, j, A)← −∞Else,

WM(q, v, S, j, A)← maxu:(u,v)∈EN

S�⊂Sc(v)∈S�

c(u)∈S−S�

WM(q, v, S �, j − 1, A) +

(* Matching, child j *)WM(qj, u, S − S �, nqj

, A) + w(u, v),

(* Insertion, node u *)W I(qj, u, S − S �, A) + w(u, v),

(* Deletion, child j *)WD(qj, v, S − S �, A)

In the computation of WM(q, v, S, j, A), we consider that q is already aligned with v. Thus, the valuestored in WM(q, v, S, j, A) is the maximum score of the subtree rooted at q and considering only its jfirst children. This score corresponds to the sum of the score of the subtree rooted at q considering onlyits j − 1 first children (i.e., WM(q, v, S �, j − 1, A)) and the score for the best aligment of the jth child ofq – denoted qj . Indeed, when considering qj one can either (1) match it with a neighbor u of v, (2) deleteit or (3) insert a neighbor u of v in the alignment. In the dynamic programming equation, we denote bynqj

the number of children of qj .To obtain the optimal solution, each subset of the multiset S of colors has to be considered. In others

words, we consider each subset of colors used for the first j−1 subtrees of q (the subset S �), and thereforeeach subset of colors used for the jth subtree (the subset S − S �). In case S is a singleton, the score is−∞ since there are at least j + 1 nodes (i.e., q and the nodes of the j subtrees rooted at q1, q2, . . . qj) toadd in the solution with only one color, while the deletion of nodes with degree one is forbidden.

WM(q, v, S, 0, A)←

h(q, v) if |d(q)| = 1,−∞ if A(q) �= v,0 else (* |d(q)| > 1 and A(q) = v *)

Entries corresponding to WM(q, v, S, 0, A) are the base cases of the recursion. When q is not a duplicatednode (i.e., |d(q)| = 1), the value is simply the similarity score with v given by h. Otherwise, the assignmentof q has already been defined in A and taken into account in score(A), and has to be preserved; otherwise,we return −∞ to forbid such an alignment.

hal-0

0619

763,

ver

sion

1 -

6 O

ct 2

011

8

W I(q, v, S, A)←

if A−1(v) = ∅ and |S| > 1

max u:(u,v)∈ENc(u)∈S−{c(v)}

�WM(q, u, S − {c(v)}, nq, A) + w(u, v) + δi,W I(q, u, S − {c(v)}, A) + w(u, v) + δi,

else−∞

When computing W I(q, v, S, A), the node v is considered to be inserted. Thus, the computation willcontinue with a neighbor u of v, performing either a match or another insertion.

In order to be coherent with A, if v has to be aligned with a duplicate node (i.e., A−1(v) �= ∅), insertingv is forbidden.

WD(q, v, S, A)←

if degree(q) �= 2−∞

else if |d(q)| = 1

maxu:(u,v)∈EN

WM(q1, u, S, nq1 , A) + w(u, v) + δd,W I(q1, u, S, A) + w(u, v) + δd,WD(q1, v, S, A) + δd

elseif A(q) = del

maxu:(u,v)∈EN

WM(q1, u, S, nq1 , A) + w(u, v),W I(q1, u, S, A) + w(u, v),WD(q1, v, S, A)

else−∞

Finally, when computing WD(q, v, S, A), the node v and the father of q are considered as aligned andthe node q to be deleted. The alignment then continues from q1, the only son of q since deletion is onlyallowed for nodes with degree two. If q is a duplicated node (i.e., |d(q)| > 1), the deletion is only allowedwhen A(q) = del to respect the reference assignment A. Again, in this case, the cost of a deletion δd isalready counted in score(A).

Let us note that since in the dynamic programming, for each j s.t. 1 ≤ j ≤ nq, all set S � ⊂ S areconsidered, hence the result does not depend on the ordering of TQ.

Let us now analyze the complexity of PADA1. The whole complexity depends essentially on lines5 to 12. Let us consider the complexity of one iteration (we have ln(1

�)ek iterations). The random

coloration can be done in O(n), where n = |VN |. There are n|F | possible assignments in the worstcase (i.e., if all the proteins in F are homologous with the n proteins of the network). The complexity ofBESTCONSTRAINTALIGNMENT is 2O(k+Nins)mNdel as in QNet, where k is the size of the graph queryand m = |EN |, since our modifications are essentially additional tests which can be done in constanttime.

Let us note that the complexity of GRAPH2TREE is negligible compared to the overall complexity ofAlgorithm PADA1. Indeed, the complexity of Algorithm GRAPH2TREE only depends on the query sizek, with k � n. Therefore, on the whole, the complexity of PADA1 is O(n|F |2O(k+Nins)mNdel ln(1

�)) time

for any desired success probability 1− � (with � > 0). Observe that the time complexity does not dependson the total number of duplicated nodes (i.e.,

�q∈F |d(q)|), but on the size of F .

III. EXPERIMENTAL RESULTS

According to the authors of QNet, one may query a PPI network by running an 2O(k)nt+1 time algorithmln(1

�)ek times, where t is the treewidth of the query graph. Thus, the difference between the two algorithms

is mainly related to the “t + 1 versus |F |” question (where |F | is the size of the set of families of

hal-0

0619

763,

ver

sion

1 -

6 O

ct 2

011

9

1.5

2

2.5

3

3.5

4

4.5

4 6 8 10 12 14 16

Val

ue

Query Graph size

Algorithms Comparison

QNet (Treewidth+1)PADA1 (|F|)

Figure 3. Comparison between QNet (i.e., the treewidth+1 value) and PADA1 parameter (i.e., the size of the feedback vertex set). In QNet(resp. in PADA1), the treewidth (resp. the feedback vertex set) is computed over the query. Here the size of the query graph correspond tothe number of nodes (there are usually between five and fifteen proteins in a classical query).

duplicated nodes computed by Algorithm GRAPH2TREE). Let us recall that the size of |F | given byour algorithm is equal to the feedback vertex set size of the query graph. According to Bodlaender andKoster [23], the treewidth of a graph is at most equal to the feedback vertex set plus one. However, itonly gives an upper-bound. We have conducted some experimental tests to compare these two parametersin practice on random graphs. The method is as follows: for each different size of graph (the numberof nodes varies from 4 to 16 while the number of edges varies from |V | to 2.|V |), we get the averagetreewidth and feedback vertex set values over 30000 connected graphs, randomly constructed with theNetworkX library (http://networkx.lanl.gov/). Treewidth is computed with the exact algorithm provided byhttp://www.treewidth.com/, while the size of the feedback vertex set is computed with our GRAPH2TREEalgorithm. Results in Figure 3 suggest that parameter |F | is usually smaller for moderate size graphs(i.e., those query graphs for which PADA1 and QNET are still practicable). In summary, PADA1 is analternative to the QNet algorithm, with a different parameter involved in the time complexity. One candetermine which parameter is more suitable for a given query and then uses the appropriate algorithm.

In practice, our upper-bound is largely over estimated. Indeed, each element of F must be assigned toa different node of the network, and hence, there are less than n possibilities for each element of F . Theworst number of runs of BESTCONSTRAINTALIGNMENT is n!

(n−|F |)! , the number of combinations.Moreover, we only consider valid assignments and there are only few such assignments. Indeed, a

protein is, on average, homologous to dozens of proteins, which is quite less than the number of proteinsin a classical PPI network (e.g. n � 5.000 for the yeast). For example, if |F | = 3 and if the proteinrepresented by this unique element of F is homologous to ten proteins in the PPI network, then, thenumber of assignment will not be n3 but only 103. Here, the running time is largely reduced. Therefore,and not surprisingly, the BLAST threshold used to determine if a protein is homologous to another hasa huge impact on the running time of the algorithm.

Finally, observe that in QNet, for a given treewidth, the query graph can be very different. For example,in the resulting tree decomposition of the graph, there is no limit on the number of bags of size t.Furthermore, in a given bag, the topology is arbitrary (e.g., a clique), potentially requiring an exhaustiveenumeration upper-bounded by nt+1. Therefore, the treewidth value does not indicate how many times anexhaustive enumeration has to be done.

We would have liked to compare in practice our algorithm to QNet, but, unfortunately, their versionquerying graphs is not yet implemented. Comparing our algorithm for simple trees queries with QNetwould not make sense since PADA1 is not optimized for this special case.

hal-0

0619

763,

ver

sion

1 -

6 O

ct 2

011

10

Figure 4. A result sample of our algorithm. a) A MAPK human query, get from [26], with three cycles. b) The alignment graph given byour algorithm in the fly PPI network. Dashed lines denotes the BLAST homology scores between the two proteins. Our algorithm retrievesa query graph in an other network. As in QNet [2], it seems to be that there is some conservation between these two species.

In order to validate our algorithm, we perform the experimental tests on real data proposed by QNet[2]. In our experiments, the data for the PPI network of the fly and the yeast have been obtained fromthe DIP database1[24]. The yeast network contains 4 738 proteins and 15 147 interactions, whereas thefly network contains 7 481 proteins and 26 201 interactions.

The first experiment consists in retrieving trees. To do so, the authors of QNet extract randomly treesqueries of size 5 to 9 from the yeast network and try to retrieve them in this network. Each query ismodified with at most two insertions or deletions. We also have successfully retrieved these queries.

The second experiment was performed across species. The Mitogen-Activated Protein Kinase (MAPK)are a collection of signal transduction queries. According to [25], they have a critical function in thecellular response to extracellular stimuli. They are known to be conserved through different species. Weobtained the human MAPK from the KEGG database [26] and tried to retrieve them in the fly networkas done in QNet. While QNet uses only trees, we were able to query graphs. The results were satisfyingsince we retrieved them, with few or without modifications. The Figure 4 shows a sample of our resultson real data. This suggests a potential conservation of patterns across species.

The BLAST threshold have deep impact on the running time. Moreover, we could certainly speed-upthe running time by stopping earlier the dynamic programming. Indeed, one can stop if there are morenodes to color than the number of available colors in a step of the dynamic programming. This trickimplies to look for the available number of deletions and insertions. Thus, for the moment we only stopwhen there is only a single color available. Another improvement can be to switch the coloration stepwith choices of assignements A. A final speed-up possibility can be to use the Huffner et al. technique[27], which basically consists in increasing the number of colors used during the coloration step.

IV. CONCLUSION

In this paper, we have tried to improve our understanding in PPI networks by developing a toolcalled PADA1 (available uppon request), to query graphs in PPI networks. This algorithm, which isFPT for query graphs with a constant size of their feedback vertex set, has a time complexity ofO(n|F |2O(k+Nins)mNdel ln(1

�)), where n (resp. m) is the number of nodes (resp. edges) in the PPI network,

k is the number of nodes in the query, Nins (resp. Ndel) is the maximum number of insertions (resp.deletions) allowed, � is any value > 0 such that 1 − � is the desired success probability, and |F | is theminimum number of nodes which have to be duplicated to transform the query graph into a tree (solvingthe FEEDBACK VERTEX SET problem). This last parameter is the main difference with QNet of Dost etal. [2], which uses the treewidth of the query (unimplemented algorithm). Consequently, PADA1 is analternative to QNet and one can determine which parameter is better considering a query. These algorithmsboth use the color coding technique and are both exact for a given coloration. We have performed some

1http://dip.doe-mbi.ucla.edu/

hal-0

0619

763,

ver

sion

1 -

6 O

ct 2

011

11

tests on real data and have retrieved known paths in the yeast PPI network. Moreover, we have retrievedknown human paths in the fly PPI network.

The time complexity of our algorithm depends on the number of nodes which have to be duplicated inthe graph query. This number is directly connected to the initial topology of the query graph. Obtainingmore information about the topology of the queries and the average number of homologous for proteinsin the query are of particular interest in this context. Future works include using these informations topredict average time complexity.

Knowing if GRAPH TOPOLOGICAL CONTAINMENT – determining if a graph G has a subgraph thatis a subdivision of a parameter graph H – is FPT or W[1]-hard is still an open problem (conjectured tobe FPT by Fellows [28]). In the context of proteins queries, one can ask if the query H appears in thenetwork G with an unbounded number of insertions.

Finally, recently, a problem close to this one, called GRAPH MOTIF, has been settled by Lacroix,Fernandes and Sagot [29]. Roughly speaking, one is only concerned in find a connected occurence ofthe query, which is defined without a given topology, that is a set or a multiset of proteins. This newdefinition leads to new issues, and has already been investigated [30], [31], [32], [33], [34], [35], [22].

ACKNOWLEDGEMENT

The authors would like to thank Banu Dost for providing us the QNet source code and their test data.We also thank the anonymous reviewers for their helpful comments and suggestions for improving themanuscript.

REFERENCES

[1] G. Blin, F. Sikora, and S. Vialette, “Querying Protein-Protein Interaction Networks,” in 5th International Symposium on BioinformaticsResearch and Applications (ISBRA’09), ser. LNBI, S. Istrail, P. Pevzner, and M. Waterman, Eds., vol. 5542. Fort Lauderdale, FL,USA: Springer-Verlag, May 2009, pp. 52–62.

[2] B. Dost, T. Shlomi, N. Gupta, E. Ruppin, V. Bafna, and R. Sharan, “QNet: A Tool for Querying Protein Interaction Networks,”RECOMB, pp. 1–15, 2007.

[3] A. Gavin, M. Boshe, et al., “Functional organization of the yeast proteome by systematic analysis of protein complexes,” Nature, vol.414, no. 6868, pp. 141–147, 2002.

[4] Y. Ho, A. Gruhler, et al., “Systematic identification of protein complexes in Saccharomyces cerevisae by mass spectrometry,” Nature,vol. 415, no. 6868, pp. 180–183, 2002.

[5] P. Uetz, L. Giot, et al., “A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisae,” Nature, vol. 403, no.6770, pp. 623–627, 2000.

[6] T. Reguly, A. Breitkreutz, L. Boucher, B. Breitkreutz, G. Hon, C. Myers, A. Parsons, H. Friesen, R. Oughtred, A. Tong, et al.,“Comprehensive curation and analysis of global interaction networks in saccharomyces cerevisiae,” Journal of Biology, 2006.

[7] M. Pellegrini, E. Marcotte, M. Thompson, D. Eisenberg, and T. Yeates, “Assigning protein functions by comparative genome analysis:protein phylogenetic profiles,” PNAS, vol. 96, no. 8, pp. 4285–4288, 1999.

[8] M. Garey and D. Johnson, Computers and Intractability: a guide to the theory of NP-completeness. San Franciso: W.H. Freeman,1979.

[9] B. Kelley, R. Sharan, R. Karp, T. Sittler, D. E. Root, B. Stockwell, and T. Ideker, “Conserved pathways within bacteria and yeast asrevealed by global protein network alignment,” Proceedings of the National Academy of Sciences, vol. 100, no. 20, pp. 11 394–11 399,2003.

[10] T. Shlomi, D. Segal, E. Ruppin, and R. Sharan, “QPath: a method for querying pathways in a protein-protein interaction network,”BMC Bioinformatics, vol. 7, p. 199, 2006.

[11] N. Alon, R. Yuster, and U. Zwick, “Color coding,” Journal of the ACM, vol. 42, no. 4, pp. 844–856, 1995.[12] R. Downey and M. Fellows, Parameterized Complexity. Springer-Verlag, 1999.[13] R. Pinter, O. Rokhlenko, E. Yeger-Lotem, and M. Ziv-Ukelson, “Alignment of metabolic pathways,” Bioinformatics, vol. 21, no. 16,

pp. 3401–3408, 2005.[14] H. Bodlaender, “A tourist guide through treewidth,” Acta Cybernetica, vol. 11, pp. 1–23, 1993.[15] S. Arnborg, D. Corneil, and A. Proskurowski, “Complexity of finding embeddings in a k-tree,” Journal on Algebraic and Discrete

Methods, vol. 8, no. 2, pp. 277–284, 1987.[16] Q. Cheng, P. Berman, R. Harrison, and A. Zelikovsky, “Fast Alignments of Metabolic Networks,” in Proceedings of the 2008 IEEE

International Conference on Bioinformatics and Biomedicine. IEEE Computer Society, 2008, pp. 147–152.[17] R. Karp, “Reducibility among combinatorial problems,” in Complexity of computer computations, J. Thatcher and R. Miller, Eds. New

York: Plenum Press, 1972, pp. 85–103.[18] J. Guo, J. Gramm, F. Huffner, R. Niedermeier, and S. Wernicke, “Compression-based fixed-parameter algorithms for feedback vertex

set and edge bipartization,” Journal of Computer and System Sciences, vol. 72, no. 8, pp. 1386–1396, 2006.

hal-0

0619

763,

ver

sion

1 -

6 O

ct 2

011

12

[19] S. Thomasse, “A quadratic kernel for feedback vertex set.” in Proceedings of the Nineteenth Annual ACM-SIAM Symposium on DiscreteAlgorithms, 2009.

[20] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, “Basic local alignment search tool,” Journal of Molecular Biology, vol.215, no. 3, pp. 403–410, 1990.

[21] J. Scott, T. Ideker, R. Karp, and R. Sharan, “Efficient algorithms for detecting signaling pathways in protein interaction networks,”Journal of Computational Biology, vol. 13, pp. 133–144, 2006.

[22] S. Bruckner, F. Huffner, R. Karp, R. Shamir, and R. Sharan, “Topology-free querying of protein interaction networks,” in Proc. 13thAnnual International Conference on Computational Molecular Biology (RECOMB), Tucson, USA. Springer, 2009, p. 74.

[23] H. Bodlaender and A. Koster, “Combinatorial optimization on graphs of bounded treewidth,” The Computer Journal, 2007.[24] I. Xenarios, L. Salwinski, X. Duan, P. Higney, S. Kim, and D. Eisenberg, “DIP, the Database of Interacting Proteins: a research tool

for studying cellular networks of protein interactions,” Nucleic Acids Research, vol. 30, no. 1, p. 303, 2002.[25] P. Dent, A. Yacoub, P. Fisher, M. Hagan, and S. Grant, “MAPK pathways in radiation responses,” Oncogene, vol. 22, pp. 5885–5896,

2003.[26] M. Kanehisa, S. Goto, S. Kawashima, Y. Okuno, and M. Hattori, “The KEGG resource for deciphering the genome,” Nucleic acids

research, vol. 32, pp. 277–280, 2004.[27] F. Huffner, S. Wernicke, and T. Zichner, “Algorithm Engineering For Color-Coding To Facilitate Signaling Pathway Detection,” in

Proceedings of the 5th Asia-Pacific Bioinformatics Conference. Imperial College Press, 2007.[28] M. Fellows, “Parameterized complexity: new developments and research frontiers,” in Aspects of Complexity: Minicourses in

Algorithmics, Complexity and Computational Algebra: Mathematics Workshop, Kaikoura, January 7-15, 2000. Walter de Gruyter,2001, p. 51.

[29] V. Lacroix, C. Fernandes, and M.-F. Sagot, “Motif search in graphs: application to metabolic networks,” IEEE/ACM Transactions onComputational Biology and Bioinformatics (TCBB), vol. 3, no. 4, pp. 360–368, 2006.

[30] M. Fellows, G. Fertin, D. Hermelin, and S. Vialette, “Sharp tractability borderlines for finding connected motifs in vertex-coloredgraphs,” in Proc. 34th International Colloquium on Automata, Languages and Programming (ICALP), Wroclaw, Poland, ser. LectureNotes in Computer Science, vol. 4596. Springer, 2007, pp. 340–351.

[31] N. Betzler, M. Fellows, C. Komusiewicz, and R. Niedermeier, “Parameterized algorithms and hardness results for some graph motifproblems,” in Proc. 19th Annual Symposium on Combinatorial Pattern Matching (CPM), Pisa, Italy, ser. Lecture Notes in ComputerScience, vol. 5029. Springer, 2008, pp. 31–43.

[32] R. Dondi, G. Fertin, and S. Vialette, “Weak pattern matching in colored graphs: Minimizing the number of connected components,”in Proc.10th Italian Conference on Theoretical Computer Science (ICTCS), Roma, Italy. World-Scientific, 2007, pp. 27–38.

[33] ——, “Maximum motif problem in vertex-colored graphs,” in Proc. 20th Annual Symposium on Combinatorial Pattern Matching(CPM’09), Lille, France, ser. Lecture Notes in Computer Science, G. Kucherov and E. Ukkonen, Eds., vol. 5577, 2009, pp. 221–235.

[34] G. Blin, F. Sikora, and S. Vialette, “GraMoFoNe: a cytoscape plugin for querying motifs without topology in protein-protein interactionsnetworks,” in 2nd International Conference on Bioinformatics and Computational Biology (BICoB-2010). International Society forComputers and their Applications (ISCA), 2010.

[35] S. Schbath, V. Lacroix, and M. Sagot, “Assessing the exceptionality of coloured motifs in networks,” EURASIP Journal on Bioinformaticsand Systems Biology, vol. 2009, 2009.

Guillaume Blin is an associate professor in the LIGM at Universite Paris-Est - France. He defended his Ph.D. onnovember 2005 from Universite de Nantes - France. His current research interests include computational complexityand approximation, algorithms, and bioinformatics.

Florian Sikora is a Ph. D. student in the LIGM at Universite Paris-Est - France. He received his Master Degreefrom the Universite Paris-Est in 2008. His research interests include algorithmics in computational biology and moreespecially fixed parameter tractable algorithms.

hal-0

0619

763,

ver

sion

1 -

6 O

ct 2

011

13

Stephane Vialette a Researcher in the LIGM at Universite Paris-Est - France. He received his PhD degree in 2001 fromthe University Paris 7 - France. His research interests are in computational biology, algorithmics and combinatorics.

hal-0

0619

763,

ver

sion

1 -

6 O

ct 2

011

GraMoFoNe: a Cytoscape plugin for querying motifs without topology inProtein-Protein Interactions networks

Guillaume Blin, Florian Sikora, Stephane Vialette

Universite Paris-Est, LIGM - UMR CNRS 8049, France{gblin, sikora, vialette}@univ-mlv.fr

AbstractDuring the last decade, data on Protein-Protein Inter-

actions (PPI) has increased in a huge manner. Search-ing for motifs in PPI Network has thus became a cru-cial problem to interpret this data. A large part of theliterature is devoted to the query of motifs with a giventopology. However, the biological data are, by now, sonoisy (missing and erroneous information) that the topol-ogy of a motif can be unrelevant. Consequently, Lacroixet al. [19] defined a new problem, called GRAPH MO-TIF, which consists in searching a multiset of colors in avertex-colored graph. In this article, we present GraMo-FoNe, a plugin to Cytoscape based on a Linear Pseudo-Boolean optimization solver which handles GRAPH MO-TIF and some of its extensions.

1. IntroductionRecent techniques increase data and knowledge about pro-teins ([14, 15, 29]). Among others proteins properties, theset of all their interactions for a given organism, calledProtein-Protein Interactions (PPI) network, have gainedhuge interest in the last few years. A major stake of com-parative analysis of PPI tries to determine to what extendproteins are conserved among species. Indeed, recent re-search suggests that proteins are functioning together intopathways and tend to evolve in correlated fashion – beingpreserved or eliminated in new species [21]. Therefore, ithas became of foremost importance to identify PPI sub-networks that are similar to a given motif (i.e., pathway ofproteins), where similarity is measured both in terms ofprotein-sequence and subnetwork topology conservation.

In this context, most tools consider topology-basedmotifs (either a path [17, 27], a tree [22, 12], or a graph[12, 8]). However, interactions data were so noisy andincomplete that there is no need for topology informationin the motif [9]. According to this remark, Lacroix etal. [19] have introduced the following problem namedGRAPH MOTIF.Definition 1.1 (GRAPH MOTIF [19]). Given a vertex-colored graph G=(V,E) and a multiset of colors M (themotif), find a connected subset of vertices R ⊆ V whosemultiset of colors equals M (i.e., there is a bijection σ :R → M such that σ(v) ∈ col(v) for all v ∈ R, wherecol(v) is the color of v).

In our context, the graph G represents the PPI net-work where vertices are the proteins and edges the in-teractions. The motif is completely defined by adding acolor in M for each different requested proteins. Oncethe motif is defined, a node v ∈ G is colored by a colorc ∈ M if the protein represented by v is homologous tothe protein represented by c (e.g., according to a BLASTp[5] analysis). If the protein represented by a node v is nothomologous to any protein of the motif then v is not col-ored.

Despite the NP-completeness of the problem [19], sometheoretical results exists [7, 13, 11]. Nevertheless, to thebest of our knowledge, there is only one implementedtool, called Torque [9]. Torque uses either integer lin-ear programming or dynamic programming conjugatedwith color coding technique [4]. Limitations of Torqueare that it is a web service (therefore it is hard to connectwith others services, the performances only depends onthe server and it is not possible to perform batch tests),it only give one solution (not all possibles solutions) and,last but not least, it only deals with colorful motif (i.e., atmost one occurrence of each color).

When dealing with multiset motif – which may beof interest – two approaches can be highlighted. (i) Afunctional approach: using a Gene Ontology like clas-sification [10], two proteins have the same color if theybelong to the same class. (ii) An evolutive approach: twoproteins have the same color in the motif if they are ho-mologous. In our plugin, the second approach is used.

By searching for exact matches of a motif, we providea new tool to solve GRAPH MOTIF [19]. It is worth notic-ing that our plugin also deals with some extensions ofthis problem. Indeed, due to the huge rate of noise in PPINetworks [14, 23], exact match are often too restrictive,and hence one may allow deletions (i.e., proteins whichare in the motif but not in the solution). The resultingproblem is MAX MOTIF, defined by Dondi et al. [11].Similarly, the resulting subnetwork may contain proteininsertions (i.e., proteins which are in the solution but notin the motif) thats help to get the connectivity of the re-sult. These proteins can be colored or not, as claimed in[9]. Moreover, our plugin allows to restrict motifs to col-orful ones. Finally, since a protein can be homologousto more than one protein, a node v ∈ G can have morethan one color. Hence, a set of colors (instead of only one

hal-0

0425

661,

ver

sion

1 -

22 O

ct 2

009

color) can be assigned to any node in order to deal withthe LIST-COLORED GRAPH MOTIF problem settled byBetzler et al. [7]. In this latter problem, the bijectionσ is still valid since col(v) then returns the list of colorsassigned to v.

2. Methods and implementationOur tool, named GraMoFoNe (which stands for GraphMotif For Networks), has been implemented as a Cy-toscape plugin. Cytoscape [26] is a popular open-sourcesoftware platform for network visualization and analysis,which supports the development of external plugin toolsextending its functionalities. Our plugin seeks for occur-rences of a user defined motif into a network previouslyloaded into the Cytoscape workspace (many file formatare supported). It uses an exact algorithm to perform thistask.

To this end, we choose to express our problem as alinear pseudo-boolean optimization problem (LPB), i.e.,as a linear program [25] whose variables are boolean. Ina LPB problem, the objective is to find an assignmentof boolean variables such that all constraints are satis-fied and such the value of the linear objective function isoptimized. Our LPB formulation is composed of 23 con-straints defined upon 9 domains of variables (details areprovided in the sequel). A large number of LPB solvers– which are generalization of SAT solvers – exists. Wedecided to use java SAT4JPseudo library [20] for (i) effi-cient java integration, (ii) its good result in the PB Eval-uation 07 [2], and (iii) its free availability (efficient purelinear programming solver are indeed often expensive).

Our LPB program seeks for a connected occurrenceof a multiset of colors, called motif, M (with k = |M |)into the vertex-colored edge-weighted undirected graphG = (V,E, w), where w is a function assigning a weightto any edge of E. Let R ⊆ V be a solution. Let N(v)represents the set of neighbors of v (i.e., N(v) = {u :{u, v} ∈ E}) and G[R] represents the subgraph of Ginduced by the set R. In the motif M , let occM (c), c ∈ C,denotes the number of occurrences of color c in M . Letcol : V → 2C be a function which returns the list ofcolors of C associated to any node of V .

As said previously, looking for exact matching can betoo restrictive. We will allow insertions and deletions ofproteins, and then, |R| would be different of k. Indeed,when |R| < k (resp. |R| > k), we say that there areat least k − |R| deletions (resp. |R| − k insertions). Themaximal number of deletions (resp. insertions) is denotedby Ndel (resp. Nins). However, comparing k and |R| isnot a sufficient condition for determining the number ofindels (i.e., insertions-deletions) in the solution. Indeed,if there is one deletion for a color and one insertion foranother color, we certainly have |R| = k whereas R maybe not in bijection with M . To deal with this fact, we haveto consider for each color c, the difference between the

number of occurrences of c in the motif and the numberof occurrences of nodes colored by c in R. Moreover,if a node v in the solution R is colored with more thanone color, v must match only one color of M since σ is abijection – other colors of v can not match other colors ofM . Our LPB program deals with these two constraints.

Hereafter, we present the variables, the objective func-tion and the constraints of our LPB program.

Variables. For any node v ∈ V , we have a variablexv ∈ {0, 1} to denote the presence of v in the solution R:xv = 1 iff v ∈ R. For any edge {u, v} ∈ E, we have avariable euv ∈ {0, 1} to denote the presence of {u, v} inG[R]: euv = 1 iff {u, v} ∈ G[R].

As we will explain soon, there is k + Nins differ-ent integers labels associated to nodes in R to ensurethe connectivity of G[R]. Note that a node is labeledonly if it is part of the solution. Thus, for any nodev ∈ V , we have k + Nins variables Label[v][i] ∈ {0, 1},with 1 ≤ i ≤ k + Nins, to represent the “label” of v:Label[v][i] = 1 iff v has the label i.

For any node v ∈ V and color c ∈ col(v), we havevariables ColV [v][c], to represent the color of v used inthe coloring function: ColV [v][c] = 1 iff v is consid-ered to have the color c in R. These variables are usedto choose which color among the |col(v)| colors of v ischosen in the bijection with M . In fact, by allowing a listof colors for v, if v is in the solution, v may match upto |col(v)| colors of the motif. Since we want a bijectionbetween colors of R and M , we have to choose whichunique color will be considered for a given node.

For any color c ∈ C, we have Nins + 1 variablesninsc[i], with 0 ≤ i ≤ Nins, to represent the numberof insertions for the color c: ninsc[i] = 1 iff there arei insertions of nodes with the color c. Similarly, for anycolor c ∈ C, we have Ndel + 1 variables ndelc[i], with0 ≤ i ≤ Ndel, to represent the number of deletions forthe color c: ndelc[i] = 1 iff there are i deletions of nodeswith the color c.

For any color c ∈ C, we have three variables IsExactc,IsInsc, IsDelc, to indicate if there are some nodes coloredwith c in R which are inserted or deleted: IsExactc = 1(resp. IsInsc = 1, IsDelc = 1) iff the number of nodes inR with the color c is equal to (resp. greater than, lowerthan) occM (c). These variables are used for ease of ex-position (i.e. there is an equivalence between these vari-ables, and ninsc[0] and ndelc[0]).

Objective. The objective of the LPB program is tomaximize the score of the solution. Our program maxi-mizes the sum of all variables euv times their correspond-ing edge weight. In other words, it corresponds to maxi-mizing the sum of edge weights of the solution. Formally,the objective is : max

�{u,v}∈E euvw({u, v})

Constraints. The two following constraints ensurethat the solution G[R] is a graph of correct size (accord-ing to k, Nins and Ndel).

hal-0

0425

661,

ver

sion

1 -

22 O

ct 2

009

∀u, v ∈ V, euv ⇔ xu ∧ xv (1)

k − Ndel ≤�

v∈V

xv ≤ k + Nins (2)

Constraint (1) ensures that {u, v} ∈ G[R] iff both uand v ∈ R. Constraint (2) controls the number of nodesin the solution. When no indels are allowed, the size ofthe solution must be equal to k, the number of elementsin the motif. When allowing insertions (resp. deletions),the size of the solution can be larger (resp. smaller) thank.

The four following constraints ensure the connectiv-ity of G[R].

∀v ∈ V, xv ⇒�

k+Nins�

i=1

Label[v][i] = 1

�(3)

∀v ∈ V, ¬xv ⇒�

k+Nins�

i=1

Label[v][i] = 0

�(4)

∀1 ≤ i ≤ k + Nins,�

v∈V

Label[v][i] ≤ 1 (5)

∀v ∈ V,∀1 ≤ i < k + Nins,

Label[v][i] ⇒

�

u∈N(v)

�

j>i

Label[u][j] ≥ 1

(6)

Constraint (3) ensures that if v ∈ R, then v has ex-actly one label, an integer between 1 and k + Nins. Con-straint (4) ensures that if v /∈ R, then v is unlabeled.Constraint (5) ensures that any label is attributed to atmost one node. Due to deletions, some labels may benot attributed. Constraint (6) ensures the connectivity ofG[R]: any node of R, except the one with the maximallabel, must have a neighbor in G[R] with a label greaterthan its own.

The two following constraints ensure that G[R] hasenough colored vertex according to occM (c) for any c ∈C, Nins and Ndel.

∀c ∈ C, occM (c)−Ndel ≤�

v∈Vc∈col(v)

xv ≤ occM (c)+Nins

(7)

∀v ∈ V,�

c∈col(v)

ColV [v][c] = xv (8)

Constraint (7) ensures that for any color c in M , thereis enough vertices colored with c in G[R]. Where no in-dels are allowed, a solution must contain occM (c) occur-rences of c, for each color c. Since insertions of colored

nodes (resp. deletions) are allowed, the number of occur-rences of a color can be larger (resp. smaller). Constraint(8) ensures that a unique color for any node v in R isselected among its |col(v)| associated colors.

The three following constraints ensure that either alloccurrences of a color c ∈ C in M are matched, or at leastone of them is inserted or deleted.

∀c ∈ C, IsExactc + IsInsc + IsDelc = 1 (9)

∀c ∈ C,�

v∈V

ColV [v][c]−occM (c) ≤ IsInsc .Nins−IsDelc

(10)

∀c ∈ C,�

v∈V

ColV [v][c] − occM (c) ≥

¬ IsExactc − IsDelc − IsDelc .Ndel

(11)

Constraint (9) ensures the above assertion whereasconstraints (10) and (11) ensure the consistency betweenColV, IsExact, IsIns, IsDel: ∀c ∈ C, IsExactc (resp. IsInsc,IsDelc) = 1 iff

�v∈V ColV [v][c] − occM (c) = 0 (resp.

> 0, < 0).The six following constraints ensure that the number

of insertions is less than Nins.

∀c ∈ C,

Nins�

i=0

ninsc[i] = 1 (12)

∀c ∈ C, IsInsc ⇒ ninsc[0] = 0 (13)

∀c ∈ C, ¬ IsDelc +ninsc[0] ≥ 1 (14)

∀c ∈ C,∀0 ≤ i ≤ Nins,�

v∈V

ColV [v][c] − occM (c) ≤ i.ninsc[i] + ¬ninsc[i].Nins

(15)

∀c ∈ C,∀0 ≤ i ≤ Nins,

¬ninsc[i] +�

v∈V

ColV [v][c] − occM (c) + Ndel. IsDelc ≥

i.ninsc[i]

(16)�

c∈C

Nins�

i=1

i.ninsc[i] +�

v∈Vcol(v)=∅

xv ≤ Nins (17)

Constraint (12) ensures that, for a given color c ∈ C,there is a unique variable ninsc that corresponds to thenumber of insertions of nodes with color c. Constraint(13) ensures that variables ninsc and IsInsc are consis-tent. Constraint (14) ensures that for a color c ∈ C there

hal-0

0425

661,

ver

sion

1 -

22 O

ct 2

009

are either insertions or deletions. Constraint (15) and(16) ensure that ninsc[i] = 1 iff there are i insertions ofnodes with the color c ∈ C (i.e. if the difference between�

v∈V ColV [v][c] and occM (c) is equal to i). Constraint(17) ensures that the number of insertions is bounded byNins. The sum of all the insertions for a given color inaddition to insertions of not colored nodes have to be lessthan Nins.

We also give six constraints, which are built similarlyto constraints (12)-(17).Lemma 2.1 Our LPB program correctly solves GRAPHMOTIF.

Proof omitted dur to space constraints.Let us now defined two preprocessing steps to speedup

our LPB program.First, let us remark that a protein in the motif without

any homologous protein in the network will be consid-ered as a deletion in any feasible result. Let D be the setof all colors corresponding to such proteins in the motifM . If the size of D exceeds Ndel, then no solution is pos-sible for this motif. Otherwise, we already know that allproteins corresponding to colors in D will be deleted inany solution. Thus, we launch the LPB program over themotif M \ D, with Ndel − |D| allowed deletions.

Then, we prune the network and run the LPB solveron each connected component as shown in [9]. Indeed, annot colored node of G can be too “far” from any colorednode, in terms of shortest path length, to be inserted in thesolution in regards to the maximum number of allowedinsertions (i.e., Nins). According to this remark, we onlykeep a colored node u in G if there exist two colorednodes v1 and v2 such that dist(u, v1) + dist(u, v2) ≤Nins + 1, where dist(u, v) is the length of the shortestpath between u and v. Otherwise, u would never be partof a solution, and hence can be safely deleted from G.

Once G is pruned, the LPB program is used on eachvalid connected component of G. A component is saidto be valid if it contains at least k − Ndel colored nodes.Otherwise, a connected solution would never be found inthis component, and hence there is no need to consider it.As stated in [9], there is in practice only 5% of colorednodes in G.

3. GraMoFoNe FunctionalitiesScreenshots of our plugin can be seen on the GraMoFoNewebsite 1. The user can provide input data and parameterson the left sidebar, networks are drawn in the center andresults are presented on the right panel. We now describeinputs and outputs of GraMoFoNe.

InputsThe network and the motif. The network has to be

loaded into the Cytoscape environment. The motif is ei-ther (1) a predefined motif, (2) or given manually in atextbox, (3) or loaded as a FASTA file.

1http://igm.univ-mlv.fr/AlgoB/gramofone/

BLASTp. Since we consider two proteins as homolo-gous according to their sequence similarity by a BLASTpanalysis, we need FASTA files of the motif and the net-work. These last can be provided by the user; other-wise, our plugin tries to retrieve them from the Uniprotdatabase Archive (Uniparc) [6] using EBI Web Services[18]. The user has also to provide the BLASTp thresholdvalue : two proteins are homologous if their -log(eV alue)value is above this threshold.

Indels. The user can provide a maximum number ofdeletions and insertions allowed in a solution, and the cor-responding penalty costs used to compute the score of aresult.

OutputsOnce GraMoFoNe routine is launched, the plugin pro-

vides the potential subnetworks list, ordered by their scores,while Torque only provides the best solution. The usermay see each of these subnetworks highlighted in the fullnetwork. The plugin also provides the correspondencebetween proteins in the result and the motif. Finally, theplugin allows an exportation of any such subnetwork as anew network .

4. Results and comparisonTo validate our plugin on real data, we launched a batchmode of our plugin (not available through Cytoscape)which tries to retrieve motifs (protein complexes) of sixdifferent species in three large different PPI networks.

Data acquisition and parametersThe PPI networks of Saccharomyces cerevisiae (Yeast,

about 5.500 proteins and 40.000 interactions), Drosophilamelanogaster (Fly, about 6.500 proteins and 21.000 in-teractions) and Homo sapiens (about 8.000 proteins and29.000 interactions) were downloaded from the Torquewebsite. They obtain these data from recent papers andpublic databases.

The motifs data for Yeast, Fly, Human, Mouse, Bovineand Rat were kindly supplied by Torque authors whichobtained them from the databases SGD [3], AmiGo[1]and Corum[24].

Fasta files for Yeast, Fly and Human were downloadedfrom the Torque website, while data for Mouse, Bovineand Rat were downloaded from Biomart [28]. Missinginformations have been manually added from Uniprot [6]and Ensembl [16] databases.

The parameters have been set as similar as possibleas in Torque. Therefore, the threshold value for BLASTphas been set to -log(10−7) � 16.1. Two insertions (Nins)and deletions (Ndel) were allowed for small motifs (size< 7), three for medium motifs (size 8-14), four for largerones. The timeout for the LPB program was set to 500seconds.

ExperimentsOur tests were done on a 3GHz Personal Computer,

with 2Go RAM memory. Torque values were not com-

hal-0

0425

661,

ver

sion

1 -

22 O

ct 2

009

Figure 1: Comparison of the number of matches between our software (GM) and Torque [9]. Each histogram labeledby X/Y corresponding to retrieving a list of motif of specie X in the network of specie Y . White (resp. grey) barscorresponds to feasible motifs founded (resp. not founded) in the network. Black bars correspond to motifs where thetimeout limit has been reached before any result. Hence, the whole bar correspond to feasible motifs.

puted by ourself since there is only a web service forTorque. We obtained values (number of matches) fromthe Torque paper. Values for GraMoFoNe were computedas follows.

From the list of motifs of a given species, we keptonly feasible ones. We performed preprocessing on mo-tif and network as described previously. Then, we con-sidered a motif as feasible when, (i) its size was between4 and 25, (ii) there were less than Ndel proteins in themotif without homologous proteins in the network, and(iii) there was at least one connected component in thenetwork with enough colors to match the motif.

Afterwards, for a feasible motif, the LPB programcould found a solution (True in Figure 1), or found thatthis motif can not be matched in this network (False inFigure 1), or not finish under the time limit (Timeout inFigure 1).

ResultsComparisons between our plugin GraMoFoNe and Tor-

que are given in Figure 1. For most experiments, our plu-gin finds more feasible motifs (i.e., the sum of “true”,“timeout” and “false” in the figure, or the height of eachwhole bar) and also more matches (i.e., height of whitebars) than Torque. These results can be due to differencesin our preprocessing methods and to our manual additionof missing information in Fasta files.

As Torque, we can query motifs where there is no in-

formation about the motif topology (Bovine, Mouse andRat). Also as in Torque, we had more unmatched feasiblemotifs when they are requested in the fly network. Ac-cording to Torque authors, this is because the fly data ismore noisy, with a high rate of false negatives. A motifcan not be found if a false negative disconnects a poten-tial solution. Conversely, false positives does not disturbthe connectivity, but can create “bad” solutions.

With the set of parameters defined previously, there isno significant differences in terms of number of matcheswhen we use a motif as a multiset (i.e. when two homol-ogous proteins in the motif has the same color) or not.

Knowing if there is a match can be computed in sec-onds (5-20 for small motifs, 40-60 for larger ones), butthe time to found the best solution can be longer. But,due to the use of a LPB solver as a “black box”, it is veryhard to predict times.

5. Conclusion

In this paper, we presented GraMoFoNe, a new tool to re-quest motifs (multiset of proteins without topology) intoProtein-Protein Interactions network by solving GRAPHMOTIF and some of its extensions, to increase knowledgeabout biological network. This tool is given as a pluginfor Cytoscape, a popular software to manage such net-works. GraMoFoNe use the free Linear Pseudo Boolean

hal-0

0425

661,

ver

sion

1 -

22 O

ct 2

009

solver Sat4JPseudo to give all possible solutions, includ-ing the best one.

Since giving all solution can take time, our tool canalso give the first solution founded by the LPB solver inshort time. However, in this case, we do not know thequality of this solution compared to the best one (i.e. ifthere is another solution with less indels). A future workcould be to find a fast heuristic to find a “good” solutionin most case, and to compare this last with GraMoFoNe.

Our coloration method is only given in terms of se-quence similarity. Therefore, it would be interesting toextend it to other measures. In the same way, our thresh-old for homologies is fixed. It would be also interestingto have a variable threshold.

The GraMoFoNe plugin and batch program are underGPL license and available at the website http://igm.univ-mlv.fr/AlgoB/gramofone/

6. AcknowledgementThe authors would like to thank Anne Parrain for her helpand her quick response to our requests for SAT4JPseudo.We also thanks Sharon Bruckner for providing motifs data,Fasta files and Torque technical details. We thanks Vin-cent Lacroix for his ideas about using multiset motif.

7. References[1] Go consortium. amigo. http://amigo.geneontology.org,

sept 2008.

[2] Pb evaluation 07 – special event of the sat 2007 confer-ence. http://www.cril.univ-artois.fr/PB07/.

[3] Sgd project. ”saccharomyces genome database”.http://www.yeastgenome.org, sept 2008.

[4] N. Alon, R. Yuster, and U. Zwick. Color coding. JACM,42(4):844–856, 1995.

[5] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman.Basic local alignment search tool. JMB, 215(3):403–410,1990.

[6] A. Bairoch, R. Apweiler, et al. The universal protein re-source (UniProt). NAR, 33:D154, 2005.

[7] N. Betzler, M. Fellows, C. Komusiewicz, and R. Nieder-meier. Parameterized algorithms and hardness results forsome graph motif problems. In CPM, volume 5029 ofLNCS, pages 31–43, 2008.

[8] G. Blin, F. Sikora, and S. Vialette. Querying Protein-Protein Interaction Networks. In ISBRA, volume 5542 ofLNBI, pages 52–62, 2009.

[9] S. Bruckner, F. Huffner, R. M. Karp, R. Shamir, andR. Sharan. Topology-free querying of protein interactionnetworks. In RECOMB. Springer, 2009.

[10] T. G. O. Consortium. Gene Ontology: tool for the unifica-tion of biology. Nature Genet, 25:25–29, 2000.

[11] R. Dondi, G. Fertin, and S. Vialette. Maximum MotifProblem in Vertex-Colored Graphs. In CPM, 2009.

[12] B. Dost, T. Shlomi, N. Gupta, E. Ruppin, V. Bafna, andR. Sharan. QNet: A Tool for Querying Protein InteractionNetworks. RECOMB, pages 1–15, 2007.

[13] M. Fellows, G. Fertin, D. Hermelin, and S. Vialette. Sharptractability borderlines for finding connected motifs invertex-colored graphs. In ICALP, volume 4596 of LNCS,pages 340–351, 2007.

[14] A. Gavin, M. Boshe, et al. Functional organization ofthe yeast proteome by systematic analysis of protein com-plexes. Nature, 414(6868):141–147, 2002.

[15] Y. Ho, A. Gruhler, et al. Systematic identification of pro-tein complexes in Saccharomyces cerevisae by mass spec-trometry. Nature, 415(6868):180–183, 2002.

[16] T. Hubbard, B. Aken, et al. Ensembl 2009. NAR, 37:D690,2009.

[17] B. Kelley, R. Sharan, R. Karp, T. Sittler, D. E. Root,B. Stockwell, and T. Ideker. Conserved pathways withinbacteria and yeast as revealed by global protein networkalignment. PNAS, 100(20):11394–11399, 2003.

[18] A. Labarga, F. Valentin, M. Anderson, and R. Lopez. Webservices at the European bioinformatics institute. NAR,35:W6, 2007.

[19] V. Lacroix, C. Fernandes, and M.-F. Sagot. Motif searchin graphs: application to metabolic networks. TCBB,3(4):360–368, 2006.

[20] D. Le Berre and A. Parrain. On extending sat solvers forpb problems. In RCRA, 2007.

[21] M. Pellegrini, E. Marcotte, M. Thompson, D. Eisenberg,and T. Yeates. Assigning protein functions by comparativegenome analysis: protein phylogenetic profiles. PNAS,96(8):4285–4288, 1999.

[22] R. Pinter, O. Rokhlenko, E. Yeger-Lotem, and M. Ziv-Ukelson. Alignment of metabolic pathways. Bioinformat-ics, 21(16):3401–3408, 2005.

[23] T. Reguly, A. Breitkreutz, et al. Comprehensive cura-tion and analysis of global interaction networks in saccha-romyces cerevisiae. Journal of Biology, 2006.

[24] A. Ruepp, B. Brauner, I. Dunger-Kaltenbach, et al. CO-RUM: the comprehensive resource of mammalian proteincomplexes. NAR, 2007.

[25] A. Schrijver. Theory of Linear and Integer Programming.John Wiley and Sons, 1998.

[26] P. Shannon, A. Markiel, O. Ozier, et al. Cy-toscape: A Software Environment for Integrated Modelsof Biomolecular Interaction Networks. Genome Research,13:2498–2504, 2003.

[27] T. Shlomi, D. Segal, E. Ruppin, and R. Sharan. QPath: amethod for querying pathways in a protein-protein inter-action network. BMC Bioinformatics, 7:199, 2006.

[28] D. Smedley, S. Haider, et al. BioMart – biological queriesmade easy. volume 10, page 22. BioMed Central Ltd,2009.

[29] P. Uetz, L. Giot, et al. A comprehensive analysis ofprotein-protein interactions in Saccharomyces cerevisae.Nature, 403(6770):623–627, 2000.

hal-0

0425

661,

ver

sion

1 -

22 O

ct 2

009

8. Appendix8.1. Constraints to bound the number of deletions

The six following constraints ensure that the number of deletions is lower than Ndel.

∀c ∈ C,

Ndel�

i=0

ndelc[i] = 1 (18)

∀c ∈ C, IsDelc ⇒ ndelc[0] = 0 (19)

∀c ∈ C, ¬ IsInsc +ndelc[0] ≥ 1 (20)

∀c ∈ C,∀0 ≤ i ≤ Ndel,

−�

v∈V

ColV [v][c] + occM (c) ≤ i.ndelc[i] + ¬ndelc[i].Ndel(21)

∀c ∈ C,∀0 ≤ i ≤ Ndel,

¬ndelc[i] −�

v∈V

ColV [v][c] + occM (c) + Nins. IsInsc ≥ i.ndelc[i](22)

�

c∈C

Ndel�

i=1

i.ndelc[i] ≤ Ndel (23)

Constraint (18) ensures that, for a given color c ∈ C, there is a unique variable ndelc that corresponds to the numberof deletions for c. Constraint (19) ensures that variables ndelc and IsDelc are consistent. Constraint (20) ensures that fora color c ∈ C there are either deletions or insertions. Constraint (21) and (22) ensure that ndelc[i] = 1 iff there are ideletions for the color c ∈ C (i.e. if the difference between occM (c) and

�v∈V ColV [v][c] is equal to i). Constraint (23)

ensures that the number of deletions is bounded by Ndel. The sum of all the deletions for a given color have to be lessthan Ndel.

8.2. Proof of Lemma 2.1

Proof We first prove the Lemma considering that no indels are allowed. The extension to the case allowing indels isstraightforward and is given afterwards.

Let us first prove that a solution to GRAPH MOTIF can be found by our LPB program i.e. that it has a correspondingvariables assignment that respects all the LPB constraints previously defined.

Given a solution G[R] to GRAPH MOTIF, set xv = 1 if v ∈ R;xv = 0 otherwise and eu,v = 1 if u and v ∈ R,eu,v = 0 otherwise. Find a spanning tree T of G[R] (this tree exists since G[R] is connected) and label the nodes of G[R]according to a postorder traversal of T .

Since no indels are allowed, |R| = k. By definition, exactly k variables xv are equal to 1 and thus constraints (1) and(2) hold. The labeling induced by the postorder traversal of T ensures that all variables of R have a unique and distinctlabel. Therefore, constraints (3) and (5) hold. Since no label are given in nodes not belonging to R, Constraint (4) holds.Moreover, in T , according to the postorder traversal, the father of any node v, except the root, has a label greater than v.Therefore, in G[R], any node has at least one neighbor with a greater label. Thus, Constraint (6) holds. Since σ : R → Mis a bijection, there is exactly occM (c) occurrences of each color c ∈ C in R, and hence, Constraint (7) holds. Moreover,there is only one image σ(v) associated to any v ∈ R, thus the sum in (8) is equal to 1 and the Constraint holds whenxv = 1. By the bijection σ : R → M , any element in M is associated to an element in R. Thus, no node v /∈ R hasan image in M , the sum in (8) is equal to 0 and the Constraint holds also when xv = 0. Since no indels are allowed,any color c is matched. Thus, Constraint (9) holds. Moreover,

�v∈V ColV [v][c] − occM (c) is equal to 0. Constraints

(10) and (11) hold iff IsExactc = 1. Indeed, if IsInsc = 1, Constraint (11) does not hold (0 ≥ 1), and if IsDelc = 1,Constraint (10) does not hold (0 ≤ −1). For each color c, constraints (12) to (23) hold if ninsc[0] = 1 and ndelc[0] = 1,which is the case when no indels are allowed.

Let us now prove that a solution to our LPB program corresponds to a solution to GRAPH MOTIF.

hal-0

0425

661,

ver

sion

1 -

22 O

ct 2

009

Given a LPB solution, for any v ∈ V , add v to R if xv = 1. Constraint (2) ensures that we have |R| = k. According toconstraints (2) and (7), R and M are finite sets with exactly the same number of elements (i.e., |R| = |M |). By Constraint(8), if σ(v) is defined, then v ∈ R (otherwise, xv = 0 and all variables ColV [v][c] are equal to 0 for this v and for allc ∈ col(v)). By constraints (7) and (8), for any c ∈ M , there is only one v ∈ R such that σ(v) = c (a node v can matchat most one color and there are exactly the same number of elements in R and M ). Thus, on the whole, σ : R → M is abijection. It remains to show that G[R] is connected.

By Constraint (3), every node in R has a label. Let r be the node with the greatest label. By Constraint (5), this labelis unique. Let us show that there exists a path in R connecting any node v ∈ R to r. To do so, let us prove by inductionthat there is a path from v to r with increasing labels. The case v = r is trivial. Suppose there exists a path p in R oflength l starting from v with increasing labels. Let sp be the sink of p (i.e. the last node in p). If sp = r, then we aredone. Otherwise, by Constraint (6), sp has at least one neighbor u with a label greater than its own. Then, there is a pathp ∪ {u} of length l + 1 with increasing labels.

Let us prove Lemma 2.1 when indels are allowed.Let first show that constraints (9) to (11) are consistent when indels are allowed. We already have shown that these

constraints hold if there is no indels.

• If there are i insertions for a color c, then,�

v∈V ColV [v][c] − occM (c) = i. Constraint (9) ensure that only onevariable among IsExactc, IsInsc and IsDelc is equal to 1.

– If IsExactc = 1, then constraints (10) (i ≤ 0) and (11) (i ≥ 0) are both true iff i = 0.

– If IsInsc = 1, then constraints (10) (i ≤ Nins) and (11) (i ≥ 0) are both true iff 0 ≤ i ≤ Nins, which is thecase here.

– If IsDelc = 1, then Constraint (10) (i ≤ −1) does not hold since the number of insertions is positive (i > 0).

• If there are d deletions for a color c, then,�

v∈V ColV [v][c] − occM (c) = −d.

– If IsExactc = 1, then constraints (10) (−d ≤ 0) and (11) (−d ≥ 0) are both true iff d = 0.

– If IsInsc = 1, then Constraint (11) (−d ≥ 1) does not hold since the number of deletions is positive (d > 0).

– If IsDelc = 1, constraints (10) (−d ≤ −1) and (11) (−d ≥ 1−1−Ndel) are both true iff −Ndel ≤ −d ≤ −1,which is the case here.

Let us now show that constraints (12) to (16) and constraints (18) to (22) are consistent with the number of indels fora given color in a solution of GRAPH MOTIF.

• If there are i insertions for a color c, then,�

v∈V ColV [v][c]− occM (c) = i and IsInsc = 1, IsDelc = IsExactc =0. Constraints (12) and (13) ensure that there is one i �= 0 s.t. ninsc[i] = 1. Constraint (14) holds since IsDelc = 0.Constraints (15) and (16) both hold iff ninsc[i] = 1 (i ≤ i and i ≥ i). Otherwise, if ninsc[j] = 1, j �= i, constraints(15) and (16) hold iff we have j ≤ i ≤ j (i ≤ j and i ≥ j), which is impossible since j �= i.

Since IsInsc = 1, Constraint (20) holds iff ndelc[0] = 1. Then, Constraint (18) holds. Variable IsDelc = 0, thusConstraint (19) holds. Hereafter, constraints (21) (−i ≤ 0) and (22) (−i + Nins ≥ 0) hold when ndelc[0] = 1.

• If there are d deletions for a color c, then�

v∈V ColV [v][c]−occM (c) = −d and IsDelc = 1, IsInsc = IsExactc =0. Thus, Constraint (13) holds. Since IsDel = 1, Constraint (14) holds iff ninsc[0] = 1. Thus, Constraint (12)holds. Hereafter, constraints (15) (−d ≤ Nins) and (16) (−d + Ndel ≥ 0) hold when ninsc[0] = 1.

Constraints (20) holds since IsInsc = 1. Constraints (18) and (19) ensure that there is one i �= 0 s.t. ndelc[i] = 1.Constraints (21) and (22) both holds iff ndelc[d] = 1 (d ≤ d and d ≥ d). Otherwise, if ndelc[j] = 1, j �= d,constraints (21) and (22) hold iff we have j ≤ d ≤ j (d ≤ j and d ≥ j), which is impossible since j �= d.

In both case, constraints (17) and (23) hold iff the overall number of insertions and deletions are respectively less thanNins and Ndel.

hal-0

0425

661,

ver

sion

1 -

22 O

ct 2

009

guillaume blin - igmmonge.univ-mlv.fr/~gblin/ligm_like/dossier-guillaume... · 2013. 3. 29. ·...

Documents