analyse automatique d’articles scientifiques...resume padding journal hijacking 4 detection of...

49
Analyse automatique d’articles scientifiques Cyril Labb´ e Universit´ e Grenoble Alpes - LIG - ´ equipe Sigma June 25, 2019 C.Labb´ e (UGA-LIG) Ike Antkare & Co June 25, 2019 1 / 44

Upload: others

Post on 17-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Analyse automatique d’articles scientifiques

Cyril LabbeUniversite Grenoble Alpes - LIG - equipe Sigma

June 25, 2019

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 1 / 44

Page 2: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Pourquoi Ecrire ?

Table of Contents

1 Pourquoi Ecrire ?

2 Publications et ScientometrieScientometrics: what for?SCIgen a Probabilistic Context Free Grammar

3 Of the use of fake publicationsh-index hackingResume PaddingJournal Hijacking

4 Detection of SCIgen papersGoogle SearchSciDetect: Automatic detection

5 Automatic detection of questionable research papersFact checking scienceSeek & Blastn tool

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 2 / 44

Page 3: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Pourquoi Ecrire ?

Pour construire la connaissance scientifique

Les ancetres (1665)

Londres : Philosophical Transactions of the Royal Society,

Paris : Journal des scavans.

Specificites des publications scientifiques :

un public de specialistes,

contributions au ”debat scientifique” avec des travaux originaux.

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 3 / 44

Page 4: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Pourquoi Ecrire ?

La publication d’un article

Scientist

Peer

Peer

Peer

Computing

Machinery and

Intelligence

Editor(Scientist)

Publisher, printand distribute

Readers,Libraries

Evaluation

Supervision c�transfert

Writes

Access Fee

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 4 / 44

Page 5: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Pourquoi Ecrire ?

Nouveaux Systemes d’Information scientifiques

Grand nombre de sources d’information :

Les catalogues des maisons d’edition scientifiques

Les archives ouvertes et les reseaux sociaux

L’Information a des caracteristiques variees :

Acces payant ou gratuit : public, restreint ou prive

Revue par les pairs ou non

Pour des objectifs varies :

Etat de l’art / Bibliometrie / Scientometrie

L’article scientifique est au cœur du systeme :

Garantir la validite des informations presentees ?

Comment garantir leurs qualites ?

Y-a-t’il des systemes plus vertueux que d’autres ?

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 5 / 44

Page 6: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Publications et Scientometrie

Table of Contents

1 Pourquoi Ecrire ?

2 Publications et ScientometrieScientometrics: what for?SCIgen a Probabilistic Context Free Grammar

3 Of the use of fake publicationsh-index hackingResume PaddingJournal Hijacking

4 Detection of SCIgen papersGoogle SearchSciDetect: Automatic detection

5 Automatic detection of questionable research papersFact checking scienceSeek & Blastn tool

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 6 / 44

Page 7: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Publications et Scientometrie Scientometrics: what for?

Ranking scientists and journals

Definition (Impact Factor)

Average number of citations topapers published by the journal overthe last two years. Computed since1975. Time after publication

Citations

2 years

Definition (h-index [Hirsch, 2005])

A scientist has index h if h of his orher N

p

papers have at least hcitations each and the other (N

p

� h)papers have h citations each.

Papers

0 h

h

N

p

Number of citations

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 7 / 44

Page 8: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Publications et Scientometrie Scientometrics: what for?

Ranking Uni, Journals and Scientists

Librarian

What are the must-buys for my readers?

Scientist

Where shall I submit my research?

Research Administration

Who shall I hire? Who deserve apromotion?

Students

Where to study? With whom? In whichcountry?

Government

Who deserve investment? What for?Which scientific field?

Impact Factor

Average number of citations (....) over thelast two years. Computed since 1975.

h-index and variationshttp://sci2s.ugr.es/hindex

h5-index, g -index, hm

-index, a-index,hg -index, ar -index...

ARWU

Academic Ranking of World Universities(Shanghai ranking) since 2003.

Collaborative distance

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 8 / 44

Page 9: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Publications et Scientometrie Scientometrics: what for?

Regles quantitatives.

En France...

Publiant : au moins 1 publication par an, ou 2 publications de rang A sur laperiode.

Produisant : les arguments qui permettent de considerer une personnenon-publiante comme produisante.

... et ailleurs

”at least one international publication per year”

Rules for defense (MS Thesis, PhD thesis)

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 9 / 44

Page 10: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Publications et Scientometrie Scientometrics: what for?

Chronos

2004 2006 2008 2010 2012 2014 2016 2018

h-index

PoP V1.0

Scopus

Web of Science

h-indexPoP

V1.0

Abiteboul par

l’administrateur

du College de

France

Generation automatique de texte

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 10 / 44

Page 11: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Publications et Scientometrie SCIgen a Probabilistic Context Free Grammar

PCFG: Probabilistic Context Free Grammar

Sets of symbols

Set of non terminal symbols N = {SP, S, V, P},Set of terminal symbols⌃ = {”.”, sing , dance, flight, seas, oceans, air , streets, hills, fields}.

Set of rules Ri

R1 : SP �! S. p(R1)=1

R2 : S �! We shall V in the P p(R2)=1/4

Non�zero

R4 : S �! We shall V in the P and in the P, S p(R4)=1/4

probability

R3 : S �! S, S p(R3)=1/2

to 1

R5..7 : V �! sing |dance|flight p(Ri

)=1/3 i=5..7

R8..13 : P �! seas|oceans|air |streets|hills|fields p(Ri

)=1/6 i=8..13

Terminal string example:

s : We shall sing in the air and in the hills, We shall dance in the fields.p(s) =

Qj

p(Rj

)

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 11 / 44

Page 12: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Publications et Scientometrie SCIgen a Probabilistic Context Free Grammar

PCFG: Probabilistic Context Free Grammar

Sets of symbols

Set of non terminal symbols N = {SP, S, V, P},Set of terminal symbols⌃ = {”.”, sing , dance, flight, seas, oceans, air , streets, hills, fields}.

Set of rules Ri

R1 : SP �! S. p(R1)=1

R2 : S �! We shall V in the P p(R2)=1/4 Non�zero

R4 : S �! We shall V in the P and in the P, S p(R4)=1/4 probability

R3 : S �! S, S p(R3)=1/2 to 1

R5..7 : V �! sing |dance|flight p(Ri

)=1/3 i=5..7

R8..13 : P �! seas|oceans|air |streets|hills|fields p(Ri

)=1/6 i=8..13

Terminal string example:

s : We shall sing in the air and in the hills, We shall dance in the fields.p(s) =

Qj

p(Rj

)

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 11 / 44

Page 13: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Publications et Scientometrie SCIgen a Probabilistic Context Free Grammar

SCIgen 2005 by J. Stribling, M. Krohn & D. Aguayo

... maximize amusement, rather than coherence ...

ReferencesTitre Abstract Introduction Model Impl Eval RelatedWork Concl

Intro_A Intro_A3Intro_A2 Intro_closing

Intro A �! Many SCI PEOPLE would agree that, had it not been for SCI GENERIC NOUN, ...

Intro A �! In recent years, much research has been devoted to the SCI ACT; , ...

Intro A �! SCI THING MOD and SCI THING MOD, while SCI ADJ in theory, have not until...

Intro A �! The SCI ACT is a SCI ADJSCI PROBLEM.

Intro A �! The SCI ACT has SCI VERBEDSCI THING MOD, and current trends...

Intro A �! The implications of SCI BUZZWORD ADJ SCI BUZZWORD NOUN have...

... �! ...

SCI PEOPLE �! steganographers, cyberinformaticians, futurists, cyberneticists, ...

SCI BUZZWORD ADJ �! omniscient, introspective, peer � to � peer, ambimorphic, ...

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 12 / 44

Page 14: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Publications et Scientometrie SCIgen a Probabilistic Context Free Grammar

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 13 / 44

Page 15: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Publications et Scientometrie SCIgen a Probabilistic Context Free Grammar

Chronos

2004 2006 2008 2010 2012 2014

Scopus (Elsevier)

Web of Science (Thomson Reuter)

SCIgen

h-index PoP

V1.0

Abiteboul par

l’administrateur

du College de

France

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 14 / 44

Page 16: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Of the use of fake publications

Table of Contents

1 Pourquoi Ecrire ?

2 Publications et ScientometrieScientometrics: what for?SCIgen a Probabilistic Context Free Grammar

3 Of the use of fake publicationsh-index hackingResume PaddingJournal Hijacking

4 Detection of SCIgen papersGoogle SearchSciDetect: Automatic detection

5 Automatic detection of questionable research papersFact checking scienceSeek & Blastn tool

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 15 / 44

Page 17: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Of the use of fake publications h-index hacking

Building a citation farm [Labbe, 2010]

Modified SCIgen

100

Real Documents Ike Antkare’s 101 Documents

......

... ...

...0

1

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 16 / 44

Page 18: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Of the use of fake publications h-index hacking

Ike Antkare h-index [Labbe, 2010]

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 17 / 44

Page 19: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Of the use of fake publications h-index hacking

Chronos

2004 2006 2008 2010 2012 2014 2016 2018

Ike Antkare

Scopus (Elsevier)

Web of Science (Thomson Reuter)

SCIgen

h-indexPoP

V1.0

Abiteboul par

l’administrateur

du College de

France

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 18 / 44

Page 20: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Of the use of fake publications Resume Padding

IEEEXplore: 12 nov. 2014

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 19 / 44

Page 21: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Of the use of fake publications Resume Padding

IEEEXplore: 2 feb. 2016

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 20 / 44

Page 22: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Of the use of fake publications Journal Hijacking

Beware Hijacking Je↵rey Beall http://scholarlyoa.com

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 20 / 44

Page 23: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Of the use of fake publications Journal Hijacking

Publication : Gold Open Access

Scientist

Peer

Peer

Peer

Computing

Machinery and

Intelligence

Editor(Scientist)

Publisher, printand distribute

Readers,Libraries

Evaluation

Supervisionc�transfert

Publication Fee

Writes

For free

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 21 / 44

Page 24: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Of the use of fake publications Journal Hijacking

Beware : Predatory PublishersGet me off Your Fucking Mailing List

David Mazieres and Eddie KohlerNew York University

University of California, Los Angeleshttp://www.mailavenger.org/

AbstractGet me off your fucking mailing list. Get me offyour fucking mailing list. Get me off your fuck-ing mailing list. Get me off your fucking mail-ing list. Get me off your fucking mailing list.Get me off your fucking mailing list. Get me offyour fucking mailing list. Get me off your fuck-ing mailing list. Get me off your fucking mail-ing list. Get me off your fucking mailing list.Get me off your fucking mailing list. Get me offyour fucking mailing list. Get me off your fuck-ing mailing list. Get me off your fucking mail-ing list. Get me off your fucking mailing list.Get me off your fucking mailing list. Get meoff your fucking mailing list. Get me off yourfucking mailing list. Get me off your fuckingmailing list.

1 IntroductionGet me off your fucking mailing list. Get me offyour fucking mailing list. Get me off your fuck-ing mailing list. Get me off your fucking mail-ing list. Get me off your fucking mailing list.Get me off your fucking mailing list. Get me offyour fucking mailing list. Get me off your fuck-ing mailing list. Get me off your fucking mail-ing list. Get me off your fucking mailing list.Get me off your fucking mailing list. Get me off

your fucking mailing list. Get me off your fuck-ing mailing list. Get me off your fucking mail-ing list. Get me off your fucking mailing list.Get me off your fucking mailing list. Get me offyour fucking mailing list. Get me off your fuck-ing mailing list. Get me off your fucking mail-ing list. Get me off your fucking mailing list.Get me off your fucking mailing list. Get meoff your fucking mailing list. Get me off yourfucking mailing list.

Get me off your fucking mailing list. Get meoff your fucking mailing list. Get me off yourfucking mailing list. Get me off your fuckingmailing list. Get me off your fucking mailinglist. Get me off your fucking mailing list. Getme off your fucking mailing list. Get me offyour fucking mailing list. Get me off your fuck-ing mailing list. Get me off your fucking mail-ing list. Get me off your fucking mailing list.Get me off your fucking mailing list. Get me offyour fucking mailing list. Get me off your fuck-ing mailing list. Get me off your fucking mail-ing list. Get me off your fucking mailing list.Get me off your fucking mailing list. Get meoff your fucking mailing list. Get me off yourfucking mailing list. Get me off your fuckingmailing list.

Get me off your fucking mailing list. Get meoff your fucking mailing list. Get me off yourfucking mailing list. Get me off your fucking

1

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 21 / 44

Page 25: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Detection of SCIgen papers

Table of Contents

1 Pourquoi Ecrire ?

2 Publications et ScientometrieScientometrics: what for?SCIgen a Probabilistic Context Free Grammar

3 Of the use of fake publicationsh-index hackingResume PaddingJournal Hijacking

4 Detection of SCIgen papersGoogle SearchSciDetect: Automatic detection

5 Automatic detection of questionable research papersFact checking scienceSeek & Blastn tool

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 22 / 44

Page 26: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Detection of SCIgen papers Google Search

Phrase search

Many SCI PEOPLE would agree that, had it not been for SCI GENERIC NOUN, ...

In recent years, much research has been devoted to the SCI ACT; ...

SCI THING MOD and SCI THING MOD, while SCI ADJ in theory, have not until ...

The SCI ACT has SCI VERBEDSCI THING MOD, and current trends ...

The implications of SCI BUZZWORD ADJ SCI BUZZWORD NOUN have ...

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 23 / 44

Page 27: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Detection of SCIgen papers Google Search

Phrase search

Many SCI PEOPLE would agree that, had it not been for SCI GENERIC NOUN, ...

In recent years, much research has been devoted to the SCI ACT; ...

SCI THING MOD and SCI THING MOD, while SCI ADJ in theory, have not until ...

The SCI ACT has SCI VERBEDSCI THING MOD, and current trends ...

The implications of SCI BUZZWORD ADJ SCI BUZZWORD NOUN have ...

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 23 / 44

Page 28: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Detection of SCIgen papers SciDetect: Automatic detection

Distance inter-textuelle : [Labbe and Labbe, 2006]

A: {le le chat} ( 13 ,23 ,

03 ) B: {un chat chat } ( 23 ,

03 ,

13 )

1/3

2/3

chat

un chat chat

un

le

le le chat

2/3

1/3

2/3

chat

un chat chat

un

le

le le chat

2/3

1/3

2/3

chat

un chat chat

un

le

le le chat

2/3

Distance intertextuelle : D(A,B) =12

Pi2(A[B) |f

i,A � f

i,B | = 23

Interpretation:

D(A,B) = � la proportion de mots (word tokens) di↵erents dans les deuxtextes.

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 24 / 44

Page 29: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Detection of SCIgen papers SciDetect: Automatic detection

Regroupement Hierarchique [Labbe and Labbe, 2013]

D(I,J) =1

|I||J| (P

i2I

Pj2J

D(i,j) + D(i,j))

I J

I 0 0.45J 0.45 0

C et D forment le groupe J

D(I,x) =12 (D(A,x) + D(B,x))

I C D

I 0 0.35 0.55C 0.35 0 0.3D 0.55 0.3 0

A et B forment le groupe I

A B C D

A 0 0.2 0.3 0.5B 0.2 0 0.4 0.6C 0.3 0.4 0 0.3D 0.5 0.6 0.3 0

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 25 / 44

Page 30: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Detection of SCIgen papers SciDetect: Automatic detection

Detection automatique [Labbe and Labbe, 2013]

Distance inter-textuelle :

�(a,b) = � proportion de mots (tokens) di↵erents dans les deux textes.

Hierarchical Clustering

ll

ll l

l l

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

ll

ll

l l!

! !l l ! ! ! ! l

l l ! ! ! ! ll l

ll l

! ll l ! l

l l ll l l l l l

l ll

l l l l!

! !!

! !!

! ! ! ! ! l!

! !l

! !! ! l ! l

ll l

l l ! l l l l ll

l! l

! ! ! !! ! !

! ! ! !! !

! ! ! ! ! ! !! !

! !!

! ! ! !! ! ! ! !! ! ! ! !! ! ! ! ! ! ! ! !

!! ! ! !

l! !

!! !

! ! !! ! ! !

ll l l l

l l ll l l l l

l lll l

ll l l

l l lll l

l ll

l lI I

ll

l ll lI

I II I

Il

l ll

l lI I

II I

II I

I II I

II I

II

I II

I I I II

II

I II

lI I

I II

I II I

II I

I II I

II

I II I

I II I

I II

II I I

I II

I II

I II

I II

II I

II

I II I

I I I I II I I

I II I

I I II

I II I

II I

I II I

II I I I

II I

II I

I I I II

I I I II

I II

II

I II I

II

I Il l

l ll

l ll l

l ll

ll

l ll

l

Corpus Z MLTSCIGen

Soit

t un texte a tester.

�Faket

= minf 2SCIgen

�(t,f )

Si (�Faket

< �Seuil

) Alors

Une generation SCIgen doitetre consideree,(risque < 10�5).

Sinon

Une origine non-SCIgen doitetre consideree.

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 26 / 44

Page 31: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Detection of SCIgen papers SciDetect: Automatic detection

SCIgen papers and its clones

SSME: Int. Conf. on Services Science, Management and Engineering. 2009.

IEEEXplorer, indexed in Scopus and WoK

150 papers, 4 SCIgen and 1 duplicate.

O�cial acceptance rate : 28%

SCIgen inside (publishers)

120 IEEE (retracted or deleted),

16 Springer (retracted),

1 Elsevier (accepted-unpublished)

SCIgen inside (social networks)

http://www.researchgate.net

http://scholar.harvard.edu

http://www.academia.edu

Other generators

Mathgen (http://thatsmathematics.com/mathgen/)

The Postmodernism Generator (http://www.elsewhere.org/pomo/)

scigen-physics (https://bitbucket.org/birkenfeld/scigen-physics)

Auto. SBIR Grant Proposal Generator (http://www.nadovich.com/chris/randprop/)

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 27 / 44

Page 32: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Detection of SCIgen papers SciDetect: Automatic detection

Dans la presse internationale scientifique et grand public (2014)

Publishers withdraw more than 120 gib-berish papers

Fraudulent scientifi

c paperspublish

ed,

then withdrawn

Wissenschaftsverlag loscht 16 sinn-freie Artikel

How computer-generated fake papers

are flooding academia

Fake Research Papers: How Did More Than 120 ’Gib-berish’ Computer-Generated Studies Get Published?

Science publisher fooled by gibberish

papers

Ike Antkare, le grand scientifique quin’existait pas

Science Publishers Remove Papers Generated as a Hoax

Fraudulent scientific papers published, then with-drawn

Publier ou perir: faux articles pourfaux congres

How Gobbledygook Ended Up in Re-

spected Scientific Journals

More Computer-Generated Nonsense

Papers Pulled From Science Journals

Science publisher fooled by gibberish

Wieder ließen Fachverlage Nonsens

ungepruftdurchgehe

n

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 28 / 44

Page 33: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Detection of SCIgen papers SciDetect: Automatic detection

Chronos

2004 2006 2008 2010 2012 2014 2016 2018

Ike Antkare

Nature

Scopus (Elsevier)

Web of Science (Thomson Reuter)

SCIgen

h-indexPoP

V1.0

Abiteboul par

l’administrateur

du College de

France

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 29 / 44

Page 34: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Detection of SCIgen papers SciDetect: Automatic detection

No SCIgen paper in arXiv (Computer Science)

Automated screening: ArXiv screens spot fake papers

Image borrowed from [Ginsparg, 2014]

Only stop-words

PCA

Supposed non Zipfian

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 30 / 44

Page 35: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Detection of SCIgen papers SciDetect: Automatic detection

Publication : Self Archiving (Green Open Access)

Scientist

Peer

Peer

Peer

Computing

Machinery and

Intelligence

Editor(Scientist)

Publisher, printand distribute

Readers,LibrariesOpen Archive

Computing Machinery

and Intelligence

Evaluation

Supervision c�transfert

uploadFor free

Write

...

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 31 / 44

Page 36: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Detection of SCIgen papers SciDetect: Automatic detection

Where to find pirated papers

Pirated papers

LibGen

Sci-Hub (Alexandra Elbakyan)

Bohannon J, Elbakyan A (2016)Data from: Who’s downloading pirated papers? Everyone.Dryad Digital Repository. https://doi.org/10.5061/dryad.q447c

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 32 / 44

Page 37: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Detection of SCIgen papers SciDetect: Automatic detection

Overlay Journal : les epi-journaux

Scientist

Peer

Peer

Peer

Computing

Machinery and

Intelligence

Editor(Scientist)

Overlay Networkprovide links toselected paper

Readers,Libraries

Open Archive

Computing Machinery

and Intelligence

Evaluation

Supervision

upload

For freeWrite

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 33 / 44

Page 38: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Detection of SCIgen papers SciDetect: Automatic detection

Springer-Nature funded SciDetect: http://scidetect.forge.imag.fr

Press release, march 2015

”The open source software discovers text that has been generated with the SCIgen computerprogram and other fake-paper generators like Mathgen and Physgen.”

”SciDetect is highly flexible and can be quickly customized to cope with new methods of

automatically generating fake or random text”

Do not cop with other problems

Peer review rings

Paper mills

Black market and authorship selling

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 34 / 44

Page 39: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Automatic detection of questionable research papers

Table of Contents

1 Pourquoi Ecrire ?

2 Publications et ScientometrieScientometrics: what for?SCIgen a Probabilistic Context Free Grammar

3 Of the use of fake publicationsh-index hackingResume PaddingJournal Hijacking

4 Detection of SCIgen papersGoogle SearchSciDetect: Automatic detection

5 Automatic detection of questionable research papersFact checking scienceSeek & Blastn tool

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 35 / 44

Page 40: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Automatic detection of questionable research papers

Automatic detection of questionable researchpapers [Byrne and Labbe, 2017b, Byrne and Labbe, 2017a]

Scientific ethics

Plagiarism, auto-plagiarism,content reuse...

N � grams signature(hashing functions).

Non-sense detection

Paper generator (SCIgen,physic-gen, MathGen...)

Authorship detection(inter-textual distance).

Need to detect questionable scientific results

Fabrications (making up data or results)

Falsification (manipulating data or results)

False or unsupported a�rmations

Genuine errors

Error spreading

Wrong belief

Research irreproducibility

=)

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 36 / 44

Page 41: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Automatic detection of questionable research papers Fact checking science

Starting point : striking similarities, obvious errors

Jennifer Byrne:

First reported TPD52L2

(20 years ago)

5 Publications with obviouserrors!

5 Publications from China:

Single gene knockdownexperiments.

Human cancer cell lines.

Conclusions highlight potential therapy

...TPD52L2... novel therapeutic target for glioma treatment.

...TPD52L2... novel clues for oral squamous cell carcinoma therapy.

...TPD52L2... therapeutic approach for the treatment of breast cancer.

...TPD52L2 is indispensable in gastric cancer proliferation.

...TPD52L2 could be a novel therapeutic target for human liver cancer.

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 37 / 44

Page 42: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Automatic detection of questionable research papers Fact checking science

Obvious errors: example

PMID : 25262828

Materials and methods

The shRNA sequence (5’-GCGGAGGGTTTGAAAGAATATCTC-GAGATATTCTTTCAAACCCTCCGCTTTTTT-3’) targetingTPD52L2 (NM 199360) was inserted into the pFH-L plasmid(Shanghai Hollybio, China). A scrambled shRNA that shared nohomology with the mammalian genome (5’-CTAGCCCGGCCAAG-GAAGTGCAATTGCATACTCGAGTATGCAATTGCACTTC-CTTGGTTTTTTGTTAAT-3’) was used as control.

Fact-Check using blastn (NCBI)

Query= SeqASeqA (evalue = 10)Length=54Sequences producing significant alignments:significant alignments:... ... ... ...> .... Homo sapiens tumor protein D52Homo sapiens tumor protein D52like 2 (TPD52L2),like 2 (TPD52L2), ...

Length=2230...Query 1 GCGGAGGGTTTGAAAGAATAT 21

|||||||||||||||||||||Sbjct 894 GCGGAGGGTTTGAAAGAATAT 914....Query 28 ATATTCTTTCAAACCCTCCGC 48

|||||||||||||||||||||Sbjct 914 ATATTCTTTCAAACCCTCCGC 894

Fact-Check using blastn (NCBI)

Query= SeqDSeqD (evalue = 10)Length=68Sequences producing significant alignments:significant alignments:... ... ... ...> .... Homo sapiens NIN1/PSMD8 bindingHomo sapiens NIN1/PSMD8 bindingprotein 1 homolog (NOB1)...protein 1 homolog (NOB1)...

Length=1775...Query 9 GCCAAGGAAGTGCAATTGCATA 30

||||||||||||||||||||||Sbjct 1505 GCCAAGGAAGTGCAATTGCATA 1526....Query 37 TATGCAATTGCACTTCCTTGG 57

||||||||||||||||||||||Sbjct 1526 TATGCAATTGCACTTCCTTGG 1506

Gene TPD52L2 Gene Nob1

50 � GCGG

SeqA

50 � GTAG

SeqD

Targets(21/21) Targets(22/22)

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 38 / 44

Page 43: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Automatic detection of questionable research papers Fact checking science

Obvious errors: example

PMID : 25262828

Materials and methods

The shRNA sequence (5’-GCGGAGGGTTTGAAAGAATATCTC-GAGATATTCTTTCAAACCCTCCGCTTTTTT-3’) targetingTPD52L2 (NM 199360) was inserted into the pFH-L plasmid(Shanghai Hollybio, China). A scrambled shRNA that shared nohomology with the mammalian genome (5’-CTAGCCCGGCCAAG-GAAGTGCAATTGCATACTCGAGTATGCAATTGCACTTC-CTTGGTTTTTTGTTAAT-3’) was used as control.

Fact-Check using blastn (NCBI)

Query= SeqASeqA (evalue = 10)Length=54Sequences producing significant alignments:significant alignments:... ... ... ...> .... Homo sapiens tumor protein D52Homo sapiens tumor protein D52like 2 (TPD52L2),like 2 (TPD52L2), ...

Length=2230...Query 1 GCGGAGGGTTTGAAAGAATAT 21

|||||||||||||||||||||Sbjct 894 GCGGAGGGTTTGAAAGAATAT 914....Query 28 ATATTCTTTCAAACCCTCCGC 48

|||||||||||||||||||||Sbjct 914 ATATTCTTTCAAACCCTCCGC 894

Fact-Check using blastn (NCBI)

Query= SeqDSeqD (evalue = 10)Length=68Sequences producing significant alignments:significant alignments:... ... ... ...> .... Homo sapiens NIN1/PSMD8 bindingHomo sapiens NIN1/PSMD8 bindingprotein 1 homolog (NOB1)...protein 1 homolog (NOB1)...

Length=1775...Query 9 GCCAAGGAAGTGCAATTGCATA 30

||||||||||||||||||||||Sbjct 1505 GCCAAGGAAGTGCAATTGCATA 1526....Query 37 TATGCAATTGCACTTCCTTGG 57

||||||||||||||||||||||Sbjct 1526 TATGCAATTGCACTTCCTTGG 1506

Gene TPD52L2 Gene Nob1

50 � GCGG

SeqA

50 � GTAG

SeqD

Targets(21/21) Targets(22/22)

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 38 / 44

Page 44: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Automatic detection of questionable research papers Seek & Blastn tool

Seek & Blastn at a glance

Materials and methodsThe shRNA sequence (5’-GCGGAGGGTTTGAAA-GAATATCTCGAGATATTCTTTCAAACCCTCCGCTTTTTT-3’) targeting TPD52L2 (NM 199360) was inserted intothe pFH-L plasmid (Shanghai Hollybio, China). Ascrambled shRNA that shared no homology with themammalian genome (5’-CTAGCCCGGCCAAGGAAGTG-CAATTGCATACTCGAGTATGCAATTGCACTTCCTTG-GTTTTTTGTTAAT-3’) was used as control.

Facts to check

Status DNA Seq... ...

Targeting GCG...TTTNon-Targ. CTA...AAT

... ...

Hit lists (Blastn results)

hit list DNA Seq... ...

TPD52L2, ... GCG...TTTNOB1,... CTA...AAT

... ...

Checked Facts

Satus DNA SeqTarg. GCG...TTT

Non-Targ CTA...AAT... ...

(1) Facts extraction:

Named entity recogni-

tion, extract nucleotide

and status...

(2) Blastn call

software gives

the hit list

(3) Comparison

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 39 / 44

Page 45: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Automatic detection of questionable research papers Seek & Blastn tool

Ambiguıtes : polysemie, homonymie, structurale,...

Le president a le pouvoir de faire taire l’avocat.

Je ne vais pas pouvoir manger l’avocat.

l’ete a l’est a ete tres beau et l’est toujours.

Je suis le secretaire.

Je vais a la grange et la ferme.

Il poursuit la jeune fille a velo.

Il a vu un homme avec un telescope.

Tous les participants prendront un bus.

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 40 / 44

Page 46: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Automatic detection of questionable research papers Seek & Blastn tool

Seek & Blastn

Related works

Detection of statistically flawed paper

Fake news detection

Seek & Blastn perspectives

Online tool : http://scigendetection.imag.fr/TPD52

Avoid false positive, more in-deep analysis of sentences.

Retractions, Errors corrections

Retractions (⇡ 18), Expression of concern (⇡ 11), ⇡ 45 to be treated

Citation analysis (to be done)

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 41 / 44

Page 47: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Automatic detection of questionable research papers Seek & Blastn tool

Chronos

2004 2006 2008 2010 2012 2014 2016 2018

Ike Antkare

Nature

Scopus (Elsevier)

Web of Science (Thomson Reuter)

SCIgen

h-indexPoP

V1.0

Abiteboul par

l’administrateur

du College de

France

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 42 / 44

Page 48: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Automatic detection of questionable research papers Seek & Blastn tool

Conclusion and Future/Ongoing works

Publication procedures, models and habits

Why fake papers were accepted, published and ... sold.

Traditional publisher vs open access.

Knowledge di↵usion: better and less... or as much as possible.

Blind management rules...

... are an incitation to malpractices: slicing, plagiarism, faked data, ...

Automatic detection of new generators

Hand written PCFG : find dense cluster inside a population.

Study other kind of generator (language model).

In the web today

Automatic knowledge extraction/detection/generation.

How to separate the wheat from the cha↵... and scale up !

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 43 / 44

Page 49: Analyse automatique d’articles scientifiques...Resume Padding Journal Hijacking 4 Detection of SCIgen papers Google Search SciDetect: Automatic detection 5 Automatic detection of

Automatic detection of questionable research papers Seek & Blastn tool

Thanks

Amancio, D. R. (2015).

Comparing the topological properties of real and artificiallygenerated scientific manuscripts.Scientometrics, 105(3):1763–1779.

Beel, J. and Gipp, B. (2010).

Academic search engine spam and google scholar’s resilienceagainst it.Journal of Electronic Publishing, 13(3).

Beel, J., Gipp, B., and Wilde, E. (2010).

Academic search engine optimization (aseo).Journal of scholarly publishing, 41(2):176–190.

Byrne, J. A. and Labbe, C. (2017a).

Fact checking nucleotide sequences in life science publications:The seek & blastn tool.In International Congress on Peer Review and Scientific

Publication, Enhancing the quality and credibility of science,Chicago.

Byrne, J. A. and Labbe, C. (2017b).

Striking similarities between publications from china describingsingle gene knockdown experiments in human cancer cell lines.Scientometrics, 110(3):1471–1493.

Dalkilic, M. M., Clark, W. T., Costello, J. C., and Radivojac, P.

(2006).Using compression to identify classes of inauthentic texts.In Proceedings of the 2006 SIAM Conference on Data Mining.

Fahrenberg, U., Biondi, F., Corre, K., Jegourel, C., Kongshøj, S.,

and Legay, A. (2014).Measuring structural distances between texts.CoRR, abs/1403.4024.

Ginsparg, P. (2014).

Automated screening: Arxiv screens spot fake papers.Nature, 508(7494):44–44.

Hirsch, J. E. (2005).

An index to quantify an individual’s scientific research output.Proceedings of the National Academy of Science,102:16569–16572.

Labbe, C. (2010).

Ike antkare, one of the great stars in the scientific firmament.International Society for Scientometrics and Informetrics

Newsletter, 6(2):48–52.

Labbe, C. and Labbe, D. (2006).

A tool for literary studies. intertextual distance and treeclassification.Literary and Linguistic Computing, 21(3):311–326.

Labbe, C. and Labbe, D. (2013).

Duplicate and fake publications in the scientific literature: howmany scigen papers in computer science?Scientometrics, 94(1):379–396.

Lavoie, A. and Krishnamoorthy, M. (2010).

Algorithmic Detection of Computer Generated Text.ArXiv e-prints.

Lopez-Cozar, E. D., Robinson-Garcıa, N., and Torres-Salinas, D.

(2012).Manipulating google scholar citations and google scholar metrics:Simple, easy and tempting.arXiv preprint arXiv:1212.0638.

Xiong, J. and Huang, T. (2009).

An e↵ective method to identify machine automatically generatedpaper.In KESE ’09. Pacific-Asia Conference, pages 101–102.

C.Labbe (UGA-LIG) Ike Antkare & Co June 25, 2019 44 / 44