bioteca: restructuring wikipedia jenny yuen serdar balci erdong chen alvin raj
TRANSCRIPT
![Page 1: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ea75503460f94baac0d/html5/thumbnails/1.jpg)
BIOTECA: Restructuring Wikipedia
Jenny Yuen
Serdar Balci
Erdong Chen
Alvin Raj
![Page 2: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ea75503460f94baac0d/html5/thumbnails/2.jpg)
Problem definition Wikipedia
Collaborative editing 3.8 Million Edits per Month 38 Edits per Article
Various titles conveying the same meaning
BIOTECA: Restructuring Wikipedia a better way for Wikipedia users to access, analyze,
and use biography data on Wikipedia.
June 28, 2007 2
![Page 3: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ea75503460f94baac0d/html5/thumbnails/3.jpg)
Problem formulation
Document set D = {d1,d2,…,dm}
Sentences in doc d, Sd = {s1,s2,…,sn(d)}
Segment sets, Ŝ = {ŝ1, ŝ2, …, ŝp} , p is unknown
Segd(si): {s1,s2,…,sn(d)} -> {ŝ’1, ŝ’2, …, ŝ’m(d)} Adjacent sentence constraint
Alid(ŝ’i): {ŝ’1, ŝ’2, …, ŝ’m(d)} -> {ŝ1, ŝ2, …, ŝp} Some segments may be empty
Goal: better alignment with reasonable segmentations (not too fine or coarse)
![Page 4: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ea75503460f94baac0d/html5/thumbnails/4.jpg)
Barack Obama (Wikipedia article)
June 28, 2007 4
Barack Obama is a Democratic politician from Illinois. He is currently running for the United States Senate, which would be the highest elected office he has held thus far.
BiographyObama's father is Kenyan; his mother is from Kansas. He himself was born in Hawaii, where his mother and father met at the University of Hawaii. Obama's father left his family early on, and Obama was raised in Hawaii by his mother.
Created in 2004 (5 sentences)
![Page 5: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ea75503460f94baac0d/html5/thumbnails/5.jpg)
5907 revisions up to 2007 (>400 sentences)
Barack Obama (Wikipedia article)
June 28, 2007 5
![Page 6: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ea75503460f94baac0d/html5/thumbnails/6.jpg)
Early Life (Section Title) "Early life, education, and family“ "Early years, education, military“
"Personal life and education“ "Early Life and Education" "Early years" "Personal life and family" "Personal life and career" "Childhood and Education“ "Early life and childhood“ "Childhood" "Early life, education, and early career“ "Early years and education“ "Early life" "Early biography" "Childhood and education“ "Earlier life“ "Youth” "Early Life & Family“ "Early years and family" "Family and education“ "Family and early life“ "Family Life" "Career after football" "Curriculum vitae" "Family and Personal Life" "Upbringing" "Early life and family“ "Early Years“ "Early and private life" "Early career" "The Early Years“ "Birth and education" "Early and personal life" "Background and early life" "Education and Family“ "Early life and education" "Family and Education“ "Early Life“ "Early Life and Family" "Background and family" "Personal and family life" "Family and childhood”
June 28, 2007 6
![Page 7: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ea75503460f94baac0d/html5/thumbnails/7.jpg)
Title distribution
June 28, 2007 7
118,626 articles/ 257341 sections
![Page 8: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ea75503460f94baac0d/html5/thumbnails/8.jpg)
Architecture
June 28, 2007 8
![Page 9: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ea75503460f94baac0d/html5/thumbnails/9.jpg)
Data Collection & Cleaning
June 28, 2007 9
![Page 10: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ea75503460f94baac0d/html5/thumbnails/10.jpg)
Data Collection & Cleaning Corpus statistics
118,626 articles 257341 sections
Data Cleaning Diagrams, tables, and links are removed Documents are parsed into sentences Sub-section titles are kept Paragraph structure are kept
June 28, 2007 10
![Page 11: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ea75503460f94baac0d/html5/thumbnails/11.jpg)
Data Integration
June 28, 2007 11
![Page 12: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ea75503460f94baac0d/html5/thumbnails/12.jpg)
Data Integration Hidden Markov Topical Model
HMM Distributional Similarity among titles Gibbs Sampling
Category: politician # of articles: 1928 # of paragraphs: 26367 # of sections: 9692 # of distinct titles: 3330
June 28, 2007 12
![Page 13: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ea75503460f94baac0d/html5/thumbnails/13.jpg)
Graphical model
z: topic y: section titles w: section texts
![Page 14: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ea75503460f94baac0d/html5/thumbnails/14.jpg)
Full Topic Graph
![Page 15: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ea75503460f94baac0d/html5/thumbnails/15.jpg)
Experiments Statistics
245 section titles (appear at least 3 times) 3331 section titles (totally)
4 Clusters Manually labeled accuracy: 91.5%
5 Clusters Manually labeled accuracy: 86.5%
![Page 16: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ea75503460f94baac0d/html5/thumbnails/16.jpg)
4 & 5 Clusters
![Page 17: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ea75503460f94baac0d/html5/thumbnails/17.jpg)
User Interface
June 28, 2007 17
![Page 18: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ea75503460f94baac0d/html5/thumbnails/18.jpg)
User Interface
June 28, 2007 18
![Page 19: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ea75503460f94baac0d/html5/thumbnails/19.jpg)
Wikipedia Adventure
June 28, 2007 19
![Page 20: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ea75503460f94baac0d/html5/thumbnails/20.jpg)
Wikipedia Adventure
June 28, 2007 20