logo a comparison of two web-based document management systems shaoxinyu columbia university march...
TRANSCRIPT
![Page 1: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009](https://reader035.vdocuments.mx/reader035/viewer/2022070403/56649f275503460f94c3fa31/html5/thumbnails/1.jpg)
LOGOLOGO
A comparison of two web-based
document management systems
ShaoxinYu
Columbia University
March 31, 2009
![Page 2: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009](https://reader035.vdocuments.mx/reader035/viewer/2022070403/56649f275503460f94c3fa31/html5/thumbnails/2.jpg)
LOGO Index
I. Description of the problem
II. Google Scholar
III. CiteSeer
IV. Comparison of Google Scholar and CiteSeer
![Page 3: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009](https://reader035.vdocuments.mx/reader035/viewer/2022070403/56649f275503460f94c3fa31/html5/thumbnails/3.jpg)
LOGO Description of the problem
Nowadays, with mushrooming of the quantity of on-line text
information, automatic text summarization plays a more and
more important role in information industry
Online resources will certainly contain similar content, however, exist separately, it is meaningful for us to find high efficient ways to manage these information.
![Page 4: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009](https://reader035.vdocuments.mx/reader035/viewer/2022070403/56649f275503460f94c3fa31/html5/thumbnails/4.jpg)
LOGO Description of the problem
Background of Multi-document Summarization Techniques
1. Free style summarization2. Sentence Extraction type summarization3. Axis (type of main topic)4. Table style summary
Four types
![Page 5: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009](https://reader035.vdocuments.mx/reader035/viewer/2022070403/56649f275503460f94c3fa31/html5/thumbnails/5.jpg)
LOGO Description of the problem
How to achieve documents about the same topic manually?
1. Use a marker to mark the important phrases or sentences
2. Figure out the main topics in the marked sentences OR Make a list to figure out the overview of the documents
3. Connect these main topics
![Page 6: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009](https://reader035.vdocuments.mx/reader035/viewer/2022070403/56649f275503460f94c3fa31/html5/thumbnails/6.jpg)
LOGO Google Scholar
Google Scholar
1. Released in November 20042. Search engine for scholarly literature3. Wide range of subject areas
![Page 7: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009](https://reader035.vdocuments.mx/reader035/viewer/2022070403/56649f275503460f94c3fa31/html5/thumbnails/7.jpg)
LOGO Google Scholar
Do not search all publicly available Web pages as Google Google Scholar gets its records from three sources:
1. Use a proprietary algorithm to identify Web documents “look scholarly” ----full-text documents and citations with abstracts.
2. Add content provided by its partners—journal publishers, scholarly societies, database vendors, and academic institutions.
3. Extracts citations from the reference lists of documents found through the first two methods
![Page 8: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009](https://reader035.vdocuments.mx/reader035/viewer/2022070403/56649f275503460f94c3fa31/html5/thumbnails/8.jpg)
LOGO Google Scholar
Google File System Architecture
![Page 9: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009](https://reader035.vdocuments.mx/reader035/viewer/2022070403/56649f275503460f94c3fa31/html5/thumbnails/9.jpg)
LOGO Google Scholar
1. Chunk fragment of information used in multimedia formats 64 MB: optimize by statistic method2. Metadata (stored in master) a. files and chunk namespaces b. mapping from files to chunks c. locations of each chunk’s replicas 3. Master Single process running on a machine that stores all metadata4. Communication between Master and Chuck Servers If corrupted, master also sends instruction to the chuck servers
for deleting existing chunks, creating new chunks.
![Page 10: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009](https://reader035.vdocuments.mx/reader035/viewer/2022070403/56649f275503460f94c3fa31/html5/thumbnails/10.jpg)
LOGO CiteSeer
CiteSeer
1. Public search engine for academic papers
2. Created by Steve Lawrence, Kurt Bollacker and Lee Giles
3. NEC Research Institute, Princeton, New Jersey, USA
4. Hosted by Pennsylvania State University
5. Over 700,000 documents, primarily in computer and science
and engineering.
![Page 11: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009](https://reader035.vdocuments.mx/reader035/viewer/2022070403/56649f275503460f94c3fa31/html5/thumbnails/11.jpg)
LOGO CiteSeer
CiteSeer features
1. Autonomous citation indexing system
2. Index academic literature in Postscript files or PDF
3. Literature retrieval by following citation links
4. Evaluation and ranking of papers, authors and journals
5. Create up-to-date databases not limited to preselected journals or
restricted by journal publication delays
6. Autonomous operation with a corresponding reduction in cost
7. Powerful interactive browsing of the literature using the context of
citations
![Page 12: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009](https://reader035.vdocuments.mx/reader035/viewer/2022070403/56649f275503460f94c3fa31/html5/thumbnails/12.jpg)
LOGO CiteSeer
Methods of CiteSeer use for computing similarity
1.Word Vectors Use the top 20 components, since the truncation may not have a large effect on the distance measures2. String Distance Use “LikeIt” string distance to measure the edit distance3. Citations Use common citations to find the research papers most closely related to the document4. Combination of Methods CiteSeer combines document similarity methods above
![Page 13: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009](https://reader035.vdocuments.mx/reader035/viewer/2022070403/56649f275503460f94c3fa31/html5/thumbnails/13.jpg)
LOGO Comparison of Google Scholar & CiteSeer
Different positioning
The core purpose of CiteSeer is to search for the complete academic papers with complete citations and exempt of the hefty fee
Google Scholar is Google’s products to promote the complete solution of searching and other need of academic purposes, whose strategy focuses on complete and can be used as a final solution
![Page 14: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009](https://reader035.vdocuments.mx/reader035/viewer/2022070403/56649f275503460f94c3fa31/html5/thumbnails/14.jpg)
LOGO Comparison of Google Scholar & CiteSeer
Coverage and performance
Google Scholar utilizes the first 100-120K bytes of the text for searching and the links always need to pay
We can trace the informative paper by CiteSeer itself, and the contributions of all the citation papers provide huge help in academic affairs
![Page 15: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009](https://reader035.vdocuments.mx/reader035/viewer/2022070403/56649f275503460f94c3fa31/html5/thumbnails/15.jpg)
LOGO Comparison of Google Scholar & CiteSeer
Click any of the informative links can connect to one link
![Page 16: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009](https://reader035.vdocuments.mx/reader035/viewer/2022070403/56649f275503460f94c3fa31/html5/thumbnails/16.jpg)
LOGO Comparison of Google Scholar & CiteSeer
Results are provided only by the topics extraction
![Page 17: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009](https://reader035.vdocuments.mx/reader035/viewer/2022070403/56649f275503460f94c3fa31/html5/thumbnails/17.jpg)
LOGO Comparison of Google Scholar & CiteSeer
As to the staleness matter, Google Scholar seems to be a loser in comparison with CiteSeer.
This effect was more obvious in the early days of appearance of Google Scholar.
Nowadays, for majority of uses, the staleness is no longer a big problem for both of them.
![Page 18: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009](https://reader035.vdocuments.mx/reader035/viewer/2022070403/56649f275503460f94c3fa31/html5/thumbnails/18.jpg)
LOGOLOGO