1 current research information as part of digital libraries and the heterogeneity problem....
Post on 21-Dec-2015
213 Views
Preview:
TRANSCRIPT
1
Current research information as part of digital libraries and the
heterogeneity problem.
Integrated searches in the context of
databases with different content analyses.
CRIS2002, Kassel
Jürgen Krause
University of Koblenz-Landau and Social Science Information Centre (IZ-Bonn)
Lennéstr. 30, 53113 Bonn, Germany, mailto:Krause@bonn.iz-soz.de
2
jk
Heterogeneity
trans fer and co o rd in atio n
u s ers
Ap p licationarea n
h igh relevanceh igh quali tycontent analys is
M 6
M 4
less relevanceno ab s trac tsh igh quali ty ind exing
M 5low relevancewww-d ocum entssearch access by search engines
M 1
h igh relevanceh igh quali tycontent analys is
M 3only ti tless im p le au tom atic indexing
M 2h igh relevanceproba li lis tic autom atic indexing of fu ll text
decentralized/polycentric docum ent space
www-d ocum entsc .a. B y sc ientis ts
in fo r m at ions er vicecen ters
p u b l is h ers
s c ien t is ts
l ib ra r iesl ib rar yc ata lo gues
wwwelec tron icp ub l is h ing
3
jk
“Scientists are increasingly using search engines to locate research of interest; some rarely use libraries, locating research articles primarily online ... About 85% of users use search engines to locate information.”
Lawrence/Giles (1999:107)
“It doesn’t matter what you want to know, there are people in the Internet who already have this knowledge and want to help you”
(Hahn, 1999: 107)
4
jk
Weizenbaum 2000/2001 Germany
„Das Internet ist ein großer Misthaufen ...“
( „The internet is a large dunghill ...“
Nov. 2000 „Gutenbergs Folgen“ Kongreß Mainz
Mai 2001 Fachseminar Hamburg
www.heise.de/newsticker/data/wst-03.05.01-001/
5
jk
WWW today:
a) “Worst case” of the ambiguity problem
Out of the estimated 800 million pages on around three million servers, only 6% relate to the fields of science and education (by comparison: 1.5% relate to pornography).
NEC 1999
NEC 2000: thousand million
the “worst case” for the ambiguity problem
No reasonable results can be obtained without additional conceptual components
6
jk
Summary and consequences
When used for specialist information retrieval (IR), general WWW search engines run counter to nearly every criterion which actually permits a successful search based on IR knowledge. This involves all the main components of an IR system, the database and its selection, the use of research logic and user expectations. Based on his/her knowledge of these aspects, the user should develop the best possible research strategy, something which is impossible with WWW search engines
7
jk
Nevertheless WWW search engines have one advantage compared with current specialist databases: embedded in an enormous volume of irrelevant data is data which is not found in specialist databases and which may be of value to experts. This means that it is simply not possible to return to the recommendation to narrow down the search to the original specialist databases. New ways have to be found to make research, including WWW sources, more satisfactory than is the case at present using general WWW search engines.
8
jk
Heterogeneity
trans fer and co o rd in atio n
u s ers
Ap p licationarea n
h igh relevanceh igh quali tycontent analys is
M 6
M 4
less relevanceno ab s trac tsh igh quali ty ind exing
M 5low relevancewww-d ocum entssearch access by search engines
M 1
h igh relevanceh igh quali tycontent analys is
M 3only ti tless im p le au tom atic indexing
M 2h igh relevanceproba li lis tic autom atic indexing of fu ll text
decentralized/polycentric docum ent space
www-d ocum entsc .a. B y sc ientis ts
in fo r m at ions er vicecen ters
p u b l is h ers
s c ien t is ts
l ib ra r iesl ib rar yc ata lo gues
wwwelec tron icp ub l is h ing
9
jk
Conceptual gaps
• Different stages of content analysis of textual data:
• an intelligently indexed term in a library classification
• ……
• automatic full text indexing in fully unstructured data pools
Descriptor A in one such system: wide range of meanings
Additional to technological integration:
10
jk
Research Projects IZ Bonn
ViBSoz „Social Science Virtual Library“, Virtual Library Project of the German Research Association (DFG)
CARMEN „ Content Analysis, Retrieval and Metadata: Effective Networking“, special support program of the German Ministry of Education and Research (BMBF).
ELVIRA “Electronic Retrieval and Analysis System for Industrial Associations”, funded by the German Federal Ministry of Economics and Technology
ETB “The European Schools Treasury Browser” funded by the European Commission
11
jk
Metadata
U.S. Bureau of the Census: Integrated Information solutions – The future of census bureau data access and dissemination, Sept. 1999. Working paper
“Recent surveys of Census Bureau customers show that two out of three use multiple data sets. ... If we continue to saddle data users with the burden of putting data from disparate sources into digestible forms, we do it at the risk of our own peril.“(p.2)
“Solutions of these issues ... will remove around the further development of standards, metadata ...“ (p.3)
“IIS will help minimize data user burden, data uncertainty and maximize data quality and usefulness through the use of metadata“ (p.2)
13
jk
DIN SICT paper: German position
„Strategie für die Standardisierung der Informations- und Kommunikationstechnik (ICT)“ (DIN Berlin 2002, draft)
... It is ... necessary to find a new concept relating to the still existing demand for consistency retention and interoperability. This concept can be described by means of the following premise: standardization must be considered in terms of the remaining heterogeneity. Only joint interaction between intellectual and automatic processes for the treatment of heterogeneity and standardization will produce a solution strategy which also ensures, under present-day marginal conditions, usable consistency and interoperability conditions
(translation from German)
14
jk
CARMEN and ViBSoz:Coping with heterogenity by transfer components
RetrievalMetadata
Coping with heterogeneity
• Cross-concordances
• Statistical transformation and neural networks
• Deductive methods
Documents
extract metadata from various document formats algorithmically
15
jk
Mathematics – Physics: MSC and PACS
statistical:
PACS 62.30.+d Mechanical and elastic waves; vibrations (Mechanische und elastische Wellen, Schwingungslehre)
MSC 74S15 Boundary element methods (Randelementmethode)
intellectual:
PACS 62. Not connected
16
jk
Example: semantic-pragmatic relation
Einfache Suche
Suchbegriff Dominanz(dominance) Zahl der relevanten Treffer 16
G. Binder
17
jk
Erweiterte Suche
Transferbegriffe Dominanz, Messen, Mongolei, Nichtregierungsorganisation, Flugzeug, Datenaustausch, Kommunikationsraum, Kommunikationstechnologie, Medienpädagogik,
Zahl zusätzliche relev. Treffer
7
Anteil der zusätzlichen relev. Treffer an den zusätzl. Treffern
50%
G. Binder
• Mitglieder des Vereins wom@n reisten zur UNO-Frauenkonfernez nach Beijing. Auf der Fahrt durch die Mongolei und die Wüste ...
jk
Text fact integration: simple directed transfer in ELVIRA
Transformations
Texts?
Facts?
Formalization
InformationNeed
TexteTexteTexts
TexteTexteFacts
Text-Query
Fact-Query
Direct Links
IterativeSearch
IterativeSearch
19
jk
Standard method: one step transformation
non-differentiated handling of vagueness
A B C
document term sets
question
20
jk
Two step transformation
V1: Handling of vagueness between questions and terms
AB C
document term sets
V3V2
V2/V3: Bilateral handling of vagueness
question
21
jk
X from A
B
A
C
Thesaurus A
Thesaurus B
Thesaurus C
Broker
X from A
Y from B
Z from C
Jugendlicher +Arbeitslosigkeit
JugendarbeitslosigkeitYouth unemployment
IZ ThesaurusIZ Soz.
SWD USB Köln
Jugendarbeitslosigkeit
22
jk
Statistical and Neural networks transformation
• Co-occurence-based similarity
• In ViBSoz: statistical crosswalk between two different thesauri (SWD as a universal thesaurus and the IZ thesaurus for the social sciences as a special thesaurus),
• in ELVIRA between a thesaurus for data and free text terms
• Transformation networks • USB Thesaurus to the IZ Thesaurus
• the USB Thesaurus or IZ Thesaurus to the IZ Precision
LSI and Transformation network x Statistical methods
Fig. 3: Transformation network USB Thesaurus to IZ Thesaurus (Fig. 7-12 from Mandl 2000:206)
Recall
25
jk
Conclusion
Todays search engines do not adequately solve the problem of a worldwide search for relevant documents and data in a special scientific community. They only represent an incomplete, albeit valuable first step. Users want to interlink literature and research project databases with the catalogues of virtual libraries, the WWW homepages of science institutions and fact sources, e.g. data archives with their survey data. In this case integration should not be performed only on a technical level or using solely intellectually created links, as is the case at present. A key role is played here by automatic transfer between different content analysis methods and standardizations of the document sets to be integrated. Based on the initial empirical results of different IZ projects, the proposed strategy appears to be highly promising: vagueness problems are not treated non-specifically as a transfer between all documents and the query but will be done cognitively plausible with individual bilateral modules.
top related