the interplay of big data, worldcat , and dewey
DESCRIPTION
Big Data, Linked Data: Classification Research at the Junction 24 th ASIS&T SIG/CR Classification Research Workshop, 2 November 2013. Rebecca Green, OCLC [email protected] Michael Panzer, OCLC [email protected]. The Interplay of Big Data, WorldCat , and Dewey. Roadmap. - PowerPoint PPT PresentationTRANSCRIPT
The world’s libraries. Connected.
The Interplay of Big Data, WorldCat, and Dewey
Big Data, Linked Data: Classification Research at the Junction 24th ASIS&T SIG/CR Classification Research Workshop, 2 November 2013
Rebecca Green, OCLC [email protected] Panzer, OCLC [email protected]
The world’s libraries. Connected.
• Setting the stage• Big data• WorldCat as big data• Literary warrant and the DDC
• “Classification analytics”• Classified works• Access points• Trending topics• Structure of discipline
Roadmap
The world’s libraries. Connected.
Setting the stage
The world’s libraries. Connected.
• Volume• Terabytes (10004), petabytes (10005), exabytes
(10006), . . .• Number of transactions vs. number of bytes• My big data is not your big data
3 V’s of big data
The world’s libraries. Connected.
• Variety• Sources, perspectives, standards• Structured vs. unstructured data• Semantically related datasets
• Velocity• Data creation• Data analysis
3 V’s of big data – cont.
The world’s libraries. Connected.
• Variety• Records in MARC Bibliographic Format• Records in MARC Holdings Format• Records in MARC Authority Format (e.g., LCSH,
FAST, BISAC, MeSH, VIAF)• Vendor records• WorldCat knowledge base• Institutional registry data• Institution-specific acquisitions, circulation, ILL data
WorldCat as big data
The world’s libraries. Connected.
• Volume• Bibliographic data: over 300 million records
• Holdings data: over 2 billion records
• Authority data• LCSH: 26.4 million headings
• VIAF: 24.2 million clusters; 21 million links between records
WorldCat as big data
The world’s libraries. Connected.
• DDC editorial rules call for literary warrant to be taken into account for:
• Expansions (i.e., development of new classes)• Reductions (i.e., discontinuing entire classes)• Form of name used in class descriptions• Order in which topics are listed in multitopic caption• Creation of and choice of examples in add instructions• Indexability of topics (print; WebDewey)• Form of name for index entries
Literary warrant and the DDC
The world’s libraries. Connected.
“Classification analytics”
The world’s libraries. Connected.
• Periodic profiles of distribution of classified works across the classification to identify:
• Expansions: Disciplines/subjects with sufficient literary warrant
• Reductions: Classes with insufficient literary warrant
Classified works
The world’s libraries. Connected.
306.44 LanguageIncluding pragmaticsClass here anthropological linguistics, ethnolinguistics, sociolinguistics
306.446 Bilingualism and multilingualism306.449 Language planning and policy306.449 4–.449 9 Specific continents, countries,
localities in modern worldAdd to base number 306.449 notation 4–9 from Table 2, e.g., language policy of India 306.44954
Classified works:Expansion warranted (1)
The world’s libraries. Connected.
• Records retrieved in WorldCat searches on dd:306.44* not dd:(306.440* or 306.446* or 306.449*)
Classified works:Expansion warranted (2)
Time period
Records retrieved
Language-specific: English, French, German, Spanish
1981-1985 120 14
1986-1990 412 59
1991-1995 912 134
1996-2000 1230 163
2001-2005 1603 199
2006-2010 2369 446
The world’s libraries. Connected.
006.33 *Knowledge-based systems
. . .
006.336 *Programming for knowledge-based systems
006.336 3 *Programming languages for knowledge- based systems
006.337 Programming for knowledge-based systems for specific types of computers, for specific
operating systems, for specific user interfaces
006.338 *Programs for knowledge-based systems
Classified works:Reduction warranted (1)
The world’s libraries. Connected.
DDC class
1986-1990
1991-1995
1996-2000
2001-2005
2006-2010
2011-2015
006.33 1241 978 612 660 915 246006.336 0 1 1 6 14 3006.3363 1 1 0 0 1 0006.337 0 1 5 5 10 0006.338 0 0 3 1 3 1
Classified works:Reduction warranted (2)
• Duplicates not filtered out of search results for 006.33• Duplicates filtered out of all other search results
• Records retrieved in WorldCat searches for disjunction of DDC class number and standard subdivisions of number
The world’s libraries. Connected.
• Analysis of subject heading data in DDC categorized content to identify:
• Areas where expansions of new classes should be considered
• Additional access points / mappings for DDC classes
• Additional topics to be added to class description
Access points
The world’s libraries. Connected.
• DDC class004.678 *Internet
Including extranets, virtual private networksClass here World Wide Web
• LCSH:010 ## $a sh 97006102 150 ## $a Extranets (Computer networks) 450 ## $a Virtual private networks (Computer networks)
• dd: 004.678* and (hl: extranets w computer w networks) retrieves 69 records
Access points: Standing room topics and literary warrant
The world’s libraries. Connected.
004.6 *Interfacing and communications. . .
Including sensor networks. . .
006.22 *Embedded computer systems [formerly 004.1]Class here microcontrollersFor a specific aspect of embedded computer systems, see the aspect, e.g., systems analysis and design of embedded computer systems 004.21, wireless sensor networks 004.6, software for embedded systems 005.3
Access points: Topics added to class description
The world’s libraries. Connected.
• My trending topics are not your trending topics• Twitter—sudden high-magnitude spike in activity
• DDC—“quick” achievement of literary warrant threshold + plateaus at steady rate
• Trending topic detection vs. new topic detection• Newly minted LCSHs
• Chapter/paper titles
• Conferences
Trending topics
The world’s libraries. Connected.
Trending topics:Newly minted LCSHs (1)
Date entered LCSH
2012-08-13 Big data
2012-08-22 Contrast data mining
2013-07-18 Linked data
The world’s libraries. Connected.
Time period
Records retrieved,su:“big data"
Records retrieved,su:“big data"
or ti:“big data" 2001-2005 1 17
2006 0 0
2007 0 0
2008 0 2
2009 0 0
2010 0 7
2011 6 74
2012 51 227
2013 131 413
Trending topics:Newly minted LCSHs (2)
The world’s libraries. Connected.
• Big data: 29th British National Conference on Databases• 1st Workshop on Architectures and Systems for Big Data• Workshop on big data• Big Data Analytics: First International Conference• The Semantic Web: Semantics and Big Data: 10th International
Conference• 2012 workshop on Management of big data systems• 2nd Workshop on Research in the Large : Using App Stores, Wide
Distribution Channels and Big Data in UbiComp Research • IEEE International Congress on Big Data • Big Data 2 Knowledge (Workshop)
Trending topics :Conferences
The world’s libraries. Connected.
• Welcome to the big data age• Big Brother and big data around the world• How to make sense of big data? • Business and social implications of big data • Big data and health care • How should big data abuses be addressed?• What is big data? • Does big-data equal big value? • Big-data technologies
Trending topics :Chapter/paper titles
The world’s libraries. Connected.
Time period
Records retrieved,su:"linked data"
Records retrieved,su:"linked data"
or ti:"linked data" 2001-2005 7 38
2006 1 2
2007 2 8
2008 2 14
2009 14 34
2010 17 72
2011 29 84
2012 54 152
2013 57 114
Trending topics :Newly minted LCSHs (3)
The world’s libraries. Connected.
Time period Records retrieved,su:“contrast data
mining”
Records retrieved,su:“contrast data mining”
or ti:“contrast data mining”2001-2005 0 0
2006 0 0
2007 0 0
2008 0 0
2009 0 0
2010 0 0
2011 1 1
2012 1 3
2013 5 9
(Non-)Trending topics :Newly minted LCSHs (4)
The world’s libraries. Connected.
• Analysis of title data in DDC categorized content to identify facet structure of discipline
• Retrieve bibliographic records from WorldCat for monographic literature
• Isolate title data
• Identify noun phrases in the titles
• Use conceptual density measure of Agirre & Rigau• Disambiguate noun phrases
• Identify appropriate generalizations
Structure of discipline
The world’s libraries. Connected.
That’s all, folks! -- Thank you=
La fin -- Merci beaucoup
The Interplay of Big Data, WorldCat, and Dewey
The world’s libraries. Connected.
Time period Records retrieved,su:“Attribute focusing
Data mining”
Records retrieved,su:“Attribute focusing Data mining”
or ti:“Attribute focusing ”2001-2005 0 0
2006 0 0
2007 0 0
2008 0 0
2009 0 0
2010 0 0
2011 0 0
2012 1 1
2013 0 0
(Non-)Trending topics :Newly minted LCSHs (5)