u of r extensible catalog team metacat. problem domain
Post on 21-Dec-2015
225 views
TRANSCRIPT
![Page 1: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/1.jpg)
U of R eXtensible Catalog
Team MetaCat
![Page 2: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/2.jpg)
Problem Domain
![Page 3: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/3.jpg)
A Modern Library
• Card catalogs are stored on a computer
• Card catalogs store metadata about books Subject Author(s)
• Searching for a book is done via an OPAC (Online Public Access Catalog) Example: http://albert.rit.edu/
![Page 4: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/4.jpg)
Card Catalog Metadata
• Two types of records A bibliographic record represents a book, and
is linked to multiple authority records. An authority record represents a single author
or subject.
• Metadata has been hand-typed by librarians across the country MARC: MAchine Readable Cataloging (XML),
specifies for both bib. and auth. record formats Dublin Core: also XML format, but only bib.
records
![Page 5: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/5.jpg)
Metadata Issues
• Since metadata has been hand-typed, it may be inconsistent
• An author could be: “Mark Twain” “Twain, Mark” “M. Twain” “Samuel Clemens”
• If a user searches for “Mark Twain”, the search may not return all related books
![Page 6: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/6.jpg)
Goals
• Bibliographic Record Author field
Name Date of Birth, Death
• Authority Record Authorized Form Alternate Forms:
Alternate form 1 Alternate form 2 …
See Also References to other
authority records
![Page 7: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/7.jpg)
Sponsor’s Solution
![Page 8: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/8.jpg)
Iterative Process Flow
Requirement Elicitation
Requirement Analysis
Define Architecture
Update Release Plan
produce SRS &acceptance tests
Subsystem DesignIdentify Integration
Tests
Implementation
Integration
Acceptance Testing
Delivery
For each release:
Update Documentation
![Page 9: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/9.jpg)
Metrics
• Effort by type of activity• Test metrics (JUnit)• Defects by types
![Page 10: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/10.jpg)
Effort by Type
Meeting Development Documentation
Before ~40 hrs 0 0
1/12-1/18 45 29 2
1/18-1/25 20 43 5
1/26-2/1 (R1) 24 41 4
2/2-2/8 20 31 7
2/9-2/15 (R2) 24 2 0
Total 133 146 18
![Page 11: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/11.jpg)
Hours spent on activities
0
5
10
15
20
25
30
35
40
45
50
Before 1/12-1/18
1/18-1/25
1/26-2/1(R1)
2/2-2/8
2/9-2/15(R2)
Time
Hou
r Meeting
Development
Documentation
Effort by Type
![Page 12: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/12.jpg)
Issuetracker
Initially, all the issues are not recorded properly.
Issue Tracker is used to track1. Issues (design, documentation, process)2. Bugs3. Discussions (new features, nice to have)
![Page 13: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/13.jpg)
Issuetracker
![Page 14: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/14.jpg)
Defects by Type
![Page 15: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/15.jpg)
Status
• 3.1 Import a record into database (R1) FR-1.1: The system shall parse the XML
record. (R2) FR-1.2: The system shall store the
information that obtained from parsing the XML record into MySQL database.
(R1) FR-1.3: The system shall be able to import multiple records at once. (Batch processing)
(R1) FR-1.4: The system shall normalize strings.
![Page 16: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/16.jpg)
Status cont.
• 3.2 Matching records (R1) FR-2.1: The system shall create a new authority
record. (R2) FR-2.2: The system shall match two strings and give a
confidence level of the matching. (R2) FR-2.3: The system shall store the results of the
matching that includes the degree of certainty, and the link(s) matched authorized record(s).
(R1) FR-2.4: The system shall identify all unprocessed records in the records database. The unprocessed records are the records that have not yet been matched against.
(R1) FR-2.5: The system shall create a new authority record, and store it in the database.
![Page 17: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/17.jpg)
Status cont.
(R1) FR-2.6: The system shall replace the data in authority-controlled fields with its authorized form and store the link to its authorized form if the degree of certainty is above auto-accept threshold.
(R2) FR-2.7: The system shall mark the record to be reviewed by a person if the degree of certainty is between auto-accept threshold and auto-reject threshold.
FR-2.8: The system shall create a new authority record using the information from the current record, and create a link between those two records if the degree of certainty is below auto-reject threshold.
(R1) FR-2.9: The system shall analyze unprocessed records on demand.
(R1) FR-2.10: The system shall attempt to match records first by comparing authority names.
(R2) FR-2.11: The system shall attempt to match records by comparing alternative names if the first attempt (FR-2.10) failed.
![Page 18: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/18.jpg)
Status cont.
• 3.5 Review possible matches (R2) FR-5.1: The system shall gather a collection of
records that are marked to review from the database. The questionable matches have the degree of certainty level between auto-accept threshold and auto-reject threshold.
(R2) FR-5.2: The system shall replace the data in authority-controlled fields with its authorized form and store the link to its authorized form if the user approves the matching.
(R2) FR-5.3: The system shall replace the data in authority-controlled fields with its authorized form and store the link to its authorized form if the user approves the matching.
![Page 19: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/19.jpg)
Our Solution
![Page 20: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/20.jpg)
Architecture
API
MyS
QL
DB
Exporter
DA
O (
Dat
a A
cces
s O
bje
ct)
-match
GUI
«subsystem»Matcher
«subsystem»Import
![Page 21: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/21.jpg)
Matcher
• In NACM, we need to be able to match Bibliographic records (books) to Authorized records (authors).
• The information in the records may not always match exactly, or may match multiple records!
![Page 22: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/22.jpg)
Matching Problems
• Different forms of the same name Nate verses Nathan, typos
• Different authors with the same name George Bush (41) versus George Bush
(43)
• Aliases or pen names Samuel Clemens verses Mark Twain
![Page 23: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/23.jpg)
Matching Problems
• To assist in matching different forms of an author’s name, Authority records have a list of alternate names in addition to the authorized form.
• Alternate names may not be distinct.
![Page 24: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/24.jpg)
Matcher Design
• We need a matching strategy that is easy to extend to add new matching rules, while still being fast.
![Page 25: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/25.jpg)
Matching Subsystem
![Page 26: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/26.jpg)
MatchStrategy
• Abstract class that defines the basics of a matching rule• Matching method• Match confidence
• All matching strategies extend this class
![Page 27: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/27.jpg)
StringTransformer
• Abstract class for string manipulation rules• String transform method• Transformation confidence
• All string manipulation rules extend this class
![Page 28: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/28.jpg)
MatchDriver
• Handles performing a match• Creates pairs of strategies &
transformations• Sorts Pairs based on overall confidence• Iterates through the pairs looking for
matches
![Page 29: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/29.jpg)
Matcher Extensibility
• Adding new rules• Extend MatchStrategy or
StringTransformer• implement new matching or
transforming rules• Assign a confidence• Add to MatchDriver
• MatchDriver takes care of the rest
![Page 30: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/30.jpg)
Importer
• Takes in input streams and parses them to extract authority and bibliographic data
• Uses a SAX parser into a Document Object Model (DOM) object
• Data is extracted from document, normalized, and inserted into the database
![Page 31: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/31.jpg)
Importer
![Page 32: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/32.jpg)
MySQL data model
record_types
PK id
name
names
PK id
orig_string nor_string
authority_records
PK id
processed generated xml_hashcode orig_xmlFK1 record_type_idFK2 authority_name_id
authority_records_alter_forms
PK,FK1 namePK,FK2 record_id
authority_records_see_also
PK,FK1 namePK,FK2 record_id
bib_records
PK id
processed xml_hashcode orig_xmlFK1 record_type_id
bib_records_titles
PK,FK1 namePK,FK2 record_id
bib_records_authors
PK,FK1 namePK,FK2 record_id
bib_records_subjects
PK,FK1 namePK,FK2 record_id
authority_records_links
PK id
approved flagged rejectedFK1 auth_record_idFK2 bib_record_id evidence time_found time_verifed approvedby percent_confidenceFK3 string_id
bib_records_author_links
PK,FK1 bib_record_idPK,FK2 auth_link_id
bib_records_subjects_links
PK,FK2 bib_record_idPK,FK1 auth_link_id
![Page 33: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/33.jpg)
Using Hibernate
• Transparent Data Persistence• Manages relationships between
entities• Benefits
Query caching Lazy-loading of associated entities Automatic flagging of changes Programmatic API for complex queries
![Page 34: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/34.jpg)
How it Works
• Define Schema• Define Domain Model• Use XML to map fields in classes to
columns in tables Define cascading behavior
![Page 35: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/35.jpg)
Hibernate Caveats
• Designed with transactions in mind But, we use batch processing!
• Query language lacks some of the power of SQL
• Not 100% transparent Design and use of domain model is
affected
![Page 36: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/36.jpg)
Results Viewing GUI
+refresh()+sortBy(in column : int)+updateLinkCountLabel()
ResultsTable
+getValueAt(in row, in column)
ResultsTableModel
FilterControls
PagingControls
SelectedLinkControls
Filter
+findAllWithFilter()
AuthorityLinkDAO
AuthorityLink
Creates and lays outa JTable and otherGUI components
*
-creates
*
-database
gui.resultsGUI
• A table displaying all created links• Can be filtered, sorted, and paged
![Page 37: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/37.jpg)
Future Plans
• Verify that matching algorithm is doing the right things
• Implement string transformers• Create new XC records• Merge and update records with new
data upon import• Configuration files for the system
![Page 38: U of R eXtensible Catalog Team MetaCat. Problem Domain](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d5e5503460f94a3dfd0/html5/thumbnails/38.jpg)
Demo!