pears, gwen and jzkit training. 2 designing and building databases topics pears database building -...
TRANSCRIPT
Pears, Gwen and JZKit Training
Pears, Gwen and JZKit Training 2
Designing and Building DatabasesTopics
Pears Database Building - Introduction
B Database Description File
C Building Databases
D Configuring and Testing
E Database Utilities and Maintenance
F Advanced Database Description Concepts
Pears, Gwen and JZKit Training 3
Pears Database BuildingIntroduction
Pears provides tools that allow you to:
• Build databases from structured data such as:
– MARC - that has a defined standard structure.
– XML – that has loose structure but clearly identified fields.
• Determine each index for the database.
• Load the records into a database following your indexing definitions.
Pears, Gwen and JZKit Training 4
Pears Database Building Exercise Preview
• View the structure of a small set of MARC records.
• Build a small database from those records.
• Look at the setup database description file.
• Build the Database.
• Test database for correctness using testgwen.
• Add the database to the JZKit configuration files, making it searchable by a Z39.50 Client.
Pears, Gwen and JZKit Training 5
Pears Database BuildingThe Gwen Search EnginePears Database BuildingThe Gwen Search Engine
• The Gwen search engine is a generalized text retrieval engine.
• Functionality is contained in the Java classes that can be embedded in Java applications including the JZKit Z39.50 Server.
• The JZKit server allows multiple, simultaneous users utilizing a client program supporting the Z39.50 protocol, to browse, search and display records from Pears databases.
Pears, Gwen and JZKit Training 6
Pears Database Building Logical and Physical DatabasesPears Database Building Logical and Physical Databases
• A Gwen Database is a logical database
– It provides features for searching and retrieving records
• A Pears Database is a physical database
– It provides the information that a Gwen database needs
Pears, Gwen and JZKit Training 7
Pears Database Building Gwen Database FeaturesPears Database Building Gwen Database Features
• A Gwen Database has:
– Indexes with numeric ID’s
– Index Terms with Postings Lists
– Postings Lists have Record Numbers and Restrictor Data
Pears, Gwen and JZKit Training 8
Pears Database BuildingWhat is a Pears Database?Pears Database BuildingWhat is a Pears Database?
• A Pears database is a single physical file with three main kinds of data
– Record data
– Index data
– Postings data
Pears, Gwen and JZKit Training 9
Pears Database BuildingRecord DataPears Database BuildingRecord Data
• Contains the actual records of your database.
• Records are stored as BER-encoded records.
• Each record is identified by a unique logical record number.
Pears, Gwen and JZKit Training 10
Pears Database BuildingIndex DataPears Database BuildingIndex Data
• Contains a sorted list of all the Index Terms extracted from your data records.
• Index Terms Contain:
– term/index-id.
– number of records that term appears in (postings count).
– a list of records that contain that term or a pointer to such a list.
Pears, Gwen and JZKit Training 11
Understanding the Database StructureUnderstanding the Database Structure
INDEXINDEX
abercrombie: au: postings=2, postings list=r17, r15abercrombie: au: postings=2, postings list=r17, r15anderson: au : postings=102, postings list ID=l21anderson: au : postings=102, postings list ID=l21
Pears, Gwen and JZKit Training 12
Pears Database BuildingPostings DataPears Database BuildingPostings Data
• Contains a list of record ID’s for each of the terms in the index.
• Each record ID may have restrictor and proximity information associated with it.
Pears, Gwen and JZKit Training 13
Understanding the Database StructureUnderstanding the Database Structure
INDEXINDEX
abercrombie: au: postings=2, postings list=r17, r15abercrombie: au: postings=2, postings list=r17, r15anderson: au : postings=102, postings list ID=l21anderson: au : postings=102, postings list ID=l21
POSTINGSPOSTINGS
l21: r1024, r1021, r1007, r995, …l21: r1024, r1021, r1007, r995, …
Pears, Gwen and JZKit Training 14
Understanding the Database StructureUnderstanding the Database Structure
INDEXINDEX
abercrombie: au: postings=2, postings list=r995, r175abercrombie: au: postings=2, postings list=r995, r175anderson: au : postings=102, postings list ID=l21anderson: au : postings=102, postings list ID=l21
POSTINGSPOSTINGS
l21: r1024, r1021, r1007, r995, …l21: r1024, r1021, r1007, r995, …
RECORDSRECORDS
r995:r995:au: Abercrombie & Andersonau: Abercrombie & Andersonti: Tennis Made Easyti: Tennis Made Easyyr: 1905yr: 1905
Pears, Gwen and JZKit Training 15
Pears Database Building Data Conversion
• The Bartlett class is responsible for updating a Pears database.
• Bartlett automatically converts input records to the Pears internal BER format.
• The class of objects that do the conversion are called RecordHandlers.
• RecordHandler is a Java Interface class
– You can write your own RecordHandlers!
Pears, Gwen and JZKit Training 16
Pears Database Building Data Conversion Options
• There are two primary Pears RecordHandlers that convert your data to BER format.
– HandleUSMARC
– HandleSGML
• There are several others:
– HandleBER, HandleDB, HandlePDB, HandleUnimarc, HandleChinaMarc
Pears, Gwen and JZKit Training 17
Pears Database Building Data Conversion
• The RecordHandler class has a main() method that you can use to test RecordHandlers and/or your data.
– Usage:
java ORG.oclc.RecordHandler.RecordHandler –c<class> -i<inputFile> -o<outputFile> …
– Example:
java ORG.oclc.RecordHandler.RecordHandler
–cUSMARC –iscifi.usmarc –oscifi.ber –n10
Pears, Gwen and JZKit Training 18
Pears Database Building Data ConversionPears Database Building Data Conversion
• BER (Basic Encoding Rules) is defined by ISO-8825
• It was created to encode ASN.1 records
• Encodes tree-structured data (equivalent to DOM records)
• Can contain binary data (e.g. .jpeg files) (unlike DOM records!)
Pears, Gwen and JZKit Training 19
BER Record StructureBER Record Structuretag=1
tag=2 tag=3 tag=4
tag=1
Ralph
Ohio OCLC
tag=2
LeVan
tag=1, Class=1, form=1, =1, Class=1, form=1, count=3count=3
tag=2, Class=2, tag=2, Class=2, form=1, count=2form=1, count=2
tag=1, Class=2, tag=1, Class=2, form=0, count=5form=0, count=5 data=Ralphdata=Ralph
tag=1, Class=2, tag=1, Class=2, form=0, count=5form=0, count=5 data=LeVandata=LeVan
tag=3, Class=2, tag=3, Class=2, form=0, count=4form=0, count=4
data=Ohiodata=Ohio tag=4, Class=2, tag=4, Class=2, form=0, count=4form=0, count=4
data=OCLCdata=OCLC
Pears, Gwen and JZKit Training 20
Pears Database Building
Marc Data Example000 nmm Ia 001 ocm35003642 003 OCoLC005 19000000000108.0008 960628s1995 cau d eng d040 $aFQM$cFQM096 $aNTERNET245 00 $aOphthalmic Anesthesia Society $h[computer file].256 $aComputer data.260 $a San Diego, CA : $b Ophthalmic Anesthesia Society, $c1995.516 $aHtml text and images in GIF and JPeg.538 $aSystem requirements: Html browser, JPeg compatible browser or image viewer.538 $aMode of access: Internet. Host: www.iea.com/3dans/OAS/oasDhomepage.html500 $aTitle from title screen.521 $aMedical.520 $aHome page of the Ophthalmic Anesthesia Society with articles, references,
e-mail addresses of members, pictures and ophthalmic anesthesia resources.650 02$aSocieties, Medical.650 02$aOphthalmology.650 02$aAnesthesia.710 02$aOphthalmic Anesthesia Society.856 07$u http://www.iea.com/3dans/OAS/oasDhomepage.html$2http$zOphthalmic Anesthesia Society home page
For USMARC data – (InputRecordtype=USMARC)
Pears, Gwen and JZKit Training 21
HandleUSMARC converts this...
01981cam2200349450000800410000001700240004102200140006503000110007906900200009010000200011010000130013011000480014324501080019126000290029950000570032850000460038550003160043152005340074754600120128165000180129365000400131165000410135165000210139269000440141369000400145790002301497690001801520690003001538690002501568690001601593773002201609^^000000s1993eng^_a0370-2693/93/$06.00^^ ^_a0370-2693^^ ^_aPYLBAJ^^ ^_aA9308-1385K-002^^ ^_aBrandenburg, A.^^ ^_aMa, J.P.^^ ^_aInst. fur Theor. Phys., Heidelberg, Germany^^ ^_aCP odd observables for the top-antitop system produced at proton-antiproton and proton-proton colliders^^ ^_aNetherlands^_c7 Jan. 1993^^ ^_aSOURCE:Physics Letters B, vol.298, no.1-2, p. 211-17^^ ^_aTREATMENT: T; Theoretical or Mathematical^^ ^_aCLASS CODES: A1385K (Inclusive reactions, including total cross sections, (energy > 10 GeV))^_aA1110E (Lagrangian and Hamiltonian approach)^_aA1130E (Charge conjugation, parity, time reversal and other discret symmetries)^_aA1340F (Electromagnetic form factors; electric and magnetic moments; structure functions)^^ ^_aThe authors propose some CP odd observables to test CP invariance in the tt system produced at pp and pp colliders. Using these observables the effects of CP violation from the production and from the decay of the top quarks can be separated well. The application of their observables to pp collisions, where one has no CP invariant initial state, is discussed. To parametrize CP violating interactions their use an effective lagrangian for the tt production and a general form factor approach for the decay of t and t (19 Refs.)^^ ^_aEnglish^^ ^_aCP invariance^^ ^_aform factors (elementary particles)^^ ^_aproton-proton inclusive interactions^^ ^_aquark production^^ ^_aantiproton+proton producing antitop+top^^ ^_aproton+proton producing antitop+top^^ ^_aCP odd observables^^ ^_aCP invariance^^ ^_aCP violating interactions^^ ^_aeffective lagrangian^^ ^_aform factor
Pears, Gwen and JZKit Training 22
tag=650, Class=2, form=1, count=2
tag=0, Class=2, form=0, count=2
data= 2
tag=1, Class=2, form=0, count=19
data=Societies, Medical.
tag=650, Class=2, form=1, count=2
tag=0, Class=2, form=0, count=2
data= 2
tag=1, Class=2, form=0, count=14
data=Ophthalmology.
tag=650, Class=2, form=1, count=2
tag=0, Class=2, form=0, count=2
data= 2
tag=1, Class=2, form=0, count=11
data=Anesthesia.
tag=710, Class=2, form=1, count=2
tag=0, Class=2, form=0, count=2
data=2
tag=1, Class=2, form=0, count=30
data=Ophthalmic Anesthesia Society.
...to this...to thistag=0, Class=1, form=1, count=22 tag=0, Class=2, form=0, count=8
data=nmm Ia tag=245, Class=2, form=1, count=3 tag=0, Class=2, form=0, count=2
data=00 tag=1, Class=2, form=0, count=29
data=Ophthalmic Anesthesia Society
tag=8, Class=2, form=0, count=16 data=[computer file].
tag=260, Class=2, form=1, count=4 tag=0, Class=2, form=0, count=2
data= tag=1, Class=2, form=0, count=15
data=San Diego, CA : tag=2, Class=2, form=0, count=30
data=Ophthalmic Anesthesia Society,
tag=3, Class=2, form=0, count=5 data=1995.
Pears, Gwen and JZKit Training 23
Pears Database Building
SGML Data Example.tags fileTitle 1
Local-Subject-Index 2Abstract 3
Spatial-Domain 4Geographic-Coverage
1Coverage-Description
2Bounding-Coordinates
3West-Bounding-
Coordinate 1East-Bounding-Coordinate 2
North-Bounding-Coordinate 3
South-Bounding-Coordinate 4Time-Period 5
Time-Period-Textual 1Name 6
Organization 7
For SGML data – (InputRecordtype=SGML)
<Rec><Title>BEG - PANHANDLE COLOR INFRARED AERIAL PHOTOGRAPHY</Title><Abstract>TNRIS file no. 01010422. File consists of original and duplicate positive transparencies, color-infrared, stereoscopic, 1:80,000, quad centered, aerial photography of the Texas Panhandle, flown in September, 1977 by Mark Hurd. </Abstract><Spatial-Domain> <Geographic-Coverage>US STATE</Geographic-Coverage> <Coverage-Description>TEXAS PANHANDLE</Coverage-Description> <Bounding-Coordinates>
<West-Bounding-Coordinate>-102</West-Bounding-Coordinate><East-Bounding-Coordinate>-98</East-Bounding-Coordinate><North-Bounding-Coordinate>30</North-Bounding-Coordinate><South-Bounding-Coordinate>26</South-Bounding-Coordinate>
</Bounding-Coordinates></Spatial-Domain><Time-Period> <Time-Period-Textual>1977-1977</Time-Period-Textual></Time-Period><Name>BUREAU OF ECONOMIC GEOLOGY</Name><Organization>BUREAU OF ECONOMIC GEOLOGY</Organization> </Rec>
Pears, Gwen and JZKit Training 24
Converted SGMLtag=3, Class=2, form=1, count=4
tag=1, Class=2, form=1, count=1 tag=1, Class=2, form=0, count=4
data=-102tag=2, Class=2, form=1, count=1
tag=1, Class=2, form=0, count=3 data=-98
tag=3, Class=2, form=1, count=1 tag=1, Class=2, form=0, count=2
data=30 tag=4, Class=2, form=1, count=1
tag=1, Class=2, form=0, count=2 data=26
tag=5, Class=2, form=1, count=1 tag=1, Class=2, form=1, count=1
tag=1, Class=2, form=0, count=9 data=1977-1977
tag=6, Class=2, form=1, count=1 tag=1, Class=2, form=0, count=26
data=BUREAU OF ECONOMIC GEOLOGY tag=7, Class=2, form=1, count=1
tag=1, Class=2, form=0, count=26 data=BUREAU OF ECONOMIC GEOLOGY
tag=0, Class=1, form=1, count=8
tag=1, Class=2, form=1, count=1
tag=1, Class=2, form=0, count=49
data=BEG - PANHANDLE COLOR INFRARED AERIAL PHOTOGRAPHY
tag=2, Class=2, form=1, count=1
tag=1, Class=2, form=0, count=35
data=AERIAL PHOTOGRAPHY; INFRARED; TEXAS
tag=3, Class=2, form=1, count=1
tag=1, Class=2, form=0, count=229
data=TNRIS file no. 01010422. File consists of original and duplicate positive transparencies, color-infrared, stereoscopic, 1:80,000, quad centered, aerial .photography of the Texas Panhandle, flown in September, 1977 by Mark Hurd.
tag=4, Class=2, form=1, count=3
tag=1, Class=2, form=1, count=1
tag=1, Class=2, form=0, count=8
data=US STATE
tag=2, Class=2, form=1, count=1
tag=1, Class=2, form=0, count=15
data=TEXAS PANHANDLE
Pears, Gwen and JZKit Training 25
Pears Database Building Viewing a BER record - BufferedBerStream
• BER records are not readable in their encoded form.
• BufferedBerStream is a class that includes main() that dumps BER records in a human readable format.
usage:BufferedBerStream –i<input file> [-n<numrecs>] [-s<skiprecs>]
To see a page at a time:
BufferedBerStream –i<input file> | more
To dump to a file:
BufferedBerStream –i<input file> > filename
Pears, Gwen and JZKit Training 26
Exercise Configuration InformationExercise Configuration Information
• The database is in ~/dbs/scifi
• The jar files are in ~/jars
• Aliases are:alias Bartlett 'java -Xmx800m ORG.oclc.pears.Bartlett.Bartlett'
alias BufferedBerStream 'java ORG.oclc.ber.BufferedBerStream'
alias IndexLoop 'java ORG.oclc.pears.util.IndexLoop'
alias RecordHandler 'java ORG.oclc.RecordHandler.RecordHandler'
alias testgwen 'java ORG.oclc.os.gwen.testgwen'
alias validate 'java ORG.oclc.pears.util.validate'
alias ZClient 'java com.k_int.z3950.client.ZClient'
alias ZServer 'java com.k_int.z3950.server.ZServer'
Pears, Gwen and JZKit Training 27
Exercise Configuration InformationExercise Configuration Information
• The CLASSPATH is:setenv CLASSPATH
.:/home/levan/java:/home/levan/lib/pears.jar:/home/levan/lib/Dbutils.jar: /home/levan/lib/ki-jzkit-z3950.jar:/home/levan/lib/ki-util.jar: /home/levan/lib/log4j.jar:/home/levan/lib/a2jruntime.jar: /home/levan/lib/ki-jzkit-iface.jar:/home/levan/lib/gwen.jar: /home/levan/lib/xerces.jar
• All of this is in ~/.tcshrc. Just say “tcsh” at the command line to get it.
Pears, Gwen and JZKit Training 28
Pears Database Building Exercise Exercise 1: Identifying Data in a BER Record
• Using the BER records generated from the MARC data file:
dbs/scifi/scifi.usmarc identify the tags used for the data.
(Hint: run RecordHandler to make the BER records and then BufferedBerStream to look at them)
Pears, Gwen and JZKit Training 29
Designing and Building DatabasesTopicsDesigning and Building DatabasesTopics
A Pears Database Building - Introduction
Database Description File
CBuilding Databases
DConfiguring and Testing
E Database Utilities and Maintenance
F Advanced Database Description Concepts
Pears, Gwen and JZKit Training 30
Database Description FileFunction
• The database description is a text file that you set up to determine:
– Database Indexing
– What Indexes support proximity searching
– What Index contains the unique recordID
• Known as the <filename>desc.ini file
Pears, Gwen and JZKit Training 31
[DB]Database Name Name=scifiAccession index RecordIDIndex=17Raw Data Type InputRecordType=USMARC
Index definitions [Title]Index ID index=1Indexing Routine routine=ORG.oclc.pears.IndexRoutines.WordsField to be indexed tagpath*=245/1
tagpath*=245/2
[Author] index=3routine=ORG.oclc.pears.IndexRoutines.Wordstagpath*=100/1tagpath*=100/2tagpath*=700/1
[Control Number]index=5routine=ORG.oclc.pears.IndexRoutines.Wordstagpath=1
Database Description File
File Example
Pears, Gwen and JZKit Training 32
Database Description FileGeneral Database Information
• The [DB] section provides the database name, accession index and input record type
• Syntax:
– [DB]
– Name=<database name>
– RecordIDIndex=<index number>
– InputRecordType=<RecordHandler type>
Pears, Gwen and JZKit Training 33
Database Description File General Database Information
Example:[DB]Name=TestRecordIDIndex=1InputRecordType=SGML
Pears, Gwen and JZKit Training 34
Database Description File Setting up Index Definitions
• Any number of independent indexes can be defined.
• An index can be made from multiple fields.
– Example: index 1 may include title, author, notes, etc.
• Indexes can share fields.
– Example: index 2 may also include title
Pears, Gwen and JZKit Training 35
Database Description File Setting up Index Definitions
• An index section is any section with Index, Routine and Tagpath
• Syntax:– [<Index Name>]
– Index=<index number>
– Routine=<index routine>
– Tagpath*=<path to field>
– OccurrenceRoutine=<proximity routine>
Pears, Gwen and JZKit Training 36
Database Description File Setting up Index Definitions
• index number is any number
• Index routine defines how the term is extracted
- use ORG.oclc.pears.IndexRoutines.Words for basic keywords
- use ORG.oclc.pears.IndexRoutines.Phrase for basic bound phrases
• path to field contains a list of BER tags separated by slashes
• occurrence routine (optional) specifies the routine to add proximity information to the index
Pears, Gwen and JZKit Training 37
Database Description File Index Definition
Example:[Title Words]Index=2Routine=ORG.oclc.pears.IndexRoutines.WordsTagpath*=245/1Tagpath*=245/2
Pears, Gwen and JZKit Training 38
• Defines positional information stored with each indexed term.
• Adjacency information is stored at build time on a per record basis, so is within fields, NOT across field boundaries.
• Set by the OccurrenceRoutine.
• ORG.oclc.pears.Bartlett.wordfield is most commonly used.
Database Description File Term Adjacency (Optional)
Pears, Gwen and JZKit Training 39
Database Description File Index Definition with Adjacency
Example:[Title Words]Index=2Routine=ORG.oclc.pears.IndexRoutines.WordsOccurrenceRoutine=ORG.oclc.pears.Bartlett.wordfieldTagpath*=245/1Tagpath*=245/2
Pears, Gwen and JZKit Training 40
Database Description File Global Stopwords
• List of terms NOT indexed
• Syntax:
[Stopwords]
index=0
routine=ORG.oclc.pears.IndexRoutines.StopwordEnforcer
tagpath=none
stopword*=<word>
Pears, Gwen and JZKit Training 41
Database Description File Global Stopwords
• Example:
[Stopwords]
index=0
routine=ORG.oclc.pears.IndexRoutines.StopwordEnforcer
tagpath=none
stopword*=and
stopword*=the
Pears, Gwen and JZKit Training 42
Database Description File Index Specific Stopwords
• Syntax:
[<index name>]
Index=<index number>
Routine=<index routine>
Tagpath*=<path to field>
Stopword*=<word>
Pears, Gwen and JZKit Training 43
Database Description File Index Definition with Stopwords
Example:[Title Words]Index=2Routine=ORG.oclc.pears.IndexRoutines.WordsOccurrenceRoutine=ORG.oclc.pears.Bartlett.wordfieldTagpath*=245/1Tagpath*=245/2Stopword*=andStopword*=the
Pears, Gwen and JZKit Training 44
Database Description FileExercise 2: Identifying Database Description Indexes
• View the database description file (dbs/scifi/scifidesc.ini) that has been created for your student account. Identify what indexes will be created from this file.
Pears, Gwen and JZKit Training 45
Designing and Building DatabasesTopicsDesigning and Building DatabasesTopics
A Pears Database Building - Introduction
BDatabase Description File
Building A Database
DConfiguring and Testing
E Database Utilities and Maintenance
F Advanced Database Description Concepts
Pears, Gwen and JZKit Training 46
Building A Database Program Steps
1.) Convert Input Data
2.) Store Records and Extract Index Terms
3.) Sort Extracted Terms
4.) Update Index and Postings
Pears, Gwen and JZKit Training 47
DatabaseDatabaseDescriptionDescription
Building a Pears Database
Program Steps - Illustrated
BartlettBartlett
desc.inidesc.ini
InputInputDataData
.pdb file.pdb file
DatabaseDatabase
Pears, Gwen and JZKit Training 48
Building A DatabaseBartlettBuilding A DatabaseBartlett
usage: Bartlett <dbname> -i<InputFileName> -d<dbdesc.ini>
[-n<numrecs>] [-s<skipnum>] [-t<numThreads>]
[-w<sorted nip filename>] [-fX]
where the -f flags (which turn things on) are:
-fg: guaranteed that all records are adds
-fn: printing to a file / use newlines
-fu: update the stored database description with a new one
All of the arguments are optional, but somehow you must specify an input file and a database file. If you specify <dbname> then the others default to -i<dbname>.recordType and -d<dbname>desc.ini
Pears, Gwen and JZKit Training 49
• Use validate to verify the internal correctness of a database
• usage: java validate <dbname> [-count] [-records]
[-index] [-data] [-postings] [-regions] [-all]
-count means validate the record count
-records means validate the records and implies -count
-index means validate the index structure
-data means validate the data for each index term and
implies -index
-postings means validate the postings list for each term and
implies -data
-all means validate everything
Building A DatabaseValidate a Database
Pears, Gwen and JZKit Training 50
Building a DatabaseExercise 3
Build and validate the scifi database
– cd dbs/scifi
– type: Bartlett scifi
– type: validate scifi -all
Pears, Gwen and JZKit Training 51
A. Pears Database Building - Introduction
B. Database Description File
C. Building A Database Configuring and Testing
E. Database Utilities and Maintenance
F. Advanced Database Description Concepts
Designing and Building Databases TopicsDesigning and Building Databases Topics
Pears, Gwen and JZKit Training 52
Configuring and TestingTest using testgwen
testgwen is a command line search engine that demonstrates how to embed searching in your java applications
usage: testgwen –p<database.properties>
Pears, Gwen and JZKit Training 53
Configuring and TestingTest using testgwenConfiguring and TestingTest using testgwen
scifi.properties:database.name=scifi
implementation.class=ORG.oclc.os.pearsgwen.pDatabase
pearsgwen.inifileName=scifi.ini
#CQL Stuff
qualifier.srw.serverChoice= 1=1016
qualifier.dc.title= 1=4
structure.*= 4=6
Pears, Gwen and JZKit Training 54
Configuring and TestingTest using testgwenConfiguring and TestingTest using testgwen
scifi.ini:[Database]
ZBaseDbType=ORG.oclc.db.DbNewton
class=ORG.oclc.pears.pears
dbName= scifi
LongName = SiteSearch example USMARC database
pdbFile=scifi.pdb
# this allows for more than 1 attribute type BIB1, EXP1, ZDSR
[attributes]
type1=BIB1attributes
Pears, Gwen and JZKit Training 55
Configuring and TestingTest using testgwen (scifi.ini continued)Configuring and TestingTest using testgwen (scifi.ini continued)
[BIB1attributes]
OID=BIB1
default=words
parse_mode = 0
browse_default=0
stopwords= default
operator= 0
index* = titleWords
index* = subjectCategoryCodes
index* = authorWords
index* = titlePhrase
…
Pears, Gwen and JZKit Training 56
Configuring and TestingTest using testgwen (scifi.ini continued)Configuring and TestingTest using testgwen (scifi.ini continued)
[titleWords]
use=4
structure=2
alternateID=1
filter=ORG.oclc.pears.IndexRoutines.Words
[subjectCategoryCodes]
use=20
structure=2
alternateID=2
filter=ORG.oclc.pears.IndexRoutines.Words
Pears, Gwen and JZKit Training 57
Configuring and TestingTest using testgwenConfiguring and TestingTest using testgwen
testgwen commands:BROWSE
b[rowse] [numberOfTerms] [positionOfSeed] <browseTerm>
numberOfTerms defaults to 10
positionOfSeed defaults to numberOfTerms/2
example: b dc.author=smith
SEARCH
s[earch] <query>
example: s dog
DISPLAY DOCUMENT
d[ocument] [startpoint][-endpoint]
startpoint defaults to 1
endpoint defaults to 1
example: d 1
Pears, Gwen and JZKit Training 58
Configuring and Testingtestgwen testing suggestions
• Test the indexes with the browse command
• Browse the top and bottom of the index; garbage in the records tends to go there
• Browse all of your indexes to verify that indexing rules
• Test the postings lists with searches
Pears, Gwen and JZKit Training 59
• Test the records with ‘d’isplay commands
e.g. d 1 to view the first record from the latest search
Configuring and Testingtestgwen testing suggestions
Pears, Gwen and JZKit Training 60
Configuring and TestingExercise 4Test your scifi database using testgwen
Configuring and TestingExercise 4Test your scifi database using testgwen
• testgwen –pscifi.properties
• b dog
• b dc.author=smith
• s dc.title=“ninja turtles”
• d
• q
Pears, Gwen and JZKit Training 61
Configuring and TestingExpose your database using JZKit’s ZServerConfiguring and TestingExpose your database using JZKit’s ZServer
JZKit is an OpenSource Z39.50 server and client package– http://www.k-int.com/products/jzkit/index.php
We have embedded gwen inside of the JZKit Server through database interfaces provided in JZKit. This allows the JZKit server to search Pears databases
Pears, Gwen and JZKit Training 62
Configuring and TestingExpose your database using JZKit’s ZServerConfiguring and TestingExpose your database using JZKit’s ZServer
Usage: ZServer <ZServer.PropertiesFile>
ZServer.props:port=2105
evaluator=ORG.oclc.os.jzkit.GwenSearchable
Gwen.configuration=gwen.properties
#
# Record conversion configuration
#
XSLConverterConfiguratorClassName= com.k_int.IR.Syntaxes.Conversion.XMLConfigurator
ConvertorConfigFile=./SchemaMappings.xml
Pears, Gwen and JZKit Training 63
Configuring and TestingExpose your database using JZKit’s ZServerConfiguring and TestingExpose your database using JZKit’s ZServer
gwen.properties:gwen.db1=scifi.properties
Scifi.properties:The same as for testgwen!
Pears, Gwen and JZKit Training 64
Configuring and TestingExpose your database using JZKit’s ZServerConfiguring and TestingExpose your database using JZKit’s ZServer
Converting your database records to Z39.50 records:
SchemaMappings.xml:<SchemaMappings>
<templatesource directory="./mappings"/>
<mapping from="OCLCRecord" to="sutrs" sheet="naiveMarcBerToSutrs.xsl"/>
<mapping from="OCLCRecord" to="meta" sheet="naiveMarcBerToMeta.xsl"/>
<mapping from="meta" to="usmarc" sheet="meta_to_usmarc.xsl"/>
</SchemaMappings>
Pears, Gwen and JZKit Training 65
Configuring and TestingSearch your database using JZKit’s ZClientConfiguring and TestingSearch your database using JZKit’s ZClient
usage: ZClientCommands:
open hostname[:portnum] - Connect to z server on host[:port]
show n[+i] - show i records starting at n
find [rpn-string] - Process the supplied rpn query
base db1 [db2.....] - Search the specified databases
format [ xml|sutrs|grs..] - Ask the server for the specified kind of records
scan [rpn-string]
Pears, Gwen and JZKit Training 66
Configuring and TestingSearch your database using JZKit’s ZClientConfiguring and TestingSearch your database using JZKit’s ZClient
usage: ZClientrpn strings are composed as follows:
rpn-string = @attrset default-attrset expr
expr = [ attr-plus-term | boolean ]
attr-plus-term = attrdef [ attrdef...] { single-term | "quoted string" }
attrdef = @attr [attrset] attrtype=attrval
boolean = { @and | @or | @not } expr expr
Pears, Gwen and JZKit Training 67
Configuring and Testing Exercise 5
• Start Zserver– ZServer ZServer.props&
• Test the database files with Zclient– Zclient
– open localhost:2105
– base scifi
– find @attrset bib-1 @attr 1=1016 @attr 4=2 dog
– quit
Pears, Gwen and JZKit Training 68
A. Pears Database Building - Introduction
B. Database Description File
C. Building Databases
D. Configuring and Testing Database Utilities and Maintenance
F. Advanced Database Description Concepts
Designing and Building DatabasesTopicsDesigning and Building DatabasesTopics
Pears, Gwen and JZKit Training 69
Database Utilities and Maintenance General Database Information Report
Indexloop:• usage: java IndexLoop <dbname> [-b<num>][-d<num>][-i<index>]
[-n<num>]
[-t<num>] [-f]
-b the number of terms from the bottom of the index to be returned
(default is 0)
-d the number of terms distributed through the index to be returned
(default is 0)
-n the number of the most highly posted terms to be returned
(default is 100)
-t the number of terms from the top of the index to be returned
(default is 0)
Pears, Gwen and JZKit Training 70
Database Utilities and MaintenanceExercise 6:Using the Database Utilities
Database Utilities and MaintenanceExercise 6:Using the Database Utilities• Run IndexLoop against the scifi
database
– IndexLoop scifi
Pears, Gwen and JZKit Training 75
A Pears Database Building - IntroductionPears Database Building - IntroductionB Database Description FileDatabase Description FileC Building DatabasesBuilding DatabasesD Configuring and TestingConfiguring and TestingE Database Utilities and MaintenanceDatabase Utilities and Maintenance Advanced Database Description ConceptsAdvanced Database Description Concepts
Designing and Building DatabasesTopicsDesigning and Building DatabasesTopics
Pears, Gwen and JZKit Training 76
Advanced Database ConceptsTopics
• Restrictors
• Replacing and Deleting Records
Pears, Gwen and JZKit Training 77
• Used to additionally qualify indexes.
• Speeds up Boolean searching.
• Can only be used in combination with another search term.
• One database can have multiple restrictors defined.
• Can be linked with a searchable index.
– by shared id
Advanced Database Concepts Record Restrictions
Pears, Gwen and JZKit Training 78
Advanced Database Concepts Record Restrictions
• Practical with data that has a defined range.
– categories like publication type
– range like publication date
– language
• Binary value
– set on a per-record basis.
– stored in the postings entry for each extracted term.
Pears, Gwen and JZKit Training 79
Advanced Database Concepts Defining Record Restrictions
• Syntax:
[docrule<n>]
index=<index number>
routine=ORG.oclc.pears.Bartlett.termrest
parameters=<terms to use as restrictors>
• Example:
[docrule1]
index=24
routine=ORG.oclc.pears.Bartlett.termrest
parameters=english german french
Pears, Gwen and JZKit Training 80
• Link to an index by using the same Id.
• routine - rule used for setting the restriction.
• parameters - specific to restriction routine.
Advanced Database Concepts Defining Record Restrictions
Pears, Gwen and JZKit Training 81
Advanced Database Concepts Replace and Delete Records
• Unique record key is in index <RecordIDIndex>.
• If a record is added that has the same unique record key as a previous record, then the new record replaces the existing record.
• HandleUSMARC uses record status values from the MARC fixed fields to delete records.
Pears, Gwen and JZKit Training 82
Advanced Class Topics
• A class on Advanced Database Building will cover:
– Building databases with SGML data.
– Advanced restrictor concepts.
– Debugging of data errors.
– and more exciting topics too numerous to mention.
Pears, Gwen and JZKit Training 83
PearsDesigning and Building Databases
...and that’s how you test your new database.
What questions do you have?