pears, gwen and jzkit training. 2 designing and building databases topics pears database building -...

79
Pears, Gwen and JZKit Training

Upload: owen-fitzgerald

Post on 27-Mar-2015

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training

Page 2: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 2

Designing and Building DatabasesTopics

Pears Database Building - Introduction

B Database Description File

C Building Databases

D Configuring and Testing

E Database Utilities and Maintenance

F Advanced Database Description Concepts

Page 3: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 3

Pears Database BuildingIntroduction

Pears provides tools that allow you to:

• Build databases from structured data such as:

– MARC - that has a defined standard structure.

– XML – that has loose structure but clearly identified fields.

• Determine each index for the database.

• Load the records into a database following your indexing definitions.

Page 4: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 4

Pears Database Building Exercise Preview

• View the structure of a small set of MARC records.

• Build a small database from those records.

• Look at the setup database description file.

• Build the Database.

• Test database for correctness using testgwen.

• Add the database to the JZKit configuration files, making it searchable by a Z39.50 Client.

Page 5: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 5

Pears Database BuildingThe Gwen Search EnginePears Database BuildingThe Gwen Search Engine

• The Gwen search engine is a generalized text retrieval engine.

• Functionality is contained in the Java classes that can be embedded in Java applications including the JZKit Z39.50 Server.

• The JZKit server allows multiple, simultaneous users utilizing a client program supporting the Z39.50 protocol, to browse, search and display records from Pears databases.

Page 6: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 6

Pears Database Building Logical and Physical DatabasesPears Database Building Logical and Physical Databases

• A Gwen Database is a logical database

– It provides features for searching and retrieving records

• A Pears Database is a physical database

– It provides the information that a Gwen database needs

Page 7: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 7

Pears Database Building Gwen Database FeaturesPears Database Building Gwen Database Features

• A Gwen Database has:

– Indexes with numeric ID’s

– Index Terms with Postings Lists

– Postings Lists have Record Numbers and Restrictor Data

Page 8: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 8

Pears Database BuildingWhat is a Pears Database?Pears Database BuildingWhat is a Pears Database?

• A Pears database is a single physical file with three main kinds of data

– Record data

– Index data

– Postings data

Page 9: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 9

Pears Database BuildingRecord DataPears Database BuildingRecord Data

• Contains the actual records of your database.

• Records are stored as BER-encoded records.

• Each record is identified by a unique logical record number.

Page 10: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 10

Pears Database BuildingIndex DataPears Database BuildingIndex Data

• Contains a sorted list of all the Index Terms extracted from your data records.

• Index Terms Contain:

– term/index-id.

– number of records that term appears in (postings count).

– a list of records that contain that term or a pointer to such a list.

Page 11: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 11

Understanding the Database StructureUnderstanding the Database Structure

INDEXINDEX

abercrombie: au: postings=2, postings list=r17, r15abercrombie: au: postings=2, postings list=r17, r15anderson: au : postings=102, postings list ID=l21anderson: au : postings=102, postings list ID=l21

Page 12: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 12

Pears Database BuildingPostings DataPears Database BuildingPostings Data

• Contains a list of record ID’s for each of the terms in the index.

• Each record ID may have restrictor and proximity information associated with it.

Page 13: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 13

Understanding the Database StructureUnderstanding the Database Structure

INDEXINDEX

abercrombie: au: postings=2, postings list=r17, r15abercrombie: au: postings=2, postings list=r17, r15anderson: au : postings=102, postings list ID=l21anderson: au : postings=102, postings list ID=l21

POSTINGSPOSTINGS

l21: r1024, r1021, r1007, r995, …l21: r1024, r1021, r1007, r995, …

Page 14: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 14

Understanding the Database StructureUnderstanding the Database Structure

INDEXINDEX

abercrombie: au: postings=2, postings list=r995, r175abercrombie: au: postings=2, postings list=r995, r175anderson: au : postings=102, postings list ID=l21anderson: au : postings=102, postings list ID=l21

POSTINGSPOSTINGS

l21: r1024, r1021, r1007, r995, …l21: r1024, r1021, r1007, r995, …

RECORDSRECORDS

r995:r995:au: Abercrombie & Andersonau: Abercrombie & Andersonti: Tennis Made Easyti: Tennis Made Easyyr: 1905yr: 1905

Page 15: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 15

Pears Database Building Data Conversion

• The Bartlett class is responsible for updating a Pears database.

• Bartlett automatically converts input records to the Pears internal BER format.

• The class of objects that do the conversion are called RecordHandlers.

• RecordHandler is a Java Interface class

– You can write your own RecordHandlers!

Page 16: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 16

Pears Database Building Data Conversion Options

• There are two primary Pears RecordHandlers that convert your data to BER format.

– HandleUSMARC

– HandleSGML

• There are several others:

– HandleBER, HandleDB, HandlePDB, HandleUnimarc, HandleChinaMarc

Page 17: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 17

Pears Database Building Data Conversion

• The RecordHandler class has a main() method that you can use to test RecordHandlers and/or your data.

– Usage:

java ORG.oclc.RecordHandler.RecordHandler –c<class> -i<inputFile> -o<outputFile> …

– Example:

java ORG.oclc.RecordHandler.RecordHandler

–cUSMARC –iscifi.usmarc –oscifi.ber –n10

Page 18: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 18

Pears Database Building Data ConversionPears Database Building Data Conversion

• BER (Basic Encoding Rules) is defined by ISO-8825

• It was created to encode ASN.1 records

• Encodes tree-structured data (equivalent to DOM records)

• Can contain binary data (e.g. .jpeg files) (unlike DOM records!)

Page 19: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 19

BER Record StructureBER Record Structuretag=1

tag=2 tag=3 tag=4

tag=1

Ralph

Ohio OCLC

tag=2

LeVan

tag=1, Class=1, form=1, =1, Class=1, form=1, count=3count=3

tag=2, Class=2, tag=2, Class=2, form=1, count=2form=1, count=2

tag=1, Class=2, tag=1, Class=2, form=0, count=5form=0, count=5 data=Ralphdata=Ralph

tag=1, Class=2, tag=1, Class=2, form=0, count=5form=0, count=5 data=LeVandata=LeVan

tag=3, Class=2, tag=3, Class=2, form=0, count=4form=0, count=4

data=Ohiodata=Ohio tag=4, Class=2, tag=4, Class=2, form=0, count=4form=0, count=4

data=OCLCdata=OCLC

Page 20: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 20

Pears Database Building

Marc Data Example000 nmm Ia 001 ocm35003642 003 OCoLC005 19000000000108.0008 960628s1995 cau d eng d040 $aFQM$cFQM096 $aNTERNET245 00 $aOphthalmic Anesthesia Society $h[computer file].256 $aComputer data.260 $a San Diego, CA : $b Ophthalmic Anesthesia Society, $c1995.516 $aHtml text and images in GIF and JPeg.538 $aSystem requirements: Html browser, JPeg compatible browser or image viewer.538 $aMode of access: Internet. Host: www.iea.com/3dans/OAS/oasDhomepage.html500 $aTitle from title screen.521 $aMedical.520 $aHome page of the Ophthalmic Anesthesia Society with articles, references,

e-mail addresses of members, pictures and ophthalmic anesthesia resources.650 02$aSocieties, Medical.650 02$aOphthalmology.650 02$aAnesthesia.710 02$aOphthalmic Anesthesia Society.856 07$u http://www.iea.com/3dans/OAS/oasDhomepage.html$2http$zOphthalmic Anesthesia Society home page

For USMARC data – (InputRecordtype=USMARC)

Page 21: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 21

HandleUSMARC converts this...

01981cam2200349450000800410000001700240004102200140006503000110007906900200009010000200011010000130013011000480014324501080019126000290029950000570032850000460038550003160043152005340074754600120128165000180129365000400131165000410135165000210139269000440141369000400145790002301497690001801520690003001538690002501568690001601593773002201609^^000000s1993eng^_a0370-2693/93/$06.00^^ ^_a0370-2693^^ ^_aPYLBAJ^^ ^_aA9308-1385K-002^^ ^_aBrandenburg, A.^^ ^_aMa, J.P.^^ ^_aInst. fur Theor. Phys., Heidelberg, Germany^^ ^_aCP odd observables for the top-antitop system produced at proton-antiproton and proton-proton colliders^^ ^_aNetherlands^_c7 Jan. 1993^^ ^_aSOURCE:Physics Letters B, vol.298, no.1-2, p. 211-17^^ ^_aTREATMENT: T; Theoretical or Mathematical^^ ^_aCLASS CODES: A1385K (Inclusive reactions, including total cross sections, (energy > 10 GeV))^_aA1110E (Lagrangian and Hamiltonian approach)^_aA1130E (Charge conjugation, parity, time reversal and other discret symmetries)^_aA1340F (Electromagnetic form factors; electric and magnetic moments; structure functions)^^ ^_aThe authors propose some CP odd observables to test CP invariance in the tt system produced at pp and pp colliders. Using these observables the effects of CP violation from the production and from the decay of the top quarks can be separated well. The application of their observables to pp collisions, where one has no CP invariant initial state, is discussed. To parametrize CP violating interactions their use an effective lagrangian for the tt production and a general form factor approach for the decay of t and t (19 Refs.)^^ ^_aEnglish^^ ^_aCP invariance^^ ^_aform factors (elementary particles)^^ ^_aproton-proton inclusive interactions^^ ^_aquark production^^ ^_aantiproton+proton producing antitop+top^^ ^_aproton+proton producing antitop+top^^ ^_aCP odd observables^^ ^_aCP invariance^^ ^_aCP violating interactions^^ ^_aeffective lagrangian^^ ^_aform factor

Page 22: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 22

tag=650, Class=2, form=1, count=2

tag=0, Class=2, form=0, count=2

data= 2

tag=1, Class=2, form=0, count=19

data=Societies, Medical.

tag=650, Class=2, form=1, count=2

tag=0, Class=2, form=0, count=2

data= 2

tag=1, Class=2, form=0, count=14

data=Ophthalmology.

tag=650, Class=2, form=1, count=2

tag=0, Class=2, form=0, count=2

data= 2

tag=1, Class=2, form=0, count=11

data=Anesthesia.

tag=710, Class=2, form=1, count=2

tag=0, Class=2, form=0, count=2

data=2

tag=1, Class=2, form=0, count=30

data=Ophthalmic Anesthesia Society.

...to this...to thistag=0, Class=1, form=1, count=22 tag=0, Class=2, form=0, count=8

data=nmm Ia tag=245, Class=2, form=1, count=3 tag=0, Class=2, form=0, count=2

data=00 tag=1, Class=2, form=0, count=29

data=Ophthalmic Anesthesia Society

tag=8, Class=2, form=0, count=16 data=[computer file].

tag=260, Class=2, form=1, count=4 tag=0, Class=2, form=0, count=2

data= tag=1, Class=2, form=0, count=15

data=San Diego, CA : tag=2, Class=2, form=0, count=30

data=Ophthalmic Anesthesia Society,

tag=3, Class=2, form=0, count=5 data=1995.

Page 23: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 23

Pears Database Building

SGML Data Example.tags fileTitle 1

Local-Subject-Index 2Abstract 3

Spatial-Domain 4Geographic-Coverage

1Coverage-Description

2Bounding-Coordinates

3West-Bounding-

Coordinate 1East-Bounding-Coordinate 2

North-Bounding-Coordinate 3

South-Bounding-Coordinate 4Time-Period 5

Time-Period-Textual 1Name 6

Organization 7

For SGML data – (InputRecordtype=SGML)

<Rec><Title>BEG - PANHANDLE COLOR INFRARED AERIAL PHOTOGRAPHY</Title><Abstract>TNRIS file no. 01010422. File consists of original and duplicate positive transparencies, color-infrared, stereoscopic, 1:80,000, quad centered, aerial photography of the Texas Panhandle, flown in September, 1977 by Mark Hurd. </Abstract><Spatial-Domain> <Geographic-Coverage>US STATE</Geographic-Coverage> <Coverage-Description>TEXAS PANHANDLE</Coverage-Description> <Bounding-Coordinates>

<West-Bounding-Coordinate>-102</West-Bounding-Coordinate><East-Bounding-Coordinate>-98</East-Bounding-Coordinate><North-Bounding-Coordinate>30</North-Bounding-Coordinate><South-Bounding-Coordinate>26</South-Bounding-Coordinate>

</Bounding-Coordinates></Spatial-Domain><Time-Period> <Time-Period-Textual>1977-1977</Time-Period-Textual></Time-Period><Name>BUREAU OF ECONOMIC GEOLOGY</Name><Organization>BUREAU OF ECONOMIC GEOLOGY</Organization> </Rec>

Page 24: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 24

Converted SGMLtag=3, Class=2, form=1, count=4

tag=1, Class=2, form=1, count=1 tag=1, Class=2, form=0, count=4

data=-102tag=2, Class=2, form=1, count=1

tag=1, Class=2, form=0, count=3 data=-98

tag=3, Class=2, form=1, count=1 tag=1, Class=2, form=0, count=2

data=30 tag=4, Class=2, form=1, count=1

tag=1, Class=2, form=0, count=2 data=26

tag=5, Class=2, form=1, count=1 tag=1, Class=2, form=1, count=1

tag=1, Class=2, form=0, count=9 data=1977-1977

tag=6, Class=2, form=1, count=1 tag=1, Class=2, form=0, count=26

data=BUREAU OF ECONOMIC GEOLOGY tag=7, Class=2, form=1, count=1

tag=1, Class=2, form=0, count=26 data=BUREAU OF ECONOMIC GEOLOGY

tag=0, Class=1, form=1, count=8

tag=1, Class=2, form=1, count=1

tag=1, Class=2, form=0, count=49

data=BEG - PANHANDLE COLOR INFRARED AERIAL PHOTOGRAPHY

tag=2, Class=2, form=1, count=1

tag=1, Class=2, form=0, count=35

data=AERIAL PHOTOGRAPHY; INFRARED; TEXAS

tag=3, Class=2, form=1, count=1

tag=1, Class=2, form=0, count=229

data=TNRIS file no. 01010422. File consists of original and duplicate positive transparencies, color-infrared, stereoscopic, 1:80,000, quad centered, aerial .photography of the Texas Panhandle, flown in September, 1977 by Mark Hurd.

tag=4, Class=2, form=1, count=3

tag=1, Class=2, form=1, count=1

tag=1, Class=2, form=0, count=8

data=US STATE

tag=2, Class=2, form=1, count=1

tag=1, Class=2, form=0, count=15

data=TEXAS PANHANDLE

Page 25: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 25

Pears Database Building Viewing a BER record - BufferedBerStream

• BER records are not readable in their encoded form.

• BufferedBerStream is a class that includes main() that dumps BER records in a human readable format.

usage:BufferedBerStream –i<input file> [-n<numrecs>] [-s<skiprecs>]

To see a page at a time:

BufferedBerStream –i<input file> | more

To dump to a file:

BufferedBerStream –i<input file> > filename

Page 26: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 26

Exercise Configuration InformationExercise Configuration Information

• The database is in ~/dbs/scifi

• The jar files are in ~/jars

• Aliases are:alias Bartlett 'java -Xmx800m ORG.oclc.pears.Bartlett.Bartlett'

alias BufferedBerStream 'java ORG.oclc.ber.BufferedBerStream'

alias IndexLoop 'java ORG.oclc.pears.util.IndexLoop'

alias RecordHandler 'java ORG.oclc.RecordHandler.RecordHandler'

alias testgwen 'java ORG.oclc.os.gwen.testgwen'

alias validate 'java ORG.oclc.pears.util.validate'

alias ZClient 'java com.k_int.z3950.client.ZClient'

alias ZServer 'java com.k_int.z3950.server.ZServer'

Page 27: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 27

Exercise Configuration InformationExercise Configuration Information

• The CLASSPATH is:setenv CLASSPATH

.:/home/levan/java:/home/levan/lib/pears.jar:/home/levan/lib/Dbutils.jar: /home/levan/lib/ki-jzkit-z3950.jar:/home/levan/lib/ki-util.jar: /home/levan/lib/log4j.jar:/home/levan/lib/a2jruntime.jar: /home/levan/lib/ki-jzkit-iface.jar:/home/levan/lib/gwen.jar: /home/levan/lib/xerces.jar

• All of this is in ~/.tcshrc. Just say “tcsh” at the command line to get it.

Page 28: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 28

Pears Database Building Exercise Exercise 1: Identifying Data in a BER Record

• Using the BER records generated from the MARC data file:

dbs/scifi/scifi.usmarc identify the tags used for the data.

(Hint: run RecordHandler to make the BER records and then BufferedBerStream to look at them)

Page 29: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 29

Designing and Building DatabasesTopicsDesigning and Building DatabasesTopics

A Pears Database Building - Introduction

Database Description File

CBuilding Databases

DConfiguring and Testing

E Database Utilities and Maintenance

F Advanced Database Description Concepts

Page 30: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 30

Database Description FileFunction

• The database description is a text file that you set up to determine:

– Database Indexing

– What Indexes support proximity searching

– What Index contains the unique recordID

• Known as the <filename>desc.ini file

Page 31: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 31

[DB]Database Name Name=scifiAccession index RecordIDIndex=17Raw Data Type InputRecordType=USMARC

Index definitions [Title]Index ID index=1Indexing Routine routine=ORG.oclc.pears.IndexRoutines.WordsField to be indexed tagpath*=245/1

tagpath*=245/2

[Author] index=3routine=ORG.oclc.pears.IndexRoutines.Wordstagpath*=100/1tagpath*=100/2tagpath*=700/1

[Control Number]index=5routine=ORG.oclc.pears.IndexRoutines.Wordstagpath=1

Database Description File

File Example

Page 32: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 32

Database Description FileGeneral Database Information

• The [DB] section provides the database name, accession index and input record type

• Syntax:

– [DB]

– Name=<database name>

– RecordIDIndex=<index number>

– InputRecordType=<RecordHandler type>

Page 33: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 33

Database Description File General Database Information

Example:[DB]Name=TestRecordIDIndex=1InputRecordType=SGML

Page 34: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 34

Database Description File Setting up Index Definitions

• Any number of independent indexes can be defined.

• An index can be made from multiple fields.

– Example: index 1 may include title, author, notes, etc.

• Indexes can share fields.

– Example: index 2 may also include title

Page 35: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 35

Database Description File Setting up Index Definitions

• An index section is any section with Index, Routine and Tagpath

• Syntax:– [<Index Name>]

– Index=<index number>

– Routine=<index routine>

– Tagpath*=<path to field>

– OccurrenceRoutine=<proximity routine>

Page 36: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 36

Database Description File Setting up Index Definitions

• index number is any number

• Index routine defines how the term is extracted

- use ORG.oclc.pears.IndexRoutines.Words for basic keywords

- use ORG.oclc.pears.IndexRoutines.Phrase for basic bound phrases

• path to field contains a list of BER tags separated by slashes

• occurrence routine (optional) specifies the routine to add proximity information to the index

Page 37: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 37

Database Description File Index Definition

Example:[Title Words]Index=2Routine=ORG.oclc.pears.IndexRoutines.WordsTagpath*=245/1Tagpath*=245/2

Page 38: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 38

• Defines positional information stored with each indexed term.

• Adjacency information is stored at build time on a per record basis, so is within fields, NOT across field boundaries.

• Set by the OccurrenceRoutine.

• ORG.oclc.pears.Bartlett.wordfield is most commonly used.

Database Description File Term Adjacency (Optional)

Page 39: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 39

Database Description File Index Definition with Adjacency

Example:[Title Words]Index=2Routine=ORG.oclc.pears.IndexRoutines.WordsOccurrenceRoutine=ORG.oclc.pears.Bartlett.wordfieldTagpath*=245/1Tagpath*=245/2

Page 40: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 40

Database Description File Global Stopwords

• List of terms NOT indexed

• Syntax:

[Stopwords]

index=0

routine=ORG.oclc.pears.IndexRoutines.StopwordEnforcer

tagpath=none

stopword*=<word>

Page 41: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 41

Database Description File Global Stopwords

• Example:

[Stopwords]

index=0

routine=ORG.oclc.pears.IndexRoutines.StopwordEnforcer

tagpath=none

stopword*=and

stopword*=the

Page 42: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 42

Database Description File Index Specific Stopwords

• Syntax:

[<index name>]

Index=<index number>

Routine=<index routine>

Tagpath*=<path to field>

Stopword*=<word>

Page 43: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 43

Database Description File Index Definition with Stopwords

Example:[Title Words]Index=2Routine=ORG.oclc.pears.IndexRoutines.WordsOccurrenceRoutine=ORG.oclc.pears.Bartlett.wordfieldTagpath*=245/1Tagpath*=245/2Stopword*=andStopword*=the

Page 44: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 44

Database Description FileExercise 2: Identifying Database Description Indexes

• View the database description file (dbs/scifi/scifidesc.ini) that has been created for your student account. Identify what indexes will be created from this file.

Page 45: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 45

Designing and Building DatabasesTopicsDesigning and Building DatabasesTopics

A Pears Database Building - Introduction

BDatabase Description File

Building A Database

DConfiguring and Testing

E Database Utilities and Maintenance

F Advanced Database Description Concepts

Page 46: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 46

Building A Database Program Steps

1.) Convert Input Data

2.) Store Records and Extract Index Terms

3.) Sort Extracted Terms

4.) Update Index and Postings

Page 47: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 47

DatabaseDatabaseDescriptionDescription

Building a Pears Database

Program Steps - Illustrated

BartlettBartlett

desc.inidesc.ini

InputInputDataData

.pdb file.pdb file

DatabaseDatabase

Page 48: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 48

Building A DatabaseBartlettBuilding A DatabaseBartlett

usage: Bartlett <dbname> -i<InputFileName> -d<dbdesc.ini>

[-n<numrecs>] [-s<skipnum>] [-t<numThreads>]

[-w<sorted nip filename>] [-fX]

where the -f flags (which turn things on) are:

-fg: guaranteed that all records are adds

-fn: printing to a file / use newlines

-fu: update the stored database description with a new one

All of the arguments are optional, but somehow you must specify an input file and a database file. If you specify <dbname> then the others default to -i<dbname>.recordType and -d<dbname>desc.ini

Page 49: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 49

• Use validate to verify the internal correctness of a database

• usage: java validate <dbname> [-count] [-records]

[-index] [-data] [-postings] [-regions] [-all]

-count means validate the record count

-records means validate the records and implies -count

-index means validate the index structure

-data means validate the data for each index term and

implies -index

-postings means validate the postings list for each term and

implies -data

-all means validate everything

Building A DatabaseValidate a Database

Page 50: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 50

Building a DatabaseExercise 3

Build and validate the scifi database

– cd dbs/scifi

– type: Bartlett scifi

– type: validate scifi -all

Page 51: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 51

A. Pears Database Building - Introduction

B. Database Description File

C. Building A Database Configuring and Testing

E. Database Utilities and Maintenance

F. Advanced Database Description Concepts

Designing and Building Databases TopicsDesigning and Building Databases Topics

Page 52: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 52

Configuring and TestingTest using testgwen

testgwen is a command line search engine that demonstrates how to embed searching in your java applications

usage: testgwen –p<database.properties>

Page 53: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 53

Configuring and TestingTest using testgwenConfiguring and TestingTest using testgwen

scifi.properties:database.name=scifi

implementation.class=ORG.oclc.os.pearsgwen.pDatabase

pearsgwen.inifileName=scifi.ini

#CQL Stuff

qualifier.srw.serverChoice= 1=1016

qualifier.dc.title= 1=4

structure.*= 4=6

Page 54: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 54

Configuring and TestingTest using testgwenConfiguring and TestingTest using testgwen

scifi.ini:[Database]

ZBaseDbType=ORG.oclc.db.DbNewton

class=ORG.oclc.pears.pears

dbName= scifi

LongName = SiteSearch example USMARC database

pdbFile=scifi.pdb

# this allows for more than 1 attribute type BIB1, EXP1, ZDSR

[attributes]

type1=BIB1attributes

Page 55: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 55

Configuring and TestingTest using testgwen (scifi.ini continued)Configuring and TestingTest using testgwen (scifi.ini continued)

[BIB1attributes]

OID=BIB1

default=words

parse_mode = 0

browse_default=0

stopwords= default

operator= 0

index* = titleWords

index* = subjectCategoryCodes

index* = authorWords

index* = titlePhrase

Page 56: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 56

Configuring and TestingTest using testgwen (scifi.ini continued)Configuring and TestingTest using testgwen (scifi.ini continued)

[titleWords]

use=4

structure=2

alternateID=1

filter=ORG.oclc.pears.IndexRoutines.Words

[subjectCategoryCodes]

use=20

structure=2

alternateID=2

filter=ORG.oclc.pears.IndexRoutines.Words

Page 57: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 57

Configuring and TestingTest using testgwenConfiguring and TestingTest using testgwen

testgwen commands:BROWSE

b[rowse] [numberOfTerms] [positionOfSeed] <browseTerm>

numberOfTerms defaults to 10

positionOfSeed defaults to numberOfTerms/2

example: b dc.author=smith

SEARCH

s[earch] <query>

example: s dog

DISPLAY DOCUMENT

d[ocument] [startpoint][-endpoint]

startpoint defaults to 1

endpoint defaults to 1

example: d 1

Page 58: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 58

Configuring and Testingtestgwen testing suggestions

• Test the indexes with the browse command

• Browse the top and bottom of the index; garbage in the records tends to go there

• Browse all of your indexes to verify that indexing rules

• Test the postings lists with searches

Page 59: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 59

• Test the records with ‘d’isplay commands

e.g. d 1 to view the first record from the latest search

Configuring and Testingtestgwen testing suggestions

Page 60: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 60

Configuring and TestingExercise 4Test your scifi database using testgwen

Configuring and TestingExercise 4Test your scifi database using testgwen

• testgwen –pscifi.properties

• b dog

• b dc.author=smith

• s dc.title=“ninja turtles”

• d

• q

Page 61: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 61

Configuring and TestingExpose your database using JZKit’s ZServerConfiguring and TestingExpose your database using JZKit’s ZServer

JZKit is an OpenSource Z39.50 server and client package– http://www.k-int.com/products/jzkit/index.php

We have embedded gwen inside of the JZKit Server through database interfaces provided in JZKit. This allows the JZKit server to search Pears databases

Page 62: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 62

Configuring and TestingExpose your database using JZKit’s ZServerConfiguring and TestingExpose your database using JZKit’s ZServer

Usage: ZServer <ZServer.PropertiesFile>

ZServer.props:port=2105

evaluator=ORG.oclc.os.jzkit.GwenSearchable

Gwen.configuration=gwen.properties

#

# Record conversion configuration

#

XSLConverterConfiguratorClassName= com.k_int.IR.Syntaxes.Conversion.XMLConfigurator

ConvertorConfigFile=./SchemaMappings.xml

Page 63: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 63

Configuring and TestingExpose your database using JZKit’s ZServerConfiguring and TestingExpose your database using JZKit’s ZServer

gwen.properties:gwen.db1=scifi.properties

Scifi.properties:The same as for testgwen!

Page 64: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 64

Configuring and TestingExpose your database using JZKit’s ZServerConfiguring and TestingExpose your database using JZKit’s ZServer

Converting your database records to Z39.50 records:

SchemaMappings.xml:<SchemaMappings>

<templatesource directory="./mappings"/>

<mapping from="OCLCRecord" to="sutrs" sheet="naiveMarcBerToSutrs.xsl"/>

<mapping from="OCLCRecord" to="meta" sheet="naiveMarcBerToMeta.xsl"/>

<mapping from="meta" to="usmarc" sheet="meta_to_usmarc.xsl"/>

</SchemaMappings>

Page 65: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 65

Configuring and TestingSearch your database using JZKit’s ZClientConfiguring and TestingSearch your database using JZKit’s ZClient

usage: ZClientCommands:

open hostname[:portnum] - Connect to z server on host[:port]

show n[+i] - show i records starting at n

find [rpn-string] - Process the supplied rpn query

base db1 [db2.....] - Search the specified databases

format [ xml|sutrs|grs..] - Ask the server for the specified kind of records

scan [rpn-string]

Page 66: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 66

Configuring and TestingSearch your database using JZKit’s ZClientConfiguring and TestingSearch your database using JZKit’s ZClient

usage: ZClientrpn strings are composed as follows:

rpn-string = @attrset default-attrset expr

expr = [ attr-plus-term | boolean ]

attr-plus-term = attrdef [ attrdef...] { single-term | "quoted string" }

attrdef = @attr [attrset] attrtype=attrval

boolean = { @and | @or | @not } expr expr

Page 67: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 67

Configuring and Testing Exercise 5

• Start Zserver– ZServer ZServer.props&

• Test the database files with Zclient– Zclient

– open localhost:2105

– base scifi

– find @attrset bib-1 @attr 1=1016 @attr 4=2 dog

– quit

Page 68: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 68

A. Pears Database Building - Introduction

B. Database Description File

C. Building Databases

D. Configuring and Testing Database Utilities and Maintenance

F. Advanced Database Description Concepts

Designing and Building DatabasesTopicsDesigning and Building DatabasesTopics

Page 69: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 69

Database Utilities and Maintenance General Database Information Report

Indexloop:• usage: java IndexLoop <dbname> [-b<num>][-d<num>][-i<index>]

[-n<num>]

[-t<num>] [-f]

-b the number of terms from the bottom of the index to be returned

(default is 0)

-d the number of terms distributed through the index to be returned

(default is 0)

-n the number of the most highly posted terms to be returned

(default is 100)

-t the number of terms from the top of the index to be returned

(default is 0)

Page 70: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 70

Database Utilities and MaintenanceExercise 6:Using the Database Utilities

Database Utilities and MaintenanceExercise 6:Using the Database Utilities• Run IndexLoop against the scifi

database

– IndexLoop scifi

Page 71: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 75

A Pears Database Building - IntroductionPears Database Building - IntroductionB Database Description FileDatabase Description FileC Building DatabasesBuilding DatabasesD Configuring and TestingConfiguring and TestingE Database Utilities and MaintenanceDatabase Utilities and Maintenance Advanced Database Description ConceptsAdvanced Database Description Concepts

Designing and Building DatabasesTopicsDesigning and Building DatabasesTopics

Page 72: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 76

Advanced Database ConceptsTopics

• Restrictors

• Replacing and Deleting Records

Page 73: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 77

• Used to additionally qualify indexes.

• Speeds up Boolean searching.

• Can only be used in combination with another search term.

• One database can have multiple restrictors defined.

• Can be linked with a searchable index.

– by shared id

Advanced Database Concepts Record Restrictions

Page 74: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 78

Advanced Database Concepts Record Restrictions

• Practical with data that has a defined range.

– categories like publication type

– range like publication date

– language

• Binary value

– set on a per-record basis.

– stored in the postings entry for each extracted term.

Page 75: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 79

Advanced Database Concepts Defining Record Restrictions

• Syntax:

[docrule<n>]

index=<index number>

routine=ORG.oclc.pears.Bartlett.termrest

parameters=<terms to use as restrictors>

• Example:

[docrule1]

index=24

routine=ORG.oclc.pears.Bartlett.termrest

parameters=english german french

Page 76: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 80

• Link to an index by using the same Id.

• routine - rule used for setting the restriction.

• parameters - specific to restriction routine.

Advanced Database Concepts Defining Record Restrictions

Page 77: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 81

Advanced Database Concepts Replace and Delete Records

• Unique record key is in index <RecordIDIndex>.

• If a record is added that has the same unique record key as a previous record, then the new record replaces the existing record.

• HandleUSMARC uses record status values from the MARC fixed fields to delete records.

Page 78: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 82

Advanced Class Topics

• A class on Advanced Database Building will cover:

– Building databases with SGML data.

– Advanced restrictor concepts.

– Debugging of data errors.

– and more exciting topics too numerous to mention.

Page 79: Pears, Gwen and JZKit Training. 2 Designing and Building Databases Topics Pears Database Building - Introduction BDatabase Description File CBuilding

Pears, Gwen and JZKit Training 83

PearsDesigning and Building Databases

...and that’s how you test your new database.

What questions do you have?