selective archiving of croatian web resources a study of processing costs at the national and...

Post on 29-Dec-2015

216 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Selective Archiving of Croatian Web Resources

A Study of Processing Costs at theNational and University Library of Croatia

Tanja Buzina* Karolina Holub* Miroslav Milinović** Nebojša Topolšćak**

Mirna Willer* Jasenka Zajec**National and University Library, Croatia

digarhiv@nsk.hr **University of Zagreb Computing Centre, Croatia

damp@srce.hr

LIDA 2007

Contents

• Background

• Research aim and tasks involved

• Task/Staff members involved

• Assessment analysis

• Conclusions

• Acknowledgements

Background:Archiving selected Croatian web resources

The National and University Library of Croatia archives web resources selectively.

National and University Library & the University of Zagreb Computing Centre (Srce) developed the Digital Archive of Web Resources as a part of a co-operative project (2003-)– Design of the System for Harvesting and Archiving Legal Deposit of

Croatian Web Publications.

The Digital Archive is fully integrated with the library information system and is running as a service since January 2004.

The funding for the development of the Digital Archive, its integration to the catalogue, the purchase of the appropriate computer system, and training of the staff involved were all taken from the running budget.

Background:Digital Archive of Web Resources

After two years of experimenting within the Project team (2003-2005), the Library selected three members of the staff to work full time in a newly created Unit for Processing Web Recourses (UPWR) on the broadly defined jobs of identification, selection, cataloguing and archiving web resources.

The UPWR is supported by the project manager and several staff members from different Library’s departments working part time, and the development and technical staff and the project manager from Srce.

The UPWR’s everyday tasks run in parallel to the third phase of the Project to be finished in October 2007.

Background:The Digital Archive: statistics

• May 15, 2007– 1,640 items,– 15,390 number of instances, growing at the rate of

approximately 400 items per year,– total size of archive: 1TB– 147 of web resources had disappeared from the live

web– 20 items: access restricted for commercial reasons.

• May 15, 2006– 1,364 items – 6,041 instances– total size of archive: 277 GB

Research aim and tasks involved

The aim of this research was to assess the costs for processing web resources.

Two analysis were done:(1) time and type of task per item archived

(2) other costs related to maintenance and development of the service.

The period of the assessment was two months during which the staff involved minutely monitored their tasks.

Tasks involved

1 Identification2 Selection3 Formal and Subject Cataloguing4 Archiving5 Updating the catalogue6 Communication with publishers7 Updating publisher’s register8 Training the library staff and publishers9 Promoting the Digital Archive10 Communication within the Library and with other institutions &

projects11 Design of the System for Harvesting and Archiving Legal Deposit of

Croatian Web Publications: the third phase of the project12 Tasks performed by the University of Zagreb Computing Centre

(Srce)

Cataloguing and archiving: workflow

WEB RESAURCES

1 IDENTIFICATION

2 SELECTION

4 ARCHIVING

6, 7, 10ADMINISTRATIVE

TASKS &CO-

ORDINATION

3 CATALOGUING

4.2 UPDATING DATA IN THE ARCHIVE PUBLISHERS

Total number of staff members (16) working full (3)/part (13) time on web resources per organisational unit (10) per type of task (12)

Name of Unit Number of Staff Type of task Involvement Unit for Processing Web Resources

3 1-11 [except for 5]

full

CIP Unit 1 6 part time ISSN Centre for Croatia

2 1, 2, 3,6, 7, 10 part time

Authority Control Department

1 3 part time

Subject and Classification Department

1 3 part time

Music Collection 1 1, 2, 3, 6, 10 part time Croatian Institute for Librarianship (project co-ordination)

1 9, 10, 11 part time

Serials Cataloguing Department (project member; ISSN and serials cataloguing co-ordination)

1 9, 10, 11 part time

Information Technology Department (project member and lead programmer)

1 5 part time

Croatian Institute for Librarianship (project member)

1 11 part time

University of Zagreb Computing Centre (Srce)

2 4, 6, 8, 9,11, 12 part time

University of Zagreb Computing Centre (Srce) (project co-ordination)

1 8, 9, 10,11, 12 part time

Assessment analysisresults

The assessment period was two months: • March 15 to May 15 2007• 42 working days or 315 hours (7.5 hours/day or 450 minutes/day)

Items fully processed and cost analysed:• 385 items were processed

Items dealt with but not fully processed:• About 100 items were identified and evaluated for inclusion in the

Digital Archive, but did not fulfil the selection criteria• 14 items were password protected but the publishers/authors did not

give permission for full text access during the assessment period

– these were excluded from cost analysis per item– time given in (2) other activities

Distribution of tasks: 1 Identification, 2 Selection, 3 Cataloguing 4 Archiving per item processed per minute

Data analysis: Distribution of type of resource in the sample

28,57%

54,81%

16,62%

serial (1) integrating resource (2) monographic resource (3)

Distribution of type of format: web pages (text, image, sound, video); doc/pdf (text);

other

12,99%

78,70%

8,31%

doc/pdf Web pages unknown

Frequency of harvestingmanual = not automated harvesting

daily etc. = automated harvesting

59,74%

2,60%

1,82%

10,65%

19,74%

5,45%

manual daily day in a week day in a month month in a year unknown

Number of harvesting parameters used

68,31%

5,19%

0,78%5,45%

3,12%

17,14%

1 2 3 4 >4 unknown

Types of harvesting parameters per item:examples for random eight items to be harvested

recursion_depth, unwanted_path_pattern, always_get_embeded_resources

recursion_depth

recursion_depth, unwanted_path_pattern, alternative_host, always_get_embeded_resources, remove_url_param

recursion_depth

recursion_depth

recursion_depth, alternative_host

recursion_depth, synonym, alternative_host

recursion_depth, unwanted_path_pattern, always_get_embeded_resources, remove_url_param

(1) Items fully processed: percentage of time per task58% archiving; 33% cataloguing; 7% selection; 2%

identification

58%

33%

7% 2%

prosjek-t.4 prosjek-t.3 prosjek-t.2 prosjek-t.1

(1) Items fully processed: total sum of time per task126 h /archiving; 72 h /cataloguing; 14.81 h /selection;

4.4 h /identification

264,00 889,00

4334,00

7572,00

ukupno-t.1 ukupno-t.2 ukupno-t.3 ukupno-t.4

(1) Items fully processed: total sum of time per task126 h /archiving

4 Archiving4.1 New items: archiving process

• checking the item on its live address on the Web;• defining the harvesting parameters; registering the item to harvesting queue;• checking the quality of the first harvesting:

– repeating, when necessary, the harvesting with changed parameters,– deleting unsuccessful or poor instances of harvesting,– checking the archived item and for display in the catalogue and the Digital

Archive’s web interface;• defining the frequency of harvesting;

4.2 Existing items: quality control of archived instances• checking the availability of the item on its live Web address according to the monthly

automatic report;• changing harvesting parameters if a change in properties/structure has taken place:

– deactivating harvesting parameters if the web resource has disappeared from the live Web,

– control of the multiple harvesting instances,– deleting unsuccessful harvesting;

• checking automatic daily reports on possible duplicates, and deleting them;• changing frequency of harvesting parameters;4.5 reporting on harvesting problems.

(1) Items fully processed: percentage of time per archiving52% archiving new items (4.1); 45% archiving existing items (4.2); 3%

reporting on harvesting problems (4.2.5)

3%

45%52%

prosjek-t.4.2.5 prosjek-t.4.2 prosjek-t.4.1

(1) Items fully processed: total sum of time/min per task66.18 h/archiving new items (4.1); 56.28 h/archiving existing items

(4.2); 3.73 h/reporting on harvesting problems (4.2.5)

3971,003377,00

224,00

ukupno-t.4.1 ukupno-t.4.2 ukupno-t.4.2.5

(1) Items fully processed: avarage time per task per type of resources processed

0,00

5,00

10,00

15,00

20,00

25,00

30,00

35,00

40,00

prosjek-t.1 prosjek-t.2 prosjek-t.3 prosjek-t.4 prosjek-t.1-t.4

svi PDF only HTML/web

(2) Assessment of costs of other activitiesProject 81.9 h; communication within library & others 33.7 h; training 26.8 h; communication with publishers 17,4 h;

identification & evaluation 8.5 h;

IDENTIFICATION AND EVALUATION

5%

COMMUNICATION WITH PUBLISHERS

10%

UPDATING PUBLISHER’S REGISTERS

1%

TRAINING LIBRARY STAFF 15%

PROMOTING DIGITAL ARCHIVE

4%

COMMUNICATION WITHIN LIBRARY

19%

PROJECT46%

Conclusions: (1) comparison

National Library of Australia: • the costs of processing web resources (acquisitions to

archiving) at the National Library of Australia.[1]

[1] Phillips, Margaret E. Selective Archiving of Web Resources: A Study of Acquisition Costs at the National Library of Australia.  // RLG (vol.9, no.13, 2005). Available at: http://www.rlg.org/en/page.php?Page_ID=20666#article0

Conclusions: (1) comparison: National Library of Australia

National and University Library Croatia

• Identification and selection = 30 min; 3 min/item + 8.5 h / 6 staff = ± 80 min/staff member

• Publishers contact, negotiating permission to archive the title, and filling of correspondence = 30 min; 2.56 min/item + 7.4 h / 8 staff = < 60 min/staff member

• Gathering, quality assurance, and archiving instances – 210 min; 19.08 min/item

• Cataloguing – 81 min; 11.26 min/item• Other activities (correspondence with I&A services,

reference inquiries, contribution to the development of PANDAS) – 60 min; (Project) 81.9 h / 8 staff = ± 60 min/staff member

Conclusions (2) comparison: National and University Library,

Croatia - print publications

• As far as we are aware there are no comparable analyses of processing cost for different tasks pertaining to print publications, so only the cost of cataloguing can be compared. – March – May 2007: 10 items/day (monographs or

serial)• Web resources:

– Identification, selection, cataloguing & archiving: 33.92 min/item <3+6 staff>= ± 14 items/day

– Cataloguing <3+4 staff> 11.26 min/item = ± 28 items/day <!!>

Conclusionsgeneral observation

• Analysis is a snapshot of activities within 2 months: the obtained results are not absolute, but should be interpreted taking into consideration specific conditions

• Cataloguing: the results show– Relatively small number of entries (resources) compared to print

publications– Almost the same time used for original cataloguing and updating

existing records– Updating due to the changes of resources characteristics:

specific to web resources vs. print publications– Further analysis need to see the percentage of original

cataloguing & updating

Conclusionsgeneral observation

• Archiving– High percentage of time used for archiving– Further training of the cataloguing staff needed, or– Employment of staff with technical skills: knowledge of web

technology and techniques– Staff member with technical skills should be a member of the

Unit for Processing Web resources (UPWR)• Development

– High percentage of task of the UPWR dedicated to development: reserach, services and tools (guidelines for cataloguing)

– Percentage of time used by UPWR (36.5 h per 1,5 staff) and ISSN Centre for Croatia (10.5 h per 3 staff) vs. Percentage of time of the co-ordinator in Croatian Institute for Librarianship (22.8 h) shows that UPWR and to a lesser degree ISSN have taken much of the development as part of their everyday activities.

The Digital Archive of Croatian Web Resources is freely available to anyone, anywhere in the world

– via the catalogue: http://katalog.nsk.hr/ – Digital Archive’s interface http://

www.nsk.hr/digarhiv

Acknowledgements

The authors wish to thank colleagues who took part in this assessment exercise.

They are Hrvoje Brozović (IT Department), Danijela Getliher and Renata Petrušić (ISSN), Sofija Klarin (Croatian Institute for Librarianship, Project member), Tatjana Mihalić (Music Collection), Robert Ravnić (Authority Control Department), Ingeborg Rudomino (UPWR), Tomica Vrbanc (CIP Unit) and Mirjana Vujić (Subject and Classification Department) from the National and University Library

top related