digital libraries models and content. goals for tonight finish up from last week – the 5 s model...
TRANSCRIPT
Digital Libraries
Models and Content
Goals for tonight
• Finish up from last week– the 5 S model more formally– Status of the systems available
• Obtaining, describing, indexing content– XML– Dublin Core– Introducing content exchanges (OAI)
Applying the 5S model, informallyChoose a subject area – then answer the questions• Stream - what types of data? gif, jpg, avi, docx, pdf, html? • Structure - How are the elements organized? Is there a
hierarchy? Are there multiple structures?• Spaces - How will we index the items? How will we divide them
into related groups• Scenarios - what services will we provide? What information do
we need to provide those services? What events might happen that we need to plan for?
• Societies - who is the library intended to serve? Remember to include agents and other processes as well as users.
This is the first deliverable for your first project.
More formally: Definitions
• Definition: A stream is a sequence whose codomain is a non empty set.
• Definition: A structure is a tuple (G, L, F) where G = (V,E) is a directed graph with vertex set V and edge set E, L is a set of label values, and F is a labeling function. F : (V E ) → L∪ .
See http://www.mathsisfun.com/sets/domain-range-codomain.html for a nice description of domain, range, codomain if you need it.
Structure illustration
ImagesAudio filesBooks
Collectioninclu
des
incl
udes
includes
A very simple structure. How might it be enhanced? How would an index be included? What substructures might be added?
What are the G, L, F, V, E parts of this example?
Definitions, cont’d
• Definition: A space is a measurable space, measure space, probability space, vector space, topological space, or metric space– A vector space is a representation for the set of elements
in a collection. The vector representing each element is a set of characteristics held by that element and both connecting that element to others that are similar and distinguishing it from those that are different.
– We will do an exercise to illustrate
Vector space illustration
• Consider a car. What are the characteristics that you associate with a car? – If you want to compare one car to another, what
characteristics would you choose?– If you wanted to distinguish a car from another type of
vehicle, what characteristics would you need?• distinguish from a snowmobile• distinguish from a truck
• Make a vector of those characteristics.• Then, fill in the vector for several specific cars.
Definitions - 3
• Definition: A scenario is a sequence of related transition events (e1, e2, …, en) on state set S such that ek = (sk, sk+1,) for 1 <= k <= n.– More easily visualized, a scenario is a path in a directed
graph, G = (S, ∑e), where vertices correspond to states in the state set S and directed edges are equivalent to events in a set of events, ∑e, and correspond to transitions between states.
– Scenarios must be implemented to make a working system.
Definitions - 4
• Definition: A society is a tuple (C,R) where – C = (c1, c2, …, cn) is a set of conceptual communities, each
community referring to a set of individuals of the same class or type (e.g. actors, activities, components, hardware, software, data);
– R = (r1, r2, …, rm) is a set of relationships, each relationship being a tuple rj = (ej, ij) where ej is a Cartesian product ck1
x ck2 x … x cknj
. 1<= k1 < k2 < … < knj<= n, which specifies the
communities involved in the relationship and ij is an activity.
Projects in our DL laboratory
• Mendel 289 is the center of activity for projects related to digital libraries and similar projects.
• Summary of the projects under way, which may present opportunities for class projects or for independent study
• NSDL, CITIDEL, CSTA, Ensemble, Distributed Expertise, Computing Ontology, Interdisciplinary Computing and its relationship to the libraries ….
Our systems
• Now available– Fedora linux machines, remotely accessible (use the gateway)– Bare machines with just basic system– We can install Drupal either from the Drupal site (doing things for
ourselves) or from the Bitnami site (builds the stack for us)• I just heard that Drupal may already be installed. Feel free to uninstall and
reinstall if you wish.
• If you have a computer of your own and want to use it, – Fine, but you must be able to demonstrate it to the class at the end
of the semester. I will need to be able to see what you are doing from time to time during the semester. – That means you need a static IP address.
The Digital Library Content
• Essential elements for a digital library–Users–Content–Services
Content - requirements
• Obtain• Store–Organize–Describe
• Find• Deliver
Describing the content
• How to describe content– Metadata• Machine readable description of anything
• What description– Machine readable requires standard descriptive
elements• Dublin Core (http://dublincore.org/)
– International standard– “a standard for cross-domain information resource
description.”– 15 descriptive elements
• Other metadata schemes– IEEE-LOM
Metadata
• What does metadata look like?• Metadata is data about data– Information about a resource, encoded in
the resource or associated with the resource.
• The language of metadata: XML–eXtensible Markup Language
XML
• XML is a markup language• XML describes features• There is no standard XML• Use XML to create a resource type• Separately develop software to interact
with the data described by the XML codes.
Source: tutorial at w3school.com
XML rules
• Easy rules, but very strict• First line is the version and character
set used: – <?xml version="1.0" encoding="ISO-8859-1"?
> • The rest is user defined tags• Every tag has an opening and a
closing
Element naming
• XML elements must follow these naming rules:– Names can contain letters, numbers, and other characters– Names must not start with a number or punctuation character– Names must not start with the letters xml (or XML or Xml ..)– Names cannot contain spaces
Elements and attributes
• Use elements to describe data• Use attributes to present information
that is not part of the data–For example, the file type or some
other information that would be useful in processing the data, but is not part of the data.
Repeating elements
• Naming an element means it appears exactly once.• Name+ means it appears one or more
times• Name* means it appears 0 or more
times.• Name? Means it appears 0 or one
time.
Parts of an XML document
• Elements– The components of an XML document– Some contain other parts, some are empty
• Ex in HTML: “br” or “table” in XML “ingredient”
• Attributes– Information about elements, not data
• Ex in HTML “src=” in XML “scale=”
• Entities– Special characters or strings with pre-assigned meaning
• Ex in HTML   for non-breaking space
• PCDATA– Parsed Character data: text that will be parsed and interpreted by the reader.
Tags and entities will be expanded and used in presentation.• CDATA
– Character data: text that will not be parsed and interpreted. It will be displayed exactly as provided.
The HTML examples are familiar; the XML examples are made up – dependent on the specific XML scheme used
Using XML - an exampleDefine the fields of a recipe collection:<?xml version="1.0" encoding="ISO-8859-1"?><recipe><recipe-title> </recipe-title><ingredient-list> <ingredient> <ingredient-amount> </ingredient-amount> <ingredient-name> </ingredient-name> </ingredient></ingredient-list><directions></directions></recipe>
ISO 8859 is a character set.
See http://www.bbsinc.com/iso8859.html
Processing the XML data
• How do we know what to do with the information in an XML file?–Document Type Definition (DTD)• Put in the same file as the data -- immediate
reference• Put a reference to an external description• Provides the definition of the legitimate
content for each element
Document Type Definition
• <?xml version="1.0" encoding="ISO-8859-1"?>• <!DOCTYPE recipe [• <!ELEMENT recipe (recipe-title, ingredient-list, directions)>• <!ELEMENT recipe-title (#PCDATA)>• <!ELEMENT ingredient-list (ingredient)>• <!ELEMENT ingredient (ingredient-amount, ingredient-name)*>• <!ELEMENT ingredient-amount (#PCDATA)>• <!ELEMENT ingredient-name (#PCDATA)>• <!ELEMENT directions (#PCDATA)> ]>
Repeat 0 or more times
<?xml version="1.0" encoding="ISO-8859-1"?><!DOCTYPE recipe SYSTEM “recipe.dtd”><recipe><recipe-title> Meringue cookies</recipe-title><ingredient-list> <ingredient> <ingredient-amount>3 </ingredient-amount> <ingredient-name> egg whites</ingredient-name> </ingredient> <ingredient> <ingredient-amount> 1 cup</ingredient-amount> <ingredient-name> sugar</ingredient-name> </ingredient> <ingredient> <ingredient-amount>1 teaspoon </ingredient-amount> <ingredient-name> vanilla</ingredient-name> </ingredient> <ingredient> <ingredient-amount>2 cups </ingredient-amount> <ingredient-name>mini chocolate chips </ingredient-name> </ingredient></ingredient-list><directions>Beat the egg whites until stiff. Stir in sugar, then vanilla. Gently fold in chocolate chips. Place
in warm oven at 200 degrees for an hour. Alternatively, place in an oven at 350 degrees. Turn oven off and leave overnight.
</directions> </recipe>
Not the way that I want to see a recipe in a magazine!
What could we do with a large collection of such entries?
How would we get the information entered into a collection?
External reference to DTD
XML exercise
• Design an XML schema for an application of your choice. Keep it simple.
• Examples -- address book, TV program listing, DVD collection, …
Another example• A paper with content encoded with XML:
http://tecfaseed.unige.ch/staf18/modules/ePBL/uploads/proj3/paper81.xml
• First few lines:• <?xml version="1.0" encoding="ISO-8859-1"?>• <?xml-stylesheet href="ePBLpaper11.css" type="text/css"?>• <?xml-stylesheet href="ePBLpaper11.xsl" type="text/xsl"?>• <!DOCTYPE paper SYSTEM "ePBLpaper11.dtd">• <paper id="proj3">• <info>• <title>Standards E-learning and their possible support for a rich pedagogic approach in a• 'Integrated Learning' context</title>• <authors>• <author>• <firstname>Rodolophe</firstname>• <familyname>Borer</familyname>• <homepageurl>http://tecfa.unige.ch/perso/staf/borer/</homepageurl>• <email/>• </author>• </authors>
"ePBLpaper11.dtd” shown on next slide
<?xml version="1.0" encoding="ISO-8859-1" ?><!-- _________ _____________________ --><!-- ePBL-project DTD for student project management
& specification --><!-- Copyright: (2004)
[email protected] --><!-- http://tecfa.unige.ch/~paraskev/ --><!-- Daniel K. Schneider --><!-- http://tecfa.unige.ch/tecfa-people/schneider.html--><!-- Created: 13/11/2002 (based on EVA_pm grammar) --><!-- Updated: 07/05/2004 --><!-- VERSIONS --><!-- v1.1 Adaptations to use with Morphon xml editor
and addition of IDs--><!-- ____________________ --><!-- _ ENTITY DECLARATIONS ______ --><!ENTITY % foreign-dtd SYSTEM "ibtwsh6_ePBL.dtd">%foreign-dtd;<!ENTITY % id "id ID #IMPLIED"><!-- ______ MAIN ELEMENT _________ --><!ELEMENT project (name, authors, date, updated,
goal, state-of-the-art, research-development-questions, methodology, workpackages ) >
<!ELEMENT name (#PCDATA )><!ELEMENT date (#PCDATA )><!ELEMENT authors (#PCDATA )>
<!ELEMENT updated (#PCDATA )><!ELEMENT goal (title, description )><!ELEMENT state-of-the-art %vert.model;><!ATTLIST state-of-the-art %id;><!ELEMENT research-development-questions (question )
+>
<!ELEMENT question (title, description )><!ELEMENT methodology %vert.model;><!ATTLIST methodology %id;><!ELEMENT workpackages (workpackage )+><!ELEMENT workpackage (planning, objectives,
deliverables )><!ATTLIST workpackage %id;><!ELEMENT objectives (objective )+><!ELEMENT objective (title, description )><!ELEMENT deliverables (deliverable )+><!ELEMENT deliverable (url, title, description )><!ELEMENT url (#PCDATA )><!ELEMENT planning (from, to, progress )><!ELEMENT from (#PCDATA )><!ELEMENT to (#PCDATA )><!ELEMENT progress (#PCDATA )><!-- ________________________ --><!ELEMENT title (#PCDATA )><!ATTLIST title %id;><!ELEMENT description %vert.model;><!-- _______________________ -->
Source: http://tecfa.unige.ch/staf/staf-j/vuilleum/staf18/p6/
Vocabulary
• Given the need for processing, do you want free text or restricted entries?
• Free text gives more flexibility for the person making the entry
• Controlled vocabulary helps with– Consistent processing– Comparison between entries
• Controlled vocabulary limits– Options for what is said
Vocabulary example
• Recipe example– What text should be controlled?– What should be free text?
• Ingredients– Ingredient-amount– Ingredient-name– Should we revise how we coded ingredient amount?
• Directions
Dublin Core
• Standard set of metadata fields for entries in digital libraries:– Title, creator, subject, description, publisher,
contributor, date, type, format, identifier, source, language, relation, coverage, rights
Dublin Core elementssee: http://dublincore.org/documents/dces/
• Title• Creator • Subject - C• Description• Publisher• Contributor• Date • Type - C
• Format - C• Identifier• Source• Language• Relation• Coverage - C
• RightsRights Management information
Space, time, jurisdiction.
C = controlled vocabulary recommended.
Ref. to related resource
Standards RFC 3066, ISO639
Unambiguous ID
Ex: collection, dataset, event, image
YYYY-MM-DD, ex.
Entity primarily responsible for making content of the resource
Entity making the resource available
Contributor to content of the resource
What is needed to display or operate the resource.
Dublin Core Terms
• An update to the original DC elements– Adds the concept of range and domain
Each term has this minimal set of attributes:• Name: A token appended to the URI of a DCMI namespace to
create the URI of the term.• Label: The human-readable label assigned to the term.• URI: The Uniform Resource Identifier used to uniquely identify
a term.• Definition: A statement that represents the concept and
essential nature of the term.• Type of Term: The type of term as described in the DCMI
Abstract Model [DCAM].
DC Terms
Additional Attributes possible:
• Comment: Additional information about the term or its application.• See: Authoritative documentation related to the term.• References: A resource referenced in the Definition or Comment.• Refines: A Property of which the described term is a Sub-Property.• Broader Than: A Class of which the described term is a Super-Class.• Narrower Than: A Class of which the described term is a Sub-Class.• Has Domain: A Class of which a resource described by the term is an Instance.• Has Range: A Class of which a value described by the term is an Instance.• Member Of: An enumerated set of resources (Vocabulary Encoding Scheme)
of which the term is a Member.• Instance Of: A Class of which the described term is an instance.• Version: A specific historical description of a term.• Equivalent Property: A Property to which the described term is equivalent.
The DC Terms – from 15 to …
abstract, accessRights, accrualMethod, accrualPeriodicity, accrualPolicy, alternative, audience, available, bibliographicCitation, conformsTo, contributor, coverage, created, creator, date, dateAccepted, dateCopyrighted, dateSubmitted, description, educationLevel, extent, format, hasFormat, hasPart, hasVersion, identifier, instructionalMethod, isFormatOf, isPartOf, isReferencedBy, isReplacedBy, isRequiredBy, issued, isVersionOf, language, license, mediator, medium, modified, provenance, publisher, references, relation, replaces, requires, rights, rightsHolder, source, spatial, subject, tableOfContents, temporal, title, type, valid
DC terms
• See http://dublincore.org/documents/dcmi-terms/
• Review the list and see what has been added
IEEE - LOM
• Example of a specialized metadata scheme– Learning Object Metadata• Specifically for collections of educational materials• Includes all of Dublin Core• See http://projects.ischool.washington.edu/sasutton/IEEE1484.html
Computing systems
• Linux machines• Introduction to unix:
http://www.csc.villanova.edu/~lab/unix/• Dspace: http://www.dspace.org/– Documentation, including installation -
http://www.dspace.org/index.php?option=com_content&task=view&id=151&Itemid=116
• Najib Nadi, our system administrator, is setting up the machines. He will send a message to the class by the middle of the week with details of machine location and login.
Remember - you have the option to use your own machine, but must meet the criteria described last week.
This session
• Defined meta data and its role in digital libraries.
• Introduced XML as a language for describing a collection of content.
• Described the computing resources and how to get ready for the first DL setup.