structured-document processing languages (5 cp), spring 2011 pekka kilpeläinen university of...
DESCRIPTION
SDPL 2011 Notes 1: Introduction 3 Goals of the Course n To get familiar with central models and languages for –Manipulating –transforming and –querying structured documents (or XML) n “Generic XML processing technology” –very little about specific XML applications, or commercial systemsTRANSCRIPT
StructuredStructured-Document -Document Processing LanguagesProcessing Languages
(5 cp), Spring 2011(5 cp), Spring 2011Pekka KilpeläinenPekka KilpeläinenUniversity of Eastern FinlandUniversity of Eastern FinlandSchool of ComputingSchool of [email protected]@uef.fi
SDPL 2011 1Notes 1: Introduction
SDPL 2011 Notes 1: Introduction 2
IntroductionIntroduction
First: Overview and ArrangementsFirst: Overview and ArrangementsWhat is this course about?What is this course about?
1.1 Review of Structured-Documents 1.1 Review of Structured-Documents 2.1 XML & XML Documents2.1 XML & XML Documents
SDPL 2011 Notes 1: Introduction 3
Goals of the CourseGoals of the Course
To get familiar with central models and To get familiar with central models and languages for languages for – ManipulatingManipulating– transforming and transforming and – querying querying
structured documents (or XML)structured documents (or XML) ““Generic XML processing technology”Generic XML processing technology”
– very little about specific XML applications, or very little about specific XML applications, or commercial systemscommercial systems
SDPL 2011 Notes 1: Introduction 4
NOT an Exhaustive SurveyNOT an Exhaustive Survey
Bias in selecting course topics:Bias in selecting course topics:– estimated usefulness/valueestimated usefulness/value
» centrality (centrality (→→ longer-lasting value) longer-lasting value)» maturity: Standards, or stable specifications? maturity: Standards, or stable specifications?
Robust implementations? Robust implementations? – Lecturer up-to-date?Lecturer up-to-date?
Emphasis on programmatic access and Emphasis on programmatic access and manipulation of data in the form of documents, manipulation of data in the form of documents, rather than describing/modelling itrather than describing/modelling it
SDPL 2011 Notes 1: Introduction 5
Motivation?Motivation?
Practical relevance of documents and XMLPractical relevance of documents and XML
Interest in models of information processingInterest in models of information processing
XMLXML
Internet / intranetInternet / intranet
e.g. ordere.g. order
e.g. invoicee.g. invoice
““Programming for XML is Programming for XML is veryvery different from programming for different from programming for other data models”other data models”
- D. Florescu & D. Kossmann at - D. Florescu & D. Kossmann at SIGMOD’06SIGMOD’06
SDPL 2011 Notes 1: Introduction 6
Course OutlineCourse Outline
1 Introduction1 IntroductionOverview and ArrangementsOverview and Arrangements1.1 Structured Documents1.1 Structured Documents
2 XML Basics 2 XML Basics - XML documents, DTDs, Namespaces- XML documents, DTDs, Namespaces
3 Programmatic Manipulation of Structured Documents 3 Programmatic Manipulation of Structured Documents (XML APIs)(XML APIs)- SAX, DOM, StAX, JAXP- SAX, DOM, StAX, JAXP
SDPL 2011 Notes 1: Introduction 7
Course Outline (2)Course Outline (2)
4 Transforming Structured Documents4 Transforming Structured Documents4.1 Addressing: XPath 1.04.1 Addressing: XPath 1.04.2 XSLT 1.04.2 XSLT 1.0
5 Querying Structured Documents5 Querying Structured Documents- W3C XML Query Language XQuery- W3C XML Query Language XQuery
6 Review of the Course6 Review of the Course
SDPL 2011 Notes 1: Introduction 8
Methodological GoalsMethodological Goals
Central professional skillsCentral professional skills– consulting technical specificationsconsulting technical specifications– experimenting with SW implementationsexperimenting with SW implementations
Ability to think…?Ability to think…?– to find out relationships, reason, explain, ...to find out relationships, reason, explain, ...– to apply knowledge in new situationsto apply knowledge in new situations
("Pidgin English" for scientific communication)("Pidgin English" for scientific communication)
SDPL 2011 Notes 1: Introduction 9
AdministrationAdministration
An elective Master-level special courseAn elective Master-level special course– suitable for all specialisation lines suitable for all specialisation lines
(esp. media technology and SWE) (esp. media technology and SWE) 5 cp (5 cp ( 133 h 20 min of work!) 133 h 20 min of work!) LecturesLectures March 9 – May 5, class MT3 March 9 – May 5, class MT3
– Video broadcast to TB180 in JoensuuVideo broadcast to TB180 in Joensuu– Lecturer: Lecturer: [email protected]
NBNB!
SDPL 2011 Notes 1: Introduction 10
Administration: ExercisesAdministration: Exercises
Exercises,Exercises, March 14 – May 9 March 14 – May 9– essential for learning the technologyessential for learning the technology– normal homework assignments, hands-on practice; normal homework assignments, hands-on practice;
Solutions discussed in classSolutions discussed in class– Grading: one prepared solution -> one activity point Grading: one prepared solution -> one activity point
Plan: Plan: – some some solutions shall be returned to Moodle, solutions shall be returned to Moodle,
for potential feedback and scaling of activity points by solution for potential feedback and scaling of activity points by solution quality quality
– These will be announced on a case-by-case basisThese will be announced on a case-by-case basis
SDPL 2011 Notes 1: Introduction 11
Administration: GradingAdministration: Grading
Final Final examexam on Friday, May 13 on Friday, May 13– ≥≥ 50% of exam points required to pass50% of exam points required to pass
Grade = Grade = round(6*Exam/MaxExam + round(6*Exam/MaxExam +
2*HomeWork/MaxHomeWork - 2.5)2*HomeWork/MaxHomeWork - 2.5) Retake exam on Friday June 10Retake exam on Friday June 10
– again again 50% to pass 50% to pass– better of grades with/without homework creditsbetter of grades with/without homework credits
SDPL 2011 Notes 1: Introduction 12
MaterialMaterial
No single textbookNo single textbook Reports, specifications, articlesReports, specifications, articles Course home pageCourse home page
– http://www.cs.uku.fi/~kilpelai/RDK11/– slides, exercises, references, announcementsslides, exercises, references, announcements
Possible background texts (See home page):Possible background texts (See home page):– Deitel et al; MDeitel et al; Møller & Schwartzbach; Key; Walmsleyøller & Schwartzbach; Key; Walmsley
SDPL 2011 Notes 1: Introduction 13
Background CheckBackground Check
Basic knowledge of structured documents and document Basic knowledge of structured documents and document standardsstandards– Course ”Introduction to Document standards"?Course ”Introduction to Document standards"?
Programming languages and conceptsProgramming languages and concepts– Java? OO programming?Java? OO programming?– Functional programming? SQL?Functional programming? SQL?– Unix/Linux vs. Windows?Unix/Linux vs. Windows?
Formal language theory Formal language theory – Theory of ComputationTheory of Computation– regular expressions, context-free grammars, parse trees?regular expressions, context-free grammars, parse trees?
SDPL 2011 Notes 1: Introduction 14
Students' Expectations?Students' Expectations?
SDPL 2011 Notes 1: Introduction 15
1.1. Structured Documents1.1. Structured Documents
DocumentDocument: : – a structured representation of information on some a structured representation of information on some
medium (medium ( message) message)
– normally for a human readernormally for a human reader» memos, manuals, articles, books, …memos, manuals, articles, books, …
– also application-to-application messagesalso application-to-application messages» e.g., btw client and server in e.g., btw client and server in Web ServicesWeb Services
– "prose-oriented XML" vs "data-oriented XML""prose-oriented XML" vs "data-oriented XML"– can be treated as a unit can be treated as a unit
» e.g., a web page vs a web site?e.g., a web page vs a web site?
SDPL 2011 Notes 1: Introduction 16
Text-Based DocumentsText-Based Documents
We concentrate on textual or text-based We concentrate on textual or text-based documentsdocuments– character data major constituent of information character data major constituent of information
contentcontent– as opposed to, say multimedia documents as opposed to, say multimedia documents
Next: Presentation vs Structure Next: Presentation vs Structure
SDPL 2011 Notes 1: Introduction 17
Presentation vs StructurePresentation vs Structure
Presentation informs the Presentation informs the human readerhuman reader about the about the meaning of text and the role of its partsmeaning of text and the role of its parts
Markup (Markup (merkkausmerkkaus)) indicates the presentation or the indicates the presentation or the meaning of different parts of text meaning of different parts of text – originally hand-written annotations for the typesetter originally hand-written annotations for the typesetter – nowadays primarily codes embedded in digital nowadays primarily codes embedded in digital
documents; documents; <Tags><Tags>
SDPL 2011 Notes 1: Introduction 18
MarkupMarkup Procedural markup Procedural markup
– formatting commands (start boldface, produce an empty formatting commands (start boldface, produce an empty line, indent 5 mm, ...)line, indent 5 mm, ...)
– proprietary word processor formats, nroff, TeX, ...proprietary word processor formats, nroff, TeX, ... Descriptive Descriptive oror generic markup generic markup
– indicating the logical structure of text using chosen indicating the logical structure of text using chosen namesnames
– LaTeX: LaTeX: \begin{abstract} ... \end{abstract} \begin{abstract} ... \end{abstract} – HTML: HTML: <TITLE> .... </TITLE><TITLE> .... </TITLE>
Markup language (Markup language (merkkauskielimerkkauskieli))– a fixed set of markup notations (e.g. nroff, TeX, HTML, a fixed set of markup notations (e.g. nroff, TeX, HTML,
SVG, …) SVG, …)
SDPL 2011 Notes 1: Introduction 19
Structured Documents?Structured Documents?
Most liberally, Most liberally, anyany document is structured document is structured (punctuation, words, sentences, fields, …)(punctuation, words, sentences, fields, …)
but especially descriptively marked-up but especially descriptively marked-up documents ... (e.g. well-formed XML)documents ... (e.g. well-formed XML)
especially if they adhere to a rigorous especially if they adhere to a rigorous specification of structure (e.g. XML+DTD)specification of structure (e.g. XML+DTD)
SDPL 2011 Notes 1: Introduction 20
Structure in DocumentsStructure in Documents
HierarchyHierarchy or or nestingnesting is ubiquitous is ubiquitous– chapters of books, warnings in maintenance chapters of books, warnings in maintenance
manuals, ...manuals, ... Linear orderLinear order essential in prose documents essential in prose documents
– less important in representations of data objectsless important in representations of data objects HypertextHypertext and and cross-referencescross-references
We'll be mainly dealing with manipulation of We'll be mainly dealing with manipulation of hierarchical, or tree-like document structureshierarchical, or tree-like document structures
Next: How are these modelled?Next: How are these modelled?