native xml database for information systems chris wallace smrg seminar feb 2006
Post on 22-Dec-2015
217 views
TRANSCRIPT
Native XML Databasefor Information Systems
Chris WallaceSMRG Seminar
Feb 2006
Chris Wallace, SMRG Seminar, Feb 2006
2
Exploring the design space
• “design as a conversation with the materials in the situation” (Schon)
• Native XML database (NXD)– Storing, querying and updating XML documents without
mapping into relations– Schema-free– Trees are to NXD what tables are to RDBMS– Tables are trees
• Information Systems– Focus on semi-structured data (mixture of simple data
items, text and complex nested structures)– Searching, derived data, visualisation– Process support– Large problem space variously supported by
spreadsheets, word documents, ad-hoc databases, increasingly web-integrated data.
Chris Wallace, SMRG Seminar, Feb 2006
3
eXist Native XML Database• Open source Java • European team of developers led by Wolfgang
Meier• Documents (files) are organised in collections
(folders) in a file store– XML Documents stored in an efficient, B+ tree structure
with indexes– Non-XML resources (XQuery, CSS, JPEG ..), etc can be
stored as binary• Deployable in different ways
– Embedded in a Java application– Part of a Cocoon pipeline– As web application in Apache/Tomcat– With embedded Jetty HTTPserver (as on stocks)
• Multiple Interfaces– REST – to Java servlet – SOAP– XML:RPC
Chris Wallace, SMRG Seminar, Feb 2006
4
NXD case studies
• FOLD– modules, programmes, scheme operations,
staff, organisational structures, events
• Family photos and history– Integration of meta-data on family photos with
family history (births, deaths and marriages)
• ISD3 Assignment – a web-based calculator– e.g. a currency converter
Chris Wallace, SMRG Seminar, Feb 2006
5
Research Work
• Development of the FOLD (Faculty OnLine Data) - a pilot project for UWE
• Teaching students and staff in XML languages (XML Schema, XSLT, XQuery) and NDX database design
• Links with other eXist projects• SPA2006 Workshop on NDX• XML Prague (eXist)
Chris Wallace, SMRG Seminar, Feb 2006
6
Research Areas
• Design practice for NDX– ‘Pattern language’ to help map from conceptual
model to multiple XML schemes– Identifier design– Structuring documents by responsibility and
versions
• NDX in organisational use– Social effects of distributed responsibility– Visualisation of complex relationships – Handling integrity problems – accept
inconsistency as a way of life– Management of veracity
Chris Wallace, SMRG Seminar, Feb 2006
7
The FOLD
• Faculty OnLine Data• Technologies
– eXist– (Java) – not yet– XQuery – XSLT– CSS– PHP – to be eliminated
Chris Wallace, SMRG Seminar, Feb 2006
8
The FOLD (2)
• Scope – Module and Programme specifications– Modular Schema operations (runs)– Staff– Organisational structure– Events
• Functionality– Highly linked– (Integrating UWE sources)– (Personalized Interface)
Chris Wallace, SMRG Seminar, Feb 2006
9
FOLD - Modules and Programmes
+ Module
- moduleCode : String
+ Module Specification
- version : Year
- faculty : Faculty
- field : Field
- title : String
- credits : CreditsType
- level : LevelType
- syllabus : RestrictedHTML
- readingStrategy : RestrictedHTML
+ 1..1+ 1..*
+ definition
+ ProgrammeStructure
- version : Year
+ Programme
- programmeCode : String
- ucasCode : String [0..1]
+ 1..1
+ 1..*+ s tructure
+ Stage
+ 1..1
+ 1..* {ordered}
+ OptionGroup
- id : String
- comment : String [0..1]
- minCredits : int
- maxCredits : int
+ 1..1
+ 1..* {ordered}
+ Core
+ 1..1
+ 1..* {ordered}
+ 1..*
+ 1..*
+ core
+ Option
+ 1..1
+ 1..* {ordered}
+ 1..*
+ 1..*
+ optional
+ Module Combination
- comment : String
+ 1..1
+ 0..1+ pre-requis ite
+ 1..1
+ 0..1
+ co-requisite
+ 1..*
+ 1..*
+ e
xpre
ssio
n
This is a boolean expression such as ( m1 and m2 and (m4 or (m5 and m6))
+ Learning Outcome
- assessed in Comp A : boolean
- assessed in Comp B : boolean
- specification : RestrictedHTML
- outcomeType : Learning Outcome
+ 1..1
+ 1..* {ordered}
+ Reading item
+ Book
- authors : String
- title : String
- year : String
- source : String
+ WebSite
- url : URL
- text : String
+ 1..1
+ 1..1
+ 1..1
+ 1..*+ Excluded
The FOLD
Chris Wallace, SMRG Seminar, Feb 2006
10
Fold Design Issues
• Conceptual Modelling• Conceptual – Logical – Physical mapping• Identifiers• Relationships and links• Versioning• Editing• Views• Responsibilities• Processes
Chris Wallace, SMRG Seminar, Feb 2006
11
Mapping from Conceptual modelto the Logical and physical layers
• What criteria to use in breaking up the whole model into – Logical
• Entity – a logical compound structure– Physical
• Documents – a physical aggregation of entity instances• Collections – a physical aggregation of documents
• Examples– Module Specification [moduleCode]
• Module Spec is an Entity• Each Module Spec is a Document
– Module Run [moduleCode/year/runNo]• Module Run is an Entity• Set of Module Runs for a Field is a Document
• Issues– Where to develop Schemas?– No logical data in the physical – purely for convenience
Chris Wallace, SMRG Seminar, Feb 2006
12
Conceptual Modelling
• Conventional normalised data model• Generality issue e.g. Module run
– Roles as Attributes• <ModuleLeader>Stewart Green</ModuleLeader>
– Roles as Entities• <role><title>Module Leader</title><person>Stewart Green</person></role>
– Entities enable meta data, but defeat use of tables for data entry
• Need views
• Attributes v elements – a Conceptual/logical mapping issue– <Module code=“UFIEKG-20-3” level=“3”>…– <Module><ModuleCode>UFIEKG-20-3</ModuleCode>..
Chris Wallace, SMRG Seminar, Feb 2006
13
Conceptual Modelling Tools
• UML class model closest to suitable conceptual model– Allows multi-valued attributes– Distinguished relationship kinds
• Composition• Bi-directional associations• Uni-directional associations (for multiplicity resolution)
– QSEE/Rose• No identifiers (primary keys) ??• No indication of mapping to attributes or elements• No mapping into Entites• No mapping into Documents and Collections
Chris Wallace, SMRG Seminar, Feb 2006
14
Identifiers• Principle adopted – use naturally occurring identifiers wherever possible
– Persons : “Ian Beeson”– Rooms : “3P14”
• Plus– Reduces gap between RW domain and system– Names in minutes of meetings, on spreadsheets are readable– )
• Minus– Duplicates
• Duplicates not tolerable in the RW either, resolved through RW negotiation within a RW namespace e.g. the Faculty
• Mergers generate duplicates– Aliases– Not all entities have unique identifiers
• Programmes – ISIS Primary Award and UCAS are candidates but don’t work
• ?– All names need namespace – “Ian Beeson” at CEMS at UWE– Need to replace multiple naming conventions with a single naming scheme (e.g.
initials)– URN’s and semantic web
Chris Wallace, SMRG Seminar, Feb 2006
15
Alias handling
– Problem handling aliases in staff data• Currently a person can have multiple names
–first is the prime• Better is a separate alias table
– Lookup the base table– If not find, try the alias table
Chris Wallace, SMRG Seminar, Feb 2006
16
Relationships and Links• Relationships need to be implemented
– One – Many • RDBMS – primary key on the One side becomes foreign key on the
Many side• NXD – choose which side on the basis of complexity and
responsibility– Sequence (modules in a stage)– Complex (pre-requisite expression)
– Many-Many• RDBMS – intersection table • NXD– as for one-many • or either side as appropriate – Groups and subgroups
• Issues– Referential integrity
• RDBMS – ‘eager’ – data not allowed in unless links OK, links maintained through updates– integrity failures transient, repair outside database
• NXD – ‘lazy’– store the data and provide on-demand or on-trigger validation– Integrity failures can be persisted (XLinkit) and repair is inside
database
Chris Wallace, SMRG Seminar, Feb 2006
17
Versioning
• Based on Yearly cycle– Base Year set in user’s session– Default set in system config
• Two different approaches– Module Run, Coursework Elements..
• Explicit version identifier– ModuleCode/Year/RunNo– Selection is explicit [Year= $year]
– Module Specification, Programme Structure• Implicit version defined by sequence of versions
Chris Wallace, SMRG Seminar, Feb 2006
18
Implicit Versioning
2002
2005
2007
Versions
Year=2006 Latest version =2005
Latest version =2002Year=2004
Chris Wallace, SMRG Seminar, Feb 2006
19
Implicit Versioning
let $specPath := "/db/versionTest", $currentYear := "2005", $moduleCode := request:request-parameter("moduleCode",""),
$year := request:request-parameter("year",$currentYear),
(: get the set of possible versions for this module :) $modspecs := collection($specPath)/moduleSpecification [ModuleCode=$moduleCode] [Version <= $year],
(: select the version with the highest version number :) $modspec := $modspecs[Version = max($modspecs/Version)] return $modspec
Chris Wallace, SMRG Seminar, Feb 2006
20
Editing• Table structured Document editing
– Allows maintenance using familiar Spreadsheet tools (Excel 2003)– Schema is induced by Excel– Accommodations
• Multi-valued fields as concatenated values– XPath Join and tokenise functions– Embedded separator problem (a name with ‘,’ as a legitimate character)– Defeats indexing
• Optional elements increase table width• Formatting choices not maintained (e.g.Freeze-Window)
• Structured Document editing– Allows maintenance with Word without a schema
• With difficulty –not schema awareness– Use InfoPath to create desktop form based on schema
• Need to redo if schema changes• In-situ Updates
– With Xquery-generated forms and update– With XForms
Chris Wallace, SMRG Seminar, Feb 2006
21
Views
• Views arise from the need for de-normalisation– Coursework Element
• As a simple element– Key : moduleCode/Year/runNo/elementNo– Data: due date
• As a derived complex element– SuggestedHours (computed from Hours table)– Late date (computed from UWE calendar)– Weighings (extracted from relevant specification)– Module Leader (extracted from Module Run)
• Views as transient or materialize• View definition• View Maintenance
Chris Wallace, SMRG Seminar, Feb 2006
22
Chris Wallace, SMRG Seminar, Feb 2006
23
declare function fold:courseworkElement($moduleCode, $year, $runNo, $elementNo) { let $mod := fold:moduleSpecification($moduleCode,$year), $run := fold:moduleRun($moduleCode,$year,$runNo), $elementRun := fold:elementRun($moduleCode,$year,$runNo,'B', $elementNo) , $elementSpec := $mod/Assessment/FirstAttempt/Components/ComponentB/Element[position() = $elementNo], $dueDate := $elementRun/DueDate, $returnDate := fold:workingDays($dueDate,20), $componentWeight := $mod/Assessment/Weighting/ComponentWeightB, $weightInComponent := data($elementSpec/Weight), $weightInModule := round($weightInComponent * $componentWeight div 100), $load := fold:load($mod/Level), $hrs := round(data($mod/UWERating) div data($load/Credits) * $weightInModule div 100 * data($load/Hours)) return<CourseworkElement> <ModuleCode>{$moduleCode}</ModuleCode> {$mod/Title} <RunNo>{$runNo}</RunNo> {$run/ModuleLeader} {$run/InternalModerator} {$run/ExternalExaminer} <Component>CW</Component> <ElementNo>{$elementNo}</ElementNo> {$elementSpec/Description} <SuggestedHours>{$hrs}</SuggestedHours> <WeightInComponent>{$weightInComponent}</WeightInComponent> <WeightInModule>{$weightInModule}</WeightInModule> <DueDate>{data($dueDate)}</DueDate> <ReturnDate>{data($returnDate)}</ReturnDate></CourseworkElement>
};
Chris Wallace, SMRG Seminar, Feb 2006
24
Process support
• Short term – Process support– Form generation– Linkage to process documentation
• Medium term – Process monitoring– Online capture of significant dates
• Coursework hand-in date• Date exam sent to moderator• Date coursework returned to students
– Derived information• Workload prediction based on coursework schedule and
student numbers• Display of latest coursework returned and SMS message to
students
• Long term- Process management – Workflow – Process enactment software
Chris Wallace, SMRG Seminar, Feb 2006
25
Short-term • Session based logins to personalise the interface and
specify parameters (currentYear) • Form generation as passive documents
– Update through the form an obvious extension• Extend operational data with date-based status
– Date-returned-to students • If set (work has been returned)
– Date used to generate page of coursework recently returned – Date used to monitor conformance to target return date(!)
• Link Forms to textual/graphical process description– Coursework from setting to field board– How to specialise a generic description?
• By level• By module• By field
Chris Wallace, SMRG Seminar, Feb 2006
26
Responsibilities
• Responsibility allocation– Admin / architect decision– Physical level design for responsibility
• All Module Runs in a Field in one document• Modules and Programme Structures in Field Collections
(within Year)– Group access rights
• For IS Field - ISAdmin– Anne Moggridge– Peter Rawlings– Lilly Cooke– Tracey Davis
• Need for check-in check-out of documents– WebDav (Web Folders)
Chris Wallace, SMRG Seminar, Feb 2006
27
Conclusion
• Slide from prototype to production• Pluses and Minuses of user enthusiasm• Go for ‘low-hanging fruit’• Pay attention to the learning process
– XQuery, XSLT are non-trivial languages because deeply unlike Java/PHP
• Reflection forced by presentations and workshops