version 17svoboda/courses/201-b4m36ds... · 2020. 9. 19. · lecture 2 - data formats \(5. 10....

51
B4M36DS2, BE4M36DS2: Database Systems 2 hƩp://www.ksi.mff.cuni.cz/~svoboda/courses/201-B4M36DS2/ Lecture 2 Data Formats MarƟn Svoboda marƟ[email protected] 5. 10. 2020 Charles University, Faculty of MathemaƟcs and Physics Czech Technical University in Prague, Faculty of Electrical Engineering

Upload: others

Post on 22-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

B4M36DS2, BE4M36DS2: Database Systems 2h p://www.ksi.mff.cuni.cz/~svoboda/courses/201-B4M36DS2/

Lecture 2

Data FormatsMar n Svobodamar [email protected]

5. 10. 2020

Charles University, Faculty of Mathema cs and PhysicsCzech Technical University in Prague, Faculty of Electrical Engineering

Page 2: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Lecture OutlineData formats

• XML – Extensible Markup Language• JSON – JavaScript Object Nota on• BSON – Binary JSON• RDF – Resource Descrip on Framework• CSV – Comma-Separated Values

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 2

Page 3: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

XMLExtensible Markup Language

Page 4: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Introduc onXML = Extensible Markup Language

• Representa on and interchange of semi-structured data+ a family of related technologies, languages, specifica ons, …

• Derived from SGML, developed byW3C, started in 1996• Design goals

Simplicity, generality and usability across the Internet• File extension: *.xml, content type: text/xml• Versions: 1.0 and 1.1• W3C recommenda on

h p://www.w3.org/TR/xml11/

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 4

Page 5: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Example

<?xml version="1.1" encoding="UTF-8"?><movie year="2007">

<title>Medvídek</title><actors>

<actor><firstname>Jiří</firstname><lastname>Macháček</lastname>

</actor><actor>

<firstname>Ivan</firstname><lastname>Trojan</lastname>

</actor></actors><director>

<firstname>Jan</firstname><lastname>Hřebejk</lastname>

</director></movie>

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 5

Page 6: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Document StructureDocument

• Prolog: XML version + some other stuff• Exactly one root element

Contains other nested elements and/or other content

<?xml<?xml versionversion == "" versionversion "" ...... ?>?> ......

elementelement

Example<?xml version="1.1" encoding="UTF-8"?><movie>

...</movie>

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 6

Page 7: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

ConstructsElement

• Marked using opening and closing tags… or just an abbreviated tag in case of empty elements

• Each element can be associated with a set of a ributes

<< namename

attributeattribute

>> element contentelement content << // namename >>

<< namename

attributeattribute

// >>

Examples<title>...</title><actors/>

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 7

Page 8: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

ConstructsTypes of element content

• Empty content• Text content• Element content

Sequence of nested elements• Mixed content

Elements arbitrarily interleaved with text values

texttext

elementelement

elementelement

texttext

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 8

Page 9: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

ConstructsA ribute

• Name-value pair

namename == "" valuevalue ""

Escaping sequences (predefined en es)• Used within values of a ributes or text content of elements• E.g.:

&lt; for <&gt; for >&quot; for "…

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 9

Page 10: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

XML Conclusion

XML constructs• Basic: element, a ribute, text• Addi onal: comment, processing instruc on, …

Schema languages• DTD, XSD (XML Schema), RELAX NG, Schematron

Query languages• XPath, XQuery, XSLT

XML formats = par cular languages• XSD, XSLT, XHTML, DocBook, ePUB, SVG, RSS, SOAP, …

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 10

Page 11: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

JSONJavaScript Object Nota on

Page 12: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Introduc onJSON = JavaScript Object Nota on

• Open standard for data interchange• Design goals

Simplicity: text-based, easy to read and writeUniversality: object and array data structures

– Supported by majority of modern programming languages– Based conven ons of the C-family of languages

(C, C++, C#, Java, JavaScript, Perl, Python, …)

• Derived from JavaScript (but language independent)• Started in 2002• File extension: *.json• Content type: application/json• h p://www.json.org/

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 12

Page 13: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Example

{"title" : "Medvídek","year" : 2007,"actors" : [

{"firstname" : "Jiří","lastname" : "Macháček"

},{

"firstname" : "Ivan","lastname" : "Trojan"

}],"director" : {

"firstname" : "Jan","lastname" : "Hřebejk"

}}

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 13

Page 14: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Data StructureObject

• Unordered collec on of name-value pairs (proper es)Correspond to structures such as objects, records, structs,dic onaries, hash tables, keyed lists, associa ve arrays, …

• Values can be of different types, names should be unique

{{

stringstring :: valuevalue

,,

}}

Examples• { "name" : "Ivan Trojan", "year" : 1964 }• { }

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 14

Page 15: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Data StructureArray

• Ordered collec on of valuesCorrespond to structures such as arrays, vectors, lists,sequences, …

• Values can be of different types, duplicate values are allowed[[

valuevalue

,,

]]

Examples• [ 2, 7, 7, 5 ]• [ "Ivan Trojan", 1964, -5.6 ]• [ ]

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 15

Page 16: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Data StructureValue

• Unicode stringEnclosed with double quotesBackslash escaping sequencesExample: "a \n b \" c \\ d"

• NumberDecimal integers or floatsExamples: 1, -0.5, 1.5e3

• Nested object• Nested array• Boolean value: true, false• Missing informa on: null

stringstring

numbernumber

objectobject

arrayarray

truetrue

falsefalse

nullnull

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 16

Page 17: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

JSON Conclusion

JSON constructs• Collec ons: object, array• Scalar values: string, number, boolean, null

Schema languages• JSON Schema

Query languages• JSONiq, JMESPath, JAQL, …

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 17

Page 18: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

BSONBinary JSON

Page 19: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Introduc onBSON = Binary JSON

• Binary-encoded serializa on of JSON documentsExtends the set of basic data types of values offered by JSON(such as a string, …) with a few new specific ones

• Design characteris cs: lightweight, traversable, efficient• Used byMongoDB

Document NoSQL database for JSON documentsData storage and network transfer format

• File extension: *.bson• h p://bsonspec.org/

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 19

Page 20: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

ExampleJSON

{ "title" : "Medvídek", "year" : 2007 }

BSON24 00 00 00 02 74 69 74 6C 65 00 0A 00 00 00 4D 65 64 76 C3 AD 64 65 6B00 10 79 65 61 72 00 D7 07 00 00 00

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 20

Page 21: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

ExampleJSON

{"title" : "Medvídek","year" : 2007

}

BSON24 00 00 0002 74 69 74 6C 65 00 0A 00 00 00 4D 65 64 76 C3 AD 64 65 6B 0010 79 65 61 72 00 D7 07 00 0000

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 21

Page 22: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Document StructureDocument = serializa on of one JSON object or array

• JSON object is serialized directly• JSON array is first transformed to a JSON object

Property names derived from numbers of posi onsE.g.:[ "Trojan", "Svěrák" ] →{ "0" : "Trojan", "1" : "Svěrák" }

• StructureDocument size (total number of bytes)Sequence of elements (encoded JSON proper es)Termina ng hexadecimal 00 byte

int32int32 elementelement 0000

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 22

Page 23: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Document StructureElement = serializa on of one JSON property

......

0202 namename int32int32 bytebyte 0000

0101 namename doubledouble

1010 namename int32int32

1212 namename int64int64

0303 namename documentdocument

0404 namename documentdocument

0808 namename 0000

0101

0A0A namename

0909 namename int64int64

1111 namename int64int64

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 23

Page 24: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Document StructureElement = serializa on of one JSON property

• StructureType selector

– 02 (string)– 01 (double), 10 (32-bit integer), 12 (64-bit integer)– 03 (object), 04 (array)– 08 (boolean)– 0A (null)– 09 (date me), 11 ( mestamp)– …

Property name– Unicode string terminated by 00

bytebyte 0000

Property value

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 24

Page 25: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

RDFResource Descrip on Framework

Page 26: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Introduc onRDF = Resource Descrip on Framework

• Language for represen ng informa on about resourcesin the World Wide Web

+ a family of related technologies, languages, specifica ons, …Used in the context of the Seman c Web, Linked Data, …

• Developed byW3C• Started in 1997• Versions: 1.0 and 1.1• W3C recommenda ons

h ps://www.w3.org/TR/rdf11-concepts/– Concepts and Abstract Syntax

h ps://www.w3.org/TR/rdf11-mt/– Seman cs

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 26

Page 27: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

StatementsResource

• Any real-world en tyReferents = resources iden fied by IRI

– E.g. physical things, documents, abstract concepts, …Values = resources for literals

– E.g. numbers, strings, …

Statement about resources = one RDF triple• Three components: subject, predicate, and object

Examples<http://db.cz/movies/medvidek><http://db.cz/terms#actor><http://db.cz/actors/trojan> .

<http://db.cz/movies/medvidek><http://db.cz/terms#year>"2007" .

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 27

Page 28: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

StatementsTriple components

• SubjectDescribes a resource the given statement is aboutIRI or blank node iden fier

• PredicateDescribes the property or characteris c of the subjectIRI

• ObjectDescribes the value of that propertyIRI or blank node iden fier or literal

Although triples are inspired by natural languages, they havenothing to do with processing of natural languages

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 28

Page 29: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Example

<http://db.cz/movies/medvidek><http://db.cz/terms#actor> <http://db.cz/actors/machacek> .

<http://db.cz/movies/medvidek><http://db.cz/terms#actor> <http://db.cz/actors/trojan> .

<http://db.cz/movies/medvidek><http://db.cz/terms#year> "2007" .

<http://db.cz/movies/medvidek><http://db.cz/terms#director> _:n18 .

_:n18<http://db.cz/terms#firstname> "Jan" .

_:n18<http://db.cz/terms#lastname> "Hřebejk" .

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 29

Page 30: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Iden fiers and LiteralsIRI = Interna onalized Resource Iden fier

• Absolute (not rela ve) IRIs with op onal fragment iden fiers• RFC 3987• Unicode characters• Examples

http://db.cz/movies/medvidekhttp://db.cz/terms#actormailto:[email protected]:issn:0167-6423

• URLs are o en used in prac ce→ informa on about givenresources are then intended to be published / retrieved viastandard HTTP

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 30

Page 31: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Iden fiers and LiteralsLiterals

• Plain valuesE.g.: "Medvídek", "2007"

• Typed valuesE.g.: "Medvídek"^^xs:string, "2007"^^xs:integerXML Schema simple data types are adopted and used

• Strings with language tagsE.g.: "Medvídek"@cs

• Types and language tags cannot be mutually combined

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 31

Page 32: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Iden fiers and LiteralsBlank node iden fiers

• Blank nodes (anonymous resources)Allow to express statements about resources without explicitlynaming (iden fying) them

• Blank node iden fiers only have local scope of validityE.g. within a given file, query expression, …

• Par cular syntax depends on a serializa on formatE.g.: _:node18

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 32

Page 33: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Data ModelDirected labeled mul graph

• Ver cesOne vertex for each IRI or literal value

• EdgesOne edge for each individual tripleEdges are directed subject predicate−−−−−→ objectProperty names (predicate IRIs) are used as edge labels

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 33

Page 34: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Example

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 34

Page 35: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Serializa onAvailable approaches

• N-Triples nota onh ps://www.w3.org/TR/n-triples/

• Turtle nota on (Terse RDF Triple Language)h ps://www.w3.org/TR/turtle/

• RDF/XML nota onXML syntax for RDFh ps://www.w3.org/TR/rdf-syntax-grammar/

• JSON-LD nota onJSON-based serializa on for Linked Datah ps://www.w3.org/TR/json-ld/

• …

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 35

Page 36: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

N-Triples Nota onRDF N-Triples nota on = A line-based syntax for an RDF graph

• Simple, line-based, plain text format• File extension: *.rdf• h ps://www.w3.org/TR/n-triples/

Example• Already presented…

Document• Statements are terminated by dots, delimited by EOL

tripletriple

0D 0A0D 0A tripletriple

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 36

Page 37: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

N-Triples Nota onStatement

• Individual triple components are delimited by spaces

subjectsubject predicatepredicate objectobject ..

Triple components: subject, predicate, object

IRI referenceIRI reference

blank node idblank node id

IRI referenceIRI reference IRI referenceIRI reference

blank node idblank node id

literalliteral

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 37

Page 38: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

N-Triples Nota onIRI reference

• IRIs are enclosed in angle brackets<< IRIIRI >>

Blank node iden fier

__ :: labellabel

Literal• Literals are enclosed in double quotes

"" valuevalue ""

^^^^ IRI referenceIRI reference

@@ language taglanguage tag

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 38

Page 39: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Turtle Nota onTurtle = Terse RDF Triple Language

• Compact text format,various abbrevia ons for common usage pa erns

• File extension: *. l• Content type: text/turtle• h ps://www.w3.org/TR/turtle/

Example@prefix i: <http://db.cz/terms#> .@prefix m: <http://db.cz/movies/> .@prefix a: <http://db.cz/actors/> .m:medvidek

i:actor a:machacek , a:trojan ;i:year "2007" ;i:director [ i:firstname "Jan" ; i:lastname "Hřebejk" ] .

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 39

Page 40: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Turtle Nota onDocument

• Contains a sequence of triples and/or declara ons• Prefix declara ons

Prefixed names can then be used instead of full IRI references• Groups of triples

Individual groups are terminated by dots

@prefix@prefix prefixprefix :: IRI referenceIRI reference ..

triplestriples

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 40

Page 41: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Turtle Nota onTriples

• Triples sharing the same subject and object or at leastthe same subject can be grouped together

object list for a shared subject and predicatepredicate-object list for a shared subject

• Brackets can be used to define blank nodessubjectsubject predicatepredicate objectobject

,,

;;

[[ predicatepredicate objectobject

,,

;;

]]

predicatepredicate objectobject

,,

;;

..

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 41

Page 42: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Turtle Nota onTriple components: subject, predicate, object

IRI referenceIRI reference

prefixed nameprefixed name

blank node idblank node id

IRI referenceIRI reference

prefixed nameprefixed name

IRI referenceIRI reference

prefixed nameprefixed name

blank node idblank node id

literalliteral

[[ predicate-object listpredicate-object list ]]

IRI reference / prefixed name

<< IRIIRI >> prefixprefix :: local namelocal name

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 42

Page 43: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Turtle Nota onLiteral

• Tradi onal literals+ new abbreviated forms of numeric and boolean literals

"" valuevalue ""

^^^^ IRI referenceIRI reference

prefixed nameprefixed name

@@ language taglanguage tag

truetrue

falsefalse

integer / decimal / doubleinteger / decimal / double

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 43

Page 44: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

ExampleExample revisited

@prefix i: <http://db.cz/terms#> .@prefix m: <http://db.cz/movies/> .@prefix a: <http://db.cz/actors/> .m:medvidek

i:actor a:machacek , a:trojan ;i:year "2007" ;i:director [ i:firstname "Jan" ; i:lastname "Hřebejk" ] .

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 44

Page 45: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

RDF Conclusion

RDF statements• Subject, predicate, and object components

Schema languages• RDFS (RDF Schema)• OWL (Web Ontology Language)

Query languages• SPARQL (SPARQL Protocol and RDF Query Language)

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 45

Page 46: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

CSVComma-Separated Values

Page 47: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Introduc onCSV = Comma-Separated Values

• Unfortunately not fully standardizedDifferent field separators (commas, semicolons)Different escaping sequencesNo encoding informa on

• RFC 4180, RFC 7111• File extension: *.csv• Content type: text/csv

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 47

Page 48: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Example

firstname,lastname,yearIvan,Trojan,1964Jiří,Macháček,1966Jitka,Schneiderová,1973Zdeněk,Svěrák,1936Anna,Geislerová,1976

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 48

Page 49: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Document StructureDocument

• Op onal header + list of records

namename ,, namename 0D 0A0D 0A

recordrecord 0D 0A0D 0A recordrecord

Record• Comma separated list of fields

valuevalue ,, valuevalue

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 49

Page 50: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical
Page 51: Version 17svoboda/courses/201-B4M36DS... · 2020. 9. 19. · Lecture 2 - Data Formats \(5. 10. 2020\) - Version 17 Author: Martin Svoboda - martin.svoboda@fel.cvut.cz - Czech Technical

Lecture ConclusionData formats

• Tree: XML, JSON• Graph: RDF• Rela onal: CSV

Binary serializa ons• BSON

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 5. 10. 2020 51