from data points to data lakes
TRANSCRIPT
![Page 1: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/1.jpg)
F R O M D ATA P O I N T S T O D ATA L A K E S
O P E N D A TA
D R J R O G E L - S A L A Z A R I M P E R I A L C O L L E G E L O N D O N
A N D U N I V E R S I T Y O F H E R T F O R D S H I R E
J . R O G E L @ P H Y S I C S . O R G
@ Q U A N T U M _ T U N N E L / @ H I D D E N _ N O D E
U S E
# D I A L O G O _ O P E N D ATA
S O C I A L M E D I A
![Page 2: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/2.jpg)
D ATA ?L E T ’ S S TA R T A T T H E B E G I N N I N G
D ATA
I N T E R C O N N E C T E D K N O W L E D G E
K N O W L E D G E
L I N K E D I N F O R M AT I O N
I N F O R M AT I O N
S T R U C T U R E D D ATA
![Page 3: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/3.jpg)
D ATA E V E R Y W H E R E !
• Lots of data is being collected and warehoused
• Scientific studies
• Web data, e-commerce
• Purchases at department/grocery stores
• Bank/Credit card transactions
• Social network
H O W M U C H D ATA ?
• Google processes 100 PB a day (2014)
• Facebook 600 TB/day (2014)
• Twitter 100 TB/day (2013/14)
• CERN’s Large Hydron Collider (LHC) generates 15 PB a year
640K ought to be enough for anybody.
Source: https://followthedata.wordpress.com/2014/06/24/data-size-estimates/
![Page 4: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/4.jpg)
Maximilien Brice, © CERN
T H E E A R T H S C O P E
•The Earthscope is also a large science project. Designed to track North America's geological evolution, this observatory records data over 3.8 million square miles, amassing 67 terabytes of data. It analyses seismic slips in the San Andreas fault, sure, but also the plume of magma underneath Yellowstone and much, much more.
1.
http://www.msnbc.msn.com/id/44363598/ns/technology_and_science-future_of_technology/#.TmetOdQ--uI
![Page 5: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/5.jpg)
T Y P E O F D ATA
• Relational Data (Tables/Transaction/Legacy Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data • Social Network, Semantic
Web (RDF), …
• Streaming Data • You can only scan the
data once
W H AT T O D O W I T H T H E S E D ATA ?
• Aggregation and Statistics
• Data warehouse and OLAP
• Indexing, Searching, and Querying
• Keyword based search
• Pattern matching (XML/RDF)
• Knowledge discovery
• Data Mining
• Statistical Modeling
![Page 6: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/6.jpg)
T H E D ATA
• Fundamental to research
• Basis for writing papers
• Important for experiment replication
• Meet contractual/funding requirements
• Settle intellectual property claims
• Defense against a charge of fraud
Images from the front covers of Circulation Research – S. Elliott (Van Eyk Lab)
I N D I V I D U A L R E S P O N S I B I L I T Y D ATA M A N A G E M E N T
Some aspects to consider:
• Ownership
• Collection
• Storage/protection of confidentiality/sharing Interpretation and publication
![Page 7: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/7.jpg)
W H AT I S C O P Y R I G H T
?
- U S C O N S T I T U T I O N
“To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective
writings and discoveries.”
![Page 8: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/8.jpg)
N O T A T O O L T O C O N T R O L A L L C O N T E N T
F O R E V E R I N A L L M E D I A
A S E T O F R I G H T S
• The right to reproduce the work
• The right to prepare derivative works
• The right to distribute the work
• The right to perform the work
• The right to display the work
• The right to license any of the above to third parties
![Page 9: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/9.jpg)
H O W ?
First, it must meet some basic requirements:
• It must be original.
• It must have some level of creativity.
• It must be in a fixed medium.
In the old-days, you would use this symbol:
Provide a date and register it.
I T ’ S I N S TA N T !
N O W A D A Y S
![Page 10: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/10.jpg)
Copyright protects…
Writing Choreography
Music Visual art
Film Architectural works
Copyright doesn’t protect…
Ideas Facts
Data (mostly) Useful articles (that’s patent)
H O W L O N G D O E S I T L A S T ?
![Page 11: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/11.jpg)
F O R N O W…
The life of the author plus 70 years
And then?
THE PUBLIC DOMAIN
G E N E R A L R U L E S F O R S TAT U S
Works No Longer Protected by Copyright
• Published before 1923
• Published between '23 and '63, but it depends.
• Authored by the Federal Government (US)
![Page 12: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/12.jpg)
V E R B O S E M O D E …
• All works published in the United States before 1923 are in the public domain.
• Works published after 1922, but before 1978 are protected for 95 years from the date of publication. If the work was created, but not published, before 1978, the copyright lasts for the life of the author plus 70 years. However, even if the author died over 70 years ago, the copyright in an unpublished work lasts until December 31, 2002.
• For works published after 1977, the copyright lasts for the life of the author plus 70 years. However, if the work is a work for hire (that is, the work is done in the course of employment or has been specifically commissioned) or is published anonymously or under a pseudonym, the copyright lasts between 95 and 120 years, depending on the date the work is published.
• Lastly, if the work was published between 1923 and 1963, you must check with the U.S. Copyright Office to see whether the copyright was properly renewed. If the author failed to renew the copyright, the work has fallen into the public domain and you may use it.
C O N F U S E D ?
![Page 13: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/13.jpg)
Hard to share
W H Y S H A R E ?
![Page 14: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/14.jpg)
![Page 15: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/15.jpg)
A L L O F T H E M C A N …
A N D S H O U L D B E S H A R E D !
A L L O F T H E M W H E R E B U I LT U P O N O T H E R P E O P L E ’ S W O R K
W H Y ?
![Page 16: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/16.jpg)
T E R R E N C E TA O B L O GH T T P S : / / T E R R Y TA O . W O R D P R E S S . C O M
![Page 17: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/17.jpg)
K A G G L E H T T P S : / / W W W. K A G G L E . C O M / C O M P E T I T I O N S
Public Domain
All Rights Reserved
least restrictive
most restrictive
A SPECTRUM OF RIGHTS
![Page 18: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/18.jpg)
W H AT I S O P E N D ATA ?
S O …
![Page 19: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/19.jpg)
Open data is information that is available for anyone to use, for any purpose, at no cost.
G O O D O P E N D ATA
• Can be linked: shared more easily
• Available in a standard format: easily processed
• Guaranteed availability and consistency: easily reliable
• Traceable: easily trusted
![Page 20: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/20.jpg)
F R O M O P E N A C C E S S T O O P E N D ATA
![Page 21: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/21.jpg)
O V E R L E A FH T T P S : / / W W W. O V E R L E A F. C O M
D R YA DH T T P : / / D A TA D R YA D . O R G
![Page 22: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/22.jpg)
D ATAV E R S EH T T P S : / / D A TA V E R S E . H A R VA R D . E D U
D ATA G O V U KH T T P : / / D A TA . G O V. U K
![Page 23: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/23.jpg)
D ATA G O V U SH T T P : / / W W W. D A TA . G O V
D AT O S G O B M E XH T T P : / / D A T O S . G O B . M X
![Page 24: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/24.jpg)
A N Y B I G I N S T I T U T I O N C O U L D P U B L I S H O P E N D ATA
A N Y O N E E L S E ?
T H E G U A R D I A NH T T P : / / W W W. T H E G U A R D I A N . C O M / N E W S / D A TA B L O G /
![Page 25: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/25.jpg)
O P E N D ATA 5 0 0H T T P : / / W W W. O P E N D A TA 5 0 0 . C O M
F I G S H A R EH T T P : / / F I G S H A R E . C O M
![Page 26: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/26.jpg)
U C I M A C H I N E L E A R N I N GH T T P : / / A R C H I V E . I C S . U C I . E D U / M L /
T R A N S P O R T F O R L O N D O NH T T P S : / / T F L . G O V. U K / I N F O - F O R / O P E N - D A TA - U S E R S /
![Page 27: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/27.jpg)
W H AT C A N W E D O W I T H I T ?
W H AT C A N W E D O W I T H I T ?
![Page 28: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/28.jpg)
W H E R E T O F I N D O P E N D ATA ?
![Page 29: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/29.jpg)
D ATA H U BH T T P : / / D A TA H U B . I O
F I G S H A R EH T T P : / / F I G S H A R E . C O M
![Page 30: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/30.jpg)
R E G I S T E R O F D ATA R E P O SH T T P : / / W W W. R E 3 D A TA . O R G
D ATA B I BH T T P : / / D A TA B I B . O R G
![Page 31: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/31.jpg)
D ATA C I T EH T T P S : / / W W W. D A TA C I T E . O R G
O P E N D O A RH T T P : / / W W W. O P E N D O A R . O R G
![Page 32: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/32.jpg)
C K A NH T T P : / / C K A N . O R G
G I T H U BH T T P S : / / G I T H U B . C O M
![Page 33: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/33.jpg)
G I T H U B - D ATA D R YA DH T T P S : / / G I T H U B . C O M / D A TA D R YA D
S P I R A L - I M P E R I A L C O L L E G EH T T P S : / / S P I R A L . I M P E R I A L . A C . U K
![Page 34: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/34.jpg)
D S PA C EH T T P : / / W W W. D S PA C E . O R G
S O U N D S G O O D … N O W W H AT ?
![Page 35: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/35.jpg)
S I M P L E G U I D E L I N E S
3 T H I N G S
• Keep it simple
• Engage early and often
• Address common fears and misunderstandings
![Page 36: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/36.jpg)
4 S T E P S
• Choose your dataset(s)
• Licensing
• Make the data available
• Make it discoverable
D ATA S E T S
• Asking the community
• Cost basis
• Ease of release
• Observe peers
![Page 37: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/37.jpg)
L I C E N S I N G
Data that doesn’t explicitly have an open license is NOT open data
C O P Y R I G H T O V E R W O R K S Y O U C R E AT E A N D A R E O R I G I N A L T O Y O U .
D ATA B A S E R I G H T O V E R C O L L E C T I O N S O F D ATA Y O U H AV E P U T A S U B S TA N T I A L E F F O R T I N T O O B TA I N I N G , V E R I F Y I N G O R P R E S E N T I N G ( O N LY E U , M E X I C O , B R A Z I L )
O W N E R S H I P
![Page 38: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/38.jpg)
C R E AT I V E C O M M O N S L I C E N S I N G
K N O W T H E T Y P E S ! !
![Page 39: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/39.jpg)
O P E N D ATA C O M M O N SH T T P : / / O P E N D A TA C O M M O N S . O R G
AVA I L A B I L I T Y
• Data should be complete
• In a (open) machine-readable format
• It should contain metadata
![Page 40: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/40.jpg)
H O W ?
• Your website • Existing repositories • Creating your own repository
M A K E I T D I S C O V E R A B L E
• Publish it in Public services (Datahub) • Index it in Catalog (Databib) • Promote it in your community • Engage with users
![Page 41: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/41.jpg)
D ATA S C I E N C E A N D D ATA L A K E S
![Page 42: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/42.jpg)
W H AT I S D ATA S C I E N C E
A set of tools and techniques used to extract useful information from data.
An interdisciplinary, problem-oriented subject.
![Page 43: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/43.jpg)
T H E I N G R E D I E N T S O F A D ATA S C I E N T I S T
O N E M O R E T H I N G !
C O M M U N I C AT I O N S K I L L S
![Page 44: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/44.jpg)
![Page 45: From Data Points to Data Lakes](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587b16481a28abb15c8b77b5/html5/thumbnails/45.jpg)
W H AT I S I T ?D A TA L A K E
Analytics
DW
Hadoop
Y O U R T H O U G H T S ?
T H A N K S !
D R J R O G E L - S A L A Z A R
J . R O G E L @ P H Y S I C S . O R G
@ Q U A N T U M _ T U N N E L / @ H I D D E N _ N O D E