demystifying data science - vvtesh.co.in · demystifying data science venkatesh vinayakarao...
TRANSCRIPT
![Page 1: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people](https://reader030.vdocuments.mx/reader030/viewer/2022040608/5ec45c12e8f156666a719ba7/html5/thumbnails/1.jpg)
Venkatesh Vinayakarao (Vv)
Demystifying
Data Science
Venkatesh [email protected]
http://vvtesh.co.in
SSN School of Advanced Career Education
The world's most valuable resource is no longer oil, but data. – Economist Report, 2017.
![Page 2: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people](https://reader030.vdocuments.mx/reader030/viewer/2022040608/5ec45c12e8f156666a719ba7/html5/thumbnails/2.jpg)
What Comes Next?
byte
kilobyte
megabyte
gigabyte
??
???
????
?????
2
![Page 3: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people](https://reader030.vdocuments.mx/reader030/viewer/2022040608/5ec45c12e8f156666a719ba7/html5/thumbnails/3.jpg)
Sizes
3
Name Size
Byte 8 bits
Kilobyte 1024 bytes
Megabyte 1024 kilobytes
Gigabyte 1024 megabytes
Terabyte 1024 gigabytes
Petabyte 1024 terabytes
Exabyte 1024 petabytes
Zettabyte 1024 exabytes
Yottabyte 1024 zettabytes
![Page 4: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people](https://reader030.vdocuments.mx/reader030/viewer/2022040608/5ec45c12e8f156666a719ba7/html5/thumbnails/4.jpg)
Big Data is Ubiquitous
• Facebook Statistics• 1.5 billion people are active on Facebook daily!
• Every minute there are 510,000 comments posted and 293,000 statuses updated!
• More than 300 million photos get uploaded per day!
• Totally, more than 2.5 Trillion posts!
4Source: Forbes
![Page 5: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people](https://reader030.vdocuments.mx/reader030/viewer/2022040608/5ec45c12e8f156666a719ba7/html5/thumbnails/5.jpg)
5Source: https://www.visualcapitalist.com/big-data-keeps-getting-bigger/
![Page 6: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people](https://reader030.vdocuments.mx/reader030/viewer/2022040608/5ec45c12e8f156666a719ba7/html5/thumbnails/6.jpg)
And, It is Growing!
6
![Page 7: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people](https://reader030.vdocuments.mx/reader030/viewer/2022040608/5ec45c12e8f156666a719ba7/html5/thumbnails/7.jpg)
Data Growth
7
Mankind’s quest to digitize the world!33 ZB (2018) → 175 ZB (2025)
size of global datasphere*
*Source: https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf
![Page 8: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people](https://reader030.vdocuments.mx/reader030/viewer/2022040608/5ec45c12e8f156666a719ba7/html5/thumbnails/8.jpg)
Solitary Confinement is Cruel
8
![Page 9: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people](https://reader030.vdocuments.mx/reader030/viewer/2022040608/5ec45c12e8f156666a719ba7/html5/thumbnails/9.jpg)
9
World needs data scientists!
![Page 10: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people](https://reader030.vdocuments.mx/reader030/viewer/2022040608/5ec45c12e8f156666a719ba7/html5/thumbnails/10.jpg)
Data Science
10
Loads of (structured and unstructured) data available.
Need scientifically sound methods to capture, maintain, process,
communicate and analyze data.
![Page 11: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people](https://reader030.vdocuments.mx/reader030/viewer/2022040608/5ec45c12e8f156666a719ba7/html5/thumbnails/11.jpg)
Modern Text ProcessingVector Space Model
11
![Page 12: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people](https://reader030.vdocuments.mx/reader030/viewer/2022040608/5ec45c12e8f156666a719ba7/html5/thumbnails/12.jpg)
Which Document to Retrieve?
12
d1:“SSN Chennai”
d2:“XYZ Delhi”
Ind
exe
d C
on
ten
t
Retrieval Model
{VSM, LDA, BM25, …}Results = ??
Query = “SSN”
![Page 13: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people](https://reader030.vdocuments.mx/reader030/viewer/2022040608/5ec45c12e8f156666a719ba7/html5/thumbnails/13.jpg)
Vectors
• Geometric entity which has magnitude and direction
13
x
yA
1
1
![Page 14: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people](https://reader030.vdocuments.mx/reader030/viewer/2022040608/5ec45c12e8f156666a719ba7/html5/thumbnails/14.jpg)
Sentences are vectors
• “SSN Chennai” as a vector
14
SSNC
he
nn
ai
SSN Chennai 1
1
![Page 15: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people](https://reader030.vdocuments.mx/reader030/viewer/2022040608/5ec45c12e8f156666a719ba7/html5/thumbnails/15.jpg)
Sentences are vectors
• “The SSN Chennai” is a 3-dimensional vector
15
SSNC
he
nn
ai
The SSN Chennai
1
1
![Page 16: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people](https://reader030.vdocuments.mx/reader030/viewer/2022040608/5ec45c12e8f156666a719ba7/html5/thumbnails/16.jpg)
Sentences are vectors
• On this 3D space, “The SSN” vector will lie on the x (The) and z (SSN) plane.
16
SSNC
he
nn
ai
The SSN1
1
![Page 17: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people](https://reader030.vdocuments.mx/reader030/viewer/2022040608/5ec45c12e8f156666a719ba7/html5/thumbnails/17.jpg)
Comparing Sentences
• We can compare sentences using the angle between vectors
17
SSNC
he
nn
ai
The SSN1
1
The Chennai
![Page 18: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people](https://reader030.vdocuments.mx/reader030/viewer/2022040608/5ec45c12e8f156666a719ba7/html5/thumbnails/18.jpg)
Angle between two vectors
• What is the angle between The and SSN vectors?
• What is the angle between SSN and Chennaivectors?
• What is the angle between The SSN and The SSN vectors?
18
![Page 19: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people](https://reader030.vdocuments.mx/reader030/viewer/2022040608/5ec45c12e8f156666a719ba7/html5/thumbnails/19.jpg)
Mathematical Notation
• We represent vectors as follows:• Vector = (dimension1, dimension2, dimension3, …)
• First, define the dimensions
• Next, put “1” if the word is present in the sentence, else “0”
• Example
19
In our example, vector = (The, SSN, Chennai)
So,
The Chennai = (1,0,1)
The SSN = (1,1,0)
![Page 20: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people](https://reader030.vdocuments.mx/reader030/viewer/2022040608/5ec45c12e8f156666a719ba7/html5/thumbnails/20.jpg)
Converting from “0 – 90” to “0 – 1”
• For convenience, We convert the angles 0 –90 to values 0 – 1• When vectors are same, we want to output 1.
• When vectors are perpendicular, we want to output 0.
20
𝑑𝑗
q
q
![Page 21: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people](https://reader030.vdocuments.mx/reader030/viewer/2022040608/5ec45c12e8f156666a719ba7/html5/thumbnails/21.jpg)
A Way to Calculate cosθ
• cos 𝜃 =𝑥.𝑦
𝑥 | 𝑦 |
• Here,• x.y is the “dot product” of x and y vectors.
• So, similarity between “The SSN” and “SSN Chennai”
= 1.0 + 1.1 +0.1
12+12+02 02+12+12=
1
2 2= 0.5
21
![Page 22: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people](https://reader030.vdocuments.mx/reader030/viewer/2022040608/5ec45c12e8f156666a719ba7/html5/thumbnails/22.jpg)
Which Document to Retrieve?
22
d1:“SSN Chennai”
d2:“XYZ Delhi”
Ind
exe
d C
on
ten
t
Retrieval Model
{VSM, LDA, BM25, …}Results = ??
Query = “SSN”
![Page 23: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people](https://reader030.vdocuments.mx/reader030/viewer/2022040608/5ec45c12e8f156666a719ba7/html5/thumbnails/23.jpg)
Example
Let query q = “SSN”.
Let document, d1 = “SSN Chennai” and d2 = “XYZ Delhi”.
In our VSM, q = (1,0,0,0), d1= (1,1,0,0) and d2 = (0,0,1,1)
similarity(d1, q) = 𝑑1.𝑞
𝑑1
| 𝑞 |= 1.1+1.0+0.0+0.0
12+12 12=
1
2= 0.71
similarity(d2, q) = 𝑑2.𝑞
𝑑2
| 𝑞 |= 1.0+0.0+0.1+0.1
12+12 12= 0.
23
SSN Chennai XYZ Delhi
q 1 0 0 0
d1 1 1 0 0
d2 0 0 1 1
![Page 24: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people](https://reader030.vdocuments.mx/reader030/viewer/2022040608/5ec45c12e8f156666a719ba7/html5/thumbnails/24.jpg)
Which Document to Retrieve?
24
d1:“SSN Chennai”
d2:“XYZ Delhi”
Ind
exe
d C
on
ten
t
Retrieval Model
{VSM, LDA, BM25, …}Results = ??
Query = “SSN”
![Page 25: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people](https://reader030.vdocuments.mx/reader030/viewer/2022040608/5ec45c12e8f156666a719ba7/html5/thumbnails/25.jpg)
Summary
• Data is Ubiquitous• and it is growing too!
• Modern Text Processing• Vector Space Model
• Remember• Data processing goes beyond common sense… we need
techniques and tools.
• Products are good to learn. Principles are even more important. Don’t ignore them.
25
![Page 26: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people](https://reader030.vdocuments.mx/reader030/viewer/2022040608/5ec45c12e8f156666a719ba7/html5/thumbnails/26.jpg)
Memories
26