![Page 1: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION](https://reader031.vdocuments.mx/reader031/viewer/2022012300/61e142b5406d5720de3ecfac/html5/thumbnails/1.jpg)
Mingon Kang, Ph.D.
Department of Computer Science, University of Nevada, Las Vegas
* Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington
CS 789 ADVANCED BIG DATA ANALYTICS
DATA REPRESENTATION
![Page 2: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION](https://reader031.vdocuments.mx/reader031/viewer/2022012300/61e142b5406d5720de3ecfac/html5/thumbnails/2.jpg)
Data Representation?
How does a computer represent data?
0 and 1 in the aspect of โgeneralโ computer science
Vector/Matrix in the aspect of โMachine Learningโ
![Page 3: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION](https://reader031.vdocuments.mx/reader031/viewer/2022012300/61e142b5406d5720de3ecfac/html5/thumbnails/3.jpg)
Data Representation
Scalar single number - usually write in italics
- lower-case variable names
- e.g., ๐ โ โ, ๐ โ โ [1]
Vector array of numbers - arranged in order
- lower-case names written in bold typeface
- ๐ฑ =
๐ฅ1๐ฅ2โฎ๐ฅ๐
, ๐ฑ = {๐ฅ1, ๐ฅ2, โฆ , ๐ฅ๐}
- what is ๐ฑ๐ฌ when ๐ฌ = {1, 3, 6}?- Then, ๐ฑโ๐ฌ?
1 https://en.wikipedia.org/wiki/List_of_mathematical_symbols
![Page 4: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION](https://reader031.vdocuments.mx/reader031/viewer/2022012300/61e142b5406d5720de3ecfac/html5/thumbnails/4.jpg)
Data Representation
Matrix 2-D array of
numbers
- an element is identified by two indices
- upper-case variable name with bold
typeface, e.g., ๐- ๐ โ โ๐โ๐: matrix has a height of m and a
width of n, and elements are real numbers
- e.g., ๐ =๐ฅ11 ๐ฅ12 ๐ฅ13๐ฅ21 ๐ฅ22 ๐ฅ23
Tensor array with more than
two axes
- three indices to identify an element
![Page 5: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION](https://reader031.vdocuments.mx/reader031/viewer/2022012300/61e142b5406d5720de3ecfac/html5/thumbnails/5.jpg)
Types of Variable
Categorical variable: discrete or qualitative variables
Nominal:
โผ Have two or more categories, but which do not have an
intrinsic order
Ordinal
โผ Have two or more categories, which can be ordered or
ranked.
Continuous variable
![Page 6: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION](https://reader031.vdocuments.mx/reader031/viewer/2022012300/61e142b5406d5720de3ecfac/html5/thumbnails/6.jpg)
Data Representation
Features
An individual measurable property of a phenomenon being observed
The number of features or distinct traits that can be used to describe each item in a quantitative manner
May have implicit/explicit patterns to describe a phenomenon
Samples
Items to process (classify or cluster)
Can be a document, a picture, a sound, a video, or a patient
Reference: https://en.wikipedia.org/wiki/Feature_(machine_learning)
![Page 7: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION](https://reader031.vdocuments.mx/reader031/viewer/2022012300/61e142b5406d5720de3ecfac/html5/thumbnails/7.jpg)
Data Representation
Feature vector
An N-dimensional vector of numerical features that
represent some objects
A sample consists of feature vectors
Reference: http://www.slideshare.net/rahuldausa/introduction-to-machine-learning-38791937
![Page 8: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION](https://reader031.vdocuments.mx/reader031/viewer/2022012300/61e142b5406d5720de3ecfac/html5/thumbnails/8.jpg)
Example - Survey
Convert Data to a feature vector/sample matrix
๐ก๐๐๐ = ๐๐๐๐๐๐๐ข๐๐๐ = ๐ฆ๐๐
โฎ
![Page 9: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION](https://reader031.vdocuments.mx/reader031/viewer/2022012300/61e142b5406d5720de3ecfac/html5/thumbnails/9.jpg)
Example โ Structured data
Convert Data to a feature vector/sample matrix
๐น๐๐๐๐๐๐๐๐๐๐๐๐ก๐๐๐
โฎ
![Page 10: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION](https://reader031.vdocuments.mx/reader031/viewer/2022012300/61e142b5406d5720de3ecfac/html5/thumbnails/10.jpg)
Example โ Image data
![Page 11: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION](https://reader031.vdocuments.mx/reader031/viewer/2022012300/61e142b5406d5720de3ecfac/html5/thumbnails/11.jpg)
Example โ Unstructured data
Feature Extraction
Unstructured data (e.g., text data)
Structured data (e.g., Bag-of-Words Model)
![Page 12: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION](https://reader031.vdocuments.mx/reader031/viewer/2022012300/61e142b5406d5720de3ecfac/html5/thumbnails/12.jpg)
Data in Machine Learning
๐ฅ๐: input vector, independent variable
๐ฆ: response variable, dependent variable
๐ฆ โ {โ1, 1} or {0, 1}: binary classification
๐ฆ โ โค: multi-label classification
๐ฆ โ โ: regression
Predict a label when having observed some new ๐ฅ
![Page 13: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION](https://reader031.vdocuments.mx/reader031/viewer/2022012300/61e142b5406d5720de3ecfac/html5/thumbnails/13.jpg)
Data Visualization
Vector space model
Data is a set of features, ๐๐ข = ๐1, ๐2, โฆ , ๐๐
All data can be represented by vector
Ref: https://www.slideshare.net/pkgosh/the-vector-space-model
![Page 14: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION](https://reader031.vdocuments.mx/reader031/viewer/2022012300/61e142b5406d5720de3ecfac/html5/thumbnails/14.jpg)
Data Visualization
Hand-written data (MNIST)
High-dimensional data
Can visualize data using Principle Component Analysis