eventcube aviation safety data analysis system fangbo tao, xiao yu, jiawei han 08/10/13
TRANSCRIPT
EventCubeAviation Safety Data Analysis System
Fangbo Tao, Xiao Yu, Jiawei Han
08/10/13
The data we focus:
Huge Collection of Logs
Following a normal approach and landing to runway 4 in roc; aircraft was taxied clear of the end of runway to the gate .several ground snow removal vehicles were operating to left of aircraft so we moved to the right side of ramp.….
Each Document
Power of Text-Rich Data Cubes
Hierarchical Data Cube Text Analysis
More than 10 dimensions
Most dimensions have hierarchy
Powerful Summarization!
Real-time for huge data!
Contains huge details
Each report tells a story
Topic AnalysisKeyword Search
Sentiment Analysis
Power of Text-Rich Data Cubes
Data Cube Rich Text
Efficient Summarization Powerful Text Mining
More than Simply Integration!
Data Cube and Rich Text can mutually enhance
each other!
Power of Text-Rich Data Cube
SliceRoll-upDrill-downDice
…
Other features
Contextual SearchHierarchical Dimension Selection: support multiple choices
Similar Document Finding: based on Contextual Search
Keyword Frequency DistributionMulti-gram Summarization
Contextual Search
Motivation: Every word/concept may have equivalent word/concept
“SVM” = “Support Vector Machine”, “Alt” = “Altitude” Connections between words
“Kernel Method” - “SVM”, “altitude” – “flight level”
Contextual Search
We develop a contextual search framework to build the word-net
Contains 4 different relationships: A “Use” B: Equivalent terms, B is more common A “RT” B: Related terms, not hierarchical A “BT” B: B is the broader word A “NT” B: B is the narrower word
Contextual Search
Step 1: Generate word-net when uploading dataset.
Step 2: Return the related terms when inputing.
Step 3: Automatically include the equivalent terms when searching.
Step 4: Operator Support “AND”/”OR”/”NOT”
Hierarchical Dimension Support
Multiple Choice Support
Each Dimension can support several levels
Powerful examples: “B-737” VS. “B-747” “Boeing” VS. “Airbus”
Document List Result
Using the default Mysql “natural language full text search”
Extract the title based on the most relevant part.
Show tags of dimension values for target dimensions
Highlight the keywords
Similar Document
Also contextual search
Step 1: Extract meaningful terms from the original report
Step 2: Using these terms as input, conduct contextual search.
Top Cells
Search all the cells in the targeted dimensions, find the most relevant cells
A multi-dimensional cell ranking
Single Dimension Distribution Based on Keywords
Single Dimension Distribution Based on Keywords
Using a offline + online framework to calculate the distribution.
If Offline: Combination of keywords are exponential
If Online: Retrieve the whole corpus every time.
Strategy: Store the single keyword distribution in the database. [Offline] Combine the single ones to a new distribution online. [Online]
Single Dimension Distribution Based on Keywords
Offline process: Step1: Map equivalent terms into one. Step2: Build both keyword reverse index and cell reverse
index based on report Step3: Compare these two reverse indexes and calculate
the single term distribution.
Online process [with a list of terms and dimensions] Step1: match each term into it’s equivalent term. Step2: Calculate the combined distribution based on the
independent assumption, for each dimension Val(t1..tn) = 1 –π(1-val(ti));
Topic Distribution
Based on Topic Cube Applying topic model. Support comparison between different cells
Unigram/Multigram description
Based on Qiaozhu’s paper, “Automatic Labeling of Multinomial Topic Models”
Find multi-gram candidate from the whole text
Scoring it based on unigram
Adjust it based on it’s length
Thinking
Data Cube: Efficient Summary Highly Structured Data.
Rich Text: Topic Analysis, keyword search Common: ASRS, IMDB, Publication-Net, News…
Network (HIN) Good at mining, contains structural information. No information loss
Motivation of EventCube
Combine Data Cube with Rich Text.
Combine Summary with Keyword Search
Build a general search/analysis system for rich text cube data. 1. Aviation Safety Reporting Data
Time, Weather, Location, Model…Flight logs 2. Publication Data
Author, Conf, Time, Field, Affliation…Abstract 3. IMDB
Time, Country, Style, Director…Description
Thanks