trmc big data analysis / knowledge management initiative · trmc big data analysis / knowledge...
TRANSCRIPT
UNCLASSIFIED – DISTRIBUTION STATEMENT A – Reference Number 18-S-1486; May 9, 2018
TRMC Big Data Analysis / Knowledge Management
InitiativeRyan Norman
Big Data and Knowledge Management Initiative LeadTest Resource Management Center
What is Big Data Analytics?
• The use of advanced statistical analytic techniques in a parallel processing high-performance computing environment against very large diverse data sets that include different types of data
• Allows analysts to make better and faster decisions using data that was previously inaccessible or unusable
• Previously under-utilized data sources can be analyzed to gain new insights resulting in significantly better and faster decisions
• Instead of analyzing small chunks of data, Big Data Analytics can give the analyst a broad view of the system, allowing the discovery of “unknown unknowns.”
• Most important (and relevant to T&E) big data analytics techniques:– Anomaly Detection – Did something go wrong?– Causality Detection – What contributed to it?– Trend Analysis – What’s happening over time?– Predicting Equipment Function and Failure – When will something go wrong?– Regression Analysis – How is today’s data different than the past?– Data Set Comparison – Is test repeatable? Is the simulation the same as the test?
Is the perceived truth the same as the ground truth?– Pattern Recognition – Are there hidden relationships in the data set?
2
Better tools and techniques so analysts can do their jobs
3
Example Big Data Analytics Return on Investment
Analysis: Brief (~300 ms) false on-ground event for sensor during flight
Big Data Analytics enables faster & more comprehensive analysis across the lifecycle of a program
Result: JSF-KM project discovered unknown problem with ground sensor
Need: An Evaluation Revolution
• Most T&E investments have been focused on the “T” rather than the “E”– Our analysis & evaluation capabilities are not keeping up with the complexity and speed
required by today’s acquisition systems– The next-generation of acquisition systems will be exponentially more complex than today
• Impact: T&E quality is inadequate for our needs– More data is being collected than can be properly analyzed– Only a tiny fraction of data is looked at– Analysis occurs on a small fraction of data– Focus is on a single test, rather than data collected across the system lifecycle– No systematic anomaly detection, trend analysis, regression analysis, causality analysis,
pattern recognition, simulation/test comparisons, perceived truth/ground truth comparisons are being done
• Impact: T&E timeliness is inadequate for our needs– Analyst retrieval of test data in many cases takes days/weeks rather than seconds/minutes– Sometimes it’s easier (though not cheaper) to just re-run a test rather than find old data that
may answer the question– Long data ingest times prevent proper debriefing of test participants after a test is over,
since their statements cannot be correlated with data in real time• Impact: T&E dollars are being spent unnecessarily
– More tests than necessary are being done, sometimes at enormous expense– Cross-program lessons learned only occur anecdotally
4
A systematic approach to Big Data Analytics and Knowledge Management is required to address these three serious issues
Long-term DoD T&E Big Data and Knowledge Management Vision
5
Result: T&E data used more effectively & efficiently during acquisition
• The primary product of T&E is data & knowledge
• Embrace KM & Big Data Analytics to efficiently handle & securely share T&E data
• Organize T&E data to build knowledge across all DoD acquisitions
• Federate distributed data repositories to enable execution & automated search scenarios that cannot occur today
• Use modern mechanisms to enable collaboration between SMEs in government and industry
Fundamental Functions Performed by KM and BDA
1. Understand and Document T&E challenges & needs– (FY12) Completed Data Management for Distributed Testing (DM-DT) Study
− Result: Developed functional requirements for T&E enterprise distributed Data Management– (FY13) Comprehensive Review of T&E Infrastructure report published
− Key Recommendation: Use DoD cloud solution for T&E data− Key Recommendation: USD(AT&L) establish a DoD-wide KM capability for T&E to help achieve
better acquisition outcomes and reduce costs
2. Execute proofs of concept that inform an enterprise approach to T&E Knowledge Management
– (FY15-18) Joint Strike Fighter Knowledge Management (JSF-KM) project− Goal: Assess KM technologies and methodologies in support of an existing acquisition program
– (FY15-17) Collected Operational Data Analytics for Continuous Test & Evaluation (CODAC-TE) project− Goal: Apply KM technologies and methodologies across the lifecycle
3. Develop investment plan that achieves strategic objectives:– Integrate T&E infrastructure into cohesive Knowledge Management enterprise– Modernize T&E practices & processes to leverage Big Data analytics techniques– Apply Big Data analytics tools & techniques to the T&E mission space
Realizing Big Data and Improved DoD T&E Knowledge Management
Investments Path Forward: It Starts with Architecture
• The Big Data and Knowledge Management Architecture Reference Document (ARD) identifies:– Deficiencies in current T&E data analysis and knowledge management
practices– Government, commercial and open source software and hardware that
could address these deficiencies– The end state we are looking to achieve
• TRMC has released the ARD for feedback in preparation for making it a JMETC community standard– Reviewers should request access to BDKM User Group on TRMC website– Standardization scheduled for August JMETC Configuration Review Board
• Once a reference architecture is standardized, we can build it– Goal: Synergize evaluation investments across DoD T&E
7
https://www.tena-sda.org/display/BDKM/Documentation
What do we need?Individual Range
2. Cloud Analytics Capability
4. Trained Data Science Workforce
• Integrated• Scalable• Cost-Effective• State-of-the-Art
Working Files
Regional Analytics Capability
Virtualized Big Data ToolsProcessing
Tiered StorageMLS Security
Data Scientists
Current Range Infrastructure
Existing ToolsExisting Storage
Existing Ingest Capabilities
Range AugmentationVirtualized Big Data Tools
Some ProcessingSome Tiered Storage
MLS SecurityEnhance Ingest
Individual Range
Cloud-Based Big Data Analytics and Knowledge Management System
Regional Analytics Capability
Virtualized Big Data ToolsProcessing
Tiered StorageMLS Security
Data Scientists
New
Existing
Quick-Look
Schedule Info
ApplicationRepository
Reports
DataRegional Analytics Capability
Virtualized Big Data ToolsProcessing
Tiered StorageMLS Security
Data Scientists
Video
Audio
Imagery
1. Integrated Local Data
3. Big Data Tools
Big Data and TENA Relationship:The Big Data Analytics Architecture is an Extension of TENA
Into the Analytic World – Seamless Integration
9
Event Data Is Ingested into Big Data Enterprise System
Working Files
Current Range Infrastructure
Existing ToolsExisting Storage
Existing Ingest Capabilities
Range AugmentationVirtualized Big Data Tools
Some ProcessingSome Tiered Storage
MLS SecurityEnhance Ingest
Individual Range
Quick-Look
Big Data Software Architecture Overview
10
Existing Range Computing and Storage
Structured Database Unstructured/Semi-Structured Database (Hadoop)
Structured Data Engine Unstructured Data Engine
Query Engine – Federated access for both Structured and Unstructured Data
Data Analysis Packages User-Defined Analytic Plugins
Massively Parallel Tiered Computing, Storage, and Network InfrastructureAt Multiple Independent Levels of Security
Extract-Transform-
Load
Data Sources
Analytic Services
Big Data Visualization
UC S TS SAP SAR
Security
Existing Range DatabasesFlat Files
Raw Files
Setup, Configure, and Manage
Policies Security Define MetadataPrioritization
Streams
Micro-batch
Mega-batch
Parallel
Verify
Transform
Add Metadata
Index
Warehouse
Configuration
Metadata Replication
Build Queries
Quick-Look Real-Time Continuous
2D/3D/Anim
Display Reports
Design Reports
CustomizedDisplays
Display Alerts
User Interface
Authenticate
Authorize
AccessControlEnforcePoliciesEnforce
WorkflowThreat
DetectionIntrusionDetection
ActiveDefenses
Working Sets Tables
Encryption
Audit
Alerts
Load Balancing
Fault/Recovery
MILS SecureCloud
Statistics
Key-Value Store
DistributedFile System
Generate ReportsAI Tools Simulation
Analysis ToolsAlerting Scheduling/Automation Legacy Tools
SQL Services
Remote DataReplication
T&E Specific Custom BDA ServicesAnomaly Detection Trend AnalysisCausality Detection Regression AnalysisGround Truth Comparison Pattern Recognition
Filter Sort Summarize Parallelize Optimize
Machine LearningData Mining
CustomizedUIs
Structured
Unstructured
Audio/Video
Schema
ComputingResources
ComputingResources
CreateAutomated
Products
Abstraction Layer (Virtualization)
Hypervisor
Virtualized Legacy Tools
Infrastructure as a Service Platform as a Service Software as a Service
Virtualized New Tools
Simulation as a Service
Graph-Based
Schema
Audio/Video Analysis
NewDatabases
Provisioning
StreamingScripting
COTS/GOTS SoftwareNew Hardware/Network
TRMC-Developed Software
Existing Range HW/SW
Applications
Resource Mgmt
VM Library
Cloud
License
Customization
Data Services
Organization
Core
OperationsShare
Serve
MessagingMetadata
Store Retrieve
VersioningTaggingPublish/Subscribe Crawl/Index
Transfer
Transform
Catalog
Search
Verify
AdministrativeCOO/DREnforce Policies Archive ToolsDB Admin Config Mgmt
Sync Data/Video
Spatio-temporal
Ontologies
MPP Programming and Execution Engine
C/R/U/D Consistency
ExistingComputers
PipelineWorkflowRange
Protocols
TENA Data Lifecycle
Workflow CreateSoftware
IDE
SDK
Security Architecture:Notional MILS and CDS
Regional Analytics Capability
Long-TermStorage
Med-Speed
High-Speed
Classification C
Long-TermStorage
Med-Speed
High-Speed
Classification B
Long-TermStorage
Med-Speed
High-Speed
Classification A
Long-TermStorage
Med-Speed
High-Speed
MLS Database
Enterprise Big Data Analysis
MILS-CDS
12
Data Science
Computer Science
Machine Learning
Math and Statistics
Traditional Software
Traditional Research
Subject Matter Expertise
Big DataAnalytics
• Unique DoD data challenges require an interdisciplinary approach with skills & analytical techniques required from 3 broad areas:
• Statistics – Especially Bayesian statistics with multivariate analysis– Knowledge of probability, distributions, hypothesis testing, and multivariate analysis
• Computer Science– Databases, SQL, data structures, algorithms, parallel computing, distributed computing, etc.
• Subject Matter Expertise– Ability to assess which models are feasible, desirable,
and practical in different settings– Clear ideal of the distinction between correlation and
causality
Help Wanted: DoD Data Scientists
“Proliferation of sensors and large data sets are overwhelming analysts, as they lack the tools to efficiently process, store, analyze, and retrieve vast amounts of data”
ASD(R&E) Department of Defense Research & Engineering website
TRMC Investments SupportJoint Strike Fighter Needs
13
Realistic DistributedMission Environments
On-Board Instrumentation
CRIIS
Miniaturized Data Capture
QRIPOn Board JSF
Model Validation& Improvement
Interoperable With JSE
TENA
LVC Integration
Next-Generation Threats
NCR
Cyber T&E
EWIIP
EW
Analysts, Evaluators, & Decision-Makers
Data Ingest & Validation
RAPIDS
Big Data Analytics & Evaluation
JSF-KMKM Data Archive
MLS-JCNECross Domain Solutions
MILS Network
JMETC
14
JSF T&E KM & Data Needs Addressed by TRMC
1. Data Capture: DART Pod is too large, requires significant jet modifications, and is not certified to support F-35 full operational profile
2. Data Warehousing: Flight test data should be stored in a government facility to expedite data access & discovery
3. Data Ingest: Current DART Pod test data ingest is too slow to meet multi-ship quick-look and quick-turn requirements– examples: 2 on 2; 4 turn 2; 4 on 4 turn 4 on 4
4. Data Access: Test data should be available for quick-look analysis during mission debrief to inform decision making
5. Video: DART Pod video should be available for quick-look analysis during mission debrief to inform decision making
6. Big Data Analytics: Analysis capabilities need to proactively identify “unknown unknowns” and other anomalies impossible for a human to discern
7. Remote Operations: Analysts need a rapid reaction capability to harvest data and conduct quick-look analyses in situations / locations where a network connection is not possible
JSF-KM Improvements to Existing T&E Capabilities
DT Today
OT Today
With JSF-KM
Parallel Data
Ingest
30 minutes (multiple aircraft)
Raw Data Available
Video/Data at Post-Mission Debrief Big Data
Analytics
Govt. Analyst Data Request
Analysis
Note: Numbers reflect single 2 hour flight mission
Data Ingest
Raw Data Available Govt. Analyst
Data Request Analysis
2 hours (per aircraft) 1 day 1 week
30 seconds
Data Ingest
Raw Data Available Govt. Analyst
Data RequestAnalysis
1-2 hours (per aircraft) 10 minutes 4-5 hours
Data Ready for Use @ (Govt)
30 seconds
90 minutes
Data Ready for Use @ LM
>20 weeks of data available online
Data Ready for Use @ (Govt)3 weeks of data available online
15
Sample JSF-KM Success Stories• Identified flights which experienced propulsion component failure
– During a blind analysis of 1,392 flights of propulsion data, JSF-KM data scientist was able to identify 7 of 10 flights with JSF analyst known engine issues
– Led to creation of a predictive model* for identifying future failures (*model validation pending)– Without JSF-KM this predictive model may not have been generated
• Video available during post-mission debrief due to JSF-KM data ingest improvements from DART Pod
– Existing tools could not process video in time to support post-mission de-brief– Without JSF-KM, there would be no flight video during post-mission debrief
• Discovery of avionics box issue during first night mission– Pilot and Analyst discovered problem from video data available 30 minutes after landing– Avionics Box was replaced before another mission was flown– Without JSF-KM, problem would not have been discovered for several days
• Reduced data profile time from 5+ hours to 47 seconds per Query– Big Data tool enabled massive improvement to data profile generation– Without JSF-KM it would still take 5+ hours to perform data profile data runs
• 9 hour routine analysis process reduced to 23 milliseconds– Patuxent River system drastically reduced routine MATLab analysis process from 9 hours to 23
milliseconds prior to KM system even being fully deployed– Patuxent River leadership already identifying other airframes which could use the system
16
Big Data Initiative Summary
• TRMC is acting upon recommendations from the Comprehensive Review of T&E Infrastructure. Strategic Goals:
– Integrate T&E infrastructure into cohesive Knowledge Management enterprise– Modernize T&E practices & processes to leverage Big Data analytics techniques– Apply Big Data analytics tools & techniques to the T&E mission space
• TRMC-funded proofs of concept are delivering proven capabilities– Enabling Big Data analytics for JSF T&E– Improving transfer of knowledge between fielded and next-gen systems– Informing an investment roadmap that advises future infrastructure, process, and
workforce decision-making
• Big Data Architecture Reference Document (ARD) will ensure interoperability and efficiencies in next-generation range knowledge management
– Big Data ARD will be standardized through JMETC Configuration Review Board (JCRB)– https://www.tena-sda.org/display/BDKM/Documentation
17
TRMC will consider additional pilots that continue to expand big data analytics in acquisition
Event Scheduling / Event QuestionsInteroperability EventsKeith Poch(850) [email protected]
Help Desk
Connectivity / Network Questions
NCRC Expansion / Site Questions
JMETC Points of Contact (POCs)
JMETC Program ManagerGeorge Rumford(571) 372-2724
TENA Software Development Activity DirectorRyan Norman
(571) [email protected]
National Cyber Range Complex DirectorAJ Pathmanathan
(571) [email protected]
NCRC, Deputy DirectorRob Tamburello(501) [email protected]
Cyber EventsLizann Messerschmidt(571) [email protected]
JMETC MILS Network (JMN)Ben Wilson(757) [email protected]
JMETC Secret Network (JSN)Jeff Braget(850) [email protected]
Action Items, Questions, Tasks, Software Needs, Bug Reports: https://www.tena-sda.org/helpdesk
TENA Products / Software RepositoryTENA Software Development ManagerSteve Bachinsky(703) [email protected]
Miscellaneous QuestionsFor JMETC questions: [email protected] TENA questions: [email protected]
WebsitesUnclassified, FOUO, DoD-Restricted (CAC required): https://www.trmc.osd.milDistribution A, Industry, non-DoD (username/password required): https://www.tena-sda.org
Range Support and TrainingTENA User Support ManagerGene Hudgins(850) [email protected]
JMETC Information Assurance LeadRobin Deiulio(540) [email protected]
JTEX-03: August 21-23, 2018; Orlando, FL