Federal Big Data andCognitive Metadata
- Goodier
AgendaFederal big data is enhanced by cognitive metadata1. Clearly
understanding the paradigm shift
2. Review of Security and Privacy implications for the federal government
3. Cyber Threat4. Cognitive Metadata
solution
3
The Internet was built without a way to know who or what you were connecting to
– Federal internet service providers workaround this with a patchwork of identity security controls and NIAP certifications
– No fair blaming the user – no framework, no cues, no control
1. Balancing the Cyber Big Data equation
2. Safeguarding and Sharing Information
5
• “One of the biggest questions is how to evolve the risk management model. What is secure enough and agile enough to support the mission?Security, agility, and transparency decisions are driven by mission priorities.”
– Major Linus J. Barloon II, Chief, J3 Cyber Operations Division at
White House Communications Agency
http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-53r4.pdf
2. Safeguarding and Sharing Information
6
“ For example, the United States Government Accountability Office (GAO) aggregates data from many agencies.
Recognizing the inherent risks, GAO sets up discrete network enclaves that are distinct from their agency-wide network, for Big Data. It assigns appropriate levels of security to each enclave driven by the sensitivity of the data therein.
• Other agencies note they ensure Big Data is stripped of personally identifiable information (PII) before it leaves the originating agency’s control.
•Data aggregation needs will expand as more elements of the critical infrastructure adopt increased cyber protection and detection capabilities that will drive enhanced data/ information sharing.”
• - www.meritalk.com
• Beacon Report
• Balancing the Cyber Big Data Equation
Federal agencies are required by law (e.g., the Privacy Act of 1974) to give notice to individuals, when collecting information from them, of the authority, purpose, and uses of PII when such data will be maintained as agency records that will be retrieved by individual name or other identifier.1
When agencies use a Web site to collect or share data, agencies must post a privacy policy, as required by Section 208 of the E-Gov Act and OMB guidance.2
In all cases, privacy notices must be prominent, salient, clearly labeled, written in plain language, and available at all locations where notice is needed.
2. Synopsis of Security and Privacy for Federal Big Data
Over time, agencies, digital developers, and data users may also create, discover, or propose new and innovative ways to combine, share, or otherwise leverage the power of the digital data and content collected or disseminated by their digital services or programs. If data will be re-combined, used or shared in ways that individuals did not originally contemplate or expect, agencies must consider the need, under applicable law or policy, to provide such individuals with additional or updated notice of their privacy rights and choices.4
In determining precisely when, where, and how to give such notice, agencies, their digital developers, and partners will need to exercise creativity and ingenuity to ensure that required notices are clearly communicated to individuals at the right time and place, and in the right manner, without unduly interfering with the user experience. The timing and format of such notices may need to vary, depending on the digital or mobile platform involved. 5
2. Review: Federal Big Data is different from industry
https://it.ojp.gov/default.aspx?area=privacy&page=1295
Page 10
2. Federal Big Data today
http://www.google.com/intx/en/enterprise/apps/government/products.html?section=drive
https://explore.data.gov/
http://catalog.data.gov/harvest
Privacy advocates are concerned about the threat to privacy represented by increasing storage and integration of personally identifiable information; expert panels have released various policy recommendations to conform practice to expectations of privacy.[99][100][101]
Cognitive metadata are sets of innovative privacy-enhancing technologies which enable new techniques for data analytics that minimize costs to privacy.
Page 11
2. FED RAMP certified commercial clouds
Data/Compute Storage/Metadata Utility/Networking Content Delivery
Shared physical resourcesPhysical infrastructure
Software-platform-as-a-
service
App-components-as-a-service
Virtual-Infrastructure-as-a-Service
Data IntensiveAmazon Hadoop, Public Data Sets, Simple DB
GoogleApp Engine
GCDS Akamai
GOV CLOUD certified government clouds
2. Before clouds swallowed the enterprise, Gov met requirements with defined EA
structuresPattern - Use Case Focus EA Notional Pattern SchematicInternal• Fine grained access control to
data• Auditing, etc.
Participant• User to service interaction• Service to service interaction
Sub-Enterprise• Share security & infrastructure• Operations• Certification and Accreditation
inheritance
Super-Enterprise• Enterprise alignment of sub-
enterprises• Federation
Mission Service
Data
Data?
MissionService
MissionService
Data
MissionServiceMission
ServiceMissionService
Data
SecuritySub-
Ente
rpris
e
Dept/AgencyDept/Agency
Network Network
Enterprise alignment – trust, credentials
Sub
Ente
rpris
e Sub
Ente
rpris
e Sub
Ente
rpris
e
Federation Federation
Source:H Reed DoD Multi Service SOA team
2. EA Privacy & Security focused on message exchange – NIEM 3.0 – and dissemination labels
Super-Enterprise
Sub-Enterprise
Participant
Internal
Ope
ratio
nal
Man
agem
ent
Prog
ram
mati
c
Fede
ratio
n
According to the Multi-Service SOA community: -- Focus of DoD/IC Security is primarily at the “participant” and “operational” level. -- Implication is that most Service Oriented security discussion will be at this level.
GOVERNANCE
SHARED SECURITY
MESSAGE EXCHANGE
Our Typical securityfocus was here
SERVICE CODE
Unfortunately that leaves lots of gray area for data spills!
https://www.niem.gov/training/Pages/train.aspx
2. Example: NIEM and NISS Message Exchanges
Each encounter describes an interaction with a person-of-interest (POI). A POI is one who possesses an identity that is associated with derogatory information residing in a system-of-record (SOR) containing watchlisted individuals. The Encounter specification is designed to convey encounter activity (e.g., who, what, when, where), any watchlist searches performed, and any encounter analysis results for Suspicious Activity Reports (SARs).
Testing PII incident responses at scale
3. PII Incident Federal Use Case at 4V Message scale – what is the worst that can happen?
3. As Federal Big Data apps expand, our data channels grow and our exposure to risk increases
http://www.verizonenterprise.com/DBIR/2013/
3. Federal PII Protections for April 15
Page 17
• http://www.cnbc.com/id/101496551
Identity thieves are stealing billions of dollars a year through fraudulent tax refunds—and the IRS isn't the only target. The 43 states that collect an income tax are also being flooded with these bogus returns.
3. Risk Exposure goes across Federal Lines of Business
3. Risk Exposure grows as our use of Federal Shared Services grows
19
Quicksilver2001
Cloud-First2010
E-Government Act2002
Clinger-Cohen1996
E-Gov InitiativesInitial 25
2003
Lines of BusinessInitial 5 (HR, GM, FM, FHA,CM)
2004
Lines of BusinessRound 2 (Geo, BFE, ITI, ISS)
2006
Payroll Consolidation Completes
2009
GAO Report: Opportunities to Reduce Potential Duplication
2011
E-Gov InitiativesRound 2 (DAIP, ITDS, IAD-Loans/Grants)2008
Shared Services
2011
4. Ensuring adherence to Security and Privacy regulations across identities shared in the federal clouds
• To
– retain MEANING (aka, contextual semantics)– in loosely coupled, highly flexible– multi-tenant environments
4. Solutions for the Federal Use Case from Industry
8118
Amazon Fire TV review: the set-top that tries to do everything ASAP Advanced Stream and Prediction
http://www.engadget.com/2014/04/09/amazon-fire-tv-review/
Movies or tv shows are buffered for playback before users hit the play button, the company says; those choices are made by analyzing users’ watch lists and recommendations. As users’ viewing habits change, the caching prediction algorithm will adjust accordingly, and personalization capabilities should get better over time
http://www.ibmbigdatahub.com/blog/caveat-use-internet-things-behavioral-analytics
4. Solutions for the Federal Use Case from Research
8118
Cognitive metadata: Advanced Streaming and Prediction for improved regulatory and incentive performance
Caching prediction algorithms will adjust according to risk exposure, and personal information protection capabilities should get better over time
4. Metadata solutions shared across government at the new scale of IT
• Federal Risk and Authorization Management Program – FedRAMP
1. Align budget and acquisitions with the technology cycle;
2. improve program management;
3. streamline governance and increase accountability;
4. increase engagement with the IT community; and
5. adopt lighter technologies and shared solutions--including the adoption of a "cloud-first" policy.
– www.cio.gov
4. What is the Cognitive Metadata Solution
…cognitive metadata (i.e. metadata coming from our perception, reasoning, or intuition such as preference for a type of content), which is very useful for personalization purposes and conversely for limiting PII incidents.
Personalities and personas
We protect the personal identifying information of people that link to us, and protect what they’re interested in, so we identify and encrypt the following:
What does this person care about?What are the types of things they’ll respond to?What’s the value-add our content offers them?What are their turn-ons and turn offs?
Initially this is a mostly qualitative process, since we're manually reviewing the data. It's not perfect science. but it does benefit from information sharing patterns that build the cognitive metadata repository to ultimately improve automated reasoning.
4. Cognitive Metadata tagging landscape
4. Federal Use Case and Cognitive Metadata
Page 26
• http://en.wikipedia.org/wiki/Sensitivity_and_specificityImagine a study evaluating a new test that screens people for a disease. Each person taking the test either has or does not have the disease. The test outcome can be positive (predicting that the person has the disease) or negative (predicting that the person does not have the disease). The test results for each subject may or may not match the subject's actual status. In that setting:
– True positive: Sick people correctly diagnosed as sick– False positive: Healthy people incorrectly identified as sick– True negative: Healthy people correctly identified as healthy– False negative: Sick people incorrectly identified as healthy
In general, Positive = identified and negative = rejected. Therefore:– True positive = correctly identified– False positive = incorrectly identified– True negative = correctly rejected– False negative = incorrectly rejectedCognitive metadata identifies PII in the context of this study so individuals involved can be protected
4. Federal Use Case and Machine Learning
Page 27
• http://en.wikipedia.org/wiki/AdaBoost• Problems in machine learning often suffer from the
curse of dimensionality — each sample may consist of a huge number of potential … and evaluating every feature can reduce not only the speed of classifier training and execution, but in fact reduce predictive power....
• Unlike neural networks and SVMs, the AdaBoost training process selects only those features known to improve the predictive power of the model, reducing dimensionality and potentially improving execution time as irrelevant features do not need to be computed.
4. Current State of Language Technology
Coreference resolution
Question answering (QA)
Part-of-speech (POS) tagging
Word sense disambiguation (WSD)
Paraphrase
Named entity recognition (NER)
ParsingSummarization
Information extraction (IE)
Machine translation (MT)
Dialog
Sentiment analysis
mostly solved
making good progressstill really hard
Spam detection
Let’s go to Agra!
Buy V1AGRA …
✓
✗
Colorless green ideas sleep furiously.
ADJ ADJ NOUN VERB ADV
Einstein met with UN officials in Princeton
PERSON ORG LOC
You’re invited to our dinner party, Friday May 27 at 8:30
PartyMay 27add
Best roast chicken in San Francisco!
The waiter ignored us for 20 minutes.
Carter told Mubarak he shouldn’t run again.
I need new batteries for my mouse.
The 13th Shanghai International Film Festival…
第 13届上海国际电影节开幕…
The Dow Jones is up
Housing prices rose
Economy is good
Q. How effective is ibuprofen in reducing fever in patients with acute febrile illness?
I can see Alcatraz from the window!
XYZ acquired ABC yesterday
ABC has been taken over by XYZ
Where is Citizen Kane playing in SF?
Castro Theatre at 7:30. Do you want a ticket?
The S&P500 jumped
Big Data works well
Page 29
4. Cognitive metadata employs predictive algorithms from Big Data Machine Learning combined with Natural Language Processing
Cognitive metadata uses a three-step management process that translates Policy documents into formal policy rule sets that computers can understand and evaluate.
1. Policy documents are translated into digital policies, using Natural Language Processing technologies.
2. Policy deconfliction ensures consistency and operational desirability. Automated deconfliction, using Turing methods and Theorem Proving Techniques that work with the constructs defined in XML, delivers active models of the resulting policy via a Policy Based Tool GUI. DPM delivers this new user interface to data stewards and Foreign Disclosure Officiers (FDOs) giving them total control over both the design and the approval of the resulting model. Then the human-approved set of deconflicted digital policies are translated into standard QOS policy-labeled services.
3. Digital policies are defined in a computer interpretable language which is also friendly to humans.
Page 30
4. How cognitive metadata works
• Regular expressions (regex) play a surprisingly large role– Sophisticated sequences of regular expressions are often
the first model for any text processing text
• For many hard tasks, we use machine learning classifiers– But regular expressions are used as features in the
classifiers– Can be very useful in capturing generalizations
4. Cognitive Metadata is a result of data science
18/18
Substantive expertise
Math & Statistics
Knowledge
Hacking Skills
MachineLearning
Traditional Research
DataScience
DangerZone!
Convergence
Predictions that enhance machine learning fueled by knowledge at the Intersection of Our Digital Lives
Page 32
4. What are some applications of Cognitive Metadata
– Machine Learning– Question Answering: IBM’s Watson– Paraphrase– Summarization– Information Extraction– Sentiment Analysis– Machine Translation– Coreference resolution– Word Sense disambiguation– Parsing– SPAM detection– Part Of Speech parsing– Named entity recognition
Page 33
4. Cognitive Metadata provides automated reasonors for Federal PII policy adherence at scale
Attribute Service(AS)
Certificate Validation
Service (CVS)
CERT
Metadata Service
CognitiveMetadata
SmartData
Policy Decision Point (PDP)
RepositoryPolicy
Administration Point (PAP)
Context Handler
Policy Information Point (PIP)
Policy Decision Service (PDS)
9
4
3
11
12
13
14
13
11
15
7
Ozone Widget
Framework
1
2
1a
1b
1c
Audit
Service
1
0
1
7
1
7
5
IT SUPPORT TEAM
DataProducer
Access Request
~ X.pdf Secure Map ~ USER Team
Member
NPECert
6
SoftCert
SoftCert
8
NPECert
Access Request
~ X.pdf Secure Map ~
~ Reason ~ Location
not relevant to data
USER Team Member
1
6
15
Cloud GatewayPolicy
Enforcement Point (PEP)
Valid Access
Invalid No Access
34
Because the Federal government has No shortage of policy…
• SCAP does NOT resolve security needs for SA when we are OUTSIDE the NETWORK.
No shortage of governance…
No shortage of standards…
But people drive standards and policy. People do not move at Cyber speed.People need cognitive metadata and data to support decision-making.
Data-driven situational awareness augments governance.
Codifying federal big data decisions
37
But knowing this is still a challenge …
Using Cognitive metadata Rules engines
Org
aniz
e b
y M
issi
on We divide and conquer the complexity of regulatory compliance
by codifying big data relationships by mission, to maintain situational awareness of all known risk mitigations, and waivers.
You can apply data and metadata according to the mission’s specific risk profile and known standards and waivers.
MISSION Area Of Responsibility
Tier 1
Tier 2
Tier 3
Enterprise
Regional
Local 39
To perform Continuous Monitoring
Why Cognitive Metadata?
• Cognitive metadata provides the answers you need when– Sorting through millions of data items to
pinpoint key PII incidents that may be crucial.– By including sophisticated semantic analytics,
vastly reduces the time and budget that might otherwise be needed for a substantive analysis of the regulatory compliance for any set of records.
Cognitive metadata maps the right Context to the right Policy as an ASAP-style service
Major Categories of Content requiring Unique Identification
Intelligence Category Focus (Intel Users) Objects of Analysis Reporting Cycle
Strategic or National Intelligence
Understanding of current and future status and behavior of foreign nations. Estimates of the state of global activities. Indications and warnings of threats.
(National policymakers)
Foreign policy
Political posture
National stability
Socioeconomics
Cultural ideologies
Science and technology
Foreign relationships
Military strength, intent
Infrequent (annual, monthly) long-duration estimates and projections (months, years)
Long-term analyses (months, years)
Frequent status reports (weekly, daily)
Military Operational Intelligence
Understanding of military powers, orders of battle, technology maturity, and future potential.
(Military commanders)
Orders of battle
Military doctrine
Science and technology
Command structure
Force strength
Force status, intent
Continually updated status databases (weekly)
Indications and warnings (hours and days)
Crisis analysis (daily, hourly)
Military Tactical Intelligence
Real-time understanding of military units, force structure, and active behavior (current and future) on the battlefield.
(Warfighters)
Military platforms
Military units
Force operations
Courses of action (past, current, potential future)
Weapon support (real-time: seconds to hours)
Situation awareness applications (minutes, hours, days)
Cognitive metadata helps support Computer Network Defense (CND) data.
Cognitive metadata supports executive orders EO 13587 for rapid response to Insider Threat.
Cognitive metadata helps support dynamic data for audit event management.
AVTRAVTRAVTRAVTR
Vulnerability WS IAVM WS
CND User & AgentCND User & Agent
IAVMIAVMNVDNVD
Service DiscoveryService Discovery
CND PortalCND Portal
Geocoding WSGeocoding WS
Web Mapping WSWeb Mapping WS
AssetAssetAsset
EventEvent
Vul.Vul.Vul.
IAVMIAVM
PROMPROMPROMPROM
Asset WS
SCAP Standards & CND Schemas Used
Cognitive metadata = PII protection as a service