Business Questions Success Strategy Project Steps Technical Solution Analytic Requirements Results Business Application Lessons Learned
2
Business Questions
How well has the Office of the Inspector General (OIG) fulfilled its mission?
How can the OIG prioritize final rule reviews? • Did common terms in public comments appear in final rules? • What sentiment did public comments express?
3
Success Strategy Sizing the Project
• Data – Available, Processable, Standardized • Security Concerns – factor in information security governance
Seeking an Executive Champion • Do they support the answer value? • To what extent will they fund the project (budgetary
considerations)?
Repeating a Quick Win • Is the project repeatable to gain support for subsequent
projects? 4
Engaged management buy-in for questions Assessed security concerns for public facing data Contracted technical support and quantitative and
qualitative statistical expertise Used Amazon Web Services for infrastructure support Used Amazon Marketplace for selecting text mining
tool Documented repeatable technical tasks
5
Project Steps
Technical Solution
6
Presenter
Presentation Notes
Good morning, Welcome to the PAWGOV conference. My name is Antuane Allen. I’m an analyst with Sanametrix. Sanametrix is a certified AWS partner contracted to provide technical guidance. To address the business questions, 2 two solutions were utilized. Within the AWS marketplace MarkLogic. IBM AlchemyAPI outside of the AWS marketplace. Utilizing Amazon Web Services to host a MarkLogic software implementation, and the AlchemyAPI service from IBM Bluemix, large disconnected data sets were processed via the REST API endpoints available from each respective service.
MarkLogic – platform enabled ability to parse unstructured text and calculate term frequencies
Term Frequency Normalization – where N is equal to the total number of terms within a document or set of documents
𝑡𝑡𝑡𝑡 𝑡𝑡 =𝑤𝑤𝑖𝑖𝑡𝑡𝑁𝑁
Gap Concept – differences between normalized frequencies of baseline terms and corpus documents
7
Analytic Requirement #1
Presenter
Presentation Notes
A challenge with addressing the business question, was how to take unstructed text data and structure it in a format that would be useful in drawing insight. Business queston 2a asks “Did common terms in public comments appear in final rules?” To address this, an application which would allow calculation of word counts of text data Was needed. In addition, a solution which can work with different document formats. After evaluating several options MarkLogic software was determined to be the appropriate solution. Utilizing the MarkLogic software, each final rule document was parsed through and a baseline set of terms were compiled based on the most frequently occurring terms. Stop terms were excluded from the calculation of terms (prepositions, and other terms that have a high frequency however yield no semantic importance). Each baseline term has 2 associated TF. Term Frequency (TF) of that term in the document. The next step was the refining of the baseline. After compiling a baseline set, terms were evaluated relative to their importance and either keep or removed based on the discretion of the subject matter experts.
OIG Standards of Work
Business Question: How well has the Office of the Inspector General (OIG) fulfilled its mission?
Answer: OIG could improve its standards of audit work.
8
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1
Baseline Terms
Gap
Presenter
Presentation Notes
The objective of the Audit Strategic Planning Model is to examine the gap between the CFTC OIG strategic objectives and its current focus on those objectives as measured by its previous OIG audit report topics. These topics are expressed through the Office of Data and Technology (ODT) selected corpus of unstructured, public facing, document-based material provided for this report www.cftc.gov
OIG Mission Results
9
-0.015
-0.01
-0.005
0
0.005
0.01
Gap
Baseline Terms
risk
Audit Mission Term Gap Analysis
“Risk” stood out for key mission terms. This suggests that the OIG generally balances workload to meet its mission. Since “risk” is typically associated with “control” work, the OIG
either has to emphasize more internal control work or the impact of the work.
Presenter
Presentation Notes
Of the seven key terms associated with the audit strategic mission, only one of the terms, “risk,” had a term frequency that was greater in the baseline documents than in OIG corpus documents. “Efficiency” had the greatest increase in term frequency proportion from the baseline to the OIG corpus, followed by “economy,” “waste,” “abuse,” “effect/effectiveness,” and “fraud.”
Utilize TeamMate software to standardize audit planning and execution
Emphasize internal control risks with project starts
Emphasize the impact associated with business question
10
Strategic Planning Application
Business Question: Did common terms in public comments appear in final rules?
Answer: Yes, with varying degrees of intensity enabling differentiation.
Generally most rules had at least 70% of key terms appearing within +/- 1% from the rule document to the public comments. 7 of the 10 rules had a higher percentage of key terms appearing more frequently within the comments than in the rule. Only 2 rules Had a higher percentage of key terms appearing less frequency. One rule had key terms appear proportionally (no difference) between the rule and the public comments.
IBM AlchemyAPI – Natural Language Processing platform, learning algorithm
Scoring Mechanism – Positive, Neutral, Negative
Sentiment Attributes – Mixed Sentiment
Limitations of Exercise • Number of Available Comments for Each Rule • Data Quality – Data Capture, PDF’s, Noise • Document Level vs Entity Level • False Positives
12
Analytic Requirement #2
Presenter
Presentation Notes
sentiment analysis algorithm works by looking for positive and negative words and then it aggregates them to yield output. The document level sentiment is outputted as a score between +1 to -1. A positive score implies positive sentiment and a negative score indicates negative sentiment. The neutral sentiment is scored as zero. Along with sentiment score, the Alchemy API also outputs a score for another indicator, called mixed. A value of 1 for “mixed” indicates the presence of both positive and negative sentiments in the text.
Business Question: What sentiment did public comments express? Answer: The majority of public comments are positive towards
Of the 10 rules, there were 2 (81FR636 and 75FR55410) which did not have enough scored comments to confidently rely on the results for analysis. 75FR55410 was processed with errors by AlchemyAPI since many comments were in an unreadable .pdf format. Six of the other 8 have over 50% or more of the comments scored as positive (77FR30596, 77FR20128, 76FR80674, 76FR41398, 76FR43851, 76FR71626). Rule 76FR71626 had the most comments and overall 68.4% of the 13,782 comments analyzed were scored positive. An independent qualitative review was conducted in which subject matter experts sampled over 1100 comments from 76FR71626 and found 90% of the comments scored by AlchemyAPI to be accurate. The rule with highest % of negative comments was 77FR42559 with 79.3% of the 1,389 scored as negative, followed by 76FR43851 with 44.9% of 1,135 comments scored as negative. Nearly, all rules had 90% or more comments with mixed sentiment.
Text mining tools, with some limitations, are useful in prioritizing OIG reviews of final rules.
Three rules in the negative quadrants should be considered for further study.
14
Strategic Planning Application
Negative Positive
Positive
77FR20128 76FR41398 76FR43851 76FR71626 77FR30596
Negative
77FR42559 76FR53172
75FR55410
Sent
imen
t
Term Frequency Gap
81FR636
76FR
8067
4
Presenter
Presentation Notes
The following matrix displays the intersections between positive/negative sentiment and positive/negative term frequencies for each Dodd Frank Rule used in this analysis. If a rule is listed in the negative ‘term frequency gap’ column then a higher proportion of key terms in that rule were mentioned less frequently among the public comments. If a rule is listed in the positive column of the ‘term frequency gap’ then a higher proportion of key terms in that rule were mentioned more frequently among the public comments. For sentiment, if a rule is listed in the positive row then a higher proportion of comments were positive and if a rule is listed in the negative row then a higher proportion of comments were negative. There were four rules that had both positive sentiment and an overall higher proportion of key terms appearing more frequently within their respective public comments. Two rules had more negative sentiment and an overall higher proportion of key terms appearing more frequently within their respective public comments. One rule had more positive sentiment and an overall higher proportion of key terms appearing less frequently within the public comments. Rule 75FR55410 did not have results for sentiment, however did have an overall higher proportion of key terms appear more frequently in the public comments. Rule 81FR636 also did not have results for sentiment. However, it had a higher proportion of overall key terms appearing less frequently within the public comments. Rule 76FR80674 had overall positive sentiment but an equal percentage of terms with positive and negative gaps.
Lessons Learned—Success Strategy
15
?
Sizing the Project • Data – Available, Processable, Standardized • Security Concerns – factor in information security governance
Seeking an Executive Champion • Do they support the answer value? • To what extent will they fund the project (budgetary
considerations)?
Repeating a Quick Win • Is the project repeatable to gain support for subsequent