teaching an old log new tricks with machine learning
Post on 27-Mar-2017
213 Views
Preview:
TRANSCRIPT
Abstract
To most people, the log file would not be considered an exciting area in technology today. However, these relativelybenign, slowly growing data sources can drive large business transformations when combined with modern-dayanalytics. Accenture Technology Labs has built a new framework that helps to expand existing vendor solutions tocreate new methods of gaining insights from these benevolent information springs. This framework provides asystematic and effective machine-learning mechanism to understand, analyze, and visualize heterogeneous log files.These techniques enable an automated approach to analyzing log content in real time, learning relevant behaviors,and creating actionable insights applicable in traditionally reactive situations. Using this approach, companies cannow tap into a wealth of knowledge residing in log file data that is currently being collected but underutilizedbecause of its overwhelming variety and volume. By using log files as an important data input into the largerenterprise data supply chain, businesses have the opportunity to enhance their current operational log managementsolution and generate entirely new business insights—no longer limited to the realm of reactive IT management,but extending from proactive product improvement to defense from attacks. As we will discuss, this solution hasimmediate relevance in the telecommunications and security industries. However, the most forward-lookingcompanies can take it even further. How? By thinking beyond the log file and applying the same machine-learningframework to other log file use cases (including logistics, social media, and consumer behavior) and any othertransactional data source.
The Problem
At the most basic level, log files are a record of events
written by software as it runs. As a common source of ma-
chine data across all industries, they are usually written in
plain text with a timestamp. Unfortunately, however, the
syntax and semantics of log files are application- or vendor-
specific, so they can be hard to understand or interpret. If log
files have not been designed well, they can lack context and
potentially be misleading. Since companies produce massive
amounts of log files every day from their various software
applications, they are difficult to monitor closely. Further-
more, it is difficult to form a comprehensive picture of what
these log files from various source systems mean as a whole.
Most large companies today use operational log management
software for the primary purpose of IT management. Typi-
cally, operational log management software is set up ac-
cording to rules defined by a network administrator. These
rules may be chosen according to recommendations by the
software itself or customized to respond to recognizable
patterns that correspond to specific issues. Since no one can
TEACHING AN OLDLOG NEW TRICKS
WITH MACHINELEARNING
Krista Schnell, Colin Puri,
Paul Mahler, and Carl DukatzAccenture Technology Labs, San Jose, California
INDUSTRY PERSPECTIVE
DOI: 10.1089/big.2014.1518 � MARY ANN LIEBERT, INC. � VOL. 2 NO. 1 � MARCH 2014 BIG DATA BD7
reasonably be expected to inspect all log files as they are
generated, companies usually only consult them once a rule
has been broken or a flag raised. This means that companies
are often looking at log files in retrospect, when it may be
much harder to rectify a situation. Furthermore, these com-
panies are only inspecting issues that are breaking rules be-
cause of known patterns; discovering new patterns is
extremely challenging.
To address this, existing operational log management solu-
tions perform some log content analytics (making sense of
computer-generated records) but of a limited scope. They can
aggregate log files from disparate sources, index them to
make them searchable, and create dashboards that summarize
the values contained in a log file. However, as the volume and
variety of log files rises, it becomes
increasingly difficult for log man-
agement solutions to parse log files,
trace potential issues, and actu-
ally find errors—particularly when
cross-log correlations come into
play. Even in the best-case scenarios,
it requires an experienced operator
to follow event chains, filter noise,
and eventually diagnose the root
cause to a complex problem. To-
day’s log content analytics relies
heavily on how extensible the exist-
ing vendor solutions are, especially in terms of creating al-
gorithms or platforms that support views of the underlying
data, parallel execution of tasks that scale, discovery of cor-
relations within and across log files, and more.
However, within these very operational log management
challenges lies hope. Ironically, one of the biggest barriers to
effectively managing log data—its quantity—can also be the
root of a solution. One of the key missing ingredients up to
now has been machine learning, which requires copious
amounts of data to leverage, learn from, and, with a little
guidance, act on. Using machine-learning techniques, a
system can find insights through the analysis of past log
events and learn to predict future events in an automated or
semiautomated fashion. Think of it as log content analytics
with an added dimension of functionality and virtually un-
limited possibility.
Understanding the Product’s Design
Accenture believes that a log content analytics framework
that allows for the customization and implementation of
machine-learning algorithms can produce better results.
Specifically, machine learning can be used to discover pat-
terns, detect anomalies, pull out correlations between a series
both within a log file and across log files, and discover data
trends—all with virtually no input required from the end
user. In this way, log content analytics driven by machine
learning could prove relevant to and possible for every
company, in a variety of use cases.
However, it is easier said than done to incorporate machine
learning into the analysis of log files. So before jumping into
the technical aspects, let us take a
step back and understand what a
high-level solution might look like.
As Accenture’s 2014 Technology
Vision1 explains, it is important to
have an end-to-end view of the data
in an organization. Too many com-
panies fall victim to siloed data and
data solutions, limiting the potential
impact and value of data throughout
the enterprise. Effectively, this ho-
listic view can be thought of as the
‘‘data supply chain,’’ which begins
with data being ingested, then moving through the ma-
nipulation phase, and ending when the output is a valuable
insight.
Keeping this progression in mind, we developed the Ac-
centure Log Content Analytics asset, a lightweight framework
that acts as a wrapper around existing vendors’ tools and
helps extend their capabilities to provide enterprises with a
systematic and more effective machine-learning mechanism
to understand, analyze, and visualize heterogeneous log files
and ultimately discover insights. Designed to process log files
at scale, either in an onsite data center or housed in the cloud,
the asset uses an extensible plug-in framework with many
algorithms and filters to perform various analytic tasks.
For example, it helps to discover and extract information
based on profiles and configurations that implement various
normalization; use mining techniques, such as process map-
ping; and employ analytics like anomaly detection and fil-
tering for specific time periods and data types. At the core of
the framework engine, the system extracts the temporal
causality behaviors of traces and events (in either full-scale
distributed mode or emulation mode). The resulting infor-
mation is presented in graphs for easy exploration.
To dig deeper into the specifics, while following the data
supply chain paradigm, this solution consists of the normal
ingestion, manipulation, and value creation phases as
‘‘IRONICALLY, ONE OF THEBIGGEST BARRIERS TO
EFFECTIVELY MANAGING LOGDATA—ITS QUANTITY—
CAN ALSO BE THE ROOTOF A SOLUTION.’’
Log content analytics is the application of analytics and se-
mantic technologies to (semi-) automatically consume and
analyze heterogeneous computer-generated log files to dis-
cover and extract relevant insights in a rationalized, struc-
tured form. This information can enable a wide range of
enterprise activities, including audit or regulatory compli-
ance, security policy compliance, digital forensic investiga-
tion, security incidence response, operational intelligence,
anomaly detection, error tracking, or application debugging.
TEACHING OLD LOG NEW TRICKSSchnell et al.
8BD BIG DATA MARCH 2014
outlined above, but it automates or semiautomates many of
these steps—going above and beyond the capabilities of
typical vendor solutions. First, data is selected, ingested, and
semiautomatically normalized. Next, as part of the manipu-
lation phase, the relevant data is automatically parsed and
extracted. Those results are then stored and indexed, be-
coming vendor agnostic in the process, and semiautomated
analysis and exploration follows, with guided exploration for
anomaly detection and other patterns. Finally, in the last
phase of the data supply chain, the results are automatically
visualized and published for value realization.
While the ingestion and value creation stages are absolutely
necessary to the solution, the most innovative technical as-
pects occur in the manipulation stage through the use of
carefully selected machine-learning algorithms. Rather than
being programmed for specific tasks, machine-learning sys-
tems gain knowledge from data as ‘‘experience’’ and then
generalize what they have learned in
upcoming situations to accomplish
defined goals.
In other words, the machine-learning
mechanism transforms the scale
and complexity of the data into
actionable information by provid-
ing a more automated approach
that analyzes log contents, learns
application behaviors, and feeds this
knowledge back into the system. Using this approach,
companies can improve the underlying models, discover
previously unknown patterns or anomalies, and enable more
real-time and proactive application debugging, anomaly de-
tection, compliance, investigation, error tracking, operational
intelligence, and root cause analysis.
How the Product Solves the Problem
Within the data manipulation phase, the Accenture Log
Content Analytics asset extracts correlations between trace
events within a log file and the information surrounding
them, such as probability of occurrence of trace log events,
probability of transitions between particular trace log events,
execution times of trace log events, and anomalous occur-
rences of trace log events.
This means that the log files contain information that can be
used to uniquely identify events, event probabilities, and statis-
tics, and discover the temporal relationships between events—
essentially functioning as a transaction log. Additionally, the
information can be mined and visually mapped on a graph that
depicts behaviors. Through this process, we created algorithms
to discover temporal causality relationships. Trace entries are
linked together by discovering the unique identifier for a se-
quence of events. Additional statistics are mined that allow us to
predict the likelihood of events following each other in time.
This data then can be used to seed real-time analysis for anomaly
detection and pattern recognition.
Moving beyond the walls of the lab and into the real world,
Accenture Technology Labs proved the enhanced log content
analytics concept with a longtime client—a global cable
company. The cable company provided time-based log files
for people watching television online, including the different
channels they watched and when. On the basis of these log
files, we used the Accenture Log Content Analytics asset to
develop an in-depth viewer segmentation. The results were
extrapolated at the demographic level, and the insights could
be used in a number of ways.
For instance, the cable company could generate more in-
formed pricing models for selling advertising to certain de-
mographics at certain times, or create a recommendation
engine to encourage viewers to
watch additional channels and ex-
pand their market reach. Likewise,
consumer goods and retail compa-
nies could use the correlated in-
formation across data sources to
determine which online channels or
specific shows are best to target their
brands to key audiences, and algo-
rithmically determine what time of
day to advertise, or which demo-
graphics might be most interested in specific products. In this
way, we not only proved that the Accenture Log Content
Analytics asset works, but also proved its value outside the
more typical realm of IT management.
In addition, we are now working on a theoretical pilot in the
security domain. Since the IT security function is rule- and
role-based, the Accenture Log Content Analytics asset could
be used to discover new behaviors that were previously un-
known, such as detecting anomalous login behaviors. The
machine-learning functionality would be able to do this in an
automated fashion and without human bias. In this manner,
companies could potentially detect damaging behaviors and
be better prepared for future attacks, as well as more proac-
tive in stopping a security breach before it happens. Consider
an attack where an organizational insider is suddenly acces-
sing a series of assets he or she has never used before. The Log
Content Analytics asset could immediately spot an atypical
access pattern for a user, even if that specific access pattern
had never been seen before, and prevent particularly high-risk
losses of private information.
Furthermore, as the Accenture Log Content Analytics asset
is used over time to learn from ever-increasing amounts of
data, it will start to accumulate a library of anomalous
patterns corresponding to specific security events. This
collection of patterns could be an asset in and of itself,
‘‘THIS DATA THEN CANBE USED TO SEED REAL-
TIME ANALYSIS FORANOMALY DETECTION ANDPATTERN RECOGNITION.’’
Schnell et al.
INDUSTRY PERSPECTIVE
MARY ANN LIEBERT, INC. � VOL. 2 NO. 1 � MARCH 2014 BIG DATA BD9
such that other companies could implement the package
and immediately capitalize on the security learnings of the
Accenture Log Content Analytics asset—without having
to spend time rediscovering and learning them on their
own. These immediate predictive capabilities in a highly
sensitive security domain make this
solution even more valuable.
In theory, the Accenture Log Content
Analytics asset could also be used
to help improve the efficiency of
a manufacturing supply chain pro-
cess by recognizing anomalies. For
example, given enough information
to learn and map the complete sup-
ply chain, the Accenture Log Content
Analytics asset could be used to
identify patterns in a supply chain
that lead to unusual delays. Then, users could potentially
discover when, where, and how to reroute products during the
manufacturing process in order to minimize delays. By re-
ducing downtime and improving the overall supply chain
process, manufacturers could save significant amounts of time
and money.
Conclusion
There has always been a wealth of information and insights
contained in the vast troves of log files. However, as soft-
ware applications grew and multiplied, so too did the
numbers of log files produced. Insights became essen-
tially invisible, and any possible value was made extremely
difficult—requiring significant time, skills, and know-how
from a meticulous subject-matter expert to do so, especially
without the aid of any tools. As such, it served IT man-
agement in a limited capacity, and its use became synony-
mous with tedious, manual error investigation. It is no
wonder that operational log file management was, and still
is for many, an uninteresting and underappreciated tech-
nology domain.
However, advances in machine learning have breathed new
life into operational log management solutions and the log
content analytics they support. With the ability to instead
leverage and learn from the copious amounts of log files,
machine-learning techniques mask the scale and complexity
of log file data and make its analysis both possible and
valuable. While most operational log management software
has stopped short of machine learning and advanced analytics
with ingestion, parsing, storing, indexing, and dashboard
capabilities, the Accenture Technology Labs’ Log Content
Analytics asset has not. It can be used on top of existing
vendor solutions to (semi-) automate many of the steps
necessary to realize the full value of log files.
The most promising aspect of the Accenture Log Content
Analytics asset is that its possibilities are essentially endless.
Once log content analytics is feasible, practical, and worth-
while, it should no longer be limited to the realm of IT
management. As discussed, this solution allows log content
analytics to extend to use cases from
cable television to enterprise secu-
rity. But this log file analysis could
also be applied to the areas of lo-
gistics, social media, and consumer
behaviors too.
So is there still potential to think
beyond the log file? Yes. For any
industry with transactional data,
ranging from retail to banking
industries, and accessed through
means such as databases or appli-
cation programing interfaces, the Accenture Log Content
Analytics asset can be applied. Since it is an end-to-end
system, encapsulating all phases of the data supply chain,
this holistic solution has the ability to provide significant
value in many domains. As it turns out, using machine
learning to harness the power of log files can be quite
exciting after all.
About Accenture
Accenture is a global management consulting, technology
services, and outsourcing company, with approximately
281,000 people serving clients in more than 120 countries.
Combining unparalleled experience, comprehensive cap-
abilities across all industries and business functions, and
extensive research on the world’s most successful compa-
nies, Accenture collaborates with clients to help them be-
come high-performance businesses and governments. The
company generated net revenues of $28.6 billion for the
fiscal year ended August 31, 2013. Its home page is
www.accenture.com.
About Accenture Technology Labs
Accenture Technology Labs, the dedicated technology research
and development (R&D) organization within Accenture, has
been turning technology innovation into business results for
more than 20 years. Our R&D team explores new and
emerging technologies to create a vision of how technology will
shape the future and invent the next wave of cutting-edge
business solutions. Working closely with Accenture’s global
network of specialists, Accenture Technology Labs help clients
innovate to achieve high performance. The labs are located in
Silicon Valley, California; Sophia Antipolis, France; Arlington,
Virginia; Beijing, China; and Bangalore, India. For more in-
formation, please visit www.accenture.com/accenturetechlabs.
‘‘BY REDUCING DOWNTIMEAND IMPROVING THE OVERALL
SUPPLY CHAIN PROCESS,MANUFACTURERS COULD
SAVE SIGNIFICANT AMOUNTSOF TIME AND MONEY.’’
TEACHING OLD LOG NEW TRICKSSchnell et al.
10BD BIG DATA MARCH 2014
Author Disclosure Statement
Copyright ª 2014 Accenture All rights reserved.
Accenture, its logo, and High Performance Delivered are
trademarks of Accenture, which sponsored this article.
This article makes descriptive reference to trademarks that
may be owned by others. The use of such trademarks herein is
not an assertion of ownership of such trademarks by Ac-
centure and is not intended to represent or imply the exis-
tence of an association between Accenture and the lawful
owners of such trademarks.
Reference
1. Accenture’s 2014 Technology Vision. Available online at
www.accenture.com/microsites/it-technology-trends-2014/
Pages/home.aspx (Last accessed on March 3, 2014).
Address correspondence to:
Krista Schnell
Accenture
50 West San Fernando Street
Suite 1200
San Jose, CA 95113
E-mail: krista.schnell@accenture.com
Schnell et al.
INDUSTRY PERSPECTIVE
MARY ANN LIEBERT, INC. � VOL. 2 NO. 1 � MARCH 2014 BIG DATA BD11
top related