cyber language analysis: emerging linguist trends in digital environments

6
CYBER LANGUAGE ANALYSIS Emerging Linguist Trends in Digital Environments

Upload: lingua-brava

Post on 18-Dec-2014

184 views

Category:

Technology


0 download

DESCRIPTION

In the post-9/11 landscape, US stakeholders realize the critical role that language analysis plays in the response to asymmetric threats and, more recently, the increased role cybersecurity plays in future national security interests. However, we have yet to fully realize how these two seemingly disparate disciplines interact. The global nature of the Internet makes it an environment for language analysis. Moreover, the digital environment, by virtue of its limitations and enhancements, lends itself to linguistic conditions that can be practically applied to emerging security needs. The intersection of cybersecurity and language is becoming increasingly evident, as illustrated by homograph attacks, language artifacts in forensic analysis of malware, and the use of right-to-left-override code (to name a few examples).

TRANSCRIPT

Page 1: Cyber Language Analysis: Emerging Linguist Trends in Digital Environments

CYBER LANGUAGE ANALYSISEmerging Linguist Trends in Digital Environments

Page 2: Cyber Language Analysis: Emerging Linguist Trends in Digital Environments

Cyber Language Analysis:Emerging Linguistic Trends in Digital Environments

By Raul Mejorado, Lingua Brava, LLC

Introduction

In the post-9/11 landscape, US stakeholders realize the critical role that language analysis plays in the response to asymmetric threats and, more recently, the increased role cybersecurity plays in future national security interests. However, we have yet to fully realize how these two seemingly disparate disciplines interact. The global nature of the Internet makes it an environment for language analysis. Moreover, the digital environment, by virtue of its limitations and enhancements, lends itself to linguistic conditions that can be practically applied to emerging security needs. The intersection of cybersecurity and language is becoming increasingly evident, as illustrated by homograph attacks, language artifacts in forensic analysis of malware, and the use of right-to-left-override code (to name a few examples).

Homograph Attacks. Threat actors leverage foreign language fonts that look similar to English text but have completely different values. This is seen in a technique known as a homograph attack wherein human readable text, such as a web address, is constructed of letters that look like letters from another language (e.g., the Russian letter ‘R’ which resembles an English ‘P’). In 2005, the Internet Corporation for Assigned Names and Numbers issued a statement and elicited comments from the

community regarding homograph attacks where web addresses were spoofed using foreign language character sets.1

Forensic Analysis of Malware. Analytically, language and cultural skills apply to forensic analysis activities. For example, ‘The Mask’ malware contained an expletive particular to the dialect spoken in Spain; modules within the program also referenced the malware name ‘careto,’ a term for an ugly mask particular to European Spanish.2 In this example, further analysis is required to fully understand the application, while taking into account the potential for misattribution of blame. This only furthers the case that language analysts are required in this environment.

Right-to-Left-Override (RLO) Code. Hackers have exploited intentional and unintentional effects of code and infrastructure that manage language rendering. In 2011, the RLO code that supports languages like Hebrew was used to make malicious file names look benign.3 By using the RLO code, hackers were able to make plain-language text in phishing emails and file names appear safe by reversing characters in the displayed format to resemble something safer to open.

Preparing Cyber Language Analysts

As these examples suggest, cybersecurity professionals of the future must possess a thorough understanding of the reciprocal relationship between language and the digital environment and appreciate the opportunities and threats of this interaction. How can we best prepare cybersecurity professionals and language analysts to operate in this new environment? From deciphering human intent to

defending against cybersecurity threats, overlooking the effects of language in the digital domain represents a shortcoming in fully understanding human presence both online and offline. There is a growing need to fully understand and communicate these effects to cybersecurity professionals, which must be accomplished in three crucial, simultaneous steps.

1 ICANN Statement on IDN Homograph Attacks and Request for Public Comment2 The Mask Is Off: Cyber Spy Operation Uncovered After 7 Years3 ‘Right-to-Left Override’ Aids Email Attacks

© 2014 Lingua Brava, LLC. All Rights Reserved. Page 1

Page 3: Cyber Language Analysis: Emerging Linguist Trends in Digital Environments

Step One: Define Linguistic Issues. The most complex task is identifying and defining the elements and conditions that comprise the emerging effects of language on the digital domain. Tackling these tasks will require further academic research, quite possibly, at the doctoral and post-doctoral level. To support future language analysts in effectively operating within the cybersecurity context, researchers must define the environment, explore new and unique ways that language interacts with the digital domain, and provide a wealth of examples to be deconstructed and presented in training materials. This includes capturing environmental effects or simulating conditions that can then be defined for training.

Defining linguistic issues within the digital domain will be an interdisciplinary effort and will fundamentally require that researchers from the soft sciences (e.g., linguistics, sociology, psychology, etc.), be informed by research in the hard sciences (e.g., mathematics, computer science, statistics, etc.). The study and research of cyber language analysis must explore the interaction of these sciences with the digital environment (see Figure 1).

How does morphology in chat environments inform our understanding of the user’s culture?

How does non-English text effect automated search algorithms? How does the understanding of language and culture support digital forensics efforts? How do viral phrases and collocations relate to particular groups or threats? How do hackers and malware authors use language features to introduce threats into systems? Only by researching the intermingling of these domains will we begin to understand the subtle nuances of cyber language analysis and begin to answer these questions.

Step Two: Prepare Language Analysts. Training future Cyber Language Analysts must go beyond simply supplementing their global language knowledge with a deeper understanding of networking or other information technology disciplines. This may be effective in general familiarization and terminology, but would merely bring them to the point of this original speculation on emerging trends. In order to fully advance understanding and elucidate practical application, Cyber Language Analysts must be provided with focused education, leveraging discovery techniques. To accomplish this goal, it is necessary to first explain the kernel issues and the nature of these trends in terms of language analysis application. Only then can the language analyst be trained to predict unique variations of linguistic content or

© 2014 Lingua Brava, LLC. All Rights Reserved. Page 2

Figure 1. Intersection of Linguistics with the Digital Environment

Page 4: Cyber Language Analysis: Emerging Linguist Trends in Digital Environments

human intent through specialized topics in low-context prediction, techniques for anticipating and assigning importance to word mutation, and distinguishing natural language in coded environments.

In order to prepare language analysts to operate in the cyber domain, they must first understand: (1) the environment, (2) the threat actors, (3) the use of domain-specific language, and (4) the analytical process.

Environment: In the digital environment, language is a means of communication, but it is also a means by which the user directly and indirectly affects the digital and physical environment. To prepare Cyber Language Analysts, training must clearly address the differentiation and interplay between form and function4 within the digital environment. The digital environment can be characterized by its voluminous amounts of context-deficient graphic information. This not only forces the need for higher level language analysts but redefines the technique used by all language analysts in navigating the environment. The seasoned Cyber Language Analyst must have a basic understanding of how the digital environment is subsumed under complex networking and information technology infrastructures characterized by three overlapping layers: content, code, and metadata (see Figure 2). Cyber Language Analysts must understand how these different layers of information

4 The form of the digital environment is addressed in this section. The function of the digital environment relates to the human element addressed below in the “Threat Actors” section. These two aspects of the cyber domain must be addressed together to ensure the Cyber Language Analyst is able to apply this knowledge to real-world scenarios.

© 2014 Lingua Brava, LLC. All Rights Reserved. Page 3

Figure 2. Layers of Digital Environment

interact. Ultimately, it is critical that they be able to navigate the digital terrain, distinguishing between genuine code, hashing, actual language, and errant junk.

Threat Actors: Deciphering the human imprint in digital environments may be a discipline unto itself. The Cyber Language Analyst must be trained to identify human presence, intent, behavior, and capability (i.e., threat level). As stated above, the digital environment serves a particular function, where the code and graphic interface serve human objectives, (e.g., human communication, signals controlling household appliances, etc.). In the case of code, as reliant on syntax as it may be, there is room for the programmer’s personality and predilections to surface. Language analysis in tandem with cultural awareness can be applied to forensic malware analysis to ‘profile’ threat actors. Whether in plain language, jargon or transliteration, terms can be leveraged to further determine interests or a potential file-naming schema. In cases where certain terms may indicate malicious actors or even nationalist tendencies, further characterization of intent can be made.

Domain-specific Language: Understanding the basic target language vocabulary that describes the digital domain (e.g., how to say router in Chinese) is just the beginning. In addition, Cyber Language Analysts must be trained to recognize and understand the functional purpose of various forms of script and code. This enables the Cyber Language Analyst to isolate linguistic artifacts or natural language that can then be used to determine clues about the author’s cultural affiliation or even threat level. The Cyber Language Analyst must understand the unique form that language takes within the digital domain, including how various characters, numbers, and codes are used to indicate features of the target language. In many ways, language in the digital environment, while just as valid, takes on a form not readily recognized in formalized contexts (e.g., news articles and print media).

Analytical Processes: The practical application of cyber language analysis borrows from forensic analysis, basic cryptanalysis, computational linguistics, and translation techniques. Regarding the latter, language analysts might exercise

Page 5: Cyber Language Analysis: Emerging Linguist Trends in Digital Environments

© 2014 Lingua Brava, LLC. All Rights Reserved. Page 4

similar techniques to decipher isolated ancient texts by determining meaning from comparative relevant corpora. Other required skills include, but are not limited to, the capability to analyze parts of speech, letter frequency, and word frequency. Given the context-poor nature of the digital domain, cybersecurity professionals will be reliant on the art of language analysis; they must be able to apply past experiences, examples, and a general understanding of the cyber domain in order to draw educated and actionable conclusions.

Step Three: Elucidate Practical Applications. Some immediate examples for this capability could be applied in support of corporate network defense and law enforcement. However, it is not historically common to find language analysts operating within the cybersecurity domain or sought out for their unique perspective. For this reason, resulting gaps in analysis could actually delay resolution or prevent a better understanding of malicious actors and their intent. Further coordination between academic, corporate, and government stakeholders is necessary to fully develop practical applications of cyber language analysis.

As we better understand these concepts and how the effects of these concepts can be practically applied, we can assume that further applications of language

analysis in cybersecurity will present themselves. The scope of our evolving understanding is complex in depth and breadth, but a few specific examples of how language analysis might be applied in the digital domain are already evident.

If provided proper education and training a Cyber Language Analyst could:

• Conduct forensic analysis of malware, malicious web sites, or phishing attacks in terms of their linguistic and cultural artifacts

• Identify environmental situations that deter or prevent the full and appropriate use of some foreign language based fonts

• Analyze social media corpora for digitally influenced morphology and threat language

• Define and decipher online communication constructs, naming conventions, and code comments

• Identify linguistic or cultural artifacts implicitly evident in script or code

Figure 3. Channels to Prepare Future Cyber Language Analysts

Page 6: Cyber Language Analysis: Emerging Linguist Trends in Digital Environments

Mr. Raul Mejorado is a Principal Advisor for Lingua Brava and a Lieutenant Commander in the US Navy Reserve. He has served as the interim Deputy Senior Language Authority for US Cyber Command and has over 20 years experi-ence as a language analyst for the US Navy. In addition, he has over 15 years experience in the public and private sector as an information technology professional, including specializations in web development, networking, and knowledge management. Mr. Mejorado holds a Bachelor of Arts in Foreign Area Studies with a minor in Russian from University of Maryland University College and has a Systems Engineering Certificate from Stanford University. He also served as a certified Adjunct Faculty Instructor for The National Cryptologic School.

Lingua Brava, LLC (DUNS # 078516592) is a certified Service Disabled Veteran-Owned and Economically Disadvan-taged Woman-Owned Small Business specializing in language education and training, curriculum development, program assessment and evaluation, educational technology, and professional development. Lingua Brava con-sultants have been involved in language training and education for over 25 years, providing services for various stakeholders, including: Department of Defense, National Security Agency, US Cyber Command, Central Intelligence Agency, Department of Commerce, Department of Education, and nonprofit organizations. Lingua Brava addresses the unique needs of Active Duty and Reserve program managers and language learners in the cryptologic and cyber operational settings.

The Monterey Institute of International Studies (MIIS) launched the Cyber Security Initiative (MCySec) in May 2013 following a year’s worth of roundtable meetings. The mission of the Cyber Security Initiative is to create an interdis-ciplinary platform to assess the impact of the information age on security, peace, and communications. Housed in the MIIS Office of the President, the Cyber Security Initiative provides input and coordinates existing cyber efforts in multiple fields, creates a collaborative community of interest, provides a forum for international, key-leader engage-ment as well as increases research and education on cybersecurity.

Solution

The scope of this discussion immediately exposes a vast problem set. It should therefore be apparent that the related security issues, combined with a tenuous understanding and nascent skill sets, suggest there is a crisis in the making. From the security aspect alone, hackers have already leveraged language environments in phishing campaigns and spoofing efforts. Due to the international nature of the Internet, it should not come as a surprise that language analysts must be engaged in some way. However, based on the potential application of cyber-focused language analysis in public and private sector roles, the capability gap is far from being filled. While the issue is best understood by leveraging relevant fields that are filled with their own experts, Lingua Brava (a pioneer in this domain) brings a nexus of skills and over 25-years of experience uniquely qualified to develop and train these concepts.

Lingua Brava has partnered with the Monterey Institute of International Studies, Cyber Security Initiative to present an Executive Education Series: The Cyber Language Analysis Workshop. The goal of the Workshop is to enable language analysts to interpret presence, intent, behavior, and capability in the cyber domain through the understanding of language or culture-influenced activity conveyed

in content, code, and metadata. In this way, the Workshop is carefully tailored to prepare language professionals to operate within the cyber domain. The Workshop equips attendees with knowledge about the cyber environment; the skills to identify target presence, intent, and capability; the knowledge of domain-specific language; and the analytical processes to apply lessons to real-world environments.

To address the need for additional research on the emerging effects of language on the digital domain, Lingua Brava has established the Cyber Language Analysis Working Group. The Working Group is a consortium of academic, commercial, and government thought-leaders dedicated to promoting research in this field of study.

For additional information about the Cyber Language Analysis Workshop or Cyber Language Analysis Working Group, please contact: [email protected].

© 2014 Lingua Brava, LLC. All Rights Reserved. Page 5