non-statistical language-blind morpheme (candidate) extraction- an unsupervised machine learning...

8/12/2019 Non-Statistical Language-Blind Morpheme (Candidate) Extraction- An Unsupervised Machine Learning Approach

1/22

Zachary [email protected] | University of South Florida

Honors College - Supervised Research

Non-statistical Language-BlindMorpheme (Candidate)* Extraction

An Unsupervised Machine Learning Approach

*One cannot know if a particular grapheme represents a morpheme until meaningcan be assigned to each morpheme. I'll use the term morpheme throughout this

paper to mean morpheme candidate, but I'm referring to a morpheme candidate.

Submitted December 19, 2013Revised March 5, 2014


2/22

Zach Bornheimer Nonstatistical Language-Blind Morpheme Extraction

2

Introduction/Abstract During Fall 2013, I was given the opportunity to work on a project that would yield language-

blind non-statistical morpheme extraction. Morpheme lists are the keys to many research projects in linguistics, however, they have to be tailor made for the language/experiment. I setout to try and determine a way to create a morpheme list without knowing the language. While I

was somewhat successful in this approach, there were some flaws, namely false positives. Thedata delivered is promising, however there are two hurdles: serial computing and meaning.Meaning assigned to morphemes eliminates false positives, yet we are working without a gloss,so we have quite a few false positiveswith more data, however, this may not be an issue.Additionally, because of the computing power required to run the code, it would be ideal torewrite part of it to make use of the parallel computing power of a GPU (which may yield speedenhancements of exponential orders of magnitude). Overall, this was a good first step in a

project that has much grander requirements.


3/22


3

Design Decisions

ParadigmsThe approach for this project was non-statistical from its origins as the majority of work beingdone in natural language processing is being done via statistics. I didn't really understand thisapproachI came into the problem with the idea that humans are pattern based creatures and iflanguages are developed through patterns and rules, the deconstruction of a language can bedone with patterns and rules. In terms of implementation, I had completed a bit of the work inKernighan & Ritchie's Book The C Programming Language (1998)as the industry calls it,K&Rby August 2013, but I had reached the part of the book where it said to undertake a largetask:

"It's possible to write useful programs of considerable size, and it would probably be a good ideaif you paused long enough to do so" (Kernighan & Ritchie, 1998, last paragraph of Ch. 1 beforeexercises).

While this task was quite substantial, the only subjects of it that I needed to learn from it thatwere not presented in Chapter 1 of K&R were: memory management ( pointers andreferences ), structs .

Language I decided to take an unsupervised machine learning algorithm which would more mimic thehuman acquisition process. For this research, I undertook the program's development in the CProgramming language for the following reasons:

1) I could control how memory was allocated and when it was freed2) I could manipulate the memory itself3) I could store memory addresses instead of duplicating content

When testing a simpler, more aggressive version of the system design in Perl6, I found thatmemory usage was an insurmountable difficulty. My program in Perl6 assumed the corporasupplied used a latin alphabet this program made no such assumption. I wrote that programabout a year ago (Dec 2012). Writing the program in C was something that needed to be donefrom an artificial intelligence perspective as C would allow me to control the speed and intensityof the program-as the program written in Perl6 was unable to, with sufficient time or memory,handle reading Moby Dick, The Adventures of Huckleberry Finn, and/or War of the Worlds.The goal of the program was to develop a morpheme list from these texts using the same basicalgorithm while controlling the memory and the assumptions of the design; this could only beaccomplished with the language C.

Computer Science ParadigmsAs memory management was a goal, I decided early on to use the tool valgrind to help meeliminate memory leaks and to optimize performance. Additionally, I took a somewhat


4/22


4

comfortable mixture of top-down and bottom-up programming paradigmsthe mixture wasdictated by the following factors:

1) Would the function be reusable for a different purpose than the calling function?2) How complex would it be to transmit data between functions?

Often times, it would be simpler to just expand the function and not have to worry about datatransmission. Additionally, I am beginning the process of optimizing the code (and removingmy testing code) to make it more elegant and to speed the software up. Part of this optimization

process is to remove redundant code and semi redundant code; luckily, C makes it simple toimplement macros. The following is an explanation for a macro that exists in the software. The

purpose of this particular macro is to make sure that a particular pointer is non-null afterrequesting new memory ( E_REALLOC is a defined constant).

#define ASSERT(condition, error_code) \if (condition == 0) { \

printf("Assertion: '%s' failed.\n", #condition); \

exit(error_code); \}

#define REALLOC_CHECK(arg) \ASSERT((arg != NULL), E_REALLOC);

The ideas here are that I can manipulate the data types involved in a macro so I can pass a boolean and a char* simultaneouslythis allows for me to write:

REALLOC_CHECK(array)

Instead of:

if (array == NULL)exit(E_REALLOC);

and have a verbose error message while retaining code brevity. Further, going along with the paradigm of style, I started out new to C in August 2013 (for the most part) and thus did notknow much about style in C, so I decided to use a mix of style that I learned directly out of K&Rand the Linux Style Guide (Kroah-Hartman, 2002).

Another paradigm used was an unsupervised machine learning algorithm (UMLA) paradigm.While the standard pragmatic flow of a UMLA is train => execute, this algorithm was able to getUMLA results by defining the algorithm in a series of variables that were only defined by the

algorithm running on real data - for example, the algorithm discards data that doesn't occurTHRESHOLD_CONFIRMATION percent of the found morpheme candidates (so if there 100,000words, and it found 415,000 morpheme candidates, to be confirmed, non-stems must account forTHRESHOLD_CONFIRMATION percent of the data). Along this lines, circumfixes areidentified by the following rules:

1) The prefix and suffix occur in equal frequency


5/22


5

2) The words that contain the prefix and the words that contain the suffix have a percentsimilarity of THRESHOLD_CIRCUMFIX

With these data elements in mind, the algorithm generates and fills/modifies all the data structuresizes and elements depending on the data's matching of thresholds (as to prevent outliers from

corrupting the data). Specifically, it will isolate morpheme candidates based on two n-grams,generate morpheme lists, generate regex (regular expressions) and reconstruction data for eachmorpheme in the list, and then the algorithm will tag morphemes (and groups of words that areassociated with the morpheme). This data is combined with other similar data when necessaryand reconstructed when modified.

The final CS goal of the project here was to successfully manage a large C project. As I havenever managed a C project prior to this, I needed to learn quite a bit about it. The generalconsensus on the internet was:

1) Use .h files for non-functional code ( prototypes , structs , macros ,

constants , externs )2) Use .c files for functional code (i.e. variable and function definitions)3) Use a Makefile for compilation

In regards to the makefile, I ended up using make debug and make optimized for a debugversion that allows for profiling and better debugging and an optimized version that makes use ofthe GNU C Compiler ( gcc )'s -O flag.

Computational Linguistic Goals Another goal of this project was to be able to break down a Context-Sensitive Language (CSL)

into a Context-Free Grammar (CFG). A CSL is a description of how a language's syntax andsemantics vary depending on the context in which the words occur. For example, the sentence:"My word, I thought out loud, how awful; I proceeded to laughexperiencing overwhelming schadenfreude ," "awful " would be interpreteddifferently if a few words later the speaker didn't say that s/he was experiencing schadenfreude.This definition of awful is Context Sensitive; languages which rely on this principal (namelynatural languages) are considered Context-Sensitive Languages. CFGs can be derived fromContext-Sensitive Languages (which, by definition, are derived from Context-SensitiveGrammars) and can also be derived from Context-Free Languagesanother way of describingthis is { ! A " => !#" } # ! $ ($ = null ) where ! = " = null (which transforms thedefinition for a CSL into the definition of a Context-Free Grammar).

Noam Chomsky (1956) defined Context-Specific Grammars as such:"A rule of the form

indicates that X can be rewritten as Y only in the context ZW" (Chomsky, 1956, p. 118).


6/22


6

Restated, CFGs are defined as a basic set of rules that are independent from one anothernamely:

{ V => w } where V could be described as a non-terminal which yields a specific w (aterminal, non-terminal, or null).

We are treating the problem as a generation of terminals for a Context-Free Grammar consistingof the following:{ => *+* } where each token in angle brackets is asequence of characters representing a morpheme class where * = 0 or more and + = 1 or more; a

prefix and stem combination may be a circumfix placed equidistantly from all stems or a particular stem.

Further, this particular CFG can be represented as such:

Word/Morpheme CFG: {

=> *+* | =

=> morpheme-list.chosen-morpheme.type == PREFIX |morpheme.prefix if morpheme-list.chosen-morpheme.type == CIRCUMFIX

=> morpheme-list.chosen-morpheme.type == STEM

=> morpheme-list.chosen-morpheme.type == SUFFIX |morpheme.prefix if morpheme-list.chosen-morpheme.type == CIRCUMFIX

} where morpheme-list.chosen-morpheme is a particular morpheme from themorpheme list and part_word is defined as part of a word which follows the affixation rule

generated by a particular infix.

Luckily for us, we can generate these rules through tokenization. The program needed to be ableto identify parallel environments which would identify a form of lexical environment in whicheach word occurs. From this, the programmed algorithm allows for the extraction of morphemes

based on the comparison of words in parallel environments. This parallelism is defined inconstants.h in terms of percentage of parallelism (specifically, it is defined in the constant:THRESHOLD_SIMILAR_NGRAMS). The technical way it is examining the parallelenvironments is by using a constant NGRAM_SIZE to determine the size of n-grams (groupingsof words). Subsequently, it then looks at the left and right haves of two n-grams and generatesan array of unique elements. It then uses the following formula to determine the percent

similarity between the two arrays:int percent_similar = (double) 100.00 * ((double) (cl - ol) / (double) cl); Where cl is the length of the combined unique array and ol is the length of the originalcombined array.

How Much to Implement


7/22


7

The biggest factor on what was implemented vs what was intended to be implemented, was time.Given only 3 months and needing to use 1 month to get through Chapter 1 of K&R, I ran intotime limitations. While this successfully analyzes and extracts morphemes, I intended to do thisin addition to writing an unsupervised language-blind morphosyntactic tagger and semanticextraction mechanism. I also cover things I would like to change in this algorithm given more

time and the funding to do so in the future (this is covered in Future Ambitions ).

Research QuestionHow do you, non-statistically, extract morphemes from prepared corpora to develop a morphemelist with an unsupervised machine learning algorithm in the C Programming Language?

Work AccomplishedDuring this project, I successfully implemented a morpheme extraction algorithm. It assumesnothing about a supplied corpus. It currently has memory leaks in which about ~0.10% of

memory is currently being leaked. The code is about 1800 lines, consists of 21 code files, aMakefile, a LICENSE file, and a README file.

To run the program You can download the code at https://github.com/zachbornheimer/morpheme-extraction , oncethe code has been downloaded, you need the following tools:

1) make - https://www.gnu.org/software/make/2) gcc - https://www.gnu.org/software/gcc/

In the directory where the software was downloaded, run: make optimized for the optimizedcode. Further information on how to run the program is available in the file README.

Algorithm The algorithm implemented used the following procedure:1) Identify and activate command-line changes2) Extract the Word-Delimiter3) Develop n-gram relationships4) If, -- process-sequentially , --serial , --sequential , or --process have been passed, find morphemes5) Repeat back to step 3 until no more files6) If --process is passed or no argument involving processing (see 4) is passed, findmorphemes.7) Write the type of the morpheme ( PREFIX , SUFFIX , STEM, INFIX , or CIRCUMFIX)and the morpheme to a file specified with command-line parameter or the default file.

Extract the Word-Delimiter We are defining a word-delimiter as a character or string of characters which has and contributesno semantic value other than to delineate the end of semantic value in a grapheme sequence.


8/22


8

For this stage, the algorithm takes an input file and develops a unique array of all characters thatoccur in the file. For each character in the unique array, it splits the processed file into asequence of non-null strings and tallies the number of non-null elements that exist. It looks atthe tallies generated for each possible word-delimiter candidate. If there is one character that hasthe highest generated split word count, it is returned as the word-delimiter. If there is more than

one, it tests permutations according to the following algorithm (represented in pseudocode):/* PERMUTATION ALGORITHM: */

while (string != testing_string) {for (0 to len(testing_string) as position) {

move-to-the-right(testing_string.character-at[position]);

increment(position);}

}

It tests each permutation against the file that is being tested to find the number of non-null strings

and it compares it against all the permutations of all other files by storing the word-delimiter thatresults in the largest frequency and said frequency. If it finds that another word-delimitercandidate has equally large size (that isn't 0) it skips that file given that there are potentially twodiffering word-delimiters, the file doesn't conform to the given word-delimiter definition. orsomething wrong that happenedeither way, it's probably best to skip the file.

Develop N-Gram Relationships For this mechanism, the algorithm splits the text into words and begins generating the n-gramdata structure based on a constant-defined NGRAM_SIZE (always an odd number) .Thanks to typecasting rules of a double onto an int , NGRAM_SIZE/2 always rounds and isequivalent to (NGRAM_SIZE-1)/2 in Cwhenever I refer to NGRAM_SIZE/2, it refers to

the value that would be computed with C, (NGRAM_SIZE-1)/2 . It sets ngram.word to bethe target word, it sets NGRAM_SIZE/2 words for ngram.before and NGRAM_SIZE/2 words for ngram.after . ngram.before and ngram.after represent the words thatoccur before and after the target word (if they exist). If NGRAM_SIZE = 9 and this is the thirdword in the corpus, ngram.before[0] and ngram.before[1] will be empty, butngram.before[2] and ngram.before[3] will contain data. If the target word occursmore than once in the corpus, instead of creating a new n-gram, it adds to the existing n-gram (songram.before may no longer contain a maximum of NGRAM_SIZE/2, but would rathercontain a N*(NGRAM_SIZE/2) where N is the number of times that ngram.word occurs inthe corpus.

Lastly, the algorithm goes through each n-gram and compares it's context to every othersubsequent n-gram whose target word doesn't occur within NGRAM_SIZE distance. This isaccomplished by combining the .before and .after arrays into one array and compares thatarray to another n-gram's merged array. If the elements that exist within that array have a

percent similarity (as defined in the Computational Linguistic Goals ) greater than or equal toTHRESHOLD_SIMILAR_NGRAMS, the original n-gram stores the address of the similar n-gramin an array into a linked list. Instead of creating a doubly or circular linked list, we use a singlylinked list because it would have been redundant to store the relationship twice (ngram-x is


9/22


9

related to ngram-y AND ngram-y is related to ngram-x) as we don't care about the order of arelationship, but rather that a relationship exists. Implemented, each ngram looks for similaritystarting with ngram[n+1] For example, if ngram[0] and ngram[5] are similar theconnection will be stored in ngram[0] . However, when it comes time for ngram[5] to lookfor store the addresses of similar n-grams, it will start with ngram[6] as opposed to

ngram[0] . This decision originates from the connection between ngram[0] and ngram[5] already being stored and not needing to duplicate that connectionany resulting data wouldn'tchange if ngram[5] stored the address of ngram[0] , in fact, it might become an infiniteregression of memory address when processing which would require additional precautions.Thus, for simplicity's sake, the algorithm only looks at subsequent n-grams.

Find Morphemes The idea of finding morphemes is fairly simple and revolves around a code sequence calledfind_longest_match which takes string0 and string1 and tries to find the longestcommon contiguous character string from position 0 of string0 and string1 . The programexecutes this on two target words with their characters non-reversed and reversed (to find

immediate prefixes and suffixes respectively). It stores this data. It then runs on the internals ofthe mechanism by removing the first character of string1 and looking for common contiguousstring with a length greater than or equal to 2 characters. It stores this data after it makes surethat the found morpheme isn't a proper subset of a prior morpheme (like { morpheme0:subset, morpheme1: ubset } ). It proceeds until string1 == NULL . If string1== NULL and string0 != NULL , it will remove the first character of string0 , resetstring1 and repeat removing characters from string1 until string1 = string2 =NULL.

The program runs this algorithm for each combination of target words such that it follows the permutation algorithm as defined in the description of the word-delimiter extraction.

It removes duplicate morphemes by fusion. It fuses the duplicate morphemes by generatingregular expressions based off the words in which the morpheme originated and combining themcharacter by character creating classes within two sets of parenthesis (representing arrays for pre-morpheme and post-morpheme). From there it reconstructs the full regex from it'sreconstruction data (which is the raw regex stored by character position) which is analyzed fortype of morpheme.

Morphemes are tagged as stems if they occur at any point standing alone or if they are marked asether a prefix or suffix and then they occur as either a suffix or prefix (respectively). Prefixesoccur at the beginning of words, suffixes at the end, and infixes occur if they occur naturally and

account for THRESHOLD_CONFIRMATION percent of the morpheme candidates. Prefixes andsuffixes are automatically tagged as such if they meet the threshold and if they occur in thecorrect position. Circumfixes are tagged by looking at the current morpheme's type. If it is asuffix, it will look though all the morphemes for a prefixif it is a prefix, a suffixthat appearsin the same frequency and whose list of associated words has a percent agreement greater than orequal to THRESHOLD_CIRCUMFIX while both morphemes also meet the confirmationthreshold. Anything marked UNDEF that meets the confirmation threshold is marked as an infix.All morphemes are tagged UNDEF until this tagging process.


10/22


10

Data Observation For morpheme tagging, the priority list is:

1) STEM

2) INFIX3) CIRCUMFIX4) PREFIX & SUFFIX5) UNDEFwhere lower value has higher priority.

Once something is tagged as STEM, it can't be taken down from that level. The being said, anyother type can be tagged STEM if a morpheme appears unbounded. This is done as a measure ofaccuracy and this can be seen in the data supplied (see Data Sample ).


11/22


11

Data SampleThe following is a data sample that was gleaned from running on the test corpus supplied in thegit repository under test-corpus - specifically, it is the first chapter of H. G. Wells's War ofthe Worlds.

The results took 5.118s of real time (as measured by time program included with Mac OS X10.9) and used 274.2 MB of RAM running the version created with make optimized serially:./nlp --serial Additionally, for the sake of ease, I ran sort on the program to sort it alphabetically for display:

./test-corpus/War_of_the_Worlds.txt===================INFIX: --INFIX: aceINFIX: ackINFIX: adINFIX: ain

INFIX: alINFIX: amINFIX: anINFIX: anceINFIX: anetINFIX: arINFIX: arerINFIX: arsINFIX: atINFIX: berINFIX: bitINFIX: bleINFIX: caINFIX: ceINFIX: cessarINFIX: crediblINFIX: ctINFIX: deINFIX: derINFIX: dnightINFIX: eaINFIX: eadINFIX: ecINFIX: ectsINFIX: edINFIX: ediblINFIX: eenINFIX: efINFIX: elINFIX: eldINFIX: elingINFIX: emINFIX: emberINFIX: enINFIX: ence


12/22


13/22


13

INFIX: nceINFIX: ncernINFIX: ncesINFIX: ndINFIX: neINFIX: ness

INFIX: ngerINFIX: nsINFIX: ntINFIX: nturINFIX: nturyINFIX: obINFIX: olINFIX: omINFIX: onINFIX: onomINFIX: opINFIX: opeINFIX: opleINFIX: ouINFIX: oudINFIX: oughINFIX: outINFIX: owINFIX: oweINFIX: peINFIX: pearINFIX: perINFIX: persINFIX: plINFIX: plainINFIX: ppINFIX: pulINFIX: rdINFIX: reINFIX: reaINFIX: retchINFIX: rfINFIX: rfaceINFIX: rgeINFIX: riINFIX: rknessINFIX: rldINFIX: rm

INFIX: rnINFIX: roINFIX: rsINFIX: rshawINFIX: rtINFIX: rthINFIX: ruINFIX: rutinINFIX: rvINFIX: scop


14/22


14

INFIX: scopeINFIX: seINFIX: servINFIX: shINFIX: siINFIX: ss

INFIX: ssarINFIX: stINFIX: stanINFIX: stanceINFIX: tchINFIX: teINFIX: tedINFIX: telINFIX: tellINFIX: tellectsINFIX: telligenINFIX: telligenceINFIX: tenceINFIX: terINFIX: tershawINFIX: thINFIX: tiINFIX: tinINFIX: tionINFIX: tronomINFIX: tsINFIX: ulINFIX: undINFIX: urINFIX: uryINFIX: usINFIX: useINFIX: vyINFIX: werPREFIX: "PREFIX: 1PREFIX: APREFIX: CPREFIX: DPREFIX: EPREFIX: FPREFIX: HPREFIX: M

PREFIX: MarPREFIX: NPREFIX: OPREFIX: PePREFIX: SPREFIX: TPREFIX: ThPREFIX: ThiPREFIX: _PREFIX: a


15/22


15

PREFIX: abPREFIX: acPREFIX: afPREFIX: apPREFIX: appPREFIX: appear

PREFIX: astronomPREFIX: bPREFIX: baPREFIX: begPREFIX: belPREFIX: beliePREFIX: biPREFIX: blPREFIX: blaPREFIX: brPREFIX: briPREFIX: buPREFIX: cPREFIX: calPREFIX: centurPREFIX: chPREFIX: clPREFIX: cloPREFIX: cloudPREFIX: coPREFIX: complaPREFIX: conPREFIX: concernPREFIX: crPREFIX: dPREFIX: daPREFIX: dangerPREFIX: diPREFIX: disPREFIX: distPREFIX: distanPREFIX: drPREFIX: duPREFIX: ePREFIX: earPREFIX: empPREFIX: evPREFIX: eve

PREFIX: exPREFIX: excPREFIX: excePREFIX: expPREFIX: explPREFIX: explainPREFIX: eyesPREFIX: fPREFIX: faPREFIX: fai


16/22


16

PREFIX: feePREFIX: fiPREFIX: fieldPREFIX: firPREFIX: flPREFIX: fla

PREFIX: flamPREFIX: foPREFIX: frPREFIX: gPREFIX: gaPREFIX: grPREFIX: grePREFIX: guPREFIX: gunPREFIX: hPREFIX: happPREFIX: havPREFIX: heaPREFIX: herPREFIX: hiPREFIX: hoPREFIX: houPREFIX: housePREFIX: huPREFIX: iPREFIX: illuPREFIX: immPREFIX: immePREFIX: immensPREFIX: incPREFIX: incrediblPREFIX: intePREFIX: intellPREFIX: intelligenPREFIX: jPREFIX: juPREFIX: kPREFIX: lPREFIX: largePREFIX: liPREFIX: loPREFIX: mPREFIX: ma

PREFIX: matPREFIX: menPREFIX: miPREFIX: midPREFIX: milPREFIX: minPREFIX: moPREFIX: morPREFIX: muPREFIX: mus


17/22


17

PREFIX: nPREFIX: necessarPREFIX: niPREFIX: oPREFIX: obsPREFIX: observ

PREFIX: pPREFIX: paPREFIX: plaPREFIX: planPREFIX: poPREFIX: pointPREFIX: populPREFIX: powerPREFIX: prPREFIX: proPREFIX: probPREFIX: rPREFIX: raPREFIX: readPREFIX: realPREFIX: recPREFIX: recePREFIX: remPREFIX: remotePREFIX: rushPREFIX: sPREFIX: saPREFIX: scPREFIX: scrutinPREFIX: secPREFIX: seePREFIX: serPREFIX: shoPREFIX: slPREFIX: sliPREFIX: slightPREFIX: smPREFIX: soPREFIX: souPREFIX: spPREFIX: specPREFIX: spiPREFIX: sta

PREFIX: steadPREFIX: strPREFIX: strePREFIX: streaPREFIX: stretchPREFIX: struPREFIX: suPREFIX: supPREFIX: surPREFIX: sw


18/22


18

PREFIX: swiPREFIX: swimPREFIX: tPREFIX: taPREFIX: telescopPREFIX: tha

PREFIX: thiPREFIX: thirPREFIX: thoPREFIX: thouPREFIX: thrPREFIX: threPREFIX: throPREFIX: trPREFIX: traPREFIX: twPREFIX: twePREFIX: twelPREFIX: twentPREFIX: uPREFIX: unPREFIX: vPREFIX: vaPREFIX: vePREFIX: viPREFIX: voPREFIX: volPREFIX: wPREFIX: waPREFIX: waterPREFIX: wePREFIX: whPREFIX: whiPREFIX: wiPREFIX: winPREFIX: wisPREFIX: woPREFIX: yeSTEM: ForSTEM: ISTEM: ItSTEM: MarsSTEM: OgilvySTEM: Ottershaw

STEM: TheSTEM: aboutSTEM: afterSTEM: allSTEM: areSTEM: asSTEM: beSTEM: blackSTEM: brightSTEM: by


19/22


19

STEM: centurySTEM: coolSTEM: darknessSTEM: daySTEM: distanceSTEM: do

STEM: dustSTEM: earthSTEM: everSTEM: existenceSTEM: eyeSTEM: faintSTEM: feelingSTEM: forSTEM: froSTEM: gasSTEM: greenSTEM: greySTEM: heSTEM: hourSTEM: ideaSTEM: inSTEM: inhabitSTEM: intellectsSTEM: intelligenceSTEM: isSTEM: itSTEM: lastSTEM: lifeSTEM: lightSTEM: litSTEM: manSTEM: meanSTEM: midnightSTEM: milesSTEM: millionSTEM: mySTEM: nearerSTEM: newSTEM: nightSTEM: noSTEM: ofSTEM: oneSTEM: or

STEM: ourSTEM: oursSTEM: paperSTEM: papersSTEM: partSTEM: peopleSTEM: planetSTEM: pointedSTEM: redSTEM: regard


20/22


20

STEM: rememberSTEM: roundSTEM: sameSTEM: seemSTEM: sinceSTEM: slow

STEM: smallSTEM: spaceSTEM: starSTEM: stillSTEM: sunSTEM: surfaceSTEM: telescopeSTEM: thatSTEM: theSTEM: themSTEM: toSTEM: underSTEM: upSTEM: warSTEM: warmSTEM: wasSTEM: worldSTEM: yearsSUFFIX: atedSUFFIX: e,SUFFIX: ed,SUFFIX: er,SUFFIX: lySUFFIX: ousSUFFIX: ves

What's most interesting about this data is that it doesn't seem all that off for some sections andwildly off in others. Suffixes seems to be about 60% correct insofar as it discerned very familiarsuffixes. Stems is 100% accurate. Prefixes and infixes are wildly off. This is in part due to thenature of meaning missing. Once software tries to assign meaning to each morpheme, it willfacilitate eradication of false positives.


21/22


21

Future Ambitions For the future, it would make sense to rewrite parts of this algorithm to make use of a GPU

through CUDA or OpenCL . This would allow most of the operations involving determining n-gram parallelism to happen simultaneously as opposed to serially. This savings would beexponential. In a 100,000 word corpus, there are 100,000! n-grams of size NGRAM_SIZE that

are processed in total. Each of those n-grams could be processed simultaneously instead ofsequentially, so it could, theoretically, take approximately the time for 1 n-gram of sizeNGRAM_SIZE to be processed in the time it would take ~2.82 " 10^456,573 n-grams to

be processed (if there were no limits on GPU cores).

Another part that could be done for this is to make use of a job scheduler or to create a jobscheduler which would allow you to have this program run on multiple computerssimultaneously. The ideal host for this would be Amazon's EC2 service as they have the abilityto do GPU processing in a High Performance Cluster of machines. Running a final program onthis type of machinery would yield terrific results.

Finally, the most important aspect of my future ambitions would be to create amorphosyntactic tagger and gloss generator (semantics extractor) using a corpus and nothing else(semantics require semantic seed data as language and meaning were not associated in a bubble,

but rather are the result of associations in the real world). These two tools are the next logicalsteps after this program as their code would function in a parallel manner to the morphemeextraction algorithm.


22/22


References

Chomsky, N. (1956). Three Models For The Description Of Language. IEEE Transactions on Information Theory , 2(3), 113-124.

Kerninghan, B. W., & Ritchie, D. M. (1998). The C programming language (2nd ed.).Englewood Cliffs: Prentice Hall.

Kroah-Hartman, G. (2002, June 26). Kernel CodingStyle. The Linux Kernel Archives . RetrievedDecember 12, 2013, from https://www.kernel.org/doc/Documentation/CodingStyle

non-statistical language-blind morpheme (candidate) extraction- an unsupervised machine learning...

Documents