great china novels

6
Cst402 Final Project Ruonan Wen 1 Project Report 1. Introduction Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel. On the other hand, “The Dream of the Red Chamber” is one of China's Four Great Classical Novels. In this project, the author applied Hadoop MapReduce on the novel “The Dream of the Red Chamber” to analyze its character occurrence and compare it with frequent Chinese characters in Modern imaginative texts to see the differences and development of Chinese language. 2. Initial Motivation Similar to English language, Chinese language experienced a long period of changes and development. From ancient style to vernacular, to nowadays modern Chinese, some characters were abandoned, some were introduced, and some were changed to another way to be expressed. It is particularly interested to go into the development of Chinese language. On the other hand, as a new framework, Hadoop MapReduce has not been widely used, especially in China. Applying Hadoop MapReduce on some Chinese data seems particular fascinating. 3. Input File “The Dream of the Red Chamber” is one of China's Four Great Classical Novels. It is generally acknowledged to be the pinnacle of classical Chinese novels. It was composed in 1784 during the Qing Dynasty when Chinese vernacular literature started to grow. Literature works accomplished during late Qing Dynasty usually combined ancient and vernacular Chinese; thus, were also considered as the start of modern Chinese Language. In this project, the author downloaded electronic version (.txt format) of “The Dream of the Red Chamber”, converted the file to Unicode, and applied Hadoop MapReduce to count the character frequency. Here is a sample of “The Dream of the Red Chamber” in Chinese:

Upload: parashara

Post on 16-Aug-2015

214 views

Category:

Documents


0 download

DESCRIPTION

Great China Novels

TRANSCRIPT

Cst402 Final Project Ruonan Wen 1 Project Report 1.Introduction Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel. On the other hand, The Dream of the Red Chamber is one of China's Four Great Classical Novels. In this project, the author applied Hadoop MapReduce on the novel The Dream of the Red Chamber to analyze its character occurrence and compare it with frequent Chinese characters in Modern imaginative texts to see the differences and development of Chinese language. 2.Initial Motivation Similar to English language, Chinese language experienced a long period of changes and development. From ancient style to vernacular, to nowadays modern Chinese, some characters were abandoned, some were introduced, and some were changed to another way to be expressed. It is particularly interested to go into the development of Chinese language. On the other hand, as a new framework, Hadoop MapReduce has not been widely used, especially in China. Applying Hadoop MapReduce on some Chinese data seems particular fascinating. 3.Input File The Dream of the Red Chamber is one of China's Four Great Classical Novels. It is generally acknowledged to be the pinnacle of classical Chinese novels. It was composed in 1784 during the Qing Dynasty when Chinese vernacular literature started to grow. Literature works accomplished during late Qing Dynasty usually combined ancient and vernacular Chinese; thus, were also considered as the start of modern Chinese Language. In this project, the author downloaded electronic version (.txt format) of The Dream of the Red Chamber, converted the file to Unicode, and applied Hadoop MapReduce to count the character frequency. Here is a sample of The Dream of the Red Chamber in Chinese: Cst402 Final Project Ruonan Wen 2 4.Mapper Read the file line by line. For each line, count all the character frequency. Output each character with its frequency in each line. Here is the main part of mapper algorithm: 5.Reducer Add all the frequency together of same character. Here is the main part of reducer algorithm: Cst402 Final Project Ruonan Wen 3 6.Output File The output file gives all the characters (including punctuations, but abandoned when analyzing) in this novel with frequency. There are 4531 characters, including punctuations and unreadable characters, in this novel. Here is a sample: Noticeably, some ancient Chinese characters cannot be displayed correctly after Unicode conversion: Cst402 Final Project Ruonan Wen 4 These are not Chinesecharacters afterconversion. However, from the results, we found that they were not frequent characters; we ignored them in the experiment. The whole output file can be found in file ouput.rtf. 7.Results The file result.xlsx gives the result of my project. Here is a sample of the result file: There are four columns: 1.modern imaginative character 2.character frequency in modern imaginative Chinese 3.character in novel The Dream of the Red Chamber 4.character frequency in The Dream of the Red Chamber The No. on the side indicates the frequency ranking. Thefilepresents3999mostfrequentcharactersortedbyfrequencyinbothmodernimaginativetextsand The Dream of the Red Chamber. The data of modern imaginative texts was obtained from online source: http://lingua.mtsu.edu/chinese-computing/statistics/ Cst402 Final Project Ruonan Wen 5 . 8Analyses and Comparison 8.1.Chinese vernacular literature Some characters such as , , and etc. are commonly used in vernacular Chinese and even modern ChineseliteratureworksarealsothemostfrequentcharactersinTheDreamoftheRedChamber. SincethisnovelwascomposedinlateQingDynasty,myexperimentcouldproveTheDreamofthe Red Chamber indeed is one of the works indicated the start of the Chinese vernacular literature. frequency ranking in The Dream of the Red Chamber Frequency ranking in modern imaginative Chinese 14 21 8.2.Characters in names of novel The Dream of the Red Chamber is a novel about four big families involving many people. Thus, there are a lot names in this novel. From the result, we could find some characters such as , , and etc. areverycommoninthenovel,butintheverydifferentpositioninthe4000mostfrequentmodern imaginativeChinesecharacterlist.Itisbecauseandarecharactersinthenamesofmain charactersinthisnovel.Thus,ononehand,wecouldusethisresulttoinvestigatethemaincharacters (people)inthenovel;ontheotherhand,whenapplyusefuldatainthenoveltocontributetothe charactercountingforthewholeworksinparticularperiod,itissignificanttodiscardthiskindof characters. frequency ranking in The Dream of the Red Chamber Frequency ranking in modern imaginative Chinese 231853 1141347 8.3.Development of modern Chinese Even though The Dream of the Red Chamber indicated the start of Chinese vernacular literature, it is classic work. Many characters in this novel could not be found in modern imaginative Chinese list. Inaddition,somecharactersusuallyusedinclassicworksareintheverydifferentpositioninmodern imaginative Chinese list. For example: frequency ranking in The Dream of the Red Chamber Frequency ranking in modern imaginative Chinese 93990 32165 Cst402 Final Project Ruonan Wen 6 9.Further Works Inthisproject,weonlytakeTheDreamoftheRedChambertoanalyze.Wedofindsomepropertiesof this novel and some differences between this and modern Chinese language. However, Chinese language has a very long history and it changed dynasty by dynasty. To see the development of Chinese as a language, we should apply Hadoop MapReduce on more works of different time. Inaddition,inthisproject,IappliedHadoopMapReduceonthenoveltoaccomplishmentthecharacter counting.Inthefurtherwork,IcouldcombinethefrequentChinesecharacterindifferentperiodsinthe same file and use Hadoop MapReduce to analyze the changes and differences. Atlast,whenappliedHadoopMapReduceChinesedata,someproblemshappenedindataconversion, especially for classic literature works, since some ancient characters had not been used. 10. Thanks Thanks for the help of Dr. Eric G Berkowitz through the whole semester. Thanks Kyle Schmitt for helping me converting the text file to Unicode. Thanks Melissa Etling for helping me with Hadoop MapReduce.