constructing your own corpus from written language

15
Constructing Your Own Corpus from Written Language

Upload: leonard-greer

Post on 29-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Constructing Your Own Corpus from Written Language

Constructing Your Own Corpus from Written Language

Page 2: Constructing Your Own Corpus from Written Language

Some likely sources for your corpus

• 1. From MS Word files• 2. From the World Wide Web• 3. From scanned books• 4. From speech audio files

Page 3: Constructing Your Own Corpus from Written Language

What you need

• MS Word• Notepad• PDF to Text Conversion Program (Simpo PDF

to Text is a very good one)

Page 4: Constructing Your Own Corpus from Written Language

Convert your files into plain text file

• Prefer UTF-8 Encoding, for it can represent all characters in every language, such as Chinese, Russian, Turkish, and so on.

• Give your files and folders a meaningful name (consistent and systematic)

Page 5: Constructing Your Own Corpus from Written Language

1. From MS Word Files to Text• Open your MS Word Document.• [File or MS Word Symbol on top left corner] • Save as • Other Formats• Save as Type = Plain Text• Save• File Conversion window will pop up.• Select Other Encoding• Highlight Unicode (UTF-8) • OK• Close MS Word• Go back to the folder and find the text file you just saved.• Double click on it and it will open in Notepad. Check how it looks.

Page 6: Constructing Your Own Corpus from Written Language

An easier way

• Use a WordToText converter• For example, Zilla• http://www.pdfzilla.com/zilla_word_to_text_

converter.html

Page 7: Constructing Your Own Corpus from Written Language

Clean the tables

• Check the file and clean the parts that you do not want to include in your research. For example, you might want to exclude the names of the students, tables, figures, and references.

Page 8: Constructing Your Own Corpus from Written Language

2. From the World Wide Web to Notepad, and Notepad to Text

• Find an article on the internet, may be from an online newspaper

• Using the mouse, left click and highlight the part of the text, then press ctrl + c.

• Open Notepad. Press ctrl + v to paste it. • File• Save as• Encoding = UTF-8• Save

Page 9: Constructing Your Own Corpus from Written Language

3. From scanned books

Scan every page and save as Searchable PDF files.

Convert your PDF files to text files(You can use Simpo PDF to Text, Adobe Reader, PDF Creator)

Correct the mistakes (Sometimes there are tons of them)

Save the text files in UTF-8 Encoding

Page 10: Constructing Your Own Corpus from Written Language

Tag Your Corpus for Other Information

You may want to tag your corpus for information that is different from POS.

For example, hedges, pauses, disagreement, metaphors, grammar mistakes, and so on.

You need to do this by entering the annotations by hand. Or, you can use a software program that is especially designed for making this process faster for you.

Page 11: Constructing Your Own Corpus from Written Language

Scenario 1

• You have decided to create a corpus out of your students’ papers. You asked your students to email their papers to you in MS Word format and they did. You want to study the types of contexts they prefer passive voice.

Page 12: Constructing Your Own Corpus from Written Language

Scenario 2

• You have decided to create a corpus out of the applied linguistics books and articles that you have read. You want to compare lexical bundles in them with the ones you use in your academic papers. Luckily some of the articles were already in PDF format but you had to scan some of your books.

Page 13: Constructing Your Own Corpus from Written Language

Scenario 3

• You want to create a corpus of newspaper headlines from New York Times and USA Today to compare their lengths.

Page 14: Constructing Your Own Corpus from Written Language

Scenario 4

• You have decided to create a corpus out of your own writing. You want to use all of the academic papers you wrote during your MA years.

Page 15: Constructing Your Own Corpus from Written Language

Is there a faster way to follow these procedures?

• Yes! If you know a programming language, such as PERL, you can write a code and make most of the above mentioned procedures automatic.