cis392 sp 03assign#11 cis392 text processing, retrieval, and mining spring 03 instructor: dr. y. f....

22
CIS392 Sp 03 Assign#1 1 CIS392 Text Processing, Retrieval, and Mining Spring 03 Instructor: Dr. Y. F. Brook Wu BOW toolkit: http://www. cs . cmu . edu /~ mccallum /bow

Post on 21-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

CIS392 Sp 03 Assign#1 1

CIS392 Text Processing, Retrieval, and Mining

Spring 03

Instructor: Dr. Y. F. Brook Wu

BOW toolkit:

http://www.cs.cmu.edu/~mccallum/bow

CIS392 Sp 03 Assign#1 2

Login in to AFS On campus: go to a computer lab in GITC 2305. At home: make sure the internet connection has been

established. Assume everyone has Windows at home. Click on

Start Run Type in “telnet afs1.njit.edu” (without quotes; the first

screen shows some useful information.) Enter user name and password  What if your account doesn’t work: Call help desk

973.596.2900, they can reset your password for you.

CIS392 Sp 03 Assign#1 3

Useful UNIX commands Note: All filenames and commands in UNIX

system are case sensitive.  General syntax:

Command [option] Argument Options modify the way command works, and

they are optional. Arguments are usually files; sometimes they

are optional too. Ex: rm –r directory_name

CIS392 Sp 03 Assign#1 4

Note Typing two “-” next to each other in MS

PowerPoint will make them look like “—” . Those BOW and UNIX commands you see in these slides, therefore, are confusing. So, please refer to BOW help file and UNIX documentations for their actual usages.

CIS392 Sp 03 Assign#1 5

Useful UNIX commands man (for manual) ex: man ls (manual for ls

command) cd (change directory) ls (list files and attributes) dir (list files) mkdir (crete a directory) rm (delete a file) rm –fr directory_name (delete the whole

directory and files inside it.)

CIS392 Sp 03 Assign#1 6

Useful UNIX commands rmdir (remove directory) cp (copy) pwd (current working directory) pico (a text editor) more filename (read plain text file one

screen at a time. Press space bar to continue and “q” to quit.)

quota (disk space)

CIS392 Sp 03 Assign#1 7

More useful UNIX commands http://www.njit.edu/CSD/Docs/

unixcmds.html http://www.njit.edu/Directory/Admin/

CSD/Academic_Computing/Manuals/UNIX/UNIX.html

CIS392 Sp 03 Assign#1 8

How to create your home page on AFS system? Help info:

http://www-ec.njit.edu/ec_info/newuser/web/web.html

Execute this command at the UNIX prompt: /usr/ec/bin/home.page.setup

Your URL: http://www-ec.njit.edu/~yourusername

CIS392 Sp 03 Assign#1 9

Overview of Retrieval Experiment

Create a sub-directory for CIS392 assignments under ~your_user_name/public_html

Create 3 sub-directories under the above directory for the 3 automatic indexing activities

Perform 3 automatic indexing activities with 3 different options

CIS392 Sp 03 Assign#1 10

Overview of Retrieval Experiment (cont) Perform 3 retrievals for each of the

above 3 auto indexing activities Analyze how different indexing options

affect retrieval Make an html page to present your

results.

CIS392 Sp 03 Assign#1 11

Creating sub directories Change directory to public_html by

typing: cd public_html mkdir cis392 (now you’ve created a

directory for your CIS392 retrieval assignments)

cd cis392 (go inside cis392 directory)

CIS392 Sp 03 Assign#1 12

Creating three sub-directories mkdir model1 (this directory stores results

from default settings: no stemming and stopped words removed.)

mkdir model2 (this directory stores results from the following settings: no stemming, and stopped words INCLUDED.)

mkdir model3 (this directory stores results from the following settings: stemming, and stopped words removed.)

CIS392 Sp 03 Assign#1 13

URL of your retrieval experiment

http://www-ec.njit.edu/~yourusername/cis392/cis392re.html

See a sample page created by Prof Wu: http://www-ec.njit.edu/~wu/cis392/cis392re.html

CIS392 Sp 03 Assign#1 14

Getting Access to BOW and Test Collection

there are three directories under ~wu/IR_Tools: bow (for BOW system), to execute BOW,

change directory to: ~wu/IR_Tools/bow/bin som (for self-organizing map program. Do

NOT use it now!) tc (test collection, Library and Information

Science Abstracts) the text is under ~wu/IR_Tools/tc/lisa/text/group0 to group5

CIS392 Sp 03 Assign#1 15

Test Collection: LISA The sample queries are stored in

~wu/IR_Tools/tc/lisa/LISA.QUE

The relevant documents corresponding to queries are stored in:~wu/IR_Tools/tc/lisa/LISA.REL

(“-1” marks the end of the entry.)

CIS392 Sp 03 Assign#1 16

Operating Arrow of BOW Read information from BOW’s web site

(again, the URL is list on the “Resources” section of the class syllabus)

Read Arrow’s help file (available on syllabus page; You should print a copy of the help file.)

CIS392 Sp 03 Assign#1 17

Automatic Indexing To begin the retrieval tasks, first you need to

index the whole document collection. Specify lexing options (stopped words

removal and/or stemming) at this time. arrow -d ~yourusername/public_html/cis392

--index ~wu/IR_Tools/tc/lisa/text/* The * sign is a wildcard represents all files

and directories under ~wu/IR_Tools/tc/lisa/text

CIS392 Sp 03 Assign#1 18

Automatic Indexing -d parameter specifies where you will store the

statistics resulted from indexing. (You will have to specify this directory when you want to index and retrieve documents.)

The path after –index specifies the location of text collection. 

The default lexing settings of the above task include: NO stemming performed, and stopped words REMOVED.

CIS392 Sp 03 Assign#1 19

Query assigned for retrieval Please refer to retrieval experiment

section of the online syllabus to see which query you get for the experiment. (http://web.njit.edu/~wu/teaching/sp03/CIS392/CIS392-Sp03.htm)

CIS392 Sp 03 Assign#1 20

Retrieval First, please specify where the indexing

statistics is stored, and then the query to be performed.

arrow –d ~yourusername/public_html/cis392/model1 --num-hits-to-show=25 –query > ~yourusername/public_html/cis392/model1/retrieved_docs

The greater-than sign (>) specifies the output filename and where it will be stored.

CIS392 Sp 03 Assign#1 21

Presenting your RE create a page under your

~/public_html/cis392 directory named: cis392re.html

this page should contain several pieces of information, see: http://web.njit.edu/~wu/cis392/cis392re.html

CIS392 Sp 03 Assign#1 22

Presenting your RE You can create this html page with the pico editor in

UNIX (if you know basic html tags) , Microsoft Word (save the file in html format), or Netscape composer.

If you use an html editor, you might need FTP software. http://www.zdnet.com/downloads/stories/info/0,10615,30994,00.html

Before due date: Please check all items on your html page and make sure all of them are displayed properly.

After due date: do not make changes. I can check when the files were last updated.