amazon web services – plagiarism application danijel novaković january 31 st, 2012 supervisor:...

31
Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st , 2012 Supervisor: Prof. Amin Anjomshoaa

Upload: myra-tush

Post on 02-Apr-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

Amazon Web Services – Plagiarism Application

Danijel Novaković

January 31st, 2012

Supervisor: Prof. Amin Anjomshoaa

Page 2: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

Outline

Scenario

Used Tools– The AWS Toolkit for Eclipse – Karmasphere Studio For Amazon– Apache PDFBox

Realization of Scenario Steps

Final Conclusions & Personal Opinion

Page 3: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

Scenario

The sample PDF files are stored in an S3 bucket under the following Endpoint: http://exercise2.ws2011.s3-website-eu-west-1.amazonaws.com/ (first a user should be authenticated in order to access these files).

The files are read and stored in an Amazon queue for further processes.

An Amazon EC2 instance processes the queued items and extracts the paragraphs out of that as text. The result should be stored in the second Amazon S3 bucket.

As the next step, Elastic MapReduce should be applied to the resulting data of the previous step. The MapReduce process should simply make a word counting and for each paragraph calculate the top ten high frequency words. The result should be then stored in a SimpleDB.

Finally some sample queries that receives some keywords and returns the list of paragraphs that matches the best to those keywords should be provided.

Page 4: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

4

Scenario

Page 5: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

5

The AWS Toolkit for Eclipse – An open source plug-in for the Eclipse Java IDE that makes it easier

for developers to develop, debug, and deploy Java applications using Amazon Web Services.

– With the AWS Toolkit for Eclipse, you’ll be able to get started faster and be more productive when building AWS applications.

– The AWS Toolkit for Eclipse features: AWS SDK for Java AWS Explorer AWS Elastic Beanstalk Deployment and Debugging Support for multiple AWS Accounts

– http://aws.amazon.com/eclipse/

Used Tools I

Page 6: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

6

Karmasphere Studio For Amazon– Graphical environment that supports the complete lifecycle for

developing for Amazon Elastic MapReduce, including prototyping, developing, testing, debugging, deploying and optimizing Hadoop Jobs. 

– By simplifying development, Karmasphere Studio increases the productivity of developers, saving time and effort. 

– Comes in versions compatible with Eclipse.– Two different licensing models

License Included (the Karmasphere software has been licensed by AWS)

Bring-Your-Own (designed for customers who prefer to use existing Karmasphere)

– http://aws.amazon.com/elasticmapreduce/karmasphere/– http://karmasphere.com/ksc/karmasphere-studio-for-amazon.html

Used Tools II

Page 7: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

7

Apache PDFBox– Java PDF Library– Open source Java tool for working with PDF documents– Used for PDF to text extraction– http://pdfbox.apache.org/

Used Tools III

Page 8: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

8

Scenario – part I

Page 9: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

9

import com.amazonaws.auth.PropertiesCredentials;

import com.amazonaws.services.s3.AmazonS3;

import com.amazonaws.services.s3.AmazonS3Client;

import com.amazonaws.services.s3.model.ListObjectsRequest;

import com.amazonaws.services.s3.model.ObjectListing;

import com.amazonaws.services.s3.model.S3ObjectSummary;

import com.amazonaws.services.sqs.AmazonSQS;

import com.amazonaws.services.sqs.AmazonSQSClient;

import com.amazonaws.services.sqs.model.CreateQueueRequest;

import com.amazonaws.services.sqs.model.ReceiveMessageRequest;

import com.amazonaws.services.sqs.model.SendMessageRequest;

Scenario – part I

Page 10: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

10

AmazonS3 s3 = new AmazonS3Client(new PropertiesCredentials(MainClass.class.getResourceAsStream("AwsCredentials.properties")));

AmazonSQS sqs = new AmazonSQSClient( new PropertiesCredentials(MainClass.class.getResourceAsStream(“AwsCredentials.properties")));

String inputBucketName = "exercise2.ws2011";

String mainBucketName = “introduction.to.cloud.computing";

String vFolderWithParagrapfsName = "pdf.extracted.paragraph";

String queueName ="myQueue01"+UUID.randomUUID();

int numberOfSentMessages = 0;

CreateQueueRequest createQueueRequest = new CreateQueueRequest(queueName);

String myQueueUrl = sqs.createQueue(createQueueRequest).getQueueUrl();

Scenario – part I

Page 11: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

11

ObjectListing objectListing = s3.listObjects(new

ListObjectsRequest().withBucketName(inputBucketName));

for (S3ObjectSummary objectSummary : objectListing.getObjectSummaries())

{

String fileName= objectSummary.getKey();

sqs.sendMessage(new SendMessageRequest(myQueueUrl, fileName));

numberOfSentMessages++;

}

Scenario – part I

Page 12: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

12

Scenario – part II

Page 13: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

13

// create bucket

s3.createBucket(mainBucketName);

// create virtual folder in created bucket

String tmpFileName = "tmpFile.txt";

Boolean successfullCreated= new File(tmpFileName).createNewFile();

File tmpFile = new File (tmpFileName);

s3.putObject(newPutObjectRequest(mainBucketName,vFolderWithParagrapfsName+"/",tmpFile));

Scenario – part II

Page 14: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

14

ReceiveMessageRequest receiveMessageRequest = new ReceiveMessageRequest

(myQueueUrl);

int totalNumberOfReceivedMessages=0;

int numberOfReceivedMessages=0;

while(numberOfSentMessages!=totalNumberOfReceivedMessages)

{

List<Message> messages = sqs.receiveMessage(receiveMessageRequest).getMessages();

numberOfReceivedMessages=messages.size();

for (Message message : messages)

{

String fileName = message.getBody();

String messageRecieptHandle = message.getReceiptHandle();

sqs.deleteMessage(new DeleteMessageRequest(myQueueUrl, messageRecieptHandle));

String sURL = s3.generatePresignedUrl(inputBucketName, fileName, null).toString();

downloadFromUrl(sURL, pdfDir+"/"+fileName);

PDFTextParser pdfTextParserObj = new PDFTextParser();

String pdfToText = pdfTextParserObj.pdftoText(pdfDir+"/"+fileName);

pdfTextParserObj.writeTexttoFile(pdfToText, pdfDir+"/"+fileName2);

Scenario – part II

Page 15: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

15

Scenario – part III

Page 16: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

16

while(numberOfSentMessages!=totalNumberOfReceivedMessages){

List<Message> messages = sqs.receiveMessage(receiveMessageRequest).getMessages();

numberOfReceivedMessages = messages.size();

for (Message message : messages) {

String fileName = message.getBody();

String messageRecieptHandle = message.getReceiptHandle();

sqs.deleteMessage(new DeleteMessageRequest(myQueueUrl, messageRecieptHandle));

String sURL = s3.generatePresignedUrl(inputBucketName, fileName, null).toString();

downloadFromUrl(sURL, pdfDir+"/"+fileName);

PDFTextParser pdfTextParserObj = new PDFTextParser();

String pdfToText = pdfTextParserObj.pdftoText(pdfDir+"/"+fileName);

pdfTextParserObj.writeTexttoFile(pdfToText, pdfDir+"/"+fileName2);

. . .

forEachParagraph:

s3.putObject(new PutObjectRequest(mainBucketName, vFolderWithParagrapfsName+

"/"+ fileName3, paragrafContent));

}

totalNumberOfReceivedMessages+=numberOfReceivedMessages;

}

sqs.deleteQueue(new DeleteQueueRequest(myQueueUrl));

Scenario – part III

Page 17: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

17

Scenario – part III (Results)

Page 18: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

18

Scenario – part IV

Page 19: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

19

import com.amazonaws.services.s3.AmazonS3;

import com.amazonaws.services.s3.AmazonS3Client;

import com.amazonaws.services.s3.model.ListObjectsRequest;

import com.amazonaws.services.s3.model.ObjectListing;

import com.amazonaws.services.s3.model.PutObjectRequest;

import com.amazonaws.services.s3.model.S3ObjectSummary;

import com.amazonaws.services.simpledb.AmazonSimpleDB;

import com.amazonaws.services.simpledb.AmazonSimpleDBClient;

import com.amazonaws.services.simpledb.model.Attribute;

import com.amazonaws.services.simpledb.model.BatchPutAttributesRequest;

import com.amazonaws.services.simpledb.model.CreateDomainRequest;

import com.amazonaws.services.simpledb.model.DeleteAttributesRequest;

import com.amazonaws.services.simpledb.model.DeleteDomainRequest;

import com.amazonaws.services.simpledb.model.Item;

import com.amazonaws.services.simpledb.model.PutAttributesRequest;

import com.amazonaws.services.simpledb.model.ReplaceableAttribute;

import com.amazonaws.services.simpledb.model.ReplaceableItem;

import com.amazonaws.services.simpledb.model.SelectRequest;

Scenario – part IV

Page 20: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

20

AmazonS3 s3 = new AmazonS3Client(new PropertiesCredentials(ExecuteJobs.class.getResourceAsStream("AwsCredentials.properties")));

AmazonSimpleDB sdb = new AmazonSimpleDBClient(new PropertiesCredentials(ExecuteJobs.class.getResourceAsStream("AwsCredentials.properties")));

//domainName in Amazon SimpleDB

String domainName = "IntroductionToCloudComputing";

String mainBucketName = "introduction.to.cloud.computing";

String vFolderWithParagrapfsName = null;

String pdfDir="pdfTemp"+UUID.randomUUID();

sdb.createDomain(new CreateDomainRequest(domainName));

Scenario – part IV

Page 21: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

21

ObjectListing objectListing = s3.listObjects(new

ListObjectsRequest().withBucketName(mainBucketName));

HadoopJob hj = new HadoopJob();

File tmpFile=null;

for (S3ObjectSummary objectSummary : objectListing.getObjectSummaries())

{

if(objectSummary.getSize()>0)

{

//it is a file read from Amazon S3, not a folder

//code is on the next slide

}

else

{

vFolderWithParagrapfsName = objectSummary.getKey().substring(0,

objectSummary.getKey().length() - 1);

}

}

Scenario – part IV

Page 22: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

22

if(objectSummary.getSize()>0) //it is a file read from Amazon S3, not a folder

{

String fileName = objectSummary.getKey();

String sURL = s3.generatePresignedUrl(mainBucketName, fileName, null).toString();

fileName = fileName.substring(vFolderWithParagrapfsName.length()+1);

String dTmpFilePath = pdfDir+"/"+fileName;

downloadFromUrl(sURL, pdfDir+"/"+fileName); //in pdfDir+”/”+fileName Paragraphs are stored

forEachParagraph

{

hj.doMyJob(pdfDir+"/"+"temp.txt", pdfDir+"/output"+"/"+fileName.substring(0,

fileName.indexOf(".txt"))+"/"+fileName.substring(0, fileName.indexOf(".txt"))+"_"+n);

int numberOfWords=10;

MyArray array = getTopWords(hadoopOutputFilePath, numberOfWords);

sdb.batchPutAttributes(new BatchPutAttributesRequest(domainName,

createSampleData(fileName.substring(0, fileName.indexOf("_Paragraphs.txt")),

shorterStmp,n,numberOfWords,array)));

}

Scenario – part IV

Page 23: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

23

public class HadoopMapper extends Mapper <Object, Text, Text, IntWritable>

{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text value, Context context) throws IOException,

InterruptedException

{

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens())

{

word.set(itr.nextToken());

context.write(word, one);

}

}

}

Scenario – part IV (Mapper)

Page 24: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

24

public class HadoopReducer<Key> extends Reducer<Key, IntWritable, Key, IntWritable>

{

private IntWritable result = new IntWritable();

public void reduce(Key key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException

{

int sum = 0;

for (IntWritable val : values)

{

sum += val.get();

}

result.set(sum);

context.write(key, result);

}

}

Scenario – part IV (Reducer)

Page 25: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

25

public static void initJob(Job job)

{

org.apache.hadoop.conf.Configuration conf = job.getConfiguration();

conf.setJobName("wordcount");

job.setInputFormatClass(org.apache.hadoop.mapreduce.lib.input.TextInputFormat.class);

job.setMapperClass(HadoopMapper.class);

job.setMapOutputKeyClass(org.apache.hadoop.io.Text.class);

job.setMapOutputValueClass(org.apache.hadoop.io.IntWritable.class);

job.setReducerClass(HadoopReducer.class);

job.setOutputValueClass(org.apache.hadoop.io.IntWritable.class);

job.setOutputFormatClass(org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.class);

);

Scenario – part IV (Driver)

Page 26: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

26

public void doMyJob(String inputFileName, String outputFolderName) throws Exception

{

Job job = new Job();

initJob(job);

/* Tell Task Tracker this is the main */

job.setJarByClass(HadoopJob.class);

/* This is an example of how to set input and output. */

FileInputFormat.setInputPaths(job, inputFileName);

Path p = new Path(outputFolderName);

FileOutputFormat.setOutputPath(job, p);

/* And finally, we submit the job. */

job.submit();

job.waitForCompletion(true);

}

Scenario – part IV (Driver)

Page 27: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

27

Scenario – part V

Page 28: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

28

private static List<ReplaceableItem> createSampleData(String fileName, String paragraphContent, int paragraphNumber,int numberOfWords, MyArray array) throws IOException

{

List<ReplaceableItem> sampleData = new ArrayList<ReplaceableItem>();

sampleData.add(new ReplaceableItem(fileName+"_Paragraf_"+paragraphNumber).withAttributes(

new ReplaceableAttribute("Paper",fileName+".pdf", true),

new ReplaceableAttribute("Paragraph_Content",paragraphContent, true),

new ReplaceableAttribute(array.getKey(0),String.valueOf(array.getNumberOfAppearances(0)), true),

new ReplaceableAttribute(array.getKey(1),String.valueOf(array.getNumberOfAppearances(1)), true),

new ReplaceableAttribute(array.getKey(2),String.valueOf(array.getNumberOfAppearances(2)), true),

…. )));

return sampleData;

}

Scenario – part V

Page 29: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

29

Scenario – part V (Final Results)

Web>’1’as=’2’

Query

Query results

Page 30: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

30

Final Conclusions & Personal Opinion

The Amazon Web Services (AWS) are a collection of remote computing services that together make up a cloud computing platform.

The importance and advantages of the usage of the Cloud Computing technology is proven in every day praxis.

Amazon Simple Storage Service (S3)– Folder structure among buckets is not completely supported;

Amazon Simple Queue Service (SQS) – Better for systems of a large number of sent messages;

Amazon SimpleDB– Service limits (http://thecloudtutorial.com/amazonsimpledb.html).

Page 31: Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

Thank you!