classification of phishing scam in website using … · machine learning researches into...

40
CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING VOWPAL WABBIT ALGORITHM IZZATY SYAHIRA BINTI KAMARUDDIN BACHELOR OF COMPUTER SCIENCE (COMPUTER NETWORK SECURITY) WITH HONOURS FACULTY INFORMATICS AND COMPUTING UNIVERSITI SULTAN ZAINAL ABIDIN August 2020

Upload: others

Post on 03-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

CLASSIFICATION OF PHISHING SCAM IN WEBSITE

USING VOWPAL WABBIT ALGORITHM

IZZATY SYAHIRA BINTI KAMARUDDIN

BACHELOR OF COMPUTER SCIENCE

(COMPUTER NETWORK SECURITY) WITH HONOURS

FACULTY INFORMATICS AND COMPUTING

UNIVERSITI SULTAN ZAINAL ABIDIN

August 2020

Page 2: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

DECLARATION

I hereby declare that this report is based on my original work except for quotations and citation,

which have been acknowledged. I also declare that it has been previously or concurrently

submitted for any other degree at University Sultan Zainal Abidin or other institutions.

Signature :…………………….

Name :…………………….

Date :…………………….

Page 3: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

APPROVAL

This confirm that the research conducted and the writing of this report was under my supervisor.

Signature :………………………………………………..

Supervisor : Sir Ahmad Faisal Amri bin Abidin @ Bharun

Date :………………………………………………..

Page 4: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

DEDICATION

In the name of Allah SWT, the Most Gracious and the Most Merciful, all praise is only for Him.

I would like to express my deepest appreciation to all who provided me the courage and possibility

to complete this report. A special gratitude goes to my supervisor, Sir Ahmad Faisal Amri for

guiding me to do my final year project.

I take this opportunity to thank you my parents and my family for giving moral support and

encouragement whenever I feel like give up. I also give special thanks to all lecturers of Faculty

of Informatics and Computing for their attentions, guidance and advice during my final year

project period. Sincere thanks to my fellow friends for their help in helping me in my final year

project.

May Allah S.W.T. bless all effort for completing this final year project.

Thank you.

Page 5: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

ABSTRACT

In this cyber-world, phishing is one of the major problems that leads to financial losses for

both industries and individuals. With the growth on the internet today, attackers can easily launch

targeted phishing attacks without the victims notice they have been deceived. Phishing is a kind

of attack which is attackers use spoofed email and fraudulent web sites to trick people without

their notice. Phishing websites looks very similar in appearance to its corresponding legitimate

website to deceive users into believing that they are browsing in the correct website. The attackers

send a malicious links or attachments through phishing emails that can perform various functions,

including steal the login credentials or account information of the victim. These emails can harm

victims through of money loss and identify theft. This paper main goal is to investigate the

potential of Vowpal Wabbit Algorithm in classify the phishing websites in order to protect users

from being hacked or deceived with stealing the personal access and information. Vowpal Wabbit

Algorithm is a fast, parallel machine learning framework that was developed for distributed

computing and it can help to prevent the attacker to do interruption. This project also will be carried

out by classifying data from computer in Weka analyzing tool.

Page 6: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

ABSTRAK

Di dunia yang serba moden ini, phishing adalah salah satu masalah utama yang membawa

kepada kerugian kewangan bagi kedua-dua industri dan individu. Dengan berkembangnya internet

hari ini, penyerang dengan mudah boleh melancarkan serangan phishing yang disasarkan tanpa

mangsa mengetahui yang mereka telah kena tipu. Phishing adalah sejenis serangan yang mana

penyerang menggunakan e-mel palsu dan laman web palsu untuk menipu orang tanpa diketahui

oleh mereka. Laman web phishing kelihatan sangat mirip dengan penampilan laman web yang sah

untuk menipu pengguna untuk mempercayai bahawa mereka sedang melayari laman web yang

betul. Penyerang menghantar pautan atau lampiran yang berniat jahat melalui e-mel phishing yang

boleh melakukan pelbagai fungsi, termasuk mencuri bukti kelayakan log masuk atau maklumat

akaun mangsa. E-mel ini boleh merosakkan mangsa melalui kehilangan wang dan mengenal pasti

kecurian. Matlamat utama kertas ini adalah untuk menyiasat potensi Vowpal Wabbit algorithm

dalam mengklasifikasikan laman web phishing untuk melindungi pengguna daripada digodam atau

ditipu dengan mencuri akses dan maklumat peribadi. Vowpal Wabbit algorithm adalah rangka

kerja pembelajaran mesin yang cepat dan sejajar yang dibangunkan untuk pengkomputeran yang

diedarkan dan dapat membantu mencegah penyerang melakukan gangguan. Projek ini juga akan

dilakukan dengan mengklasifikasikan data dari komputer dalam alat analisis Weka.

Page 7: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

TABLE OF CONTENTS

Title Page No

Declaration

Approval

Dedication

Abstract

Abstrak

Table Of Contents

Diagram Lists

CHAPTER 1: INTRODUCTION

1.1 Background

1.2 Problem statement

1.3 Objective

1.4 Scope

1.5 Limitation of work

Page 8: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

1.6 Thesis Organization

CHAPTER 2: LITERATURE REVIEW

2.1 Introduction

2.2 Phishing

2.2.1 Definition of Phishing

2.2.2 How Phishing works ?

2.2.3 Types of Phishing

2.3 Scam

2.3.1 Definition of Scam

2.3.2 Types of Scam

2.4 Vowpal Wabbit Algorithm

2.5 Email Filtering Techniques

2.6 Comparison between Methods

2.7 Summary

CHAPTER 3: Methodology

3.1 Introduction

Page 9: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

3.2 Specification and System Requirements

3.2.1 Determine Requirements

3.2.2 Hardware

3.2.3 Software

3.3 Algorithm

3.3.1 Vowpal Wabbit

3.4 General Framework

3.5 Summary

REFERENCES

Page 10: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

CHAPTER 1

INTRODUCTION

1.1 Background

In this modernized world, security issue plays an important role in technology especially

electronics communication on the internet and it can launch targeted the phishing attacks. Phishing

is a criminal technique employing both social engineering and technical subterfuge to steal

consumer’s personal identity data and financial account credential [1]. Phishing also one of the

different types of fraud that committed today. In criminal law, fraud is defined as a deliberate

deception made of the sole aim of personal gains or for smearing an individual’s image [2].

Phishing websites are fake web pages that are creates by malicious people to imitate web

pages of real websites. The attacker of phishing is known as a phisher. Phisher usually do their

evil by create web pages that are very similar to the real web pages in order to scam their victims

by reveal their personal information [1]. Victims will be tricked by clicking the malicious link,

which is can lead to the installation malware or the Web pages that look-alike to the legitimate site

but actually it is not the real thing. It would freeze of the system without victim’s notice and

automatically get their personal information like password, bank account number, social security

Page 11: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

number, credit card details. So, the users can be easily deceived by this scam because phishers can

misuse their personal information without the user’s permission. Even worse, phishing attacks may

cost companies hundreds of thousands of dollars per attack in fraud-related losses and personal

time.

In order to secure from phishing scam, Vowpal Wabbit was implemented. It is a latest

machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit

algorithm of breaking the stream of text into words, symbols, phrases or another meaningful

element. It also a fast machine learning and able learning the terascale datasets faster than any

other models. The classification process will be based on a different characteristics such as spelling

errors, poor grammar, long URLs, generic salutation and personalization.

1.2 Problem Statement

• The attackers can steal a sensitive information and use it for dangerous purposes. It can

happen when the user click the malicious link and it immediately install the malware inside

the user’s device.

• Attackers usually use official logos from real organizations and other identifying

information by taken directly from legitimate Web sites including a deceptive URL address

linking to a scam web site.

• With regard this matter, this research intends to leverage Vowpal Wabbit algorithm to

secure email from phishing scams.

Page 12: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

1.3 Objective

• To study about Vowpal Wabbit algorithm in order to secure websites in phishing scam.

• To modify Vowpal Wabbit algorithm to suit with Weka based system settings.

• To test the data sets by using Vowpal Wabbit algorithm in Weka in order to detect phishing

websites.

1.4 Scope

• Classify of phishing scam messages and pre-process the content of the messages.

• Subject will test using the computer program and will generate the data from the test.

• Classify the data using Weka tool to get the accurate results.

Page 13: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

1.5 Limitation of Work

• Website system for detecting the phishing scamming messages only.

• The system will analyze the text of the message and malicios link.

• Focuses in single language

~ Language on text can be analyze by only English language.

Vowpal Wabbit

algorithm

Weka

Data Sets Results

Page 14: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

1.6 Thesis Organization

This report covers all the necessary information about the project. In chapter 1, this report

covers about the introduction of the project where the details about objectives of the project, the

scope and also the limitation of work. In chapter 2, the report mainly covers about the previous

researches that were used as references for this project and it relations to this project.

The next chapter is methodology details. This chapter tells about the framework of the

project and all details about software and hardware that this project used to produce results.

Page 15: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

CHAPTER 2

LITERATURE REVIEW

2.1 Introduction

In this chapter, it will discuss and portrays about literature review for machine learning

classifier that being used for previous research. A literature review is about past research or recent

research or what need to search or seek the truth for the purpose portraying or illustrate the research

problem, solutions and the importance of seeking a solution. A literature review is not about

information gathering. The literature review shows in-depth grasp and summarize prior research

that linked to the research subject in a chosen topic. Literature review involves the process of

reading journal, books, articles and research paper. After that, it will be analyzing, summarize and

evaluate the reading based on connection to the project. It is a guideline to stablishes the credibility

for the better project.

Page 16: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

2.2 Phishing

2.2.1 Definition of Phishing

Phishing is a criminal technique employing both social engineering and technical

subterfuge to steal consumer’s personal identity data and financial account credential [1]. Phishing

is a new type of attack dating from the mid-1990s, and it soon become a major problem in an

online transaction. The word “phishing” appeared when Internet scammers were using email lures

to “fish” for passwords and financial information from the sea of Internet users; “ph” is a common

hacker replacement of “f”, which comes from the primary form of hacking, “phreaking” on

telephone switches [3]. The attacker of phishing is known as a phisher. A phisher attempts to

deceive the online customer by sending an email and click a site that falsely claiming to be an

established legitimate enterprise in an attempt to scam the user into surrendering private

information that will be used for identity theft. Legitimate organizations would never request this

information via email.

Page 17: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

2.2.2 How Phishing works ?

Phishing process usually starts with a spoofed email that it will invite the user to login to

their accounts by using in a forged Webpages that also very closely resembles the official website

such as bank or an e-shop. The spoofed emails often look like the valid emails because the phishers

share the same logos and graphic pictures as the original website. In addition, the scam emails

contain deceptive URL addresses linking to a scam website [4]. The information that phisher will

get as soon as the victim enters the username, password or the credit card number. Moreover, users

should not forward unauthenticated emails or click on unusual links in email or use the search

engines to look for online donations and charitable organizations [5].

2.2.3 Types of Phishing

[3] The categories of phishing are as follows :

• Clone Phishing

Clone phishing creates a cloned email. User does this by getting information such as

content and recipient addresses from an authorized email which was delivered previously,

then user sends the same email with links replaced by malicious ones. User also employs

address spoofing so that the email looks to be from the primary sender. The email can claim

to be a re-send of the original or an updated version as a trapping strategy.

Page 18: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

• Spear Phishing

Spear phishing targets at a specific group. Instead of casting out thousands of emails

randomly, spear phishers target selected groups of people with something in common, for

example group from the same organizations. Spear phishing is also being used against

high-level marks, in a type of attack called “whaling”.

• Phone Phishing

Phone phishing refers to messages that demand to be from a bank asking users to dial a

phone number paying attention to the problems with their bank accounts. Traditional phone

equipment has dedicated lines, voice over IP, being easy to manipulate, becomes a good

choice for the phisher. Once the phone number, closely-held by the phisher and provided

by a VoIP service, the voice prompts tell the caller to enter her account numbers and PIN.

Caller ID spoofing, which is not impermissible by law and it can be used along with this

so the call appears to be from a trusted caller.

• Domain Spoofing

Domain spoofing attack uses either email or fraudulent websites. It occurs when a

cybercriminal “spoofs” an organization or company’s domain to make their emails look

like they’re coming from the official domain or make a fake website look like the real site’s

design and using a similar URL.

Page 19: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

• Watering Hole Phishing

It is a reminiscent of a scene from the animal kingdom. There target a businesses by

identifying specific websites that your company or employees visit most often and

infecting one of the sites with a malware.

• Evil Twin

This is a form of phishing that usually happens on Wi-Fi. It also been referred ta the

Starbucks scam because it often takes place in coffee shops. They used the set of service

identifier (SSID) that look alike as the same network.

2.3 Scam

2.3.1 Definition of Scam

According to a group called Computer Hope, a scam is a term used to describe any

fraudulent business or scheme that takes money or other goods from an unsuspecting person. With

the world becoming more connected through the Internet, online scams had increased and it is

often up to help stay cautions with people on the Internet.

2.3.2 Types of Scam

Below are the categories of scam actions :

Page 20: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

• Online Survey Scams

➢ It is a site that says they offer a large amount of money or gift vouchers to participants

for answering questions. The main goal of an online survey scams is to get a

demographic information and the site can sell this information to scammers, spammers

or other marketers.

• 419 Scam

➢ This scam is called 419 or Nigeria scam. The name is after the penal code that it is

prosecuted under in Nigeria, Africa. Victims can gain a large amount of money and

this scam only requires bank information to deposit the money into victim’s account.

This bank information is used to against the person or the deposits are kept with no

reward.

• Catfish

➢ A person who creates a fake online profile with the intention of deceiving someone

else. For example, a woman could create a fake profile on an online dating website.

She can start a relationship with one or more people and make up a story in attempt to

get a money from them.

Page 21: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

• Auction Fraud

➢ Unknown seller that selling something on an online auction site such as eBay that

appears to be something it is really impossible. For example, seller is selling a tickets

for an upcoming football match that really are not an official tickets on the Internet.

• Donation Scam

➢ A person may claim that they have or have a child or someone that they know with an

illness and urgently need financial assistance. Although this donation can be real, there

are also an alarming number of people who create a fake accounts on donation sites in

the hope to scam people out of money.

• Cold Call Scam

➢ A person that claims to be from a technical support from a computer company like HP,

saying that they had received the information that your computer is infected with a

virus, or it already been hacked. Then, they offer to remotely connect your computer

and to fix the problem. It is a tactic that used by scammers to con you out of money.

• Chain Email

➢ An unsolicited email containing false information of the purpose to scare, intimidating

or deceiving the recipient. The purpose is to coerce the recipient to forward the email

to other unwilling recipients which is a malicious or spurious message. It will keep

spreading until someone notice it is actually a chain email.

Page 22: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

• Phishing

➢ Phishing is a criminal technique employing both social engineering and technical

subterfuge to steal consumer’s personal identity data and financial account credential.

For example, receiving email from someone that pretend to be your bank indicating

that you are overdrawn or the purchase you did not make and asking you to log in and

verify the information.

2.4 Vowpal Wabbit Algorithm

Vowpal Wabbit is a learning system sponsored by Microsoft Research and Yahoo!

Research (previously). The goal is to develop a single machine learning algorithm that is inherently

fast and able of being run in both standalone machines and in parallel processing environments.

Also it capable of handling the datasets in the scale of terabytes (Big data open source platform).

It has three references :

• The vorpal blade of Jabberwocky

• The rabbit of Monty Python

• Elmer Fudd who hunted the wascally wabbit

It is a hybrid for both the stochastic gradient descent and the batch learning algorithm. Briefly, the

input of stochastic gradient descent is read in a sequential order and used to update predictors for

Page 23: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

future data at each step while batch learning techniques calculate predictors by learning on the

entire data set at once [6].

Wabbit runs mainly as a library or a standalone daemon service but it is fully ready to be

deployed in cloud environments in terms of deployment [7]. Four main features that used in

combination to get a better result is :

• Input formats of data

➢ Example : consist of free form text, which is interpret in a bag-of-words way and it can

be multiple sets of free form text in different namespaces.

• Speed of learning

➢ It can be affectively applied on learning problems with a sparse terafeature.

• Scalability of the data sets analyzed

➢ The characteristics is the memory footprint of the program which is bounded

independent of data.

➢ It is means the training set is not loaded into main memory before learning starts.

Page 24: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

• Feature pairing.

➢ It is subsets of features can be internally paired so that algorithm is linear in the cross-

product of the subsets.

2.5 Email Filtering Techniques

[8] Type of program that filters and separates email into a different folder based on a

specified criterion. Todays, people have to waste a significant of time to deal with spams and scams

in email filtering. It also gives increase a problem like personal information leaking, malware

infection and one click fraud. Nowadays, the design goals that can be given for the spam and scam

filtering techniques as below:

• Accuracy of Decision

➢ The system of technique should give accurate result within the time in order to mistake

minimization of non-spam URLs.

• Classification should be context independent

➢ Classification should allow services for different webservices.

• Results in Real-Time

Page 25: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

➢ Some services like social networking and many others are working in real time. So, it

is needed the spam and scam filtering that can be done with a small delay.

• Fine-grained Classification

➢ The system should be easily recognizing a different between spams which is hosted on

public- services with ‘non-spam content.

2.6 Comparison between Methods

No Title Author / Year Method Description

1. Detection of

Phishing Emails

using Data Mining

Algorithms

Smadi, S., Aslam, N.,

Zhang, L., Alasem,

R., & Hossain, M. A.

(2015, December)

Data Mining

Algorithm

J48 Classification

Algorithm

~ Enhance the overall metrics

values of email classification by

focusing on the preprocessing

phase and determine the best

algorithm.

~ Extracted a set of features are

classified using the J48

classification algorithm.

~ Results achieved 98.87%

accuracy for random forest

Page 26: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

algorithm which is the highest

registered.

2. Detection of

Phishing Emails

using Feed Forward

Neural Network

Jameel, N. G. M., &

George, L. E.

(2013)

Feed Forward

Neural Network

~ Phishing detection model is based

on the extracted email features to

detect phishing emails. These

features appeared in the header and

HTML body of email using feed

forward neural network.

~ Using two phases which is

training and testing.

~ Consists of three stages, namely,

pre-processing, neural network

training and application oh phish

detection.

~ The results of the conducted tests

indicated good identification rate

(98.72%) with short required

processing time (0.00067 msec).

3. Classifying Phishing

Emails using

Confidence-

Basnet, R. B., &

Sung, A. H.

(2010)

Confidence-

Weighted Linear

Classifiers (CWLC)

~ Use the contents of the emails as

features without applying any

heuristic based phishing specific

Page 27: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

Weighted Linear

Classifiers

features and obtain highly accurate

results.

~ CWLC is a new class of online

learning method designed for

Natural Language Processing

(NLP) problems based on the

notion of parameter confidence.

~ Results achieved the best F-

measure of 99.83%.

4. Classification of

Phishing Email using

Random Forest

Machine Learning

Technique

Akinyelu, A. A., &

Adewumi, A. O.

(2014)

Random Forest

Machine Learning

Technique

~ To improved phishing email

classifier with better prediction

accuracy and fewer number of

features.

~ This method is an ensemble

learning classification and

regression method.

~ Results classification accuracy of

99.7% and low false negative (FN)

and false positive (FP) rates.

5. Detecting Phishing

Emails using Hybrid

Features

Ma, L., Ofoghi, B.,

Watters, P., & Brown,

S.

(2009, July)

Hybrid Features

Robust Classifiers

~ Build a robust classifier to detect

phishing emails using hybrid

features and select features using

information gain.

Page 28: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

~ Also analyses the quality of each

feature using information gain and

the best feature set is selected after

a recursive learning process.

~ Three types of features defined

manually based on observation in

emails which is content features,

orthographic features and derived

features.

~ Extract feature vectors from the

emails which effectively represents

the instances to detect phishing

emails.

~ Results achieve decision tree

produced the highest accuracy

which builds a better classifier.

6. Detecting Phishing

Emails using Text

and Data Mining

Pandey, M., & Ravi,

V.

(2012, December)

Text and Data

Mining

~ Analyzed phishing emails after

extracting 23 keywords from the

email bodies using text mining.

~ Results obtained 98.12, 97.29 as

accuracy and sensitivity

respectively using 23 features the

GP yields the best result.

Page 29: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

7. Collaborative Email-

Spam Filtering with

the Hashing Trick

Attenberg, J.,

Weinberger, K.,

Dasgupta, A., Smola,

A., & Zinkevich, M.

(2009, July)

Hashing - Trick ~ Technique can be used with a

variety of classifiers and can

implemented in a few lines of code

for collaborative spam filtering.

~ This method to scale up linear

learning algorithms.

~ Also used the Vowpal Wabbit

(VW) implementation of stochastic

gradient descent on a square-loss.

~ Result is more robust against

noise and absorbs individual

preferences in the context of spam

classification.

8. Anti-Phishing

Detection of

Phishing Attacks

using Genetic

Algorithm

Shreeram, V., Suban,

M., Shanthi, P., &

Manjula, K.

(2010, October)

Genetic Algorithm ~ To detect phishing by using the

rule-based system.

~ These algorithms is used to

evolve rules that used to

differentiate the legitimate link and

phishing link.

~ It can get a minimal false

negatives at a speed adequate for

online application.

Page 30: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

9. Learn To Detect

Phishing Scams

using Learning and

Ensemble Methods

Saberi, A., Vahidi, M.,

& Bidgoli, B. M.

(2007, November)

Learning and

Ensemble Methods

~ Used three different learning

methods to detect phishing scams.

~ Applied the ensemble method on

the outputs of different classifier to

increase the accuracy of other filter

results.

~ It detect 94.4% of scam emails

while it only detect 0.08% of

legitimate emails.

10. Detecting Phishing

Websites using

Associative

Classification

Ajlouni, M. I. A., Hadi,

W. E., & Alwedyan, J.

(2013)

Associative

Classification

~ Get the potential use of

automated data mining techniques

and detect problem of phishing

Websites.

~ Used two different associative

classification which is MCAR and

CBA.

~ MCAR achieved an average on

6.8%,6.1% and 5.4% which is the

highest accuracy while CBA

algorithm outperformed of SVM

and NB algorithms.

Page 31: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

2.7 Summary

Based on the literature review, there are various type of method that can be apply to detect

phishing. Literature review can give the details and some research of the related studies. Some of

the type of method is J48 Classification Algorithm, Feed Forward Neural Network, Confidence-

Weighted Linear Classifiers, Random Forest Machine Learning Technique, Hybrid Features, Text

and Data Mining, Hashing – Trick, Genetic Algorithm, Learning and Ensemble Methods and

Associative Classification. But, for this project we propose the method of Vowpal Wabbit

Algorithm.

Page 32: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

CHAPTER 3

METHODOLOGY

3.1 Introduction

Methodology is a systematic way that solves the research problem to achieved the

objectives. This chapter will explain the specific details on the methodology being used in order

to develop this project. In order to make sure the project is in the right path, methodology plays an

important role as a guide for the project to complete and working well as plan. There is different

type of methodology that is used for different type of application. It is very important to choose

the right and suitable methodology for the development of the application thus it is necessary to

understand the application functionality itself. Selection of methodology to be used should be

compatible with the application which is being developed. It can be apply through technique,

algorithm or method. It comprises by theoretical analysis of methods and principles associated

with a branch of knowledge. It also defines as rules, principles or procedure that use for developing

a system or project.

Page 33: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

3.2 Specification and System Requirements

System requirement is needed to accomplish this project and assist the development of the

project. It can involve a system requirement in hardware and software. Each of these requirements

is related to each other to make sure that the system can be done smoothly.

3.2.1 Determine Requirements

In this stage, we collect the information about the project from the previous research. Then,

we analyze the previous research to get the data that they collect in the form of the security,

problem statement and the method that are used. This project we analyzed research about the

phishing scam filtering and their technique in what algorithm have been used to apply in this

project. In order to overcome the problem that stated in 1.2, this methodology builds a referring to

the three main objectives that stated in 1.3. The first objective to study about Vowpal Wabbit

algorithm in order to classify phishing scam website, second to modify the Vowpal Wabbit

algorithm to suit with Weka based system settings and lastly to test the data sets by using Vowpal

Wabbit algorithm in Weka.

Page 34: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

3.2.2 Hardware

My suggestion is to get a high performance of processor and get a higher capacity of RAM

with a better high-end device. It is because machine learning required to use a high speed processor

to train the model if it is related to large amount of data.

3.2.3 Software

No. Software Description

1. Google Chrome To search for a related articles and method for the

project.

2. Microsoft Word 2016 Microsoft Word used for word processing such as

creating and editing report and documentation.

3. Microsoft PowerPoint 2016 To present the result and for project presentation.

4. Snipping Tools Used to captured and screen shot the images.

5. WEKA Application used for classification and project main

development phase.

6. WinZip To extract the data.

7. PyCharm Used for modify coding.

Page 35: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

3.3 Algorithm

This chapter will discuss about the algorithm that will be used to carry out of the project.

It also explains thoroughly about the algorithm and the reason why it was chosen. In order to ensure

the project will be running smoothly and according to the plan, methodology takes place as a

guideline for the project. It is very important to choose the suitable algorithm and the best one so

that our analysis did not affected by other factors. Moreover, it is important to ensure that the

algorithm is able to run in the device so that the study did not disturbed mid-way.

3.3.1 Vowpal Wabbit

Vowpal Wabbit is a machine learning system that incorporate into algorithms. It can handle

a large dataset in scale of Terabytes. It also a single machine and it develop a good predictor faster

than most other models. Vowpal Wabbit is used for a decision service for a personalized news

recommendation system. Moreover, it is an open interactive machine learning solution for

reinforcement learning, supervised learning and other machine learning paradigms. Vowpal

Wabbit supports solutions to a range of real-world problems through reductions to standard

learning algorithms. This versatility empowers us to frame learning problems effectively and

achieve the best solution.

Page 36: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

Figure : The example of Reduction Stack

Page 37: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

3.4 General Framework

Users are exposed from the phishing when visited the unknown websited. Scammer is a

threat that sending a scam ad in order to encourage the users to give out their private information

such as username, password and banking details.

Figure 3.1 : A framework of how data being process.

Install PyCharm and

Weka in Windows.

Modify coding of Vowpal

Wabbit in PyCharm. Spilt the data into train

and test datasets in Weka.

OR

SCAM NON - SCAM

Get the accuracy to identify

scam or non-scam.

Page 38: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

3.5 Summary

Methodology is very important in system and application development. There also a lots

of different software development methodology that available and can be used to develop any kind

of application. The right methodology can help the project to be done according to the specified

time. The activities in each phase in the methodology are explained so that it can be understood

easily.

References

[1] Ajlouni, M. I. A., Hadi, W. E., & Alwedyan, J. (2013). Detecting phishing websites using

associative classification. image, 5(23), 36-40.

[2] Akinyelu, A. A., & Adewumi, A. O. (2014). Classification of phishing email using random

forest machine learning technique. Journal of Applied Mathematics, 2014.

[3] Nivedha, S., Gokulan, S., Karthik, C., Gopinath, R., & Gowshik, R. (2017). Improving

Phishing URL Detection Using Fuzzy Association Mining. The International Journal of

Engineering and Science (IJES), 6.

[4] Salem, O., Hossain, A., & Kamala, M. (2010, June). Awareness program and ai based tool to

reduce risk of phishing attacks. In 2010 10th IEEE International Conference on Computer and

Information Technology (pp. 1418-1423). IEEE.

Page 39: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

[5] Chen, T. S., Jeng, F. G., & Liu, Y. C. (2006, December). Hacking tricks toward security on

network environments. In 2006 Seventh International Conference on Parallel and Distributed

Computing, Applications and Technologies (PDCAT'06) (pp. 442-447). IEEE.

[6] Agarwal, A., Chapelle, O., Dudík, M., & Langford, J. (2014). A reliable effective terascale

linear learning system. The Journal of Machine Learning Research, 15(1), 1111-1133.

[7] de Almeida, P. D. C., & Bernardino, J. (2015, June). Big data open source platforms. In 2015

IEEE International Congress on Big Data (pp. 268-275). IEEE.

[8] Revar, P., Shah, A., Patel, J., & Khanpara, P. (2017). A Review on Different types of Spam

Filtering Techniques. International Journal of Advanced Research in Computer Science, 8(5).

[9] Smadi, S., Aslam, N., Zhang, L., Alasem, R., & Hossain, M. A. (2015, December). Detection

of phishing emails using data mining algorithms. In 2015 9th International Conference on

Software, Knowledge, Information Management and Applications (SKIMA) (pp. 1-8). IEEE.

[10] Jameel, N. G. M., & George, L. E. (2013). Detection of phishing emails using feed forward

neural network. International Journal of Computer Applications, 77(7).

[11] Basnet, R. B., & Sung, A. H. (2010). Classifying phishing emails using confidence-weighted

linear classifiers. In International Conference on Information Security and Artificial Intelligence

(ISAI) (pp. 108-112).

[12] Ma, L., Ofoghi, B., Watters, P., & Brown, S. (2009, July). Detecting phishing emails using

hybrid features. In 2009 Symposia and Workshops on Ubiquitous, Autonomic and Trusted

Computing (pp. 493-497). IEEE.

Page 40: CLASSIFICATION OF PHISHING SCAM IN WEBSITE USING … · machine learning researches into algorithms. This research intends to utilize Vowpal Wabbit algorithm of breaking the stream

[13] Pandey, M., & Ravi, V. (2012, December). Detecting phishing e-mails using text and data

mining. In 2012 IEEE International Conference on Computational Intelligence and Computing

Research (pp. 1-6). IEEE.

[13] Attenberg, J., Weinberger, K., Dasgupta, A., Smola, A., & Zinkevich, M. (2009, July).

Collaborative email-spam filtering with the hashing trick. In Proceedings of the Sixth Conference

on Email and Anti-Spam.

[14] Shreeram, V., Suban, M., Shanthi, P., & Manjula, K. (2010, October). Anti-phishing detection

of phishing attacks using genetic algorithm. In 2010 International Conference on Communication

Control and Computing Technologies (pp. 447-450). IEEE.

[15] Saberi, A., Vahidi, M., & Bidgoli, B. M. (2007, November). Learn to detect phishing scams

using learning and ensemble? methods. In Proceedings of the 2007 IEEE/WIC/ACM International

Conferences on Web Intelligence and Intelligent Agent Technology-Workshops (pp. 311-314).

IEEE Computer Society.