text dan web mining
TRANSCRIPT
-
8/10/2019 Text Dan Web Mining
1/45
5/12/2014
1
Text & Web MiningData Mining Ilmu Komputer IPB
Kuliah 12
Data terstruktur
Sejauh ini kita berurusan dengan data terstruktur,
Umumnya data mining menggunakan data semacam ini
Attribute Value
Attribute Value
Attribute Value
Attribute Value
Outlook Sunny
Temperature Hot
WindyYes
Humidity High
PlayYes
-
8/10/2019 Text Dan Web Mining
2/45
5/12/2014
2
Complex Data Types
Berkembangnya data complex
Spatial data: geographic data, medical &satellite images
Multimedia data: images, audio, & video Time-series data: banking data & stockexchange data
Text data: word descriptions for objects World-Wide-Web: highly unstructured text &multimedia data
5/12/2014
Basisdata Teks
Dalam prakteknya terdapat banyak basis data teks: artikel berita paper riset buku
perpustakaan digital e-mail halaman web
Berkembang dengan cepat baik dari segi jumlah maupunkepentingan (80%)
5/12/2014
-
8/10/2019 Text Dan Web Mining
3/45
5/12/2014
3
Text Mining Text mining merujuk pada data mining yang
menggunakan dokumen teks sebagai data
Hampir semua tugas Text Mining menggunakan metodeInformation Retrieval (IR) untuk pra-proses dokumenteks.
Metode ini sedikit berbeda daripada metode pra-prosesdata yang digunakan dalam tabel relasional
Web search juga berakar pada IR
CS583, Bing Liu,
UIC
Definisi Text Mining
Discover useful and previously unknowngems of information in large text collections
-
8/10/2019 Text Dan Web Mining
4/45
5/12/2014
4
Definisi Text Mining
Text Mining=
Data Mining (applied to text data)+
basic linguistics
Text Mining is understood as a process of automaticallyextracting meaningful, useful, previously unknown andultimately comprehensible information from textualdocument repositories.
Definisi
yang tidak diketahui sebelumnya ? Definisi ketat
Informasi yang bahkan penulisnya tidak mengetahui
Contoh: menemukan metode baru untuk pertumbuhan rambutyang merupakan efek samping dari suatu prosedur
Definisi longgar
Menemukan kembali informasi yang telah ditulis pengarangdalam teksnya Contoh: secara otomatis mengekstrak nama produk dari sebuah
halaman web
-
8/10/2019 Text Dan Web Mining
5/45
-
8/10/2019 Text Dan Web Mining
6/45
5/12/2014
6
DM vs TMData Mining Text Mining
Object ofinvestigation
Numerical and categoricaldata
Texts
Object structure Relational databases Free form texts
GoalPredict outcomes of futuresituations
Retrieve relevant information,distill the meaning,categorize and target-deliver
MethodsMachine learning: SKAT,DT, NN, GA, MBR, MBA
Indexing, special neural networkprocessing, linguistics,
ontologiesCurrent marketsize
100,000 analysts at largeand midsize companies
100,000,000 corporate workersand individual users
MaturityBroad implementationsince 1994
Broad implementation starting2000
Search vs Discover
DataMining
TextMining
DataRetrieval
InformationRetrieval
Search(goal-oriented)
Discover(opportunistic)
StructuredData
UnstructuredData (Text)
-
8/10/2019 Text Dan Web Mining
7/45
5/12/2014
7
Aplikasi Text Mining Pemasaran: Menemukankelompok pembeli yangpotensial berdasarkan profilteks pengguna contoh. amazon
Industri: Mengidentifikasisitus web kelompok pesaing Produk pesaing dan harganya
Pencarian kerja:mengidentifikasi parameterdalam pencarian pekerjaan
www.flipdog.com
Aplikasi Text Mining Search engines
Enterprise portals
Knowledge management systems
e-Business systems
Vertical applications: e-mail categorization and routing
Call center notes categorization
CRM systems
-
8/10/2019 Text Dan Web Mining
8/45
5/12/2014
8
IndexingQueryOperations
Searching
Text Operations
UserInterface
Ranking
INDEX
TextDatabase
Search Subsystem
Inverted filesystem
queryparse query
stemming*
stemmedterms
stop list* non-stoplist
tokens
query tokens
Booleanoperations*
ranking*
relevantdocument set
rankeddocument set
retrieved documentset
*Indicatesoptionaloperation.
-
8/10/2019 Text Dan Web Mining
9/45
5/12/2014
9
Indexing SubsystemDocuments
break into tokens
stop list*
stemming*
term weighting*
Inverted filesystem
text
non-stoplisttokens
tokens
stemmedterms
terms withweights
*Indicates
optionaloperation.
assign document IDsdocuments
documentnumbersand *fieldnumbers
Text Mining
SampleDocuments
Transformed
Representationmodels
Learning Domain specifictemplates/models
Text document
VisualizationsLearning Working
-
8/10/2019 Text Dan Web Mining
10/45
5/12/2014
10
Text characteristics: Outline Large textual data base
High dimensionality
Several input modes
Dependency
Ambiguity
Noisy data
Not well structured text
Text characteristics
Large textual data base Efficiency consideration
over 2,000,000,000 web pages
almost all publications are also in electronic form High dimensionality (Sparse input)
Consider each word/phrase as a dimension
Several input modes e.g., Web mining: information about user is generated
by semantics, browse pattern and outsideknowledgebase.
-
8/10/2019 Text Dan Web Mining
11/45
5/12/2014
11
Text characteristics Dependency
relevant information is a complex conjunction ofwords/phrases e.g., Document categorization.
Pronoun disambiguation.
Ambiguity Word ambiguity
Pronouns (he, she )
buy, purchase
Semantic ambiguity The king saw the rabbit with his glasses. (8 meanings)
Text characteristics
Noisy data Example: Spelling mistakes
Not well structured text Chat rooms r u available ?
Hey whazzzzzz up
Speech
-
8/10/2019 Text Dan Web Mining
12/45
-
8/10/2019 Text Dan Web Mining
13/45
5/12/2014
13
Syntactic / Semantic text analysis Part of Speech (pos) tagging
Find the corresponding pos for each word
e.g., John (noun) gave (verb) the (det) ball (noun)
~98% accurate.
Word sense disambiguation Context based or proximity based
Very accurate
Parsing Generates a parse tree (graph) for each sentence
Each sentence is a stand alone graph
Feature Generation: Bag of words
Text document is represented by the words it contains(and their occurrences) e.g., Lord of the rings {the, Lord, rings, of}
Highly efficient
Makes learning far simpler and easier Order of words is not that important for certain applications
Stemming: identifies a word by its root e.g., flying, flew fly
Reduce dimensionality
Stop words: The most common words are unlikely to helptext mining e.g., the, a, an, you
-
8/10/2019 Text Dan Web Mining
14/45
5/12/2014
14
Feature Generation: D2K ExampleHi,Here is your weekly update (that unfortunately hasn't goneout in about a month). Not much action here right now.
1) Due to the unwavering insistence of a member of thegroup, the ncsa.d2k.modules.core.datatype package isnow completely independent of the d2k application.2) Transformations are now handled differently in Tables.Previously, transformations were done using aTransformationModule. That module could then be addedto a list that an ExampleTable kept. Now, there is aninterface called Transformation and a sub-interface calledReversibleTransformation.
hi, weekly update (that unfortunately gone out month).much action here right now. 1) due unwavering insistencemember group, ncsa.d2k.modules.core.datatype packagenow completely independent d2k application. 2)transformations now handled differently tables. previously,transformations done using transformationmodule. moduleadded list exampletable kept. now, interface calledtransformation sub-interface calledreversibletransformation.
hi week update unfortunate go out month much action hereright now 1 due unwaver insistence member group ncsa d2kmodules core datatype package now complete independenced2k application 2 transformation now handle different tableprevious transformation do use transformationmodule moduleadd list exampletable keep now interface call transformationsub-interface call reversibletransformation
Feature Generation: XML Current keyword-oriented search engines cannot handle rich
queries like Find all books authored by Scooby-Doo.
XML: Extensible Markup Language XML documents have a nested structure in which each
element is associated with a tag. Tags describe the semantics of elements.
The making of a bad movie Scooby-Doo Cartoons
-
8/10/2019 Text Dan Web Mining
15/45
5/12/2014
15
Feature selection Reduce dimensionality
Learners have difficulty addressing tasks with highdimensionality
Irrelevant features Not all features help!
e.g., the existence of a noun in a news article is unlikely to help
classify it as politics or sport
Feature selection: D2K Example I
hiweekupdateunfortunategooutmonthmuch
actionhererightnow1dueunwaverinsistencemembergroupncsad2kmodulesdo
coredatatypepackagecompleteindependenceapplication2transformation
handledifferenttableprevioususetransformationmoduleaddlistexampletablekeepinterfacecallsub-interfacereversibletransformation
hiweekupdateunfortunategooutmonthmuchactionhererightnowdueinsistencemembergroupncsad2kmodules
docoredatatypepackagecompleteindependenceapplicationtransformationhandledifferenttableprevioususeaddlistkeepinterfacecallsub-interface
-
8/10/2019 Text Dan Web Mining
16/45
5/12/2014
16
Feature selection: D2K Example IIhiweekupdateunfortunategooutmonthmuchactionhererightnow1dueunwaverinsistencemembergroupncsad2kmodulesdo
coredatatypepackagecompleteindependenceapplication2transformationhandledifferenttableprevioususetransformationmoduleaddlistexampletablekeepinterfacecallsub-interfacereversibletransformation
hiweekupdateunfortunategooutmonthmuchactionhererightnowdueinsistencemembergroupncsad2kmodules
docoredatatypepackagecompleteindependenceapplicationtransformationhandledifferenttableprevioususeaddlistkeepinterfacecallsub-interface
hiweekupdateunfortunatemonthactionrightdueinsistencemembergroupncsad2kmodulescore
datatypepackagecompleteindependenceapplicationtransformationhandledifferenttablepreviousaddlistinterfacecallsub-interface
Text Mining: Classification definition
Given: a collection of labeled records (training set) Each record contains a set of features (attributes), and the
true class (label)
Find: a model for the class as a function of thevalues of the features
Goal: previously unseen records should beassigned a class as accurately as possibleA test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and testsets, with training set used to build the model and test setused to validate it
-
8/10/2019 Text Dan Web Mining
17/45
5/12/2014
17
Similarity Measures:
Euclidean Distance if attributes are continuous Other Problem-specific Measures
e.g., how many words are common in these documents
Text Mining: Clustering definition Given: a set of documents and a similarity measure
among documents
Find: clusters such that: Documents in one cluster are more similar to one another
Documents in separate clusters are less similar to one another
Goal:
Finding a correct set of documents
ContohGREAT Camera., Jun 3, 2004Reviewer:jprice174 fromAtlanta, Ga.
I did a lot of research last yearbefore I bought this camera...It kinda hurt to leave behind
my beloved nikon 35mmSLR, but I was going to Italy,and I needed somethingsmaller, and digital.
The pictures coming out of thiscamera are amazing. The'auto' feature takes greatpictures most of the time. Andwith digital, you're notwasting film if the picturedoesn't come out.
Summary:
Feature1: picture
Positive: 12
The pictures coming out of this cameraare amazing.
Overall this is a good camera with areally good picture clarity.
Negative: 2
The pictures come out hazy if yourhands shake even for a moment duringthe entire process of taking a picture.
Focusing on a display rack about 20 feetaway in a brightly lit room during daytime, pictures produced by this camerawere blurry and in a shade of orange.
Feature2: battery life
CS583, Bing Liu,
UIC34
-
8/10/2019 Text Dan Web Mining
18/45
5/12/2014
18
Visual Comparison
CS583, Bing Liu,
UIC35
Summary ofreviews ofDigital camera 1
Picture Battery Size WeightZoom
Comparison ofreviews of
Digital camera 1Digital camera 2
+
_
_
+
Information ExtractionPosting from Newsgroup
Telecommunications. Solaris Systems
Administrator. 55-60K. Immediate need.
3P is a leading telecommunications firm
in need of a energetic individual to
fill the following position in the
Atlanta office:
SOLARIS SYSTEM ADMINISTRATOR
Salary: 50-60K with full benefits
Location: Atlanta, Georgia no relocation
assistance provided
FILLED TEMPLATE
job title: SOLARIS SYSTEM ADMINISTRATOR
salary: 55-60K
city: Atlanta
state: Georgia
platform: SOLARIS
area: Telecommunications
-
8/10/2019 Text Dan Web Mining
19/45
5/12/2014
19
Classification: An Example
Ex# Country MaritalStatus
IncomeHooligan
1 England Single 125K Yes
2 England Married Yes
3 England Single 70K Yes
4 Italy Married 40K No
5 USA Divorced 95K No
6 England Married 60K Yes
7 England 20K Yes
8 Italy Single 85K Yes
9 France Married 75K No
10 Denmark Single 50K No1 0 Training
SetModel
Learn
Classifier
Country MaritalStatus
IncomeHooligan
England Single 75K ?Turkey Married 50K ?
England Married 150K ?
Divorced 90K ?
Single 40K ?
Itlay Married 80K ?1 0
Test
Set
-
8/10/2019 Text Dan Web Mining
20/45
5/12/2014
20
Text Classification: An Example
Ex#Hooligan
1An English football fan
Yes
2During a game in Italy
Yes
3England has beenbeating France
Yes
4Italian football fans werecheering
No
5An average USAsalesman earns 75K
No
6The game in Londonwas horrific
Yes
7 Manchester city is likelyto win the championshipYes
8Rome is taking the leadin the football league
Yes1 0
Training
SetModel
Learn
Classifier
Test
Set
Hooligan
A Danish football fan ?
Turkey is playing vs. France.The Turkish fans
?1 0
-
8/10/2019 Text Dan Web Mining
21/45
5/12/2014
21
Web MiningData mining Ilmu Komputer IPB
Web Mining
Knowledge
WWW
-
8/10/2019 Text Dan Web Mining
22/45
5/12/2014
22
Example: Web data extraction
CS583, Bing Liu,
UIC43
Dataregion1
Dataregion2
A datarecord
A datarecord
Align and extract data items (e.g., region1)image1 EN7410
17-inchLCDMonitorBlack/Darkcharcoal
$299.99
AddtoCart
(Delivery /Pick-Up )
PennyShopping
Compare
image2 17-inch
LCDMonitor
$249.9
9
Add
toCart
(Delivery /
Pick-Up )
Penny
Shopping
Compare
image3 AL1714 17-inch LCDMonitor,Black
$269.99
AddtoCart
(Delivery /Pick-Up )
PennyShopping
Compare
image4 SyncMaster 712n 17-inch LCDMonitor,Black
Was:$369.99
$299.99
Save$70After:$70 mail-in-rebate(s)
AddtoCart
(Delivery /Pick-Up )
PennyShopping
Compare
CS583, Bing Liu, UIC
-
8/10/2019 Text Dan Web Mining
23/45
5/12/2014
23
Ads vs. search results
Reproduced from Ullman & Rajaraman with permission
Ads vs. search results
Search advertising is the revenue modelSearch advertising is the revenue model
Multi-billion-dollar industry
Advertisers pay for clicks on their ads
Interesting problemsInteresting problems
How to pick the top 10 results for a search from2,230,000 matching pages?
What ads to show for a search? If Im an advertiser, which search terms should I bid on
and how much to bid?
Reproduced from Ullman & Rajaraman with permission
-
8/10/2019 Text Dan Web Mining
24/45
5/12/2014
24
Whats Web Mining?
Web search : Google, Yahoo,MSN, Ask,
Specialized search: e.g. Froogle(comparison shopping), job ads(Flipdog)
eCommerce : Recommendations: e.g. Netflix,Amazon
improving conversion rate: nextbest product to offer
Advertising, e.g. Google Adsense Fraud detection: click fraud
detection, Improving Web site design and
performance
Discovering interesting and useful
information from Web content and usage
Web Mining
Web mining - data mining techniques toautomatically discover and extract informationfrom Web documents/services (Etzioni, 1996).
Web mining research integrate research fromseveral research communities (Kosala andBlockeel, July 2000) such as: Database (DB) Information retrieval (IR) The sub-areas of machine learning (ML) Natural language processing (NLP)
May 12, 2014 Web Mining
-
8/10/2019 Text Dan Web Mining
25/45
5/12/2014
25
Web Mining The World Wide Web may have more opportunities
for data mining than any other area However, there are serious challenges:
It is too huge Complexity of Web pages is greater than any traditional
text document collection It is highly dynamic It has a broad diversity of users
Only a tiny portion of the information is truly useful
5/12/2014
How big is the Web ?
Numberof pagesNumberof pages
Technically,
infinite
Technically,
infinite
Because of dynamicallygenerated content
Because of dynamicallygenerated content
Lots of duplication (30-40%)Lots of duplication (30-40%)
Best estimate ofunique static HTMLpages comes from
search engine claims
Best estimate ofunique static HTMLpages comes from
search engine claims
Google = 8 billion,Yahoo = 20 billionGoogle = 8 billion,Yahoo = 20 billion
Lots of marketinghype
Lots of marketinghype
Reproduced from Ullman & Rajaraman with permission
-
8/10/2019 Text Dan Web Mining
26/45
5/12/2014
26
Why Mine the Web? Enormous wealth of textual information on the Web.
Book/CD/Video stores (e.g., Amazon)
Restaurant information (e.g., Zagats)
Car prices (e.g., Carpoint)
Lots of data on user access patterns
Web logs contain sequence of URLs accessed by users
Possible to retrieve previously unknown information
People who ski also frequently break their leg. Restaurants that serve sea food in California are likely to be outside
San-Francisco
In the May 2014, 975,262,468sites 16 million more than last month
http://news.netcraft.com/archives/category/web-server-survey/
-
8/10/2019 Text Dan Web Mining
27/45
5/12/2014
27
Unique Features of the Web The Web is a huge collection of documentswhere many contain: Hyper-link information
Access and usage information
The Web is very dynamic Web pages are constantly being generated (removed)
Challenge: Develop new Web mining algorithms to . . .Exploit hyper-links and access patterns.Be adaptable to its documents source
Web Mining vs Data Mining
Web is not relation Textual information and linkage structureStructureStructure
Usage data is huge and growing rapidly Data generated per day is comparable to
largest conventional data warehousesScaleScale
Often need to react to evolving usagepatterns in real-time (e.g., merchandising)
No human in the loopSpeedSpeed
-
8/10/2019 Text Dan Web Mining
28/45
5/12/2014
28
Web Mining Taxonomy
May 12, 2014 Web Mining
Web Mining
Web ContentMining
Web UsageMining
Web StructureMining
Web Mining Taxonomy
Web Mining
Web ContentMining
Web Structure
Mining
Web Usage
Mining
Web PageContent Mining
Search ResultMining
General AccessPattern Tracking
CustomizedUsage Tracking
Identify information
within given web
pages
Distinguish personal
home pages from
other web pages
Categorizes documents
using phrases in titles
and snippets
Uses interconnections between
web pages to give weight to
pages
Understand access
patterns and trends to
improve structure
Analyzes access
patterns of a user to
improve response
-
8/10/2019 Text Dan Web Mining
29/45
5/12/2014
29
Mining the World Wide Web
May 12, 2014 Web Mining
Web Mining
Web StructureMining
Web ContentMining
Web Page Content Mining
Web Page Summarization
WebOQL(Mendelzon et.al. 1998) :
Web Structuring query languages;
Can identify information within
given web pages
(Etzioni et.al. 1997):Uses heuristics todistinguish personal home pages
from other web pages
ShopBot (Etzioni et.al. 1997): Looksfor product prices within web
pages
Search ResultMining
Web UsageMining
General AccessPattern Tracking
CustomizedUsage Tracking
Mining the World Wide Web
May 12, 2014 Web Mining
Web Mining
Web UsageMining
General AccessPattern Tracking
CustomizedUsage Tracking
Web StructureMining
Web Content
Mining
Web PageContent Mining Search Result Mining
Search Engine Result
Summarization
Clustering Search Result(Leouski and Croft, 1996,
Zamir and Etzioni, 1997):
Categorizes documents
using phrases in titles and
snippets
-
8/10/2019 Text Dan Web Mining
30/45
5/12/2014
30
Mining the World Wide Web
May 12, 2014 Web Mining
Web Mining
Web ContentMining
Web PageContent Mining
Search ResultMining
Web UsageMining
General AccessPattern Tracking
CustomizedUsage Tracking
Web Structure Mining
Using Links
PageRank (Brin et al., 1998)CLEVER (Chakrabarti et al., 1998)Use interconnections between web pages
to give weight to pages.
Using Generalization
MLDB (1994)Uses a multi-level database
representation of the Web. Counters
(popularity) and link lists are used for
capturing structure.
Mining the World Wide Web
May 12, 2014 Web Mining
Web Mining
Web StructureMining
Web ContentMining
Web PageContent Mining
Search ResultMining
Web UsageMining
General Access Pattern Tracking
Web Log Mining (Zaane, Xin andHan, 1998)
Uses KDD techniques to understand
general access patterns and trends.
Can shed light on better structure and
grouping of resource providers.
CustomizedUsage Tracking
-
8/10/2019 Text Dan Web Mining
31/45
5/12/2014
31
Mining the World Wide Web
May 12, 2014 Web Mining
Web Mining
Web UsageMining
General AccessPattern Tracking
Customized Usage Tracking
Adaptive Sites (Perkowitz and Etzioni,1997)
Analyzes access patterns of each user at
a time.
Web site restructures itself automatically
by learning from user access patterns.
Web StructureMining
Web ContentMining
Web PageContent Mining
Search ResultMining
Web Content Mining Approaches
Information Retrieval Approach
To assist or to improve the information finding orfiltering the information to the users usually basedon either inferred or solicited user profiles.
Database Approach
To model the data on the Web and to integratedthem so that more sophisticated queries other thanthe keywords based could be performed.
-
8/10/2019 Text Dan Web Mining
32/45
5/12/2014
32
Web Content Mining
5/12/2014
IR View DB View
View of Data Unstructured
Semi-structured
Semi-structured
Web site as DB
Main Data Text documents
Hypertext documents
Hypertext documents
Representation Bag of words, n-grams
Terms, phrases
Concepts or ontology
Relational
Edge-labeled graph
Relational
Methods Machine LearningStatistics
ILPAssociation rules
Applications Categorization
Clustering
Finding extraction rules
Finding patterns in textUser modeling
Finding frequent substructures
Web site schema discovery
Isu dalam Web Content Mining
Pengembangan alat cerdar untuk IR
Mencari kata kunci & frasa kunci
Menemukan aturan gramatikal & collocation
Klasifikasi/kategorisasi hyperteks Mengekstra frasa kunci dari dokumen html
Ekstraksi model/aturan pembelajaran
Hierarchical clustering
Memprediksi keterhubungan kata
Membangun web Query system (WebOQL, XMLQL)
Mining multimedia data
May 12, 2014 Web Mining
-
8/10/2019 Text Dan Web Mining
33/45
5/12/2014
33
Web Structure Mining
5/12/2014
View of Data Links structure
Main Data Links structure
Representation Graph
Methods Proprietary algorithms
Applications Categorization
Clustering
Web Structure Mining
Untuk menemukan struktur link dari hyperlinkspada level antardokumen untuk membangunringkasan struktur tentang situs web
Arah 1: berbasis hyperlinks, mengkategorikan halamanWeb & informasi yang dibangun
Arah 2: menemukan struktur dari dokumen web itusendiri
Arah 3: menemukan kealamiahan hierarki/jaringanhyperlinks pada situsweb tertentu
May 12, 2014 Web Mining
-
8/10/2019 Text Dan Web Mining
34/45
5/12/2014
34
Web Structure Mining
Menemukan halaman web yg authorative
Menemukembalikan halaman yang tidak hanya relevan, tapijuga berkualitas tinggi/authorative terhadap topik
Hyperlinks dapat merujuk authority
Web menganfung juga hyperlinks dari satu halaman kehalaman lain
Hyperlinks mengandung anotasi manusia berjumlah besar
Hyperlink yang merujuk ke halaman lain, dapat
dipertimbangkan sebagai kesukaan pengarang terhadaphalaman lain
May 12, 2014 Web Mining
Web Usage Mining
5/12/2014
View of Data Interactivity
Main Data Server logs
Browser logs
Representation Relational table
GraphMethods Machine learning
Statistics
Association rules
Applications Site construction, adaptation & managementMarketing
User modeling
-
8/10/2019 Text Dan Web Mining
35/45
5/12/2014
35
Web Usage MiningWeb usage miningjuga disebut Weblog mining Teknik mining untuk menemukan polapenggunaan yang menarik dari datasekunder yang diturunkan dari interaksipengguna ketika menjelajahi web
May 12, 2014 Web Mining
Web Usage MiningAplikasi
Menargetkan kostumer yang potensial untuk produkelektronik
Memperluas kualitas dan pengantaran Internet
Information Services kepada pengguna akhir. Memperbaiki performa sistem web server Mengidentifikasi lokasi iklan yang potensial Memfasilitasi personalisasi/situs adaptif Memperbaki desain situs Deteksi fraud/intrusion Memprediksi aksi pengguna
May 12, 2014 Web Mining
-
8/10/2019 Text Dan Web Mining
36/45
5/12/2014
36
May 12, 2014 Web Mining
Log Data - Simple Analysis
Statistical analysis of users
Length of path
Viewing time Number of page views
Statistical analysis of site
Most common pages viewed
Most common invalid URL
May 12, 2014 Web Mining
-
8/10/2019 Text Dan Web Mining
37/45
5/12/2014
37
Web Log Data Mining ApplicationsAssociation rules
Find pages that are often viewed together
Clustering
Cluster users based on browsing patterns
Cluster pages based on content
Classification Relate user attributes to patterns
May 12, 2014 Web Mining
Common Log Format
Remotehost: browser hostname or IP #
Remote log name of user (almost
always "-" meaning "unknown")
Authuser: authenticated username
Date: Date and time of the request
"request: exact request lines from client
Status: The HTTP status code returned
Bytes: The content-length of response
-
8/10/2019 Text Dan Web Mining
38/45
5/12/2014
38
SERVER LOGS
May 12, 2014 Web Mining 75
Fields
Client IP: 128.101.228.20 Authenticated User ID: - - Time/Date: [10/Nov/1999:10:16:39 -0600] Request: "GET / HTTP/1.0" Status: 200 Bytes: - Referrer: - Agent: "Mozilla/4.61 [en] (WinNT; I)"
May 12, 2014 Web Mining
-
8/10/2019 Text Dan Web Mining
39/45
5/12/2014
39
Searching the Web
Content aggregatorsThe Web Content consumersReproduced from Ullman & Rajaraman with permission
Web search basics
The Web
Ad indexes
Web Results1- 10of about 7,310,000formiele. (0.12 seconds)
Miele, Inc -- Anythingelseis acompromiseAt theheart of yourhome, Appliances byMiele. ... USA. to miele.com. ResidentialAppliances.VacuumCleaners. Dishwashers. CookingAppliances. SteamOven. CoffeeSystem...www.miele.com/ -20k- Cached - Similar pages
Miele
Welcometo Miele, thehomeof the very best appliances andkitchensintheworld.www.miele.co.uk/ -3k - Cached- Similar pages
Miele -DeutscherHerstellervon Einbaugerten, Hausgerten ... -[ Translatethispage]Das PortalzumThemaEssen& Geniessenonlineunterwww.zu-tisch.de.Miele weltweit...einLebenlang. ... WhlenSiedie MieleVertretungIhresLandes.www.miele.de/ -10k- Cached- Similar pages
Herzlichwillkommenbei Mielesterreich -[ Translatethis page]Herzlichwillkommenbei Miele sterreichWennSienicht automatischweitergeleitet werden, klickenSiebittehier! HAUSHALTSGERTE ... www.miele.at/ -3k- Cached - Similar pages
SponsoredLinks
CG ApplianceExpressDiscount Appliances (650)756-3931SameDayCertifiedInstallationwww.cgappliance.com SanFrancisco-Oakland-SanJose,CAMiele VacuumCleanersMiele Vacuums-CompleteSelectionFreeShipping!www.vacuums.com
Miele VacuumCleanersMiele-FreeAirshipping!Allmodels. Helpfuladvice.www.best-vacuum.com
Web crawler
Indexer
Indexes
Search
User
Reproduced from Ullman & Rajaraman with permission
-
8/10/2019 Text Dan Web Mining
40/45
-
8/10/2019 Text Dan Web Mining
41/45
5/12/2014
41
Layout Structure Compared to plain text, a web page is a 2D presentation Rich visual effects created by different term types, formats,
separators, blank areas, colors, pictures, etc Different parts of a page are not equally important
5/12/2014 Data Mining: Principles and Algorithms
Title: CNN.com International
H1: IAEA: Iran had secret nuke agenda
H3: EXPLOSIONS ROCK BAGHDAD
TEXT BODY (with position and font
type): The International Atomic EnergyAgency has concluded that Iran hassecretly produced small amounts ofnuclear materials including low enricheduranium and plutonium that could be usedto develop nuclear weapons according to aconfidential report obtained by CNN
Hyperlink:
URL: http://www.cnn.com/...
Anchor Text: AI oaedaImage:
URL: http://www.cnn.com/image/...
Alt & Caption: Iran nuclear
Anchor Text: CNN Homepage News
Web Page BlockBetter Information Unit
5/12/2014 Data Mining: Principles and Algorithms
Importance = Med
Importance = Low
Importance = High
Web Page Blocks
-
8/10/2019 Text Dan Web Mining
42/45
5/12/2014
42
Web Usage MiningApplications:Applications:
Simple and Basic:Simple and Basic:
Monitor performance, bandwidth usage Catch errors (404 errors- pages not found) Improve web site design
(shortcuts for frequent paths, remove links not used, etc)
Advanced and Business Critical :Advanced and Business Critical : eCommerce: improve conversion, sales, profit Fraud detection: click stream fraud,
Web Usage Mining Three Phases
-
8/10/2019 Text Dan Web Mining
43/45
5/12/2014
43
Web Usage Mining Issues Identification of exact user not possible.
Exact sequence of pages referenced by a user notpossible due to caching.
Session not well defined
Security, privacy, and legal issues
Systems Issues
Tens to hundreds ofterabytes
Web data sets canbe very large
Web data sets canbe very large
Need large farms ofservers
Cannot mine on asingle server!
Cannot mine on asingle server!
Without breaking thebank!
How to organizehardware/software
to mine multi-terabye data sets
How to organizehardware/software
to mine multi-terabye data sets
-
8/10/2019 Text Dan Web Mining
44/45
5/12/2014
44
root
furnishing
accomodation
event area
...
hotel youth hostel...
cityregion ...
wellness hotel
Ontology Learning
[Mdche, Staab: ECAI 2000]
Derived concept pairs
(wellness hotel, area)(hotel, area)(accomodation, area)
AssociationRule Mining
Generalized Conceptual
Relation
hasLocation(accomodation,area)
is-ahierarchy
Semantic Web Structure/Content Mining
Knowledge base
Hotel: Wellnesshotel
GolfCourse: Seaview
belongsTo(Seaview,Wellnesshotel)
...
ILP BasedAssociationRule Mining,
eg. [Dehaspe,Toivonen,
J. DMKD 1998]
Hotel(x), GolfCourse(y), belongsTo(y,x) hasStars(x,5)
support = 0.4 % confidence = 89 %
belongsTo
FORALL X, Y
Y: Hotel[cooperatesWith ->> X] > Y].
GolfCourse
Organization
Hotel
name Cooperat
es With
Ontology
-
8/10/2019 Text Dan Web Mining
45/45
5/12/2014
Complex Data Types Summary Emerging areas of mining complex data types:
Text mining can be done quite effectively, especially ifthe documents are semi-structured
Web mining is more difficult due to lack of suchstructure
Data includes text documents, hypertext documents, linkstructure, and logs
Need to rely on unsupervised learning, sometimes
followed up with supervised learning such as classification
5/12/2014