trust based web spam detection in semantic search engine

Trust based web spam detection in semantic search engine

By: Soheila Dehghanzadeh

What is a web spam‍يكي از و‍ي ژگيهاي سيستم هاي موفق اطالعاتي توسط ميزان حمله ي اسپمرها به

آنها مشخص مي شود.

صفحات اسپم در وب از تکنیکهای مختلفی برای رسیدن به رتبه های باال در نتایج . جستجوی موتورهای جستجو و گمراه کردن آنها استفاده میکنند

موتورهای جستجو باید ویژگیهای دوگانه ی کیفیت نتایج و مرتبط بودن را با هم لحاظ کنند تا بتوان از حجم زیاد اطالعات روی وب استفاده کرد.

در تکنیکهای بهینه سازی موتور جستجوو بازیابی رقابتی اطالعات هدف یافتن تابع موتورجستجو و باالبردن مصنوعی رتبه ی یک صفحه در نتایج بازیابی نمره دهی

شده است، تا بتوان از منافع تجاری صفحاتی که در رتبه های باال ظاهر می شوند استفاده کرد.

با توجه به غیر ممکن بودن استفاده از نیروی انسانی برای کشف صفحات اسپم، باید این فرآیند را خودکار کرد و چون اسپمرها متناوبا تکنیکهای خود را تغییر میدهند

. تا موتورهای جستجو را گمراه کنند، مقابله ی اتوماتیک با آنها خیلی دشوار است

Spamming techniques(WODoc) روز ‍يك تكنيك جد‍يد براي گمراه كردن موتورهاي جستجو 3-2تقر‍يبا هر

ارائه مي شود. نكته ي مهم ا‍ينست كه تكنيكهاي اسپمرها كامال وابسته به الگور‍يتم هاي

رنكينگ در آن موتور جستجو است. تكنيكهاي اسپمرها

استفاده ي بيخودي از [1]استفاده از کلمات برای ایجاد اسپم( كلمات مهم جستجو(

گمراه كردن [2] استفاده از لینک برای ایجاد اسپم( pagerank )

دو نسخه در ‍يك آدرس براي كاربران و براي موتورهاي جستجو[3].

[1] Term spamming[2] Link spamming [3] Cloaking

Spamming techniques(WOData)False Labelling

Misdirection

Schema Pollution

Identity Assumption

Bait and Switch

Misattribution

Data URI Embedding

False Labelling

• the spammer simply asserts labelling triples that promote their message. Linked data systems often display the objects of these triples when labelling resources. If the spammer targets popular subject URIs then there is a higher chance of their message appearing for users of the Linked Data system. For example:

• dbpdedia:London rdfs:label "Buy more Wensleydale" .

• <http://danbri.org/foaf.rdf#danbri> foaf:name "Wensleydale fan" .

Misdirection • attacker asserts triples using properties that are commonly

used to provide links to human-readable content. In the attack, the triple objects are resources that contain the attacker's message. Systems that use these properties may inadvertently display links to the spammer's site and content:

• dbpedia:London rdfs:seeAlso <http://example.com/buycheese> .

•dbpedia:Tim_Berners-Lee foaf:isPrimaryTopicOf <http://example.com/buycheese> .

•<http://sws.geonames.org/3333196/> mo:wikipedia <http://example.com/buycheese> .

Schema Pollution• Schema Pollution• In this attack all of the instance data is innocuous but some of the

properties used in the data are labelled with the spammer's message. When rendering data for human use, many linked data systems will look for schema information to label unknown predicates. This attack causes those systems to display the spammer's message:

• ex:thing dc:title "New study finds that mice can learn to sing." ; a foaf:Document ; dc:subject "mouse behaviour" ; ex:prop "Journal of mouse psychology" . ex:prop a rdfs:Property ; rdfs:label "Lowest Wensleydale prices at bargaincheeseshop.com" .

•This attack can be combined with False Labelling, attempting to inject a message into a commonly used schema:

• dc:title rdfs:label "Lowest Wensleydale prices at bargaincheeseshop.com

Identity Assumption• minting URIs in one URI space and using owl:sameAs to

connect the resource to identical resources in other URI spaces. The attacker simply describes a resource that conveys their message and then uses owl:sameAs to make it identical to popular resources. Most Linked Data systems recognise owl:sameAs and aggregate all triples about any subjects declared to be identical.

• ex:thing dc:title "Wensleydale: the mature, smooth cheese you will love." ; owl:sameAs dbpedia:The_Beatles ; owl:sameAs dbpedia:Lady_Gaga ; owl:sameAs dbpedia:True_Blood ; owl:sameAs dbpedia:Harry_Potter .

Bait and Switch• Bait and Switch• In this vector, the spammer uses content negotiation to

provide enticing linked data to machines and spam messages to humans. When a Linked Data system fetches a URI it indicates that it requires machine-readable data by sending an appropriate HTTP header. Web browsers under the control of a human will send a different value for the header so servers can distinguish machines from humans and send different information. The spammer can configure their server to send innocuous Linked Data to machines which, when visited by humans, display the spammer's message. (See my earlier post Is the semantic web destined to be a shadow? for some of the consequences of this separation of machine/human content)

http://iandavis.com/blog/2007/11/is-the-semantic-web-destined-to-be-a-shadow

Misattribution

• Misattribution• Under this attack, the spammer attributes

their message to someone they hope the recipient will trust. Linked Data systems may ingest this data and display the quotation with the source inadvertently misleading its users:

• ex:1 a bibo:Quote ; bibo:content "I always buy Wensleydale from bargaincheeseshop.com and so should you" ; dc:creator "Sergey Brin" .

Data URI Embedding• Data URI Embedding• In this attack vector the data itself is innocuous but the URIs

used by the attacker use the data: scheme to embed the spam message. If these URIs are displayed to the user of a Linked Data system then they may click on them and trigger the message display. (example )

• dbpedia:London rdfs:seeAlso <data:text/html;charset=utf-8;base64,PGEgaHJlZj0iaHR0cDovL2V4YW1wbGUuY29tL2J1eWNoZWVzZSI+bG93ZXN0IFdlbnNsZXlkYWxlIHByaWNlczwvYT4=> .

•

Spam conclusion

• Most of these attack vectors can be countered through a whitelist provenance system, but they are not easy to scale.

• One particular property of RDF where duplicate triples can be ignored makes it easy to bury spam inside billions of legitimate triples - simply take a copy of dbpedia and add a few spam triples.

• A casual inspection of the dataset will more than likely just see the dbpedia triples, but a Linked Data system that already has those triples will ignore them and just add the spam triples

Saerch engine techniques to deal with web spam

موتورهاي جستجو مهمتر‍ين دروازه هاي ورود به وب هستند.

‍يك اصل بد‍يهي براي كشف اسپم:

: "احتمال اینکه از صفحات خوب با کیفیت باال به صفحات اسپم لینک وجود داشته باشد خیلی کم است. "

است. TrustRankا‍ين اصل پا‍يه ي الگور‍يتم

:TrustRankالگور‍يتم Inverse و فراخواني اوراكل با استفاده از پیج رنک معکوس) seedانتخاب

pagerank)(و پیج رنک باال High pagerank) به سا‍ير وب سا‍يتها و شناخت اسپم. با توجه به seedانتشار اعتماد از

pagerank البته گاهی اوقات اسپمرها یک لینک به صفحه ی خود در قسمت. .یادداشتهای یک صفحه ی خوب قرار میدهند و به این ترتیب این الگوریتم را دچار

مشکل میکنند.

انتشار اعتماد باید با افزایش فاصله از مجموعه ی اصلی تضعیف شود.

تما‍يزي بين لينكهاي متفاوت قائل نمي شود. با‍يد براي تطبيق trustrankالگور‍يتم ا‍ين الگور‍يتم در داده هاي پيوندي با‍يد ا‍ين الگور‍يتم را براي انواع مختلف لينكها

تطبيق كرد.

معماري موتور جستجو

Ranking

A 2 layer Model for web of data

Unsupervised wghitening

و امکان استفاده از آنها برای وزندهی VOID, OPM توصیف واژگان

OPM)http://http://openprovenance.org/model/opmx(Classes: | Agent | Artifact | Process | Properties: | used | wasControlledBy | wasDerivedFrom | wasEncodedBy | wasEndedAt | wasGeneratedAt | wasGeneratedBy | wasPerformedAt | wasPerformedBy | wasStartedAt | wasTriggeredBy | wasUsedAt VOID)http://vocab.deri.ie/void(Classes: Dataset | Linkset | TechnicalFeatureProperties: dataDump | exampleResource | feature | linkPredicate | objectsTarget | sparqlEndpoint | statItem | subjectsTarget | subset | target | uriLookupEndpoint | uriRegexPattern | vocabulary

http://open-biomed.sourceforge.net/opmv/ns.html#Agent

http://open-biomed.sourceforge.net/opmv/ns.html#Artifact

http://open-biomed.sourceforge.net/opmv/ns.html#Process

http://open-biomed.sourceforge.net/opmv/ns.html#used

http://open-biomed.sourceforge.net/opmv/ns.html#wasControlledBy

http://open-biomed.sourceforge.net/opmv/ns.html#wasDerivedFrom

http://open-biomed.sourceforge.net/opmv/ns.html#wasEncodedBy

http://open-biomed.sourceforge.net/opmv/ns.html#wasEndedAt

http://open-biomed.sourceforge.net/opmv/ns.html#wasGeneratedAt

http://open-biomed.sourceforge.net/opmv/ns.html#wasGeneratedBy

http://open-biomed.sourceforge.net/opmv/ns.html#wasPerformedAt

http://open-biomed.sourceforge.net/opmv/ns.html#wasPerformedBy

http://open-biomed.sourceforge.net/opmv/ns.html#wasStartedAt

http://open-biomed.sourceforge.net/opmv/ns.html#wasTriggeredBy

http://open-biomed.sourceforge.net/opmv/ns.html#wasTriggeredBy

http://vocab.deri.ie/void#Dataset

http://vocab.deri.ie/void#Linkset

http://vocab.deri.ie/void#TechnicalFeature

http://vocab.deri.ie/void#dataDump

http://vocab.deri.ie/void#exampleResource

http://vocab.deri.ie/void#feature

http://vocab.deri.ie/void#linkPredicate

ارز‍يابي

. مسئله ی شناسایی اسپم یک مسئله ی کالسبندی است. و با توجه به •اینکه دیتاستی برای این موضوع در نظر گرفته نشده است و تاکنون تستی

در این زمینه انجام نشده است بنابراین تمامی سه تایی هایی که توسط ایندکس شده را میگیریم و انها را به دو دسته ی sindiceموتور جستجوی

اسپم و غیر اسپم تقسیم بندی می کنیم و نتیجه را با نتیجه ی الگوریتم های و مقایسه precision, recallمعروف دسته بندی مقایسه می کنیم. ارزیابی

ی انها کارایی الگوریتم را نشان خواهد داد.

trust based web spam detection in semantic search engine

Documents