creating operational redundancy for effective web data mining

39
Using Operational Redundancy Effective Web Data Mining Jonathan LeBlanc Head of Developer Evangelism N.A. (PayPal) Github: http://github.com/jcleblanc Slides: http://slideshare.net/jcleblanc Twitter: @jcleblanc

Upload: jonathan-leblanc

Post on 28-Jan-2015

106 views

Category:

Technology


0 download

DESCRIPTION

In this session, we will explore the principles behind building a highly scalable, efficient, and effective web data mining architecture, based on standard semantic principles of data collection. This type of standard collection will allow any company to turn unstructured web data into structurally sound, valuable content.

TRANSCRIPT

  • 1. Using Operational RedundancyEffective Web Data MiningJonathan LeBlancHead of Developer Evangelism N.A. (PayPal)Github: http://github.com/jcleblancSlides: http://slideshare.net/jcleblancTwitter: @jcleblanc

2. PremiseThe interactions of a user can be used topersonalize their experience 3. Elements of Mining RedundancyWebsiteDataMiningUserEmotionalState MiningUserInteractionMining 4. Our Subject MaterialHTML content is poorly structuredThere are some pretty bad webpractices on the interwebzYou cant trust that anythingsemantically valid will be present 5. How Well Capture This DataStart with base linguisticsExtend with available extras 6. The Basic PiecesPage DataScrapeyScrapeyKeywordsWithout allthe fluffWeightingWord dietsFTW 7. Capture Raw Page DataSemantic data on the webis sucktasticAssume 5 year olds builtthe sitesLanguage is the key 8. Extract KeywordsWe now have a big jumbleof words. Lets extractWhy is and a top word?Stop words = sad panda 9. Weight KeywordsAll content is not createdequalMeta and headers andsemantics oh my!This is where we leechoff the work of others 10. Questions to Keep in MindShould I use regex to parse webcontent?How do users interact with pagecontent?What key identifiers can be monitoredto detect interest? 11. Fetching the Data: cURL$req = curl_init($url);$options = array(CURLOPT_URL => $url,CURLOPT_HEADER => $header,CURLOPT_RETURNTRANSFER => true,CURLOPT_FOLLOWLOCATION => true,CURLOPT_AUTOREFERER => true,CURLOPT_TIMEOUT => 15,CURLOPT_MAXREDIRS => 10);curl_setopt_array($req, $options); 12. //list of findable / replaceable string characters$find = array(/r/, /n/, /ss+/); $replace = array( ,,);//perform page content modification$mod_content = preg_replace(##is, , $page_content);$mod_content = preg_replace(##is, , $mod_content);$mod_content = strip_tags($mod_content);$mod_content = strtolower($mod_content);$mod_content = preg_replace($find, $replace, $mod_content);$mod_content = trim($mod_content);$mod_content = explode( , $mod_content);natcasesort($mod_content); 13. //set up list of stop words and the final found stopped list$common_words = array(a, ..., zero);$searched_words = array();//extract list of keywords with number of occurrencesforeach($mod_content as $word) {$word = trim($word);if(strlen($word) > 2 && !in_array($word, $common_words)){$searched_words[$word]++;}}arsort($searched_words, SORT_NUMERIC); 14. Scraping Site Meta Data//load scraped page data as a valid DOM document$dom = new DOMDocument();@$dom->loadHTML($page_content);//scrape title$title = $dom->getElementsByTagName("title");$title = $title->item(0)->nodeValue; 15. //loop through all found meta tags$metas = $dom->getElementsByTagName("meta");for ($i = 0; $i < $metas->length; $i++){$meta = $metas->item($i);if($meta->getAttribute("property")){if ($meta->getAttribute("property") == "og:description"){$dataReturn["description"] = $meta->getAttribute("content");}} else {if($meta->getAttribute("name") == "description"){$dataReturn["description"] = $meta->getAttribute("content");} else if($meta->getAttribute("name") == "keywords){$dataReturn[keywords"] = $meta->getAttribute("content");}}} 16. Weighting Important DataTags you should careabout: meta (include OG),title, description, h1+,headerBonus points for adding incontent location modifiers 17. Weighting Important Tags//our keyword weights$weights = array("keywords" => "3.0","meta" => "2.0","header1" => "1.5","header2" => "1.2");//add modifier hereif(strlen($word) > 2 && !in_array($word, $common_words)){$searched_words[$word]++;} 18. Expanding to Phrases2-3 adjacent words, makingup a direct relevant calloutSeems easy right? Just likesingle wordsLanguage gets wonkywithout stop words 19. Adding in Time InteractionsInteraction with a site doesnot necessarily meaninterest in itTime needs to also includean interaction componentGift buying seasons seeinterest variations 20. Grouping Using CommonalityInterestsUser AInterestsUser BInterestsCommon 21. Using Color TheoryProducts with a feel-good messageHappiness, energy, encouragementHealth care (but not food!)Relatable, calm, friendly, peace, securityStartups / innovative productsCreativity, imaginationAuction sites (but not sales sites!)Passion, stimulation, excitement, power 22. What Were Talking About 23. The CSS Service Enginelesscss.orgsass-lang.comlearnboost.github.com/stylus 24. http://leafo.net/lessphp/Design Engine Foundation: LESSPHP+ 25. The Basics of a Design Engine//create new LESS object$less= new lessc();//compile LESS code to CSS$less->checkedCompile(/path/styles.less,path/styles.css);//create new CSS file and return new file linkecho ""; 26. Passing Variables into LESSPHP//create a new LESS object$less = new lessc();//set the variables$less->setVariables(array(color => red,base => 960px));//compile LESS into PHP and unset variablesecho $less->compile(".magic { color: @color;width: @base - 200; }");$less->unsetVariable(color); 27. Implementing Color FunctionsLighten / Darken Saturate / DesaturateAdjust HueMix Colors 28. Managing Irrelevant ContentRemove / hide contentbased on user profileand state 29. Managing Irrelevant Content//variables passed into LESS compilation$less->setVariables(array("percent" => "80%",));//LESS template.highlight{@bg-color: "#464646;@font-color: "#eee";background-color: fade(@bg-color, @percent);color: fade(@font-color, @percent);} 30. Traits of the BoredDistractionRepetitionTirednessReasons for BoredomLack of interestReadinessActing on Disinterest / Boredom 31. Highlighting on Agitated BehaviorHighlight relevantcontent to reduceagitated behavior 32. Acting Upon User Queues$less->setVariables(array("percent" => "100%","size-mod" => "2"));Variables passed into LESS script 33. Acting Upon User Queues.highlight{@bg-calm: "blue";@bg-action: "red";@base-font: "14px";background-color: mix(@bg-calm,@bg-action,@percent );font-size: @size-mod + @base-font;}LESS script logic for color / size variations 34. Interaction and Emotion PluginjQuery Behavior Minerby Cedric Dugashttps://github.com/posabsolute/jquery-behavior-miner 35. In the EndWhat a person is interested inWhat a person is doingWhat their emotional state is 36. http://slideshare.com/jcleblancThank You! Questions?Jonathan LeBlancHead of Developer Evangelism N.A. (PayPal)Github: http://github.com/jcleblancSlides: http://slideshare.net/jcleblancTwitter: @jcleblanc