sourcing7 september workshop - data extraction: data is king
DESCRIPTION
Sourcing7 made an important change in our programs for 2014; we are offering hands-on classroom training to the sourcing community. These classes will cover a wide range of subjects that sourcers should find extremely valuable. The classes will be facilitated by sourcing thought leaders from the local community. The classroom format will allow the participants to take a deeper dive on the respective subject and engage more fully with the presenter. And did we mention, the events are offered at no charge to the participants. Our first fall training session for 2014 is on Thursday September 18th and will feature Todd Davis and Hakon Verespej of Madrona Venture Group. This month's workshop will cover some pretty advanced topics! Section 1 - A Peak Behind the Curtain Hakon, a former engineer at Microsoft, will show how using JavaScript, node.js, cheerio and others can unleash a Sourcers data extraction potential! Section 2 - The Tools of the Trade Todd will show how tools like Kimono Labs and Facebook Sensei can, with a few clicks, extract data from websites and social media websites and help you build your pipeline of data. Remember, bring your laptops as this will be an activity-based workshop. Also bring your appetite as there will be [free] food and drinks! Date: Thursday, September 18 from 6:00-8:30pm Agenda: 6:00-6:30pm Mingling, eating, getting laptops setup, and getting settled 6:30-8:30pm Interactive learning session, real-time problem solving, and lots of Q&A! Address: Nytec, Inc. 416 6th Street South Kirkland, WA 98033TRANSCRIPT
2
DATA EXTRACTION USING KIMONO
• Kimono lets you turn websites into APIs in seconds• You don't need to write any code or install any software
to extract data with Kimono. The easiest way to use Kimono is to add their bookmarklet to your browser's bookmark bar. Then go to the website you want to get data from and click the bookmarklet. Select the data you want and Kimono does the rest.
• They take care of hosting the APIs that you build with Kimono and running them on the schedule you specify. Use the API output in JSON or as CSV files that you can easily paste into a spreadsheet.
3
Why Kimono • Free• Nothing to install (bookmarklet or chrome extension)• Simple for basic stuff• Strong enough for complex tasks• Pagination (i.e. 1 2 3 ... Next) • On-demand or scheduled crawling• Secure login & password • Export to CSV / RSS • Can create a mobile app from data or embedd in a website • Visually test and edit your scraper :)• Historic data without duplicates• Official Googlesheets add-on• Email alerts
DATA EXTRACTION USING KIMONO
4
We will use the Drupal Austin conference as our example
We will be using the Chrome Extension that was mentioned on the previous page. 1. Click the Kimono
Labs extension on Chrome. Kimono will open at the top of the page, it looks like a toolbar.
DATA EXTRACTION USING KIMONO
5
1. For this example we are going to select the “username” under the picture
2. When we do that you will see that all the other usernames are selected in yellow with a check mark next to it.
3. By clicking the check mark you select all the usernames on the page.
4. With the extension open and the kimono labs bar at the top lets select the first piece of data we want to extract. We need to give it a title in the box that says “property1”. For this example we will call the first field Username
DATA EXTRACTION USING KIMONO
6
• We still have other information we want to extract- click the + sign next to the Username field you created.
• You will see that you have now created a “property2” field, for that lets select the name of the people on the page.
• You will see that our “colored boxes” have reappeared and all the names have been selected, 50 total.
• We need to give it a Title in the box that says “property2”. For this example we will call the second field Name.
DATA EXTRACTION USING KIMONO
7
• Click the + sign next to the Name field and create a “property3” field.
• We need to add each person’s Title to our extraction list. So highlight the first persons Title.
• You notice that our check marks are back, just click the checkmark next to the 2nd person Title.
• We need to give it a title in the box that says “property3”. For this example we will call the third field Title.
DATA EXTRACTION USING KIMONO
8
• Click the + sign one more time and create a “property4” section.
• We need to add each person’s Company to our extraction list. So highlight the first persons Company.
• You will notice that the checkmarks return, click the checkmark on the second person Company.
DATA EXTRACTION USING KIMONO
9
• We have 4 fields picked for extraction, Username, Name, Title and Company
• If we look at the page now we notice that we have selected 50 records but there are 9 pages of records total, we are leaving a lot of information behind if we create the API and extract the data now
• We need to do one more step….
DATA EXTRACTION USING KIMONO
10
ONE TOOL TO RULE THEM ALL
• Pagenation, simply it’s the ability to scrape multiple pages of data. This would take a lot of time doing it by hand, Kimono has a simple way to extract ALL of the leads.
• If we look at the DrupalCon Austin website we see that we have only selected the first page of 50 but as I mentioned there are 9 pages!
11
• On the Kimono toolbar, click the button that looks like and open book. This the Pagenation feature.
• Once that is selected, click the Next option at the bottom of the page. This will tell Kimono to select all the following page on the Drupalcon Austin website.
DATA EXTRACTION USING KIMONO
12
• We are ready to click the Done button on the Kimono toolbar
• This will open a screen where we can name our API, we will call it SourceCon Test
• We can add tags to describe the API • We can choose how we get the data, on
demand, etc • We can set how many pages to scrape • Once you have that completed, click Create
API and click the link to visit your new API
DATA EXTRACTION USING KIMONO
13
• We aren't done yet we need to complete our pagenation by clicking the Start Crawl Now button
• This tells our new API to crawl and extract the data from all 9 pages!
• It will show you what percentage has been completed, how many URLs have been crawled, rows returned and the time elapsed
DATA EXTRACTION USING KIMONO
14
Lets take a look at our results! We were able to extract over 2,000 leads!
DATA EXTRACTION USING KIMONO
15
You can view the results in Json, RSS OR CSV under the Preview Tab. Lets take a look how the data looks like in CSV
DATA EXTRACTION USING KIMONO
16
Kimono also offers a Chrome Add-On for Google Sheets - https://chrome.google.com/webstore/detail/kimono/gincecdpheaeldbkjinnmloiiomnakee?hl=en
DATA EXTRACTION USING KIMONO
17
• You can also download the CSV file • There is an option under the Use in Code
tab where you can add it to your website, etc. They have options in Node, Ruby and more
• Under the API Detail you can set up email alerts for your API, create a mobile app, embed in your website or set up webhooks
DATA EXTRACTION USING KIMONO
18
• Other things to consider, when considering pagenation and if the website doesn’t have a Next button, Kimono has an option where you can select each URL of the pages you want to scrape and add them manually to your API
• Kimono also offers an API search and the ability to Clone an API
DATA EXTRACTION USING KIMONO
19
• Facebook is a password protected website • Kimono works with some public data on
Facebook but not for our Facebook Graph Search results
• Coding even using the Facebook API information is unreliable, unstable
• Other tools like Facebook Sensei can extract UIDs and information
DATA EXTRACTION USING KIMONO
20
QUESTIONS