sourcing7 september workshop - data extraction: data is king

20
July 2014 DATA EXTRACTION: DATA IS KING TODD B DAVIS MADRONA VENTURE GROUP [email protected]

Upload: todd-davis

Post on 13-Jun-2015

753 views

Category:

Recruiting & HR


0 download

DESCRIPTION

Sourcing7 made an important change in our programs for 2014; we are offering hands-on classroom training to the sourcing community. These classes will cover a wide range of subjects that sourcers should find extremely valuable. The classes will be facilitated by sourcing thought leaders from the local community. The classroom format will allow the participants to take a deeper dive on the respective subject and engage more fully with the presenter. And did we mention, the events are offered at no charge to the participants. Our first fall training session for 2014 is on Thursday September 18th and will feature Todd Davis and Hakon Verespej of Madrona Venture Group. This month's workshop will cover some pretty advanced topics! Section 1 - A Peak Behind the Curtain Hakon, a former engineer at Microsoft, will show how using JavaScript, node.js, cheerio and others can unleash a Sourcers data extraction potential! Section 2 - The Tools of the Trade Todd will show how tools like Kimono Labs and Facebook Sensei can, with a few clicks, extract data from websites and social media websites and help you build your pipeline of data. Remember, bring your laptops as this will be an activity-based workshop. Also bring your appetite as there will be [free] food and drinks! Date: Thursday, September 18 from 6:00-8:30pm Agenda: 6:00-6:30pm Mingling, eating, getting laptops setup, and getting settled 6:30-8:30pm Interactive learning session, real-time problem solving, and lots of Q&A! Address: Nytec, Inc. 416 6th Street South Kirkland, WA 98033

TRANSCRIPT

Page 1: Sourcing7 September Workshop - Data Extraction: Data is King

July 2014

DATA EXTRACTION: DATA IS KING

TODD B DAVISMADRONA VENTURE [email protected]

Page 2: Sourcing7 September Workshop - Data Extraction: Data is King

2

DATA EXTRACTION USING KIMONO

• Kimono lets you turn websites into APIs in seconds• You don't need to write any code or install any software

to extract data with Kimono. The easiest way to use Kimono is to add their bookmarklet to your browser's bookmark bar. Then go to the website you want to get data from and click the bookmarklet. Select the data you want and Kimono does the rest.

• They take care of hosting the APIs that you build with Kimono and running them on the schedule you specify. Use the API output in JSON or as CSV files that you can easily paste into a spreadsheet.

Page 3: Sourcing7 September Workshop - Data Extraction: Data is King

3

Why Kimono • Free• Nothing to install (bookmarklet or chrome extension)• Simple for basic stuff• Strong enough for complex tasks• Pagination (i.e. 1 2 3 ... Next) • On-demand or scheduled crawling• Secure login & password • Export to CSV / RSS • Can create a mobile app from data or embedd in a website • Visually test and edit your scraper :)• Historic data without duplicates• Official Googlesheets add-on• Email alerts

DATA EXTRACTION USING KIMONO

Page 4: Sourcing7 September Workshop - Data Extraction: Data is King

4

We will use the Drupal Austin conference as our example

We will be using the Chrome Extension that was mentioned on the previous page. 1. Click the Kimono

Labs extension on Chrome. Kimono will open at the top of the page, it looks like a toolbar.

DATA EXTRACTION USING KIMONO

Page 5: Sourcing7 September Workshop - Data Extraction: Data is King

5

1. For this example we are going to select the “username” under the picture

2. When we do that you will see that all the other usernames are selected in yellow with a check mark next to it.

3. By clicking the check mark you select all the usernames on the page.

4. With the extension open and the kimono labs bar at the top lets select the first piece of data we want to extract. We need to give it a title in the box that says “property1”. For this example we will call the first field Username

DATA EXTRACTION USING KIMONO

Page 6: Sourcing7 September Workshop - Data Extraction: Data is King

6

• We still have other information we want to extract- click the + sign next to the Username field you created.

• You will see that you have now created a “property2” field, for that lets select the name of the people on the page.

• You will see that our “colored boxes” have reappeared and all the names have been selected, 50 total.

• We need to give it a Title in the box that says “property2”. For this example we will call the second field Name.

DATA EXTRACTION USING KIMONO

Page 7: Sourcing7 September Workshop - Data Extraction: Data is King

7

• Click the + sign next to the Name field and create a “property3” field.

• We need to add each person’s Title to our extraction list. So highlight the first persons Title.

• You notice that our check marks are back, just click the checkmark next to the 2nd person Title.

• We need to give it a title in the box that says “property3”. For this example we will call the third field Title.

DATA EXTRACTION USING KIMONO

Page 8: Sourcing7 September Workshop - Data Extraction: Data is King

8

• Click the + sign one more time and create a “property4” section.

• We need to add each person’s Company to our extraction list. So highlight the first persons Company.

• You will notice that the checkmarks return, click the checkmark on the second person Company.

DATA EXTRACTION USING KIMONO

Page 9: Sourcing7 September Workshop - Data Extraction: Data is King

9

• We have 4 fields picked for extraction, Username, Name, Title and Company

• If we look at the page now we notice that we have selected 50 records but there are 9 pages of records total, we are leaving a lot of information behind if we create the API and extract the data now

• We need to do one more step….

DATA EXTRACTION USING KIMONO

Page 10: Sourcing7 September Workshop - Data Extraction: Data is King

10

ONE TOOL TO RULE THEM ALL

• Pagenation, simply it’s the ability to scrape multiple pages of data. This would take a lot of time doing it by hand, Kimono has a simple way to extract ALL of the leads.

• If we look at the DrupalCon Austin website we see that we have only selected the first page of 50 but as I mentioned there are 9 pages!

Page 11: Sourcing7 September Workshop - Data Extraction: Data is King

11

• On the Kimono toolbar, click the button that looks like and open book. This the Pagenation feature.

• Once that is selected, click the Next option at the bottom of the page. This will tell Kimono to select all the following page on the Drupalcon Austin website.

DATA EXTRACTION USING KIMONO

Page 12: Sourcing7 September Workshop - Data Extraction: Data is King

12

• We are ready to click the Done button on the Kimono toolbar

• This will open a screen where we can name our API, we will call it SourceCon Test

• We can add tags to describe the API • We can choose how we get the data, on

demand, etc • We can set how many pages to scrape • Once you have that completed, click Create

API and click the link to visit your new API

DATA EXTRACTION USING KIMONO

Page 13: Sourcing7 September Workshop - Data Extraction: Data is King

13

• We aren't done yet we need to complete our pagenation by clicking the Start Crawl Now button

• This tells our new API to crawl and extract the data from all 9 pages!

• It will show you what percentage has been completed, how many URLs have been crawled, rows returned and the time elapsed

DATA EXTRACTION USING KIMONO

Page 14: Sourcing7 September Workshop - Data Extraction: Data is King

14

Lets take a look at our results! We were able to extract over 2,000 leads!

DATA EXTRACTION USING KIMONO

Page 15: Sourcing7 September Workshop - Data Extraction: Data is King

15

You can view the results in Json, RSS OR CSV under the Preview Tab. Lets take a look how the data looks like in CSV

DATA EXTRACTION USING KIMONO

Page 16: Sourcing7 September Workshop - Data Extraction: Data is King

16

Kimono also offers a Chrome Add-On for Google Sheets - https://chrome.google.com/webstore/detail/kimono/gincecdpheaeldbkjinnmloiiomnakee?hl=en

DATA EXTRACTION USING KIMONO

Page 17: Sourcing7 September Workshop - Data Extraction: Data is King

17

• You can also download the CSV file • There is an option under the Use in Code

tab where you can add it to your website, etc. They have options in Node, Ruby and more

• Under the API Detail you can set up email alerts for your API, create a mobile app, embed in your website or set up webhooks

DATA EXTRACTION USING KIMONO

Page 18: Sourcing7 September Workshop - Data Extraction: Data is King

18

• Other things to consider, when considering pagenation and if the website doesn’t have a Next button, Kimono has an option where you can select each URL of the pages you want to scrape and add them manually to your API

• Kimono also offers an API search and the ability to Clone an API

DATA EXTRACTION USING KIMONO

Page 19: Sourcing7 September Workshop - Data Extraction: Data is King

19

• Facebook is a password protected website • Kimono works with some public data on

Facebook but not for our Facebook Graph Search results

• Coding even using the Facebook API information is unreliable, unstable

• Other tools like Facebook Sensei can extract UIDs and information

DATA EXTRACTION USING KIMONO

Page 20: Sourcing7 September Workshop - Data Extraction: Data is King

20

QUESTIONS