web crawler 고군분투기
TRANSCRIPT
![Page 1: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/1.jpg)
Web Crawler 고군분투기
@sangjun
# 고군분투준비
Kodevelopers
#고군분투기
# 고군분투종료
크롤러란 ?
창과방패의싸움
결말
![Page 2: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/2.jpg)
Slack : @sangjunFacebook : richellin7Github : richellin7Email : [email protected]
Like : ㅅ .ㅜ .ㄹ ...
일본에서일하는한국인개발자모임 (Kodevelopers) 관리자https://www.facebook.com/groups/1726012127643525/?fref=ts
![Page 3: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/3.jpg)
# 고군분투준비크롤러란 ?
![Page 4: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/4.jpg)
크롤러 (Crawler)
크롤링 (Crawling)
스크랩핑 (Scraping)
![Page 5: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/5.jpg)
크롤러 (Crawler)란 ?
크롤러 = 스파이더 = 로봇 = 봇 크롤러란웹상의문서나이미지를주기적으로 습득해서자동으로 필요한정보를
데이터베이스화하는프로그램
![Page 6: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/6.jpg)
크롤러 (Crawler)란 ? 대표적인크롤러
Googlebot( Google)bingbot(영어판)(마이크로소프트・
bing)Baiduspider(바이두)Yetibot(네이버)
![Page 7: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/7.jpg)
크롤링 (Crawling)이란 ?
크롤러가웹사이트로부터 HTML 이나 임의의 정보를습득 하는기술 또는행위
![Page 8: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/8.jpg)
스크랩핑 (Scraping)이란 ?
습득한 HTML 에서임의의 정보를추출 하는기술또는행위
![Page 9: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/9.jpg)
간단하게정의하자면 ...
![Page 10: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/10.jpg)
크롤러 = 크롤링 + 스크랩핑
![Page 11: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/11.jpg)
크롤러가웹상의무엇을 찾으러떠났죠 ?
![Page 12: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/12.jpg)
원피스
![Page 13: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/13.jpg)
= 필요한정보
![Page 14: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/14.jpg)
출발 !
![Page 15: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/15.jpg)
잠깐 !
![Page 16: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/16.jpg)
동료는 ?
![Page 17: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/17.jpg)
프로토콜(HTTP,HTTPS)
Headless(GUI없는 )
Browser
Full Browser(full 렌더링 )
curl(libcurl)
mechanizeurllib2
httplib2requests
PhantomJSHtmlUnitTrifleJS
Zombie.jsENVJSimerJS
ChromeSafariFirefox
![Page 18: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/18.jpg)
프로토콜(HTTP,HTTPS)
Headless(GUI없는 )Browser
Full Browser(full 렌더링 )
부하높음
부하낮음
유저화면 (랜더링 ) 일치율높음
유저화면 (랜더링 ) 일치율낮음
![Page 19: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/19.jpg)
렌더링
프로토콜(HTTP,HTTPS)
Headless(GUI없는 )Browser
Full Browser(full 렌더링 )
HTML,Cookie 습득
HTML 해석
통신
자바스크립트해석
그리기
![Page 20: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/20.jpg)
#고군분투기 창과방패의
싸움
![Page 21: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/21.jpg)
네이버의인기검색어를 습득한다고가정해봅시다 .
![Page 22: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/22.jpg)
방패창
크롤러
스크랩핑크롤링
![Page 23: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/23.jpg)
방패창 정보가져갈게 ! VS ㄴㄴ시른데
![Page 24: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/24.jpg)
Round 1- Header -
![Page 25: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/25.jpg)
VS 크롤링시작 !User-Agent : X
User-Agent : 없는얘들은다 거부ㅂㅂ
![Page 26: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/26.jpg)
VS
User-Agent : Chrome 어떻냐 ?
하핫
오 ? 위장했다 이거지그럼
Cookie다 !
![Page 27: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/27.jpg)
VSCookie :buid:bjU6/wo...
옴마야 ? Cookie 까지 동작그만 . 밑장빼기냐 ?
![Page 28: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/28.jpg)
![Page 29: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/29.jpg)
VS
![Page 30: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/30.jpg)
Header 싸움은시작되었다 .Referer
HostAccept
Accept-EncodingAccept-Language
…
![Page 31: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/31.jpg)
Round 1창 의승리
![Page 32: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/32.jpg)
Round 2- Javascript -
![Page 33: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/33.jpg)
VSJavascript
분석및해석 ... ( …힘들다 )
자바스크립트로DOM 조작
![Page 34: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/34.jpg)
웹의발전과함께 ...
![Page 35: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/35.jpg)
Javascript 춘추전국시대
![Page 36: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/36.jpg)
VS
....(뷰 ..뷰 ..뷰티 ..)
OTL
널더고통스럽게 .. 넌 M 이고난 S 야
( 자바스크립트 + 난독화 + AJAX +
Token)
![Page 37: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/37.jpg)
![Page 38: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/38.jpg)
VS
마이묵었다 아이가그만해라 .
렌더링된 결과값만받을래 ..
(Headless)
어 ...어 ...어 ...
…
IP차단
![Page 39: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/39.jpg)
VSIP우회 리퀘스트수제한
![Page 40: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/40.jpg)
Round 2창 의승리
( 살을내주고뼈를취함 )
![Page 41: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/41.jpg)
Round 3 사람과컴퓨터구분
![Page 42: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/42.jpg)
VS…?? 뼈를주고살을취하겠다 .
![Page 43: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/43.jpg)
![Page 44: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/44.jpg)
VSGG 뼈를주고살을취하겠다 .
![Page 45: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/45.jpg)
Round 3방패 의승리
( 살을내주고뼈를취함 )
![Page 46: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/46.jpg)
# 고군분투종료결말
![Page 47: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/47.jpg)
최종승리는방패
![Page 48: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/48.jpg)
하지만 !?
![Page 49: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/49.jpg)
AI 에의해현방패는언젠가뚫립니다 .
![Page 50: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/50.jpg)
싸움을멈추는방법은 ?
OPEN API
![Page 51: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/51.jpg)
싫으면 !?
![Page 52: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/52.jpg)
싸움은계속된다 .To be Continued
![Page 53: Web Crawler 고군분투기](https://reader035.vdocuments.mx/reader035/viewer/2022062218/58aad9181a28ab27178b5053/html5/thumbnails/53.jpg)
감사합니다 .