webscalding: a framework for big data web services ferosh jacob, aaron johnson, faizan javed, meng...
TRANSCRIPT
![Page 1: WebScalding: A Framework for Big Data Web Services Ferosh Jacob, Aaron Johnson, Faizan Javed, Meng Zhao, and Matt McNair](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649d765503460f94a56fd9/html5/thumbnails/1.jpg)
WebScalding: A Framework forBig Data Web Services
Ferosh Jacob, Aaron Johnson, Faizan Javed, Meng Zhao, and Matt McNair
![Page 2: WebScalding: A Framework for Big Data Web Services Ferosh Jacob, Aaron Johnson, Faizan Javed, Meng Zhao, and Matt McNair](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649d765503460f94a56fd9/html5/thumbnails/2.jpg)
Webscalding overview
What? Design, develop, and deploy
Big Data solutions without delving into the accidental complexities of MapReduce or web services descriptions
Scalding Wrapper
Written in SCALA
Highly customized for CareerBuilder (CB) R&D projects
Why? Write once, run anywhere
Production, test environments
Sequential and parallel environments
Parallel and parallel environments
Legacy solution (Python scripts) Lack of abstraction Local and BigData modes Webservices orchestration
![Page 3: WebScalding: A Framework for Big Data Web Services Ferosh Jacob, Aaron Johnson, Faizan Javed, Meng Zhao, and Matt McNair](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649d765503460f94a56fd9/html5/thumbnails/3.jpg)
How cascading solves WORA?
![Page 4: WebScalding: A Framework for Big Data Web Services Ferosh Jacob, Aaron Johnson, Faizan Javed, Meng Zhao, and Matt McNair](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649d765503460f94a56fd9/html5/thumbnails/4.jpg)
WebScalding overview
![Page 5: WebScalding: A Framework for Big Data Web Services Ferosh Jacob, Aaron Johnson, Faizan Javed, Meng Zhao, and Matt McNair](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649d765503460f94a56fd9/html5/thumbnails/5.jpg)
WebScalding libraries Alleppey
Training and classifying data using several algorithms
Custom cross validation on any supported models
StringOps
Matching, Sorting, Stemming string algorithms
Language detection
Web Services Library support for calling many internal
web services authentication/authorization processing web service requests
(e.g., GET/POST) handling various error scenarios
while processing web services.
DBpedia API Wrappers
make the Dbpedia access thread-safe
Local and hadoop execution of Wikipedia
![Page 6: WebScalding: A Framework for Big Data Web Services Ferosh Jacob, Aaron Johnson, Faizan Javed, Meng Zhao, and Matt McNair](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649d765503460f94a56fd9/html5/thumbnails/6.jpg)
CaseStudy: DSFSG Project
![Page 7: WebScalding: A Framework for Big Data Web Services Ferosh Jacob, Aaron Johnson, Faizan Javed, Meng Zhao, and Matt McNair](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649d765503460f94a56fd9/html5/thumbnails/7.jpg)
DSFSG Project: Tasks for CB DataScience team
Web service execution
1. Carotene (CB)
2. Autocoder (External)
Data 22M resumes 2.5M job
postings a total of 101 GB.
Data cleaning remove email addresses
and phone numbers names of the resume
posters
Special processing for resumes extract job title and the
job description limit to last3 jobs
![Page 8: WebScalding: A Framework for Big Data Web Services Ferosh Jacob, Aaron Johnson, Faizan Javed, Meng Zhao, and Matt McNair](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649d765503460f94a56fd9/html5/thumbnails/8.jpg)
DSFSG Project: Defining a webservice using WebScalding
![Page 9: WebScalding: A Framework for Big Data Web Services Ferosh Jacob, Aaron Johnson, Faizan Javed, Meng Zhao, and Matt McNair](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649d765503460f94a56fd9/html5/thumbnails/9.jpg)
Case study 1: DSFSG Project: Defining a Webservice WebScalding execution job
![Page 10: WebScalding: A Framework for Big Data Web Services Ferosh Jacob, Aaron Johnson, Faizan Javed, Meng Zhao, and Matt McNair](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649d765503460f94a56fd9/html5/thumbnails/10.jpg)
DSFSG Project: Execution time Comparison
0 500000 1000000 1500000 20000001E+01
1E+02
1E+03
1E+04
1E+05
1E+06
62.86
595.54
4657.37
14784.67
32019
203046.33
40.7445.0775.82
131.11246.74
726.22
WebScalding
Python
Size of data- Lines of input to process
Logarithmic execution time (sec)
![Page 11: WebScalding: A Framework for Big Data Web Services Ferosh Jacob, Aaron Johnson, Faizan Javed, Meng Zhao, and Matt McNair](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649d765503460f94a56fd9/html5/thumbnails/11.jpg)
DSFSG Project Speedup: WebScalding Hadoop v/s Python Sequential
0 500000 1000000 1500000 20000000
50
100
150
200
250
300
Size of data- Lines of input to process
Speedup
![Page 12: WebScalding: A Framework for Big Data Web Services Ferosh Jacob, Aaron Johnson, Faizan Javed, Meng Zhao, and Matt McNair](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649d765503460f94a56fd9/html5/thumbnails/12.jpg)
Case study 2: Accessing HIVE tables- German skills projectGoal: Identify all the German resumes and job
postings in our database
Legacy approach: Determine the German data (resume or job posting) based on from where the data is uploaded
Challenge: Users can upload their data in English from a foreign language CB website and vice versa
![Page 13: WebScalding: A Framework for Big Data Web Services Ferosh Jacob, Aaron Johnson, Faizan Javed, Meng Zhao, and Matt McNair](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649d765503460f94a56fd9/html5/thumbnails/13.jpg)
German skills project- workflow overview1. Extract job posting or resume from DB
2. Use a language translator
3. Filter based on the detected language
![Page 14: WebScalding: A Framework for Big Data Web Services Ferosh Jacob, Aaron Johnson, Faizan Javed, Meng Zhao, and Matt McNair](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649d765503460f94a56fd9/html5/thumbnails/14.jpg)
German skills project- Defining HIVE tables using WebScalding
![Page 15: WebScalding: A Framework for Big Data Web Services Ferosh Jacob, Aaron Johnson, Faizan Javed, Meng Zhao, and Matt McNair](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649d765503460f94a56fd9/html5/thumbnails/15.jpg)
German skills project- Hostsite and lang-detect comparison
Lang Lang-detect Hostsiteen 46,204,969 42,364,522de 2,741,516 2,802,857pl 1,666,247 1,702,878fr 524,902 84,665it 274,305 305,636
LangLang-detect Hostsite
en 87,048,731 112,764,535fr 955,712 504,499ro 384,316 15,617es 301,511 448,943it 289,129 520,161
Resume
Job postings
![Page 16: WebScalding: A Framework for Big Data Web Services Ferosh Jacob, Aaron Johnson, Faizan Javed, Meng Zhao, and Matt McNair](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649d765503460f94a56fd9/html5/thumbnails/16.jpg)
Case Study 3: Reading Large XML files- Wiki searchGoal: Read a large XML dataset (e.g.,
Wikipedia)
Approach: Split the large XML file into node of interest, so each node can be processed separately
Challenge: How to split the large XML based on the node of interest.
![Page 17: WebScalding: A Framework for Big Data Web Services Ferosh Jacob, Aaron Johnson, Faizan Javed, Meng Zhao, and Matt McNair](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649d765503460f94a56fd9/html5/thumbnails/17.jpg)
Wiki search: Summary of Wiki templates and Wiki Categories
Wiki Category CountLiving people 666,299Archived files for deletion discussions 89,118Year of birth missing (living people) 51,482Pending DYK nominations 37,794English-language films 28,835American films 20,558Year of birth unknown 19,610Year of birth missing 18,546The Football League players 17,808Main Belt asteroids 17,323
Wiki Template Countflagicon 3,251,438reflist 3,020,151convert 1,356,797persondata 1,149,443coord 731,265fb r 707,343sortname 578,434sort 541,871flag 508,315sfn 422,362
![Page 18: WebScalding: A Framework for Big Data Web Services Ferosh Jacob, Aaron Johnson, Faizan Javed, Meng Zhao, and Matt McNair](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649d765503460f94a56fd9/html5/thumbnails/18.jpg)
Conclusion WebScalding was developed to enable data
scientists and analysts to: design, develop, and deploy Big Data solutions
without delving into the accidental complexities of MapReduce or web services descriptions
At CareerBuilder, WebScalding has successfully delivered high performing Big Data solutions involving web services in a very time-efficient manner
![Page 19: WebScalding: A Framework for Big Data Web Services Ferosh Jacob, Aaron Johnson, Faizan Javed, Meng Zhao, and Matt McNair](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649d765503460f94a56fd9/html5/thumbnails/19.jpg)
Questions?