ads users group meeting, 20-21 nov. 2019 ingest...
TRANSCRIPT
![Page 1: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic](https://reader035.vdocuments.mx/reader035/viewer/2022063011/5fc71e6451035f3c5f7450f4/html5/thumbnails/1.jpg)
Ingest System
Steve McDonald and the ADS Team
ADS Users Group Meeting, 20-21 Nov. 2019
![Page 2: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic](https://reader035.vdocuments.mx/reader035/viewer/2022063011/5fc71e6451035f3c5f7450f4/html5/thumbnails/2.jpg)
Ingest System Overview
● 8 Individual Pipelines○ Bibliographic, Non-bibliographic○ Affiliation, Citation Capture, Fulltext, myADS, Orcid○ Master
● Each Sends Processed Data To Master Pipeline○ Master merges and sends to persistent stores
■ Solr, SQL Database, API Endpoint ● Tech Stack
○ Python, Cron, Docker, RabbitMQ, Protobufs● Mostly Reuses Classic Files
○ Needs to change!
2
![Page 3: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic](https://reader035.vdocuments.mx/reader035/viewer/2022063011/5fc71e6451035f3c5f7450f4/html5/thumbnails/3.jpg)
Ingest Operations Overview● 635,989 More Bibcodes● 8,676,103 More Citations● Persistent Vigilance
○ Immediate Action By Curators ○ Daily and Monthly Reports
● Number Of Failed Records In Daily Ingest○ None: 38%○ 1 to 5 Records: 42%○ These Are Conservative Estimates!○ Publisher And Non Curation Issues
3
![Page 4: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic](https://reader035.vdocuments.mx/reader035/viewer/2022063011/5fc71e6451035f3c5f7450f4/html5/thumbnails/4.jpg)
Affiliation Pipeline
● Affiliations Data Debuted @ 2019 Winter AAS● Pipeline In Daily Operation Since ~ Spring 2019
○ New records have aff_ids assigned daily○ Existing records are updated as the dictionary of aff/aff_id is
updated (~monthly)● Machine-Learning Based Curation
○ Curation of a learning model is ongoing (slow, mainly hand-work)○ Not yet in automated production, but can be used to assist
curation● Statistics: See CSG Presentation (Thursday) On Affiliations
4
![Page 5: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic](https://reader035.vdocuments.mx/reader035/viewer/2022063011/5fc71e6451035f3c5f7450f4/html5/thumbnails/5.jpg)
ORCID Pipeline
● Restructure To Improve Performance○ Reduced Pipeline’s API Search Calls
■ Requests from the ORCID pipeline were ~80% of search traffic■ Reduced to 1% of search traffic
● Changes To Support ORCID Microservice/UI Restructure ○ Functionality migrated from UI to ORCID microservice
5
![Page 6: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic](https://reader035.vdocuments.mx/reader035/viewer/2022063011/5fc71e6451035f3c5f7450f4/html5/thumbnails/6.jpg)
Fulltext Pipeline Overview
● Over 5 Million Articles With Fulltext○ Exponential growth
6
![Page 7: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic](https://reader035.vdocuments.mx/reader035/viewer/2022063011/5fc71e6451035f3c5f7450f4/html5/thumbnails/7.jpg)
Fulltext Pipeline Updates
● Upgraded To Python 3 Compatible External Libraries ● More Robust XML Parsing● Added Performance Monitoring
7
![Page 8: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic](https://reader035.vdocuments.mx/reader035/viewer/2022063011/5fc71e6451035f3c5f7450f4/html5/thumbnails/8.jpg)
Fulltext Pipeline Updates
● PDF Extraction Analysis○ Validated Current Technique
8
![Page 9: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic](https://reader035.vdocuments.mx/reader035/viewer/2022063011/5fc71e6451035f3c5f7450f4/html5/thumbnails/9.jpg)
Ingest System Future
● Reimplement Non-bibliographic Pipeline○ Non-metadata (reads, citations, metrics)○ Changes .4M to 2M records daily○ Investigations underway○ Behind schedule
● Create New Ingest System Plan○ Remove dependency on classic pipeline
■ Large effort○ Consider framework like Kafka
● Implement Plan○ Long running project
9
![Page 10: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic](https://reader035.vdocuments.mx/reader035/viewer/2022063011/5fc71e6451035f3c5f7450f4/html5/thumbnails/10.jpg)
Ingest System
Steve McDonald and the ADS Team
ADS Users Group Meeting, 20-21 Nov. 2019
![Page 11: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic](https://reader035.vdocuments.mx/reader035/viewer/2022063011/5fc71e6451035f3c5f7450f4/html5/thumbnails/11.jpg)
Data Timeliness
● Server Room Issues○ Twice cooling system failed○ Once network upgrade went poorly
● Nonbib Failed With Bad Import Once Every 2 months○ Delays data reaching prod servers by ~10 hours
● Changes To Data Links Field Not Automatically Detected● End To End System Performance● Misc
○ Replication fail, disks full
11
![Page 12: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic](https://reader035.vdocuments.mx/reader035/viewer/2022063011/5fc71e6451035f3c5f7450f4/html5/thumbnails/12.jpg)
Data Quality
● Generally Good● We Avoid Repeated Bugs
○ Root Cause Analysis● Any Outstanding Issues
○ ?
12
![Page 13: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic](https://reader035.vdocuments.mx/reader035/viewer/2022063011/5fc71e6451035f3c5f7450f4/html5/thumbnails/13.jpg)
Ingest System 2019 Highlights
● New Affiliation Pipeline● Enhancements To Orcid Pipeline● Improvements To Fulltext Pipeline● Bug Fixes To Improve Data Quality
13
![Page 14: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic](https://reader035.vdocuments.mx/reader035/viewer/2022063011/5fc71e6451035f3c5f7450f4/html5/thumbnails/14.jpg)
Fulltext Pipeline
File Formats
14
![Page 15: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic](https://reader035.vdocuments.mx/reader035/viewer/2022063011/5fc71e6451035f3c5f7450f4/html5/thumbnails/15.jpg)
Fulltext Pipeline
Extraction Analysis
15