![Page 1: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/1.jpg)
Preventing Data Errors with Continuous Testing
Kıvanç Muşlu Yuriy Brun Alexandra Meliou
University of Washington University of Massachusetts
![Page 2: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/2.jpg)
![Page 3: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/3.jpg)
Data discrepancies
6:15 PM
6:15 PM
6:22 PM
9:40 PM 8:33 PM
9:54 PM
slide credit: Luna Dong and Divesh Srivastava
![Page 4: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/4.jpg)
“Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?”
Charles Babbage, from Passages from the Life of a Philosopher
![Page 5: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/5.jpg)
Dealing with soUware errors
• program analysis • language features • code reviews • formal verificaVon
… and
tesVng
![Page 6: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/6.jpg)
Key idea: TesVng for data-‐intensive systems
• IdenVfying system failures caused by well-‐formed but incorrect data
• Using applicaVon-‐specific execuVon informaVon • IntegraVng into the system usage workflow
Bringing tesVng to the data domain
![Page 7: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/7.jpg)
How is data tesVng different?
• Data semanVcs • Test query generaVon • Timeliness • Unobtrusive and precise interface
System administrators and users are not soUware engineers
![Page 8: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/8.jpg)
Data semanVcs
• Valid data can span mulVple orders of magnitude – thwarts staVsVcal outlier detecVon
• SemanVcs of nearly idenVcal data differ vastly:
617 616
339
![Page 9: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/9.jpg)
Test query generaVon
• Administrators and users have not (yet) bought into tesVng
• Manually wrieen tests will come – developers can ship these with applicaVon
• But automaVc generaVon is needed now – mine queries from code – record historical queries – adapt related work on database-‐use test generaVon
![Page 10: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/10.jpg)
Timeliness
• Data tesVng is a runVme acVvity
• Administrators and users don’t troubleshoot unless something goes wrong
• Learning about an error sooner means error has less impact
![Page 11: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/11.jpg)
Unobtrusive and precise user interface
• Administrators and users may not understand test outcomes
• Tests must integrate into workflow • Results must link to
– acVons that caused failure, or – data values relevant to failure
![Page 12: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/12.jpg)
Example scenario: car dealership
• Manager wants to put cars between $10K and $15K on a 30%-‐off sale
![Page 13: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/13.jpg)
Car dealership challenges
• Data semanVcs reducing all prices by 30% seems like a valid change
• Test query generaVon manager doesn’t know about wriVng tests
• Timeliness manager doesn’t know to run tests reporVng delay causes financial losses
• Unobtrusive and precise interface problem cannot be reported as “test failed”
![Page 14: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/14.jpg)
ConVnuous Data TesVng (CDT)
• Data semanVcs and test query generaVon mines source code for queries allows manually wrieen and history-‐mined tests
• Timeliness run tests conVnuously opVmizaVons to trigger proper tests at proper Vmes
• Unobtrusive and precise interface delegates problem to system designer highlights data involved in tests
![Page 15: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/15.jpg)
Making CDT possible (opVmizaVons)
• NaïveCDT: run tests conVnuously • SimpleCDT: only run tests aUer updates • SmartCDT: only relevant tests aUer
relevant changes • SmartCDTTC: test compression • SmartCDTIT: incremental test query
execuVon • SmartCDTTC+IT: all of the above
![Page 16: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/16.jpg)
CDT effecVveness
How effecVve is CDT at prevenVng data entry errors?
Do false posiVves reduce CDT’s effecVveness?
CDT’s and integrity constraints’ effect on data entry speed
![Page 17: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/17.jpg)
Data came from real-‐world spreadsheets from data.gov and the EUSES corpus; tests from formulae in the spreadsheets.
96 disVnct users were asked to copy numeric values into the corresponding cells
Crowd study Amazon Mechanical Turk
Group 1: control
No highlighVng. Submiqng errors allowed.
Group 2: CDT
Data involved in failing tests highlighted. Submiqng errors OK. highlighted.
Group 3: CDT with false posiVves
Data involved in failing test and 40% extra data highlighted. Submiqng errors OK. highlighted.
Group 4: integrity constraints
Data involved in failing integrity constraints highlighted. No submiqng errors. highlighted.
![Page 18: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/18.jpg)
0
20
40
60
80
100
corrected errors (%) time to correct (sec)
controlintegrity constraints
CDTCDTFP
CDT, even with false posiVves, successfully prevented data errors
CDT much faster than integrity constraints
![Page 19: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/19.jpg)
CDT efficiency
0
2
4
6
8
10
12
14
16
18
0.1/min 0.2/min 0.33/min 1/min 2/min
Ove
rhea
d (%
)
NaïveCDTSimpleCDT
SmartCDTSmartCDTTC
SmartCDTITSmartCDTTC+IT
more frequent updates = more overhead
Even with frequent updates, overhead is manageable
CDT’s effect on performance of performance-‐intensive applicaVons
![Page 20: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/20.jpg)
Related work
• GeneraVng tests for systems that use databases Chays, Shahid, and Frankl. Query-‐based test generaVon
for database applicaVons. DBTest 2008. Khalek and Khurshid. SystemaVc tesVng of database engines using a relaVonal constraint solver. ICST 2011. Li and Csallner. Dynamic symbolic database applicaVon tesVng. DBTest 2010. Pan, Wu, and Xie. Database state generaVon via dynamic symbolic execuVon for coverage criteria. DBTest 2011. Pan, Wu, and Xie. GeneraVng program inputs for database applicaVon tesVng. ASE 2011.
![Page 21: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/21.jpg)
Related work
• GeneraVng tests for systems that use databases
• ConVnuous tesVng Saff and Ernst. Reducing wasted development Vme via conVnuous tesVng. ISSRE 2003. Saff and Ernst. An experimental evaluaVon of conVnuous tesVng during development. ISSTA 2004.
![Page 22: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/22.jpg)
Related work
• GeneraVng tests for systems that use databases
• ConVnuous tesVng • Effects of conVnuous feedback
Katzan Jr. Batch, conversaVonal, and incremental compilers. The American FederaVon of InformaVon Processing SocieVes 1969. Muslu, Brun, Holmes, Ernst, and Notkin. SpeculaVve analysis of integrated development environment recommendaVons. OOPSLA 2012. Boekhoudt. The big bang theory of IDEs. Queue 2003.
![Page 23: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/23.jpg)
Related work
• GeneraVng tests for systems that use databases
• ConVnuous tesVng • Effects of conVnuous feedback • Errors in spreadsheets
Badame and Dig. Refactoring meets spreadsheet formulas. ICSM 2012. Barowy, Gochev, and Berger. CheckCell: Data debugging for spreadsheets. OOPSLA 2014. Hermans and Dig. BumbleBee: A refactoring environment for spreadsheet formulas. FSE Tool Demo 2014.
![Page 24: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/24.jpg)
Related work
• GeneraVng tests for systems that use databases
• ConVnuous tesVng • Effects of conVnuous feedback • Errors in spreadsheets • Data cleaning
Culoea and McCallum. Joint deduplicaVon of mulVple record types in relaVonal data. CIKM 2005. Domingos. MulV-‐relaVonal record linkage. In Workshop on MulV-‐RelaVonal Data Mining 2004. Dong, Halevy, and Madhavan. Reference reconciliaVon in complex informaVon spaces. SIGMOD 2005. Li, Tziviskou, Wang, Dong, Liu, Maurino, and Srivastava. Chronos: FacilitaVng history discovery by linking temporal records. PVLDB 2012. Kashyap and Sheth. SemanVc and schemaVc similariVes between database objects: A context-‐based approach. The VLDB Journal 1996. Hernandez and Stolfo. The merge/purge problem for large databases. SIGMOD 1995.
![Page 25: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/25.jpg)
Related work
• GeneraVng tests for systems that use databases
• ConVnuous tesVng • Effects of conVnuous feedback • Errors in spreadsheets • Data cleaning • Understanding errors in databases
Meliou, Gaeerbauer, Moore, and Suciu. The complexity of causality and responsibility for query answers and non-‐answers. PVLDB 2010. Meliou, Roy, and Suciu. Causality and explanaVons in databases. PVLDB 2014. Wang, Dong, and Meliou. Data X-‐Ray: A diagnosVc tool for data errors. SIGMOD 2015. Khoussainova, Balazinska, and Suciu. Towards correcVng input data errors probabilisVcally using integrity constraints. ACM InternaVonal Workshop on Data Engineering for Wireless and Mobile Access 2006.
![Page 26: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/26.jpg)
Related work
• GeneraVng tests for systems that use databases
• ConVnuous tesVng • Effects of conVnuous feedback • Errors in spreadsheets • Data cleaning • Understanding errors in databases • A truckload of automated test generaVon work
![Page 27: Preventing Data Errors with Continuous Testing](https://reader033.vdocuments.mx/reader033/viewer/2022041701/62536df2ed19f9590704b9d9/html5/thumbnails/27.jpg)
ContribuVons • Four challenges of data tesVng:
• ConVnuous Data TesVng prototype for PostgreSQL
• OpVmizaVons for which tests to run when • CDT efficient and effecVve at prevenVng errors, even when low-‐quality tests result in false posiVves
1. data semanVcs 2. test generaVon
3. Vmeliness 4. interface
heps://bitbucket.org/ameli/contest