text editing
DESCRIPTION
LIANZA ITSIG webinar series. Text Editing. Tools, tips, tricks. Kim Shepherd [email protected] Digital Development Team The University of Auckland Library. Summary. General (large) text files We manage and manipulate text data daily It’s tedious and time consuming - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Text Editing](https://reader035.vdocuments.mx/reader035/viewer/2022062314/5681424e550346895dae7a6a/html5/thumbnails/1.jpg)
Text Editing
Digital Development TeamThe University of Auckland Library
Tools, tips, tricks
LIANZA ITSIG webinar series
![Page 2: Text Editing](https://reader035.vdocuments.mx/reader035/viewer/2022062314/5681424e550346895dae7a6a/html5/thumbnails/2.jpg)
Summary
• General (large) text files– We manage and manipulate text data daily– It’s tedious and time consuming– Find & Replace is too limited and dangerous– We know there must be a better way...
• Tabular data files (eg. Spreadsheets)– We work with these all the time, usually in Excel– What tools can help us clean messy data?
![Page 3: Text Editing](https://reader035.vdocuments.mx/reader035/viewer/2022062314/5681424e550346895dae7a6a/html5/thumbnails/3.jpg)
Topics
• Regular Expressions
• Text Editors
• Operating on lines, not entire files
• Google Refine
![Page 4: Text Editing](https://reader035.vdocuments.mx/reader035/viewer/2022062314/5681424e550346895dae7a6a/html5/thumbnails/4.jpg)
Regular Expressions
/^\s+[a-zA-Z0-9](?:\W+)/
![Page 5: Text Editing](https://reader035.vdocuments.mx/reader035/viewer/2022062314/5681424e550346895dae7a6a/html5/thumbnails/5.jpg)
Regular Expressions
• A way to describe a set of strings and capture parts of them
• Originated in old UNIX/POSIX tools
• Now used all over the place
• Test your regexes out on the web:– http://gskinner.com/RegExr/
![Page 6: Text Editing](https://reader035.vdocuments.mx/reader035/viewer/2022062314/5681424e550346895dae7a6a/html5/thumbnails/6.jpg)
Text Editors & Useful Languages
sed, grep, awk
![Page 7: Text Editing](https://reader035.vdocuments.mx/reader035/viewer/2022062314/5681424e550346895dae7a6a/html5/thumbnails/7.jpg)
Text Editors
• Word processors aren’t text editors
• Shop around, compare features
• My favourite: Vim (UNIX, Windows, Mac)
– Wikipedia comparison of editor features– Wikipedia list of regex software
![Page 8: Text Editing](https://reader035.vdocuments.mx/reader035/viewer/2022062314/5681424e550346895dae7a6a/html5/thumbnails/8.jpg)
Useful Languages / Interpeters
• Perl– An old favourite, great for string manipulation
• Python– The cool kids tell me it’s better than Perl
• GREL– We’ll get to this later...
![Page 9: Text Editing](https://reader035.vdocuments.mx/reader035/viewer/2022062314/5681424e550346895dae7a6a/html5/thumbnails/9.jpg)
Line-by-line processing
while(<STDIN>) {....
}
![Page 10: Text Editing](https://reader035.vdocuments.mx/reader035/viewer/2022062314/5681424e550346895dae7a6a/html5/thumbnails/10.jpg)
Line-by-line processing
• Large files are large!– If they’re big on disk, they’ll be big in memory
• Lines are (usually!) small– Read a line– Do something with it– Output the modified line
![Page 11: Text Editing](https://reader035.vdocuments.mx/reader035/viewer/2022062314/5681424e550346895dae7a6a/html5/thumbnails/11.jpg)
![Page 12: Text Editing](https://reader035.vdocuments.mx/reader035/viewer/2022062314/5681424e550346895dae7a6a/html5/thumbnails/12.jpg)
Google Refine
• Cleans messy tabular data– Easy facetting and filtering of columns/values– Easy transformation of values
• Google Refine Expression Language (GREL)– Extensive use of regular expressions and other standard string
manipulation techniques
• Other features– Perform web service calls directly, reconcile row IDs
![Page 13: Text Editing](https://reader035.vdocuments.mx/reader035/viewer/2022062314/5681424e550346895dae7a6a/html5/thumbnails/13.jpg)
![Page 14: Text Editing](https://reader035.vdocuments.mx/reader035/viewer/2022062314/5681424e550346895dae7a6a/html5/thumbnails/14.jpg)
Conclusion
• Our problems are solvable!– Regular expressions– Decent text editors for general/unformatted text– Google Refine for tabular data
• Contact me– Please feel free to contact me with questions, corrections or
ideas– [email protected]– Twitter: @kimshepherd– Google+: [email protected]