Download - smart_open at Data Science London meetup
![Page 1: smart_open at Data Science London meetup](https://reader036.vdocuments.mx/reader036/viewer/2022070522/58eddb241a28ab521f8b45bf/html5/thumbnails/1.jpg)
smart_openStreaming large files with a simple Pythonic API to and from S3, HDFS, WebHDFS, even zip and local files
Lev Konstantinovskiy
![Page 2: smart_open at Data Science London meetup](https://reader036.vdocuments.mx/reader036/viewer/2022070522/58eddb241a28ab521f8b45bf/html5/thumbnails/2.jpg)
What?
smart_open is
a Python 2 and 3 library
for efficient streaming of very large files
with a simple Pythonic API
in 600 lines of code.
![Page 3: smart_open at Data Science London meetup](https://reader036.vdocuments.mx/reader036/viewer/2022070522/58eddb241a28ab521f8b45bf/html5/thumbnails/3.jpg)
Easily switch just the path when data are moved for example from laptop to S3.
smart_open.smart_open('./foo.txt')
smart_open.smart_open('./foo.txt.gz')
smart_open.smart_open('s3://mybucket/mykey.txt')
smart_open.smart_open('hdfs://user/hadoop/my_file.txt')
smart_open.smart_open('webhdfs://host:port/user/hadoop/my_file.txt')
![Page 4: smart_open at Data Science London meetup](https://reader036.vdocuments.mx/reader036/viewer/2022070522/58eddb241a28ab521f8b45bf/html5/thumbnails/4.jpg)
Who?
Open-source MIT License. Maintained by RaRe Technologies. Headed by Radim Rehurek aka piskvorky.
![Page 5: smart_open at Data Science London meetup](https://reader036.vdocuments.mx/reader036/viewer/2022070522/58eddb241a28ab521f8b45bf/html5/thumbnails/5.jpg)
Why?
- Originally part of gensim - an out-of-core open-source text processing library (word2vec, LDA etc). smart_open is used for streaming large text corpora.
![Page 6: smart_open at Data Science London meetup](https://reader036.vdocuments.mx/reader036/viewer/2022070522/58eddb241a28ab521f8b45bf/html5/thumbnails/6.jpg)
Why? Boto is not Pythonistic :(- Study 15 pages of boto book
before using S3
Solution:
smart_open is Pythonised boto
![Page 7: smart_open at Data Science London meetup](https://reader036.vdocuments.mx/reader036/viewer/2022070522/58eddb241a28ab521f8b45bf/html5/thumbnails/7.jpg)
What is “Pythonistic”?
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
PEP 20. The zen of Python
![Page 8: smart_open at Data Science London meetup](https://reader036.vdocuments.mx/reader036/viewer/2022070522/58eddb241a28ab521f8b45bf/html5/thumbnails/8.jpg)
Write more than 5GB to S3: multipart-ing in Boto>>> mp = b.initiate_multipart_upload(os.path.basename(source_path))>>> chunk_size = 52428800; chunk_count = int(math.ceil(source_size / float(chunk_size))) # Use a chunk size of 50 MiB# Send the file parts, using FileChunkIO to create a file-like object# that points to a certain byte range within the original file. We# set bytes to never exceed the original file size.>>> for i in range(chunk_count):>>> offset = chunk_size * i>>> bytes = min(chunk_size, source_size - offset)>>> with FileChunkIO(source_path, 'r', offset=offset, bytes=bytes) as fp:>>> mp.upload_part_from_file(fp, part_num=i + 1)# Finish the upload>>> mp.complete_upload()
#Note that if you forget to call either mp.complete_upload() or mp.cancel_upload() you will be left with an incomplete upload and charged for the storage consumed by the uploaded parts. A call to bucket.get_all_multipart_uploads() can help to show lost multipart upload parts.
![Page 9: smart_open at Data Science London meetup](https://reader036.vdocuments.mx/reader036/viewer/2022070522/58eddb241a28ab521f8b45bf/html5/thumbnails/9.jpg)
Write more than 5GB to S3: multipart-ing in smart_open
>>> # stream content *into* S3 (write mode, multiparting behind the screen):>>> with smart_open.smart_open('s3://mybucket/mykey.txt', 'wb') as fout:... for line in ['first line', 'second line', 'third line']:... fout.write(line + '\n')
![Page 10: smart_open at Data Science London meetup](https://reader036.vdocuments.mx/reader036/viewer/2022070522/58eddb241a28ab521f8b45bf/html5/thumbnails/10.jpg)
Write more than 5GB to S3: multipart-ing>>> mp = b.initiate_multipart_upload(os.path.basename(source_path))# Use a chunk size of 50 MiB (feel free to change this)>>> chunk_size = 52428800>>> chunk_count = int(math.ceil(source_size / float(chunk_size)))# Send the file parts, using FileChunkIO to create a file-like object# that points to a certain byte range within the original file. We# set bytes to never exceed the original file size.>>> for i in range(chunk_count):>>> offset = chunk_size * i>>> bytes = min(chunk_size, source_size - offset)>>> with FileChunkIO(source_path, 'r', offset=offset, bytes=bytes) as fp:>>> mp.upload_part_from_file(fp, part_num=i + 1)# Finish the upload>>> mp.complete_upload()
#Note that if you forget to call either mp.complete_upload() or mp.cancel_upload() you will be left with an incomplete upload and charged for the storage consumed by the uploaded parts. A call to bucket.get_all_multipart_uploads() can help to show lost multipart upload parts.
Boto:>>> # stream content *into* S3 (write mode):>>> with smart_open.smart_open('s3://mybucket/mykey.txt', 'wb') as fout:... for line in ['first line', 'second line', 'third line']:... fout.write(line + '\n')
smart_open:
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
PEP 20. The zen of Python
![Page 11: smart_open at Data Science London meetup](https://reader036.vdocuments.mx/reader036/viewer/2022070522/58eddb241a28ab521f8b45bf/html5/thumbnails/11.jpg)
From S3 to memory
>>> c = boto.connect_s3()
>>> b = c.get_bucket('mybucket')
>>> k = Key(b)
>>> k.key = 'foobar'
>>> # Create StringIO in RAM
>>> k.get_contents_as_string() Traceback (most recent call last):
MemoryError
>>> # Workaround for memory error:
writing to local disk first. Need a large
local disk!
Boto:
>>> # can use context managers:>>> with smart_open.smart_open('s3://mybucket/mykey.txt') as fin:... for line in fin:... print line
>>> # bonus... fin.seek(0) # seek to the beginning... print fin.read(1000) # read 1000 bytes
smart_open:
![Page 12: smart_open at Data Science London meetup](https://reader036.vdocuments.mx/reader036/viewer/2022070522/58eddb241a28ab521f8b45bf/html5/thumbnails/12.jpg)
From large iterator to S3
>>> c = boto.connect_s3()
>>> b = c.get_bucket('mybucket')
>>> k = Key(b)
>>> k.key = 'foobar'
>>>
k.set_contents_as_string( list(my_iterator))
Traceback (most recent call last):
MemoryError
>>> # Workaround: via writing to local disk
first. Need a large local disk!
Boto:
>>> # stream content *into* S3 (write mode):>>> with smart_open.smart_open('s3://mybucket/mykey.txt', 'wb') as fout:... for line in ['first line', 'second line', 'third line']:... fout.write(line + '\n')# Streamed input is uploaded in chunks, as soon as `min_part_size` bytes are accumulated
smart_open:
![Page 13: smart_open at Data Science London meetup](https://reader036.vdocuments.mx/reader036/viewer/2022070522/58eddb241a28ab521f8b45bf/html5/thumbnails/13.jpg)
Un/Zipping line by line
>>> # stream from/to local compressed files:>>> for line in smart_open.smart_open('./foo.txt.gz'):... print line
>>> with smart_open.smart_open('/home/radim/foo.txt.bz2', 'wb') as fout:... fout.write("some content\n")
![Page 14: smart_open at Data Science London meetup](https://reader036.vdocuments.mx/reader036/viewer/2022070522/58eddb241a28ab521f8b45bf/html5/thumbnails/14.jpg)
Summary of Why?
Working with large S3 files using Amazon's default Python library, boto, is a pain. - limited by RAM. Its key.set_contents_from_string() and
key.get_contents_as_string() methods only work for small files (loaded in RAM, no streaming).
- There are nasty hidden gotchas when using boto's multipart upload functionality, and a lot of boilerplate.
smart_open shields you from that.
It builds on boto but offers a cleaner API.
The result is less code for you to write and fewer bugs to make.- gzip ContextManager in Python 2.5 and 2.6
![Page 15: smart_open at Data Science London meetup](https://reader036.vdocuments.mx/reader036/viewer/2022070522/58eddb241a28ab521f8b45bf/html5/thumbnails/15.jpg)
Streaming out-of-core read and write for:
- S3
- HDFS
- WebHDFS ( don’t have to use requests library!)
- local files.
- local compressed files
smart_open is not just for S3!
![Page 16: smart_open at Data Science London meetup](https://reader036.vdocuments.mx/reader036/viewer/2022070522/58eddb241a28ab521f8b45bf/html5/thumbnails/16.jpg)
Thanks!
Lev Konstantinovskiy
github.com/tmylk
@teagermylk