big data with rubygems.org download data · bucket = storage.bucket "goruco2016-bg-files"...

128
Big Data with rubygems.org Download Data Aja Hammerly

Upload: others

Post on 18-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

Big Data with rubygems.org Download Data

Aja Hammerly

Page 2: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

Aja Hammerly

@thagomizer_rbhttp://www.thagomizer.com

http://github.com/thagomizer

Page 3: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What
Page 4: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Lawyer Cat Says: Any code is copyright

Google and licensed Apache V2

Page 5: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Big Data

Page 6: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

DATA

Page 7: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Big Data

Page 8: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Storage is Cheap

Page 9: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Intimidating

Page 10: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

OMG Statistics

Page 11: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Page 12: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Machine Learning

Page 13: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Exploratory

Page 14: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Rubygems Download Data

Page 15: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Overview

Page 16: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

rubygems

Page 17: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Column Name Type

id integer

name varchar

created_at datetime

updated_at datetime

slug varchar

Page 18: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Column Name Type

id integer

name varchar

created_at datetime

updated_at datetime

slug varchar

Page 19: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

126,007

Page 20: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

gem_downloads

Page 21: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Column Name Type

id integer

rubygem_id integer

version_id integer

count bigint

Page 22: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

883,848

Page 23: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

dependencies

Page 24: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Column Name Typeid integer

requirements varcharrubygem_id integerversion_id integer

scope varcharcreated_at datetimeupdated_at datetime

unresolved_name varchar

Page 25: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Column Name Typeid integer

requirements varcharrubygem_id integerversion_id integer

scope varcharcreated_at datetimeupdated_at datetime

unresolved_name varchar

Page 26: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

3,638,968

Page 27: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

linksets

Page 28: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Column Name Typeid integer

rubygem_id integerhome varcharwiki varchardocs varcharmail varcharcode varchar bugs varchar

created_at datetime updated_at datetime

Page 29: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

125,932

Page 30: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

versions

Page 31: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Column Name Type Column Name Typeid integer authors text

rubygem_id integer description textsize integer summary text

position integer requirements textnumber varchar platform varcharindexed boolean full_name varchar

prerelease boolean licenses varcharlatest boolean required_ruby_version varchar

yanked_at datetime required_rubygems_version varcharbuilt_at datetime info_checksum varchar

updated_at datetime metadata hstorecreated_at datetime sha256 varchar

Page 32: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Column Name Type Column Name Typeid integer authors text

rubygem_id integer description textsize integer summary text

position integer requirements textnumber varchar platform varcharindexed boolean full_name varchar

prerelease boolean licenses varcharlatest boolean required_ruby_version varchar

yanked_at datetime required_rubygems_version varcharbuilt_at datetime info_checksum varchar

updated_at datetime metadata hstorecreated_at datetime sha256 varchar

Page 33: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Column Name Type Column Name Typeid integer authors text

rubygem_id integer description textsize integer summary text

position integer requirements textnumber varchar platform varcharindexed boolean full_name varchar

prerelease boolean licenses varcharlatest boolean required_ruby_version varchar

yanked_at datetime required_rubygems_version varcharbuilt_at datetime info_checksum varchar

updated_at datetime metadata hstorecreated_at datetime sha256 varchar

Page 34: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Column Name Type Column Name Typeid integer authors text

rubygem_id integer description textsize integer summary text

position integer requirements textnumber varchar platform varcharindexed boolean full_name varchar

prerelease boolean licenses varcharlatest boolean required_ruby_version varchar

yanked_at datetime required_rubygems_version varcharbuilt_at datetime info_checksum varchar

updated_at datetime metadata hstorecreated_at datetime sha256 varchar

Page 35: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Column Name Type Column Name Typeid integer authors text

rubygem_id integer description textsize integer summary text

position integer requirements textnumber varchar platform varcharindexed boolean full_name varchar

prerelease boolean licenses varcharlatest boolean required_ruby_version varchar

yanked_at datetime required_rubygems_version varcharbuilt_at datetime info_checksum varchar

updated_at datetime metadata hstorecreated_at datetime sha256 varchar

Page 36: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Column Name Type Column Name Typeid integer authors text

rubygem_id integer description textsize integer summary text

position integer requirements textnumber varchar platform varcharindexed boolean full_name varchar

prerelease boolean licenses varcharlatest boolean required_ruby_version varchar

yanked_at datetime required_rubygems_version varcharbuilt_at datetime info_checksum varchar

updated_at datetime metadata hstorecreated_at datetime sha256 varchar

Page 37: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

757,920

Page 38: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Asking Questions

Page 39: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Domain Knowledge

Page 40: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Hypothesis

Page 41: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Examples

Page 42: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

The gem with the most downloads is rails.

Page 43: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

MiniTest is more popular than Rspec.

Page 44: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Gems released in the last year require ruby > 2.0.

Page 45: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Rails 3 is still more popular than rails 4.

Page 46: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Fewer gems are released during summer.

Page 47: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Largish Data

Page 48: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

BigQuery

Page 49: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

What

Page 50: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Why

Page 51: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

How

Page 52: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

I ❤ BigQuery

Page 53: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

SQL

Page 54: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Fast

Page 55: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Scales

Page 56: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Complex Enough

Page 57: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Demo

Page 58: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Vocabulary

Page 59: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Dataset

Page 60: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Table

Page 61: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Import

Page 62: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Streaming

Page 63: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

gcloud

Page 64: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

pg

Page 65: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

require 'pg' require 'gcloud'

ENV["GOOGLE_CLOUD_PROJECT"] = "rubygems-bigquery" ENV["GOOGLE_CLOUD_KEYFILE"] = "#{key_path}"

Page 66: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

gcloud = Gcloud.new bigquery = gcloud.bigquery bq_database = bigquery.dataset "rubygems"

Page 67: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

postgres = PG.connect dbname: "rubygems"

Page 68: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

bq_table ||= bq_database.create_table("gems") do |s| s.integer "id" s.string "name" s.timestamp "created_at" s.timestamp "updated_at" end

Page 69: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

columns = %w[id name created_at updated_at]

Page 70: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

postgres.exec("SELECT * FROM rubygems") do |pg_table| pg_table.each do |row| hashed_row = Hash[columns.zip(row.values)] bq_table.insert(data) end end

Page 71: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

postgres.exec("SELECT * FROM rubygems") do |pg_table| pg_table.each do |row| hashed_row = Hash[columns.zip(row.values)] bq_table.insert(data) end end

Page 72: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

postgres.exec("SELECT * FROM rubygems") do |pg_table| pg_table.each do |row| hashed_row = Hash[columns.zip(row.values)] bq_table.insert(data) end end

Page 73: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

postgres.exec("SELECT * FROM rubygems") do |pg_table| pg_table.each do |row| hashed_row = Hash[columns.zip(row.values)] bq_table.insert(hashed_row) end end

Page 74: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Zip & Hash[]

Page 75: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

[ , , , ]key1 key2 key3 key4[ , , , ]val1 val2 val3 val4[ , , , ]key1 key2 key3 key4[ , , , ]val1 val2 val3 val4

Page 76: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

zip

Page 77: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

[ , , , ]key1 key2 key3 key4[ , , , ]val1 val2 val3 val4

[[ , ], [ , ], [ , ], [ , ]]

[ , , , ]key1 key2 key3 key4[ , , , ]val1 val2 val3 val4

Page 78: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

[ , , , ]

key1 key2key3 key4

[ , , , ]

val1 val2val3 val4

[[ , ], [ , ], [ , ], [ , ]]

[ , , , ]key1 key2 key3 key4[ , , , ]val1 val2 val3 val4

Page 79: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

[[key1, val1], [key2, val2], [key3, val3], [key4, val4]]

Page 80: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Hash::[]

Page 81: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Hash[[key1, val1], [key2, val2], [key3, val3], [key4, val4]]

Page 82: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

{ key1 => val1, key2 => val2, key3 => val3, key4 => val4 }

Page 83: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Hash[keys.zip(values)]

Page 84: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

postgres.exec("SELECT * FROM rubygems") do |pg_table| pg_table.each do |row| hashed_row = Hash[columns.zip(row.values)] bq_table.insert(hashed_row) end end

Page 85: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Batch

Page 86: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Formats

Page 87: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

CSV

Page 88: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

JSON

Page 89: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Avro

Page 90: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

CSV

Page 91: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

require 'pg' require 'csv' require 'gcloud'

Page 92: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

postgres = PG.connect dbname: "rubygems"

cols = %w[id requirements created_at updated_at rubygem_id version_id scope]

Page 93: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

query = "SELECT #{cols.join(',')} FROM dependencies"

CSV.open(csv_path, "wb") do |csv| postgres.exec(query) do |pg_table| pg_table.each do |row| csv << row.values end end end

Page 94: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

storage = Gcloud.new.storage bucket = storage.bucket "goruco2016-bg-files"

bucket.create_file csv_path, "dependencies.csv"

Page 95: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Import

Page 96: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What
Page 97: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What
Page 98: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

What Now?

Page 99: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

rubygems

Page 100: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Simple

Page 101: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Rails has the most downloads.

Page 102: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Which gem has the most downloads?

Page 103: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

SELECT name, count FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id ORDER BY count DESC LIMIT 5

Page 104: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

name count

rake 107,076,261

rack 100,955,906

multi_json 100,171,080

json 95,715,131

bundler 93,085,862

Page 105: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

SELECT name, sum(count) as total FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id GROUP BY name ORDER BY total DESC LIMIT 5

Page 106: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

name count

rake 214,152,212

rack 201,911,759

multi_json 200,342,260

json 191,430,173

bundler 186,172,479

Page 107: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

How many downloads does Rails have?

Page 108: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

SELECT name, sum(count) as total FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id WHERE name = 'rails'

Page 109: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

name total

rails 137,635,731

Page 110: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Minitest is more popular than Rspec.

Page 111: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

SELECT name, sum(count) as total FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id GROUP BY name HAVING name IN ('minitest', 'rspec')

Page 112: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

name total

minitest 101151246

rspec 77293803

Page 113: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Gems released in the last year require ruby > 2.

Page 114: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

SELECT required_ruby_version, COUNT(*) AS total FROM rubygems.versions WHERE created_at > DATE_ADD(CURRENT_TIMESTAMP(), -1, "YEAR") GROUP BY required_ruby_version ORDER BY total DESC

Page 115: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

name total

>= 0 95,857

>= 1.9.3 9,069

>= 2.0.0 4,624

>= 2.0 1,648

>= 2.1.0 1,432

Page 116: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Complex

Page 117: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Rails 3 has more downloads than the other Rails major

versions.

Page 118: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

SELECT name, REGEXP_EXTRACT(number,r'(\d\.)') AS major, sum(rubygems.downloads.count) AS total FROM [rubygems.versions] JOIN rubygems.gems ON rubygems.gems.id = rubygems.versions.rubygem_id JOIN rubygems.downloads ON rubygems.versions.rubygem_id = rubygems.downloads.rubygem_id WHERE rubygems.gems.name = 'rails' GROUP BY name, major ORDER BY major

Page 119: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

SELECT name, REGEXP_EXTRACT(number,r'(\d\.)') as major, sum(rubygems.downloads.count) as total FROM [rubygems.versions] JOIN rubygems.gems ON rubygems.gems.id = rubygems.versions.rubygem_id JOIN rubygems.downloads ON rubygems.versions.rubygem_id = rubygems.downloads.rubygem_id WHERE rubygems.gems.name = 'rails' GROUP BY name, major order by major

Page 120: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

REGEXP_EXTRACT(number,r'(\d\.)') as major

Page 121: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

version downloads0 2,890,350,3511 2,064,535,9652 3,991,436,1993 16,378,651,9894 12,662,487,2525 963,450,117

Page 122: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

version downloads0 2,8901 2,0642 3,9913 16,3784 12,6625 963

Page 123: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Gems released in the last year require ruby > 2.

Page 124: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

SELECT required_ruby_version, COUNT(*) AS total FROM rubygems.versions WHERE created_at > DATE_ADD(CURRENT_TIMESTAMP(), -1, "YEAR") GROUP BY required_ruby_version ORDER BY total DESC

Page 125: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

SELECT REGEXP_EXTRACT(required_ruby_version, r'(.*?\d\.?)') AS version, COUNT(*) AS total FROM rubygems.versions WHERE created_at > DATE_ADD(CURRENT_TIMESTAMP(), -1, "YEAR") GROUP BY version ORDER BY total DESC

Page 126: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

name total

>= 0 95,851

>= 1 13,080

>= 2 12,944

~> 2 2,040

> 2 49

Page 127: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb

Thank You

Page 128: Big Data with rubygems.org Download Data · bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path, "dependencies.csv" @thagomizer_rb Import. @thagomizer_rb What

@thagomizer_rb