creative computing with deep learning · deep learning charles p. martin 2017. deep learning! all...

Creative Computing with Deep Learning

Charles P. Martin2017

Deep Learning! All aboard the hype train!

Musical Examples

More Examples

Zelda Attention RNN Output

DeepJazz - 2-layer LSTM

Flow MachinesWaveNet piano

http://www.asimovinstitute.org/analyzing-deep-learning-tools-music/

http://www.asimovinstitute.org/analyzing-deep-learning-tools-music/

https://clyp.it/ljezg5kv


https://soundcloud.com/deepjazz-ai/deepjazz-on-metheny-128-epochs

https://soundcloud.com/deepjazz-ai/deepjazz-on-metheny-128-epochs

https://soundcloud.com/user-547260463/daddys-car

https://soundcloud.com/user-547260463/daddys-car

https://soundcloud.com/its-only-science-discover-magazine/wavenet-artificial-intelligence-playing-piano

https://soundcloud.com/its-only-science-discover-magazine/wavenet-artificial-intelligence-playing-piano

Big Questions1. How do Artificial Neural Networks (ANN) work? 2. How do you use them artistically?3. Why have ANN systems worked so well for images, and so poorly for

music?

Artificial Neural Networks

“Units” - Artificial Neuronshttps://www.tensorflow.org/get_started/mnist/beginners

(bias) b

+ b

https://www.tensorflow.org/get_started/mnist/beginners


Artificial Neural Network

Multilayer Perceptrons

Modern “Deep” Networks● VGG16 - Popular image

recognition network design uses 16 layers. (138M parameters)

● Other deep image rec. networks > 150 layers.

● Networks start to look like layered multinomial functions.

Simple Example: Handwritten Numbers


How do we choose which number an input represents?





Or just:

As equations:

As vectors:





import tensorflow as tfx = tf.placeholder(tf.float32, [None, 784])W = tf.Variable(tf.zeros([784, 10]))b = tf.Variable(tf.zeros([10]))y = tf.nn.softmax(tf.matmul(x, W) + b)

Check out example in Jupyter Notebook



Training- Define some “Loss” function: L(w,b) = cross_entropy(y,y’)- Find values for the parameters so that the loss is minimized.- Tricky with lots of parameters

- “Simple” Example has 7850 params

- Take partial derivative of L withrespect to each parameter, then adjust the parameter a little bit sothat we can expect L to be lower.

- Typical technique for finding thesepartials is “back propagation”, a.k.a.reverse-mode differentiation.

- http://colah.github.io/posts/2015-08-Backprop/

http://colah.github.io/posts/2015-08-Backprop/

http://colah.github.io/posts/2015-08-Backprop/

Demo- See Jupyter Notebooks

“MNIST-Tensorflow” and “MNIST-Keras”

Recurrent Neural Networks

Idea: Units receive input and previous output.

Training- Training examples consist of sequences

of input and output data.- E.g., training a network on a vocabulary

of four letters (h, e, l, o) and the example “hello”

- X would be “h, e, l, l”- y would be “e, l, l, o”- Each step finds Loss gradient over all

examples and through time.

Long Short-Term Memory Cells- Cells can have “memory” of

some internal state that is preserved in between time-steps.

- LSTM (Long Short-Term Memory Cell) is v. popular

- Another is GRU (Gated Recurrent Unit).

- Not much evidence about advantages of any.

- Many “casual” deep learners just work with LSTM.

http://colah.github.io/posts/2015-08-Understanding-LSTMs/




“Unreasonable” Effectiveness- RNNs seem to work really well for certain tasks.- Generating natural language at character level is a good example- Andrej Karpathy had a lot of fun with the char-RNN idea.

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Demo character-level RNN- Look at charRNN-Keras Jupyter Notebook

Still possible to have kinks:

Generated: be not afraid of greatness: some are born great, stal3n3r333333ng3l33n333s333 3f the3rd3n33n33s333s3333333333333333333333333333n33l33c33333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333

Generated: be not afraid of greatness: some are born great, such so the hands;i have the ready we mad bring some which thee.

sicinio:so the man that his doliester to break their town.

second servinam:the noble her own or say the hands and the prest and down,and i am madam, sir, peterving the bark.

wance:some we hence and not down they the honour to branged,and i grace the provilly frattion, or and strive be firstbut since be disse'eret?

A note on Convolutional Neural Networks.- Mainly used in image

recognition- Convolution layers apply

“learnable filters” to segments of image.

- “De-convolution” can undo this process and produce image.

- Successful results in “deep dream” images and “style transfer”

Folk-RNNarXiv:1604.08723 [cs.SD]

https://arxiv.org/abs/1604.08723



Data and Model- Data are folk tune transcriptions in

“ABC” format from thesession.org - Cleaned corpus includes 23636

transcriptions all transposed to C.- 137 unique tokens (vocabulary

elements) covering pitch, structure, duration, meter, etc.

- Data are encoded as one-hot vectors.- Network architecture: three LSTM layers

of 512 units each.- Paper found that folk-RNN (with 137

elements in vocab) outperforms a char-RNN on the same data despite similar numbers of params.

T: A Cup Of TeaM: 4/4L: 1/8K: AdoreAAa ~g2fg|eA~A2 BGBd|eA~A2 ~g2fg|1af (3gfe dG~G2:|2af (3gfe d2^cd||eaag efgf|eaag ed (3Bcd|eaag efgb|af (3gfe d2^cd:|

Cleaned up:

<s> M:4/4 K:Cdor g c c c’ b 2 a b | g c c 2 d B d f | g c c 2 b 2 a b |1 c’ a (3 b a g f B B 2 :| |2 c’ a (3 b a g f 2 =e f | g c’ c’ b g a b a | g c’ c’ b g f (3 d e f | g c’ c’ b g a b d’ | c’ a (3 b a g f 2 =e f :| <\s>


Generating “Transcriptions” via LSTM- FolkRNN produces lead sheets

(“Transcriptions”) which are then interpreted by performers.

- Authors interested in evaluating and comparing music generation techniques.

- How useful are they? How much cherry picking needs to be done? What could their role be in assisting human creativity?

- https://highnoongmt.wordpress.com/2017/03/19/benchmarking-music-generation-systems/

https://highnoongmt.wordpress.com/2017/03/19/benchmarking-music-generation-systems/




http://www.youtube.com/watch?v=omHhyVD3PD8

Applications in composition?- The Millenial Whoop Jig- “Given the ubiquity of this short musical phrase

called the “Millennial Whoop”, let’s compose some Celtic-style music that “features” it. I am going to use the folk-rnn deep LSTM model that we trained and presented recently. (This is a continuation of my explorations of using deep learning for assisting the process of music composition.)”

- “Let’s start with a jig, which is a 6/8 dance. Using our model trained on over 23,000 folk music transcriptions, I seed with the millennial whoop as the beginning measure and ask for 3 transcriptions in C major:”

https://soundcloud.com/sturmen-1/the-millennial-whoop-jig

https://thepatterning.com/2016/08/20/the-millennial-whoop-a-glorious-obsession-with-the-melodic-alternation-between-the-fifth-and-the-third/



https://github.com/IraKorshunova/folk-rnn

https://csmc2016.wordpress.com/proceedings/

https://highnoongmt.wordpress.com/2015/08/15/deep-learning-for-assisting-the-process-of-music-composition-part-4/





WaveNet

Generating Raw Audio- Convolutional network. Input is a series

of digital audio samples, output is predicted next sample.

- So far, primary use-case is speech synthesis

- Network is also conditioned on phonetic features to define sounds to make.

- DeepMind demonstrate a “musical” example trained on “classical piano” music, which makes piano-like sounds.

- https://deepmind.com/blog/wavenet-generative-model-raw-audio/

- WaveNet: a generative model for raw audio

https://deepmind.com/blog/wavenet-generative-model-raw-audio/



https://arxiv.org/pdf/1609.03499.pdf




Magenta - https://magenta.tensorflow.org

https://magenta.tensorflow.org

https://magenta.tensorflow.org

Coordinated MIDI Generation- Extensive project for generating music

with ANNs- Open source:

https://github.com/tensorflow/magenta - Collection of approaches for generating

music (and images).- RNNs tuned for monophonic melodies,

polyphony, melody + chords, drums, etc.- Written in Tensorflow with abstracted

data cleaning, training, generation, and models. Code is tough to read.

- Magenta-Discuss is a good resource! 50% thoughtfulness, 50% noobs.

- Blog also good

https://github.com/tensorflow/magenta



https://groups.google.com/a/tensorflow.org/forum/#!forum/magenta-discuss

https://magenta.tensorflow.org/2016/07/15/lookback-rnn-attention-rnn/

https://magenta.tensorflow.org/2016/07/15/lookback-rnn-attention-rnn/

http://www.youtube.com/watch?v=QlVoR1jQrPk

Magenta Demo- Magenta Attention Model trained on

MIDI from all Legend of Zelda soundtracks (up to 2016).

- Generated 2048 notes with no primer.- Magenta work seems useful, but lacks

research goals.- Could easily be repurposed by

researchers / creators - if so, good chance Google would promote the work too!

Zelda Attention RNN Output



Neural Mashups- Take an image- Use a captioning network (im2txt) to

generate some sort of lyrics.- Set them to a generated song- Use speech synth to sing- Headline: “AI wrote a Christmas carol!”- E.g., https://vimeo.com/192711856 - Flow Machines “pop songs” are (better

developed) examples of this.

https://vimeo.com/192711856

Deep Learning Handwriting Paths

LSTM + Mixture Distribution Network

https://greydanus.github.io/2016/08/21/handwriting/ Fake Kanji GeneratorMDN

https://greydanus.github.io/2016/08/21/handwriting/

https://greydanus.github.io/2016/08/21/handwriting/

http://blog.otoro.net/2015/12/28/recurrent-net-dreams-up-fake-chinese-characters-in-vector-format-with-tensorflow/

http://blog.otoro.net/2015/12/28/recurrent-net-dreams-up-fake-chinese-characters-in-vector-format-with-tensorflow/

https://nbviewer.jupyter.org/github/greydanus/adventures/blob/master/mixture_density/mdn.ipynb

https://nbviewer.jupyter.org/github/greydanus/adventures/blob/master/mixture_density/mdn.ipynb

Deep Learning Ensemble Interactions

RNN for Ensemble Performances

- 9 possible gestures- 4 players input, 3 players output- Input and output encoded as one-hot

vectors (i.e., 9^4 input possibilities, 9^3 output possibilities).

- Tried a quartet and duet configuration- Network used was similar to folkRNN (3

LSTM layers).- Broken right now! Built for old version of

tensorflow… ;_;

ANN to recreate “tiny touch performances”

Where is this going?

Challenges- Defining a network is super easy.- Finding good data is hard, good representation can be very hard!- Training can take a long time, easier with big GPUs

- workstation with gaming GPUs is handy for experiments- Big models trained on several GPUs

- Trained models can be evaluated on simpler systems - even browsers and mobile devices.

- Can be hard to figure out goals or use cases even for interesting data.- Much blog hype on the internet focusses on existing data, and

well-defined problems. But getting these two things is much of the challenge.

creative computing with deep learning · deep learning charles p. martin 2017. deep learning! all...

Documents