creative computing with deep learning · deep learning charles p. martin 2017. deep learning! all...
TRANSCRIPT
Creative Computing with Deep Learning
Charles P. Martin2017
Deep Learning! All aboard the hype train!
Musical Examples
More Examples
Zelda Attention RNN Output
DeepJazz - 2-layer LSTM
Flow MachinesWaveNet piano
Big Questions1. How do Artificial Neural Networks (ANN) work? 2. How do you use them artistically?3. Why have ANN systems worked so well for images, and so poorly for
music?
Artificial Neural Networks
“Units” - Artificial Neuronshttps://www.tensorflow.org/get_started/mnist/beginners
(bias) b
+ b
Artificial Neural Network
Multilayer Perceptrons
Modern “Deep” Networks● VGG16 - Popular image
recognition network design uses 16 layers. (138M parameters)
● Other deep image rec. networks > 150 layers.
● Networks start to look like layered multinomial functions.
Simple Example: Handwritten Numbers
https://www.tensorflow.org/get_started/mnist/beginners
How do we choose which number an input represents?
Simple Example: Handwritten Numbers
https://www.tensorflow.org/get_started/mnist/beginners
Or just:
As equations:
As vectors:
Simple Example: Handwritten Numbers
https://www.tensorflow.org/get_started/mnist/beginners
import tensorflow as tfx = tf.placeholder(tf.float32, [None, 784])W = tf.Variable(tf.zeros([784, 10]))b = tf.Variable(tf.zeros([10]))y = tf.nn.softmax(tf.matmul(x, W) + b)
Check out example in Jupyter Notebook
Training- Define some “Loss” function: L(w,b) = cross_entropy(y,y’)- Find values for the parameters so that the loss is minimized.- Tricky with lots of parameters
- “Simple” Example has 7850 params
- Take partial derivative of L withrespect to each parameter, then adjust the parameter a little bit sothat we can expect L to be lower.
- Typical technique for finding thesepartials is “back propagation”, a.k.a.reverse-mode differentiation.
- http://colah.github.io/posts/2015-08-Backprop/
Demo- See Jupyter Notebooks
“MNIST-Tensorflow” and “MNIST-Keras”
Recurrent Neural Networks
Idea: Units receive input and previous output.
Training- Training examples consist of sequences
of input and output data.- E.g., training a network on a vocabulary
of four letters (h, e, l, o) and the example “hello”
- X would be “h, e, l, l”- y would be “e, l, l, o”- Each step finds Loss gradient over all
examples and through time.
Long Short-Term Memory Cells- Cells can have “memory” of
some internal state that is preserved in between time-steps.
- LSTM (Long Short-Term Memory Cell) is v. popular
- Another is GRU (Gated Recurrent Unit).
- Not much evidence about advantages of any.
- Many “casual” deep learners just work with LSTM.
“Unreasonable” Effectiveness- RNNs seem to work really well for certain tasks.- Generating natural language at character level is a good example- Andrej Karpathy had a lot of fun with the char-RNN idea.
Demo character-level RNN- Look at charRNN-Keras Jupyter Notebook
Still possible to have kinks:
Generated: be not afraid of greatness: some are born great, stal3n3r333333ng3l33n333s333 3f the3rd3n33n33s333s3333333333333333333333333333n33l33c33333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333
Generated: be not afraid of greatness: some are born great, such so the hands;i have the ready we mad bring some which thee.
sicinio:so the man that his doliester to break their town.
second servinam:the noble her own or say the hands and the prest and down,and i am madam, sir, peterving the bark.
wance:some we hence and not down they the honour to branged,and i grace the provilly frattion, or and strive be firstbut since be disse'eret?
A note on Convolutional Neural Networks.- Mainly used in image
recognition- Convolution layers apply
“learnable filters” to segments of image.
- “De-convolution” can undo this process and produce image.
- Successful results in “deep dream” images and “style transfer”
Folk-RNNarXiv:1604.08723 [cs.SD]
Data and Model- Data are folk tune transcriptions in
“ABC” format from thesession.org - Cleaned corpus includes 23636
transcriptions all transposed to C.- 137 unique tokens (vocabulary
elements) covering pitch, structure, duration, meter, etc.
- Data are encoded as one-hot vectors.- Network architecture: three LSTM layers
of 512 units each.- Paper found that folk-RNN (with 137
elements in vocab) outperforms a char-RNN on the same data despite similar numbers of params.
T: A Cup Of TeaM: 4/4L: 1/8K: AdoreAAa ~g2fg|eA~A2 BGBd|eA~A2 ~g2fg|1af (3gfe dG~G2:|2af (3gfe d2^cd||eaag efgf|eaag ed (3Bcd|eaag efgb|af (3gfe d2^cd:|
Cleaned up:
<s> M:4/4 K:Cdor g c c c’ b 2 a b | g c c 2 d B d f | g c c 2 b 2 a b |1 c’ a (3 b a g f B B 2 :| |2 c’ a (3 b a g f 2 =e f | g c’ c’ b g a b a | g c’ c’ b g f (3 d e f | g c’ c’ b g a b d’ | c’ a (3 b a g f 2 =e f :| <\s>
Generating “Transcriptions” via LSTM- FolkRNN produces lead sheets
(“Transcriptions”) which are then interpreted by performers.
- Authors interested in evaluating and comparing music generation techniques.
- How useful are they? How much cherry picking needs to be done? What could their role be in assisting human creativity?
- https://highnoongmt.wordpress.com/2017/03/19/benchmarking-music-generation-systems/
Applications in composition?- The Millenial Whoop Jig- “Given the ubiquity of this short musical phrase
called the “Millennial Whoop”, let’s compose some Celtic-style music that “features” it. I am going to use the folk-rnn deep LSTM model that we trained and presented recently. (This is a continuation of my explorations of using deep learning for assisting the process of music composition.)”
- “Let’s start with a jig, which is a 6/8 dance. Using our model trained on over 23,000 folk music transcriptions, I seed with the millennial whoop as the beginning measure and ask for 3 transcriptions in C major:”
WaveNet
Generating Raw Audio- Convolutional network. Input is a series
of digital audio samples, output is predicted next sample.
- So far, primary use-case is speech synthesis
- Network is also conditioned on phonetic features to define sounds to make.
- DeepMind demonstrate a “musical” example trained on “classical piano” music, which makes piano-like sounds.
- https://deepmind.com/blog/wavenet-generative-model-raw-audio/
- WaveNet: a generative model for raw audio
Magenta - https://magenta.tensorflow.org
Coordinated MIDI Generation- Extensive project for generating music
with ANNs- Open source:
https://github.com/tensorflow/magenta - Collection of approaches for generating
music (and images).- RNNs tuned for monophonic melodies,
polyphony, melody + chords, drums, etc.- Written in Tensorflow with abstracted
data cleaning, training, generation, and models. Code is tough to read.
- Magenta-Discuss is a good resource! 50% thoughtfulness, 50% noobs.
- Blog also good
Magenta Demo- Magenta Attention Model trained on
MIDI from all Legend of Zelda soundtracks (up to 2016).
- Generated 2048 notes with no primer.- Magenta work seems useful, but lacks
research goals.- Could easily be repurposed by
researchers / creators - if so, good chance Google would promote the work too!
Zelda Attention RNN Output
Neural Mashups- Take an image- Use a captioning network (im2txt) to
generate some sort of lyrics.- Set them to a generated song- Use speech synth to sing- Headline: “AI wrote a Christmas carol!”- E.g., https://vimeo.com/192711856 - Flow Machines “pop songs” are (better
developed) examples of this.
Deep Learning Handwriting Paths
LSTM + Mixture Distribution Network
https://greydanus.github.io/2016/08/21/handwriting/ Fake Kanji GeneratorMDN
Deep Learning Ensemble Interactions
RNN for Ensemble Performances
- 9 possible gestures- 4 players input, 3 players output- Input and output encoded as one-hot
vectors (i.e., 9^4 input possibilities, 9^3 output possibilities).
- Tried a quartet and duet configuration- Network used was similar to folkRNN (3
LSTM layers).- Broken right now! Built for old version of
tensorflow… ;_;
ANN to recreate “tiny touch performances”
Where is this going?
Challenges- Defining a network is super easy.- Finding good data is hard, good representation can be very hard!- Training can take a long time, easier with big GPUs
- workstation with gaming GPUs is handy for experiments- Big models trained on several GPUs
- Trained models can be evaluated on simpler systems - even browsers and mobile devices.
- Can be hard to figure out goals or use cases even for interesting data.- Much blog hype on the internet focusses on existing data, and
well-defined problems. But getting these two things is much of the challenge.