analyzing blockchain data with deep learning · 2019. 7. 15. · bsc thesis topic kristóf máté...
TRANSCRIPT
BSC THESIS TOPIC
Kristóf Máté Horváth
Graduating Student in Electrical Engineering
Analyzing Blockchain Data with Deep Learning
Bitcoin, the first cryptocurrency and the underlying blockchain technology were
established around a decade ago. The blockchain is a distributed database that contains a
chain of blocks; each block contains data of transactions, including timestamp, endpoints,
and value. This data is free to access; it is cryptographically secure and immutable. As all
the transactions are opened, it forms a challenging task to data scientist and machine
learning researchers to search patterns, anomalies and correlations within the blockchain
and also with additional data sources.
Due to the significant increase in the amount of available data, the continuous rise of
high-performance GPUs (Graphics Processing Unit) and the novel scientific results, deep
learning has become one of the most focused research area in machine learning. The tens
or even hundreds layers of deep architectures are simultaneously able to learn
representation and model the input data efficiently.
The goal of this thesis work is to investigate the blockchain technology, analyze the
transaction database, concentrating on deep learning-based methods.
The following subtasks should be elaborated:
• Overview the most important scientific papers about deep learning and
blockchain technology.
• Investigate the structure and possible origins of blockchain database and related
data source.
• Create a solution for downloading and storing these data in a Linux environment,
considering performance issues.
• Analyze the gathered data and create a possible train and test sets for deep learning
systems.
• Implement and test at least one deep neural network to jointly model transactions
and a related data feed (e.g., derived from the asset prices) in a demonstration
system.
• Evaluate the results of the demonstration system.
• Explore the possibilities of extending this solution to more complex deep learning
systems.
• Prepare detailed documentation about the work, summarize the results and
write a conclusion and possible future work.
Academic supervisor: Bálint Gyires-Tóth, PhD
Budapest, 27th September 2018
Gábor Magyar, PhD
/Head of Department/
Budapest University of Technology and Economics
Faculty of Electrical Engineering and Informatics
Department of Telecommunications and Media Informatics
Kristóf Máté Horváth
ANALYZING BLOCKCHAIN DATA
WITH DEEP LEARNING
ACADEMIC SUPERVISOR
Bálint Gyires-Tóth, PhD
BUDAPEST, 2018
Contents
1. Introduction ................................................................................................................ 1
2. Public-key cryptography ........................................................................................... 2
2.1 One-way hash functions .......................................................................................... 2
2.2 Private and Public Keys .......................................................................................... 4
2.3 The Elliptic Curve Digital Signature Algorithm (ECDSA) .................................... 6
2.3.1 Finite Fields ...................................................................................................... 6
2.3.2 The Finite Field Fp ........................................................................................... 6
2.3.3 Elliptic curves over Finite fields ...................................................................... 6
2.3.4 Group order and group structure ...................................................................... 8
2.3.5 ECDSA domain parameters ............................................................................. 8
2.3.6 Randomly generating an elliptic curve ............................................................. 9
2.3.7 Domain parameter generation ........................................................................ 10
2.3.8 ECDSA Key Pair generation, public key validation and proof of possession
of the private key ............................................................................................ 10
3. The blockchain of the Bitcoin network .................................................................. 13
3.1 Private Keys ......................................................................................................... 13
3.2 Public Keys .......................................................................................................... 13
3.3 Bitcoin addresses .................................................................................................. 14
3.4 Transactions ......................................................................................................... 15
3.4.1 Structure of a transaction ................................................................................ 16
3.4.2 Transaction inputs and outputs ....................................................................... 17
3.4.3 Transaction fees .............................................................................................. 18
3.4.4 Transaction validation conditions .................................................................. 19
3.4.5 Orphan transactions ........................................................................................ 20
3.5 The data structure of the blockchain ..................................................................... 21
3.5.1 A block’s data fields ....................................................................................... 21
3.5.2 Block Header .................................................................................................. 22
3.5.3 Merkle Trees ................................................................................................... 22
3.6 Decentralised consensus through proof of work ................................................... 24
3.6.1 Aggregation of transactions ............................................................................ 24
3.6.2 Proof of work .................................................................................................. 25
3.6.3 Validation of a new block .............................................................................. 26
3.6.4 Blockchain forks ............................................................................................. 27
4. Analyzing Bitcoin’s blockchain with deep learning algorithms........................... 28
4.1 Collecting Bitcoin’s blockchain data .................................................................... 28
4.1.1 Blockchain.com API ...................................................................................... 29
4.1.2 Drawing transaction networks with NetworkX .............................................. 29
4.1.3 Additional features from blocks ..................................................................... 32
4.1.4 Storing data in HDF5 files .............................................................................. 33
4.1.5 Volatility estimators ....................................................................................... 36
4.2 Deep learning ........................................................................................................ 38
4.3 Predicting price and volatility with different architectures ................................... 41
4.3.1 Determining the number of transactions from the transaction graphs ........... 48
4.4 Different approaches for predictions, system usage, extensions .......................... 50
4.4.1 Analyzing the correlations of block features with market data ...................... 50
4.4.2 Long Short-Term Memory network with block features ............................... 54
4.4.3 Application and integration of an operative prediction system ...................... 57
4.4.4 Possible future experiments ............................................................................ 60
5. Summary ................................................................................................................... 62
Acknowledgements ....................................................................................................... 64
References ...................................................................................................................... 65
6. Appendix .................................................................................................................... 68
A.1. Secure Hash Algorithm (SHA) ........................................................................ 68
A.2. The domain parameters of the Koblitz curve, secp256k1 ............................... 74
HALLGATÓI NYILATKOZAT
Alulírott Horváth Kristóf Máté, szigorló hallgató kijelentem, hogy ezt a szakdolgozatot
meg nem engedett segítség nélkül, saját magam készítettem, csak a megadott forrásokat
(szakirodalom, eszközök stb.) használtam fel. Minden olyan részt, melyet szó szerint,
vagy azonos értelemben, de átfogalmazva más forrásból átvettem, egyértelműen, a forrás
megadásával megjelöltem.
Hozzájárulok, hogy a jelen munkám alapadatait (szerző(k), cím, angol és magyar
nyelvű tartalmi kivonat, készítés éve, konzulens(ek) neve) a BME VIK nyilvánosan
hozzáférhető elektronikus formában, a munka teljes szövegét pedig az egyetem belső
hálózatán keresztül (vagy hitelesített felhasználók számára) közzétegye. Kijelentem,
hogy a benyújtott munka és annak elektronikus verziója megegyezik. Dékáni engedéllyel
titkosított diplomatervek esetén a dolgozat szövege csak 3 év eltelte után válik
hozzáférhetővé.
Kelt: Budapest, 2018. 12. 06.
...…………………………………………….
Horváth Kristóf Máté
Kivonat
A blokklánc egy elosztott peer-to-peer hálózat ami lehetővé teszi, hogy ismeretlen felek
biztonságosan küldjenek egymásnak tranzakciókat digitális valuták formájában. Mindez
anélkül történik, hogy egy központi felügyelő rendszer beavatkozna a tranzakciós
folyamatokba.
A Bitcoin a világ jelenlegi legértékesebb digitális valutája amivel ma már
kriptotőzsdéken és hagyományos tőzsdéken is kereskednek határidős ügyletek által. A
Bitcoin blokklánc publikus főkönyve lehetőséget teremt arra, hogy mély tanuló
algoritmusok használatával az adatokból új információkat nyerjünk és az adatok
statisztikai elemzését felhasználjuk automatizált kereskedési rendszerek tervezéséhez.
A szakdolgozatomban először bemutatom azokat az alapvető matematikai definíciókat
és műveleteket amelyek a nyilvános kulcs alapú vagy aszimmetrikus kriptográfia alapját
képezik. Az aszimmetrikus kriptográfiai módszereknek tulajdonítható a blokkláncok
biztonsága és kikényszerítik az elosztott rendszer résztvevőiből az egymás iránti
kölcsönös bizalmat.
A második fejezetben részletesen tárgyalom a Bitcoin blokklánc felépítését,
működését és adatstruktúráját.
A dolgozatom második felében bemutatom a folyamatot amely során adatokat
gyűjtöttem a Bitcoin hálózatáról és piaci értékéről. Tárgyalom az adatok hatásos tárolását,
transzformálását és elemzését amelyet mély tanuló algoritmusok segítségével végeztem
azért, hogy változók jövőbeli értékekét jelezzem előre. A dolgozat végén említést teszek
további lehetséges kutatásokról és egy működő prediktáló rendszer felhasználási
lehetőségéről.
Ma a hagyományos és kriptotőzsdéken az ügyletek legnagyobb részét automatizált
rendszerek hajtják végre. Ezek a rendszerek eliminálták az emberek érzelmi hibáját,
jobban kihasználják a mintákon alapuló felismerést, és a pontos kereskedési stratégiákhoz
való ragaszkodást, valamint a rendkívül gyors információfeldolgozást annak érdekében,
hogy minél nagyobb nyereséget érjenek el.
Abstract
Blockchain is a distributed peer to peer network, which allows clients to anonymously
and securely transfer digital currencies without the intervention of a centralized authority.
Blockchain technology is also called public ledger, because the network’s transactions
are public.
Bitcoin is the most valuable digital asset, which is traded on cryptocurrency exchanges.
The publicity of the Bitcoin ledger creates an opportunity to combine blockchain data and
deep learning algorithms in order to leverage possible new sources of information for
automated trading. In this thesis, in the first place I introduce the basic definitions,
mathematical formulas and operations of public-key cryptographic methods that facilitate
blockchain technology to operate without a central authority and establish so called
decentralised trust between anonymous parties. Then I discuss in detail the data structure
and the operation of the Bitcoin blockchain. The second half of this thesis represents the
process, through which I collected, transformed and analyzed data about Bitcoin, and
utilized the effectiveness of deep learning algorithms in order to predict future properties
of Bitcoin and its network. At the end of the document I mention a possible use case of a
prediction system and some future investigation opportunities that the thesis leaves
behind.
Traditional stock market and cryptocurrency trading are mostly based on algorithmic
trading in today’s world. The trading algorithms exploit pattern recognition, stickiness to
precise trading strategies and rapid information processing in order to beat human traders.
- 1 -
1. Introduction
Blockchain and deep learning are both outstanding computer technologies, which have
gained importance in recent years. Blockchain is a distributed peer to peer network, which
allows the network’s participants to securely and anonymously transfer digital assets to
each other without the intervention of a central authority. Deep learning is a subfield of
machine learning and it is used to analyse large sets of data and to map input variables to
desired output. In this thesis I utilize the Bitcoin blockchain data in order to predict
Bitcoin’s price and its volatility with deep learning algorithms.
In the second chapter of my thesis I introduce the concepts of cryptographic hash
functions, basic encryption schemes, private and public keys. I discuss in detail the
mathematical background of the Elliptic Curve Digital Signature Algorithm, which is
widely utilized by blockchain technology in order to generate private and public key pairs.
These concepts and innovations are the fundamental blocks of blockchains, which
facilitate the secure operation of the peer to peer networks and eliminate the necessity of
a central authority.
In the following paragraph I connect the previously introduced mathematical
background to the Bitcoin blockchain. The key pairs, adresses, the structure of
transactions and the data fields are discussed, which will be used in the subsequent
chapters. The concept of Merkle tree is also explained, which aggregates the network’s
transactions and then the proof of work algorithm is interpreted, which constitutes the so-
called decentralized consensus and creates a possibility for blockchain forks to occur.
In the fourth chapter I discuss in detail the process of data collection about the Bitcoin
blockchain, the storing, transformation and analysis of the data. I introduce transaction
graphs that I created from each Bitcoin block in order to feed them to deep neural
networks. I experimented with different convolutional neural network architectures to
predict Bitcoin’s price and its volatility from the transaction graphs and I utilized
additional block features to train long short-term memory networks. In this part of my
thesis I also summarize the results of the investigations and the detected correlations
between the Bitcoin network and Bitcoin’s market data. In the remaining of the chapter I
introduce a possible use case of an operative prediction system and I propose potential
further experiments.
- 2 -
2. Public-key cryptography
Permissionless blockchain protocols like Bitcoin are based on P2P networks,
cryptography and game theory. The participants of blockchain networks reach consensus
over which transaction is correct, without the help of a central authority. Cryptography is
used to preserve privacy and transparency at the same time. Public-key cryptography or
asymmetric cryptography is a cryptographic system that relies on a pair of keys. A private
key is kept secret and a public key can be broadcasted out to a network. The cryptographic
system ensures the authenticity and integrity of a message. Bitcoin’s wallet creation,
signing of transactions, verification of transactions and common consensus over the
network are activities of blockchain networks, which rely on public-key cryptography
techniques.
2.1 One-way hash functions
The mathematical one-way functions are the key to public-key cryptography. These
functions are the fundamental building blocks of secure communications over an insecure
channel.
One-way functions are easy to compute but almost impossible to reverse. For a given
x value, it is easy to compute f(x) but with the possession of f(x), x is not computable.
Only with the use of brute-force attacks (that is to try every possible value that might
produce f(x)) one could be able to produce the secret data, x. For this reason, many
protocols rely on one-way hash functions, because they transform valuable information
into a uniquely differentiable, fixed length data that is known as the data’s digital
fingerprint[1].
Hash functions use variable length data as input to create a hash value with fixed
length. Hash values can contain many leading zeros in order to match the required length
output. The following properties must be mathematically satisfied for cryptographic hash
functions that create digital fingerprints[2]:
• Providing hash values for any kind of data quickly
• Being pseudorandom
• Being deterministic
• Being one-way functions
• Being collision resistant
- 3 -
Providing hash values for any kind of data quickly means that the algorithm of the
function to produce a fixed length output for any kind of data should not be
computationally intensive and the output must be returned quickly.
Definition 1.1 (pseudorandom functions):
A pseudorandom function is an efficient (deterministic) algorithm[3] which is given by
an 𝑛-bit seed 𝑠, an 𝑛-bit argument x and returns an 𝑛 -bit string, denoted 𝑓𝑠(𝑥), so that it
is infeasible to distinguish the responses of 𝑓𝑠, for a uniformly chosen 𝑠, from the
responses of a truly random function.
The hash value that is returned by a pseudorandom function changes unpredictably
with the change of the input data. It should be impossible to predict the output of the hash
function with the knowledge of the input data.
Deterministic functions return identical encrypted data for the same inputs. Equivalent
data given to a hash function must have equivalent digital fingerprints, to identify them
correctly.
Definition 1.2 (one-way functions):
A function 𝑓: {0, 1}*↦ {0, 1}* is called one-way, if
• easy direction: there is an efficient algorithm which on input x outputs f(x).
• hard direction: given f(x), where x is uniformly selected, it is infeasible to find,
with non-negligible probability, a preimage of f(x). That is, any feasible algorithm
which tries to invert f may succeed only with negligible probability, where the
probability is taken over the choices of x and the algorithm’s coin tosses.
One-way functions are non-invertable, therefore it is impossible to recover the original
input data in the possession of the hash value.
Definition 1.3 (Collision-Free Hashing):
Consider a family of hash functions[4], indexed by strings, 𝐹 ≝ {𝑓𝛼 ∶ {0, 1}2|𝛼| ↦
{0, 1}|𝛼|}𝛼, so that there exists a polynomial-time algorithm for evaluating 𝐹 (i.e., on input
α and 𝑥 returns 𝑓𝛼(𝑥)). The family 𝐹 is called collision-free with respect to (w.r.t)
complexity 𝑐(∙) if for every non-uniform family of circuits {𝐶𝑛} with size bounded by
𝑐(∙) and all sufficiently large 𝑛’s, the probability that 𝐶𝑛, given a uniformly chosen 𝛼 ∈
{0, 1}𝑛 , outputs a pair (𝑥, 𝑦) so that 𝑓𝛼(𝑥) = 𝑓𝛼(𝑦), is bounded above by 1/𝑐(𝑛). The
family 𝐹 is called collision-free if it is collision-free w.r.t. all polynomials and is called
- 4 -
strongly collision-free if, for some ϵ > 0, it is collision-free w.r.t. the function
𝑓(𝑛) ≝ 2𝑛ϵ .
Collision-free functions exist assuming the intractability of factoring integers (i.e.: in
polynomial time). Strong collision-free functions exist if n-bit long integers cannot be
factored in time 2𝑛ϵ , for some ϵ > 0. Collision resistance occurs when the possibility of
creating identical hash values from two distinct inputs is approximately zero.
The abovementioned conditions must be satisfied to create digital fingerprints with a
hash function. Every definition is analogous to a human fingerprint. A human fingerprint
is quickly identifiable by a proper camera, the digital fingerprint must change when the
human finger is injured, every time when the finger is sampled it has to produce the same
digital output, for someone who sees the fingerprint it is impossible to guess the
corresponding personality and two different people will never have identical fingerprints
even if they are twins.
2.2 Private and Public Keys
The idea of asymmetric cryptography which is also known as public key cryptography
was proposed by Merkle, Diffie and Hellman in the mid-1970s. This cryptographic
standard is a set of techniques that allows two parties to communicate securely by
eliminating the possibilities for eavesdropping, tampering and impersonation attacks.
It provides:
• Encryption
• Tamper detection
• Authentication
• Non-repudiation
Two parties that want to exchange confidential information must encrypt and decrypt
the data that contain the information. The raw data, called plaintext that represents
readable information is encrypted by the sender with an encryption algorithm, using the
sender’s public key. The encryption algorithm produces an uninterpretable ciphertext,
which is transmitted over a shared medium. The receiver decrypts the ciphertext with a
private key to read the plaintext. The public and private key are interconnected, in a sense
that the public key is generated from the private key. The public key allows somebody to
encrypt data but only the owner of the private key can decrypt it. The receiver of the
information can verify that the data has not been modified during the transmission. An
- 5 -
adversary with an attempt to modify the data will cause a change in the message, thus
these harmful actions can be detected. This is called tamper detection. Authentication
provides a method to prove the identity of the sender, therefore it excludes impersonation
attacks. Non-repudiation prevents the sender from claiming later that the information was
never sent.
The mathematical definition of public key cryptography encryption scheme is defined
in the following [5].
Definition 2.2.1
Let κ ∈ ℕ be a security parameter. An encryption scheme is defined by the following
spaces in (all depending on the security parameter κ) and algorithms in Table 1.
Table 1. Spaces and algorithms of an encryption scheme
𝑀κ The space of all possible messages.
𝑃𝐾κ The space of all possible public keys.
𝑆𝐾κ The space of all possible private keys.
𝐶κ The space of all possible ciphertexts.
KeyGen
A randomised algorithm that takes the
security parameter κ, runs in expected
polynomial-time (i.e. 𝑂(κ𝑐) bit operations
for some constant c ∈ ℕ) and outputs a
public key pk ∈ 𝑃𝐾κ and a private key sk
∈ 𝑆𝐾κ.
Encrypt
A randomised algorithm that takes as
input m ∈ 𝑀κ and pk, runs in expected
polynomial time (i.e. 𝑂(κ𝑐) bit operations
for some constant c ∈ ℕ) and outputs a
ciphertext c ∈ 𝐶κ.
Decrypt
An algorithm (not usually randomised)
that takes c ∈ 𝐶κ and sk, runs in
polynomial-time and outputs either m ∈
𝑀κ or the invalid ciphertext symbol ⊥.
It is required that
Decrypt (Encrypt (m, pk), sk) = m.
if (pk, sk) is a matching key pair. It is a requirement that the fastest known attack on this
system should perform at least 2κ bit operations.
- 6 -
2.3 The Elliptic Curve Digital Signature Algorithm (ECDSA)
The Elliptic Curve Digital Signature Algorithm creates a digital signature of the input
data. The digital signature is used to verify the authenticity of the underlying data without
compromising its security. The following chapters discuss in detail the mathematics of
ECDSA.
2.3.1 Finite Fields
A finite field consists of a finite set of elements F [8]. The order of a finite field is the
number of elements in the field. A finite field of order q exists if and only if q is a prime
power. If q is a prime power, then there is essentially only one finite field of order q and
it is denoted by 𝐹𝑝. If 𝑞 = 𝑝𝑚, where p is a prime and m is a positive integer, then p is
called the characteristic of 𝐹𝑞 and m is called the extension degree of 𝐹𝑞. Most standards
which specify the elliptic curve cryptographic techniques restrict the order of the
underlying finite field to be an odd prime (q=p) or a power of 2 (𝑞 = 2𝑚).
2.3.2 The Finite Field 𝑭𝒑
Let p be a prime number. The finite field 𝐹𝑝 is called prime field. 𝐹𝑝 consists of the set of
integers {0, 1, 2, …, p-1}. The following operations are defined on 𝐹𝑝:
• Addition: If a, b ∈ 𝐹𝑝, then a + b = r, where r is the remainder when a + b is
divided by p and 0 ≤ r ≤ p – 1. This is known as addition modulo p.
• Multiplication: If a, b ∈ 𝐹𝑝, then a ∙ b = s, where s is the remainder when a ∙ b is
divided by p and 0 ≤ s ≤ p – 1. This is known as multiplication modulo p.
• Inversion: If a is a non-zero element in 𝐹𝑝, the inverse of a modulo p, denoted
𝑎−1, is the unique integer c ∈ 𝐹𝑝, for which a ∙ c = 1.
Example 1. The finite field’s 𝐹44 elements are {0, 1, 2, …, 43}. In addition,
multiplication and inverse operation respectively are: 41 + 22 = 19, 4 ∙ 12 = 4, 9−1 = 5.
2.3.3 Elliptic curves over Finite fields
Let p > 3 be an odd prime. An elliptic curve E over 𝐹𝑝 is defined by an equation of the
form 𝑦2 = 𝑥3 + 𝑎𝑥 + 𝑏, where a, b ∈ 𝐹𝑝 and 4𝑎3 + 27𝑏2 ≡ 0 (𝑚𝑜𝑑 𝑝). The set 𝐸(𝐹𝑝)
consists of all points (x, y), x ∈ 𝐹𝑝, y ∈ 𝐹𝑝, which satisfy the defining equation, together
with a special point Ơ called the point at infinity.
- 7 -
Addition of two points on an elliptic curve 𝐸(𝐹𝑝) is defined according to the chord-
and-tangent rule. Let P = (x1, y1) and Q = (x2, y2) be two distinct points on an elliptic
curve E. Then the sum of P and Q, denoted R = (x3, y3), is defined as follows. First draw
the line through P and Q. This line intersects the elliptic curve in a third point. Then R is
the reflection of this point in the x-axis. The geometric description is depicted on Figure
1. In this case, 𝐸(𝐹𝑝) consists of an ellipse and an infinite curve.
Figure 1. Addition of two distinct elliptic curve points (Source: [8])
The double of P = (x1, y1), denoted R = (x3, y3) is defined as follows. A tangent line
is drawn from P. The intersection of the tangent line and the elliptic curve is -R. Then R
is the reflection of -P in the x-axis. Figure 2. depicts this process.
Figure 2. The doubling of a point on an elliptic curve (Source: [8])
- 8 -
The algebraic formulas for the sum of two points and the double of a point is derived
from the geometric description.
1. 𝑃 + Ơ = Ơ + 𝑃 for all 𝑃 ∈ 𝐸(𝐹𝑝).
2. If 𝑃 = (𝑥, 𝑦) ∈ 𝐸(𝐹𝑝), then (𝑥, 𝑦) + (𝑥, −𝑦) = Ơ. (The point (𝑥, −𝑦) is
denoted by −𝑃, is called the negative of 𝑃, and it is indeed a point on the curve.
3. Let 𝑃 = (𝑥1, 𝑦1) ∈ 𝐸(𝐹𝑝), where 𝑃 ≠ −𝑃. Then 2𝑃 = (𝑥3, 𝑦3), where
𝑥3 = (𝑦2−𝑦1
𝑥2−𝑥1)
2
− 𝑥1 − 𝑥2 and 𝑦3 = (𝑦2−𝑦1
𝑥2−𝑥1) (𝑥1 − 𝑥3) − 𝑦1
4. Let 𝑃 = (𝑥1, 𝑦1) ∈ 𝐸(𝐹𝑝), where 𝑃 ≠ −𝑃. Then 2𝑃 = (𝑥3, 𝑦3), where
𝑥3 = (3𝑥1
2+𝑎
2𝑦1)
2
− 2𝑥1 and 𝑦3 = (3𝑥1
2+𝑎
2𝑦1) (𝑥1 − 𝑥3) − 𝑦1
2.3.4 Group order and group structure
Let E be an elliptic curve over a finite field 𝐹𝑞. According to Hasse’s theorem, the number
of points on an elliptic curve (including the point at infinity) is #𝐸(𝐹𝑞) = 𝑞 + 1 − 𝑡
where |𝑡| ≤ 2√𝑞. #𝐸(𝐹𝑞) is called the order of E and t is called the trace of E. Otherwise,
the order of an elliptic curve 𝐸(𝐹𝑞) is approximately equal to the size of q of the
underlying field.
𝐸(𝐹𝑞) is an abelian group of rank 1 or 2. 𝐸(𝐹𝑞) is isomorphic to ℤ𝑛1 × ℤ𝑛2, where n2
divides n1, for unique positive integers n1 and n2. ℤ𝑛 denotes the cyclic group of order
n. Moreover, n2 divides q – 1. If n2 = 1, then 𝐸(𝐹𝑞) is said to be cyclic. Therefore, 𝐸(𝐹𝑞)
is isomorphic to ℤ𝑛1 and there exists a point 𝑃 ∈ 𝐸(𝐹𝑝) such that 𝐸(𝐹𝑞) = {𝑘𝑃 ∶ 0 ≤ k
≤ n1 − 1}. Such a 𝑃 point is called a generator point of 𝐸(𝐹𝑞).
2.3.5 ECDSA domain parameters
The domain parameters for ECDSA are an elliptic curve E, defined over a finite field 𝐹𝑞
of characteristic p and a base point 𝐺 ∈ 𝐸(𝐹𝑞). Restrictions are placed on on the
underlying field size q, the representation of the elements of 𝐹𝑞, on the elliptic curve E,
and the order of the base point. These restrictions are necessary to facilitate
interoperability and to avoid known attacks.
- 9 -
1. The field size should be an odd prime q=p, so the underlying finite field is 𝐹𝑝, the
integers modulo p.
2. An indication field representing FR, used for the representation of the elements of
𝐹𝑞.
3. An optional bit string seedE of length at least 160 bits.
4. Two field elements a and b in 𝐹𝑞 which define the equation of the elliptic curve E
over 𝐹𝑞.
5. Two field elements 𝑥𝐺 and 𝑦𝐺 in 𝐹𝑞 which define a finite point G = (𝑥𝐺 , 𝑦𝐺) (also
called Generator Point) of prime order 𝐸(𝐹𝑞).
6. The order n of the point G, with n > 2160 and n > 4√𝑞.
7. The cofactor ℎ = #𝐸(𝐹𝑞)/𝑛.
2.3.6 Randomly generating an elliptic curve
The following algorithm is a verified random method to generate an elliptic curve. The
algorithm will be referenced by Algorithm 1. for further explanations. The notations 𝑡 =
⌈log2 𝑝⌉, 𝑠 = ⌊(𝑡 − 1)/ 160⌋, and 𝑣 = 𝑡 − 160 ∙ 𝑠 are used.
Algorithm 1.: Generating a random elliptic curve over 𝐹𝑝.
Input: A field size p, where p is an odd prime.
Output: A bit string seedE of length at least 160 bits and field elements 𝑎, 𝑏 ∈ 𝐸(𝐹𝑝)
which define an elliptic curve 𝐸 over 𝐹𝑝.
1. Choose an arbitrary bit string seedE of length 𝑔 ≥ 160 bits.
2. Compute 𝐻 = 𝑆𝐻𝐴256(𝑠𝑒𝑒𝑑𝐸) and let 𝑐0 denote the bit string of length 𝑣 bits
obtained by taking the 𝑣 rightmost bits of 𝐻.
3. Let 𝑊0 denote the bit string of length 𝑣 bits obtained by setting the leftmost bit of
𝑐0 to 0. (This ensures that 𝑟 < 𝑝.)
4. Let 𝑧 be the integer whose binary expansion is given by the 𝑔-bit string 𝑠𝑒𝑒𝑑𝐸.
5. For i to 1 to s do:
- Let 𝑠𝑖 be the 𝑔-bit string which is the binary expansion of the integer
(𝑧 + 𝑖) 𝑚𝑜𝑑 2𝑔.
- Compute 𝑊𝑖 = 𝑆𝐻𝐴256(𝑠𝑖).
6. Let 𝑊 be the bit string obtained by concatenating 𝑊0, 𝑊1, … , 𝑊𝑠 as follows:
𝑊 = 𝑊0 ∥ 𝑊1 ∥ ⋯ ∥ 𝑊𝑠.
7. Let 𝑟 be the integer whose binary expansion is given by 𝑊.
- 10 -
8. If 𝑟 = 0 or if 4𝑟 + 27 ≡ 0 (𝑚𝑜𝑑 𝑝) then go to step 1.
9. Chose arbitrary integers 𝑎, 𝑏 ∈ 𝐸(𝐹𝑝), not both 0, such that 𝑟 ∙ 𝑏2 ≡ 𝑎3 𝑚𝑜𝑑 𝑝.
10. The elliptic curve chosen over 𝐹𝑝 is 𝐸 ∶ 𝑦2 = 𝑥3 + 𝑎𝑥 + 𝑏.
11. Output(𝑠𝑒𝑒𝑑𝐸, 𝑎, 𝑏)
2.3.7 Domain parameter generation
There are several ways to generate cryptographically secure domain parameters. Some of
the existing methods that used in practice are the Koblitz Curves [9], Atkin-Morain
method [10] and Schoof’s algorithm [11]. The following method is one way to generate
secure domain parameters:
1. Select coefficients a and b from 𝐹𝑞 verifiably at random using Alg.1. Let E be the
curve 𝑦2 = 𝑥3 + 𝑎𝑥 + 𝑏.
2. Compute 𝑁 = #𝐸(𝐹𝑞).
3. Verify that 𝑁 is divisible by a large prime 𝑛 (𝑛 > 2160 and 𝑛 > 4√𝑞 ). If not,
then go to step 1.
4. Verify that 𝑛 does not divide 𝑞𝑘 − 1 for each 𝑘, 1 ≤ 𝑘 ≤ 20. If not, then go to
step 1.
5. Verify that 𝑛 ≠ 𝑞. If not, then go to step 1.
6. Select and arbitrary point 𝐺′ ∈ 𝐸(𝐹𝑞) and set 𝐺 = (𝑁/𝑛)𝐺′. Repeat until 𝐺 ≠ Ơ.
2.3.8 ECDSA Key Pair generation, public key validation and proof of
possession of the private key
A specific ECDSA key pair can be characterized with the elliptic curve domain
parameters: 𝐷 = (𝑞, 𝐹𝑅, 𝑎, 𝑏, 𝐺, 𝑛, ℎ). The entity that possesses the key pairs must assure
that the domain parameters are valid. The ECDSA key pair generation consists of the
following three steps:
1. Select a random or pseudorandom integer d in the interval [1, 𝑛 − 1].
2. Compute 𝑄 = 𝑑𝐺.
3. The public key is 𝑄, the private key is d.
The private key is a randomly generated integer and the public key is derived from the
private key by multiplying the base point.
The validation of the public key is required to avoid known attacks and errors, such as
malicious insertion of an invalid public key and inappropriate coding or transmission.
- 11 -
The following algorithm referenced by Algorithm 2. validates that an associated public
key with the private key exists. However, it is not an assurance of the existence of the
private key nor the possession of it.
Algorithm 2.: Explicit validation of an ECDSA public key.
Input: A public key 𝑄 = (𝑥𝑄 , 𝑦𝑄) associated with valid domain parameters:
(𝑞, 𝐹𝑅, 𝑎, 𝑏, 𝐺, 𝑛, ℎ).
Output: Acceptance or rejection of the validity of 𝑄.
1. Check that 𝑄 ≠ Ơ.
2. Check that 𝑥𝑄 and 𝑦𝑄 are properly represented elements of 𝐹𝑄 (integers in the
interval [0, p-1]).
3. Check that 𝑄 lies on the elliptic curve defined by 𝑎 and 𝑏.
4. Check that 𝑛𝑄 = Ơ.
5. If any check fails, then 𝑄 is invalid, otherwise 𝑄 is valid.
The ECDSA signature generation and verification is the consequence of all previously
mentioned methods. Transmission of information between two parties and proof that the
message was originated from a trusted and authentic source is described in the following.
ECDSA signature generation is signing a message m. An entity A with domain
parameters 𝐷 = (𝑞, 𝐹𝑅, 𝑎, 𝑏, 𝐺, 𝑛, ℎ) and associated key pair (𝑑, 𝑄) does the following:
1. Select a random or pseudorandom integer 𝑘, 1 ≤ 𝑘 ≤ 𝑛 − 1.
2. Compute 𝑘𝐺 = (𝑥1, 𝑦1) and convert 𝑥1 to an integer 𝑥1′.
3. Compute 𝑟 = 𝑥1 𝑚𝑜𝑑 𝑛. If 𝑟 = 0, then go to step 1.
4. Compute 𝑘−1 𝑚𝑜𝑑 𝑛.
5. Compute 𝑆𝐻𝐴256(𝑚) and convert this bit string to an integer 𝑒.
6. Compute 𝑠 = 𝑘−1(𝑒 + 𝑑𝑟) 𝑚𝑜𝑑 𝑛. If 𝑠 = 0 then go to step 1.
7. A’s signature for the message m is (𝑟, 𝑠).
In order to verify A’s signature (𝑟, 𝑠) on m, B obtains an authentic copy of A’s domain
parameters 𝐷 = (𝑞, 𝐹𝑅, 𝑎, 𝑏, 𝐺, 𝑛, ℎ) and associated public key 𝑄. It is also recommended
for B to validate the domain parameters D and the public key Q. To verify the signature
B does the following:
1. Verify that 𝑟 and 𝑠 are integers in the interval [1, 𝑛 − 1].
2. Compute 𝑆𝐻𝐴256(𝑚) and convert this bit string to an integer 𝑒.
3. Compute 𝑤 = 𝑠−1 𝑚𝑜𝑑 𝑛.
4. Compute 𝑢1 = 𝑒𝑤 𝑚𝑜𝑑 𝑛 and 𝑢2 = 𝑟𝑤 𝑚𝑜𝑑 𝑛
- 12 -
5. Compute 𝑋 = 𝑢1𝐺 + 𝑢2𝑄.
6. If 𝑋 = Ơ, reject the signature. Otherwise convert 𝑥-coordinate 𝑥1 of 𝑋 to an
integer 𝑥1′ and compute 𝑣 = 𝑥1′ 𝑚𝑜𝑑 𝑛.
7. Accept the signature if and only if 𝑣 = 𝑟.
If a signature (𝑟, 𝑠) on a message m was indeed generated by A, then
𝑠 = 𝑘−1(𝑒 + 𝑑𝑟) 𝑚𝑜𝑑 𝑛. Rearranging gives
𝑘 ≡ 𝑠−1(𝑒 + 𝑑𝑟) ≡ 𝑠−1𝑒 + 𝑠−1𝑟𝑑 ≡ 𝑤𝑒 + 𝑤𝑟𝑑 ≡ 𝑢1 + 𝑢2𝑑 (𝑚𝑜𝑑 𝑛).
Thus 𝑢1𝐺 + 𝑢2𝐺 = (𝑢1 + 𝑢2𝑑)𝐺 = 𝑘𝐺 and so 𝑣 = 𝑟 as required.
- 13 -
3. The blockchain of the Bitcoin network
Bitcoin is a virtual network with separated participants, rules and a digital currency. The
network consists of participants that operate the network, follow the same rules and thus
eliminate the need for a central authority. Users of the network can transfer digital
currency, called bitcoin to other peers. Every transaction is validated by the network’s
operators and added to the public ledger, called the blockchain[22]. The blockchain is a
chain of blocks, which aggregate transactions. The security of the network is maintained
despite its publicity through emergent decentralised consensus between the network
operators or mining nodes by using cryptographic hash functions and by taking the
advantages of these functions. In comparison to traditional bank account numbers, public
keys are used to generate public addresses to receive currencies. Private keys represent
the ownership of goods, which can be transferred to other peers of the network. I utilized
Mastering Bitcoin: unlocking digital cryptocurrencies[6] book in the following
investigation of the Bitcoin blockchain.
3.1 Private Keys
A private key authorizes its owner to access and spend the bitcoin funds, which belong to
a specific account or bitcoin address. A private key is a random number generated by a
cryptographically secure source of entropy. It can be any number between 1 and 1.1568 ∗
1077 − 1, slightly less than 2256 − 1. This number is the same as the order of the elliptic
curve that secp256k1 defines. In general, SHA-256 algorithm (See Appendix, A.1. Secure
Hash Algorithm (SHA) for further details) is used to generate this number by feeding the
algorithm with a large string of random bits. The private key is almost never shown to the
owner. Different software wallets use different methods for the generation of a private
key, like using the underlying operating system random number generators to produce
256-bits of entropy or using the user’s mouse movements for generation. The most secure
way is to use a one-way hash function, which produces random sequence of bits.
3.2 Public Keys
The public key is calculated from the private key using elliptic curve multiplication,
which is previously described. The secp256k1 Koblitz curve (For further details, see
Appendix, A.2. The domain parameters of the Koblitz curve, secp256k1) with its
predefined properties is used to produce irreversible steps on an elliptic curve and to
- 14 -
create the public key. 𝐾 = 𝑘 ∗ 𝐺, where 𝑘 is the private key, 𝐺 is the generator point and
𝐾 is the public key. The operation is non-invertable, thus it is impossible to find the
private key from the public key. Because 𝐺 is the same for all bitcoin users, a private key
multiplied by G will always result in the same public key. The multiplication of the
generator point 𝐺 with 𝑘 is the same as adding 𝐺 to itself 𝑘 times in a row, according to
the mathematics of elliptic curves over finite fields. Below Figure 3. shows the iterative
process of drawing a tangent line on the point 𝐺, then finding where it intersects the curve
then reflecting that point on the x-axis. This procedure repeats itself for 2𝐺, 4𝐺, … , 𝑘 ∗ 𝐺.
Figure 3. Visualization of the multiplication of a point G by an integer k on an elliptic
curve (Source: [6])
3.3 Bitcoin addresses
A bitcoin address is like a bank account number. It represents an account that is eligible
for receiving bitcoins. It can be shared with anyone who wants to send bitcoins to the
owner of the account and it is also publicly available in bitcoin’s public ledger. Anyone
can query a specific address with the corresponding holdings, however, the account
remains anonym. Bitcoin addresses are produced from the public keys and begin with the
digit 1. SHA-256 and RIPEMD-160 hash functions are used in combination with the
public key 𝐾, to produce a bitcoin address 𝐴. Equation (1). represents the generation of
A.
𝐴 = 𝑅𝐼𝑃𝐸𝑀𝐷160(𝑆𝐻𝐴256(𝐾)) (1)
- 15 -
Because the outer function is RIPEMD-160[27], the resulting address is 160-bit (20
byte) number. In the interest of the user, Base56Check encoding is used by software
wallets to represent a bitcoin address in human readable and shorter format. This type of
encoding was developed for use in bitcoin. It is a subset of Base-64, which uses 26 lower
case letters, 26 capital letters, 10 numerals and two more characters for encoding character
spaces. Base-58 is Base-64 without 0 (number zero), O (capital o), l (lower L), I (capital
i) and the symbols \, +, /. Base58Check additionally introduces a four bytes long built-in
error-checking code to detect and prevent transcription and typing errors. Software
wallets check mistyped bitcoin addresses, so they do not get accepted as a valid
destination for a transaction, therefore the funds cannot get lost this way.
3.4 Transactions
Bitcoin transactions represent transfer of value from one party to another. Like other non-
digital currencies, bitcoin can be divided into smaller units, called Satoshis. One bitcoin
is equal to 10-8 Satoshis. Bitcoin’s decentralized system is resistant to traditional
inflationary effects, because the maximum available bitcoins are fixed at 21 million.
However, this quantity does not circulate in its enclosed system, currently. New bitcoins
are added to the system itself, by the activity of miners, while they validate the
genuineness of spendable transactions, until the circulating bitcoins reach the 21 million.
When someone would like to send bitcoins to another party the transaction should be
signed cryptographically by the appropriate private key, representing the ownership. The
signed transaction then propagated to the bitcoin network to a few nodes. These nodes
validate the signature and if the transaction is validated successfully, it is broadcasted to
more peers until it reaches every node. Transactions does not contain any confidential
information about the users, so they can be propagated through insecure networks like
NFC, Wifi, etc. Once a transaction becomes valid, it is sent to a common pool that collects
transactions, called the memory pool. Operators of the bitcoin networks, called miners,
compete to summarize the collected transactions in a block, which is then added to the
blockchain, also called a public ledger. It is public, because every peer in the network can
check and query information about the anonymous transactions. The only thing that
matters is a person who wants to spend the bitcoins, factually has the rights to spend them.
The incentive behind the reliable activity of the miners is the newly created bitcoins with
every new block. After every 210,000 transaction blocks, the reward amount is halved,
until the total number of circulating coins will reach the fixed amount. When the network
- 16 -
was launched with the mining of the so-called genesis block, the reward was 50 bitcoins
for every new block. The current reward is 12.5 bitcoins on the 22nd of October 2018.
3.4.1 Structure of a transaction
In general, there are two kind of transactions, normal and Coinbase transactions. Normal
transactions are the transactions used by parties to transfer value to each other. These
transactions have inputs and outputs. Unlike traditional bank accounts, these input and
output values belongs to private keys instead of identities. Once the private key is lost,
the corresponding funds are also lost forever because of the huge address space. Coinbase
transaction is the first transaction in every new block and it only has outputs, usually to a
miner’s address who has successfully created the block.
Table 2. describes a transaction data structure.
Table 2. Data structure of a transaction
Field Description Size
Version Version control for
software updates and
developments
4 bytes
Input Counter Number of inputs 1-9 bytes (VarInt)
Inputs Transaction inputs Variable
Output Counter Number of outputs 1-9 bytes (VarInt)
Outputs Transaction outputs Variable
Locktime Unix timestamp or block
number 4 bytes
Transactions form chains to almost infinity. They lock spendable bitcoins, which
change their owner from time to time. The chain can be inspected by a block explorer on
online sites by inspecting transaction inputs recursively. The lock time field is used to
define timing conditions when the transaction can be added to the blockchain. The field’s
value is above 500 million, it is interpreted as a Unix Epoch timestamp and the transaction
is not included in the blockchain prior to the specified date. If the lock time is between
zero and 500 million, it is interpreted as a block height (blocks are indexed by integer
numbers, called block height), which specifies the block index from when the transaction
can be included in the blockchain.
- 17 -
3.4.2 Transaction inputs and outputs
In the lowest level every circulating quantity of bitcoin is locked to the appropriate owner
by unspent transaction outputs, called UTXO-s. UTXO-s are the basic elements of every
transaction. Hundreds and thousands of UTXO-s can belong to an identity who wants to
spend bitcoins. Wallet software that provides convenient methods for the usage of the
bitcoin network collects all UTXO-s that belong to a specific person to display the
available balance. The bitcoin network nodes also maintain a database that contain every
UTXO and ownership pairs. Therefore, when a transaction is created, it consumes the
adequate amount of UTXO-s, unlocks it with the signature of the current owner, creates
an UTXO and locks it to the new owner. Although transactions are anonymous with
sophisticated methods the frequent uses of the same bitcoin addresses can lead to a
traceback to the owner. In consequence, developed wallet software takes advantage of
different public address creation methods and the available address space by creating a
change address for every transaction. This changed address will hold the remaining
UTXO-s that are not spent by the user.
UTXO-s are tracked by full node bitcoin clients and stored in a database held in
memory, called the UTXO pool. New transactions are created by consuming one or more
of these unspent outputs. Locking scripts are used to specify the conditions, which should
be satisfied to spend the outputs or coins. Tabe 3. describes the data fields of a transaction
output.
Table 3. Data structure of a transaction output
Field Description Size
Amount Transferable value
denominated in
Satoshis
8 bytes
Locking-script size Locking script
length in bytes 1-9 bytes (VarInt)
Locking-script
A script that defines
the conditions
required to spend
the output
Variable
Transactions identified by their hashes, which are produced by SHA-256 hash
function. A transaction’s inputs are pointers to UTXO-s, which point to the transaction
hash and a sequence number that identifies the UTXO record in the blockchain. UTXO-
- 18 -
s can only be spent if the unlocking-script satisfies the required conditions. The script
contains a signature which proves the ownership of the address, that the UTXO-s belong
to. Tabe 4. describes the data fields of a transaction’s input.
Table 4. Data structure of a transaction input
Field Description Size
Transaction hash Pointer to UTXO
referenced by the
transaction
32 bytes
Output index The index number of the
UTXO, starts from 0 4 bytes
Unlocking-script size Unlocking-script length in
bytes 1-9 bytes (VarInt)
Unlocking-script A script that satisfies the
spending conditions Variable
Sequence number Currently disabled feature 4 bytes
3.4.3 Transaction fees
Miners compete for bitcoin rewards that they earn by the successful summarization of the
transactions into a new block. The new block is then appended to the chain of blocks. The
winning miner also earns transaction fees for each transaction that is summarized into the
block. Mining fees and block rewards serve as an incentive to prevent any malicious
activity or abuse against the network. The transaction fees are calculated based on the
transaction size in kilobytes. However, the network users who spend their bitcoins can
also determine the fees that they are willing to pay to the miners. Although, miners
obviously prioritize transactions by fees, so common market forces between the peers
prevail. There is a minimum fee that is currently fixed at 0.0001 bitcoin.
Transaction fees are calculated as the sum of the input UTXO-s minus the sum of the
output UTXO-s as described with Equation (2).
𝐹𝑒𝑒𝑠 = 𝑠𝑢𝑚(𝑖𝑛𝑝𝑢𝑡𝑠) − 𝑠𝑢𝑚(𝑜𝑢𝑡𝑝𝑢𝑡𝑠)
(2)
For this reason, wallet software calculates the fees based on the current market
conditions that determine the prevailing fees on the market. In most applications fees are
also adjustable by the users, giving them the opportunity to prioritize their urgency.
The age of the UTXO-s that are being spent in a transaction input also determines the
priority of the transaction.
- 19 -
The priority is calculated with Equation (3).
𝑃𝑟𝑖𝑜𝑟𝑖𝑡𝑦 =
𝑠𝑢𝑚(𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑖𝑛𝑝𝑢𝑡 ∗ 𝑖𝑛𝑝𝑢𝑡 𝑎𝑔𝑒)
𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛 𝑠𝑖𝑧𝑒
(3)
The value is denominated in satoshis and the age of the input is measured by the blocks
that have elapsed since the transaction is recorded on the network, therefore the age is an
expression by how many blocks deep into the blockchain the transaction is. High priority
transactions can be validated without any fees if the transaction fits into the remaining
space of the block. The original block size was 1 megabyte but by the adaptation of a new
bitcoin protocol called SegWit, it is increased to 2-megabyte. A transaction has a higher
priority if its priority exceeds 57,600,000 which is one bitcoin denominated in satoshis
and aged one day, approximately 144 blocks in a transaction with a size of 250 bytes as
described with Equation (4).
100,000,000 𝑠𝑎𝑡𝑜𝑠ℎ𝑖𝑠 ∗ 144 𝑏𝑙𝑜𝑐𝑘𝑠
250 𝑏𝑦𝑡𝑒𝑠= 57,600,000
(4)
In every bitcoin block the first 50 kilobytes are preserved for high priority transactions,
without the consideration of the transaction fees. The remaining space is filled with
transactions that pay the minimum fee, prioritizing the highest fees on a per kilobyte basis.
The transactions that remain in the memory pool get older as new blocks are added to the
chain.
3.4.4 Transaction validation conditions
A bitcoin node verifies several criteria to consider a transaction to be valid. If the
transaction satisfied all conditions, the transaction is propagated to the connected nodes,
otherwise it is discarded. The following list of criteria is validated when a transaction is
received by a node:
• The transaction’s syntax and data structure must be correct
• Neither lists of inputs or outputs are empty
• The transaction size in bytes is less than MAX_BLOCK_SIZE
• Each output value and the total must be within the allowed range of values (more
than 0 and less than 21 million coins)
• None of the inputs have Coinbase transaction
• nLockTime is less than or equal to INT_MAX
• The transaction size in bytes is greater than or equal to 100
- 20 -
• The number of signature operations contained in the transaction is less than the
signature operation limit
• The unlocking script (called scriptSig) can only push numbers on the stack and
the locking script (called scriptPubkey) must match is standard forms (rejection
of nonstandard transactions)
• A matching transaction in the pool or in a block in the main branch must exist
• For each input, if the referenced output exists in any other transaction in the pool,
rejects the transaction (prevention of a double spending)
• For each input, look in the main branch and the transaction pool to find the
referenced output transaction. If the output transaction is missing for any input,
this will be an orphan transaction. Add this transaction to the orphan transactions
pool, if it is not already in the pool
• For each input, if the referenced output transaction is a Coinbase input, it must
have at least COINBASE_MATURITY (100) confirmations
• For each input there must be a referenced output that is not spent
• Reject if the sum of input values < sum of output values
• Reject if the transaction fee would be too low to get into an empty block
• The unlocking scripts for each input must validate against the corresponding
output locking scripts
3.4.5 Orphan transactions
Transactions form a chain. In this chain the previous parent transaction outputs are spent
by the child transaction and the child transaction outputs by the grandchildren and so on.
There are different kind of complex transactions, like CoinJoin transaction where
transactions are joined together by multiple parties to protect their privacy. In this case, it
can happen when the chain of the transactions depends on each other. Transactions are
transmitted to peers and they do not always arrive in the same order. Because the child’s
signature is required before the parent is signed, a situation can emerge when the child
references its parent transaction that is not yet known for the node. The node instead of
rejecting the transaction puts it into a temporary pool that is known as the orphan pool.
Thereafter the transaction waits in the pool while its parent arrives with the correct UTXO
reference. The orphan pool is stored in memory and for this reason the total number of
- 21 -
transactions that can be stored is fixed by a constant called
MAX_ORPHAN_TRANSACTIONS.
3.5 The data structure of the blockchain
The data structure of blockchain forms a back-linked list of blocks that contain
aggregation of transactions. Each block in the chain is identified by a hash that is
generated using SHA-256 hash function and by an integer index called the block height.
Each block contains a reference to the previous block, called the parent block within its
header field. The links point back to the previous hashes constitute a chain, where every
element is cryptographically connected to each other. The hashes from the most basic
level, from aggregating the transactions to linking the blocks to each other are calculated
based on the previous values. Therefore, if anyone tries to forge a value in a transaction
or anywhere in the blocks, the whole links of the chain also changes. Changes are detected
by the common proof of work algorithm immediately and are rejected by the nodes. The
network’s property that every information is encapsulated in a chain and relies on
previous elements provides bitcoin strong and unbreakable security.
3.5.1 A block’s data fields
A block is characterized by four data fields, each consisted of different length and
meanings. The Block Size field contains the size of the block in bytes. The Block Header
contains several fields and the summary of them is 80 bytes. The Transaction Counter is
a variable integer and indicates the number of transactions that are settled in a block. The
Transactions field contains the recorded transactions for the block with a variable length.
Table 5. describes a block’s data fields.
Table 5. Data fields of a bitcoin block
Field Description Size
Block Size The size of the block in
bytes 4 bytes
Block Header Different fields that form
the block’s header 80 bytes
Transaction Counter Number of transactions in
the block Variable, from 1 to 9 bytes
Transactions The transactions that
construct the Merkle Tree
Variable
- 22 -
3.5.2 Block Header
The block header consists of six fields each contains different metadata. Table 5.
represents the the data fields of a block’s header.
Table 6. Data fields of a bitcoin block’s header
Field Description Size
Previous Block Hash A hash reference to the
previous block in the chain 32 bytes
Merkle Root Hash of the Merkle Tree’s
root, summarizing the
block’s transactions
32 bytes
Timestamp The estimated creation
time of this block
(UnixEpoch)
4 bytes
Difficulty Target Difficulty target of the
proof of work algorithm 4 bytes
Nonce A counter used for the
proof of work algorithm 4 bytes
Version Software version number 4 bytes
3.5.3 Merkle Trees
A block summarizes transactions in a data structure called Merkle Tree. A Merkle Tree
is a Binary Hash Tree which is used to efficiently summarize and verify the integrity of
large datasets. Merkle Tree’s structure is similar to the mathematical tree structure, except
it contains cryptographic hashes.
Transactions need to be validated at a given time that are collected by the network
to encapsulate them in a block. A merkle tree is constructed in a recursive manner.
Transactions that are collected in a pool are used as inputs to a one-way cryptographic
hash function. This function is usually Secure Hash Algorithm 2 with 256 bits output
(SHA-256). After hashing the transactions individually, they are concatenated in binary
hash pairs and then hashing the concatenations again. This process recursively repeats
itself until only one transaction hash remains, the merkle tree’s root.
Let’s consider a simple example by constructing a merkle tree. There are four
transactions collected in the pool, A, B, C and D. The transactions data is hashed by
applying SHA-256 on each.
HA = SHA256(SHA256(Transaction A))
HB = SHA256(SHA256(Transaction B))
- 23 -
The same procedure is repeated on every remaining transaction, in this example on C
and D. These hashes are the leafs of the merkle tree. A parent node is constructed from
every binary pairs by concatenating the 32 bytes hashes, producing a 64-byte string. On
this string SHA256 is used twice to produce the parent node’s hash a 32-byte string.
HAB = SHA256(SHA256(HA + HB))
This process is repeated for every remaining leaf pairs and then for parents as well, as
illustrated on Figure 4.
Figure 4. Aggregation of transactions in a Merkle tree structure
The top node of the merkle tree is the Merkle Root and its data is stored in the block
header, summarizing all underlying transaction’s data. No matter how many transactions
are included in a block, the merkle root always summarizes them to 32 bytes.
The recursive cycle of constructing a merkle tree can be generalized for every
number of even transactions to construct trees of any size. If there is an odd number of
transactions, the last transaction hash will be duplicated to create an even number of leafs,
also known as a balanced tree.
This data structure is very efficient to store information because 2*log2(N)
calculations are maximally needed to check if a specific element is included in the tree.
- 24 -
3.6 Decentralised consensus through proof of work
Bitcoin mining is a process through which transactions are validated and added to the
public ledger by the network’s mining nodes. Mining is incentivized by mining rewards
that the competitors can earn by every new block creation. This reward is halved
approximately every four years or 210,000 blocks, until the reward will reach 1 satoshis.
After about 2140 new bitcoins will not be issued and miners will exclusively receive
reward through mining fees. The main purpose of mining is to secure the bitcoin network
by forcing the network’s participants to individually validate every transaction. Validated
transactions become part of a block that is added to the blockchain. Since then, new
owners of bitcoin can spend their received currencies. New blocks are added to the
blockchain by miners, who solve cryptographic hash puzzles by computing trillions of
hashes, searching for the appropriate hash that matches the network’s so-called difficulty
target. This process is called proof of work, an algorithm through decentralised consensus
emerge and propagate through the bitcoin network.
Bitcoin has no central authority. Every node stores the public ledger they can trust.
The decentralised consensus emerges by the independent operation of the mining nodes.
However, their operation is independent, but they follow the same rules. Mining nodes
independently verify each transaction based on a list of criteria that is described in section
2.4.4. The transactions are aggregated into new blocks and a field value is added to the
block header which proves that the miner satisfied the work that is required to add a new
block. Every new block is verified by every node, then the new block is added to each
miner’s chain independently. The nodes select the main chain with the most cumulative
computation.
3.6.1 Aggregation of transactions
Transactions are validated immediately when they are received. Valid transactions then
added to the memory pool, where they are reserved until they are mined. When a node
receives a new block, it checks that if transactions in the memory pool are included in the
new block and in this case, it removes them. Transactions are prioritized by the age of the
UTXO that is being spent in their inputs. Transactions with high priority can be sent
without any fees. In every block the first 50 kilobytes of the transaction space are reserved
for high priority transactions, regardless of fees. The rest of the block is filled with
transactions that pay the minimum fee, preferring those with the highest fee on a per
- 25 -
kilobyte basis. If there is a remaining space in the block it can be filled with transactions
without fees. Transactions that remain in the memory pool get older, therefore their
priority will increase over time. Transactions are aggregated in a merkle tree structure as
described in section 2.5.3. When a node solves the hash puzzle, it constructs a generation
transaction or Coinbase transaction that has no inputs and its output references the miner’s
bitcoin address. The reward is calculated based on the block height and on the halving
fact after every 210,000 blocks. The mining fee is added to the reward and they together
represent the output of the Coinbase transaction. The generation transaction has the
following data structure, as described with Table 7.
Table 7. Data structure of a Coinbase transaction
Field Description Size
Transaction hash All bits are zero, because it
is not a transaction hash
reference
32 bytes
Output index All bits are ones 4 bytes
Coinbase data size Length of Coinbase data 1-9 bytes (VarInt)
Coinbase data Arbitrary data used for
extra nonce and mining
tags
Variable
Sequence number All bits are ones 4 bytes
3.6.2 Proof of work
Mining is the process through trillions of hash values created with the SHA-256 hash
function [22]. After transactions are aggregated by a node, it creates a block header with
the appropriate fields as described in section 2.5.2. The node then repeatedly hashes the
block header by changing the Nonce field at every iteration randomly, until it matches a
criterion. Because the output of the hash function is unpredictable, the solution can only
be found by trial and error, similarly like a brute force approach. The criteria that the hash
value of the block header must match is the network’s difficulty target. The difficulty
target is represented as a coefficient/exponent format, where the first two hexadecimal
digits represent the exponent and the next six hexadecimal digits represent the coefficient.
Equation (5). is used to calculate the difficulty target:
𝑡𝑎𝑟𝑔𝑒𝑡 = 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 ∗ 2(8∗(𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡−3)) (5)
- 26 -
Let’s consider an example. A block explorer website, https://www.blockchain.com/ is
used to search for block 318,516. On the site the bits field specify 405675096, which is
in hexadecimal and can be represented as 0x182E1C58. The exponent 0x18 and the
coefficient is 0x2E1C58. Using the formula, the target is 302,191,2 ∗ 2168 in decimal.
This huge value is converted to a binary value, depicted on 256 bits produces a number
that has approximately sixty leading zeros. The miner who created block 318,516 had to
produce a hash value that is less than this target value, like a hash value with 65 leading
zeros. While creating a new block the winning miner produces a hash value from the
block header, by varying the nonce field of the header until the hash value is less than the
network’s difficulty target. When the correct nonce is found, the block is created and
propagated through the network, where each node independently validates the new block
and adds to its own blockchain.
The bitcoin network’s difficulty target is dynamically adjusted by the network’s source
code based on the computational or hashing power that operates the system. Blocks are
created on average every 10 minutes and difficulty is adjusted to keep this pace. If block
creation is slower, the difficulty decreases otherwise it increases. Because the difficulty
is independent of the number of transactions or anything else, the hashing power
represents market forces as new miners enter the market to achieve the reward.
3.6.3 Validation of a new block
Once a mining node find a solution for the hash puzzle, it propagates the new block
through the network. The peers independently verify the block by checking the following
criteria:
• The block data structure is valid
• The block header hash is less than the target difficulty
• The block timestamp is less than two hours in the future
• The block size is within the limit
• The first transaction is a Coinbase transaction
• All transactions are valid
If a block does not match the conditions each node rejects the block. If a block is
rejected, the competition restarts, otherwise the race begin for the next block. The
individual validation of every transaction and block enforces a common consensus
between the nodes, therefore excluding the opportunity that some nodes cheat the system.
- 27 -
The decentralised consensus is achieved through the rules that every node follows to
validate transactions and blocks.
3.6.4 Blockchain forks
The bitcoin network’s topology is a loosely connected mesh like object where every node
is interconnected with a few other peers. Because peers are not connected with every other
node, the information propagation is limited in time. A situation can consist for a short
time, when two different newly mined blocks are added to the same chain, or in other
words, two different chains compete to be considered as the main chain. Due to the
bitcoin’s network protocol this situation happens on average every week. However,
according to the protocol nodes must select the longest chain with the most cumulative
difficulty that represents the most proof of work. Blockchain forks under normal
conditions are temporary inconsistencies between versions of the blockchain, which are
resolved by the reconvergence as new blocks are added to one of the forks. Blockchain
forks can also occur when there is an upgrade in the network’s protocol and a considerable
percentage of the nodes decide to follow the new rules. In this case, there is no
reconvergence and both chains will exist. This incident is called a hard fork.
- 28 -
4. Analyzing Bitcoin’s blockchain with deep learning
algorithms
Machine learning is a data analysis method that automates analytical model building. The
purpose of machine learning algorithms is to predict software applications outcome
without explicitly programming them to do so. These kinds of algorithms operate on huge
datasets, each contains millions of records with several features, which describe each
record. The learning algorithms require preprocessed datasets that fill the requirements
of the specific algorithms, for appropriate operation and results. Several model
architectures exist that can solve different mathematical problems by recognising hidden
patterns in the datasets.
In this thesis I utilize the Bitcoin blockchain data in order to predict bitcoin’s price and
volatility. Each record of the dataset belongs to a specific date thus it is a temporal dataset.
It is not a time-series dataset, because the time intervals between the samples are not
equal. The dataset must be split up to train, validation and test sets for machine learning
models in order to train and test them. Because of the temporal property of the dataset,
the three separated datasets must be ordered sequentially in time. Different models are
trained on the train dataset and during the training, the performance of the models is
evaluated on the validation dataset. After each model finished the training process, they
are applied on the test dataset in order to make predictions for bitcoin’s price and volatility
values. The train, validation and test sets contain 51712, 6463 and 6459 records from
2017.01.02. to 2018.01.02., from 2018.01.02. to 2018.02.15. and from 2018.02.15. to
2018.04.05. respectively.
In this thesis at first, a graph representation of blockchain transactions in each block is
examined in order to forecast the afore mentioned target values. The main idea behind
this approach is the assumption, that if unique structures represent each block’s
transaction network in different times of market conditions, then using these transaction
network’s bitcoin’s price and volatility could be predicted.
4.1 Collecting Bitcoin’s blockchain data
Data mining is a process through data is collected, processed and transformed in order to
feed machine learning algorithms with the properly formatted data.
- 29 -
4.1.1 Blockchain.com API
Blockchain.com is a bitcoin block explorer website that provides an application
programming interface (API)1. An API is a set of standardized requests that define the
proper way for an application to request services from another application. Because
bitcoin blockchain size is hundreds of gigabytes, I used Blockchain.com’s API to query
bitcoin blocks.
Python is the most used language for machine learning problems, therefore I exploited
its capabilities in this research to achieve my goals. Blocks, transactions, addresses and
balances can be queried through blockchain.com’s API in different ways. At first, I
queried every block from 2017.01.02. to 2018.04.05. There is a specific https request
provided by the API, which enables users to query blocks. I wrote a function that
generates datetime objects from the Python Datetime library. The function generates
datetime objects from the start to the end date day by day and then it converts the dates
to milliseconds. The millisecond date format is the proper data format that the API query
needs. The function then returns a list of milliseconds. I made a https request for every
element of the list to get the mined blocks for those specified days. For one call, the API
responds with the blocks that contain block heights, block header hashes and times when
the blocks were approximately created. The times property is in a UNIX epoch format,
which is a common format worldwide and measures the time that elapsed from UTC
1970.01.01. 00:00:00. The block hashes are needed for further data collection. Another
API call that blockchain.com provides is a request, through individual blocks can be
queried by their block header hashes. The individual blocks contain every information
described in the previous sections. List of transactions are contained in each block with
further lists of input and output transactions that belong to a specific transaction. I
separated these transactions from each block’s data in order to create and visualize them
in mathematical graph structures.
4.1.2 Drawing transaction networks with NetworkX
NetworkX is a Python package for the creation, manipulation and study of the structure,
dynamics and functions of mathematical graph networks[13]. Almost every graph
structure and algorithm used for analyzing networks are implemented in this library. I
used NetworkX to build transaction networks from each bitcoin block’s transactions,
separately.
1 https://blockchain.info/api downloaded at: 2018.09.20
- 30 -
I chose a class called MultiDiGraph, which is a graph type with directed edges. This
type permits multiple directed edges between nodes. In a bitcoin transaction, inputs and
outputs of a specific transaction do not correspond to each other explicitly. Inputs are
collected from remaining UTXO-s and can be spent to different destinations, like multiple
addresses and to a change address that is used to provide more anonymity for the user.
For this reason, I added an auxiliary node to every transaction that collects inputs from
and emits outputs to addresses. Figure 5. illustrates the problem and my solution.
Figure 5. Illustration of a bitcoin transaction
Nodes of the MultiDiGraph network represent bitcoin addresses and edges represent
transactions between them. The edges also store the bitcoin amount that is transferred
between addresses, although these cannot be visualized efficiently because of the density
of the networks.
Figure 6. depicts an example bitcoin transaction with two input and output
transactions, which I drew and visualized with NetworkX and Matplotlib.
Figure 6. Illustration of a Bitcoin transaction created with NetworkX
- 31 -
It can be seen, that the locking script represented by the middle red node with a long
sequence collects two UTXO-s of 0.01615 and 0.1897 Bitcoins, which were sent to two
output addresses. One output address received 0.2 Bitcoins and the other received the
remaining from the input UTXO-s. Presumably, the later address was the change address,
where the original owner of the coins kept the remainder of his UTXO-s, which was not
spent.
I draw the graphs with Fruchterman-Reingold algorithm, which is implemented in
NetworkX. It is a force-directed algorithm. Force-directed algorithm’s goal is to
aesthetically satisfy the display of graphs, that have huge number of nodes and edges[14].
The algorithm places the nodes in two or three-dimensional space in a way that as less
edges intersect each other as could. This is achieved by the application of Hooke’s law
on the nodes. The nodes are simulated with repulsive force, but the adjacent nodes have
attractive force too. The acting force between the nodes can be calculated with the Kawai
algorithm, which assigns force values to the nodes that are proportionate with the shortest
path between the nodes. After the system converged to an equilibrium state the adjacent
nodes have equal length edges, while the non-adjacent nodes are placed farther from each
other. In total, I draw 64,636 pictures about transaction networks in Bitcoin blocks. Each
picture depicts and individual block. Figure 7. represents some of the pictures with
different density.
Figure 7. Transaction networks in Bitcoin blocks
These pictures are trimmed with a Python library called PIL. They originally contained
superfluous white spaces at the edges, because of NetworkX’s plotting. Each picture is
identified by the corresponding block’s block height. The files were saved with png
extensions, in 1024x1024 resolution.
- 32 -
4.1.3 Additional features from blocks
I created additional features from each block’s data. These features are described in
Table 8.
Table 8. Features calculated from Bitcoin blocks
Feature Description
Block height Indexes of blocks in time order as they
were created
Creation time An approximation when the block was
created
Number of transactions The number of transactions contained in
each block
Block size The size of the block in kilobytes
Nonce The data that solves the hash puzzle
Block hash The header hash of the block
Average transaction size The average transaction size in the block
Mining fee The mining fee denominated in Bitcoin
Mining fee in USD The mining fee denominated in USD
All reward The reward for the block creation plus
the mining fee denominated in Bitcoin
All reward in USD The reward for the block creation plus
the mining fee denominated in USD
Difficulty target The network’s difficulty
Total BTC output All Bitcoin that were transferred in the
block denominated in Bitcoin
Total BTC output in USD All Bitcoin that were transferred in the
block denominated in USD
The first six features in the table can be explicitly extracted from each block’s data
fields.
The average transaction size can be calculated by iterating through every transaction
and extracting their sizes from the ‘size’ data field.
The mining fee can be calculated from the Coinbase transaction, which is the first
transaction in every block. The zero indexed transaction ‘out’ field’s zero indexed ‘value’
field contains the miner’s earnings for the creation of the block. The earnings are
denominated in Satoshis, thus I divided it by 10^8 to get the Bitcoin representation. The
- 33 -
mining reward was 12.5 Bitcoin throughout the period that I have investigated, so I
subtracted it from the earnings in order to get the mining fee.
The difficulty can be calculated as described in section 2.6.2. The block’s ’bits’ field
contains the value, which first two and the remaining digits represent the coefficient and
the exponent, respectively. The result of the equation can then be calculated in binary,
hexa or decimal numeral system. The result is an extremely large number but for machine
learning I had to represent it in decimal instead of hexadecimal.
The total bitcoin output of each block can be calculated with nested loops. The block’s
‘tx’ field contains every transaction and then every transaction’s ‘out’ field contains the
outputs for each transaction. These can be summed together in order to get all the Satoshis
that were transferred in the block. It can be changed to Bitcoin as I previously described.
Those features that are denominated in USD were calculated by multiplying the
appropriate features with Bitcoin’s USD market value at the block’s creation time. I
obtained a minutely price dataset from Kaggle, as I will describe later in a section. The
features were collected in a Pandas DataFrame format and were saved as a comma
separated values (‘.csv’) file.
4.1.4 Storing data in HDF5 files
Hierarchical Data Format version 5 (HDF5) is a file format, which is developed to store
and hierarchically organize huge amount of data[15].
Basically, the file structure is simplified to contain two different objects:
• Datasets, which are multidimensional arrays and contain the same data types
• Groups, which are the storage of datasets and additional groups
Using this structure, a completely hierarchical file structure is created where the stored
data is accessed with POSIX syntax, like: /path/to/resource. Additional metadata is stored
in user defined attributes, which are attached to either groups or datasets. The power of
HDF5 lies in its property that it can read and write huge amount of data. An example for
this is chunked storage, by which the user can pre-define an arbitrary smaller size of a
bigger dataset. These smaller chunks can be accessed instead of the whole dataset, which
hardly fits into the memory. For example, an image of size 1024x1024 can be stored in
64x64 pixels of blocks. The slices of the blocks are indexed with a binary tree to preserve
their order. Different filter operations and compression techniques can be defined on the
slices. The filter operations consist of checksum, adding metadata, or any other operations
- 34 -
wanted to be used on the slices. For compression, GZIP, SZIP, LZF or other third-party
filters can be chosen.
My storing solution creates a ‘BTC_dataset.hdf5’ named file if it does not exist
already, in the directory that the main programme code is called from. After creating the
file, the programme checks if the file contains a transaction_matrices named group. If not,
the function creates one, otherwise it appends the actual block’s network graph’s
adjacency matrix and the graph picture to the transaction_matrices group and identifies
them with the corresponding block height. I attached additional metadas to the datasets,
like creation time, block height, nonce, total number of transactions, aggregated
transaction fees, cost per transaction on average, total output value and estimated
transaction value. Some of these additional features are calculated from each block
independently. The block height also identifies the blocks sequentially in time as well. I
stored the adjacency matrices of the graphs with GZIP compression and the shuffle
parameter turned on. GZIP is the simplest and most portable compression method. Every
HDF version contains GZIP and it operates on every HDF5 file format. With the shuffle
option the speed of the compression is optimized. I tried all built in compression methods
and the smallest file sizes were achieved with GZIP. Figure 8. illustrates the storing
structure of my solution.
Figure 8. The storing structure of blockchain data
- 35 -
I also created another HDF5 file that contains separated train, validation and test sets.
These datasets can be represented as vectors with shapes of (51712, 128, 128, 3), (6463,
128, 128, 3), (6459, 128, 128, 3) respectively to the length of each dataset. The first
number of the shapes represent the number of pictures in the train, validation and test
sets. The images about block transaction networks were originally created in 1024x1024
pixels resolution, which is quite large size for machine learning algorithms because of
their GPU memory requirements. Therefore, I resized the pictures with CV2 Python
library[21] to 128x128 size. The number 3 in the shapes corresponds to the 3 colour
channels, RGB. The resized version of the previously introduced pictures on Figure 8.
can be seen on Figure 9.
Figure 9. Resized transaction graph pictures
I added six more vectors to the HDF5 file, which contains target values for every
dataset. Each dataset is coupled with weighted prices (market value of Bitcoin
denominated in USD) and volatilities, thus two target values. The test set contains target
values in order to verify and evaluate machine learning models’ performance after they
have completed their training.
The vector representation is needed, because some learning models require time
distributed sequences of data. These models have an inbuilt memory and they make
predictions after sequences, rather than after each input.
- 36 -
4.1.5 Volatility estimators
In finance, volatility is the degree of variation of a trading price series over time[16]. It
can be measured by standard deviation or variance between returns. A tradable asset, like
a security, currency or market index is considered riskier the higher the volatility is.
Historical volatility measures a time series of past market prices, while implied volatility
is compared against historical volatility to see if an underlying asset is cheap or not.
Market data that describes an underlying asset can be obtained with different time
resolutions. For example, it can be obtained by ticks (usually means seconds), minutely,
hourly, 4-hourly, daily or weekly. These are the most common resolutions. Time intervals
that are larger than tickly are represented with 5 values. These values are the underlying
asset’s open, high, low, close prices and its volume (OHLCV). Let’s consider an example:
For an interval that starts from 14:00 and holds up to 15:00, the open price represents
the price at 14:00. The high and low prices represent the highest and lowest price of an
asset during the examined period and the close price is the price at 15:00. Volume is the
quantity of an asset that changed hands during the trading period.
In the following, a few volatility estimators, their advantages and disadvantages will be
introduced. Table 9. contains notations that will be used.
Table 9. Notations used in volatility estimator formulas
Notation Description
N The chosen sample size
F A scaling factor, equals to the amount of
trading days in a year
𝑜𝑖 The 𝑖𝑡ℎ open price in a time interval
ℎ𝑖 The 𝑖𝑡ℎ highest price in a time interval
𝑙𝑖 The 𝑖𝑡ℎ lowest price in a time interval
𝑐𝑖 The 𝑖𝑡ℎ close price in a time interval
𝑥′ The average of 𝑥𝑖-s, also called drift
Volatility is defined as the annualised standard deviation of logarithmic returns. Close-
to-close volatility is the usual measure for historical volatility. It requires at least 5
samples to be used.
Close to close volatility is calculated with Equation (6):
𝑉𝑜𝑙𝑎𝑡𝑖𝑙𝑖𝑡𝑦𝑐𝑙𝑜𝑠𝑒 𝑡𝑜 𝑐𝑙𝑜𝑠𝑒 = 𝜎𝑐𝑐 = √𝐹
𝑁 − 1√∑(𝑥𝑖 − 𝑥′)
𝑁
𝑖=1
= √𝐹
𝑁 − 1√∑ 𝐿𝑛(
𝑐𝑖
𝑐𝑖−1)
𝑁
𝑖=1
(6)
- 37 -
The Parkinson estimator is the first advanced volatility estimator created by Parkinson
in 1980. It uses high and low instead of closing prices. The drawback of the estimator is
that it assumes continuous trading, therefore it underestimates the volatility as potential
movements when the market is shut are ignored. Today, there are exchanges that provide
pre and after-hours trading, which is isolated from normal trading hours and markets.
These markets are characterized by high volatility and low liquidity.
The formula of the Parkinson estimator is represented by Equation (7):
𝑉𝑜𝑙𝑎𝑡𝑖𝑙𝑖𝑡𝑦𝑃𝑎𝑟𝑘𝑖𝑛𝑠𝑜𝑛 = 𝜎𝑃 = √𝐹
𝑁√
1
4 𝐿𝑛(2)∑(𝐿𝑛 (
ℎ𝑖
𝑙𝑖))2
𝑁
𝑖=1
(7)
An extension of the Parkinson estimator is the Garman-Klass estimator, which
includes opening and closing prices. It also underestimates the volatility because it
ignores overnight jumps.
The Garman-Klass estimator is represented by Equation (8):
𝑉𝑜𝑙𝑎𝑡𝑖𝑙𝑖𝑡𝑦𝐺𝑎𝑟𝑚𝑎𝑛−𝐾𝑙𝑎𝑠𝑠 = 𝜎𝐺𝐾 = √𝐹
𝑁√∑
1
2(𝐿𝑛 (
ℎ𝑖
𝑙𝑖))2 − (2𝐿𝑛(2) − 1)(𝐿𝑛 (
𝑐𝑖
𝑜𝑖))2
𝑁
𝑖=1
(8)
The Garman-Klass estimator was modified by Yang-Zhang in order to let it handle
jumps. The measurement assumes zero drift hence it overestimates the volatility if an
underlying asset has a non-zero mean return.
The modified formula is described by Equation (9):
𝑉𝑜𝑙𝑎𝑡𝑖𝑙𝑖𝑡𝑦𝐺𝐾𝑌𝑍 = 𝜎𝐺𝐾𝑌𝑍
= √𝐹
𝑁√∑(𝐿𝑛 (
𝑜𝑖
𝑐𝑖−1
))2 +1
2(𝐿𝑛 (
ℎ𝑖
𝑙𝑖
))2 − (2𝐿𝑛(2) − 1)(𝐿𝑛 (𝑐𝑖
𝑜𝑖
))2
𝑁
𝑖=1
(9)
Bitcoin is traded on so called cryptocurrency exchanges or crypto currency exchanges.
These exchanges allow customers to trade digital currencies, like Bitcoin for other digital
assets or traditional fiat money. The main difference between traditional and crypto
exchanges is that the later operate permanently, without no closing hours. Crypto
exchanges usually provide functional API-s to their customers to implement automated
trading, based on several strategies.
In my experiment, I used the so called BVOL Annualized Historical Volatility Index,
which is a common estimator for Bitcoin’s volatility in crypto community.
- 38 -
The calculation of the index is represented by Equation (10):
𝐵𝑉𝑂𝐿 𝐼𝑛𝑑𝑒𝑥 = 𝑆𝑡𝑑𝑒𝑣 (𝐿𝑛 (
𝑃1
𝑃0) , 𝐿𝑛 (
𝑃2
𝑃1) , … , 𝐿𝑛 (
𝑃𝑖
𝑃𝑖−1)) ∗ √365
(10)
For 𝑃𝑖, I used weighted prices that were provided in the minutely sampled Bitcoin price
dataset, which I obtained from Kaggle2. This dataset, called Bitcoin Historical Data
contains one-minute Bitcoin’s price data from Bitstamp and Coinbase exchanges. The
dataset is updated frequently. I resampled it by 5-minute, which means I obtained a
dataset that contains volatility values for every 5-minute. In Equation 10. 365 denotes the
trading days of Bitcoin in a year, so I replaced this value with 288, which is the number
of 5 minutes in a day.
4.2 Deep learning
Artificial intelligence or AI was born in the 1950s. The field was first emerged in order
to automate intellectual tasks by computers that normally humans perform. AI is general
field that encompasses machine learning and deep learning but also includes other
approaches that do not involve any learning[23].
Symbolic AI was the dominant paradigm in AI from the 1950s to the late 1980s. In
this approach, experts believed that human-level artificial intelligence could be achieved
by sufficiently large set of explicit rules, programmed into machines. Symbolic AI
provided satisfactory solutions to solve logical problems, such as playing board games.
However, it turned out to be useless in cases when explicit rules were needed to solve
more complex tasks, like fuzzy problems, image classification, categorization tasks,
speech recognition and language translation. A new subset of AI emerged, by the help of
mathematicians, called machine learning. Questions like could a computer automatically
learn rules by looking at data and learn on its own how to perform a specified task led to
a new programming paradigm. In machine learning, humans input the data and the
answers, which are expected from the data as well and models create rules. These rules
after the models are trained can be applied to new data to produce answers. Machine
learning models are trained on huge amount of data to find statistical structures in them
and come up with rules to automate specific tasks. Therefore, they are trained rather than
explicitly programmed. Machine learning started to become more popular in the 1990s,
2 https://www.kaggle.com/mczielinski/bitcoin-historical-data#bitstampUSD_1- min_data_2012-01-
01_to_2018-06-27.csv, downloaded at: 2018.09.20.
- 39 -
when appropriate hardware was built in order to perform sufficiently enough calculations,
needed by these models.
Every machine learning task requires input data, which properly describes the feature
space and from which the output can be presumably calculated. Examples of the expected
outputs needed to bound the specific inputs to the desired outputs. Mathematical functions
are used to measure the performance of the machine learning algorithms and make an
adjustment to optimize them. The optimization, when the parameters of the algorithms
are updated called learning. Machine learning models consist of layers. These layers learn
different and unique representations from the input data to associate inputs with outputs.
Each layer is parameterized by its weights. During the training the optimal parameters of
the layers are searched to make the whole network correctly map the inputs to the
associated targets. The loss function measures how far the network current output is from
the true target. It calculates a distance score in order to measure the difference. An
optimizer then adjusts the weights in a direction that lowers the loss score for the current
inputs. The first and most common optimizer that enabled machine learning to gain space
is called Backpropagation. The weights initially set to random values and they are updated
after a batch of inputs. A batch can be the whole, a subset of the dataset or even one input
data. The typical batch sizes that are used are the powers of two. There is no explicit rule
to determine the optimal batch size that results in the best performance, it is up to trial
and error. An epoch is one iteration of a cycle, through which the whole train dataset is
fed to the network once. Usually, models are trained through tens or hundreds of epochs,
until their performance improves. Figure 10. describes a block diagram of a machine
learning model’s training process.
Figure 10. A block diagram of a machine learning model
- 40 -
Deep learning is a specific subfield of machine learning. It is called deep, because
networks in this field often have hundreds or thousands of successive layers. The main
idea behind this approach is that different layers learn different patterns and hierarchical
representations about the input data. A model’s depth is the number of layers a model
has. Neural networks are the most often used models in deep learning. They were initially
developed by theories, which were based on the understanding of the human brain, but
currently there is no evidence that the brain implements mechanism like those used in
deep learning models.
In the early 2000s, companies that focus on the production of massively parallel chips
called graphical processing units or GPUs, developed products that become capable to
run huge deep neural networks, which require millions of matrix multiplications and
tensor operations. These chips were first used to gamers to render complex 3D scenes real
time, but later AI experts started to write implementations of neural networks in order to
run them on GPUs. Today, the most advanced chips capable to execute hundreds of
teraflops per a second, which is the number of floating-point operations in a second. Deep
learning has achieved many real-world applications and large companies started to
develop specialized hardware, like Google’s tensor processing unit or TPU. Nowadays,
there are several libraries that contain implementation of deep learning models, like
Keras, Theano, Tensorflow and Pytorch. These libraries can be installed to support CPU
or GPU devices based on the users’ equipment.
The most advanced deep learning applications and pioneering breakthroughs are the
followings:
• Language translation
• Near-human level autonomous driving
• Digital assistants
• Improved ad targeting systems
• Improved search engines
• Chatbots
• Board and computer games played by AI, which defeats humans
Artificial intelligence continuously transforms how people live. Although, there are
pioneering achievements which are the results of AI, the technology’s true potential is
likely to has not come to the surface yet.
- 41 -
4.3 Predicting price and volatility with different architectures
Convolutional neural networks, also known as convnets are used in computer vision
applications. These networks commonly used for image-classification problems.
Different convnet architectures are designed and implemented by group of artificial
intelligence experts from large technological companies. The mathematical building
blocks of these models are implemented in several machine learning libraries like in
Keras. The models are trained on huge datasets, which contain millions of images. After
the models completed the training their inner state with the learned weights was saved to
files. The saved files can be loaded with the learned weights in order to reuse these
models. However, the models can be retrained and modified arbitrarily to exploit their
architectures only and reuse them to solve different problems. In this thesis, I chose the
following architectures, in order to feed them with transaction graph pictures:
• Inceptionv3 [17]
• MobileNet [18]
• NASNet Mobile [19]
• DenseNet 121 [20]
In Keras, models and layers can be added on top of each other. Therefore, in all cases
when I used the afore mentioned models, the implementation of the final architectures
was the same. At first, I added the convolutional base to the sequence, for example
Inceptionv3. I loaded the model with 128x128x3 input shape and I specified in the
argument that not to include weights and the top layer. In the original architectures, the
top layer was used for classification tasks, but volatility and weighted price prediction is
a regression problem where the target values are continuous. After the convolutional base,
I added a Flatten layer, which makes a one-dimensional vector from tensors. It is followed
by a fully-connected Dense layer with size 512 and a Rectifier activation function, called
Relu. The top layer that I finally added is a Dense layer with 1 neuron, with linear
activation function. I set the convolutional base to trainable in order to make its weights
updatable.
I chose Root Mean Square Prop, called RMSprop optimizer to update the model
weights during training. I left the optimizer parameters to the defaults, as it is suggested
by the developers of Keras. The loss function that I monitored was Mean Squared Error
or MSE.
- 42 -
Callbacks are functions which can be applied to influence the models at given stages
during the training procedure. They can be used to view internal states and make pre-
defined adjustments. I used EarlyStopping, which was applied to monitor the validation
loss during the training with a patience value of 20. This function stopped the training
process, if the validation loss has not decreased for 20 epochs or training cycles. I set the
initial learning rate to 0.1, which was decreased by 0.02, if the validation loss has not
improved for the last 5 epochs. This was achieved by the ReduceLROnPlateu function. I
exploited ModelCheckpoint capabilities to save the best model to an HDF5 file during
the training with the corresponding weights. Each model’s learning attributes are saved
to comma separated values or ‘.csv’ files with CSVLogger after each epoch.
In order to feed each model with data, I used the ImageDataGenerator class. I rescaled
each image’s pixels between 0-1, which can be done at the initialization of the class.
ImageDataGenerator has different methods to generate batches of data. I chose the
’flow_from_dataframe’ method, because my target values were stored in ‘.csv’ files with
the corresponding block height values. Block heights also identified the pictures of
transaction networks. Keras automatically infers the file extension from the names, if the
extension is not provided.
Table 10. depicts the Pandas DataFrame object, which was used to generate the input
and target values for each convnet.
Table 10. Table of input and target values
In the generator function I set the class mode to ’other’, which is the parameter that
should be used for regression, the shuffle parameter to ’False’ in order to keep the
temporal property of the dataset, the target size to 128x128 and the batch size equal to 16.
I tried larger batch sizes as well, but all of them resulted in resource exhausted errors, due
to lack of GPU memory.
- 43 -
For the test generator, I moved the test dataset pictures to a separated directory, called
test. Keras ’flow_from_directory’ method explicitly yielded the test files from this
directory in order to test the models after training.
I initially set the number of epochs to 100, but all training processes exited around 30
epochs since the validation loss has not improved, therefore the callback function stopped
the training.
The following figures, Figure 11. and Figure 12. summarize the performance of the
models during the training:
Figure 11. Train and validation loss, when the price of Bitcoin was the target value
- 44 -
Figure 12. Train and validation loss, when the volatility of Bitcoin's price was the target
value
It can be seen on Figure 11-12. that the train and validation losses decreased from huge
value ranges. They converged after a few epochs to intervals where they settled and
oscillated for the remaining epochs.
- 45 -
On Figure 13-14., the convergences of the loss functions can be better seen.
Figure 13. Enlarged picture of train and validation loss, when weighted price was the
target
Figure 14. Enlarged picture of train and validation loss, when volatility was the target
Although the loss of each model was decreased, huge amount of loss remained.
Therefore, it was expected that the models could not learn unique patterns from the
transaction networks in order to predict the target values, weighted prices and volatilities.
- 46 -
Figures 15-18. illustrate the predictions for both target values. I zoomed into some
diagrams for better visibility.
Figure 15. Each model’s prediction for Bitcoin's weighted price
Figure 16. Each model's prediction for the volatility of Bitcoin's price
- 47 -
Figure 17. Each model's prediction for the volatility of Bitcoin's price
Figure 18. Each model's prediction for the volatility of Bitcoin's price
Figures 15-18. confirm the previous assumption, that the models were not able to learn
the target values from the pictures. In the case of weighted price targets, only DenseNet
could predict notable range of values. The other models averaged the targets and predicted
constant values. For the volatilities, NasNet could predict highly oscillating values.
The training of each model held up to 2-3 days on Nvidia TitanX GPU, with 12 GB
memory, which access was provided by my department. I tried a data augmentation
technique, the rotation of the pictures randomly, training with grayscale images and I
divided the datasets to volatile and non-volatile periods as well, but none of the attempts
- 48 -
ended with different results. It can also be tried to train the models with higher resolution
images than 128x128x3 pixels, but due to lack of GPU memory it was not an option.
4.3.1 Determining the number of transactions from the transaction graphs
As the previous section revealed, different architectures were not able to associate the
pictures of transaction networks with price fluctuations and price values. However, I
made further experiments through which I attached different target values to the convnets,
like the number of transactions in each block.
On Figure 19., the predictions of different convnet architectures can be seen for the
number of transactions. It is very interesting that three different models were able to
determine the cardinality of transactions right from the pictures. It means that the pictures
have some representational ability. The best predictors were NasNet and Inceptionv3,
both achieved a test Root Mean Squared Error (RMSE) of 157.
Unfortunately, this experiment has no practical application, because the number of
transactions in a block can be explicitly queried from the blockchain. However, the results
are interesting and provide reasons for further research.
- 49 -
Figure 19. Predictions of convnets for the number of transactions
- 50 -
4.4 Different approaches for predictions, system usage, extensions
I devote this chapter to introduce new experiments and results, future investigation
opportunities and an application of an operative prediction system.
Recurrent neural networks (RNN) process sequences of data by iterating through the
elements in the sequences. These networks have an internal loop in order to maintain a
state that contains information relative to the input sequences. Simple RNNs are unable
to learn long-term dependencies due to the vanishing gradient problem[28], which arise
from the layer depth of neural networks. Long Short-Term Memory (LSTM) algorithm
was developed to solve the vanishing gradient problem. It is capable to carry information
across several timesteps hence this algorithm has an inbuilt memory.
4.4.1 Analyzing the correlations of block features with market data
The block features that I calculated from each Bitcoin block and presented previously are
continuous variables. The values of these variables provide information about the Bitcoin
blockchain.
The diagrams on Figure 20. reflects the values and the corresponding 150 long moving
averages of block features, Bitcoin’s price and volatility. It can clearly be seen that
Bitcoin price started a long-term rally from about July 2017. There is a correlation
between the upward tendency of the number of transactions in a block, the mining fee
and the total Bitcoin amounts in a block with Bitcoin’s price although, this is not true for
the all bull market when Bitcoin’s price was on an uptrend. The number of transactions
and the mining fee started to increase from about August 2017. The Bitcoin amount that
was transferred in each block with a little lag, started to increase from about October
2017. It has reached its peak value well before the bear market started about 2017.12.18.
It is noticable that as Bitcoin’s price started a downward tendency, the network’s
usability also started to drop. The drop in the number of transactions and the amounts of
transferred Bitcoins confirm this statement. As Bitcoin mining became less profitable the
network’s difficulty also decreased, because significant hashing power headed out from
the system. Therefore, the network automatically adjusted its difficulty target according
to the represented hashing power in the system.
- 51 -
Figure 20. Diagrams of block features and their 150 long moving averages
- 52 -
The following matrix on Figure 21. describes the correlation between the previously
mentioned variables:
Figure 21. Correlation matrix of block features and market data
The correlation matrix on Figure 21. represents the correlation coefficients between
the variables. If the coefficients are closer to 1, the correlation is stronger and if the
coefficients are in the negative territory, it represents negative correlation. I created the
matrix by resampling the dataset by daily frequency. This means that I averaged the
variables on a per day basis. The matrix shows there are positive correlations between
Bitcoin’s price, the mining fee, the size of the transactions and trivially with the volatility.
However, because this matrix was created from both bull and bear market data, the
correlation values are misleading.
- 53 -
The following two figures, Figure 22-23. represent the correlation matrices of
separated bull and bear markets:
Figure 22. Correlation matrix of Bitcoin’s bull market
On Figure 20., the fluctuation of the correlating features is clearly visible during the
bull trend of Bitcoin’s price. Therefore, on Figure 22. the afore mentioned correlations
are not as strong as in the following bear market where value of reward, block size, mining
fee, the number and size of transactions in a block and the total circulating Bitcoins in the
network strongly correlated with the downward movement of Bitcoin’s price.
- 54 -
The correlations during the bear market are visible on Figure 23. The conclusion of
the analysis is the following: During a bull market the Bitcoin’s blockchain utilization by
the network’s users increases, while during a bear market it decreases.
Figure 23. Correlation matrix of Bitcoin’s bear market
4.4.2 Long Short-Term Memory network with block features
LSTM networks are designed to process sequential or time series data. These networks
are capable to utilize previous values from a sequence in order to forecast the next values.
The length of the input and output sequences can be arbitrarily defined with the time lags
as well. Time lag is the number of time steps left out between the input and output.
I trained LSTM models with 300 neurons. I used up every block feature with different
sequence lengths in order to predict the next price value. I divided the dataset into 10 and
50 long sequences.
- 55 -
Figure 24. demonstrates the training and validation loss of the models and their
corresponding predictions.
Figure 24. Training and validation loss of LSTM models with different sequence lengths
Figure 25. LSTM predictions, which was trained on 50 long sequences for 100 epochs
- 56 -
Figure 26. LSTM predictions, which was trained on 50 long sequences for 400 epochs
Figure 27. LSTM predictions, which was trained on 10 long sequences for 400 epochs
- 57 -
Figure 28. LSTM predictions, which was trained on 10 long sequences for 1000 epochs
There is a visible underfitting to the dataset on Figure 25. The LSTM model that was
trained on 50 long sequences produced better results, when the training took more epochs.
The same statement holds true for the LSTM model that was trained with 10 long
sequences. Although, it obviously produced better results after 100 epochs than when it
was trained on 50 long sequences of data.
These experiments were carried out with different length of sequential input data and
only one-time lag was attached to each input. It can be clearly seen on Figure 28. that
LSTM was able to forecast drops in the price several times before it factually happened.
However, additional investigation is needed in order to predict multiple time lags.
4.4.3 Application and integration of an operative prediction system
An operational prediction system which can accurately predict a crypto asset like
Bitcoin’s price or the volatility of its price can be effectively used for profitable trading.
Such a system could be integrated to a strategy module of an event-driven trade system.
Event-driven trade systems are built in order to realize semi-automated and fully
automated trade systems[24]. Semi-automated systems can produce signals about
evolving entry points on markets, which signals are utilized by users in order to open new
positions. In the case of fully automated systems, they are capable to open positions on
their own upon a signal receival. Essentially, an event-driven trade system operates like
a computer game. All calculations are generated from an infinitely running cycle where
- 58 -
different objects are placed at the frequency of incoming data. Because market data
continuously flows it is necessary for the system to be operated on a high frequency.
Cryptocurrency exchanges afford suitable data access possibilities for this task, while
traditional stock exchanges require severe conditions and large amounts of money to
provide real-time market data and support for automated trading.
An event-driven trade system has several advantages:
• The source code of the system is reusable. Its components can be easily replaced
to test it on historical data or use it for real-time trading.
• The foresight property excludes the possibility to use future data, because the data
flows with the event objects sequentially, thus it operates like a real-time system.
• The system works realistically. Any trade order with commissions can be
simulated arbitrarily.
Figure 29. illustrates an event-driven trade system. The components of the system
called objects are the most standard elements, which I describe in detail.
• Event – Every objects reaction in an adequate time is based on the reception of
event objects. The essential types of event objects are the Market, Signal, Order,
and Fill objects. Differents objects inherit the properties of the abstract base class.
• Event Queue – In memory stored Python Queue type object, which stores every
descendant event object, which are generated by other classes as reactions to data
flows.
Figure 29. The block diagram of an event-driven trade system
- 59 -
• DataHandler – An abstract base class (ABC). It provides an interface to treat
historical and real-time data differently. In this way, the strategy and portfolio
objects are reusable for both approaches. The DataHandler object generates
MarketEvent objects at the Backtest Event Queue’s frequency, which are then
treated by the Strategy.
• Strategy – The Strategy object is also an ABC and it has an interface, which can
communicate with the DataHandler. It interprets the market data adequately and
generates SignalEvent objects accordingly. When it signals a new position, the
SignalEvent contains the trading asset’s ticker symbol (like BTC-USD), the
direction of the position (long or short) and a timestamp. In the case of
cryptocurrencies, the direction of the trade is mostly a long position, which is a
buy order or the closing of an opened long position.
• Portfolio – It maintains a database about the user’s balance and historical trades
with statistics as well. The Portfolio object also calculates the size of each
position, which is also proportionate to the total available balance (excluding
already invested money).
• ExecutionHandler – It simulates the connection to an exchange or in the case of
real-time trading it realizes the connection. The ExecutionHandler is a gateway
for the system, which is used to connect to an API interface of a specfic exchange.
The handler also receives orders from the queue, which is then transmitted to the
API. If an order is filled the handler generates a FillEvent object, which contains
the information about the filled order. The information includes the filled quantity
of the order, the commission that was payed and a possible drift. Drift can only
happen in the case of market orders, when the desired trading price is not fixed.
• Backtest – Every object is collected in a common event cycle, from where
different events are directed to the adequate components of the system.
The strategies of an event-driven trade system mostly consist of signals, which are
based on technical indicators. These indicators are mathematical calculations based on
price, volume or open interest of a security or contract. There are several strategies that
use more indicators and the combination of them. A prediction system that can predict
the price movements, the volatility of the price, or even volatility from transaction
networks or block features could be integrated to the Strategy component of a trade
system.
- 60 -
4.4.4 Possible future experiments
I propose two different methods to further investigate the topic of training deep learning
networks on the pictures of transaction graphs. The first method suggests a combination
of a convolutional and a recurrent neural network and the second method describes a data
separation process to train different models on the segregated data.
The combination of a convolutional and a recurrent neural network is usually used as
a next frame predictor of a video input. CNN is used as a deep hierarchical feature
extractor and an LSTM is capable to recognize and synthesize temporal dynamics. A
Long-term recurrent convolutional network (LRCN)[25], or a convolutional, long short-
term memory, fully connected deep neural network (CLDNN)[26] could be applied to
process sequences of transaction graph pictures with the corresponding weighted price or
volatility target sequences. The sequence length is an arbitrarily adjusted parameter which
can be optimized by trial and error. The final layer of these architectures must be modified
in order to adapt to regression.
An additional possible research could be to train different models on separated time
intervals, where the sequences of the price and volatility correlates to each other. In order
to demonstrate this idea, I divided my temporal dataset into 15 long subsets, each of them
contains equal elements to the length. Then, I iterated through every subset and searched
for matching subsets with a higher correlation value than 0.8. The following dataframe
on Figure 31. illustrates the result DataFrame:
Table 11. DataFrame of correlating subintervals
The interpretation of the DataFrame on Figure 31. is identical to a correlation matrix
of features. The third column with index 2 represents the third subset of my temporal
dataset, which has a higher than 0.8 correlation with the 175th, 434th, 681th, 691th, etc.
subsets. There are 15 groups of subsets in the dataframe, which have more than 40
- 61 -
elements. It means 15 groups with 40 elements, each of the elements contains 15 records
of temporal data.
The following diagrams on Figure 30-31. illustrates the elements of the previous
dataframe’s 3rd and 53rd column:
Figure 30. Correlating time intervals, 53rd column of the DataFrame on Table 11.
Figure 31. Correlating time intervals, 3rd column of the DataFrame on Table 11.
It can be seen on Figure 30-31. that these time intervals are highly correlated. The
previously mentioned CNN architectures could produce different results, if they are
trained on separated datasets.
- 62 -
5. Summary
In this thesis, I introduced the basic mathematical background that blockchain networks
rely on. I discussed in detail the data structure of the Bitcoin blockchain, its protocol and
operation. I represented the process, through which I collected, stored, analyzed and
transformed the Bitcoin network’s data in order to feed them into deep learning networks
and make predictions for future price, volatility and transaction quantity.
During my work I learned about public-key cryptography. This field of cryptography
introduces cryptographic hash functions, finite fields and mathematical operations on
elliptic curves. These innovations jointly secure blockchain networks and allows them to
operate without a central authority. Blockchain networks are recent inventions that create
trust between untrusted parties. Today, there are several untapped possibilities that the
technology of blockchain could be used for. However, most of the blockchain networks
provide digital or cryptocurrencies and make these tokens transferable between two
parties. The value of cryptocurrencies is also denominated in fiat currencies and therefore
they are traded on so called cryptocurrency exchanges. The publicly available historical
data of every blockchain creates new opportunities for trading strategies.
Deep learning is a subfield of machine learning. Artificial neural networks with several
layers are capable to learn interrelations between input data and desired output, which are
otherwise impossible to be explicitly associated with traditional functions. I collected one
and a half year long temporal data about the Bitcoin network and stored them in HDF5
files and in other data structures, which is required by deep learning networks. From every
Bitcoin block, I created graph pictures about transaction networks in order to investigate
the possible relationships between unique graph structures, subsequent price and
volatility data. I also did experiments to explicitly tell the number of transactions from
the graph pictures, with the help of different convolutional neural network architectures.
The reason why many convolutional neural network architectures were not able to learn
weighted prices and volatilities from the transaction graphs, I used a recurrent neural
network, long short-term memory to predict the desired targets, from the temporal
sequences of block features. LSTM could learn the sequences, but it had a lag in time
between the true values and its predictions and therefore it is not perfectly usable for
predictions in a real-time environment.
In the last chapters of this thesis, I introduced an application of an operational
prediction system in an event-driven trade system’s strategy module. I also proposed
- 63 -
future investigation opportunities to process sequences of transaction networks, in the
form of a convolutional neural neural network and long short-term memory combination.
Another possibility to further investigate the topic is to train different deep learning
models on separated but correlating time intervals.
- 64 -
Acknowledgements
I would first like to thank my thesis advisor, Dr. Bálint Gyires-Tóth of the Department of
Telecommunications and Media Informatics at Budapest University of Technology and
Economics. He helped me a lot to figure out the way that fits into my interest. He
consistently allowed to do my own work, steered me in the right directions and his door
was always open to dispute actual topics when I got stuck in a subtask.
I would also like to thank to my friend Róbert M. Németh, who helped me to create
illustrations and figures with his graphic designer experience.
Finally, I must express my very profound gratitude to my Dad, Mom, Grandma and to
other family members for inventing energy in me, providing me permanent support, love
and making my studies possible. They cooked me delicious dishes, which obviously
helped me to tackle this road.
- 65 -
References
[1] Schneier, B., 1996. Applied cryptography-protocols, algorithms, and source code in
C. John Wiley & Sons., pp. 56-57.
[2] Drescher, D., 2017. Blockchain basics. Apress. pp. 71-81.
[3] Goldreich, O., 1998. Modern cryptography, probabilistic proofs and
pseudorandomness (Vol. 17). Springer Science & Business Media. pp. 11.
[4] Goldreich, O., 1998. Modern cryptography, probabilistic proofs and
pseudorandomness (Vol. 17). Springer Science & Business Media. pp. 65-66.
[5] Galbraith, S.D., 2012. Mathematics of public key cryptography. Cambridge
University Press. pp. 4-7.
[6] Antonopoulos, A.M., 2014. Mastering Bitcoin: unlocking digital
cryptocurrencies. O'Reilly Media, Inc.
[7] Standard, S.H., 2002. FIPS PUB 180-2. National Institute of Standards and
Technology.
[8] Johnson, D., Menezes, A. and Vanstone, S., 2001. The elliptic curve digital
signature algorithm (ECDSA). International Journal of Information Security, 1(1),
pp.36-63.
[9] Koblitz, N., 1991, August. CM-curves with good cryptographic properties.
In Annual International Cryptology Conference. Springer, Berlin, Heidelberg, pp.
279-287.
[10] Morain, F., 1991, April. Building cyclic elliptic curves modulo large primes.
In Workshop on the Theory and Application of of Cryptographic Techniques.
Springer, Berlin, Heidelberg, pp. 328-336.
[11] Koblitz, N., 1990, August. Constructing elliptic curve cryptosystems in
characteristic 2. In Conference on the Theory and Application of Cryptography,
Springer, Berlin, Heidelberg, pp. 156-167.
[12] Qu, M., 1999. SEC 2: Recommended elliptic curve domain parameters. Certicom
Res., Mississauga, ON, Canada, Tech. Rep. SEC2-Ver-0.6.
[13] Hagberg, A., Schult, D. and Swart, P., 2012. NetworkX Reference. Python
Package.
[14] Kobourov, S.G., 2004. Force-directed drawing algorithms. University of Arizona,
pp. 383-403.
- 66 -
[15] Collette, A., 2013. Python and HDF5: Unlocking Scientific Data. O'Reilly Media,
Inc, pp. 21-110.
[16] Bollen, B. and Inder, B., 2002. Estimating daily volatility in financial markets
utilizing intraday data. Journal of Empirical Finance, 9(5), pp.551-562.
[17] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. and Wojna, Z., 2016. Rethinking
the inception architecture for computer vision. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 2818-2826.
[18] Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T.,
Andreetto, M. and Adam, H., 2017. Mobilenets: Efficient convolutional neural
networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
[19] Zoph, B., Vasudevan, V., Shlens, J. and Le, Q.V., 2017. Learning transferable
architectures for scalable image recognition. arXiv preprint
arXiv:1707.07012, 2(6).
[20] Huang, G., Liu, S., van der Maaten, L. and Weinberger, K.Q., 2017. CondenseNet:
An Efficient DenseNet using Learned Group Convolutions.
[21] Mordvintsev, A. and Abid, K., 2014. Opencv-Python tutorials
documentation. Avaiable at: https://media. readthedocs. org/pdf/opencv-Python-
tutroals/latest/opencv-Python-tutroals.pdf.
[22] Nakamoto, S., 2008. Bitcoin: A peer-to-peer electronic cash system, pp. 1-9.
[23] Chollet, F., 2017. Deep learning with Python. Manning Publications Co, pp. 3-93.
[24] Kim, K., 2010. Electronic and algorithmic trading technology: the complete guide.
Academic Press.
[25] Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S.,
Saenko, K. and Darrell, T., 2015. Long-term recurrent convolutional networks for
visual recognition and description. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition pp. 2625-2634.
[26] Sainath, T.N., Vinyals, O., Senior, A. and Sak, H., 2015, April. Convolutional, long
short-term memory, fully connected deep neural networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4580-
4584.
[27] Dobbertin, H., Bosselaers, A. and Preneel, B., 1996, February. RIPEMD-160: A
strengthened version of RIPEMD. Springer, Berlin, Heidelberg. In International
Workshop on Fast Software Encryption, pp. 71-82.
- 67 -
[28] Hochreiter, S., 1998. The vanishing gradient problem during learning recurrent
neural nets and problem solutions. International Journal of Uncertainty, Fuzziness
and Knowledge-Based Systems, 6(02), pp.107-116.
- 68 -
6. Appendix
A.1. Secure Hash Algorithm (SHA)
Secure Hash Algorithm is a hash function that was developed by the National Institute of
Standards and Technology (NIST) and published as a Federal Information Processing
Standard (FIPS 180) after approval by the Secretary of Commerce pursuant to Section
5131 of the Information Technology Management Reform Act of 1996 (Public Law 104-
106) and the Computer Security Act of 1987 (Public Law 100-235). Weaknesses were
discovered in SHA therefore revised versions were issued in the following years. SHA-
256, SHA-384 and SHA-512 are 256, 384 and 512-bits versions respectively. They were
introduced in 2002 by NIST. The documentation that describes SHA-256 and which is
the subject of the subsequent investigation is known as FIPS-180-2 [7].
The FIPS-180-2 standard specifies the SHA-256 hash function for the usage of
generating message digests. Digests can be used for the detection of changes in the
messages after the digest generation. The SHA-256 is considered as secure because it is
computationally infeasible to find a message that corresponds to a given message digest
(one-way property) or find two different messages that produce the same message digest.
Two different messages with any little dissimilarity will cause different message digests
with very high probability. For this reason, a change in the message will result in a
validation failure when the algorithm is used with a digital signature algorithm.
The SHA-256 has properties used by the algorithm for the generation of message
digest. The following table illustrates these properties and the conditions that should be
met.
Algorithm Message Size
(bits)
Block size
(bits)
Word size
(bits)
Message
Digest Size
(bits)
SHA-256 <264 512 32 256
A bit indicates a binary digit with a value of 0 or 1. A byte is a group of eight bits
and a word is a group of 32 bits.
The following table describes the parameters that are used by the secure hash
algorithm.
- 69 -
a, b, c, …, h Variables that are w-bit words used in the computation of the hash
values, 𝐻(𝑖).
𝐻(𝑖) The 𝑖𝑡ℎ hash value. 𝐻(0) is the initial and 𝐻(𝑁) is the final hash value.
They are used in the construction of the message digest.
𝐻𝑗(𝑖)
The 𝑗𝑡ℎ word of the 𝑖𝑡ℎ hash value, where 𝐻0(𝑖)
is the left-most word of
hash value i.
𝐾𝑡 Constant value used for the iteration t of the hash computation.
k The number of zeroes appended to a message during the padding step.
l Length of M, the message in bits.
m Number of bits in a message block, 𝑀(𝑖).
M Message to be hashed.
𝑀(𝑖) Message block i, with a size of m bits.
𝑀𝑗(𝑖)
The jth word of the ith message block, where 𝑀0(𝑖)
is the left-most word
of the message block i.
n Number of bits to be rotated or shifted when there is an operation upon
the word.
N Number of blocks in the padded message.
T Temporary w-bit word used in the hash computation.
w Number of bits in a word.
𝑊𝑡 The 𝑡𝑡ℎ w-bit word of the message schedule.
The following symbols represents binary operators each operates on w-bit words.
<< Left-shift operator, where x<<n means that every bit shifted to left by
n positions, discarding the left-most n bits of x and padding the result
with n zeroes on the right.
>> Right shift operator, where x>>n means that every bit shifted to right
by n positions, discarding tthe right-most n bits of x and padding the
result with n zeroes on the left.
∧ Bitwise AND operator.
∨ Bitwise OR operator.
¬ Bitwise complement operator.
⊕ Bitwise XOR operator.
+ Addition modulo 2𝑤.
- 70 -
These symbols are general in computer science. The following operators are specific
to the specification of SHA-256.
𝑅𝑂𝑇𝐿𝑛(𝑥) Rotate left operator, also called circular
left shift, where x is a w-bit word and n is
an integer with 0 ≤ n < w, is defined by
𝑅𝑂𝑇𝐿𝑛(𝑥)= (x << n) ∨ (x >> w -n).
𝑅𝑂𝑇𝑅𝑛(𝑥) Rotate right operator, also called circular
right shift, where x is a w-bit word and n
is an integer with 0 ≤ n < w, is defined by
𝑅𝑂𝑇𝑅𝑛(𝑥) = (x >> n) ∨ (x << w-n).
𝑆𝐻𝑅𝑛(𝑥) Right shift shift operator, where x is a w-
bit word and n is an integer with 0 ≤ n <
w, is defined by 𝑆𝐻𝑅𝑛(𝑥) = x >> n.
The following equivalence relationships exists between the rotating operators:
𝑅𝑂𝑇𝐿𝑛(𝑥) ≈ 𝑅𝑂𝑇𝑅𝑤−𝑛(𝑥)
𝑅𝑂𝑇𝑅𝑛(𝑥) ≈ 𝑅𝑂𝑇𝐿𝑤−𝑛(𝑥)
The abovementioned notations require some explanations:
• A hexadecimal digit is an element of the set {0, 1, 2, …, 9, a, b, c, d, e, f} and is
the representation of a 4-bit string.
• A word is a w-bit that can be represented as a sequence of hexadecimal digits, by
converting 4-bit stings to their hexadecimal equivalents. For example, the 32-bit
string 1000 0010 1010 1111 0111 0001 0010 1001 can be expressed as 82af7129.
Within each word the ‘big-endian’ convention is used, so the most significant bit
is stored in the left-most bit position.
• A word or pair of words can represent an integer. Padding techniques that are used
in the algorithm of SHA-256 require the message length, l, to be represented as a
word or pair of words in bits. An integer between 0 and 232 − 1 inclusive can be
represented as a 32-bit word. The least significant four bits of the integer are
represented by the right-most hexadecimal digit of the word. For example, the
integer 314 = 28 + 25 + 24 + 23 + 21 = 256 + 32 + 16 + 8 + 2 can be
represented with the word 0000013a.
- 71 -
• The following property is used by SHA-256. If Z is an integer, 0 ≤ Z < 264, then
Z = 232 𝑋 + 𝑌 where 0 ≤ X <232 and 0 ≤ Y <232. Let x and y be the word
representation of X and Y respectively and the pair of words (x, y) be the
representation of Z.
• The addition modulo 2𝑤 operation x + y is defined as Z = (X + Y) mod 2𝑤, where
X and Y are integers and are represented by the words x and y respectively. For
positive integers U and V, UmodV is the remainder when dividing U by V. For Z,
it is true that 0 ≤ Z < 2𝑤, so convert the integer Z to a word z and z = x + y is
defined.
• SHA-256 operates on 32-bit words (w=32).
Several functions are used by SHA-256 in order to hash the message. Each function
operates on 32-bit words. These words are represented by x, y and z. The following table
defines the functions, each outputs a new 32-bit word.
𝐶ℎ(𝑥, 𝑦, 𝑧) = (𝑥 ∧ y) ⊕ (¬ x ∧ z)
𝑀𝑎𝑗(𝑥, 𝑦, 𝑧) = (𝑥 ∧ y) ⊕ (x ∧ z) ⊕ (y ∧ z)
∑ (𝑥) = 2560 𝑅𝑂𝑇𝑅2(𝑥) ⊕ 𝑅𝑂𝑇𝑅13(𝑥) ⊕ 𝑅𝑂𝑇𝑅22(𝑥)
∑ (𝑥) = 2561 𝑅𝑂𝑇𝑅6(𝑥) ⊕ 𝑅𝑂𝑇𝑅11(𝑥) ⊕ 𝑅𝑂𝑇𝑅25(𝑥)
𝜎0256(𝑥) = 𝑅𝑂𝑇𝑅7(𝑥) ⊕ 𝑅𝑂𝑇𝑅18(𝑥) ⊕ 𝑆𝐻𝑅3(𝑥)
𝜎1256(𝑥) = 𝑅𝑂𝑇𝑅17(𝑥) ⊕ 𝑅𝑂𝑇𝑅19(𝑥) ⊕ 𝑆𝐻𝑅10(𝑥)
SHA-256 use sixty-four constant 32-bit words in the computation of the hash value.
These constant words are indicated by 𝐾0{256}
, 𝐾1{256}
, 𝐾2{256}
, … , 𝐾63{256}
and they represent
the first thirty-two bits of the fractional parts of the cube roots of the first sixty-four prime
numbers. The superscript of each K indicates the 256-bit version of SHA, because in
different versions of SHA, different constants are used. The following table represents
these constans in hexadecimal format.
428a2f98 71374491 b5c0fbcf e9b5dba5 3956c25b 59f111f1 923f82a4
ab1c5ed5
d807aa98 12835b01 243185be 550c7dc3 72be5d74 80deb1fe 9bdc06a7
c19bf174
e49b69c1 efbe4786 0fc19dc6 240ca1cc 2de92c6f 4a7484aa 5cb0a9dc
76f988da 983e5152 a831c66d b00327c8 bf597fc7 c6e00bf3 d5a79147
06ca6351 14292967 27b70a85 2e1b2138 4d2c6dfc 53380d13 650a7354
766a0abb 81c2c92e 92722c85
- 72 -
a2bfe8a1 a81a664b c24b8b70 c76c51a3 d192e819 d6990624 f40e3585
106aa070 19a4c116 1e376c08 2748774c 34b0bcb5 391c0cb3 4ed8aa4a
5b9cca4f 682e6ff3
The algorithm starts with preprocessing the message, M, which wanted to be hashed.
At first, padding is used to amend the message with the required bits, to be a multiple of
512. Let suppose that the length of M is l bits. A ‘1’ bit is appended to the end of M,
followed by k zero bits. The equation l + 1 + k ≡ 448 mod 512 must be satisfied with the
appropriate k value, where k is the possible smallest, non-negative solution. Then a 64-
bit block which is equal with l is appended, where l is in binary representation. For
example, consider a message ‘halo’, where each character is coded in 8-bit ASCII.
M = 01101000 01100001 01101100 01101111 = ‘halo’ in ASCII coding.
A ‘1’ bit is appended to the end of M, followed by 415 zero bits, because 448 – (32 +
1) = 415, then the length of the message l=32 represented in a 64-bit binary block. The
message became a 512-bit padded message. The message should be in the range of 0 <
M <264. If the message is longer than 512 bits, when it is padded, it should become
a multiple of 512 in bits.
When padding is completed the padded message is parsed into N 512-bit blocks,
𝑀(1), 𝑀(2), … , 𝑀(𝑁) . Each input block is expressed as sixteen 32-bit words, where the
first 32 bits of the message block i are denoted 𝑀0(𝑖)
, the next 32 bits are 𝑀1(𝑖)
and so on
up to 𝑀15(𝑖)
.
The computation of the hash value requires an initial hash value to be set, 𝐻(0). It is
made up eight 32-bit words, with the following values.
𝐻0(0)
= 6𝑎09𝑒667
𝐻1(0)
= 𝑏𝑏67𝑎𝑒85
𝐻2(0)
= 3𝑐6𝑒𝑓372
𝐻3(0)
= 𝑎54𝑓𝑓53𝑎
𝐻4(0)
= 510𝑒527𝑓
𝐻5(0)
= 9𝑏05688𝑐
𝐻6(0)
= 1𝑓83𝑑9𝑎𝑏
𝐻7(0)
= 5𝑏𝑒0𝑐𝑑19
- 73 -
During the process of hash value computation, SHA-256 uses a message schedule of
sixty-four 32-bit words labeled 𝑊0, 𝑊1, … , 𝑊63, eight 32-bit working variables labeled a,
b, c, d, e, f, g, h and a hash value of eight 32-bit words labeled 𝐻0(0)
, 𝐻1(0)
, … , 𝐻7(0)
which
will hold the initial hash value 𝐻(0) . After each message block processing 𝐻(0) is replaced
by an intermediate hash value 𝐻(𝑖), until the ending of the iteration with
𝐻(𝑁), the final hash value. Two temporary words, 𝑇1 and 𝑇2 are also used by the
algorithm.
The following steps are repeated N times, while all message blocks will be processed.
The result is the 256-bit message digest, which is a digital fingerprint of the message M,
in the form of:
𝐻0(𝑁)
∥ 𝐻1(𝑁)
∥ 𝐻2(𝑁)
∥ 𝐻3(𝑁)
∥ 𝐻4(𝑁)
∥ 𝐻5(𝑁)
∥ 𝐻6(𝑁)
∥ 𝐻7(𝑁)
The message blocks 𝑀(1), 𝑀(2), … , 𝑀(𝑁) are processed in order, using the following
steps:
For i=1 to N:
{
𝑊𝑡 = {𝑀𝑡
(𝑖) , 0 ≤ 𝑡 ≤ 15
𝜎1256(𝑊𝑡−2) + 𝑊𝑡−7 + 𝜎0
256(𝑊𝑡−15) + 𝑊𝑡−16, 16 ≤ 𝑡 ≤ 63
a = 𝐻0(𝑖−1)
b = 𝐻1(𝑖−1)
c = 𝐻2(𝑖−1)
d = 𝐻3(𝑖−1)
e = 𝐻4(𝑖−1)
f = 𝐻5(𝑖−1)
g = 𝐻6(𝑖−1)
h = 𝐻7(𝑖−1)
- 74 -
For t=0 to 63:
{
𝑇1 = ℎ + ∑ (𝑒)
{256}
1
+ 𝐶ℎ(𝑒, 𝑓, 𝑔) + 𝐾𝑡{256}
+ 𝑊𝑡
𝑇2 = ∑ (𝑎)
{256}
0
+ 𝑀𝑎𝑗(𝑎, 𝑏, 𝑐)
ℎ = 𝑔
𝑔 = 𝑓
𝑓 = 𝑒
𝑒 = 𝑑 + 𝑇1
𝑑 = 𝑐
𝑐 = 𝑏
𝑏 = 𝑎
𝑎 = 𝑇1 + 𝑇2
}
𝐻0()
= 𝑎 + 𝐻0(𝑖−1)
𝐻1(𝑖)
= 𝑏 + 𝐻1(𝑖−1)
𝐻2(𝑖)
= 𝑐 + 𝐻2(𝑖−1)
𝐻3(𝑖)
= 𝑑 + 𝐻3(𝑖−1)
𝐻4(𝑖)
= 𝑒 + 𝐻4(𝑖−1)
𝐻5(𝑖)
= 𝑓 + 𝐻5(𝑖−1)
𝐻6(𝑖)
= 𝑔 + 𝐻6(𝑖−1)
𝐻7(𝑖)
= ℎ + 𝐻7(𝑖−1)
A.2. The domain parameters of the Koblitz curve, secp256k1
The elliptic curve called secp256k1 is a Koblitz curve. It’s domain parameters over 𝐹𝑝 are
specified by the sextuple 𝑇 = (𝑝, 𝑎, 𝑏, 𝐺, 𝑛, ℎ) where the finite field 𝐹𝑝 is defined by[12]:
p =
FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFE FFFFF2CF =
2256 − 232 − 29 − 28 − 27 − 26 − 24 − 1
The curve 𝐸: 𝑦2 = 𝑥3 + 𝑎𝑥 + 𝑏 over 𝐹𝑝 is defined by:
- 75 -
𝑎
= 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
𝑏
= 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000007
The compressed form of G is:
G
= 02 79BE667E F9DCBBAC 55A06295 CE870B07 029BFCDB 2DCE28D9 59F2815B 16F81798
The order n of G is:
𝑛
= FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFE BAAEDCE6 AF48A03B BFD25E8C D0364141
The order cofactor, h is:
ℎ = 01