a model for general video game learning with htm

A Model for General Video Game Learning with HTM

Leonardo Arturo Quinonez PerezComputer Science

Code: 2702300

Universidad Nacional de ColombiaFacultad de Ingenierıa

Departamento de Sistemas e IndustrialBogota, D.C.

May 2016


Leonardo Arturo Quinonez PerezComputer Science

Code: 2702300

Dissertation Presented for the Degree ofMaster in Computer Science

AdvisorJonatan Gomez Perdomo, Ph.D.

Ph.D. in Computer Science

Research lineArtificial intelligence

Research groupALIFE

Universidad Nacional de ColombiaFacultad de Ingenierıa

Departamento de Sistemas e IndustrialBogota, D.C.

May 2016

Title in English


Tıtulo en espanol

Modelo para el aprendizaje general de videojuegos con HTM

Abstract: A model-based agent, for general game playing in the context of ArtificialGeneral Intelligence is proposed. That agent is structured as an utility agent, thatmodels the games through two functions, Transition and Reward. With these functions,a planning phase is proposed using a tree of actions. The agent is tested with two Atarigames, Breakout and Pong.

Resumen: Se propone un agente basado en modelos para el aprendizaje general de videojuegos. El agente se estructura con dos funciones, de transicion y de recompenza, que seusan en una fase de planeacion. El modelo propuesto se prueba con dos juegos de Atari,Breakout y Pong.

Keywords: Artificial General Intelligence, AGI, Video Game Playing, Atari, Pong,Breakout, Hierarchical Temporal Memory, HTM, Cortical Learning Algorithm, CLA,Model-based, Avatar, Environment, Dynamic Objects

Palabras clave: Inteligencia artificial general, AGI, Atari, Pong, Breakout, HTM, CLA,agente basado en modelos, avatar, ambiente, objetos dinamicos

Acceptation Note

Thesis Work

-

“- mention”

Jury-

Jury-

Jury-

AdvisorJonatan Gomez

Bogota, D.C., May 2016

Contents

Contents I

Introduction IV

1. Artificial General Intelligence - Preliminaries 1

1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.2 Artificial General Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 AGI Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 AGI Model Clasification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.2 Memory Prediction Framework - Hierarchical Temporal Memory . . . 3

1.2.2.1 Memory Prediction Framework . . . . . . . . . . . . . . . . . . . . 4

1.2.2.2 Hierarchical Temporal Memory . . . . . . . . . . . . . . . . . . . . 4

1.2.2.3 HTM implementations . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.2.4 HTM- Uses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.2.5 HTM- Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2.3 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2.4 Universal Artificial Intelligence:AIXI . . . . . . . . . . . . . . . . . . . . . . . 11

1.2.5 Novamente - OpenCogBot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.2.6 Other Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.2.6.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.2.6.2 Blue Brain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.2.6.3 HMAX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.3 Future of AGI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.3.1 Long Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.3.2 Short Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

I

CONTENTS II

1.4 Testing for AGI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.4.1 Defining a Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.4.2 Game Playing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.4.3 General Game Learning and Playing . . . . . . . . . . . . . . . . . . . . . . . 16

1.4.4 Video Game Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.4.5 Arcade Learning Environment- ALE . . . . . . . . . . . . . . . . . . . . . . . 16

1.4.5.1 Atari . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.4.5.2 Arcade Learning Environment . . . . . . . . . . . . . . . . . . . . . 17

1.4.5.3 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.4.5.4 ALE agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2. Learning Model 21

2.1 Transition Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.1.1 Environment (E) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.1.2 Avatar (A) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.1.3 Dynamic objects (D) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2 Reward Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3 Planning Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.4 Training the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3. Learning Experiments and Results 32

3.1 Learning the Transition Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1.1 Avatar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1.1.1 Learning action effects with HTM . . . . . . . . . . . . . . . . . . 32

3.1.1.2 Learning action effects with Neural Net . . . . . . . . . . . . . . 33

3.1.2 Dynamic objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.1.2.1 Learning from a whole game frame . . . . . . . . . . . . . . . . . 33

3.1.2.2 Using viewports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2 Learning the Reward Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2.1 Using whole game frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2.2 Using viewport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2.2.1 With Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2.2.2 Without environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3 Playing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

CONTENTS III

3.3.1 Breakout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3.1.1 Using whole game frames . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3.1.2 Using viewports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3.2 Pong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4 Other agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4.1 Random Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.4.2 Test with Neuroevolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Conclusions 47

Future Work 50

Bibliography 51

Introduction

Understanding human intelligence and replicating its behavior has constantly awaken in-terest, but the first important milestone in this search came after until the development ofthe computer. With the computer, the idea of replicating intelligence became a tangiblegoal; so, the term artificial intelligence was coined. Artificial intelligence was then bornin a 1956 workshop as a research Area with the purpose of understanding and replicatingintelligence. The first exercises of Artificial Intelligence(AI) were fruitful and thrilling; forexample, the reasoning program called the Logic Theorist was able to develop a shorterproof of a theorem presented in Principia Mathematica [69].

After the initial thrust the AI decelerated when the initial solutions could not be gen-eralized; it was clear that human intelligence was more complex than imagined. As aconsequence of the complexity, AI turned towards specific abilities of the human intel-ligence: Visualization, Classification, Pattern Recognition etc. In the last 50 years, AIhas accomplished many challenges, but few towards simulating human intelligence and itsgenerality, understood as the capacity of the human intelligence of working in multiplecontexts. This generality is what allows humans to find patterns in the world and transferthe acquired knowledge to different domains.

Currently, some researchers believe that the state-of-art in AI and the current speedand resources of modern computers allow to focus, once again, on Human Intelligence andits generality [34] [77] [29]. The projects that work under this rationale are grouped un-der the name of Artificial General Intelligence(AGI), concentrating on replicating generalintelligence rather than specific intelligence.

Each AGI project has a different approach[28]. Some of them are bio inspired(Hierarchical Temporal Memory[33]), some are essentially mathematical approaches(AiXi[39])and others combine multiple concepts ( Open Cog[27]). These approaches areat different states of development, ranging from a theoretical work to an initial implemen-tation.

In 2010 a workshop[1] on AGI was carried out. The goal of the workshop was to estab-lish a path that could guide researchers towards the goal of Artificial General Intelligence.In [1], multiple tests for intelligent agents were proposed.1 The tests are ordered in a scalethat depends on the individual capacity theory of Piaget and the socio-cultural engage-ment theory of Vygotsky. The simplest test presented, with lower requirements in both

1An agent, in the context of AI, is an entity (commonly a computer program) that acts; in most casesthis is accomplished by perceiving it’s environment and taking rational decisions.

IV

INTRODUCTION V

axis, is game playing. There, in game playing, one of the most interesting characteristicsis that the complexity of the games can vary allowing to test different abilities.

Computer agents for playing games have already been developed. The most famousone is Deep Blue[12]. Deep Blue is the chess playing agent which was able to beat thecurrent chess champion. The problem with Deep Blue and the many other playing agentsis that they lack generality. For example, Deep Blue is incapable of playing checkers 2

or even learning how to play the game. Common game playing agents work in just onecontext. They are developed to play board games, to be opponents in video games or asnon playable characters(NPC), but, what they all have in common is the need of a humanexpert. This expert will extract the rules of the game and include his own perceptionsinto the agent. In consequence, these playing agents are not general.

In contrast, some games have been proposed to motivate general solutions rather thanspecific ones.The Metagame[63] and the General Game Playing Competition[63] are ex-amples of such competitions, having the last one still active.

In the context of games and the Hierarchical Temporal Memory (HTM), introducedbefore, a HTM network was modified to be part of an adaptative agent which was able tolearn how to play ”rocks-paper-scissors”. [67]

With the game playing scenario proposed in [1] and what is shown in [67] we propose totest the abilities of a HTM agent in the context of general game playing. There, an HTMagent would pursue to learn and play two different games with no previous knowledgeabout them. The learning would happen as in standard reinforcement learning, where thegame points would be the feedback.

This thesis is structured as follows: the first chapter deals with the state-of-art withthree main topics, models for AGI, AGI and General game learning. The second chaptershows the proposed playing model. The third chapter shows how different parts of themodel are trained, tests with Breakout and Pong are shown and a comparison of the resultswith other agents is done. Finally, conclusions of the whole process are drawn.

2Checkers is a strategy board game, using the same board type as chess, that uses uniform game piecesplaced in the dark squares of the board.

CHAPTER 1

Artificial General Intelligence - Preliminaries

1.1 Definitions

1.1.1 Artificial Intelligence

Artificial Intelligence(AI) is the field that pursues to understand and, more important, tobuild intelligent entities [69]. The way to achieve this goal, and the exact definition of whatintelligence is, will change the path taken to develop AI. In general, as described by Russellon [69] there are four perspectives, thinking humanly, thinking rationally, acting humanlyand acting rationally. Different approaches to AI will tackle each of these perspectives.

Since the term(Artificial Intelligence) was coined in 1956 there have been many changesin the field. These changes can be summarized in two historical periods, classic and modernAI.

In the first days of AI, there were multiple developments oriented towards logic, such asthe Logic Theorist (which was able to prove theorems from a mathematics book and evencame up with a shorter proof for one of them). There were also many other developments,among which came ELISA(the psychologist program), expert systems, and even the ideaof neural networks. Around those developments and their early success, the belief in AIgrew as a bubble. But most of those solutions were very limited and couldn’t be extended.Because of that, AI was not approved by the industry and quickly came to a standstill;the bubble burst when the initial achievements could not be generalized.

Later, came a new perspective of AI which will be called Modern AI. Modern AI is theconsequence of the unfruitful development of classic AI. Modern AI tries to keep aside thedevelopment of intelligence as a whole and concentrates its efforts on specific tasks, suchas pattern recognition, classification and others. The problem is that those solutions mustbe adjusted by an expert in the field of the problem. Modern AI in contrast with classicAI is widely accepted and used by the industry, but has no special focus on intelligenceas it was in classic AI. The focus of Modern AI is on specific problems.

1

CHAPTER 1. ARTIFICIAL GENERAL INTELLIGENCE - PRELIMINARIES 2

1.1.2 Artificial General Intelligence

As introduced by Pei Wang on his “Artificial General Intelligence, A Gentle Introduc-tion”1, new ideas about intelligence were born around 2004. The terrain was prepared forthe revival of the original ideas of AI. Multiple books with multiple types of initiativeswere written[4, 34, 77, 29] and conferences that shared this view were given. An importantachievement used at the time was Olshausen and Field’s proposal of sparseness [62], whichstates that during the learning process, humans save resources and try to learn withoutmemorizing every little detail. For example, in regard to image recognition, humans learnsome characteristics and then represent images as a small subset of those characteristics.This subset is known as the sparse coding representation. Using those ideas of sparsecoding, an algorithm was able to beat the state of art in object recognition [45]. Thiswould then join with the general belief in the effectiveness of using brain inspired modelsto achieve General Artificial Intelligence.

Although it seems there is no name for this trend yet, multiple names are given forthose ideas. Artificial General Intelligence(AGI), Strong Artificial Intelligence or Human-level Artificial intelligence are terms used to describe these general ideas that come fromthe beginning of the AI, but were left apart due to its difficulty, low understanding orlittle utility.

In consequence AGI can be defined as ”an emerging field aiming at the constructionof “thinking machines”; that is, general-purpose systems with intelligence comparable tothat of the human mind”2. It is important to annotate that, even thought the previousdefinition could be stated as the original goal of AI, AGI pursues a more direct approachto general intelligence, in contrast with the bottom top approach of modern AI.

1.2 AGI Models

1.2.1 AGI Model Clasification

Using Pei Wang’s classification of AGI3, extending it to different models and allowingmultiple categories for some models, the different models presented here are classified inTable 1.1. This classification is made by mapping the answer of two questions. First, whatis the concrete goal of the project or model?, this has five possible common answers:

• structure: to model the human brain

• behavior: to simulate human performance

• capability: to solve practical problems

• function: to have cognitive faculties

• principle: to obey rational norms

The second question is: What is the technical path to achieve the goal?, with theanswers being:

1http://www.cis.temple.edu/ pwang/Writing/AGI-Intro.html (last checked on March 2014)2http://www.agi-society.org/3Presented on his web document: Artificial General Intelligence, A Gentle Introduction 4


• hybrid: to connect existing AI techniques

• integrated: to combine modules based on different techniques into an overall archi-tecture

• unified: to extend and augment a core technique in various ways

Table 1.1. Classification of different AGI approaches

goal/path Integrated Unified

Structure Blue Brain HTM, DeSTIN, HMAX

Principle AIXI, NARS

Function Novamente, OpenCog

Behavior OpenCog

Capability Deep learning, hardware approaches HTM

1.2.2 Memory Prediction Framework - Hierarchical Temporal Memory

Research on this path was described in the book “On Intelligence” [34]. The introductionof this book talks about Hawkins efforts to work on Artificial Intelligence with some bio-inspiration. However, as he says, the academic world was not ready for that approachat that time. In the 1980’s, the momentum needed for the development of this class ofsystems was not ready. At the core of Hawkins story, and his argument, is that all methodsuntil now, have focused more on what they could do, rather than how they should do it.He expresses how most methods have little or no influence on brain understanding, andhow even those who have used some of the knowledge about the brain, such as neuronalnetworks, have not given enough attention to the task of comprehending the brain andusing the solutions that evolved in humans for hundreds of years.

Hawkins centers his ideas on one specific part of the brain, the neo-cortex. He explainsthat the research seems to lead to the conclusion, that the neo-cortex has one algorithmthat is used for every learning task; for example, if the ears were to be rewired we wouldlearn to see with a different part of the brain. The consequence of this is that the neocortexdoes not have divisions of tasks; it is not specialized for every different task. It justknows how to do one thing well, and repeats it for all its stimuli. This reminds the Stoicphilosophical theory of “Tabula rasa”, in which humans are born as a blank tablet that isthen written by their sensations. [16]

The Memory Prediction Framework (MPF) theory is the product of a joint research inneuroscience and artificial intelligence first presented on “On Intelligence” [34]. There, thegeneral ideas of Artificial General Intelligence are exposed, although not with the nameof AGI. As a consequence of these ideas the authors proposed the development of a modelthat differs from classic artificial intelligence. This bio-inspired model, called HierarchicalTemporal Memory (HTM), encompasses the ideas of how the human neocortex works.

The main concepts of the HTM can be summed up in three words: memory, time andhierarchy. The model works in a hierarchy of inputs, where the raw inputs are in the


bottom and as they flow through the hierarchy they represent more abstract concepts.These abstractions are extracted from the temporal and spatial patterns that the inputshave. With these patterns flowing up and down through the hierarchy, the model predictsconstantly the inputs that should arrive. For instance when someone reaches for a knowndoor knob, the neocortex predicts the position where the knob should be, so that if it wasmoved a few millimeters the neocortex will know that there was something wrong.

1.2.2.1 Memory Prediction Framework

The Memory Prediction Framework is the corner stone of Hawkins’ ideas, as a consequenceof the desire to tackle the AI problem in a Top-down model. This model contrasts withmost known AI research which was done in a bottom-up approach, where each individualtask was thought out and researched as a single problem. This approach led to thedevelopment of efficient algorithms for those tasks. But in the path of understandingintelligence the bottom up approach was not giving results. As he[34] explains, the taskswould be compared to solving a puzzle in which the final result is not known, the piecescould fit or not, some pieces would not be used, and each month new pieces would arrive.Making it an “Herculean” task. That is why Hawkins proposes his Memory model as anapproach to a top-down view of the problem in hope that it would lead to a milestone inthe more than 50 year search for Human Artificial Intelligence.

This memory prediction framework encapsulates the ideas of Hawkins. It begins withthe observation of two main aspects of the brain, memory and time, human brains notonly learn space patterns but also temporal ones. Joined with these is the continuouslyoccurring process of the mind in which it constantly makes predictions about what, howand when it is going to feel something. One interesting example of this is when a “known”door is opened. If someone were to change the place of the door knob, the user of the doorwill realize something is wrong as he extends his hand to open the door, reaching for theknob in its usual place, but finding his predictions about it being false. This is used as anexample of how every action is mapped to an expected feeling.

With these the framework proposes a hierarchy of brain elements in which the percep-tions are an input in the lower part of the hierarchy, as they are unfamiliar the go upwardsin the hierarchy to a more general representation, and as they become known they movedownwards. So when a child is beginning to read each symbol (letter) it will go up allthe way through the hierarchy, but with the time he would need just the first levels torecognize it. Furthermore, when understanding a word it would require at the beginningthat, after knowing which letter it is, it goes up the hierarchy to be recognized. So, inthat way the spatial and temporal patterns are extracted in the upper hierarchies and theinformation flows up and down as a consequence of how novel or predictable the inputis. This happens in such a way that when a prediction is false the brain may look upperin the hierarchy for an “explanation” of the unpredicted event and for new predictions,transforming invariant representations into more specific ones.

1.2.2.2 Hierarchical Temporal Memory

The ideas on [34] and in general from the memory prediction framework were used andlater developed by a few authors. One interesting early work is by Garalevicius on [22]where the ideas of the framework and a complete analysis of the Bayesian model is made.


The advantages and disadvantages are discussed and some expectations are raised overthe development of the model that would lead to “more accurate models”.

Subsequently a model is presented by George Dileep in his doctoral thesis, which isdeveloped while he was working at the RNI(Redwood Neuroscience Institute) and later atNumenta[25]. There, the theoretical framework of Jeff Hawkins[34] was used to developwhat would be called as HTM(Hierarchical Temporal Memory). HTM is presented as aresult of the search for a cortical algorithm that makes machines capable of intelligence,in the sense of flexible model generation, and as an effort towards the understanding ofthe brain, both being a duality that should develop in symbiosis. The HTM is presentedas a visual pattern recognition model, it is developed and tested [25].

Hawkings founded a research company called Numenta through which the ideas of theMemory Prediction Framework were to be used. The company then released a software(NuPIC) for use, test and development of the HTM(Hierarchical Temporal Memory)[59].This software was open source, so anyone could use this implementation of the MemoryPrediction Framework to put it to the test and develop from it. Thus many documentswere released by Numenta explaining the theory, the implementation and even makingcomparisons with current Machine Learning approaches[58].

That report [58] presents the existing models in machine learning with a simple clas-sification and tries to explain what each model has to do with HTM and how HTM is animprovement over the model. It also expresses how each model could be useful towardsthe HTM. The main argument is the conjunction of hierarchy over time and space, whichis not found as a whole in any of the models presented. This leads to another report fromNumenta in which the capabilities of these algorithms are expressed[35] as:

1. Discovering causes in the world

2. Inferring causes of novel input

3. Making predictions

4. Using predictions to direct motor behavior.

An example of a simple HTM hierarchy is presented in Figure 1.1, there, it is picturedhow the sensory data enters the lowest levels of the Hierarchy and goes up through themultiple layers.

The implementation of this model is made via a Bayesian Network that is organized, aspresented in the graph, in a tree form. In that way the sensory data changes beliefs on thelowest level, and these are propagated through the network such that when a belief reachesthe highest level in the hierarchy, beliefs in all nodes are consistent. A good example forthis is the following [35]: if the visual information creates a belief of dog or cat (with astronger belief toward the dog), but the auditory input tells it is a pig or a cat (with astronger belief toward the pig), then, a higher level will decide that, for a consistent belief,the input must come from a cat; even though the cat had the lowest probability amonglowest level nodes.

The HTM has had multiple applications mostly promoted by Numenta or even madeusing the software (NuPic) they released. An interesting report describes what kinds ofproblems would fit the HTM [60], in it an exercise is proposed in which some hierarchyshould be found easily on the problem. Then after applying the HTM implementation


Figure 1.1. Example of a simple HTM hierarchy with its belief propagation. Taken from [35]

the hierarchies thought about or even novel ones would appear. But this gives the under-standing that problems in which hierarchy is an important feature would apply.

After that Numenta, turned from a research company to a product company, makingtheir algorithms difficult to find. Their main product is Grok which is explained as a cloudbased HTM. An important aspect of this is that it is mentioned that Grok has predictioncapabilities which were asked for but not seen before on official Numenta software. Lately,in 2013, a new version of their algorithms, with the name of Nupic, was released as opensource.

HTM or Hierarchical Temporal Memory is the model that results from the theory ofMPF. Given model wraps a set of theories that build a model of the neocortex, some ofwhich have already been implemented. An HTM is a neural network where the informationflows through a hierarchy. It is important to annotate that it is not a classic neuralnetwork. It models neurons(called cells in the HTM literature) which are arranged incolumns, layers, regions and in a hierarchy [33].

The different layers of the HTM, play different roles although all of them behave in asimilar way. Some layers receive direct input from the senses, while others obtain it fromprevious layers. With this model of the neocortex the HTM pretends to achieve a goodlevel of human intelligence behavior.

One of the most important details in the behavior of an HTM, is the sparse encodedrepresentation of the world, where each state activates few neurons, but just enough toencode the inputs. This sparse representation of the inputs allows to recognize similarinputs even if they are not completely equal. For example, a known word spoken by anunknown person would activate almost the same sparse set of neurons as if it was spokenby a known person.

In consequence, the main feats of the HTM revolve around its hierarchical structure,and its constant predictions; the constant predictions of following inputs, allow the recog-nition of new unexpected inputs. These unexpected inputs would flow through multiplelevels, activating neurons and, with repetition, forming new memories. In such a form,that when this ”new” input appears again it would no longer be new and it would bepredicted.

Although the theory is public, the translation of given theory to an implementationis a work in progress. In consequence there are multiple implementations of HTM, whichdiffer from one another.


1.2.2.3 HTM implementations

The HTM implementations publicly available can be grouped in two. Most of them areimplemented using concepts of bayesian networks or extending them. Others use differentlearning techniques.

Bayesian The bayesian implementations found can be split in two, the HTM authors’bayesian implementations, or implementations made by others.

The HTM authors have developed multiple implementations, one of the first ones, canbe found on [24]. This can be called a preHTM model, where some of the neurologicaltheories were used and tested. That model has Matlab implementation that is not opento use.

After the explicit description of the model, the authors of the HTM model built acompany named Numenta, and under this name, they developed the implementationsof HTM. The current, most advanced implementation is called CLA(Cortical LearningAlgorithm), which the company uses for its commercial activities, but is developed asopen source (https://github.com/numenta/nupic)

Other authors have developed implementations of HTM, using the theories and thewhite paper by Numenta [33]. There are two worth mentioning: the implemenation bySalvius as an Opensource proyect, and the OpenHTM, which is also opensource.

Not Bayesian There are a few implementations of the HTM that avoid using BayesianNetworks, the most relevant of them is the presented in [67]. There the authors useSelf Organizing Maps to implement a hierarchical learning network which keeps the sameproperties as the HTM theory, and they use this network in adaptative agent.

Cortical Learning Algorithm The Cortical Learning Algorithm(CLA) simulates a re-gion in the neocortex as a collection of cells (neurons), the main difference of CLA withnormal neural nets is its inclusion of distal dendrite segments which have been seen tocreate effects of inhibition and excitation[61, 33]. The main principle of the CLA arecontinuous context sensitive predictions (using online learning).

The CLA is split in two parts, Spatial pooling and Temporal pooling.[33] The Spatialpooling is focused on creating stable sparse distributed representations of the input data.The Temporal pooling creates context sensitive predictions through modeling distal den-drite segments. The main idea behind this model is explained in Fig 1.2. Implementationsof this algorithm are normally called HTM-CLA.

1.2.2.4 HTM- Uses

Even though was introduced a short time ago, 2007, some multiple different applicationshave been published. The majority of them are used for classification. As seen in the firstpaper of George Dileep, whose emphasis was on visual pattern recognition, most of theapplications of HTM are, in the core, some sort of classification, mostly of visual inputbut there are also other kinds of input.


Figure 1.2. Column of CLA cells predicting two sequences, BA and CA. The temporal pooler ismodeled as a network of cells divided in columns, in the figure each group of fourvertical cells is a column. Each cell can be connected to cells in other neighborcolumns, when a column becomes active through direct input(B and C) it sends sig-nals through its cells connections and when a cell receives enough signals it activatesas a prediction (black cells). [61]

Visual Pattern Classification Visual Pattern Classification is the main use for HTM.It starts out with the examples provided in the official NuPIC package. Which is thenextended as in [7]. There HTM is used to implement a system that helps in content basedimage retrieval of architectural images drawn in pen and paper. Here the extension ofthe NuPic results useful, allowing the user to implement working queries in more naturallanguage.

During this time, the main authors of HTM namely Jeff Hawkins and George Dileep,kept working on the algorithms. Two important examples are [36] and [26]. In the first,an explanation of the HTM as a hierarchical Spatial and temporal model is made and atest in which motion capture is used and presented as an example of the need of temporalpatterns. In the second one the authors use the HTM to be able to recognize Kanizsadiagrams (recognize a square in a diagram that only shows the corners). This is aninteresting case of visual classification, given that such inputs were never presented duringtraining.

Afterwards multiple papers were written with uses of HTM for visual classification; oneinteresting detail is how the image classification is a tool used in these papers to achievesomething more than a simple picture classification. For example on [30] the HTM learnssome standard design models of the back of chairs and then is used via classification todecide whether proposed new designs are good or bad with interesting results in compar-ison with students decisions. Other two good examples for this are [52] and [18]. In thefirst one an HTM is used with different types of color inputs to recognize different imagessuch as traffic signs. In the second the HTM is used in a real world situation to learn torecognize the soccer field and the relationship among the objects inside it in a robocupscenario. Some modifications to the HTM are made and unsupervised learning used. Afterthis the robot is capable of recognizing the patterns with 83% accuracy. These two usesare also similar to [64] where HTM is used for scanning aerial photos and deciding the useof the land, as in irrigation, building or different types of plants. One interesting aspect


of some of these papers is that in the discussion it is common to find the desire that inthe future the HTM will be able to make predictions and not only classification. Whichas explained by some authors it would lead to more interesting results, applications andextensions to the papers.

Two of the latest papers are [10] And [73]. There, the theories from MFP and HTMare used to create an unsupervised identification system that learns by itself creatingcategories of objects in its environment, with a casual intervention of an operator to labelthose categories found. In the second paper, a common hand written recognition systemis implemented with average results.

Sounds As the original model states, the cortical algorithm is used by the brain throughthe entire neo cortex, meaning that HTM should be capable of being used not only forvisual input but also for auditory input. One important detail is that for auditory inputthe temporal part of the model is of greater importance than for visual classification. Sothere have been some approaches to it, but not as numerous as for visual input.

One of the first approaches is [15] where a HTM was trained, with a small set, forspoken digit recognition with 11.63% of word error rate. An important detail is that atthe time the temporal properties of the model had not been implemented on the HTMused. Later it is also used for speech recognition [49], where the unsupervised learningof phonemes is shown. This is also important as it is one of the few examples whereunsupervised learning is used.

Other Uses Apart from the previous presented uses other implementations have alsobeen developed. Some multiple other uses of HTM are found in the literature, whichdiffer widely, as it is used for classifying arm movements (given by a movement sensor)in different activities such as drinking or eating[78], or for describing the movements ofgenerated robots. As in [71] where a different yet interesting use of the HTM is presented:An authoring program for making robots is made in which the motion of the robot isexpressed in a HTM, making it easy for children or elderly to move the robot, althoughthe output of the HTM is just a class.

On [72] different scenario is presented, where HTMs are used to model AsymmetricWarfare. An example case is used “Joint Urban Operations in Iraq after the regime ofSaddam Hussein” which makes it an unusual use of the HTMs. Here multiple parametersof the state of Iraq are input to the HTM which at the end should classify the state giventhe input labeled data.

1.2.2.5 HTM- Modifications

Given that HTM is just a model which is in development, multiple modifications or exten-sions are attempted to improve it. There are some papers in which it is clear that HTMimplementations are not definite yet, and multiple modifications are proposed. An inter-esting adaptation, taking in mind the speech recognition capabilities, is the proposal ofa HSSM[50] as an adaptation of the HTM to build representations of musical knowledge.A more interesting modification is the rHTM [13] in which following the brain inspiredmodels, the author proposes the addition of a reward parameter to the Networks; this


done in such a way that reinforcement learning becomes similar to how dopamine workson biology. This ends in the model being able to reproduce behavior founded on neurology.

Some other models are presented searching for a closer approach to the MPF. Forexample [19] uses HTM and HQSOM Hierarchical Quilted Self-Organizing Map, a variantof the self-organizing map. And [70], that achieves better feature representation via aDHNBP(Dynamic Hierarchical Nonparametric Belief Propagation) mode. Both DHNBPand the proposed model have no public implementations.

Another approach to improve the HTM is to increase its speed, it is not common but agood example is [76] where an alternative model(HTSOM) to HTM is proposed in whichlow cost implementations suitable for consumer market are more possible.

More recently, some papers have been written with a more general view. A veryinteresting application of the HTM is done in [67], where an HTM is used in variousscenarios one of the them being an agent that plays “Rock, Paper, Scissors”. The HTMtrained not only predicts the next move, but also decides which movement increments itsreward. The reward portion is, as in the example before, included in the model, but in amore practical way. The only problem is that the opponent just cycles through the optionsin a defined way, so that it should be easy to learn.

Another important paper with this general view is [44], where most of the work in HTMis taken into account and an optimization on it is proposed. It proposes some changes inthe temporal and spatial modules. One of the interesting changes is the use of a log polartransformation that simulates how the eye gives the information to the brain. This resultsin an improvement in object recognition and classification.

It is important to note that [2] is described by some authors as a better hierarchicalmodel than HTM.

1.2.3 Deep Learning

Another path, which is not considered by Pei Wang, is the one introduced by GeoffreyHinton[37]. This research path, although more leaned to machine learning than other AGImodels, is important as some brain inspired techniques are implemented. This path hashad a good momentum and in recent years has gotten very popular. With deep learning,multiple systems have beaten state of the art performance with ease. The principal ideabehind deep learning is the use of a generative model that results in a joint distributionover the data and the labels. It has the ability to not only predict the label from the data,but also to predict the data given the label. This DPNs (Deep Belief Networks) have ageneral structure as shown on figure 1.3 [3]. Later, the introduction of convolutional beliefnetworks expanded the flexibility of DBNs.

One really interesting application is [46] where a deep neuronal network is trained ona cluster of 1000 machines (16000 cores) to be “Selective for high-level concepts usingentirely unlabeled data”. Using multiple concepts and ideas through all AI, the authorsstate that the main reason for deep networks just being able to find shallow hierarchicalrepresentations was the big training time on these. So using the resources at hand theyimplemented a 1 billion parameter system and trained it with a large amount of data.As a result Human Faces, Bodies and Cat faces are recognized with a 70% improvementover state-of-art classifications over 22k categories, where random guess achieves less than0.005% accuracy.


Figure 1.3. Deep Belief Network structure, from [3]

It is important to note, that Deep learning has also influence from the brain. Thebrain has also a deep architecture and more importantly the idea that the brain does notprocess input data before introducing it to its network, but rather learns from it.

One of the latest advances of DBNs is shown on [54], where a deep network was trainedto play Atari.

1.2.4 Universal Artificial Intelligence:AIXI

Universal Artificial Intelligence is motivated by Marcus Hutter [38]. In contrast with thepreviously presented approaches, the Universal Artificial Intelligence is not very inspiredon the brain nor searches for immediate state of the art results. Rather, it tries to explainand mathematically express an agent that would perform as the most intelligent unbiasedagent possible. The general idea is that AI systems can be stated as goal driven systemswhich try to maximize something, in life it would be survival and procreation. Given that,AIXI is the mathematical theory that would lead to that knowledge.

So, induction and many other mathematic tools are used in the definition of the AIXI,but given that it assumes the availability of unlimited computational resources, whichresults in its mayor draw back. It is not computable. Some reductions are made todevelop an algorithm AISItl5 which still conserves a good level of intelligence but itscomplexity results in it not being testable. Thus the theories exposed in the UniversalArtificial Intelligence will not develop in applicable systems or Artificial Intelligence for awhile.

5Taken from the web page of Hutter www.hutter1.net


1.2.5 Novamente - OpenCogBot

Novamente is also different from the previous approaches. It focuses on artificial generalintelligence and results in a testable product. But its principle is that an intelligent systemis “A system that can achieve complex goals in complex environments”[48] . Thus theNovamente is not developed as a single algorithm that results in generality, but it ratherinvolves many components working together. These components are called lobes; each ofthese has a specific task.

Many of these components use multiple techniques found through AI, with that in mindit is possible to see Novamente as a more direct approach of achieving general intelligenceusing the bottom-up approach that the AI has been following. Resulting in what couldbe called an attempt to make a union of the different individual solutions that have beendeveloping through the years.

On [27] the exploration of the idea of a robot with intelligence comparable to that ofa preschool kid is really interesting, based on new connections and architecture of alreadyknown AI systems, visual, language comprehension, action planning. The project is in anearly state and some work is needed.

1.2.6 Other Ideas

There are many ideas and research paths which touch in multiple common points, depend-ing on what is taken as the final goal. There are two important surveys that show manyof this, again each one including or excluding models depending on their approach. Oneis Pei Wang’s gentle introduction on General Artificial Intelligence (an on-line document),the other a survey of cognitive models [28]. Here are exposed some other approaches notincluded in those reviews.

1.2.6.1 Hardware

One line of research states that Artificial General Intelligence is not reachable with thehardware we currently have. It is suggested that it is not only necessary to understandthe brain, but a more adequate simulation of its parts is needed to achieve some level ofintelligence. The algorithmic and logic hardware we have differs in great way from howthe brain is build. Thus, a line of research is motivated by the idea of bringing to artificialhardware the concepts understood from the Brain [41] [14].

In relationship with neuronal networks it is believed that some of their limitations aregiven by actual hardware. The computing power is not able to train or use effectively anetwork with many neurons and many layers. As seen on [46] a big cluster of machines(1000) used for 3 days is needed to learn a network that results in good generalizations.That is why the idea that hardware implemented neurons could be much more effectivethan current processors and therefore be able to implement more dense neuronal networks,which could possibly lead to General Intelligence.


1.2.6.2 Blue Brain

Another approach that comes more from neuroscience is the Blue Brain project. Thisproject began in 2005 with a BlueGene/L supercomputer. With which the researchers arelooking to make a complete model of the human brain. They have already been able tosimulate a rat cortical column.

Via this long approach a lot of information is being found and tested mutually frombiology and on simulations. With it, a greater comprehension of the human brain shouldbe reached, but more interesting is the general idea of building a whole simulated artificialbrain. Although it seems a giant task, it is said to be no more than ten years away. Andgiven that it will be a complete simulation it is expected that it would work as a humanbrain, thus achieving human level intelligence.

1.2.6.3 HMAX

HMAX stands for “Hierarchical Model and X” it is an approximation of a computationalmodel of object recognition in the cortex. Given that it is close to what the HTM is, it issometimes used as comparison with HTM. Where HMAX is closer to how the brain works,but less general, mainly because it focuses on brain simulation and not on applications.

1.3 Future of AGI

The future of the area can be viewed in a short term and in a long term. Strangely enoughlong term predictions seam easier to elaborate than short term ones. Maybe because inlong term predictions the “How” does not matter, but in short term ones it does. Longterm predictions are made by almost every author or every theory. One idea of this is givenby Adams on [1], there a path of the required milestones towards human-level artificialintelligence is presented.

1.3.1 Long Term

It is enlightening to see how the development of the field has led many authors to believethat Human Level Artificial intelligence is coming closer. The singularity is a growingtopic [40], and authors are sitting in their chairs, thinking how AI could change foreverthe Human Life. If AI is joined with brain understanding, the most wildest ideas seem aspossible. Even uploading of brains could be a reality.

Less wild thoughts lead to a much greater understanding of the world. Animals be-haviors, weather and many other highly variable systems could be finally understood andmaybe predicted. The development of Human-Level intelligence is thought to rapidly takeus to an understanding that has not been reached by humans, because of the human-levelintelligence that an artificial system would have, would have faster access and more in-formation available. And understanding patterns that have not yet been discovered byhuman eyes could be obvious for it.

A really interesting thought of Hawkins [34] is that machines could, and most probablywould have another sets of sensors, thus the input for them, their sensations would becompletely different. They could see different things that humans, they could see a wider


electromagnetic spectrum, or more detailed. They could hear more. All these sensorychanges would also change the way they perceive the world and how they understand it.Artificial Intelligence could achieve very interesting generalizations and discover patternsthat are hidden by our limited senses.

Although this increasing momentum and belief of finally arriving to a General ArtificialIntelligence, that has been evading researchers for more than 50 years, is impressive, thereturn to the origins of Artificial Intelligence evokes the memories of the old mistakes. Itis known that when the challenge of Artificial Intelligence was first presented it was takenas a simple task. The first successful attempts leaded the researchers to think that AI waswaiting humanity around the corner, but they could not be more wrong. So, the currentmomentum and growing belief in Human level AI should be taken slowly, because it hashappened before and in that occasion we were wrong.

1.3.2 Short Term

As said before, short term perspectives of the field are difficult. Each of the research pathshas its inconveniences. Each one has problems it has to tackle. And most of them haveyet to show their potential, so constant review in the area is needed.

As stated on [1] there are multiple approaches, but one to note is General Video gameLearning. There a system that can learn to play many games, without previous knowledge,is presented as a first step towards human level intelligence. The approach of the “paperscissors rock” game on [67] could be a beginning, but the environment and opponent areexcessively simple.

1.4 Testing for AGI

1.4.1 Defining a Test

Adams Et al. suggested a set of tests to compare AGI algorithms. On [1], the authorsdiscussed the characteristics of such tests and proposed a road map for the developmentof AGI. This road map is presented with the 2D plane where Piaget’s development stageslay on the x-axis and Vygotsky’s social-cultural phases lay on y axis. On the plane, thetests suggested are settled down( see Fig. 1.4) such that simple tests are on the lowest leftcorner and the complex ones are on the highest right one. Using this order, the simplestsuggested test is general video-game learning; where, an agent should be able to learn howto play as many different games as possible.

1.4.2 Game Playing

The idea of a game playing agent is not new, the old Arab [69] which deceived peopleinto thinking it was an autonomous chess master, is a demonstration of the old idea of anautonomous agent. With the development of computers such agents seamed more viableand there were many attempts of such agents. An important peak on game playing agentswas chess playing, taking into consideration chess has been related with high intelligence.Thus, the development of a chess playing agent that could maneuver better than a humanwas considered as a great accomplishment. In 1997 it was achieved [12], but afterwards


Figure 1.4. Milestones on the AGI Landscape, from [1]

it was absolutely clear that Deep Blue, the agent that accomplished the feat, was nottruly intelligent. This lack of intelligence is mainly due to the domain knowledge it hasembedded.

As with Deep Blue, multiple game agents have been designed to play video games, butits difficult to call them intelligent, as most of the agents just play one game and requirespecific heuristics and domain knowledge. These agents use multiple strategies, some usethe domain knowledge to check the benefits of all possible actions using algorithms suchas A*, Min Max or Monte Carlo. Others use the domain knowledge to extract the mostimportant parts and learn which actions in which situations return the best reward, aswith the TD-gammon game [54].

Video-games are a good source to motivate the development of agents, either to be foesto human players or to compare their abilities with human players. The first case is verycommon; the first foes were agents with a pre-written set of rules that would make them”worthy” enemies. After these initial enemies, more flexible foes have been developed; forexample using reinforcement learning to develop dynamic rules [65]. The second case isusing video-games as a tool to test intelligent agents. Many games have been used withthe purpose of testing new intelligent agents; for example, Pong was used by Nadh [57]to test an associative memory agent. Another example of this is with Mario game, asused by Mohan in [55]. In that paper, different forms of reinforcement learning agentdesign are presented. The different agents use heuristics specific for each game resultingin good agents for their designed game. For example playing Infinite Mario a variationof Super Mario game from Nintendo. Also, in 2009 Robles [68] used tree search to playMs. Pac-Man and Buro in [11] mentions a Starcraft competition to test intelligent agents.These are some of the games that have been used to develop and test intelligent agents.


1.4.3 General Game Learning and Playing

The test suggested by Adam Et al. goes beyond the common agents used for game playingand forces the agents to avoid including specific domain knowledge. This kind of test isnormally named as General Game Playing. There have been some previous attempts ofsuch tests. On 1992 the Metagame[63] was presented as a first idea of a competition whereagents are not programmed to play a specific game but rather a class of games. In theanalysis of Metagame some of the most relevant concepts of developing general agents, incontrast with specific agents, are discussed. One of the most important of these conceptsis Evaluation, where it is questioned whether the contest is evaluating the performance ofthe agent or the analysis of the researcher.

Afterwards, in 2005, the General Game Playing competition was presented as an ap-proach to General Game Playing [23]. In this competition a player is given the rules for anunknown game via the Game Description Language [75]. Because the agents are focusedon extracting the best possible strategy from the rules, but they are not confronted withthe problem of learning which are the rules, the competition focuses on General GamePlaying and not General Game Learning. The General Game Playing competition hasbeen running since 2005, and multiple papers about the agents developed have been pub-lished. Some examples are [20, 51, 21], in these, as in most other agents of the competition,the strategy is based on searching the tree of possible actions inspired by the min-maxmethod traced to a 1912 paper by Ernst Zermelo (as Russell’s states in [69]) .

1.4.4 Video Game Learning

1.4.5 Arcade Learning Environment- ALE

The Arcade Learning Environment (ALE)[6] was proposed by Bellemare Et al. as aplatform for evaluating general domain-independent AI technology. ALE provides aninterface between an Atari 2600 6 emulator and an agent. ALE sends each game frame(as an array of pixels) and the current points obtained, the agent in return sends the actionto take. Thus, ALE allows to develop tests of intelligent agents in a fast and systematicalway.

1.4.5.1 Atari

Pong Pong (Fig. 1.5) is one of the first video games developed, in 1977 for the Atari 2600.It is a simple game, in which two players, represented by vertical lines, should move toscore a goal with a ball without being scored. Pong can be played by one or two players;to execute the general game playing tests the one player version is used, where the foeis a computer player included in the game. It is important to explain that, although theplayer acts by using up and down movements these actions can be executed with the leftand right buttons.

The game score in this game is the amount of goals scored by each player, and thegame ends when one player gets 21 points. This game score is transformed to a singlenumber by calculating the difference of goals for both players, so the game points for the

6Atari 2600 is a popular video game console with more than 500 games and a 1.19Mhz Cpu.


Figure 1.5. Pong game image

Figure 1.6. Breakout game image

agent ranges from -21 to 21. For example five goals for the agent and eight goals for thecomputer results in negative three points.

Breakout Breakout, released in 1976, is an Atari video-game where the player uses apaddle to redirect a bouncing ball in order to break several lines of bricks1.6. Breakoutis a deterministic one player game, it has one object (the paddle) moved by the playerand one object (the ball) that moves independently but, can be affected indirectly by theplayer. The game score is a multiple of the number of bricks broken.

1.4.5.2 Arcade Learning Environment

The Arcade Learning Environment (ALE) [6] is a open-source general game playing plat-form that uses the Atari 2600 ; a popular game console with many games released. Sucha common console with many games provides a good benchmark for intelligent agents,where the agents can be tested and the results can be compared. Furthermore, Atari2600 had a processor of 1.19Mhz and a RAM capacity of 128 to 256 bytes; this low re-source requirements allow running Atari games at fast speeds, which are needed to trainan intelligent agent.

ALE is composed of different parts. First, ALE uses an open-source Atari emulatorcalled Stella [8], which loads the game code (ROMS) and runs games as the real hardwarewill. ALE main code, wrote in C++, works as an interface to this Stella emulator; offer-ing the game images and score, receiving the actions to execute and managing differentcommands to control the emulation, like loading games, and starting or stopping them.ALE code runs as a process and is able to communicate with an intelligent agent (anotherrunning process) that makes the action decisions. There are a couple of forms to allow thecommunication between ALE and the agent; but, the most flexible is by the using pipes,


a one-way interprocess communication tool with a buffer like FIFO structure). Using twopipes, one to receive the game images and score, and one to send the actions, any intel-ligent agent (in any programming language) can communicate with ALE and play Atarigames with the restrictions suggested by Adams Et Al. on [1]. Through this documentthis last structure will be used with Java agents that extend directly from the Java agentexample that the ALE code includes.

One additional important detail of ALE is the way it sends the game score. The gamescore in ALE is presented as a unique scalar value, called the reward. This reward is adirect representation of the in-game points, with some specific details. For example ingames where there are two scores, like in Pong, the difference between those is the rewarddelivered by ALE. Also, in games where other information, additional to the score, ispresent, like the amount of lives left, this information is not available through ALE.

1.4.5.3 Pre-processing

It is common, in intelligent agents, to reduce the size of the input space, in order tosimplify the problems and reduce the processing time. The following process will try toreduce the dimensionality of the game frames without removing vital information.

Color Reduction Atari games use 8 bits to represent the colors, and normally objectsare clearly differentiable by their colors. But, the two games studied in this documenthave no strong meanings for the colors present; so, the games are transformed to a blackand white scale, separating the background(black) from the rest. This color reductiondecreases the game frame dimensions without any important information loss; a humancould still play those games.

Scale Reduction A common process to reduce the dimensions of the image is to down-scale it; thus, reducing resolution but conserving the positions and general forms of gameobjects. The agents used in this document, execute the same downscaling process used byBellemare and Hausnecht [6, 32] to reduce image dimensionality see Fig. 1.7. This processconsists of taking groups of pixels and joining their colors; this grouping works best whenthe resulting number of groups is a divider of the original number of pixels. Therefore, theframes are reduced from 160 pixels wide and 210 pixels high to 80 pixels wide and 52 pixelshigh, down scaling the width by half and the height by a factor of four. This asymmetricscaling allows to keep the image as small as possible without much information loss.

This factor is chosen by visual comparison of different scale factors. Down scaling bya factor of 5 produces an image where the ball is difficult to follow, some times it hastwo vertical pixels, others two horizontal pixels and in other frames just one pixel. Afactor of 4 produces a cleaner image in the vertical direction but has still differences inthe horizontal direction. Thus, an asymmetric downscaling by two and four allows a cleanimage (See Fig 1.7).

Frame skipping Some frames of the game are skipped, processing only every eight frame.Thus, reducing the frame frequency from 60fps to 7fps and allowing better learning. Thistechnique is used by Bellemare [6] and by DeepMind [54] and its benefits are discussed by


Figure 1.7. First Breakout frame after down-scaling and black and white conversion

Figure 1.8. Second Breakout frame moving left, after down-scaling and black and white conver-sion

Braylan [9]. Additionally, every fourth frame a Fire action is sent, which is required, inthe games selected, to start the games. (See Fig. 1.7 and Fig. 1.8).

1.4.5.4 ALE agents

ALE has been used to test intelligent agents, for example Hausnecht in [31] used analgorithm called HyperNEAT to evolve neural-networks with the Atari games Asterixand Freeway: In 2013, Hausnecht [32], extends the previous paper to include differentmethods like reinforcement learning and neuro-evolution, using those with some featureextraction techniques and comparing with more than 30 games. Bellemare in [6] showssome agents with common reinforcement learning and planning techniques. Naddaf in [56]shows another set of reinforcement learning agents and separately planning agents withobject feature extraction.

Finally, with a very similar architecture of ALE, that uses the Stella emulator, a napproach using a Q-Learning modification called deep-Q-Learning achieved state-of-the-art results in many of the games ALE supports [54].


1.5 Summary

Neuroscience and machine learning have been developing through the last years, mostlyin separate directions. They have get better at understanding the brain and discoveringand learning patterns respectively. But both have encountered limitations. Also, for along time, there has been a profound desire to develop a machine with human capabilities.This machine would think or even act as a human. This has proven to be a really difficulttask. With these ideas in mind a research path has been developed in which, neuroscienceand machine learning should be able to collaborate. Inspiration from the internal workingsof the brain, should aid in developing more intelligent models and agents. These modelsshould help to test and improve theories in neuroscience. With these thoughts many kindsof models with different inspirations, biological or not have been developing. There seemsto be momentum that is building up towards a more classical artificial Intelligence, a realartificial intelligence.

But these ideas are still young and much work is needed, not only to clean the ideaspace, but to develop the most promising ones. HTM is one of those, is seems promisingbut it is not sure.

Given that HTM has an implementation, an active community and is bio inspired, itseems as a good model to test for AGI. General Game Learning is shown as a good wayto test for AGI on [1], having in mind that it requires the least individual capacity andthe least social-cultural engagement, but is capable of increasing its requirements on boththeories (Piaget and Vygotsky).

Surrounding game learning there are multiple projects and initiatives, and althoughfew of them center on general game playing, most depend on feature extraction or ruledescription. A good example is found on [32] where neuronal networks are evolved to playAtari video games. Although the most useful ones require feature extraction, the authorstest some with direct pixels and no feature extraction. Using the same useful Atari testingframework [6], which is developed to test for AGI, on [32] an AGI agent with reinforcementlearning is presented with very good results.

HTM has also been used for general game learning, although with smaller games, in [67]an HTM network learns to play ”rocks, paper, scissors” through reinforcement learning.

Given these we propose to test HTM in the context of General Game playing, usingreinforcement learning as the tool that will connect the HTM predictions with the actionson game. For these, two games will be chosen and the agent will play both games, wherethe points in game will be compared with different learning orders, first one game or theother.

CHAPTER 2

Learning Model

In chapter 1, general game playing was presented as a good test frame for general artificialintelligence. The general game playing test (proposed by Adams Et al. on [?]) andthe Arcade Learning Environment (developed by Bellemare Et al. [6]) limit agents to asensory-motor interface. This sensory-motor interface sends to the agent the game imagesand the in-game score while it receives the actions to be executed from the agent. Usingthis interface the agent explores the environment and should choose the actions thatincrease the score received. This agent process of trial-and-error interaction (with theenvironment) and hedonistic (pain/pleasure) maximization of a numerical reward signal,is the distinguishing feature of reinforcement learning problems[74, 43]; in consequence,general game learning can be appropriately modeled as reinforcement learning problem.

Reinforcement learning includes different approaches which range from direct learn-ing but indirect use, to direct use but indirect learning. Some well-known approaches toreinfocement learning are Sarsa and QLearning, both of them used on general game play-ing [53, 57]. However, these approaches employ indirect model-free learning (the agentdoes not develop a model of the environment), making the agent less capable of transfer-ring its knowledge among games. In addition, other techniques have been used to developgame playing agents [65, 55, 11, 32, 6, 66]; none of them capable of learning a model ofthe game.

Even though general game playing is a reinforcement learning problem, the agentsdeveloped to solve it can be structured with the concepts of a intelligent agents [69]. Thistype of agent is called a model-and-utility-based agent by Russell and Norvig [69] and itsarchitecture is shown in Fig 2.1. In the case of game playing, the world state mentioned inRussell’s agent is represented by a game image, or series of game images; and the utility isdirectly related to the game score. The questions presented in Russell’s architecture (Howthe world evolves?, What my actions do?, and How happy I am in x state?) correspond, inreinforcement learning, to the transition function and the reward function. The transitionfunction answers the question about the evolution of the world and about the effects ofthe actions. The reward function answers ”How happy I am in x state?”.

Based on Russell’s architecture and reinforcement learning theory a general game learn-ing architecture is proposed in this document (see Figure 2.2). The agent receives fromthe environment the game images and the reward for the current game state. This infor-mation is then used to improve the reward and transition functions; the later also uses

21

CHAPTER 2. LEARNING MODEL 22

Figure 2.1. Russell’s model-based, utility-based agent. Taken from [69]

Figure 2.2. Agent architecture

the details of the last action taken. After the possible improvement of the functions, theplanning phase takes place. There, the possible actions and current game state will lead topredictions of future game states, by applying the transition function; this predictions willthen be tested for utility by applying the reward function. Using this utility informationan agent then decides which action to execute, and sends it to the actuators.

Through this chapter the Transition Function, Reward Function and Planning Phase,components of the architecture proposed, will be detailed and examples of the process forBreakout will be used.


2.1 Transition Function

The transition function models how a game evolves as different actions are taken. Inother words, the transition function predicts the next game state (game image) s′ fromthe current state s and an action a taken by the agent. This function is defined inEquations 2.1 and 2.2, where E is the set of all possible game states and A is the set ofall possible actions.

T (s, a) = s′ (2.1)

T : E ×A→ E (2.2)

The objective of a model-based agent is to learn an approximation of this transitionfunction (see Equation 2.3), that allows the agent to confidently predict the effects of itsactions (see Equation 2.4).

T ′(st, at) ≈ T (s, a) (2.3)

st+1 = T ′(st, at) (2.4)

By applying the transition function multiple times, on a list of actions, an agent cantry to predict k steps ahead (see Equations 2.5, 2.6 and 2.7).

st+k = T ′(T ′(st+k−2, at+k−2), at+k−1) (2.5)

st+k = T ′(T ′(· · ·T ′(st, at) · · · , at+k−2), at+k−1) (2.6)

st+k = T ′k(st, ak) (2.7)

The transition function can be approximated in many ways. An approximation com-posed of three disjoint parts of a game state is proposed (see Equation 2.8). The threeparts consist of the following: the environment, which are the pixels in the screen thatstay the same through the game (change in few occasions); the avatar, which is the objectthat is directly affected with the actions executed by the agent; and the set of dynamicobjects, that move independently and possibly interact with the agent.

GameState = Environment ] Avatar ] Dynamic (2.8)

In consequence, the transition function is defined (see Equation 2.9) by the disjointunion of three functions which model each of the parts of the game state.

T ′(s, a) = E(s, a) ] A(s, a) ] D(s, a) (2.9)


2.1.1 Environment (E)

The environment (static objects) of the game is composed by pixels that either are constantthrough the game (such as the background) or that hardly change (such as breakable partsof the background). To find the environment, different actions are executed and repeated;then, only those pixels that kept active during all actions are selected as the environment(this is valid for games where the background is static).

Algorithm 1 Find Environment

1: function runAndSum(n, action)2: for i from 0 to n do3: sendAction(action)4: sum← sum+getGameFrame()5: end for6: end function7: function findEnvironment(n, α)8: for action a in A do9: sum← sum + runAndSum(n, a)

10: end for11: staticEnv ← x| x in sum , x > α|A|n12: end function

Algorithm 1 is used to find the environment. First, an auxiliary function, namedrunAndSum (line 1), is defined. This function repeats an action n times and sums, ina matrix, each frame obtained; thus, it counts how many times a pixel is active duringthe n frames. Finally the runAndSum function returns the summed matrix. In orderto find the environment, the function findEnvironment, on line 7, calls runAndSumfor every possible action and sums all the matrices up. From such cumulative matrix (seeFig 2.3) the environment is extracted by selecting the pixels that were active in more thana number αn of frames, where α is the ratio of time a pixel must be active to be consideredpart of the environment. (see Fig. 2.4 for the result)

Figure 2.3. Sum of all frames for environment extraction Breakout


Figure 2.4. Static Environment Breakout

2.1.2 Avatar (A)

According to Bellemare et al., “Contingency awarness is the recognition that componentsof a future observation can be affected by one’s choice of action.” [5]. This conceptof cognitive science, which is thought as being important for child development, is theessence of the conception of the avatar. In Atari games the contingency awareness meansrecognizing which object represents the player, i.e. which object is its avatar. For example,in Breakout the avatar is the paddle; moved by the actions and used to bounce the movingball.

To find the avatar, a human player commonly executes the set of possible actions anddeducts which block of pixels is directly affected by these actions and what effect thoseactions have. The agent proposed assumes that the avatar stays still when the no-operation(noop) action is sent; using this to remove the environment and find the pixels that staystill when only NOOP is sent. This is described by Algorithm 2.

Algorithm 2 Agent Extraction

1: function extractAgent(n,β)2: staticEnv ← findEnvironment(n)()3: movableNoopEnv ← runAndSum(n,ACTION.NOOP )− staticEnv4: agent← movableNoopEnv > βn5: extractP ixels(agent)6: end function

So, the algorithm sums all the frames from the action noop (no operation) up andremoves all pixels of the environment (see Fig. 2.5). Afterwards, it extracts all pixels thatare active in more than the half of the frames (see Fig. 2.6). With that, the agent obtainsall pixels that don’t change when the action is noop. For some games this is enough tofind the avatar; but, for others, additional steps (including using all other actions) mightbe necessary. Finally, with the given active pixels the agent locates the coordinates of thesmallest rectangle that encapsulates the agent and its pixels. For the current example (seeFig. 2.7) the coordinates ranges are x:[19-22] and y:[37,38].

For the next frames, the last position of the avatar is used as the center for a searchbox. In this search box, of m times the dimension of the avatar, the new position is searchby comparing the known avatar form.

With this process, the agent’s avatar is found in most frames, with the exception oftwo different cases. The first case happens when a game restarts and the avatar goes to itsinitial position; thus, moving out of the search box. To reposition the search box around


Figure 2.5. Noop pixels without the environment Breakout

Figure 2.6. First Breakout frame without the environment(pixels which were on after some gameepisodes)

Algorithm 3 Agent Extraction

1: procedure ExtractAgent2: noopMovable← sumNoop− staticEnv3: avatarInNoop← x| x in noopMovable, x = n4: oneLeftAction←runAndSum(2, action.LEFT)5: leftEffect← x| x in oneLeftAction, x = 16: avatar ← avatarInNoop + leftEffect = 27: end procedure


Figure 2.7. Only pixels of agent active for Breakout, and extracted agent found by Algorithm 3.

the avatar, the restart status of the game is sensed; moving the search box to its initialposition when the game restarts. The second case is when the avatar is partially visible,like when the paddle in Breakout moves slightly out of the screen. Since the algorithmuses the exact form of the avatar to determine its position, it can not locate the avatarwhen this happens. To solve this problem, the algorithm cuts the avatar form in half anduses this half to locate the avatar in the search box. This last solution might not work inall Atari games, specially when the avatar has more details; but, it is enough for gameslike Breakout and Pong.

2.1.3 Dynamic objects (D)

The last part of the transition function is to learn how objects move and interact amongthem. In 1984, Duncan tested the limits of visual attention and showed that humans canonly focus their attention on one object at the same time [17]. This fact is used, in theproposed model, to learn the way objects move. So, for each image, a box is computedaround the moving objects(see Fig 2.9), and this subset of the whole image is used to learnthe movement of the objects.

To compute this box, the model presumes that these moving objects are representedby the game pixels remaining from removing the environment and the agent’s avatar, likethe ball in breakout. To identify those objects, the algorithm removes the environmentand the avatar from each frame (see Fig. 2.8), and uses an exhaustive search to locate theobjects and calculate their size.

Using the position and size of an object, a box is placed around each object (seeFig. 2.9). The active pixels, inside this box, are taken as a representation of the dynamicobject state, called the viewport from here on(see Fig. 2.10). In order to have betterinformation of the object state, the box used to create the viewport is placed in consecutivegame frames in order to create the history of given viewport (see Fig. 2.11).

Finally, since the viewport is centered in the object, in some occasions part of theviewport can lay outside of the game frame (see Fig. 2.12). In these occasions, the partof the viewport that is outside of the image, is filled with inactive pixels (see Fig. 2.13).


Figure 2.8. Breakout without environment and avatar, just dynamic objects(ball).

Figure 2.9. Breakout image with viewport box marked

Figure 2.10. Breakout viewport from Figure 2.9

Figure 2.11. Example of three consecutive viewports of a ball moving left-down


Figure 2.12. Example of a viewport laying outside the game image

Figure 2.13. Example of the resulting viewport, when part of it is outside the game image (seeFig. 2.12).

2.2 Reward Function

The reward function measures the utility of a given game state st, by predicting the rewardrt that such state will obtain (see Equation 2.10).

R : E → Zst 7→ rt

(2.10)

E is the set of all the possible game states.

2.3 Planning Phase

To decide the action to be taken, the transtion and reward functios are used in a breathfirst search through the action tree. The general concept of this action decision process isto select the action that maximizes the future reward, such like in the Minmax algorithm(traced to a 1912 paper by Ernst Zermelo [69]). This is accomplished by making predictionsof future game states for multiple permutations of the set of possible actions to execute,and applying the reward function on each predicted game state to obtain its utility (seeEquation 2.12).

actionP lan = arg maxak

(R(T k(st, ak))) ∀ak|ak ∈ Ak (2.11)

Where, Ak is the set of all possible permutations of k consecutive actions.


Figure 2.14. Action tree with rewards for each state

at+i = actionP lan[i] (2.12)

An action tree (see Fig. 2.14) is created by starting from the current state and applyingthe Transition function for each possible action ( Left, Right, Noop for Breakout). Toasses the utility of predictions made by the Transition function each result is send to thereward function. If no prediction yields a positive reward, the process is repeated foreach prediction; thus, applying the Transition function a second time(T 2). This processwill continue iterating until there is a predicted state that yields a positive reward; then,the list of actions taken to arrive to such reward is stored and followed (as the actionplan 2.12).

This algorithm can require very large time to execute; therefore, to reduce the executiontime two optimizations are used. First, a limit on the depth of the search is used. So,in case the depth limit is achieved before a positive reward is found, the action with thepredicted maximum reward is executed, but no action plan is generated or stored; thus,requiring another search through the tree in the next time step. The second optimizationapplied is to ignore the game states that are equal to previously processed game states.

This approach is enough to test the Transition and Reward functions accuracy; but,it can be improved by using a more efficient search algorithm. For example, it can beimproved by using a Monte Carlo search, which is used in the agents competing in thegeneral game playing competition [23].

2.4 Training the model

The proposed model for the Transition function is composed of three parts: Environment,Avatar and Dynamic Objects. Each of these parts should be accurate in its predictionsin order to correctly predict each game frame. Thus, learning how each part behaves andevolves is central to the proposed model .

To learn the Environment, the proposed algorithm assumes the environment stays still,which is enough for many Atari games. Some games have a moving background, and forthose games another algorithm and learning technique would be needed.

To learn the Avatar, two aspects are important. First, an agent should find which setof pixels represent the avatar. Then, relations among actions taken and Avatar positionsshould be learned. Those two aspects might be highly related; effects of the actions makesthe avatar recognizable.

Multiple approaches to learn these two aspects of the Avatar are possible. For example,an agent could consider both aspects together and learn the pixel-wise effects of taken


actions. On the other hand, an agent could find the Avatar, with a similar algorithm asthe one previously described, and use its position to learn the action effects by using afunction approximator. Also, once having the position, an agent could also use the speedas a variable that helps to predict avatar positions.

For the last part of the transition function, Dynamic Objects, other set of approachesare possible. Pixel-wise predictors like an HTM-CLA could be used to learn objectsmovement, or other temporal approaches might be used. Also, like with the Avatar,finding the position of objects could be useful. Following each objects could allow to learnthe patterns in their movement and their interactions with the avatar or other objects; toaccomplish this, a pattern finding technique or a predictor could be used.

Finally, the reward function should guide an agent towards best rewarding game states.To predict those best game states the reward function could use pixel-wise approachesor split the problem in the same parts used in the Transition function. Additionally,the prediction function could either directly model rewards obtained from games or useconcepts of reinforcement learning like eligibility traces, predicting unreal rewards, butguiding towards real ones.

CHAPTER 3

Learning Experiments and Results

3.1 Learning the Transition Function

As it was described in Section 2.1, the game state is defined in terms of three disjoint parts.In previous Sections, the set of algorithms to find those parts were detailed. Through thissection, the algorithms to learn the movement of those parts are described and tested withBreakout.

3.1.1 Avatar

In order to learn the effect of each action on the avatar, two approaches are carried on. Inthe first one, a HTM-CLA is modified in such a way that it works as a direct transitionfunction. In the second one, the avatar position is calculated and used as input and outputof a neural network.

3.1.1.1 Learning action effects with HTM

Since the HTM-CLA model has no concept of actions, which are necessary to predict theavatar position; n cells, each one representing a single possible action, are included to theinput of the basic HTM-CLA algorithm and the neighbor definition is changed in orderto include such cells. Thus, every cell can develop synapses to the action cells, allowingpredictions that depend on the previous action.

The changed HTM-CLA was compared against the basic HTM-CLA model, on hand-crafted synthetic game frames. The test showed better accuracy, of the avatar positionpredictions, with a changed HTM-CLA. However, when tested on an Atari game, theavatar predictions were not as good as expected. This may be explained by consideringthat, if a cell has synapses to all the action cells, the sum of active synapses will alwaysbe the same for every action. So, this approach was not further explored.

32

CHAPTER 3. LEARNING EXPERIMENTS AND RESULTS 33

3.1.1.2 Learning action effects with Neural Net

The second tested approach uses a neural network to predict the avatar position. Toaccomplish this, a feed-forward network, with the following properties, is used:

• Five neurons in the input layer, two for the scaled paddle coordinates and three forthe one-hot encoding 1 of the actions (Left, Right, No-op).

• Two neurons in the output layer, that predict the paddle coordinates(scaled in [0,1]).

• Two hidden layers with 20 neurons each2.

• A tanh activation function, such as LeCun suggests in [47].

For the training process, the game is run through 50 episodes while executing randomactions. Then, a unique set is used to store the examples of avatar positions and actions;as soon as 300 unique examples are gathered, the network is trained using backpropagationthrough 10 000 iterations. After the training process finishes, the current avatar networkis replaced with the trained network.

Table 3.1 shows the prediction error of the net through three variables: the x and yerrors (as a percentage of the width and height of the game frame), and the Euclideandistance between the predicted coordinates and the real coordinates. The evolution ofthose errors, through out the training process, is shown in Figures 3.1 to 3.3. Thisaccurate predictions allow to state that the neural network was able to learn the relationamong actions and avatar position.

Measurement median sd

y percentage error 0 0

x percentage error 1.25 0.52

error as distance 1 0.42

Table 3.1. Prediction error as a percentage of the maximum possible error for a neural net trained(10 times) to learn Breakout action consequences

3.1.2 Dynamic objects

In order to learn the evolution of dynamic objects, two approaches are developed: learningfrom a whole game frame and learning from a viewport that follows dynamic objects.

3.1.2.1 Learning from a whole game frame

Both an HTM-CLA and a feed forward neural network are trained to predict future gamestates from the whole game frames of previous game states, the best being the HTM-CLA.

1One-hot encoding uses n bits to encode n states, and only one bit can be on (hot) at any time.2Tests with higher number of neurons for the hidden layers show no significant improvement.


Figure 3.1. Percentage of wrong pixels —before and after learning— of paddle prediction in xaxis (Breakout) vs game time.

Figure 3.2. Percentage of wrong pixels —before and after learning— for paddle prediction in yaxis (Breakout) vs game time.

Figure 3.3. Percentage of wrong pixels —before and after learning— for paddle prediction as adistance (Breakout) vs game time.


Figure 3.4. HTM-CLA percentage of wrong pixels —before and after learning— for game frameprediction (Breakout) vs game time.

NeuralNetwork The neural network used receives previous four game frames as input(those four frames are used to have enough information of the objects movement) andreturns the predicted game frame as output. So, the neural network has 16 640 inputs (4game frames of 80x52 each); and 80x52 outputs. It is trained with 5 000 examples, andthe resulting network is highly error prone and has some problems predicting from its ownpredictions (it adds noise after multiple predictions). Thus, this approach was not furtherexplored.

HTM-CLA A second test is completed using the Cortical Learning Algorithm (CLA).The HTM-CLA used has 80x52 inputs and uses its temporal pooler to predict the next80x52 outputs. During training, different parameters of the HTM-CLA are hand adjusted;in general, the learning radius and activation threshold have the highest impact in theaccuracy of the predictions.

Game frames, created by a random agent, are used to train the HTM-CLA. The HTM-CLA was trained through 50 episodes on 10 experiments (see Fig. 3.4), with a mean errorof 0.64% and a standard deviation of 0.061 (0.64% of error corresponds to around 26 pixelswrong per frame prediction). A qualitative analysis, of different predictions, shows that,the HTM-CLA learns the ball movement of game frames that repeat continuously. Forexample, the ball movement at the start of each episode is learned and predicted accurately,because it repeats on every episode. However, the HTM-CLA makes no interpolations,that means ball predictions may fail for unseen game frames. Additionally, on framesthat repeat in few occasions, like the ball hitting the paddle or a wall, the HTM-CLAmakes noisy predictions. This last detail is a problem for the planning algorithm; because,noisy predictions of a game state, where the ball might even disappear, are useless whenmeasuring the utility of given game state.


Figure 3.5. Example of view port history and CLA correct prediction.

3.1.2.2 Using viewports

A second approach to learning the evolution of dynamic objects is to use viewports. Theseviewports reduce the dimensionality of the data and allow for simple movement patternsto arise. For example, when an object is moving at a constant speed, viewports extractedat different times may be equal in content. This means that, an object that moves fromposition (0,0) to (40,40), in the full game frame will, produce around eight 6x6 viewportsof the object moving from (1,1) to (5,5), possibly, with the same content.

In order generate useful viewports, the size of the viewport should be enough to fitthe objects movement, i.e., the object should appear in the viewport at time steps t− 1, tand t+ 1. Thus, the viewport should be at least as big as the object size plus some extrapixels, in order to fit the object after it moves. Therefore, the required size depends onhow fast the object moves, and this depends on how frequent the game frames are sensed(as described before, some of the frames are skipped, see Section 1.4.5.3). Tests for thesize of the viewport and the best number of frames to skip were carried out, and the bestperforming combination was a 6x6 viewport with no skipped frames; higher skip ratesrequire bigger viewports, which increase the prediction error and training time.

HTM-CLA A first approach to use the viewports is to train a HTM-CLA network tolearn how an object evolves in a viewport. Initial tests show that the HTM-CLA learns topredict the ball movement when it is, alone, in the center of the game frame (See Fig. 3.5for a 12x12 example); but, the HTM-CLA struggles to learn the ball movement when itbounces with the walls or the paddle (see Fig. 3.6, 12x12 example). Even though theHTM-CLA is fully trained.

Throughout ten runs, with 50 episodes each, the median error of the HTM-CLA isof 5.5% with a standard deviation of 1.96 3.7. This error is defined as the percentageof wrong pixels, that means that the HTM-CLA makes mistakes in around 2 pixels ofevery 6x6 predicted viewport. A qualitative analysis shows that, the HTM-CLA predictsaccurately the position of the ball, when it is alone in the game frame. So, it perfectlypredicts some viewports, some with one missing pixel (3% error) and others are predictedwith the ball slightly misplaced (one pixel up or down), producing an error of 4 pixels(11%error).

Neural Network A feed-forward neural network was also used to predict future view-ports, with 72 inputs (2 times 6x6 viewports) and 36 outputs. It was trained (see Fig.


Figure 3.6. Example of a ball bounce and CLA incorrect prediction.

Figure 3.7. HTM-CLA percentage of wrong pixels —before and after learning— for viewportprediction in Breakout vs game time.

3.8) through 50 episodes and 300 environment removed non-repeating training examples,in order to only generate predictions of the ball. The resulting network has a mediumprediction error of 2.8 pixels, with a standard deviation 1.69. Most of the predictions ofthe network have one erroneous pixel as shown by the histogram in Figure 3.9.

3.2 Learning the Reward Function

To approximate the reward function two approaches are tested. In the first one, a neuralnetwork is used to predict the expected reward from the whole game frame. In the secondone, a neural network is used to predict the reward from two consecutive viewports. Inboth approaches the reward is scaled to a scale [0,1] where 0 means negative reward and1 means a positive reward (0.5 represents no reward).

3.2.1 Using whole game frame

The used network has 80x52 inputs (a game frame) and one output (the scaled reward).Training examples are taken from 25 episodes with random actions. The network is trainedafter gathering 10 000 examples. Additionally, not only the frames that are labeled as


Figure 3.8. Neuralnet wrong pixels —before and after learning— for viewport prediction inBreakout vs game time.

Figure 3.9. Histogram of pixel errors through 50 episodes

reward frames, by the game, are trained with positive rewards, but also a history of lprevious frames of each reward frame is labeled as positive (see Equation 3.1).

rt−k = rt × γk ∀k|1 ≤ k ≤ l ∀t|rt = 1 (3.1)

Here, γ is the discount factor, l is the length of the history used, and rt is the rewardat time step t.


Figure 3.10. Current reward (green) and Predicted reward (red) vs game iteration.

The neural network returns a positive reward for frames that lead to a game reward.Image 3.10 compares the reward predicted (in red) against the real reward (in green),after training the network. The trained net increases appropriately its predictions until itarrives to a reward state. This will be useful for the planning phase, where it is better torequire less prediction steps to find a reward and consequently an action to execute.

3.2.2 Using viewport

The second attempt uses the object view-port(t and t − 1) as an input, such that, thereward network would learn from local ball interactions. Two viewports are used, first anormal viewport and then a viewport without the environment, where only the object isactive.

3.2.2.1 With Environment

The first test is run by using a normal view-port (see Left example in Fig. 3.11) as in-put. Through 50 episodes taking random actions, the network gathers 300 non-repeatingexamples, and then is trained through 10 000 iterations. Every additional 100 examplesthe network is retrained. The training graph is shown in Figure 3.12. The graph of thepredicted reward vs obtained reward is shown in Figure 3.13. The resulting network makesperfect predictions on 53% of the frames.

3.2.2.2 Without environment

A second test is executed, with the environment removed from the view-ports (see Rightexample in Figure 3.11). So, the network must focus on the object movement to make itspredictions. The network is trained with the same parameters as before, which results inperfect predictions 45% of the time; but, compared with the previous net, it has a betterbehavior when used together with the transition function.


Figure 3.11. Examples of the same viewport; Left with environment, Right without the environ-ment.

Figure 3.12. Reward training error for view-ports with environment vs Training Epoch.

Figure 3.13. Current reward(green) vs Predicted reward(red), with environment.

3.3 Playing

The best results among tested approaches are selected to play Breakout and Pong. Inorder to make avatar predictions, the network with avatar coordinates is used, and topredict object’s movements, both the whole game frame and the viewport approach areselected. These approaches are first tested on Breakout and the best one is then tested onPong.

3.3.1 Breakout

3.3.1.1 Using whole game frames

In a first test, the trained HTM-CLA, of Section 3.1.2.1, and the reward network, ofSection 3.2.1, are used to play Breakout. A first test run shows that all parts of the model


Figure 3.14. Two levels of tree search of an HTM-CLA predicting whole image with the avatarposition predicted by a neural net, some similar second level predictions are omittedto make the image more clear.

Figure 3.15. Continuing the tree search in Image 3.14, shows the ball ready to bounce withpaddle.

work well together. This means that, the action predictions tree is correctly explored.However, the ball predictions are not perfect, mostly failing on collisions. As a result ofthose failed predictions, the agent is not able to obtain a positive reward from the rewardfunction so it plays as bad as a random agent.

However, an analysis of the tree search of the agent, shows that, if the HTM-CLA couldpredict ball bounces, the agent should be able to obtain the desired rewards. Therefore, inorder to test the agent’s possible capacities, the previously trained HTM-CLA is retrainedwith hand picked examples for the first possible bounce of the ball. After repeatingthe bounce examples ten times, the HTM-CLA was able to correctly predict the trainedbounce. When this new HTM-CLA was used to play Breakout, the agent was able tocorrectly predict future frames of the ball and the avatar movement, triggering a rewardon the reward function.

Steps taken to arrive to that reward prediction are as following. First, predictions fort+ 1 with actions LEFT, NOOP and RIGHT are computed (see first level of Figure 3.14).Then, from each one of those predictions, t+ 2 predictions are computed (see second levelof Figure 3.14). Further on, predictions for time steps t+ 10 and t+ 11, for a combinationof LEFT and NOOP actions, show a ball bouncing with the paddle (see Figures 3.15 and3.16). Using such predicted game state, the reward function returns a positive reward;therefore, actions that lead to that state are selected and executed by the agent.

These interactions result in an agent that obtains 2 points per episode, beating therandom agent.


Figure 3.16. Continuing the tree search in Image 3.15, shows ball after bouncing with paddleand activating reward in the reward function(neural net).

Figure 3.17. Predicted State that returns positive reward.

Figure 3.18. Viewport of 3.17 that returns positive reward.

3.3.1.2 Using viewports

In a second test, the network from Section 3.1.2.2 and the reward network from Section3.2.2.2 are used to play Breakout. The resulting agent is able to accurately predict futuregame frames, and the reward function returns positive rewards on some of the paddle-ballbounces. For example, in the first planning phase of the game, it predicts 28 steps andactions ahead (first left, then noop) until ball and paddle are close enough (see Fig. 3.17).The reward function, for that viewport prediction, returns 0.89 (see Fig. 3.18); so, to theagent should follow the action plan that leads to that game state. Continuing this process,the agent is able to obtain 12 game points.

3.3.2 Pong

To test the proposed model with an unseen game, Pong is used. First, the original Pongimage, shown on Figure 3.19, is downscaled, Figure 3.20.

Then, the different algorithms are executed and the different parts are found andtrained. First, the environment, on Figure 3.21 and the avatar, on Figure 3.22, are cor-rectly found. With the avatar, the effects of the actions are trained in a neural network,


Figure 3.19. Original Pong image

Figure 3.20. Downscaled black and white Pong image

Figure 3.21. Environment found for Pong

Figure 3.22. Only avatar of Pong

as in Section 3.1.1.2. The error graph, on Figure 3.23, shows that the avatar positionswere correctly learned. Afterwards, the ball movement is learned; training error results areshown in the error plot in Figure 3.24. All this, shows that, algorithms to find the Envi-ronment, the Avatar and Dynamic Objects work well for Pong. Additionally, the trainingprocess for the avatar and objects shows errors in few pixels, although the viewport erroris bigger than in Breakout, which had no distance error larger than ten pixels.


Figure 3.23. Avatar learning distance error vs iteration (Pong)

Figure 3.24. Viewport learning error percentage vs iteration (Pong)

Figure 3.25. Current reward (red) and Predicted reward (green) vs game iteration.

When training the reward function, a different behavior to the observed when trainingbreakout was perceived. This can be explained with the negative rewards that were notpresent in Breakout. The trained network has good accuracy on negative rewards, but alot of false positives on positive rewards.

Finally, the agent plays Pong and makes accurate predictions of the ball movement,but has two problems. First, it is not able to predict the enemy movement; so, it assumesmost of its movements lead to a reward when the ball is moving towards the left. Second,it is not pressed to bounce the ball, the negative rewards of missing a ball are not predictedcorrectly. As a result, the agent is not able to play Pong.

3.4 Other agents

In order to compare the results of proposed agents, two baseline solutions were developed.One by creating an agent, that executes actions at random, and another by using a neuro-evolution approach.


Figure 3.26. Neuro-evolved agent action decision structure.

3.4.1 Random Agent

The agent randomly selects among the possible actions, and sends them to ALE. Thisagent obtains an average of 0.7 points per episode in Breakout and -20.7 in Pong.

3.4.2 Test with Neuroevolution

A second baseline was developed by using a neuroevolution approach. An agent wasstructured with a neural network using the game images as input and actions as output.Then, the network is trained using a genetic algorithm where the reward works as a fitnessfunction [66]. The architecture of the agent is described in Figure 3.26. Interestingly, thisagent was able to beat some of the agents in the literature. The agents were able to get 9and -13 points in Breakout and Pong respectively.

3.5 Results

Table 3.2 shows results of the proposed model, compared with some results found in theliterature. These results show a good performance of the proposed agent for Breakout,where it is better that most agents, except DQN. On the other hand, the proposed modelshows a low performance in Pong, which should improve when the object and rewardfunctions are improved. In regarding to differences between changes in the proposed model,the viewport approach clearly has a better performance than a whole frame approach.


Agent Breakout Pong

Viewport-ANN Model 12 -21

HTM-Model* 2 –

Neuro-Ev Br,Po 9.7 -19

Neuro-Ev Po,Br 7.7 -13

Random 0.7 -20

Human 864[42] 21

Random [32] 0.8 -20.7

Human[32] 825 –

HNEAT-Pixel[32] 4 -16

LSH[6] 2.5 -19.9

DQN[54] 168 20

Table 3.2. Game reward comparison of general game playing agents

Conclusions

Learning to play video-games in a general way is a difficult task, in which few approacheshave shown good results. To accomplish this task model-free reinforcement learning ap-proaches, like Qlearning or SARSA, are commonly used, but as they are not able transferknowledge among games, so it is difficult for that approaches to show good results Accord-ing to the results it is possible to use model-based agent for general game learning. Theuse of a game model allows the agent to learn separately the game (transition function)and the rewards, which should help in generalizing; for example, similar games with dif-ferent objectives(rewards) would benefit form this approach. Additionally, a model-basedagent allows to use planning algorithms to select actions, which have shown 3 [32] thebest performance among different agents. The problem in using model-based learning isthat training such a model is not an easy task; furthermore, the required time to chooseactions, with a planning algorithm, increases exponentially with the distance of actionsand rewards.

Learning a transition function directly from pixels seams unfeasible. Some approacheswere tested in this thesis and no one showed good results. First, a neural network wasused as a function approximator to predict the game states. But, the neural network onlylearned the common appearing pixels, ignoring other pixels like the ones of the ball. Incontrast, an HTM-CLA can learn the temporal relations of the ball movements, includingsimple objects in its predictions; but, its predictions still have errors. Moreover, it was clearthat action information is essential for learning the game model and for its subsequentuse by a planning algorithm. So, the action information was added to an HTM-CLAwithout a difference in the results; it was not able to learn from the actions. From this, itis concluded that an HTM-CLA can not correctly use action information; hence it is notuseful for learning a transition function. Other techniques can be applied to use directpixel images, but it was not in the scope of this Thesis.

Trying to get information without pixels is another viable approach, but it uses highlevel knowledge; anyway, it can be extracted automatically without loosing generality.According to the work carried on in this thesis every Videogame can be divided in threeparts, Environment, Avatar and Dynamic Objects. Splitting the games in those partsallows to learn each part individually. Since the most games draw a static environmentthat has low impact on the game, the environment can be removed to focus on higherimpact objects. To find the Environment, a simple analysis of constantly active pixelsis enough for most Atari games; even though other techniques can be used for this task,and could be required for other kinds of (modern) games. The use of constantly active

3These planning algorithms use perfect predictions from an emulator.

47

CONCLUSIONS 48

pixels as environment was enough for the games tested. Furthermore, a difference betweenenvironment and background might be necessary for other games.

Using the movements caused by the actions showed to be enough to identify the Avatarin Atari games, so such an algorithm was proposed. This algorithm should work withother un-tested games; given that, in those games taken actions move the avatar, ratherthan shooting or having other non-positional effects. For games with those last kindactions changes to the algorithm might be required. Also, the proposed algorithm couldbe improved to better consider different pixel shapes the avatar can take,

Using the coordinates in a neural network, used as a function aproximator, learnsefficiently the avatar and action relations. This can be improved by using differencesin positions. That means, that instead of learning that, action a in position p leads toposition p′, an agent could learn that action a affects the position in (+x,+y) pixels. Thisapproach might allow better generalization and knowledge transfer among games.

The concept of attention focus, used as viewports, is an effective form of learningobjects movement. The viewport concept showed improvements in learning, how objectsmove, either with neural networks or a HTM-CLA, but viewports have two main problems.The first problem is with collisions, where viewports showed low performance, with thetested models. The second problem is with learning dynamic objects that refuse to followa clear pattern for their movement, i.e. they react to other objects or are controlled byother agents. Consequently, viewports were useful for predicting an objects movement,but could use additional techniques that involve all objects information.

In regard to the reward function, only the surface was scratched. It can be improvedby using avatar information, and all objects’ informations, like their coordinates. Usingeligibility traces showed to be useful, although they can not be used for a viewport ap-proach. Further research is required to find a good model for learning the reward function,which in Atari games can depend on various factors; although, it is usually related withan object’s position or the collision of different objects.

According to the results, the used HTM-CLA is able to efficiently learn the ball’smovement, but HTM-CLA doesn’t scale well when multiple patterns should be learned,making it less reliable as the number of patterns increases. Also, the number of parametersthat the HTM-CLA requires, makes it complex to use for every game, possibly needingan adjustment in the parameters for each game. Tests with those parameters showedthat, the learningRadius has high impact on the accuracy of object movement predictions;i.e. the learning radius should contain the object while it is moving, in order to allowpredictions from previous object positions. Likewise, the activationThreshold should beof a size similar but smaller than the size of objects to predict.

Also, the HTM-CLA depends on randomness; that is, two HTM-CLAs, trained withthe same set of examples, might have different results and convergence is not alwaysensured. So, when improving the accuracy of an HTM-CLA, training a new HTM-CLAmight be better than retraining the old one. In general, for those scenarios where HTM-CLA was tested, it showed to rapidly learn the movement pattern of a single object, butthen, it gets stuck in a local minimum. All these facts lead to conclude that, the HTM-CLA, as it is currently conceived, is not useful as a feature extractor for general gameplaying; and as a predicting system it might work, but its learning capacity is limited bythe amount of different patterns it can learn.

CONCLUSIONS 49

During the training of used feed-forward neural networks some useful approaches tolearning were observed. First, a tanh activation function trains much faster than a logisticfunction and with less errors. Also, sets of non-repeating examples worked better thanbigger sets with repetitions; in particular, they were better than training with all availableexamples. Another observed behavior is that, retraining a network after gathering addi-tional examples might decrease the accuracy of the network. Some of this observationsare supported by the literature [47].

In relation to ALE, it showed to be a good framework for testing and comparing AGIagents. Although, the use of named pipes hinders the ability to execute multiple tests inparallel. Another inconvenience of ALE, is that it requires to pre-process each game tobe able to return its reward, which is a problem in case additional reward informationis needed, like the remaining lives on a game. This remaining number of lives might beused by humans to play better, for example in Breakout they indicate which scenariosare negative for the agent. So, ALE is a useful framework with increasing usage in theliterature, which, when the time comes, could use an extension for other video-gameconsoles.

The proposed model can be used in most Atari games and could even work in othervideo-games. In general, it works on any game that can be split in Environment, Avatarand Dynamic Objects; and depending on the complexity of those parts proposed algo-rithms or new ones could be used. Also, a model-based approach requires synergisticinteraction of elements (Transition Function, Reward Function, and Planning phase); yetit can shine in exporting knowledge through games.

Future Work

Even though the proposed model has many characteristics that make it partially func-tional, the results obtained are not conclusive. So, the proposed model provides severalinteresting directions for future work. Some of the possible improvements are as following:

• Testing the model with other games.

• Improving the object finding, following and learning, for example with deep learning,PCA or auto-encoders.

• Testing other tools for viewport predictions, like a temporal neural networks.

• Using a better planning algorithm, like Monte Carlo.

• Creating an automatic classification of different object interactions and movements,which could be used to learn a predictor for each category.

• Improve the reward function by using avatar an objects information.

• Testing the model without downscaling, which is not be as computationally expensiveif only viewports are used.

• To set up a test framework based on ALE, that focuses on knowledge transfer; andtest if knowledge transfer gives an advantage to AGI agents.

• To make a complete test design, in order to definitely know the HTM-CLA limitsand capacities in this context.

• Test a possibly better planning algorithm, inspired in Monte Carlo, by using infor-mation of the best strategy found; thus, increasing the probability of searching bestrewarding paths.

• To develop a planning algorithm that solves the exploitation vs exploration problemby using backtracking and the concept of boredom, to induce a backtrack. Forexample, the algorithm could always follow(exploit) good rewards until those rewardsfail to increase, then it could backtrack to test better action plans. This algorithmmay work at the level of actions or maybe at higher levels (using game concepts),like humans do.

50

Bibliography

[1] Sam S Adams, Itamar Arel, Joscha Bach, Robert Coop, Ben Goertzel, J StorrsHall, Alexei Samsonovich, Matthias Scheutz, Matthew Schlesinger, Stuart C Shapiro,John Sowa, Rod Furlan, Ben Goertzel, J Storrs Hall, Alexei Samsonovich, MatthiasScheutz, and John Sowa, Mapping the Landscape of Human- Level Artificial GeneralIntelligence, AI Magazine 33 (2012), 25.

[2] Itamar Arel, Derek Rose, and Robert Coop, DeSTIN : A Scalable Deep LearningArchitecture with Application to High-Dimensional Robust Pattern Recognition, aaai,2009.

[3] Itamar Arel, Derek C Rose, and Thomas P Karnowski, Deep Machine Learning — ANew Frontier in Artificial Intelligence Research, IEEE COMPUTATIONAL INTEL-LIGENCE MAGAZINE (2010), no. November, 13–18.

[4] Eric Baum, What Is Thought?, MIT press, 2003.

[5] Marc G. Bellemare, Joel Veness, and Michael Bowling, Investigating ContingencyAwareness Using Atari 2600 Games, the Twenty-Sixth AAAI Conference on ArtificialIntelligence (2012), 864–871.

[6] M. Bellemare, M.G. and Naddaf, Y. and Veness, J. and Bowling, The arcade learningenvironment: An evaluation platform for general agents, Journal of Artificial Intelli-gence Research 47 (2013), 253–279.

[7] Bruce A. Bobier and Michael Wirth, Content-based image retrieval using hierarchicaltemporal memory, MM’08 - Proceedings of the 2008 ACM International Conferenceon Multimedia, with co-located Symposium and Workshops (New York, New York,USA), ACM Press, oct 2008, pp. 925–928.

[8] Stephen Anthony Bradford W. Mott, The Stella Team, and The Stella Team, Stella.

[9] Alex Braylan, Mark Hollenbeck, Elliot Meyerson, and Risto Miikkulainen, Frame SkipIs a Powerful Parameter for Learning to Play Atari, Learning for General Compe-tency in Video Games: Papers from the 2015 AAAI Workshop, 2015, pp. 10–11.

[10] Marek Bundzel, Object Identification in Dynamic Images Based on the Memory-Prediction Theory of Brain Function, Journal of Intelligent Learning Systems andApplications 02 (2010), no. 04, 212–220.

[11] Michael Buro and David Churchill, Real-time strategy game competitions, AI Maga-zine (2012), 106–108.

51

BIBLIOGRAPHY 52

[12] Murray Campbell, A.Joseph Hoane, and Feng-hsiung Hsu, Deep Blue, Artificial In-telligence 134 (2002), no. 1-2, 57–83.

[13] Hansol Choi, Jun-cheol J.-C. Jun-cheol Park, Jae Hyun Lim, Jae Young Jun, Dae-shik D.-S. Kim, Jae Young Jun Hansol Choi, Jun-Cheol Park, Jae Hyun Lim, andDae-Shik Kim, Reward hierarchical temporal memory: Model for Memorizing andComputing Reward Prediction Error by Neocortex, Neural Networks (IJCNN), The2012 International Joint Conference on (Brisbane,Australia), IEEE, jun 2012, pp. 1–7(English).

[14] Swadesh Choudhary, Steven Sloan, Sam Fok, and Alexander Neckar, Silicon Neu-rons that Compute, International Conference on Artificial Neural Networks, Springer,Heidelberg, 2012, pp. 2–9.

[15] Joost Van Doremalen, Hierarchical Temporal Memory Networks for Spoken DigitRecognition, Ph.D. thesis, Radboud University Nijmegen, 2007, p. 71.

[16] C Duka, Philosophy of Education’ 2006 Ed., Rex Bookstore, Inc.

[17] J Duncan, Selective attention and the organization of visual information., Journal ofexperimental psychology. General 113 (1984), no. 4, 501–17.

[18] N. Farahmand, M.H. Dezfoulian, H. GhiasiRad, A. Mokhtari, and A. Nouri, Onlinetemporal pattern learning, 2009 International Joint Conference on Neural Networks,IEEE, jun 2009, pp. 797–802 (English).

[19] Universidade Federal, D O Rio, Grande Do, and Rafael Coimbra Pinto, A Neocor-tex Inspired Hierarchical Spatio-Temporal Pattern Recognition System, Ph.D. thesis,UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL, 2009.

[20] Hilmar Finnsson, Cadia-player: A general game playing agent, M.s. thesis, ReykjavikUniversity, 2007.

[21] Hilmar Finnsson, Yngvi Bjornsson, and Yngvi Bj, Simulation-based approach to gen-eral game playing, The Twenty-Third AAAI Conference on Artificial Intelligence,2008, pp. 259–264.

[22] Saulius J Garalevicius, Memory-Prediction Framework for Pattern Recognition: Per-formance and Suitability of the Bayesian Model of Visual Cortex, LAIRS Conference,Florida, 2007, pp. 92–97.

[23] Michael Genesereth, Nathaniel Love, and Barney Pell, General game playing:Overview of the AAAI competition, AI magazine 26 (2005), no. 2, 62.

[24] D. George and J. Hawkins, A hierarchical bayesian model of invariant pattern recog-nition in the visual cortex, Proceedings. 2005 IEEE International Joint Conferenceon Neural Networks, 2005. 3 (2005), 1812–1817 (English).

[25] Dileep George, How the brain might work: A hierarchical and temporal model forlearning and recognition, Ph.D. thesis, Stanford University, 2008, p. 117.

[26] Dileep George and Jeff Hawkins, Towards a mathematical theory of cortical micro-circuits., PLoS computational biology 5 (2009), no. 10, e1000532.

BIBLIOGRAPHY 53

[27] Ben Goertzel, Hugo De Garis, Cassio Pennachin, and Nil Geisweiller, OpenCogBot :Achieving Generally Intelligent Virtual Agent Control and Humanoid Robotics viaCognitive Synergy, ICAI International Conference on Artifitial Intelligence, 2010,pp. 1–12.

[28] Ben Goertzel, Ruiting Lian, Itamar Arel, Hugo de Garis, and Shuo Chen, A worldsurvey of artificial brain projects, Part II: Biologically inspired cognitive architectures,Neurocomputing 74 (2010), no. 1-3, 30–49.

[29] Cassio Goertzel, Ben and Pennachin, Artificial General Intelligence, vol. 6830,Springer, 2007.

[30] Josh Hartung, Jay McCormack, and Frank Jacobus, Support for the Use of Hierarchi-cal Temporal Memory Systems in Automated Design Evaluation: A First Experiment,Volume 8: 14th Design for Manufacturing and the Life Cycle Conference; 6th Sym-posium on International Design and Design Education; 21st International Conferenceon Design Theory and Methodology, Parts A and B, vol. 8, Asme, 2009, pp. 853–862.

[31] Matthew Hausknecht, Piyush Khandelwal, Risto Miikkulainen, and Peter Stone,HyperNEAT-GGP: A HyperNEAT-based Atari General Game Player, Proceedingsof the fourteenth international conference on Genetic and evolutionary computationconference, 2012, pp. 217—-224.

[32] Matthew Hausknecht, Joel Lehman, Risto Miikkulainen, and Peter Stone, A neu-roevolution approach to general atari game playing, IEEE Transactions on Computa-tional Intelligence and AI in Games 6 (2014), no. 4, 355–366.

[33] Jeff Hawkins, Subutai Ahmad, and Donna Dubinsky, HTM Cortical Learning Algo-rithms, Tech. report, Numenta, Inc., 2011.

[34] Jeff Hawkins and Sandra Blakeslee, On Intelligence, St. Martin’s Press, 2005.

[35] Jeff Hawkins and Dileep George, Hierarchical temporal memory: Concepts, theoryand terminology, Tech. report, Numenta Inc., 2006.

[36] Jeff Hawkins, Dileep George, Jamie Niemasik, J. Hawkins, J. and George, D. andNiemasik, Jeff Hawkins, Dileep George, and Jamie Niemasik, Sequence memory forprediction, inference and behaviour, Philosophical transactions of the Royal Societyof London. Series B, Biological sciences 364 (2009), no. 1521, 1203–9.

[37] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh, A fast learning algorithmfor deep belief nets., Neural computation 18 (2006), no. 7, 1527–54.

[38] Marcus Hutter, Universal Artificial Intelligence, Machine Learning 1 (2005), no. 2,1–82.

[39] , Universal Algorithmic Intelligence: A Mathematical Top\rightarrowDownApproach, Artificial General Intelligence (B Goertzel and C Pennachin, eds.), Cogni-tive Technologies, Springer, Berlin, 2007.

[40] , Can Intelligence Explode?, Journal of Consciousness Studies 19 (2012), no. 1-2, 143–146.

BIBLIOGRAPHY 54

[41] Giacomo Indiveri, Bernabe Linares-Barranco, Tara Julia Hamilton, Andre van Schaik,Ralph Etienne-Cummings, Tobi Delbruck, Shih-Chii Liu, Piotr Dudek, PhilippHafliger, Sylvie Renaud, Johannes Schemmel, Gert Cauwenberghs, John Arthur, KaiHynna, Fopefolu Folowosele, Sylvain Saighi, Teresa Serrano-Gotarredona, JayawanWijekoon, Yingxue Wang, and Kwabena Boahen, Neuromorphic silicon neuron cir-cuits., Frontiers in neuroscience 5 (2011), 73.

[42] JVGS, Atari 2600 High Scores.

[43] Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore, ReinforcementLearning : A Survey, Journal of artificial intelligence research 4 (1996), 237–285.

[44] I Kostavelis and A Gasteratos, On the optimization of Hierarchical Temporal Memory,Pattern Recognition Letters 33 (2012), no. 5, 670–676.

[45] J Laserson, From Neural Networks to Deep Learning: zeroing in on the human brain,XRDS: Crossroads, The ACM Magazine for Students- Neuroscience and Computing:Technology on the Brain 18 (2011), no. 1, 29–34.

[46] Quoc V. Le, Marc’Aurelio Aurelio Ranzato, Matthieu Devin, Greg S. Corrado, An-drew Y. Ng, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeff Dean,and Andrew Y. Ng, Building high-level features using large scale unsupervised learn-ing, International Conference in Machine Learning, vol. 28, 2013, pp. 8595—-8598.

[47] Yann A. LeCun, Leon Bottou, Genevieve B. Orr, and Klaus Robert Muller, Efficientbackprop, Lecture Notes in Computer Science (including subseries Lecture Notes inArtificial Intelligence and Lecture Notes in Bioinformatics) 7700 LECTU (2012),9–48.

[48] Moshe Looks, Ben Goertzel, and Cassio Pennachin, Novamente: An integrative ar-chitecture for general intelligence, AAAI Fall Symposium, Achieving Human-levelintelligence, 2004.

[49] LL L Majure, Unsupervised Phoneme Acquisition Using Hierarchical Temporal Mod-els, Ph.D. thesis, University of Illinois at Urbana-Champaign, 2009, p. 33.

[50] JB B Maxwell, Philippe Pasquier, and Arne Eigenfeldt, Hierarchical sequential mem-ory for music: A cognitive model, ISMIR’09, 2009, pp. 429–434.

[51] Jean Mehat and Tristan Cazenave, Ary, a general game playing program, BoardGames Studies Colloquium, 2010.

[52] Wim J. C. Melis and Michitaka Kameyama, A Study of the Different Uses of ColourChannels for Traffic Sign Recognition on Hierarchical Temporal Memory, 2009 FourthInternational Conference on Innovative Computing, Information and Control (ICI-CIC), IEEE, dec 2009, pp. 111–114 (English).

[53] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness,Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, GeorgOstrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, HelenKing, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis, Human-level control through deep reinforcement learning, Nature 518 (2015), no. 7540, 529–533.

BIBLIOGRAPHY 55

[54] Martin Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David and Graves,Alex and Antonoglou, Ioannis and Wierstra, Daan and Riedmiller, Playing Atariwith Deep Reinforcement Learning, arXiv preprint arXiv: . . . (2013), 1–9.

[55] John E Mohan, Shiwali and Laird, Learning to play Mario, Center for CognitiveArchitecture, University of Michigan, Tech. Rep. CCA-TR-2009-03 (2009).

[56] Yavar Naddaf, Game-independent AI agents for playing Atari 2600 console games,Master thesis, 2010.

[57] Kailash Nadh and Christian R. Huyck, A Pong playing agent modelled with massivelyoverlapping cell assemblies, Neurocomputing 73 (2010), no. 16-18, 2928–2934.

[58] Numenta, Hierarchical Temporal Memory- Comparison with Existing Models, Tech.report, Numenta Inc., 2007.

[59] , Getting Started With NuPIC, 2008, pp. 1–92.

[60] Numenta Inc, Problems that Fit HTM, Tech. report, Numenta Inc., 2007.

[61] , Introduction to the CLA algorithm, 2013, pp. 1689–1699.

[62] Bruno A others Olshausen, , and Bruno A others Olshausen, Emergence of simple-cell receptive field properties by learning a sparse code for natural images, Nature 381(1996), no. 6583, 607–609.

[63] Barney Pell, METAGAME: A new challenge for games and learning, Heuristic Pro-gramming in Artificial Intelligence 3 (1992), 237–251.

[64] A J Perea, J E Merono, and M J Aguilera, Application of Numenta (R) HierarchicalTemporal Memory for land-use classification, South African Journal of Science 105(2009), no. 9-10, 370–375.

[65] Marc Ponsen and Pieter Spronck, Improving Adaptive Game AI with EvolutionaryLearning, Proceedings of the Computer Games: Artificial Intelligence, Design andEducation Conference (2004), no. Manslow 2002, 389–396.

[66] Leonardo Quinonez and Jonatan Gomez, General videogame learning with neural-evolution, 2014 9th Computing Colombian Conference (9CCC), IEEE, sep 2014,pp. 207–212.

[67] David Rawlinson and Gideon Kowadlo, Generating adaptive behaviour within amemory-prediction framework., PloS one 7 (2012), no. 1, e29264.

[68] David Robles and SM Lucas, A simple tree search method for playing Ms. Pac-Man,Intelligence and Games, 2009. CIG 2009 (2009), 249–255.

[69] Stuart Jonathan Russell, Peter Norvig, John F Canny, Jitendra M Malik, and Dou-glas D Edwards, Artificial intelligence: a modern approach, vol. 74, Prentice hallEnglewood Cliffs, 1995.

[70] AD Schwartz and RL Jones, On the pattern classification of structured data using theneocortexinspired memory-prediction framework, Ph.D. thesis, University of SouthernDenmark, 2009.

BIBLIOGRAPHY 56

[71] K.-H.a Seok and Y.S.b Kim, A new robot motion authoring method using HTM, 2008International Conference on Control, Automation and Systems, ICCAS 2008, IEEE,oct 2008, pp. 2058–2061 (English).

[72] Jason Sherwin and Dimitri Mavris, Hierarchical Temporal Memory algorithms for un-derstanding asymmetric warfare, 2009 IEEE Aerospace conference, IEEE, mar 2009,pp. 1–10 (English).

[73] Svorad Stolc and Ivan Bajla, On the Optimum Architecture of the Biologically InspiredHierarchical Temporal Memory Model Applied to the Hand-Written Digit Recognition,Measurement Science Review 10 (2010), no. 2, 28–49.

[74] Andrew G Sutton, Richard S and Barto, Reinforcement Learning : An Introduction,MIT press, 1998.

[75] Michael Thielscher, General game playing in AI research and education, KI 2011:Advances in Artificial Intelligence, Springer, 2011, pp. 26—-37.

[76] R.J.A van Gastel, Evaluation and mapping of hierarchical-temporal memory networkson an efficient platform, (2011), no. January, 1–13.

[77] Pei Wang, Rigid Flexibility, Springer, 2006.

[78] S.a Zhang, M.H.b Ang Jr., W.c Xiao, and C.K.a c Tham, Detection of activities bywireless sensors for daily life surveillance: Eating and drinking, e-health Networking,Applications and Services, 2008. HealthCom 2008. 10th International Conference on,vol. 9, 2009, pp. 1499–1517.

a model for general video game learning with htm

Documents