a bayesian model for canonical circuits in the neocortex for...

A Bayesian model for canonical circuits in the neocortexfor parallelized and incremental learning of symbol representations

Martin Dimkovski n, Aijun AnYork University, 4700 Keele St. Toronto, ON Canada, M3J 1P3

a r t i c l e i n f o

Article history:Received 28 October 2013Received in revised form28 June 2014Accepted 2 September 2014Communicated by A. BelatrecheAvailable online 16 September 2014

Keywords:Neocortical canonical circuitsBayesian brainSymbolic abstractionIncremental Metropolis-HastingsData stream learning

a b s t r a c t

We present a Bayesian model for parallelized canonical circuits in the neocortex, which can partition acognitive context into orthogonal symbol representations. The model is capable of learning from infinitesensory streams, updating itself with every new instance and without having to keep instances olderthan the last seen instance per symbol. The inherently incremental and parallel qualities of the model,allow it to scale to any number of symbols as they appear in the sensory stream, and to transparentlyfollow non-stationary distributions for existing symbols. These qualities are made possible in part by anovel Bayesian inference method, which can run Metropolis-Hastings incrementally on a data stream,and significantly outperforms particle filters in a Bayesian neural network application.

& 2014 Elsevier B.V. All rights reserved.

1. Introduction

We present a model for a hypothetical functional unit of theneocortex and its relationship with proximal peers that share thesame cognitive context. Our model is not one of the neocortex atlarge, which would require a network of cognitive contexts, but wehope it offers a building block for such an objective. Our approachis to first identify key computational aspects of the neocortex, andthen build the model upon those assumptions. We demonstratehow the resulting model can orthogonalize a cognitive contextdeveloping representations for the cognitive symbols. The modelis evaluated on popular machine learning datasets.

Our assumptions are detailed in Section 2. The principalassumption is the existence of an elementary functional unit inthe neocortex, identified as a canonical circuit. Next, we assumethat each canonical circuit develops to represent a particularcognitive symbol, by learning towards data associated with itssymbol, and by learning away from the data of all other symbols.Another assumption is that each canonical circuit must executeconcurrently with all other canonical circuits, in complete taskparallelism. Next, we assume that neocortical computation isanalogous to Bayesian inference, and we approach this aspectthrough Marr's three levels of analysis. Lastly, we assume thatcanonical circuits must operate inherently incremental by learning

from only a few examples, from infinite data streams, and withouthaving to store old data or use multiple epochs. There is existingwork in neocortical computational modeling which covers theprevious listed assumptions to a certain extent, either individuallyor in subsets. However, we are not aware of any work whichcovers all the assumptions jointly. One of our contributions is thatwe identify the state of each aspect in current neuroscience,propose correlations between them, and propose a model thatputs them all in a common framework.

In our model, each cognitive symbol is represented by a canonicalcircuit in the form of an independent Bayesian neural network. Eachof these neural networks updates with its own Bayesian inferenceprocess. The inference processes of all cognitive symbols are coupledin an inhibitory way, so that each pursues uniqueness and the overallresult is orthogonalization of the cognitive context which describesthe data stream. The model starts blank and adds a canonical circuitfor each new symbol as it shows up in the stream. For example,cognitive context could be “direction of motion,” with its symbolsbeing “up”, “down”, “right”, etc. Fig. 1 shows a simplified visualiza-tion of how canonical circuits orthogonalize a cognitive context. Dueto the task parallelism, regardless of how many canonical circuitsbecome involved, the model runs in constant time, and the canonicalcircuits can be distributed to different processors or machines acrossthe network.

In order to meet the requirements for incremental learning, wedeveloped a novel Bayesian inference method that runs Metropolis-Hastings (MH) on a data stream. The method makes it possible for asingle data instance to be sufficient in forming a useful representationof a symbol, and for each symbol to update with each new data

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/neucom

Neurocomputing

http://dx.doi.org/10.1016/j.neucom.2014.09.0020925-2312/& 2014 Elsevier B.V. All rights reserved.

n Corresponding author. Tel.: þ1 416 452 8071.E-mail addresses: [email protected] (M. Dimkovski),

[email protected] (A. An).

Neurocomputing 149 (2015) 1270–1279

www.sciencedirect.com/science/journal/09252312

www.elsevier.com/locate/neucom

http://dx.doi.org/10.1016/j.neucom.2014.09.002



http://crossmark.crossref.org/dialog/?doi=10.1016/j.neucom.2014.09.002&domain=pdf



mailto:[email protected]

mailto:[email protected]


instance efficiently. No data instances, before the last seen per symbol,have to be kept for subsequent updates. We discuss how it is possible,in an optimal model, for not even the latest instance to be required.We call our method Incremental Metropolis-Hastings (IMH). IMHrecurrently re-uses the last posterior as a new prior. Priors andposteriors are represented as non-parametric probability distributions,utilized through Monte Carlo or kernel density estimators. Therefore,the inference does not suffer from limitations of point-based approx-imation such as Maximum-a-Posteriori.

From a purely computational perspective, we contribute aBayesian classification model which is capable of supervisedlearning of an unlimited number of symbols (classes) from aninfinite data stream, and which has simple parameterization.Because of its incremental learning qualities, the model is uniquein handling concept drift transparently, i.e. it inherently supportsnon-stationary class distributions. We show that it matches theperformance of state-of-the-art incremental learning methods.IMH is also a computational contribution in itself, because weshow how it vastly outperforms particle filters for incrementalBayesian inference, at least with a neural network model.

In the following section we present the background for ourmodel, structured as a literature review of the principal assump-tions, each of them identified as a subsection. We attempt to relatethem by expressing them into a shared terminology, which allowsfor a unified perspective upon which we build the model. In thethird section we describe our model in detail. The forth sectionreviews existing and related models. In the fifth section wepresent the evaluation results, after which we finish with aConclusions section.

2. Background

2.1. Canonical circuits in the neocortex

The idea of elementary circuits as functional modules in theneocortex was hypothesized as early as 1938 [1], though it remainsan open question [2]. A prominent hypothesis of this type is thecolumnar view of the neocortex, based on functional identificationof neural circuits perpendicular to the pial surface [3–5], as well asa repeating template of neural distribution and connectivity foundin such circuits [6–8]. In the columnar hypothesis, the smallest

circuit is called a mini-column, and proximally connected mini-columns form a column. Depending on the area of neocortex,a mini-column is linked to a specific representation such as lineorientation, isofrequency tones, angles, or direction of motion.Inside a column, its mini-columns typically cover the full range oftheir representation type, such as full 1801 covered for orientationor direction of movement, or the full tone frequency spectrum [3].Columns are proximally or distally interconnected with many peercolumns across the neocortex.

Even though the direct and indirect legacy of the columnarhypothesis is undeniable [9], it hinges too much on anatomicalmodularity, which is debatable [2,10]. However, there is a generalconsensus that there is some functional differentiation betweenneocortical circuits [9]. da Costa and Martin [9] review the legacyof the columnar hypothesis, its criticism, and recent perspectives,and then propose that we focus on the most salient aspects of theidea by using the term canonical circuits instead of mini-columns,and not trying to constrain these circuits into rigid physicalcolumns. Their canonical circuit “embodies the idea of a repeatedlocal circuit that performs some fundamental computations thatare common to all areas of the neocortex.” In their view, the exactphysical location and configuration of canonical circuits canchange and even be overlapping [9].

2.2. Symbols and cognitive contexts

The canonical circuit has a multifaceted input [2,7]. One part ofthe input is from the environment, coming from subcorticalstructures such as the thalamus. Another part of the input comesfrom peer cortical circuits which could be anywhere in theneocortex. The environmental and the peer inputs come into thecanonical circuit at different locations, and can therefore be seenas contextual to one another through the canonical circuit as acontext processing element. Therefore, we will refer to thecombined input as cognitive context.

Certain facts put forward in the columnar hypothesis suggestthat a column, and therefore all its mini-columns, share a similarset of inputs [5]. The width of a column is also linked to thetermination width of the sensory (thalamic) input [4]. In otherwords, any particular cognitive context could be seen as beingprocessed by a group of canonical circuits. Each canonical circuit inthe group represents a particular aspect of the cognitive context,for example, a particular angle in the cognitive context “lineorientation.” We will refer to any particular representation of acognitive context as a cognitive symbol. Any such symbol hasmeaning only given its context. The term cognitive symbolhas been defined previously, in a compatible perspective, as theinternal categorical representation of an external physical orinformation entity [11]. Identifying the symbol representationmechanism in neural circuits is a critical challenge [12].

2.3. Requirements for parallelism

The assumption of canonical circuits implies that each circuit isdistinct from others and can execute in parallel [13]. Functionalmagnetic resonance imaging during various behavior shows thatmultiple distal areas of the neocortex appear as active at the sametime. Therefore, it is unlikely that canonical circuits involved in aparticular behavior have close correlations of their internal states.We assume that the learning method inside one canonical circuitmust be able to execute without knowing the states in othercanonical circuits, i.e. in complete task parallelism. In other words,each canonical circuit should be able to run on a separate machine.

Development of a Novel Cognitive Symbol

OngoingOrthogonalization

Oppose

Symbol: “up”

Symbol: “down”

Symbol: “right”

E.g. cognitive context : direction of motion

Symbol: “up”

ccA

E.g. cognitive context : direction of motion

ccB

ccC

ccA

ccB

ccC

Fig. 1. Orthogonalization example of a cognitive context. The left plate shows thedevelopment of a canonical circuit ccA for the first symbol that shows up in thestreaming sensory input. The other canonical circuits are not initialized yet. Theright plate shows the state after each canonical circuit has associated with asymbol. From here on, they all continually specialize for their symbols whileopposing each other.

M. Dimkovski, A. An / Neurocomputing 149 (2015) 1270–1279 1271

2.4. Why Bayesian and how?

Bayesian probabilistic inference is becoming a leading candidate forexplaining the operation of the brain [14,7,12,15]. Bayesian probabilityis an evidential probabilistic interpretation, where probability is anabstract concept describing a state of knowledge, as opposed to theview of probability as a frequency [16,17]. Given new relevant data andgiven a probability model for that data (called the likelihood) in termsof some parameters of interest θ, Bayesian inference provides a way toupdate an assumed probability of the parameters pðθÞ (called the priorbefore the update, and posterior after the update) using a simpleformula:

pðθjdataÞ ¼ pðθÞpðdatajθÞpðdataÞ ð1Þ

The challenge in Bayesian inference comes from how the prob-ability models and distributions are framed, learned, and utilized, andhow the inference equation (1) is executed. Solutions from Bayesianinference could be either models of the whole posterior distribution,or point-based approximations such as Maximum-a-Posteriori (MAP).Using the whole distribution as a solution gives much more informa-tive solutions because it describes the probability of the parameters atany possible value. This is important when there is uncertainty, as isusually the case, because the probability distribution can account for it,while a singular value is over-confident. This is especially relevantwhen the goal is to learn from few examples.

Tenenbaum et al. [12] note how Bayesian inference quicklybecomes a suspect when we consider how the brain learns andgeneralizes all too well from sparse, noisy, and ambiguous data: “If themind goes beyond the data given, another source of information mustmake up the difference. Some more abstract background knowledge.”The authors note how in different disciplines that study the brain thisabstract background knowledge is known under different names, suchas constrains in psychology and linguistics, inductive bias in machinelearning, and priors in statistics.

To consider the idea of a Bayesian brain systematically, we can lookat it through the three levels of analysis in Marr's approach tounderstanding information processing systems [18]. The names ofMarr's levels can be misleading, however, the underlying questionsare intuitively distinguishable: the top level (called computational)considers what the system does and why, the middle level (calledalgorithmic or representational) considers what representations are usedand how they are built and manipulated, and the bottom layer (calledphysical) considers how the whole system is implemented physically.

The top Marr's level consider conscious action. Whether ornot Bayesian inference applies at this level is debatable. Criticsargue that many conscious processes are notoriously deviantfrom Bayesian expectations. However, there are argumentsthat this is only so for a single individual where a particularbehaviour is just one sample, and that when looked at collec-tively, at the population distribution level, behaviour is faith-fully Bayesian [19]. There is work showing that the brainoperates akin to sampling from the population behaviourdistributions. For example, Sanborn et al. [20] show that therational model of categorization in cognitive science worksbetter with Gibbs sampler and particle filters, than with MAP.The representational and physical Marr's level can be describedmore as unconscious. Existing Bayesian brain theories focus onthe representational level, where they attempt to explain howprior and posterior probability distributions are built andmaintained on a conceptual level. We present key examplesin Section 4.1. There are few attempts in the literature thatattempt to explain the prior and posterior probability distribu-tions on the physical level in terms of neuron and circuitspecifics [21,22]. In general, this issue is considered unresolved

[15,12]. “Uncovering their neural basis is arguably the greatestcomputational challenge in cognitive neuroscience…” [12].

2.5. Learning with few examples

We consider it unlikely that the brain archives multiple datainstances in their original form for later use in an epoch-like proces-sing. It is more likely that the streaming data, which the brainconstantly experiences, is used only at the time of exposure, afterwhich only their contribution remains in the form of adjustmentsmade to neural networks. Under this assumption, a faithful neuro-morphic algorithm has to learn with few examples at a time. Thismeans the algorithm should start developing useful symbol represen-tations after the first few related instances, and use few instances ateach update stage.

The quality of learning with few instances has been noted asessential in cognitive science. Tenenbaum et al. [12] address this topicand discuss how the brain needs only a few examples of a symbol inorder to form an understanding. They point out how children canlearn to use a new word after just a few examples of it.

3. Proposed model

3.1. Overview and definitions

This work presents two independent but complementing con-tributions. The principal contribution is a model for incremental andBayesian orthogonalization of a cognitive context, where eachresulting cognitive symbol is adopted by a canonical circuit, andwhere all canonical circuits execute in parallel. The model describeshow canonical circuits relate in a cognitive context, and howcognitive symbols develop while inhibiting each other. We refer tothis model as Cognitive Context Orthogonalizing Networks (CCON).

In order for each canonical circuit to run, CCON needs anincremental Bayesian inference method over a non-linear high-dimensional neural network. CCON also needs the ability to learnfrom an infinite data stream, without needing multiple epochs orknowing the whole dataset. The most fitting existing method isparticle filters. However, as discussed later in Sections 4.2 and 5,CCON is too complex for particle filters. Therefore, we develop theIncremental Metropolis-Hastings (IMH) method. IMH was devel-oped for the purposes of CCON, but it can also be used indepen-dently for any sequential Bayesian inference. IMH is the result ofpursuing a more biologically plausible incremental inference, onewhich could be explained better on the physical Marr's level.

An overview of CCON is illustrated in Fig. 2. It is composed of Kcanonical circuits (denoted as cc1; cc2;…; ccK ), each specializing fora particular cognitive symbol of a cognitive context. A cognitivesymbol is equivalent to a class in a classification problem. Acanonical circuit, cck where k¼ 1;…;K , is a construct consistingof an IMH Bayesian process Bk and a multi-layer feed-forwardneural network NNk whose synapses are controlled by Bk. Eachcanonical circuit cck acts as a binary classifier, that outputs theprobability for the presence of its associated symbol in the currentdata stream instance.

The data stream from which the model needs to learn ispotentially infinite and defined as S¼ ⟨e1; e2;…; e1⟩, whereei denotes the labeled training example that arrives at the i-thtime point. A training example ei consists of xi and yi, whichrepresent, respectively, the environmental and peer input partsdefined earlier. Specifically, xi is a vector of sensory attributevalues, and this is the same for all cck. yi is the supervisioninformation coming from distal peers outside the cognitive con-text, suggesting which symbol is related to xi. Because each cck is abinary classifier, yi needs to be a binary class label. Therefore, any

M. Dimkovski, A. An / Neurocomputing 149 (2015) 1270–12791272

original multinomial supervision label is converted to multiplebinary labels, one for each cck. For example, if there are threeclasses A;B;C and a training example ei belongs to class B, then theclass label yi of ei for ccB will be a 1, while yi for ccA and ccC will be a0.

The neural network NNk in a canonical circuit cck is approximatedwith a multi-layer feed-forward neural network, fully connected in-between adjacent layers. The input layer is determined as per thedimensionality of input x, and the output layer has a single outputneuron vout. Calling upon the universal approximation theorem [23],we use a single hidden layer consisting of hidden neurons vh, whereh¼ 1‥H and the number of hidden neurons H is determined by theuser. The parameters of NNk are weights from input to hidden neurons

wðkÞinp2h, weights from hidden neurons to the output one wðkÞ

h2out , biases

of hidden neurons bðkÞh , and the bias of the output neuron bðkÞout . All

these parameters are also jointly represented as θðkÞ, i.e.

θðkÞ ¼ fwðkÞinp2h;w

ðkÞh2out ; b

ðkÞh ; bðkÞoutg. The output neuron vout uses a log-

sigmoid activation function in order to contain the output within0 and 1. All other neurons use a hyperbolic tangent sigmoid transferfunction tanh.1 Thus, given an instance with sensory input x, theoutput of NNk is

ξkðx;θðkÞÞ ¼ 1= 1þexp �bðkÞout� ∑H

h ¼ 1wðkÞ

h2outvh x;θðkÞ� � !" #ð2Þ

where vhðx;θðkÞÞ, the output of hidden layer neuron h, is

vhðx;θðkÞÞ ¼ tanh bðkÞh þ ∑dimðxÞ

j ¼ 1winp2hxj

!ð3Þ

The following interpreted execution log provides an idea ofhow the model works on one of our evaluation datasets (Wine).Out of the first 11 instances, only a select few instructive areshown. We can see that CCON starts differentiating early andprogressively better. By the 11th instance it is already workingwith high precision.

Instance 1:

Its symbol: A

First time symbol A is seen, therefore a new ccA is initializedHow all canonical circuits see this instance at start:ccA: Match/Recognition: 0.99999

How all canonical circuits see this instance after learning:ccA: Match/Recognition: 0.99999

…

Instance 3:

Its symbol: BFirst time symbol B is seen, therefore a new ccB is initializedHow all canonical circuits see this instance at start:ccA: Match/Recognition: 0.99999ccB: Match/Recognition: 0.99999

How all canonical circuits see this instance after learning:ccA: Match/Recognition: 0.99995ccB: Match/Recognition: 0.99999

…

Instance 5:

Its symbol: CFirst time symbol C is seen, therefore a new ccC is initializedHow all canonical circuits see this instance at start:ccA: Match/Recognition: 0.99999ccB: Match/Recognition: 0.97186ccC: Match/Recognition: 0.99999

How all canonical circuits see this instance after learning:ccA: Match/Recognition: 0.83387ccB: Match/Recognition: 0.48339ccC: Match/Recognition: 0.87383

…

Instance 11:

Its symbol: AccA already exists for it

How all canonical circuits see this instance at start:ccA: Match/Recognition: 0.93646ccB: Match/Recognition: 0.00132ccC: Match/Recognition: 0.04108

How all canonical circuits see this instance after learning:

Fig. 2. Overview of the CCONmodel, showing K canonical circuits of a single cognitive context, each circuit with its Bk Bayesian inference process and its NNk neural network.The same input x feeds all canonical circuits, but each of them has its own supervision information yi. The top that canonical circuits are mutually inhibitive resulting inorthogonalization of the cognitive context.

1 Log-sigmoid transfer function was also tried, without noticeable impact.


ccA: Match/Recognition: 0.98506ccB: Match/Recognition: 0.00015ccC: Match/Recognition: 0.04137

…

3.2. Cognitive context orthogonalizing networks (CCON)

We recall that the goal of CCON is incremental and Bayesianorthogonalization of a cognitive context into symbols, each of whichis maintained by a canonical circuit, executing in parallel with allother circuits. In computational terms only, CCON is a Bayesianclassifier that can learn from an infinite data stream, with fewinstances at a time, and scale dynamically to any number of classesas they show up in the stream, processing each class in parallel.

CCON starts from a blank state, and adds a new canonicalcircuit for each new symbol when it first shows up in the datastream. This means the number of known classes starts from zero,and increments as they show up for the first time in the datastream. The neural network of a new canonical circuit is initializedusing gradient descent so that ξk ¼ 1 for the first data instance.Once a canonical circuit cck is initialized, its Bk maintains aconditional probability distribution for the configuration of NNk,i.e. for θðkÞ ¼ fθðkÞ

1 ;θðkÞ2 ;…;θðkÞ

D g, while distancing the distributionfrom the ones maintained in the other canonical circuits. In otherwords, Bk maintains a recognizer for the symbol of cck whilemaintaining its uniqueness in its context. Their concurrent mutualinhibition results in orthogonalization of the context.

To be more specific, let us suppose that new input ðx; yÞ arrives attime t and we look at a single canonical circuit cck. The most recentprobability distribution of θðkÞ maintained by Bk is pðt�1ÞðθðkÞÞ, lastupdated at time t�1, and NNk is parameterized with a sample fromit. The level of recognition ξkðx;θðkÞÞ can be used to measure thelikelihood of θðkÞ given x and y, based on a Bernoulli discriminativedata likelihood model ξykð1�ξkÞ1�y. To simulate inhibition betweendifferent canonical circuits, we formulate the data likelihood using aspecialized set of i.i.d. instances for each canonical circuit. At time t,each canonical circuit cck has its own update set of instancesDðtÞ

k ¼ xð1Þ; yð1Þ� �

; xð2Þ; yð2Þ� �

;… xðKÞ; yðKÞ� ��

, where each instance isthe last seen of each of the K classes, and only yðkÞ ¼ 1 , while allothers are 0. Therefore, all the DðtÞ

k share the same xðkÞ but havedifferent yðkÞ. The data likelihood for cck at time t is then

p DðtÞk jθðkÞ� �

¼ ∏jDðtÞ

kj

i ¼ 1ξk xðiÞ;θðkÞ� �yðiÞ

1�ξk xðiÞ;θðkÞ� �� 1�yðiÞ

ð4Þ

All the instances which have y¼0, i.e. all those associated withsymbols of other canonical circuits, are used as a representation ofwhat cck needs to distance from. In other words, the inhibition isbeing simulated by the ð1�ξÞ1�y factors in Eq. (4). It would bemore biologically plausible if the model did not have to store eventhe last seen instance per symbol, and could learn with the currentinstance only. We believe this is possible if all the NNk are modeledas recurrent neural networks and the desired inhibition betweenthem is implemented through their attractors and synchrony. Weattempted this approach, but have so far been unsuccessful andthis remains as key future work.

The Bayesian inference process executed in each cck can thenbe expressed as

pðtÞ θðkÞjDðtÞk

� �ppðt�1Þ θðkÞ� �

p DðtÞk jθðkÞ� �

ð5Þ

where pðt�1Þ θðkÞ� �is the previous posterior, used as empirical

prior at time t. In addition to using only the last seen instance persymbol, one could optionally use multiple last seen instances pereach symbol, which improves the inhibition simulation.

As data stream instances keep coming in, each instancereplaces the older instance of its symbol in all the DðtÞ

k sets, andfor each new instance all canonical circuits update again in parallelusing equation (5). The process recurrently reuses the last poster-ior as new empirical prior, developing p θðkÞ� �

with each newinstance, and constantly sampling from it for the latest NNk

parameter configuration.So far we described how CCON learns. The model is used for

predictions at any time by getting the recognition ξk of each of itscanonical circuits and then using the softmax model:

p ynew ¼ kjxnew� �¼ ξk xnew;θ

ðkÞ� �∑K

l ¼ 1ξl xnew;θðlÞ� � ð6Þ

3.3. Incremental metropolis hastings (IMH)

We believe that a biologically plausible inference method for asingle canonical circuit cck in CCON operates as follows at a high-level: it contains a generative process which constantly issuesparameters for the neural network NNk; while being constantlyreconfigured, NNk persistently tries to recognize its associatedsymbol in the streaming sensory data; supervision information isused to determine how well the recognition performs, thusmeasuring the likelihood of the proposed parameters; the currentinput persistently activates the neural network, and the generativeprocess is able to keep proposing changes to the parameters whileobserving how their likelihood changes under the current input.Changes are accepted or rejected based on their likelihood effectand the neural network progresses through a chain of successivelybetter configuration. The generative process can be imagined asturning a dial on each parameter with speed dependant on thelikelihood effect, turning it faster when the likelihood gets worse,to look for a new value domain, and turning it slower when thelikelihood improves, to optimize in the current value domain.

The method above is reminiscent of the Metropolis-Hastings(MH) Markov chain Monte Carlo algorithm [16]. However MH cannotlearn from data streams because it requires all the data at eachupdate step. Therefore, we develop IMH, which updates with a singleinstance instead of all the data, and transfers the knowledge fromprevious updates through the prior. We also adopt the acceptancecriterion from MH by which proposals which improve the likelihoodare always accepted, while proposals that degrade the likelihood canbe tolerated to some extent for exploration benefits.

In this paper we only propose how IMH works on Marr'srepresentational level. How IMH works on Marr's physical level,we leave as an open question. However, it is interesting to notethat we were led to the idea for IMH from our ongoing researchinto the physical level, where we are looking into a biologicalprocess which engulfs and manipulates many synapses inside acanonical circuit. This process has a wave-based dynamics and canbe interpreted probabilistically as a generative process whichcontinually samples neural network configurations, while learningfrom their performance.

Conventional MH builds a Markov chain which converges to adesired distribution pðtÞ. It starts from a random value, and at eachiteration proposes a new value θnew from a proposal distribution q,to be a possible successor to the last chain element θold. The newvalue is accepted with probability:

min 1;pðtÞ θnew� �

q θold� �

pðtÞ θold� �

q θnew� �

!ð7Þ

Intuitively, θnew has greater chances of acceptance if it is moreprobable in the target distribution pðtÞ, and MH cares more forcandidates which are harder to propose by q. The proposal distribution


q could be a normal distribution centered on the previous value, andwith a certain standard deviation. If the proposal is rejected, θold iscopied as the new chain element. The empowering aspect of MH isthat the normalizing constant of pðtÞ cancels out in Eq. (7). Because MHstarts from a randomvalue, the Markov chain needs to spend a certaintime getting close to the target distribution before its values can beused as a representative sample. This initial part of the chain is oftenreferred to as burn-in and removed after manual analysis.

IMH is based on the idea that instead of running a single longMarkov chain requiring all the data for each update, we run a shortupdate chain for each new data instance using only DðtÞ

k for datalikelihood. Starting with a random value, each update chain attime t uses the previous posterior pðt�1Þ as an empirical prior, andan acceptance probability based on Eqs. (5) and (7):

min 1;pðt�1Þ θnew

� �p DðtÞ

k jθnew

� �q θold� �

pðt�1Þ θold� �

p DðtÞk jθold

� �q θnew� �

0@

1A ð8Þ

The product of each update chain is an updated posterior pðtÞ inthe form of a new set of samples. Since the posteriors used in IMHare sets of samples, we cannot use them in their original form as thenext prior, since for a prior we need a functional form which canevaluate probability for any proposal θnew. To bridge this gap we usekernel density estimation (KDE). Given a sample θ1;θ2;…θS

� �of

size S from a probability distribution p, KDE can estimate theprobability p θnew

� �for any θnew, by averaging the contributions of

a symmetric kernel Λ centered on each point of the sample:

pKDE θnew� �� 1

S∑S

s ¼ 1Λ θs;θnew� � ð9Þ

For Λ we use Gaussian kernels with bandwidth ð 43S Þ0:2σ, whereσis the sample standard deviation. This is a normal-optimalbandwidth heuristic [24], which we found allows for multiplemodes, while preventing unbounded growth by merging proximalmodes as their number grows.

Each update chain starts from a random sample from theprevious posterior, thus using previous knowledge as a newstarting point. Combined with the use of previous posterior as avery informative prior, a transfer of knowledge is achievedbetween update chains, allowing for much shorter chains (in ourcase, several orders of magnitude). The transfer of knowledge alsoremoves the need for burn-in analysis, because after the firstinstance each update chain no longer starts with a random value,but in some vicinity of its target distribution.

IMH has simple parameterization because it automatically setsmost parameters based on how well NNk recognizes its symbolinstances, and how well it rejects those of other symbols. To measurethis quality we use the data likelihood from Eq. (4), and call it thefitness of a canonical circuit. The fitness is used to set the proposaldistributions for the update chains, as well as their length.

For the proposal distribution q, we use a Gaussian centered on thecurrent value and with a standard deviation equal to the log of thefitness. As a result, the worse a circuit does on predicting a newinstance (i.e. away from its symbol or towards foreign symbols), thestandard deviation will be exponentially larger, pushing it to proposealternatives more forcefully.

For the length of each update chain, IMH takes a minimum andmaximum as parameters, and sets the actual length automaticallyas maximumnð1� fitnessÞ with a lower bound of minimum, whichgives less-fit canonical circuits longer chains and vice versa.Because of this, IMH goes through familiar instances faster, andslows down to deal with novelty.

The complexity of IMH is similar to MH because instead ofrunning thousands of iterations using the whole dataset for eachiteration, as is usually the case with MH, IMH can run much

shorter chains for each data instance in sequence. IMH does havethe minor added cost for calculating KDE for each update chain.

We recall that each canonical circuit cck learns independently,i.e. it runs its own IMH which is independent from the IMHprocesses of other canonical circuit. In terms of the update order ofneural network parameters θðkÞ ¼ θðkÞ

1 ;θðkÞ2 ;…θðkÞ

D

n o, they are

updated one at a time and in a random order which is permutedeach iteration. We also assume independence of the prior compo-nents, i.e. p θðkÞ� �

¼∏Dj ¼ 1p θðkÞ

j

� �.

4. Related work

4.1. Biomorphic perspective

Computational models of the neocortex vary greatly in theirobjectives and biological inspiration. Many of them do not pursue ageneric function but are specialized models, such as for example inmodeling ocular dominance with a 2D grid over which simpleHebbian-type relationships are simulated [25]. Models which aremore generic cover only a subset of our five principal assumptions.Even if we ignore these assumption, existing models usually presenteither no evaluation or one done on simplified datasets. In addition,they model either pattern completion or time-series predictions. Ifwe consider, under the symbolic abstraction assumption, that a flowof reasoning can be seen as parallel cascades and co-occurrences ofcognitive symbols, then we can argue that pattern completion andtime-series predictions are relevant for the relationships between thedifferent cognitive contexts, while a separate mechanism is stillneeded to select the symbol in each context. Since our work focuseson symbol identification in a cognitive context, it can therefore beseen as complementing to many existing models. Such complementscan lead to models which combine symbolic and sub-symbolic (orconnectionist) computation, which has been considered essential forartificial intelligence [26,27]. Below we review these aspects throughsome key related work, in chronological order.

Fransen and Lansner [28] offer a model of associatively-networkedcolumns using very biologically plausible neurons. Each of theircolumns is built from a single densely-connected recurrent neuralnetwork, and then the columns are interconnected and trained with aBayesian correlation-based (Hebbian) learning rule. If these columnsare assumed to be canonical circuits, then they do not provide symbolabstraction, but only operate sub-symbolic on raw input and trans-form it into a co-occurrence pattern. It is also not clear if they areparallelizable. The model is evaluated only on a simple artificialpattern completion problem. This work provides a great starting pointfor using more biologically plausible neurons in our canonical circuits.

Maass et al. [29] present a system of recurrent networks ofintegrate-and-fire neurons, called Liquid State Machines, wherethey identify a column with each recurrent network. It is a typeof reservoir computing, where each recurrent network is a reservoirwhich takes the input, forms transient dynamics, and offersopportunity for stable output neurons. In essence, it performssophisticated input transformation. The output neurons can thenbe employed in some task like time-series prediction or classifica-tion. However, the use of the outputs requires an additional methodsuch as a perceptron-like local learning rule or linear regression,and it is unclear how these additional mechanisms relate to corticalcircuits, or how they would be implemented biologically. Reservoirsare likely relevant to signal transformation in the brain, however bythemselves do not offer a way for symbol abstraction. Perhaps thecharacteristic dynamics inside a reservoir could be related some-how to symbols, however, this model has no method for this, andthe authors simply say they would be “optimized genetically andthrough development.” In addition, this model is not Bayesian.


Simen et al. [30] propose a neocortical model for a hybrid sub-symbolic and symbolic computation. Their equivalent of a cano-nical circuit is given the function of signal propagation delay andlow-pass filtering, which is more of an auxiliary function, com-pared to our approach where a canonical circuit is an explicitrepresentative of a cognitive symbol. Their model then proposesthat circuits can be organized with lateral inhibition as futurework, but does not give any computational details. In addition,there is no evaluation.

Nouri and Nikmehr [31] present a hierarchical Bayesian approachto reservoir computing with the objective of implementing a top-down Memory Prediction Framework (MPF). MPF essentially statesthat the system constantly tries to predict the future input based on itspast experience. The proposed model is framed as a model of theneocortex and thalamus because of its hierarchical regions, and isstrictly a time-series solution. The time-series prediction happensdirectly on raw sensory input, and there is no consideration forcognitive symbols. However, if the same idea is applied over ourcognitive symbol probabilities instead of raw sensory inputs, thenthe regions in this model can relate to cognitive contexts, and it canbe used for time-series predictions over cascades of contexts andsymbols.

George and Hawkins [15] also present a hierarchical andtemporal Bayesian model inspired by the columnar hypothesis,which learns temporal coincidences. As discussed above, thisperspective is relevant to the network of cognitive contexts. Theydo not address the issue of symbol abstraction from raw inputinside a single context. Instead, they skip this step and at thebottom layer present pre-identified symbols (Gabor filters). Theauthors state that “[the] model would require modifications toinclude sparse-distributed representations within an HTM node,”and this is where our model can be complementing.

Rinkus [2] presents a model for a generic function of canonicalcircuit, which agrees with our perspective that such a rolebecomes clear only in the context of its proximal peer group. Itproposes that the proximal peer group stores sparse representa-tion of the input. The principal difference with our model is thatthere is no symbolic abstraction. Also, this work has no computa-tional implementation or evaluation, is not Bayesian, and it usesonly binary input attributes and binary synapses.

The most current, and perhaps most relevant to our model, isthe work of Bastos et al. [13], which employs a Bayesian braintheory based on predictive coding and free energy minimization[32], and relates it remarkably to canonical circuits. Predictivecoding is based on a hierarchical model where a higher areapredicts the input of the lower area using a generative process,while the lower area computes the error between the predictionand the actual input, and sends back only the error to the higherarea. The goal of the system is to minimize the error feedback, orthe surprise, which requires better feedforward predictions. Themodel is conceptual only and there is no evaluation on datasets. Itis also not clear how symbols relate on the same level in thehierarchy or within a context. This presents an opportunity tocombine this model with ours.

4.2. Computational perspective

Due to the data stream learning requirements in our objective,we do not consider learning models that need all the data to beavailable at start. For example, Metropolis-Hastings has beenapplied to learn neural network parameters given a set of trainingstatic data [33], but requiring the complete data set for eachproposal evaluation. Another example is the work of Wu et al. [34],where the authors propose a model for intuitive representations ofclasses (symbols) in a neural network framework. However, theirmodel cannot learn incrementally (online) because it requires

statistics from the whole dataset at the beginning. In addition,this model is not probabilistic.

One related learning model is the work of de Freitas et al. [35],which, even though in much lower dimensional problems andmostly for time series analysis, uses particle filters to performsequential Bayesian inference on neural network parameters.Therefore, we compare using particle filters in place of IMH asthe Bayesian inference method for CCON. There are also othersequential Bayesian inference methods, however, they are notapplicable to our model due to its continuous state-space and thehighly non-linear probability space of neural network parameters.

Particle filters are based on importance sampling which tacklesan intractable distribution of θ by sampling from an approximat-ing tractable distribution and weighting the samples with animportance weight w equal to the ratio of these two distributions.Unlike IMH which keeps a single parameter value at each moment,particle filters keep many versions (particles) of a parameter, anduse all of them based on their weights. Under a Markov processassumption, this approach can be formulated recursively so that itcan be used on streaming data. Starting with an initial set ofapproximating particles for θ, each iteration updates their weightsand performs various additional techniques to help with a knownparticle degeneration issue where few particles end up withweights close to 1 while the rest end up close to 0. Aside fromthe degeneration issue getting worse with increasing modelcomplexity, particle filters also inherit the limitation of importancesampling in scaling with higher dimensionality.

5. Evaluation

The objective of the evaluation is to see how the CCONworks asa supervised classifier on data streams. Part of the evaluation isalso to compare IMH to particle filters used as inference methodsin CCON. It is a limitation in our work that we do not evaluate ondatasets that are more related to biological behavior. We arepartially prevented from doing this because we would need anetwork of cognitive contexts, which is part of the future work.

We evaluate CCON in classification tasks on data streams using10 data-sets. Two of the datasets, SEA [36] and Circles [37], arepopular for data stream evaluation and we use versions with4 concept drifts each, at 25% intervals. The rest of the datasets arefrom the UCI repository2: Waveform, Iris, Wine, Vehicle, Balance,Liver, Vertebral, and Diabetes. The dataset Waveform is also popularin data stream evaluation, though it has no concept drifts. Theother UCI datasets are not data streams originally, however, weshuffle them in a single fixed order and assume the dataset ispresented one instance at a time. All data is normalized in therange [�1,1] before being used, as is customary before feedingdata into neural networks.

For all experiments using random functions, 10 trials withdifferent random initializations were used and their averagesreported. Parallel programming was used to have all canonicalcircuits execute simultaneously. Accuracy was evaluated using theprequential metric [38] which is calculated by making a predictionon every new instance, comparing to the true label, and keeping ashifting sum of a loss function (0 for correct prediction and1 otherwise). Its advantage over holdout validation is that it canbe used on any data stream without losing training data, while itdoes converge to holdout estimates [38]. The shifting sum is overthe last 50 instances, which is an evaluation (not model) para-meter and chosen empirically, noting that evaluation results arenot sensitive to it.

2 http://archive.ics.uci.edu/ml/.


http://archive.ics.uci.edu/ml/

We present results for 3 and 7 hidden layer neurons in eachcanonical circuit's NN. Theoretically, this parameter can also bemade self-adjusting. The number of MH iterations for each update isself-adjusting based on minimum of 5 and maximum of 50. Settingany of these parameters does not require a parameter space searchproblem. They are all monotonic measures of allocated computa-tional power. To calculate the data likelihood, we try with only thelast seen instance per symbol, or the last 7 per symbol to show howthis impacts performance.

Figure 3 shows IMH significantly outperforming particle filters (p-value of 0.0001% in pairwise t-tests), when used as the incremental

Bayesian inference method in CCON. The particle filters configurationshown is the best of 6 versions which tried two different resamplingtechniques [39], sample sizes of 50 and 500, as well as 1 and 50sample evolution iterations per data instance. For the transitionalprior in our particle filters, we use the same proposal distributionfrom our IMH update step.

CCON is next compared, using IMH, to 4 leading methods fordata stream learning: Hoeffding Adaptive Tree (also known as VeryFast Decision Tree), Naïve Bayes with Drift Detection Mechanism,and accuracy weighted ensemble classifiers using the above twomethods as their base classifiers [38]. All methods have concept

Fig. 3. Comparing Bayesian inference methods for CCON. The left-most bar (in a diagonal pattern) shows the best particle filter configuration we found. The other bars (incrossed and solid patterns) show various configurations of IMH as detailed in the legend on top.

Fig. 4. Comparing CCON to leading data stream learning methods. The left-most two bars (in crossed patterns) represent two IMH configurations: the first using only the lastseen instance per symbol, and the second using 7 previous per symbol. The other bars (in diagonal patterns) show other methods as described in the legend on top.

Fig. 5. Concept drift handling on the Circles dataset. The symbol distributions change every 200 instances, at vertical lines.


drift handling ability. Fig. 4 shows that when using only the lastseen instance in the likelihood, CCON is comparable to the othermethods for roughly half the datasets. However, when allowing7 last seen instances, CCON becomes comparable for all thedatasets, since this improves the simulation of inhibition betweencanonical circuits.

CCON is innately incremental because the full Bayesian infer-ence at each update modifies the prior only where it does not fitwith the new instance, preserving older and compliant knowledge.Figs. 5 and 6 show that CCON handles concept drifts (marked withvertical lines) as good as any of the other methods.

While CCON matches the performance of state-of-the-art datastream classifiers, it has the following advantages: it is fully Bayesianand hence better with uncertainty, is not tied to a fixed number ofclasses and can add classes as they show up in the data stream, hasinnate concept drift detection, and is much easier to parameterize. Theother methods must keep a batch of old data or sufficient statisticsand use special mechanisms to forget old knowledge in order toperform incremental learning. Therefore, they require constants suchas sliding windows, forgetting parameters, speed or bounds of change,all of which present their own parameter search problems [40].

6. Conclusion

We started by identifying five assumptions about neocorticalcomputation: the existence of canonical circuits, their relationshipto cognitive symbols and contexts, Bayesian aspects, parallelismand incremental requirements. Following these requirements, webuilt our CCON model for a single cognitive context which iscontinually orthogonalized into symbols by canonical circuits, canfollow the evolution of non-stationary symbols, and can grow thenumber of symbols dynamically as the context itself evolves.

We show that existing inference methods are insufficient torun the complexity of CCON. Following biological inspiration, wedevelop IMH as a novel inference method which enables CCON toperform as well as state-of-the-art data stream learners, evaluatedon popular machine learning datasets. IMH is a contribution initself as it can be used in other machine learning problems.

CCON covers only a single cognitive context. We are currentlyworking on networking multiple cognitive contexts where they willco-supervise each other. This will allows us to model higher levelneocortical circuits and evaluate against biological behavior datasets.

References

[1] R.L. de No, Architecture and structure of the cerebral cortex, in: J.F. Fulton(Ed.), Physiology of the Nervous System, Oxford University Press, New York,1938.

[2] G.J. Rinkus, A cortical sparse distributed coding model linking mini-andmacrocolumn-scale functionality, Front. Neuroanat. 4 (2010).

[3] V.B. Mountcastle, The columnar organization of the neocortex, Brain 120 (1997)701–722.

[4] D.P. Buxhoeveden, M.F. Casanova, The minicolumn hypothesis in neuroscience,Brain 125 (2002) 935–951.

[5] J. Szentagothai, The Ferrier lecture, 1977: the neuron network of the cerebralcortex: a functional interpretation, Proc. R. Soc. London. Ser. B, Biol. Sci. (1978)219–248.

[6] S. Haeusler, W. Maass, A statistical analysis of information-processing propertiesof lamina-specific cortical microcircuit models, Cereb. Cortex 17 (2007) 149–162.

[7] A.M. Bastos, W.M. Usrey, R.A. Adams, G.R. Mangun, P. Fries, K.J. Friston,Canonical microcircuits for predictive coding, Neuron 76 (2012) 695–711.

[8] H. Markram, M. Toledo-Rodriguez, Y. Wang, A. Gupta, G. Silberberg, C. Wu,Interneurons of the neocortical inhibitory system, Nat. Rev. Neurosci 5 (2004)793–807.

[9] N.M. da Costa, K. Martin, Whose cortical column would that be? Front. Neuroanat.4 (2010).

[10] J. DeFelipe, H. Markram, K.S. Rockland, The neocortical column, Front.Neuroanat. 6 (2012).

[11] S. Harnad, The symbol grounding problem, Phys. D: Nonlinear Phenom. 42 (1990)335–346.

[12] J.B. Tenenbaum, C. Kemp, T.L. Griffiths, N.D. Goodman, How to grow a mind:statistics, structure, and abstraction, Science 331 (2011) 1279–1285.

[13] A.M. Bastos, W.M. Usrey, R.A. Adams, G.R. Mangun, P. Fries, K.J. Friston,Canonical microcircuits for predictive coding, Neuron 76 (2012) 695–711.

[14] K.J. Friston, K.E. Stephan, Free-energy and the brain, Synthese 159 (2007)417–458.

[15] D. George, J. Hawkins, Towards a mathematical theory of cortical micro-circuits, PLoS Comput. Biol. 5 (2009).

[16] A. Gelman, J.B. Carlin, H.S. Stern, D.B. Rubin, Bayesian Data Analysis, Chapmanand Hall, Boca Raton, 2004.

[17] C.M. Bishop, Pattern Recognition and Machine Learning, Springer, New York,2006.

[18] D. Marr, Vision: A Computational Investigation into the Human Representa-tion and Processing of Visual Information, Henry Holt and Co., New York, NY,1982 2–46.

[19] E. Vul, N.D. Goodman, T.L. Griffiths, J.B. Tenenbaum, One and done? Optimaldecisions from very few samples, in: Proceedings of the 31st AnnualConference of the Cognitive Science Society, vol. 1, 2009, pp. 66–72.

[20] A.N. Sanborn, T.L. Griffiths, D.J. Navarro, A more rational model of categoriza-tion, in: Proceedings of the 28th Annual Conference of the Cognitive ScienceSociety, 2006, pp. 726–731.

[21] R. Rao, Hierarchical Bayesian inference in networks of spiking neurons,in: Advances in Neural Information Processing Systems, 2004, pp. 1113–1120.

[22] S. Deneve, Bayesian inference in spiking neurons, in: Advances in NeuralInformation Processing Systems, vol. 17, 2005, pp. 353–360.

[23] B.C. Csáji, Approximation with Artificial Neural Networks (Master's thesis),Faculty of Sciences, Etvs Lornd University, Hungary, 2001.

[24] A.W. Bowman, A. Azzalini, Applied Smoothing Techniques for Data Analysis:The Kernel Approach with S-Plus Illustrations: The Kernel Approach withS-Plus Illustrations, Oxford University Press, Oxford, 1997.

[25] G. Goodhill, Topography and ocular dominance: a model exploring positivecorrelations, Biol. Cybern. 69 (1993) 109–118.

[26] M.L. Minsky, Logical versus analogical or symbolic versus connectionist orneat versus scruffy, AI Mag. 12 (1991) 34.

[27] J. Wallace, K. Bluff, Neurons, glia and the borderline between subsymbolic andsymbolic processing, Progr. Artif. Intell. 990 (1995) 201–211.

[28] E. Fransen, A. Lansner, A model of cortical associative memory based ona horizontal network of connected columns, Network 9 (1998) 235–264.

[29] W. Maass, T. Natschlager, H. Markram, Real-time computing without stablestates: a new framework for neural computation based on perturbations,Neural Comput. 14 (2002) 2531–2560.

[30] P. Simen, T. Polk, R. Lewis, E. Freedman, Universal computation by networks ofmodel cortical columns, in: Proceedings of the International Joint Conferenceon Neural Networks, vol. 1, IEEE, 2003, pp. 230–235.

[31] A. Nouri, H. Nikmehr, Hierarchical bayesian reservoir memory, in: Proceedings ofthe 14th International Computer Conference (CSICC), IEEE, 2009, pp. 582–587.

[32] K.J. Friston, K.E. Stephan, Free-energy and the brain, Synthese 159 (2007) 417–458.[33] R.M. Neal, Bayesian Learning for Neural Networks Lecture Notes in Statistics,

No. 118 ed., Springer-Verlag, New York, 1996.[34] Q. Wu, M. McGinnity, D.A. Bell, G. Prasad, A self-organizing computing

network for decision-making in data sets with a diversity of data types, IEEETrans. Knowl. Data Eng. 18 (2006) 941–953.

Fig. 6. Concept drift handling on the SEA dataset. The symbol distributions change every 200 instances, at vertical lines.


http://refhub.elsevier.com/S0925-2312(14)01157-6/sbref1























































[35] J.F.G. de Freitas, M. Niranjan, A.H. Gee, Hierarchical bayesian models forregularization in sequential learning, Neural Comput. 12 (2000) 933–953.

[36] W.N. Street, Y. Kim, A streaming ensemble algorithm (sea) for large-scaleclassification, in: Proceedings of the Seventh ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, ACM, 2001, pp. 377–382.

[37] K. Nishida, K. Yamauchi, Detecting concept drift using statistical testing,Discov. Sci. (2007) 264–269.

[38] J. Gama, Knowledge Discovery from Data Streams, 1st ed., Chapman and Hall, BocaRaton, 2010.

[39] R.V.D. Merwe, A. Doucet, N.D. Freitas, E. Wan, The unscented particle filter, in:Advances in Neural Information Processing Systems, 2001, pp. 584–590.

[40] A. Bifet, R. Gavalda, Adaptive learning from evolving data streams, Adv. Intell.Data Anal. VIII (2009) 249–260.

Martin Dimkovski is a Computer Science Ph.D. studentat York University in Toronto, Canada, in Dr. Aijun An'steam. He is an Elia scholar, who is awarded at YorkUniversity for achievements in liberal education andinterdisciplinary studies. Martin holds a M.Sc. in Com-puter Science, a M.Sc. in Information Technology, and aB.Sc. in Computer Science. His research interests are inartificial intelligence and knowledge technology, inparticular related to modeling biological intelligence.

Aijun An is a Professor of Computer Science at YorkUniversity, Toronto, Canada. She received her PhDdegree in computer science from the University ofRegina in 1997, and held research positions at theUniversity of Waterloo from 1997 to 2001. She joinedYork University in 2001. Her research area is datamining. She has published widely in premier journalsand conference proceedings on various topics of datamining, including classification, clustering, data streammining, transitional and diverging pattern mining, highutility pattern mining, sentiment and emotion analysisfrom text, topic detection, keyword search on graphs,social network analysis, and bioinformatics.










a bayesian model for canonical circuits in the neocortex for...

Documents