mathematical model of stored logic based computation

Mathematical Model of Stored Logic Based Computation

Higinio Mora-Mora*, Jerónimo Mora-Pascual, María Teresa Signes-Pont

José Luis Sánchez Romero

Specialized Processor Architectures Laboratory, University of Alicante 03690 San Vicente del Raspeig, Alicante, Spain

Abstract

Emerging VLSI and ULSI integration technologies provide new possibilities for developing computational paradigms based on memories with pre-calculated data. A memory can behave like a processor with complete functionality by simulating a classic Turing machine. By means of this stored logic-based architecture, the processor adopts a simple and regular internal organization. In addition, it makes the most of inherent features in the memories and offers the possibility of reconfiguration by writing new results inside of it. This paper reviews different possibilities to use memory elements to perform computations that would replace combinational logic processing of a given function. In this research, a computational model based on stored logic is described and interesting issues concerned with memory utilization in processing are analyzed. A small functional unit based on these notions is presented. Keywords: digital VLSI systems; microprocessors design; memories; Look-up Tables

1 Introduction

The main aim of computation is to process information. To carry out this task, a computer is regarded as a universal computing device with the processor being the central part that performs the processing. Inside the processor, a set of operators carry out the functions and procedures that make up the computation [1].

Traditionally, these operators have been based on Boolean Algebra principles, so the electronic circuits that implement them by means of logic ports can subsequently be built according to the rules of digital design. In this context, nowadays digital hardware plays a fundamental role in computer engineering. This is mainly due to the rapid increase in the integration density of electronic components and the fall in prices caused by the implementation technologies. However, in spite of this progress, digital design is still based on the same principles of over 30 years ago.

Although there are lines of research that are studying the feasibility of other alternatives with regard to electronic technology [2,3], complementary to this, we believe some effort could be made to improve its use. Recent advances in electronic technology and its greater capacity allow for the design of operators based on elements on a greater processing level than logic ports [4,5],

* Corresponding Author.

E-Mail addresses: [email protected] (H. Mora-Mora), [email protected] (J. Mora-Pascual), [email protected] (M.T. Signes-Pont), [email protected] (J.L. Sánchez-Romero)

H. Mora et al - Mathematical Model of Stored Logic Based Computation

that is, capable of processing more information than a simple bit. These operators must make the most of the greater integration intensity in order to build more robust and faster units.

In this paper, we suggest the possibility of creating operators based on stored logic which, in certain application contexts, will perform the computation more advantageously than its corresponding design in combinational logic. We think that engineers of low-level computation systems should take this paradigm into account in their designs in order to increase performance. In most universities and colleges, logic design is only taught by means of combinational logic, however, without wishing to discredit their teaching [6,7], this design alternative provides an explicit abstraction of the circuit itself as it regards the operator as a black box.

Digital circuits are an inseparable part of electronic systems. In order to determine their applicability in solving a problem, two areas of computational theory must be analyzed: computability and complexity. Computability deals with deciding whether or not a problem can be solved by means of a computer, that is, if there is an algorithm. Complexity deals with the cost associated with the solution [1]. In this work, we fundamentally analyze these characteristics with regard to systems based on stored logic in order to determine their capacity to replace combinational circuits and likewise determine more favorable problems.

This paper is structured in the following way: in the following section we review the use of memories in processing. In particular, we study designs based on look-up tables with precalculated results. In the next section, the calculation model proposed in this work is presented and the performance and features relevant to this technique are analyzed. Finally, we describe an arithmetic unit with operations implemented with the proposed model, in which special detail is given to the addition and multiplication operations.

2 A Survey of Memories in Processing

The use of stored data in operator design is not a new technique. Their use is becoming frequent as a way of providing constant values in the execution of an algorithm. Generally speaking, these values consist of an initial seed that the processing begins on, or rather of constants used in carrying out an algorithm. However, they have been designed to take a second place or as an accessory part in the structure of an operator. Proof of this is that references to complexity in operator design are usually about port levels and not about the characteristics of the memory elements it contains.

The aim of this article is to configure a processing technique based on memory elements that will result in a computational model based on stored logic. Firstly, it should be pointed out that in order to do this, these memories are frequently used to mark the starting point of the approach presented in this article. Computer architecture usually associates memories with elements that store data that will be subsequently processed, such as general-use caches or registers [8,9]. However, in this state of the art we will focus on the application of memories as a part of operator design itself.

As we have mentioned before, one of the most frequent applications of memories is to provide a piece of information or initial seed that will begin the processing of certain algorithms of arithmetic calculation. Several representative works are mentioned below.

Newton-Raphson and Goldschmidt’s well-known methods for the calculation of roots in polynomials are frequently used to calculate, among other operators, division functions [10,11],


square roots and inverse square roots [12,13]. These algorithms have an iterative structure that begins with an initial value upon which the functional iteration is performed until the desired degree of precision is reached. This seed is supplied by a rapid access memory close to the processing called a Look-Up Table (LUT). The more precise this information is, the fewer iterations are necessary [10,14]. Some studies demonstrate that in applications where little precision is required, the right LUT only has to be accessed and several simple operations carried out [15,16].

Methods based on polynomial expansion approximate the function value by means of a polynomial expression that tends to the desired value. The implementation of these algorithms uses memories to improve their performance. In general, the combinational calculation of the first terms of the polynomial is replaced by its value selected directly from a LUT and the precision is subsequently increased by calculating later expressions [18, 19]. Other function designs require a set of numerical constants which, operated with the rest of the circuit, produce the final result of the operator. These numbers are previously calculated and stored in memories together with the rest of the operator’s logic. These expressions usually correspond to complicated mathematical expressions but have well-known results. For example, for approximations using the Taylor expansion, it is frequent to have precalculated the reciprocal of the successive factorials of the denominator terms. In other cases, the polynomials of the functions in accordance with Taylor’s method combine with other methods for calculating polynomial roots, such as Remes’s algorithm [20], which finds the best minimax approximation to a continuous function in a finite interval or Horner’s recursions for the calculation of functions in polynomials [21]. Tang’s proposals for the calculation of transcendental functions belong to this group [18]. A uniform approximation of continuous functions by low grade polynomials is achieved with this technique, which lowers the cost of calculating a polynomial to a set of multiplications and additions. The technique is based on a set of precalculated results for the function to be calculated, which are stored in memories with a reduced-length operand [25].

For the processing of elemental, non-linear functions, such as logarithm, exponential and trigonometric functions, there are proposals that combine the use of look-up tables with polynomial approximations. In this group Wong and Goto’s [22] and Ercegovac’s [17] methods for calculating exponential and logarithmic functions can be highlighted. Wong and Goto’s algorithm makes the most of the properties of the operations to break up their arguments into parts with fewer significant numbers. The result of the function for each of these fragments is obtained by consulting a look-up table. Subsequently, the partial results are combined by means of simple operations to give the result of the function. Ercegovac’s method transforms the operation by means of prior preprocessing. Afterwards, a look-up table is used as a post-processing to the calculation of the operator that provides the final result of the function.

Other research into the design of arithmetic operators uses memories and the precalculated results inside of them to obtain more accurate results by means of functional interpolation. Along these lines, the bipartite table methods consist of fragmenting the operands into parts and then the functional value of each part is obtained by accessing a look-up table. Subsequently, the final result of the function is obtained by interpolating the results obtained from the memories [22, 26]. In this group we mention the methods proposed by Gal [23, 24]. The use he makes of the memories consists of their containing a collection of results of the functions to be computed,


which have a limited accuracy and are representable by the computer. These values are not uniformly separated and act as a basis to obtain the correct result or one with suitable precision. Thus, for example, for the calculation of trigonometric sine and cosine functions, their values are already precalculated to a precision of a few bits, and by means of interpolation, the result for a greater number of bits is obtained.

One of the most popular algorithms for the calculation of functions is the one that uses the vectorial rotation system for trigonometric calculations, known by its acronym CORDIC [29,30]. This method calculates a great variety of functions by simply using the basic operators of shift and addition. However, it also requires precalculated results in order to obtain the correct result. According to how many numbers the operands have, the result obtained has to be adjusted in order to keep the module of the vector that is being rotated constant. This adjustment is performed by the successive products of the inverse cosines which are rotated. Their precalculated values are stored in a memory. To improve performance, only one multiplication, which groups together the previous results and which is also stored in the table, is made. This algorithm is currently used to calculate the value of a wide range of algebraic functions [31, 32].

In construction of reconfigurable systems, like FPGAs, the small look-up tables are fundamental pieces in the design of reconfigurable devices due to its versatility and adaptation to the processing necessities. The FPGA fits the logic inside the LUTs and their connections.

Finally, to bring to a close this brief review of arithmetic operator calculations that use memories with precalculated results, we must mention the works that propose a design directed towards the use of tables for elementary primitive operations such as addition and multiplications. In these proposals the operands are broken up into blocks and are processed among themselves in an orderly way by means of access to look-up tables. The final result of the operation is obtained by combining the partial results obtained [27,28].

After this review of the use of memories in processing, it is evident that their use has been a valid and popularly accepted solution for calculating mathematical functions. In most of the proposals presented, the alternatives based on LUTs offer a lower temporal cost than the corresponding ones based on combinational logic. What’s more, without their collaboration, some operators would not be able to work to acceptable performance levels.

Research into memories mentions their size as being a disadvantage in their general use. Others problems deal with the underutilization of resources. The stored logic stores data which are all possible resulting computations and a lot of them may be used rarely. To get round this problem some of the proposals include a prior preprocessing of the operands in order to reduce their length, as well as a post-processing that restores or increases precision for the results obtained. Likewise, there is a possibility of using compression techniques to reduce their size and achieve operands of a greater length. Recent advances in the development of electronic technology in general and, in particular, in data storage [4, 6, 33], make the processing of larger operands possible with the previous algorithms and mean that the development of algorithms based on memories with precalculated results is an interesting strategy to take into account.

In the following sections a description is given of a design model aimed at stored logic that analyzes some aspects of the use of memories in processing and a small functional arithmetic unit based on these principles is presented.


3 Processor Design based on Stores Logic

Let a set of functions be: Φ = {f1, f2, ..., fn} (1)

Let’s say that the architecture ΛΦ = {Γf1, Γf2, ..., Γfn} (2)

offers the functionality Φ if and only if there is an instrumentation of each of these functions in Λ. A processor is characterized by the set of functions it offers and by the way it implements them, that is, its architecture Λ. According to these principles, a processor Π is defined as

Π(Φ, ΛΦ) (3) From the point of view of the user or programmer, there is no difference between the results

given by a combinational processor and those given by a memory. A processor can offer the same functionality or set of instructions irrespective of which of these two options are implemented. The processor or computation system is a black box with different implementation options. The possible alternatives may be a combinational circuit or memory in which the precalculated results, which must be produced before their input, are kept.

The conclusions drawn from the study of the state of the art in the previous section are that the main disadvantages of calculation methods that use stored logic are the delay in obtaining the response and the size of the area they take up. However, the new capacities in circuit manufacture that the VLSI and ULSI high-density integration methods offer, open up new possibilities in the design of arithmetic operators.

The main idea of this work is that a memory with precalculated results can behave like a processor with full functionality in the operations it offers. That is, the control logic and operator implementation that a processor provides can be included in the same memory thus making up its own architecture. In this approach, it is the inputs to the processor or the arguments of each function that guide the positions of the memory where the results are to be found. The simplest case is when the processor only implements one operator, a function f.

Π(f1, Γf) (4) In this example, the structure is very simple. The architecture consists of a memory with the

results of the function for all the instances of its operands with a specific precision together with a set of registers suitable for placing the operands and the calculated results. A program for this processor will consist of a list of calls to this operator. The processor will return its result in each case according to the value of the function parameters and the contents of the memory they access. The design of this processor can be represented as shown in the figure 1.

For processors that offer greater functionality, the elements of the memory can be organized in a different way in the design of the processor. One option is to integrate the memory modules that are needed to implement each of the functions the processor provides. In this case, the sizes of the memories can vary according to the operands for each function and the precision necessary in each of them. A control module will select the result from the appropriate memory. A program for this processor will be processed by means of a series of accesses to all the memories. The memories will then return the result of all the instructions which, in turn, may be taken as


parameters for successive operations. The figure 1b is a functional diagram of the structure of this processor.

Γfoperands

nresults

m

memory

Γf1results f1

memory f1

Γf2operands results f2

Γfn

n

results fn

∙∙∙

∙∙∙

results

mux

function

memory f2

memory fn

a) b)

Fig. 1: Stored logic based operator implementation. a) single operator; b) multiple operators

Alternatively, the processor can also be implemented by means of only one memory in which the results of all the operators provided by the processor are to be found. In this way, only one operator provides multiple functionality. By means of an additional parameter the operators are shown which section of the memory they must access in order to obtain the result of each operation. In this way, one of the parameters is used to segment the memory into portions in which the results of the operations are placed.

It is easy to see how, by following this procedure, any set of functions can be integrated into a processor by means of its implementation by only one memory. For example, the following figure shows the contents of a memory that implements the addition, multiplication, squares and square root operations for two-bit operands.

00

01

10

11

00

0000

0001

0001

01 10 11

0010

0011

0010

0010

0011

0011

0011

0100

0100

0100 0101

0101

0110

0000

0000

0000

0000

0000

0001

0000

0010

0010

0000

0011

0100

0011 0110

0110

1001

0000 0001 0100 1001

0000 0001 0001 0001

addition:

square root:

00

01

10

11

00

00

00

multiplication:

square:

01

stored logic processor2

42

operation

operands

memory content10

11memory addres

00

01

10

11

00

0000

0001

0001

01 10 11

0010

0011

0010

0010

0011

0011

0011

0100

0100

0100 0101

0101

0110

0000

0000

0000

0000

0000

0001

0000

0010

0010

0000

0011

0100

0011 0110

0110

1001

0000 0001 0100 1001

0000 0001 0001 0001

addition:

square root:

00

01

10

11

00

00

00

multiplication:

square:

01

stored logic processor2

42

operation

operands

memory content10

11memory addres

Fig. 2: Stored logic processor

Note that the operands and the operation make up the address the memories use to obtain the

processing result. With this architecture, the input is a 6-bit word: <operation, operand1, operand2>. In the following section the performance of the processor will be studied in terms of cost and the advantages and disadvantages of this architecture will be analyzed.

4 Evaluation of the design based on Memories

The previous section has demonstrated how a memory can function as a processor by storing the precalculated results of the operations it implements inside itself. From the point of view of


computability, architecture based on stored logic is equivalent to architecture made with combinational circuits and logic ports. Both are capable of solving the same computational problems.

The differences between the two architectures lie in complexity. Here, as in many other areas of information technology, there is a relationship between the area cost of an architecture and the delay produced by its processing. With regard to the performance of computational architecture, reducing spatial complexity is usually to the detriment of temporal complexity, and vice versa. The degree to which the delay in processing benefits from the increase in area depends on the characteristics of the function to be processed. Among these characteristics, research has found that the length of the operands and the precision of the results, together with the mathematical complexity of the algorithm that calculates the function, are the most influential aspects. The study of these characteristics for each specific operator may indicate the advisability of rejecting operators based on stored logic. However, other aspects inherent to the use of memories that offer advantages both in the processing itself and in processor design and implementation, must be taken into account.

As mentioned previously, the implementation of circuits by means of stored logic provides a higher density of VLSI integration than other methods based on combinational designs [35,36]. Recently, integration densities of 32mm have been achieved in the construction of chips, and the first circuits produced with this technology are not processors but memory chips that have less internal complexity than a processor. In addition, the use of stored logic reduces hardware development costs, and limits the number and variability of modules required in the design of microprocessors. This option improves the maintenance of the arithmetic unit.

The memories facilitates their reusability and provides a high degree of parallelism. Multiport memories with several access channels enable parallel results to be obtained in the same storage chip. On the other hand, memory construction offers the system the possibilities of reliability and flexibility. That is, the use of read and write memories enables several different functions to be configured in the same chip. In addition, memory modules can incorporate elements of error detection and correction and, therefore, strengthen the results they produce. Below, a more detailed analysis will be given of the implications our proposal has on the spatial and temporal cost within the framework of independent implementation of used technology.

4.1. Area cost

The number of operands, their length, and also their precision, correspond directly to the spatial cost of the circuit based on stored logic [18, 28]. There are proposals in the implementation of the functions aimed at reducing it as much as possible. Firstly, one of the most frequently used techniques consists of breaking up the operands into parts, processing each part using stored logic and subsequently combining the results [20, 27]. Depending on the function to be calculated, data pre- or post-processing stages can be included to reduce the lengths of the operands and the results [34,35]. These strategies add complexity to get a suitable area cost. So, a compromise between the length and the size of the memory required must be found. Generally, comparisons made with combinational circuits demonstrate a clear disadvantage in area, as was expected. However, it is the price that has to be paid for obtaining other kinds of benefits. Table 1 shows the area complexity of several operators based on stored logic.


Table 1 Area complexity for various operations Operand length Addition Multiplication Square Square root 4 160 B 256 B 16 B 16 B 6 3.5 KB 6 KB 96 B 96 B 8 72 KB 128 KB 512 B 512 B 12 26 MB 48 MB 12 KB 12 KB 16 ≈ 8.5 GB ≈ 16 GB 256 KB 256 KB

In order to compare the performance of the operators based on stored logic with other combinational designs, we can express area cost in terms of the area of complex gates. Let τa be the area cost of a complex gate, such as one full adder. According to analysis from Ref. 12, 16 and 37 (implementation using a family of standard gates from the AMS 0.35 µm CMOS library) and our own estimates [28], we assume 40 τa / Kbit rate for tables addressed by words of up to 6 bits long, a 35 τa / Kbit rate for 7-11 input bit tables, a 30 τa / Kbit for 12-13 input bit tables and a 25 τa / Kbit for 14-16 bit tables, where the meaning of the rate τa /Kbit is the amount of complex gates that correspond to each Kbit of memory.

4.2. Time cost The computational delay is the time required to reach the memory and collect the operation result. It depends on the memory communication routes (access time), its internal structure and manufacturing technology (response time) [38,39]. It is constant and independent of the value of the operands. Let:

T(Γf) = Temporal cost to execute f function according to Γ implementation. T(ΔM)= Response time of a memory according to Δ implementation. TΔM = Access time of the memory according to Δ implementation.

The following equation will determine the functions in which the application of the calculation methods based on look-up tables offers advantages over other implementations (Γ) of function f.

T(Γf) > T(ΔM) + TΔM (5) As can be seen in the expression above, the mathematical complexity of the algorithm that

calculates the function f is an important aspect when deciding on the design of the operator. The more difficult the calculation of the function is, the greater the cost of the combinational implementation T(Γf), even more so if the function is derived from other more simple primitives of the processor.

With stored logic implementation, the cost of obtaining data by means of a memory T(ΔM), is constant and independent of the function to be calculated. The cost depends exclusively on the size of the memory, its architecture and manufacturing technology [33, 35, 39]. In the LUT−based design, advances in technology play a determining role in performance improvement and in the reduction of the time delay. The location of the look-up tables inside the arithmetic unit, next to the rest of the operation’s logic, also reduces access time TΔM [40]. So, embedded memories inside the arithmetic unit and near to the control nucleus of the processor are a good option for accommodating these operators [4, 33].

The time delay in terms of complex gates can be expressed by means of the independent technology model introduced at previous section (Ref. 12, 16, 28 and 37) in order to be able to compare the costs of different operator implementations. Let τt be the delay of a complex gate.


The stored logic model assumes a joint delay for T(ΔM) and TΔM of 3.5 τt for tables addressed by words of up to 8 bits long, 5 τt for 12-13 input bit tables and 6.5 τt for 16 input bit tables.

With these data and the estimate of the time cost provided by other colleagues in their research into the design of the implementation of arithmetic functions, we are able to compare operator delay curves and decide in what conditions implementations based on stored logic can compete with combinational designs. The functions of the two operands have been left out of this study because they require much more memory than monadic operations to store their results. Strategies to reduce the area required for these functions can be found in other works [27,28].

5 Construction of Arithmetic Units based on Stored Logic The construction of an arithmetic unit based on the principles of stored logic, which has been described in this article, is much more accessible than the manufacture of a unit with combinational implementation. The study on a low level of the computer is within the reach of researchers who don’t have a processor factory with a combinational nucleus.

For example, we can build a small functional unit that provides the functions of the MS Windows calculator simply with a memory where the results of these operations are stored. The set of functions this calculator provides in its scientific mode is shown in the figure below.

Diadic operations: Φ1 = {addition, product, division, power} Monadic operations: Φ2 = {sine, cos, tan, exp, square, cube, log, ln, fact, inverse} For a word length of 16 bits, the diadic operations of Φ1 require 8 GB for each one in order to

store all the operand combinations (without taking into account the fragmentation of operands nor stages of pre- or post-processing , which reduce this size). On the other hand, the monadic operations Φ2, only require 128 KB each one to store their results. Therefore, a storage capacity of a little over 32 GB is required to manufacture the unit.

Although some important aspects, such as the execution control logic, the sequencing of instructions or aspects concerning the frequency of operation are omitted in this example, our aim is to show the versatility of this alternative. With this approach, the following questions should be asked: which is cheaper, a processor or a hard disk with sufficient capacity? or what storage capacity is necessary to provide suitable functionality for an application?

In the field of computer architecture research this operator design method opens up new possibilities of adapting it to the requirements of certain specific applications. The Specialized Processors Architecture Laboratory of the University of Alicante has been researching into new calculation methods for real-time applications [27,28]. One of the most interesting lines of research resides in the development of a Real-Time processor that especially considers the flexibility and determinism of calculations in the low level of the architecture.

Operations based on stored logic make up the arithmetic unit modules of this processor. Let us give a pair of simple implementation examples: the addition and multiplication operators. We will briefly describe the implementation of these operations in which fragmentation of the operands to reduce their spatial complexity is carried out. The temporal complexity and the time-cost of these designs are explained in depth in Refs. 27 and 28. The LUT implementation corresponds to the design presented in Refs. 38, 39, and has been integrated into the rest of the circuit. Figure 3 shows a diagram of these implementations.


k-1 0...

2k2k-1n-1

k-1 0...

2k2k-1n-1

Operand A

Operand B

+++ k-adder

k k

k+1

k-1 0...

2k2k-1n-1

k-1 0...

2k2k-1n-1

Operand A

Operand B

+++ k-adder

k k

k+1

4k-1 3k3k-1 2k 2k-1 k k-1 0

4k-1 3k3k-1 2k 2k-1 k k-1 0AAAA

B B B B

0

01

23

34

4

X

A0 B0·A2 B0·A1 B0·A3 B0·

A1 B1·A3 B1·A0 B1·A2 B1·

A1 B2·A3 B2·A0 B2·A2 B2·

A1 B3·A3 B3·A0 B3·A2 B3·

+

A B·8k-1 0

02k-1

k-multiplier

k k

2k

4k-1 3k3k-1 2k 2k-1 k k-1 0

4k-1 3k3k-1 2k 2k-1 k k-1 0AAAA

B B B B

0

01

23

34

4

X

A0 B0·A2 B0·A1 B0·A3 B0·

A1 B1·A3 B1·A0 B1·A2 B1·

A1 B2·A3 B2·A0 B2·A2 B2·

A1 B3·A3 B3·A0 B3·A2 B3·

+

A B·8k-1 0

02k-1

4k-1 3k3k-1 2k 2k-1 k k-1 0

4k-1 3k3k-1 2k 2k-1 k k-1 0AAAA

B B B B

0

01

23

34

4

X

A0 B0·A0 B0·A2 B0·A2 B0·A1 B0·A1 B0·A3 B0·A3 B0·




+

A B·8k-1 0

02k-1

k-multiplier

k k

2k

a) b) Fig. 3: Stored-logic implementation. a) Addition operator; b) multiplication operator

5.1. Addition operation Addition is one of the most frequent operations of the arithmetic unit and has been the object of intense research. The proposed addition is based on the carry select-adder: First, the operands are fragmented into k bit length blocks. Next, the results of the addition of each pair of blocks are obtained directly from a look-up table with precalculated results. Finally they are combined by means of a carry select scheme. Table 2 shows the comparison with well-know adders [41] obtained after the synthesis and FPGA simulation for some wordlength

Table 2 Delays in the calculation of the sum (ns) for several word-length

Adder 8 bits 16 bits 32 bits 64 bits 128 bits Carry Lookahead Adder standard 11.1 25.5 54.0 110.9 224.8 Parallel-Prefix Adder Sklansky 9.4 20.4 34.5 37.5 48.6 Parallel-Prefix Adder Brent-Kung 9.4 13.7 19.8 24.3 32.7 LUT based Adder (k=4) 3.3 4.8 7.3 13.1 23.2 LUT based Adder (k=8) 3.1 3.7 5.2 7.7 13.5

5.2. Multiplication operation The calculation method is made up of the following steps: generation of partial products, reduction in the number of partial products and finally addition. We are focusing on the first step. The other steps are implemented by any of the known methods.

The partial products generation process is crucial to the operation's overall performance. Two aspects must be taken into account in its design: the complexity of the generating circuit and the number of generated partial products. The first aspect is linked to the time taken in generating each partial product, whereas the second one affects the time invested in subsequently combining them to make up the final result. The calculation design consists in fragmenting the numbers to be multiplied into k bit blocks and obtaining the product of each pair of blocks directly from a LUT k-multiplicator. Table 3 shows the results obtained after the synthesis and FPGA simulation of each k-multiplication for some k wordlength in homogenous comparison.

Table 3 Delays of partial product generation in homogenous comparison (ns)

Method Generation delay Reduction Stage Total delay Simple 0.573 4.704 5.277 Booth2 2.181 2.352 4.533

LUT (k=4) 2.754 2.352 5.106 LUT (k=8) 3.132 0 3.132


6 Conclusions New features provided by integration technologies open up new possibilities for processor

development. The main idea of this paper is that although the implementation of a function by means of a memory with precalculated results may seem trivial, its integration into the design of processors could be considered as an alternative which, totally or partially, carry out computable instructions. The use of a memory to perform the processing is a direct application of the classic Turing machine model.

The review of the use of memory in the processing of functions reveals that there is a great amount of scientific work based on pre-calculated results, and that more and more designers are taking this paradigm into account in their implementations. However, although the latest-generation processors incorporate more and more memory, there is still a reluctance to integrate large volumes of data inside of them in spite of the advantages they bring, doubtlessly due to excess consumption or to the area they would take up with current technology. We think that precalculated processing in memories will gain ground as their advantages in certain applications become known.

References 1 P.P.Chu, Computation Theory in a digital systems course, ASEE/IEEE Frontiers in education Conf. (2002). 2 D.R. Simon, On the Power of Quantum Computation, SIAM Journal on Computing, 25 (1997), pp. 1474-1483. 3 D. Woods, T. J. Naughton, An optical model of computation. Theoretical Computer Science (2004). 4 ISSCC Roundtable, Embedded Memories for the Future, IEEE Design and Test of Computers, 20 (4) (2003),

pp. 66-81. 5 S. Borkar, Exponential Challenges, Exponential Rewards — The Future of Moore's Law, IFIP WG 10.5

Conference on Very Large Scale Integration of System-on-Chip (2003). 6 S. Areibi, A first course in digital design using VHDL and programmable logic, ASEE/IEEE Frontiers in

Education Conference, 31 (2001), pp. 19-23. 7 N.L.V. Calazans and F.G. Moraes, Integrating the teaching of computer organization and architecture with

digital hardware design early in undergraduate courses, IEEE Transaction on Education, 44 (2001), pp 109-119. 8 A. Chandrakasan, W. J. Bowhill, F. Fox, Design of High-Performance Microprocessor Circuits, IEEE Press,

2001. 9 J.L. Hennessy, D.A. Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufmann; (2002). 10 M.J. Schulte. Optimal initial approximations for the Newton-Raphson division algorithm. Computing, 53

(1994). 11 S.F. Oberman and M.J. Flynn, Division Algorithms and Implementations, IEEE Transaction on Computers, 46

(8) (1997), pp. 833-854. 12 T. Lang and P. Montuschi, Very High Radix Square Root with Prescaling and Rounding and a Combined

Division/Square Root Unit, IEEE Trans. Computers, 48 (8) (1999), pp. 827-841. 13 E. Antelo, T. Lang, and J.D. Bruguera, Computation of √x/d in a Very High Radix Combined Division/Square-

Root Unit with Scaling, IEEE Trans. Computers, 47 (2) (1998), pp. 152-161. 14 M. Ito, N. Takagi, and S. Yajima, Efficient Initial Approximation and Fast Converging Methods for Division and

Square Root, 12th IEEE Symp. Computer Arithmetic (1995), pp. 2-9. 15 H. Hassler and N. Takagi, Function Evaluation by Table Look-Up and Addition, Proc. 12th IEEE Symp.

Computer Arithmetic (1995), pp. 10-16. 16 W.F. Wong and E. Goto, Fast Evaluation of the Elementary Functions in Single Precision, IEEE Trans.

Computers, 44 (3) (1995), pp. 453-457. 17 M.D. Ercegovac, T. Lang, J. Muller, A. Tisserand. Reciprocation, Square Root, Inverse Square Root, and Some

Elementary Functions Using Small Multipliers. IEEE Transaction on Computers, 49 (7) (2000). 18 P.T.P. Tang, Table Look-up algorithms for elementary functions and their error analysis, IEEE Symposium on


Computer Arithmetic, (1991). 19 D.M. Priest, Fast Table-Driven Algorithms for Interval Elementary Functions, IEEE Symposium on Computer

Arithmetic, (1991). 20 E. Remes. Sur un procédé convergent d'approximations successives pour déterminer les polynômes

d'approximation. C.R. Acad. Sci. Paris, (1934). 21 Horner W.G. Philosophical Transactions of the Royal Society of London in 1819. 22 W.F. Wong and E. Goto, Fast Evaluation of the Elementary Functions in Simple Precision, IEEE Trans.

Computers, 44 (3) (1995), pp. 453-457. 23 S. Gal, Computing elementary functions: a new approach for achieving high accuracy and good performance.

Accurate Scientific Computations. Lecture Notes in Computer Science, 235 (1986), pp. 1-16. 24 S.Gal, B. Bachelis. An accurate elementary mathematical library for the IEEE floating point standard. ACM

Transaction on Mathematical Software, 17 (1) (1991). pp. 26-45. 25 D-U Lee, O. Mencer, D. J. Pearce and W. Luk. Automating Optimized Table-with-Polynomial Function

Evaluation for FPGAs. International Conference on Field-Programmable Logic and its Applications (2004), 3203 of LNCS, pp 364-373.

26 M. J. Schulte, J. E. Stine, Symmetric Bipartite Tables for Accurate Function Approximation, 13th IEEE Symposium on Computer Arithmetic, (1997).

27 H. Mora-Mora, J.M. Mora-Pascual, JM. García-Chamizo, A. Jimeno-Morenilla, Real-Time Arithmetic Unit, Real-Time Systems, 34 (1) (2006), pp. 53-79.

28 H. Mora-Mora, J. Mora-Pascual, J.L. Sánchez-Romero, J.M. García-Chamizo, Partial product reduction by using look-up tables for M×N multiplier, Integration, The VLSI Journal, 41, (4), (2008).

29 J. Voider, Binary computation algorithms for coordinate rotation and function generation, Convair Report IAR- 1 148 Aeroelectrics Group, (1956).

30 R. Andraka, A survey of CORDIC algorithms for FPGA based computers, International Symposium on Field Programmable Gate Arrays, (1998).

31 T. C. Chen, Automatic Computation of Exponentials, Logarithms, Ratios and Square Roots, Numerical Computation, 16 (1972).

32 J.S. Walther, A unified algorithm for elementary functions, Spring Joint Computer Conference (1971), pp. 379-385.

33 R. Rajsuman. Design and Test of Large Embedded Memories: An Overview, IEEE Design and Test of Computers, 18 (3) (2001), pp. 16-27.

34 B. Parhami. Alternate Memory Compression Schemes for Modular Multiplication, IEEE Transaction Signal Processing, 41 (1993), pp. 1378-1385.

35 B. Parhami. Computer Arithmetic. Algorithms and Hardware Designs, Oxford University Press, 2000. 36 D. DasSarma, D.W. Matula, Measuring the Accuracy of ROM Reciprocal Tables, IEEE Transactions on

Computers, 43 (8) (1994), pp. 932-940. 37 J.A. Piñeiro, M.D. Ercegovac, J.D. Bruguera, High-Radix Logarithm with Selection by Rounding, IEEE ASAP

Conference, (2002). 38 H. Nambu, K. Kanetu, K.Higeta, M. Usami, T. Kusonoki, K. Yamaguchi, N. Homma. A 1,8ns Access, 550 Mhz

4,5Mb CMOS SRAM. IEEE ISSCC. (1998). 39 T. Wada, S. Rajan, S.A. Przybylski, An Analytical Access Time Model for On-Chip Cache Memories. IEEE

Journal of Solid-State Circuits, 27 (8) (1992). 40 S. Carr. Memory Hierarchy Management. PhD Thesis, Rice University. 1993. 41 R. Zimmermann. Binary Adder Architectures for Cell-Based VLSI and their Synthesis. PhD Thesis, Swiss

Federal Institute of Technology. 1997.

mathematical model of stored logic based computation

Documents