Introduction in multicellular processors

General description

The multicellular processor is the first processor with a principally new (post-von Neumann) multicellular architecture. It is designed to solve the tasks of controlling and digital processing of signals in the applications which require minimal power consumption as well as superior performance, for instance, audio information.

The multicellular processor consists of 4, 8 or 16 cells integrated by intellectual commutation environment. The multicellular processor cells have the command system made on the triad language basis. The types of data – integral and fractional (both signed and unsigned numbers) of single precision - 16(24) bits or double precision - 32(48) bits as well as fractional signed and unsigned packed (complex) numbers of single precision - 32(48) bits

Architecture features

1. The multicellular architecture differs from the von Neumann model by the direct indication of informational connections between operations and consequently any requirement for the ordered arrangement of the operation description in the program is annulled.

This disorder makes all the methods (superscalararity, VLIW, super-pipeline, jump forecasting, etc.) unnecessary; these methods ensure operation speed, but at the same time dramatically complicate the processor as well as the development software (compilers, debuggers) design and increase their cost.

2. It is different from the well known non-von Neumann architectures by means of sequential fetching which realizes imperative programming languages as well as by dynamically generated tags, but not instruction addresses of indicating informational connections. Any instruction is executed at the “data readiness” and “its output users’ readiness”.

3. The cell instruction set is based upon some intermediate presentation of a compiled program after the syntax analysis (triads) and actually it is a sort of hardware realization of input programming language. It minimizes the labour costs to create compilers due to the fact that the units of machine-oriented optimization and paralleling disappear as well as the instruction generating unit dramatically decreases. The notion "assembler programming" disappears as the processor language is not visible and thus it is “not programmable”. The software becomes really hardware-independent.

4. If necessary, the disordered triads ensure an individual object code for every processor after every compilation. This fact as well as the closure of the triad subsets make it dramatically impossible to unauthorizedly, secretly and from without interfere into the system software.

5. The system code is individual and unprivileged users apply only the high-level language for programming which permits to create a new and effective toolkit against viruses.

6. The triads make it possible to read and execute several instructions simultaneously and without analysis of their execution sequences or informational connecting, i.e. they ensure the “natural” realization of the parallelism. It is initially conditioned by the mechanisms of the execution of operations and the execution type. In the multicellular processor there is no hardware identifying informational connections between fetched operations (instructions) and distributing them by functional units, i.e. there is no dynamic paralleling. There is also no static paralleling because the triad-type program describes informational connections, includes line structure, but includes no indications of what and how it is possible to do in parallel.

7. The fully connected intellectual commutation environment functioning in “broadcasting” mode ensures effective realization of any type of tasks since it makes no topological restrictions as to intercellular data exchange.

All these architectural features ensure both universal character and effective scaling of the processor; the process of scaling practically increases performance directly proportionally to the number of cells.

8. The compiled program can be executed at any number of cells. With that, it is possible to see dynamic change of their number that ensures the gradual degradation methodology of the processor to be realized at the failure of its cells. The processor can re-arrange itself and be functional, when we have the commutation environment and at least one cell is operable.

This code independence upon applied resources ensures the permanent self-adapting of the processor to the task flow and, when including new resources, ensures its self-repairing after failures.

9. The asynchronous and decentralized organization of the multicellular processor both at the system level – between the cells (when paralleling) and at the intracellular level – between the cell’s units (when instructions are realized) additionally guarantees:

http://www.multiclet.com/templates/ja_edenite/images/bullet.gif); background-attachment: initial; background-origin: initial; background-clip: initial; background-color: initial; padding-left: 30px; background-position: 18px 7px; background-repeat: no-repeat no-repeat; ">minimum number of design objects and decrease of complexity;

decrease of the crystal area because the unit volume is less at the decentralized control than that at the centralized one;
http://www.multiclet.com/templates/ja_edenite/images/bullet.gif); background-attachment: initial; background-origin: initial; background-clip: initial; background-color: initial; padding-left: 30px; background-position: 18px 7px; background-repeat: no-repeat no-repeat; ">increase of performance and decrease of power consumption several times (see features) because it realizes the effective computing process.
http://www.multiclet.com/templates/ja_edenite/images/bullet.gif); background-attachment: initial; background-origin: initial; background-clip: initial; background-color: initial; padding-left: 30px; background-position: 18px 7px; background-repeat: no-repeat no-repeat; ">application of the individual synchronization system for every cell, when, in perspective, realizing on the one chip of tens and hundreds of cells.

As a result, we have a well-structured modular system that permits us to considerably simplify the processor as well as consequently decrease labour costs and improve project quality.

Status

As of today the processor description for 4, 8 and 16 cells is developed and tested at the RTL model. The 4-cell processor is tested at the FPGА-model (XC2V4000) and its synthesis for 0,18mm, V=1,8V technological process is carried out (versions: 10Mhz/40MIPS; 50Mhz/200MIPS). We received the assessment properties with regard to performance and power consumption.

Properties

Table 1. - Performance

	MCc0401100000	MCc0801100000	MCc1601100000	TI С64хх-С*	TMS320C647x**
Number of operations fetched and executed as per one clock cycle	4	8	16	8	24(3*8)
CFFT-256(clock cycles)	1192 (radix-2)	639 (radix-2)	338 (radix-2)	1246 (radix-4)	806 (radix-4)

* Single-core processor. Core of C64xx type has VLIW architecture. Command word consists of 8 fields to specify 8 operations which can be processed simultaneously. See: Buyer's Guide to DSP Processors, 2001 Edition (Berkeley Design Technology, Inc. (BDTI), p. 645)

**Processor includes three cores of C64xx type. See:
http://focus.ti.com/dsp/docs/dspplatformscontento.tsp?sectionId=2&;familyId=1635&tabId=2432

Table 2. – Power consumption

Properties	Dimensions	MCc0401100000			TMS320VC5504***
Properties	Dimensions	Synthesis results	Design value*	Forecast**	TMS320VC5504***
Topological norm	mm	0,18	0,13	0,13	0,09
Voltage	V	1,8	1,2	1,2	1,05
Power consumption at the task CFFT-256	µW/Mhz	590	136,6	54,6	-
Power consumption at the task CFFT-256	µW/MIPS	147,5	34,1	13,64	-
Power consumption at the mix of 75%DMAC+25%ADD (Typical Sine Wave Data Switching)	µW/Mhz	425	98,4	39,4	150
	µW/MIPS	106,2	24,6	9,8	75

* Calculation takes into account only the decrease of topological norm and voltage.

** Forecasts take into account the decrease of power consumption by 60% after the RTL-code optimization (full scale introduction of the following power consumption reducing methods: «clock gating», «operand isolation for functional unit», «operand isolation for multiplexers», «latching of register addresses instructional decoder»; See http://www.retarget.com/resources/pdfs/goossens-ip07.pdf ).

*** TMS320VC5504 processor*** is announced in August 2009 as the processor with ultra-low power consumption. See: http://focus.ti.com/lit/ds/symlink/tms320vc5504.pdf .

11.10.2011