Low-Power Embedded Processor

Home
Members
Project Statement
Project Specification
Project Proposal
ECE 290 Final Report
ECE 291 Final Report
File Archive
Weekly Reports
Timeline

 

Download this document [pdf] [doc]

Download the 290 Final Presentation [pdf] [ppt]

ECE 290 Final Report

Introduction:

An embedded processor is a processor that has been “embedded” into a device.  It can be programmed to interact with different pieces of hardware. Performance wise, an embedded processor can outperform a microcontroller, but does not have as much performance as a general-purpose microprocessor.

Low-power embedded processors are used in a wide variety of applications including cars, phones, digital cameras, printers, and other such devices. The reason for their wide use is that embedded processors are small; therefore, they do not take up much die area and are cheap to fabricate. Also embedded processors are verified, eliminating the need to spend additional engineering man-hours tracking down hardware flaws. Another great advantage in using embedded processors is that they run software, which enables one to deal with changing specifications as various system requirements change.

Low power processors are the key to the realization of portable electronic devices, in which power consumption is an important factor. Low-power consumption helps to reduce heat dissipation, lengthen battery life, and increase device reliability.  In this project we will implement a 16-bit RISC type embedded processor that will support a pre-defined instruction set. This processor will follow RISC architecture because it allows for a simpler implementation of our design. Various power saving techniques, such as reducing supply voltage, clock gating, and full custom design, will be employed in the processor architecture. 

Ways of reducing power consumption:

There are many ways to reduce power consumption of a processor.  Some of the methods that we will employ in our design are listed below.

Reduced supply voltage:  Power consumption (P) of a CMOS based processor is related to the supply voltage (V), switching frequency (f), and CMOS gate capacitance (C).  The relationship can be described as:

P µ C · f · V2

The above relationship shows that reducing the supply voltage can reduce power consumption.  But we also need to be aware of the fact that switching frequency is directly proportional to the supply voltage as well; that is:

Vµ f

Therefore, while lowering supply voltage will reduce power consumption, it will also result in lower switching frequency, and thus slower processor speed.  A trade off must be made in the form of speed in order to reduce power usage.

Full custom design:  Full custom design will help to reduce the number of logic gates in the implementation of the necessary functionality of the processor.  Since each logic gate requires power to operate, lower gate count means fewer switching and thus power need. 

Clock gating:  Clock gating is a method where certain parts of the processor are prevented from receiving the clock signal.  If a part of the processor is not needed for a given operation, then the clock signal to that part can be stopped.  Since switching requires power and in the absence of the clock signal no switching will take place, gating the clock will lower power need.

Block Diagram:

The block diagram for the processor is shown in diagram 1 below.

Diagram 1: Processor Block Diagram

 

Program Counter (PC):  Program counter holds the address of the next instruction to be fetched from the Instruction memory.  After each fetch cycle, the PC will be incremented by 2 to point to the next instruction, or if the previous instruction was a branch instruction then PC will hold the address of the instruction pointed to by the branch target.

Instruction Memory:  As the name indicates, instruction memory holds the instructions that the processor will execute.  Since the address bus is 16 bits wide, the size of the instruction memory can be at most 216 or 64 kilo bytes.

Control Unit:  The control unit is responsible for decoding the opcode and generating the necessary control signals.  The control signals generated by this unit go to the ALU, multiplier, data memory, register file, and the branch decide unit.  These signals decide which of the module(s) to use for any given instruction.

Register file:  This two-port register file contains all sixteen general-purpose registers supported by this processor.  Each of the registers is 16 bits wide.  This unit supports two concurrent read and one write operation to the registers in each clock cycle.

Sign extension unit:  This unit takes an 8-bit input and sign extends the value to 16 bits.  This unit is necessary for instructions that specify operation on immediate values.  Since immediate values specified in instructions are 8-bits wide, they need to be sign extended to 16-bits before going to the next stage.

ALU Control:  This unit provides the signal that specifies which ALU operation is to be performed on the present set of data.

Arithmetic Logic Unit (ALU):  ALU is responsible for performing all the arithmetic and logical operations on data.  Some of these operations require two operands and while others operate on only one operand.  The operations include add, subtract, compare, and, or, not, xor, logical shift, and arithmetic shift.  The output of the ALU goes either to the data memory (in the case where the output is an address) or through a multiplexer back to the register file.

Multiplier:  This unit is responsible for performing the multiplication operation.  The inputs to this unit are two 16-bit numbers and the output is a 32 bit number.  The output of the multiplier goes back to the register file through a multiplexer.

Flag Register:  This register holds all the flag bits.  The bits are the Z (Zero), V (Overflow), C (Carry-out), and N (Negative).

Branch Decide Unit:  This unit is responsible for deciding whether to execute a branch instruction or not.  It compares the Zero flag and the Branch signal from the control unit to decide whether the branch is to be taken.  Output of this unit is a one bit value, which is ‘1’ when branch is taken and ‘0’ otherwise.

Data Memory:  This unit is similar to the instruction memory unit, but in this case this memory holds data instead of instructions.  Like the instruction memory, the maximum possible data memory is also 64kb.

 

Instruction Architecture:

The instruction set that the processor will support is shown in table 1 below.

Instruction

Description

Arguments

Add

Addition

Register-Register

Addu

Unsigned addition

Register-Register

Addi

Addition (immediate)

Register-Immediate value

Sub

Subtraction

Register-Register

Subu

Unsigned subtraction

Register-Register

Subi

Subtraction (immediate)

Register-Immediate value

Mul

Multiplication

Register-Register

Muli

Multiplication (immediate)

Register-Immediate value

Cmp

Compare

Register-Register

And

AND

Register-Register

Andi

AND (immediate)

Register-Immediate value

Or

OR

Register-Register

Ori

OR (immediate)

Register-Immediate value

Not

NOT

Register

Xor

XOR

Register-Register

Sll

Logical shift left

Register

Srl

Logical shift right

Register

Sla

Arithmetic shift left

Register

Sra

Arithmetic shift right

Register

Lw

Load word

Register-Register

Sw

Store word

Register-Register

Mov

Move data between registers

Register-Register

Movi

Move data (immediate)

Register-Immediate value

Beq

Branch if equal to 0

Memory Target

Bne

Branch if not equal to 0

Memory Target

Ba

Branch always

Memory Target

movy

Move data from Y register

Y-Register

Nop

No operation

N/A

Table 1: Instruction Table

 

Instruction Formats:

There are three separate instruction formats for our processor. The first instruction format is a register-register operation. In this format, as in all formats, bits 15-12 represent the opcode. Bits 11-8 represent the value of a register that is used as a source register. Bits 7-4 represent the register that doubles as a source register as well as a destination register and bits 3-0 represent the function field.  The function field is only valid if the opcode calls for an ALU operation; if the opcode is not an ALU function then bits 3-0 is ignored.

 

15            12

11                     8

7                     4

3                   0

opcode

Rs

Rd

Function

Register-Register Instruction

The next instruction format is a register-immediate operation. The bits 11-8 now represent both the source register and destination register. The bits 7-0 are the immediate value that is to be used in the operation with the value in the specified register.

 

15        12

11                 8

7                                                         0

opcode

Rs

Immediate

Register-Immediate Instruction

The last instruction format is the format of a branch instruction. As in all formats the bits 15-12 are used as the opcode field; however, in this format bits 11-0 are used to represent the target memory address that is to be branched to.

 

15        12

11                                                                                  0   

opcode

Branch target

Branch Instruction

 The list of opcodes and the instructions they correspond to are shown in table 2.

Opcode

Instruction

Function Code

0000

Nop

N/A

0001

Andi

N/A

0010

Addi

N/A

0011

Subi

N/A

0100

Ori

N/A

0101

LW

N/A

0110

SW

N/A

0111

Mov

N/A

1000

Movi

N/A

1001

Beq

N/A

1010

Bne

N/A

1011

Ba

N/A

1100

movy

N/A

1101

ALUop

0000-1100

1110

Mul

N/A

1111

Muli

N/A

Table 2: Opcode Table

The ALU functions and the function code associated with them are listed in table 3.

ALU Function

Function Code

Add

0000

Sub

0001

Addu

0010

Subu

0011

Cmp

0100

And

0101

Or

0110

Not

0111

Sll

1000

Srl

1001

Sla

1010

Sra

1011

Xor

1100

Table 3: ALU function Table

Multiplier:

The high-level block diagram of the multiplier is shown in diagram 2 below.  It consists of four distinct components.  They are the Booth Encoder, Partial Product Generator, Carry Save adder, and the Carry lookahead adder.  There are two main techniques that can be used to increase the speed of the multiplication process.  First technique is to reduce the number of partial product and the second is to increase the speed at which the partial products are added.  The proposed architecture employs both of these techniques in the design.  The individual components shown in diagram 2 are explained in detail below.

Diagram 2: Architecture of the Multiplier

 

Booth Encoder:  This module encodes the 16-bit multiplier using radix 4 Booth’s algorithm.  Radix 4 encoding reduces the total number of multiplier digits by a factor of two, which means in this case the number of multiplier digits will reduce from 16 to 8.  This algorithm groups the original multiplier into groups of three consecutive digits where the outermost digit in each group is shared with the outermost digit of the adjacent group.  Each of these groups of three binary digits then corresponds to one of the numbers from the set {2, 1, 0, -1, -2}.  Each encoder produces a 3-bit output where the first bit represents the number 1 and the second bit represents the number 2.  The third and final bit indicates whether the number in the first or second bit is negative.  Since there are 16 input bits, there will be a total of 8 Booth encoder modules in the overall multiplier architecture.  The way the outputs are determined is shown in table 4 below.

  

Multiplier Bits

Output bits

Operation on

Multiplicand

yi+1

yi

yi-1

NEG

2

1

0

0

0

0

0

0

0x

0

0

1

0

0

1

+1x

0

1

0

0

0

1

+1x

0

1

1

0

1

0

+2x

1

0

0

1

1

0

-2x

1

0

1

1

0

1

-1x

1

1

0

1

0

1

-1x

1

1

1

1

0

0

0x

Table 4: Booth algorithm

 

Partial Product Generator (PPG):  The output from the Booth encoder is used in this module to generate the partial products.  Since there are eight Booth encoders there will be a total of eight partial products.  The multiplication by two is implemented by shifting the multiplicand left one bit and the negation is implemented by taking the two’s complement of the multiplicand.  The architecture of the partial product generator is shown in diagram 3.

Diagram 3: Partial Product Generator

 

Each row of the diagram corresponds to one partial product.  Even though the diagram does not show it, there are eight such rows corresponding to eight partial products.  Also,

Each partial product is shifted two bits to the left relative to the partial product above it to account for the radix 4 Booth encoding of the multiplier. 

 

Wallace Tree:  This module is responsible for adding the partial products that were generated in the PPG module.  This module uses 3 to 2 carry save adders (CSA) to implement the Wallace Tree.  The individual CSAs are nothing more than full adders except for the fact that the carry-ins and the carry-outs are handled in a special way.  Each column of numbers in the partial product is added using this method.  Diagram 4 below shows how this method works for adding 8 bits.  The carry-outs generated in each stage of addition are transferred to the Wallace Tree of the column of bits of partial products on the left and the carry-ins comes from the column to the right.  The advantage of using a Wallace Tree structure for addition is that for adding eight bits the result is available only after four full adder delays.  If the same addition were to be performed using a ripple carry adder, it would have required seven full adder delays.  Therefore, although the structure of the adder might be a little complicated, it greatly increases the speed of addition.

 

Diagram 4: Wallace Tree

 

Carry Lookahead Adder (CLA):

Thi