Low-Power Embedded Processor
|
|
Download this document [pdf] [doc] Download the 290 Final Presentation [pdf] [ppt] ECE 290 Final ReportIntroduction:An embedded processor is a processor that has been “embedded” into a device. It can be programmed to interact with different pieces of hardware. Performance wise, an embedded processor can outperform a microcontroller, but does not have as much performance as a general-purpose microprocessor. Low-power embedded processors are used in a wide variety of applications including cars, phones, digital cameras, printers, and other such devices. The reason for their wide use is that embedded processors are small; therefore, they do not take up much die area and are cheap to fabricate. Also embedded processors are verified, eliminating the need to spend additional engineering man-hours tracking down hardware flaws. Another great advantage in using embedded processors is that they run software, which enables one to deal with changing specifications as various system requirements change. Low power processors are the key to the realization of portable electronic devices, in which power consumption is an important factor. Low-power consumption helps to reduce heat dissipation, lengthen battery life, and increase device reliability. In this project we will implement a 16-bit RISC type embedded processor that will support a pre-defined instruction set. This processor will follow RISC architecture because it allows for a simpler implementation of our design. Various power saving techniques, such as reducing supply voltage, clock gating, and full custom design, will be employed in the processor architecture. Ways of reducing power consumption:There are many ways to reduce power consumption of a processor. Some of the methods that we will employ in our design are listed below. Reduced supply voltage: Power consumption (P) of a CMOS based processor is related to the supply voltage (V), switching frequency (f), and CMOS gate capacitance (C). The relationship can be described as: P µ C · f · V2 The above relationship shows that reducing the supply voltage can reduce power consumption. But we also need to be aware of the fact that switching frequency is directly proportional to the supply voltage as well; that is: Vµ f Therefore, while lowering supply voltage will reduce power consumption, it will also result in lower switching frequency, and thus slower processor speed. A trade off must be made in the form of speed in order to reduce power usage. Full custom design: Full custom design will help to reduce the number of logic gates in the implementation of the necessary functionality of the processor. Since each logic gate requires power to operate, lower gate count means fewer switching and thus power need. Clock gating: Clock gating is a method where certain parts of the processor are prevented from receiving the clock signal. If a part of the processor is not needed for a given operation, then the clock signal to that part can be stopped. Since switching requires power and in the absence of the clock signal no switching will take place, gating the clock will lower power need. Block Diagram:The block diagram for the processor is shown in diagram 1 below.
Diagram 1: Processor Block Diagram
Program Counter (PC): Program counter holds the address of the next instruction to be fetched from the Instruction memory. After each fetch cycle, the PC will be incremented by 2 to point to the next instruction, or if the previous instruction was a branch instruction then PC will hold the address of the instruction pointed to by the branch target. Instruction Memory: As the name indicates, instruction memory holds the instructions that the processor will execute. Since the address bus is 16 bits wide, the size of the instruction memory can be at most 216 or 64 kilo bytes. Control Unit: The control unit is responsible for decoding the opcode and generating the necessary control signals. The control signals generated by this unit go to the ALU, multiplier, data memory, register file, and the branch decide unit. These signals decide which of the module(s) to use for any given instruction. Register file: This two-port register file contains all sixteen general-purpose registers supported by this processor. Each of the registers is 16 bits wide. This unit supports two concurrent read and one write operation to the registers in each clock cycle. Sign extension unit: This unit takes an 8-bit input and sign extends the value to 16 bits. This unit is necessary for instructions that specify operation on immediate values. Since immediate values specified in instructions are 8-bits wide, they need to be sign extended to 16-bits before going to the next stage. ALU Control: This unit provides the signal that specifies which ALU operation is to be performed on the present set of data. Arithmetic Logic Unit (ALU): ALU is responsible for performing all the arithmetic and logical operations on data. Some of these operations require two operands and while others operate on only one operand. The operations include add, subtract, compare, and, or, not, xor, logical shift, and arithmetic shift. The output of the ALU goes either to the data memory (in the case where the output is an address) or through a multiplexer back to the register file. Multiplier: This unit is responsible for performing the multiplication operation. The inputs to this unit are two 16-bit numbers and the output is a 32 bit number. The output of the multiplier goes back to the register file through a multiplexer. Flag Register: This register holds all the flag bits. The bits are the Z (Zero), V (Overflow), C (Carry-out), and N (Negative). Branch Decide Unit: This unit is responsible for deciding whether to execute a branch instruction or not. It compares the Zero flag and the Branch signal from the control unit to decide whether the branch is to be taken. Output of this unit is a one bit value, which is ‘1’ when branch is taken and ‘0’ otherwise. Data Memory: This unit is similar to the instruction memory unit, but in this case this memory holds data instead of instructions. Like the instruction memory, the maximum possible data memory is also 64kb.
Instruction Architecture:The instruction set that the processor will support is shown in table 1 below.
Table 1: Instruction Table
Instruction Formats: There are three separate instruction formats for our processor. The first instruction format is a register-register operation. In this format, as in all formats, bits 15-12 represent the opcode. Bits 11-8 represent the value of a register that is used as a source register. Bits 7-4 represent the register that doubles as a source register as well as a destination register and bits 3-0 represent the function field. The function field is only valid if the opcode calls for an ALU operation; if the opcode is not an ALU function then bits 3-0 is ignored.
Register-Register Instruction The next instruction format is a register-immediate operation. The bits 11-8 now represent both the source register and destination register. The bits 7-0 are the immediate value that is to be used in the operation with the value in the specified register.
Register-Immediate Instruction The last instruction format is the format of a branch instruction. As in all formats the bits 15-12 are used as the opcode field; however, in this format bits 11-0 are used to represent the target memory address that is to be branched to.
Branch Instruction The list of opcodes and the instructions they correspond to are shown in table 2.
Table 2: Opcode Table The ALU functions and the function code associated with them are listed in table 3.
Table 3: ALU function Table Multiplier:The high-level block diagram of the multiplier is shown in diagram 2 below. It consists of four distinct components. They are the Booth Encoder, Partial Product Generator, Carry Save adder, and the Carry lookahead adder. There are two main techniques that can be used to increase the speed of the multiplication process. First technique is to reduce the number of partial product and the second is to increase the speed at which the partial products are added. The proposed architecture employs both of these techniques in the design. The individual components shown in diagram 2 are explained in detail below.
Diagram 2: Architecture of the Multiplier
Booth Encoder: This module encodes the 16-bit multiplier using radix 4 Booth’s algorithm. Radix 4 encoding reduces the total number of multiplier digits by a factor of two, which means in this case the number of multiplier digits will reduce from 16 to 8. This algorithm groups the original multiplier into groups of three consecutive digits where the outermost digit in each group is shared with the outermost digit of the adjacent group. Each of these groups of three binary digits then corresponds to one of the numbers from the set {2, 1, 0, -1, -2}. Each encoder produces a 3-bit output where the first bit represents the number 1 and the second bit represents the number 2. The third and final bit indicates whether the number in the first or second bit is negative. Since there are 16 input bits, there will be a total of 8 Booth encoder modules in the overall multiplier architecture. The way the outputs are determined is shown in table 4 below.
Table 4: Booth algorithm
Partial Product Generator (PPG): The output from the Booth encoder is used in this module to generate the partial products. Since there are eight Booth encoders there will be a total of eight partial products. The multiplication by two is implemented by shifting the multiplicand left one bit and the negation is implemented by taking the two’s complement of the multiplicand. The architecture of the partial product generator is shown in diagram 3. Diagram 3: Partial Product Generator
Each row of the diagram corresponds to one partial product. Even though the diagram does not show it, there are eight such rows corresponding to eight partial products. Also, Each partial product is shifted two bits to the left relative to the partial product above it to account for the radix 4 Booth encoding of the multiplier.
Wallace Tree: This module is responsible for adding the partial products that were generated in the PPG module. This module uses 3 to 2 carry save adders (CSA) to implement the Wallace Tree. The individual CSAs are nothing more than full adders except for the fact that the carry-ins and the carry-outs are handled in a special way. Each column of numbers in the partial product is added using this method. Diagram 4 below shows how this method works for adding 8 bits. The carry-outs generated in each stage of addition are transferred to the Wallace Tree of the column of bits of partial products on the left and the carry-ins comes from the column to the right. The advantage of using a Wallace Tree structure for addition is that for adding eight bits the result is available only after four full adder delays. If the same addition were to be performed using a ripple carry adder, it would have required seven full adder delays. Therefore, although the structure of the adder might be a little complicated, it greatly increases the speed of addition.
Diagram 4: Wallace Tree
Carry Lookahead Adder (CLA): Thi | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||