Email this Article
Printer-Friendly
Reader Comments
[Direct Feature]
Power Play For The SoC Developers
Chris Rowen explains how computational performance can be boosted by flexible length instruction extensions.
Chris Rowen
ED Online ID #9459
September 2004
The Xtensa LX processor uses Tensilica's innovative FLIX (Flexible Length Instruction eXtensions) architecture a highly efficient implementation of the Xtensa instruction set architecture (ISA) that gives designers more options for cost/performance tradeoffs. FLIX technology provides the flexibility to freely and modelessly intermix single-operation RISC instructions, simple- and compound-operation TIE instructions, and multiple-operation FLIX instructions. By packing multiple operations into a wide 32- or 64-bit instruction word, FLIX technology allows designers to accelerate a broader class of 'hot spots' in embedded applications, while eliminating the performance and code-size drawbacks of VLIW processor architectures.
Instruction-set performance relates to the number of useful operations than can be executed per unit of time or per clock. High performance does not guarantee good flexibility, however. Instruction-set flexibility relates to the wider diversity of different applications whose computations can be efficiently encoded in the instruction stream. A longer instruction word generally allows a greater number and diversity of operations and operand specifiers to be encoded in each word.
RISC architectures generally encode one primitive operation per instruction. Long-instruction-word architectures encode a number of independent sub-instructions per instruction, with operation and operand specifiers for each sub-instruction. The sub-instructions may be primitive generic operations similar to RISC instructions or they may each be more sophisticated, application-specific operations, such as those described previously in this chapter as processor extensions. Making the instruction word longer, for any given number of operands and operations, makes instruction encoding simpler and more orthogonal.
It is worth noting that long-instruction-word processors are not always faster than RISC processors. Sometimes the benefit of RISC execution-unit simplicity boosts maximum clock frequency and the execution of several distinct RISC instructions per cycle can compensate for the relative austerity of RISC instruction sets. Nevertheless, when RISC instruction sets are found in the most demanding data-intensive tasks, they are implemented with super-scalar implementations that attempt to execute multiple instructions per cycle, mimicking the greater intrinsic operational parallelism of long-instruction words.
Shown in Figure 1 is an example of a basic long-instruction operation encoding. The figure lays out a 64-bit instruction word with three independent sub-instruction slots, each of which specifies an operation and operands. The first sub-instruction (sub-instruction 0) has an opcode and four operand specifiers two source registers, an immediate field, and one destination register. The second and third sub-instructions (sub-instructions 1 and 2) have an opcode and three operand specifiers two source registers and one source/destination register. The 2bit format field on the left designates this particular grouping of sub-instructions. It may also designate the overall length of the instruction if the processor supports variable-length encoding.
Clearly there is a hardware cost associated with long instruction words. Instruction memory is wider, decode logic is bigger, and a larger number of execution units and register files (or register file ports) must be implemented to deliver instruction parallelism. Larger numbers of bigger logic blocks are incrementally harder to optimise, so maximum clock frequency can drop compared to simpler, narrower instruction encodings such as RISC. Nevertheless, the performance and flexibility benefits can be substantial, particularly for data-intensive applications with high inherent parallelism.
In some long-instruction-word architectures, each sub-instruction has almost completely independent resources: dedicated execution units, dedicated register files, and dedicated data memories. In other architectures, the sub-instructions share common register files and data memories and require a number of ports into common storage structures to allow effective and efficient data sharing.
Long-instruction-word architectures also vary widely on the question: How 'long' is a long instruction? For high-end computer-system processors, such as Intel's Itanium family and for high-end embedded processors such as Texas Instruments' TMS320C6400 DSP family, the instruction word is very 'long' indeed hundreds of bits. For more cost- and power-sensitive embedded applications, 'long' may be just 64 bits. The essential processor architecture principles are largely the same, however, once multiple independent sub-instructions are packed into each instruction word.
CODE SIZE AND LONG INSTRUCTIONS
One common liability of long-instruction-word architectures is large code size, compared to architectures that encode one independent operation per instruction. This is a common problem for VLIW architectures, but it is an especially important one for SOC designs where instruction memories may consume a significant fraction of total silicon area. Compared to code compiled for code-efficient architectures, VLIW code can often require two to five times more code storage. Compared in Figure 2 is the total code size of a VLIW DSP (TI TMS320C6203) with Tensilica's Xtensa processor for the EEMBC Telecom (discussed in Chapter 3) suite, with both straight compilation from unmodified C and with optimised C code. No assembly code was used.
Similarly, a comparison in Figure 3 shows the total code size of a VLIW media processor (Philips Trimedia TM1300) with Tensilica's Xtensa processor for the EEMBC Consumer suite, with both straight compilation from unmodified C and with full optimisation of the C. No handwritten assembly code was created for the optimised Tensilica processor.
Code bloat stems, in part, from instruction-length inflexibility. If, for example, the compiler can find only one operation whose source operands and execution units are ready, it may be forced to encode several sub-instruction fields as NOPs (no operation). Instruction storage is already a major portion of embedded SOC silicon area, so code expansion translates into higher cost, poorer instruction-cache performance, or both.
A second source of VLIW code bloat is the loose encoding of frequent operations commonly found in VLIW processors. The TI TMS320C6203 DSP, for example, requires 32bits of instruction to specify a 16bit multiplication and 32bits to specify a 16bit add, so the common multiply/accumulate (MAC) combination takes at least 64bits. If a loop containing many MACs is unrolled four times (to amortise the cost of branch and address calculations), the resulting eight MAC operations require 512bits of instruction storage, not counting the additional bits for any loads, stores, branches or address-calculation instructions.
However, long instructions do not necessarily lead to VLIW code bloat. A long-instruction-word implementation of Tensilica's Vectra LX DSP architecture needs about 20bits within the instruction stream to specify eight 16bit MACs executing in SIMD fashion, not counting the additional bits for any loads, stores, branches, or address-calculation instructions.
One attractive solution for long-instruction-word code bloat is to use a more flexible range of instruction lengths. If the processor allows multiple instruction lengths, including short instructions that encode a single operation, the compiler can achieve significantly better code size and instruction storage efficiency, compared to traditional VLIW processor designs with fixed-length instruction words. Reducing code size for long-instruction-word processors also tends to decrease bus-bandwidth requirements and reduces the power dissipation associated with instruction fetches. Tensilica's Xtensa LX processor, for example, incorporates flexible-length instruction extensions (FLIX). This architectural approach addresses the code size challenge by offering 16bit, 24bit, and a choice of either 32 or 64bit instruction lengths. Designer-defined instructions can use the 24, 32, and 64bit instruction formats.
Long instructions allow more encoding freedom, where a large number of sub-instruction or operation slots can be defined (although three to six independent slots are typical) depending on the operational richness required in each slot. The operation slots need not be equally sized. Big slots (2030 bits) accommodate a wide variety of opcodes, relatively deep register files (1632 entries), and three or four register-operand specifiers. Developers should consider creating processors with big operation slots for applications with modest degrees of parallelism, but a strong need for flexibility and generality within the application domain.
Small slots (816 bits) lend themselves to direct specification of movement among small register sets and allow a large number of independent slots to be packed into a long instruction word. Each of the larger number of slots offers a more limited range of operations, fewer specifiers and shallower register files. Developers should consider creating processors with many small slots for applications with a high degree parallelism among many specialised function units.
<-- prev. page
[1]
2
next page -->
|