Note: This blog entry continues with the excellent short course in low power design that Professor Jan Rabaey taught at the January meeting of the Santa Clara Valley Chapter of the IEEE Solid State Circuits Society.
Low-Power Design Essentials, Part 2
There are many levels within the design hierarchy that provide you with opportunities to reduce power consumption. From highest benefit to lowest, these levels are:
- System/Application
- Software
- Microarchitecture
- Logic/Register Transfer Level
- Circuit level
- Devices
Hopefully, you will notice that the circuit and device levels—the levels with the least amount of leverage on and the smallest knob for controlling power consumption—are the levels we’ve most heavily relied upon for cutting system power. You should stop and think about that for a minute. Why should this be? Why have we relied on the least effective tools for controlling power consumption?
It’s because of the immense, powerful capability of Moore’s Law and Dennard Scaling. With Moore’s Law seemingly unstoppable and producing a new IC process generation every 18 months to two years, Dennard Scaling ensured that each new IC process node delivered transistors that were twice as fast and consumed half the power of previous-generation transistors.
That is, until recently.
Dennard Scaling broke at or near the 90nm node. We no longer get 50% power reductions with each new IC process node. According to Tom Beckley, Senior VP of R&D for Custom IC and Signoff at Cadence, the 20nm node—the node now being readied for production volume—potentially provides upwards of 20% better performance (not 100%), and a 30% power savings (not 50%). (See “Scaling the 20nm peaks to look at the 14nm cliff, Part 1: Tom Beckley from Cadence maps the challenges of advanced node design at ISQED”.)
At the system/application level, you have algorithmic choices that massively affect power consumption at the system level. You also have control over the amount of concurrency. The trend towards the use of multiple processor cores running at moderate clock rates rather than one processor core running at an extreme clock rate that’s at the limit of a process technology is but one example of a contemporary trend towards concurrency. Additional examples of this trend include the sudden appearance of algorithm- or application-specific hardware engines such as multi-core graphics processing units (GPUs), hardware H.264 video decoders, and separate audio processors in SoC design.
At the Microarchitecture level, you have choices between executing algorithms in parallel versus taking a pipelined approach. You can also choose between general-purpose execution engines (CPUs) and application-specific engines such as the ones noted at the end of the previous paragraph.
Finally dropping down into the actual hardware-implementation level (Logic/RT), you can pick from a logic family (general-purpose versus low-power) and you can choose either full custom IC design or standard-cell design. At the circuit level, you control device sizing, power-supply voltages, and transistor thresholds. Finally, you can select the actual device substrate material—bulk silicon versus SOI for example.
Nothing comes for free, said Rabaey. Power, Performance, Cost—you must always pay for one with one or both of the others. We are quite used to designing in the power/performance/cost design space (the power/performance/area or PPA space for semiconductor design—area = cost to first approximation in IC design), but there’s an equally valid power/delay space we could be operating within.
In many markets, we’ve been pushing performance hard, feeling the need to go as fast as possible. Our marketing of electronic products, as it does for automobiles, frequently emphasizes and glorifies speed. Some applications still need to run as fast as possible. But fewer and fewer systems these days really need more speed. Many systems can tolerate more application latency and we can, if we wish, back off on the speed to save energy and we now, increasingly, finding ourselves in an energy-constrained world.
Operating in the power/delay design space, we can choose from two extremes:
- Go as fast as possible, shoot for minimum delay and pay dearly in power consumption
- Maximize performance for a given energy budget
If you want to optimize both power and performance, you will need to cross-optimize design at all six of the levels listed above. Here are Rabaey’s four concrete guidelines for “energy-inspired” design:
- For maximum performance, maximize concurrency. You pay for this concurrency with area and power.
- For a given performance level, choose an optimal amount of concurrency to minimize energy consumption.
- For a given energy-consumption level, use the least amount of concurrency that meets performance goals.
- For minimum energy consumption, pick a design with minimum overhead—direct mapping of function to architecture.
One thing to get very clear in your head: we have often implemented functions inefficiently because we were not paying attention to energy efficiency when these functions were designed, and once designed in legacy and inertia can keep these less-efficient functional designs in use for a long time.
Inefficiencies in function-block designs arise from:
- Over-dimensioning and over-design
- Building generality into a design where none is needed
- Inefficient design methodologies—using methodologies that do not even consider energy consumption
- Limited design time, forcing the selection of less-than-optimal designs
- The need for flexibility, re-use, and programmability to accommodate unknown future enhancements
Professor Rabaey then supplied some simple guidelines for improving computational energy efficiency:
- Match computation to architecture. Dedicated functional solutions are far more energy-efficient that are general-purpose solutions
- Preserve an algorithm’s locality. Don’t move data very far unless needed. In other words, keep data local in registers or closely bound RAM, not out in bulk SDRAM whenever possible.
- Exploit signal statistics. Correlated data contains fewer transitions than random data—that’s the concept behind a 1-bit audio D/A converter, for example. Most of the time, audio signals transition smoothly from one value to the immediately adjacent value.
- Use energy only when demanded. It seems like a simple idea, but first consider how much time a PC processor spends in a loop waiting for a user to hit a key or move a mouse. Then think about the amount of power wasted in that loop. For contemporary desktop PC processors, there are tens of amps flowing through that processer in leakage current alone, not to mention the millions of instructions executed in the millions of wait-loop iterations. There are countless examples of wait loops burning power and consuming energy while waiting for something to happen in all sorts of electronic system designs. PCs are not exceptional in this regard.
Since the 1971 introduction of the commercial microprocessor by Intel, hardware programmability and flexibility have been the name of the game. Microprocessors have transformed system design for a variety of excellent reasons. They have shortened time to market: many of today’s systems could not be implemented in any practical manner without programmable microprocessors. Microprocessors encourage a lot of hardware design reuse. Often, it’s possible to base an entire product family on one board design that employs different firmware sets for different product family members.
We’ve also become dependent on the microprocessor’s ability to permit field updates. Just last month, I uploaded a needed firmware update for my video camcorder. The camcorder design predated the invention of Class 10 SDHC memory cards, so I discovered that the camcorder was not able to recognize the newer Flash memory media when I purchased a 32Gbyte SDHC card. A firmware update cured the problem.
However, says Rabaey, all of this flexibility through programmability comes at a large efficiency cost. Dedicated hardware is more energy-efficient and provides better throughput and latency but the use of dedicated hardware is counter to the processor-centric design trend of the last 40 years.
So the trick is to find a way to combine the flexibility of processor-based, firmware-centric design with the efficiency of dedicated hardware. Rabaey says that you do this by selecting simple processors over complex ones where possible, choosing concurrency over clock frequency whenever possible, stepping back from a “completely flexible” design mindset and adopting a design approach that relies more on “somewhat dedicated”—thus more efficient—hardware, and considering novel architectural solutions such as some form of hardware reconfigurability.
Note: This blog entry is Part 2 of a series based on a comprehensive one-evening course in low-power design essentials that UC Berkeley EECS Professor Jan Rabaey presented to about 100 people attending a meeting of the Santa Clara Valley chapter of the IEEE Solid State Circuits Society. You can find Part 1 here.