Designing Low-Power Systems with FPGAs, Part 2

February 1, 2010 on 5:34 pm | In Design, FPGA, SOC | No Comments

Literally within an hour of posting my last blog entry on designing low-power systems with FPGAs, Altera’s marketing engine issued a related email and dropped it into my inbox. Altera’s email pre-announces the company’s upcoming FPGAs based on 28nm lithography. The email included the following marketing graph (with no scale) to explain the advantages of the smaller geometries for FPGA manufacture.

Altera 28nm devices

The first set of bars in the graph set the baseline using Altera’s 40nm devices as a reference. The next set of bars show that the feature shrink alone improves FPGA gate density by 25% and power consumption by about 12.5%. (Note: That’s my eyeball talking, not Altera’s official numbers.)

The next set of bars shows what happens incrementally when Altera takes some major logic blocks and hard-codes them. Suddenly, gate density doubles and power consumption drops by 40% compared to 40nm FPGA.

The last set of bars shows what happens when you combine the lithography shrink and hard-coded IP. Suddenly you’re getting 4x the gate density at a mere 25% of the power consumption compared to 40nm devices. (Note: I’m not sure what suddenly happened to the transceiver count, that third bar in the group, which had been constant until everything got combined in the last set. My guess is that the marketing artist who drew the graph got overzealous, cut everything 75% for visual consistency, and the proofreaders missed it. I think the number of transceivers is supposed to stay constant, based on the first three sets of bars in the graph.)

Two things to note here. First, you get a lot of bang out of hard-coded IP. Coincidentally, MIPS announced that Altera had licensed the MIPS32 architecture back in October, 2008 but Altera was mum on the subject back then. RISC processor cores make lousy targets for programmable FPGA fabrics, largely because of the routing congestion around their large register files, so processor core IP is one of the IP types that really should be hard-coded onto an FPGA. Although both Altera and Xilinx did not have much success with their first-generation FPGAs that incorporated hard-coded processor cores, that doesn’t mean they’re not going to try again and the MIPS announcement late last year telegraphed that move.

Want more proof? Last week at the Real Time Embedded Computing Conference held in Santa Clara, California, Xilinx’s Senior VP of Worldwide Marketing and Business Development Vin Ratford did more than telegraph his company’s intent to put processor cores back into FPGAs. He announced and elaborated on that intent. Xilinx will be adopting the ARM architecture and an FPGA-friendly version of ARM’s AMBA interconnect in future FPGA generations.

Make no mistake. Processors are coming to FPGAs for several reasons. First, a RISC processor core consumes between 25,000 and 50,000 gates. You can drop one of those puppies into an FPGA fabric and never see it. In essence, those transistors are “free.” That’s the nature of an FPGA’s programmable interconnect. Logic just sort of disappears.

Second, you can’t build a system without at least one processor these days. Which immediately leads to the third reason. If Xilinx and Altera truly wish to convert their “We’re taking over everything” or “All your chips are belong to us” attitudes, then the processor will just have to live on the FPGA silicon. Otherwise, the FPGA companies don’t get all of the chips. It’s as simple as that.

However, as both Altera and Xilinx discovered last time they tried this, dropping a processor core into an FPGA and making it usable is not just a matter of burying some gates into the FPGA fabric. Effective ways of connecting the processor to the programmable FPGA fabric must also exist and the software developers—who represent more than 90% of modern embedded development teams—must also be happy with the integration. You only make them happy with good development, profiling, and debugging tools.

And there’s the rub.

(It’s possible that Shakespeare’s Hamlet was indeed an embedded systems developer.)

More on Mentor’s Catapult C from John Cooley and Other Designers

December 18, 2009 on 11:15 pm | In Design, EDA, SOC | No Comments

Earlier this month, I wrote about Mentor’s C-to-gates synthesis tool Catapult C and low-power design. The EDA industry’s self-appointed gadfly and uber-user John Cooley has just written an extensive blog posting about Catapult C complete with detailed comments from several of his reader/users. These comments and Cooley’s conclusions are very, very interesting for people in the ESL space, as well as anyone involved in chip design, so I thought I’d highlight some of Cooley’s conclusions.

First, Cooley quotes EDA analyst Gary Smith’s published numbers to quantify Catapult C’s lead in the high-level synthesis arena. He then uses the anecdotal evidence of the large number of comments (both good and bad) that his reader/users make about Catapult C relative to the other high-level synthesis tools to conclude that there do seem to be more IC designers using Catapult C than competing tools.

Cooley then hands the microphone over to his designer/readers for comments. One thing that really strikes me about these comments is the number of people who want to use C++ to describe hardware. Now C++ compilers have a tough enough time creating streamlined object code out of C++ descriptions. C++ allows such a high level of abstract description that algorithmic descriptions more resemble poetry than precise engineering-style descriptions. My opinion is that expecting any and all C++ descriptions to result in efficient hardware is a bit of a reach. No matter how good the compiler is, C++ descriptions can be so abstract that it can be tremendously difficult to infer any sort of efficient hardware design from such descriptions. The likelihood of developing mind-reading compilers in the near future seem mighty slim to me.

Other designer/readers seem to share my concerns. One engineer who sent a comment to Cooley and who preferred to remain anonymous wrote: “I remain concerned that quality-of-results derived from designs developed in ANSI C/C++ will not compare well to hand-coded RTL for our design area (hardware accelerators for broadband communications), regardless of claimed market share of Catapult C.”

Now don’t make something out of this skepticism (mine and “anonymous”) that’s not there. When logic synthesis first appeared in the early 1980s, it too “suffered” from a quality-of-results issue. The earliest logic-synthesis tools could not generate gate-level designs that were as efficient as manually-created designs by even moderately good human logic designers just as C compilers could not initially generate assembly code that was as efficient as code written by a good human codesmith. However, two things happed to make this issue become a non-issue.

First, the tools simply got better. Adoption of Verilog and VHDL as description languages helped to standardize the sea of HDL slopping over the EDA bucket back in the 1980s. Standardization gave compiler designers a focused target and channeled their creative energy into building better synthesis tools rather than creating ever-more-elegant description languages.

Second, Moore’s Law made irrelevant the difference between 10 and 100 gates or even between 100 and 1000 gates. At some point, we stopped counting gates just as we’d previously stopped counting polygons and transistors. We don’t really know how many gates there are on a chip any more and what’s more, we not longer care. Not really. Because it’s square millimeters of silicon that actually costs money, not gates. So today we use square millimeters and then use a fudge factor to estimate the number of gates represented by the square-millimeter metric.

Perhaps something like that will happen with high-level synthesis. The jury’s still out.

Laser Spike Annealing of Nickel in Nanometer CMOS ICs Cuts Leakage 10x

December 6, 2009 on 8:22 pm | In CMOS, Design, EDA, Green Design, Low-Power, SOC | No Comments

One of the sad facts of life for nanometer silicon has been the rise of leakage current as device geometries shrink. At 65nm, CMOS leakage currents roughly equal operating currents, making it virtually impossible to reduce overall operating current by more than half. I’ve long thought this was the result of low-Vt transistors that can never fully turn off, a consequence of the drive to recover speed that’s lost when supply voltages are cut to reduce operating power. Turns out there’s another culprit: nickel contamination that occurs when nickel atoms drift away from the nickel-silicide interface layer used to improve the connectivity of metal inter-layer contact plugs. The nickel atoms drift during the annealing process, which is used to drive the deposited nickel atoms into the transistors’ source and drain contact pads. The first of two annealing cycles drives the metallic nickel atoms into the silicon source and drain pads creating Ni2Si silicide. A second, higher-temperature annealing process converts the Ni2Si into NiSi, which has lower resistance and thus provides good electrical connectivity between the contact pad and the metal interconnect plug.

It turns out that the current “soak” annealing (which lasts for tens of seconds) processes allow the nickel atoms to drift far afield. Like beach sand in your bathing suit, the nickel gets into places you’d rather not have it. The drifting nickel atoms seem to have an affinity for silicon lattice discontinuities, which can be found at the outside ends of the transistor where source and drain diffusions meet the isolation trenches and in long, narrow voids that run from the source and drain regions towards and into the FET channel. Both of these hiding places cause leakage because the metallic nickel conducts electricity where there should be insulator or semiconductor material. Nickel at the ends of the transistor causes substrate leakage and nickel atoms in the channel naturally cause channel leakage.

Applied Materials and European semiconductor research powerhouse IMEC have jointly developed a laser-annealing process with one-millisecond duration instead of taking tens of seconds. As a result, the diffusing nickel doesn’t have time to drift into these unwanted places during the second annealing step that generates NiSi. Applied Materials described a similar laser-spike annealing process back in 2004 (see article here), but reportedly achieved only a 3-4% leakage reduction back then. This latest development appears to be a refinement of that earlier technique. The two companies will be presenting their findings at this week’s IEDM conference in Baltimore, Maryland.

IMEC and Applied Materials will indeed have pulled a rabbit out of the hat if this laser-spike annealing process plus the application of appropriate transistor-design rules result in cutting leakage currents by 90% for nanometer CMOS. Leakage-driven power loss has become a significant problem for advanced IC design and had appeared to be insurmountable, even with the addition of high-K and metal-gate processing. Now, it appears there’s a real solution with the best of all possible implications for system and logic designers: they don’t need to learn anything new. They can leave this fix to the design tools and to the process engineers and once again skirt the system-level and architectural issues of low-power design.

C-to-Gates Synthesis and Low-Power Design

December 4, 2009 on 1:42 pm | In Design, Low-Power, SOC | No Comments

One of the many “pushbutton” design-automation tools that chip designers have sought is a “C-to-Gates” tool that would allow the automated development of hardware from algorithmic descriptions written in the C programming language.

The place to start almost any system design is with the most fundamental aspect of system design: algorithm development. Systems are collections of independently and dependently operating algorithms. For example, a DVD player uses one algorithm to decompress the combined media stream, another to decode the resulting video stream, and yet another algorithm to decode the resulting audio stream. All systems are based on the execution of one or more algorithms.

Most algorithm development begins and ends with C. One exception to this rule is MATLAB from The Mathworks, which is very popular with many algorithm developers. However, there are ways to get MATLAB to produce C from MATLAB-based algorithms as well.

Given the close association between algorithm development and the C language, it’s only natural to want a tool that automatically converts C descriptions to hardware netlists. However, early attempts to create such tools didn’t meet with much commercial success, largely because the quality of results (number of gates, operational speed, and resulting power consumption) compared poorly to manual RTL-generation design techniques.

The tools have been getting better over the years, leading to this recent announcement by Mentor Graphics:

“Mentor Graphics Corp. (NASDAQ: MENT), today announced that Fujitsu Kyushu Network Technologies Limited (Fujitsu QNET) has chosen the Catapult C Synthesis tool for use in its design tool environment to implement complex algorithms in hardware that were previously processed by a processor implemented on LSI. Fujitsu’s growing expertise with the Catapult C tool is also a key enabler in the expansion of their design services business.

Fujitsu QNET was able to dramatically cut power consumption by using the Catapult C Synthesis tool to create a dedicated hardware accelerator for mobile voice processing algorithm versus running in software. The resulting silicon implementation yielded a reduction in power consumption of 83%. This was made possible by the ability of the Catapult C tool to find the optimal trade-off between power, performance and area, in this case, implementing a design satisfying voice performance requirements while running at a lower clock frequency than the previous implementation using a processor.”

Based on this announcement, it would seem that C-to-gates tools, at least Mentor’s Catapult C, are getting closer to reality. In fact, the above announcement would lead you to believe that such tools are here and production-ready today. Indeed, Mentor’s description of the Catapult C synthesis tool appears mighty attractive:

“Catapult C Synthesis reduces design time and verification effort. When writing pure C++, designers focus on the functional intent of their application. Timing and architectural information is abstracted away from the source description. With fewer details in the model, testbench development is also simplified.

Implementation of specific details are automatically added during the synthesis process, eliminating error-prone manual interventions and resulting in RTL designs correct by construction. Debug of the resulting RTL is in turn eliminated, further reducing the overall verification effort.

The Catapult C automated verification environment allows any RTL implementation of a C++ model to be verified using the original C++ testbench. This eliminates the need to write pin-level interfacing and bit-timed RTL environments to verify the RTL blocks created by Catapult before moving to system integration.”

There’s a caveat or two to remember, however. First, there’s good C code and bad C code whether you’re writing code to run on a processor or code that’s to be synthesized into gates. In the case of C-to-gates synthesis, good code signals the design intent as clearly as possible so that the synthesis tool needs to infer as little as possible. Machine inference is the second caveat. Every detail that Catapult C—or any synthesis tool for that matter—must infer is a design detail that you didn’t put into the design.

Conventional RTL-driven logic synthesis makes such inferences all the time and, over the years, designers have gotten savvy to the kinds of inferences that will be made by their logic-synthesis tools and have compensated by adapting their code-writing styles when writing hardware descriptions in the Verilog and VHDL hardware description languages. However, C has largely been used as a sequential algorithm description tool to create software that runs on single processors and use patterns by engineers reflect that long history. In addition, C describes algorithms at a higher level of abstraction than descriptions written in hardware description languages. As a result, there’s always more to infer from a C algorithm description just due to the higher abstraction level.

So eye that 83% power-reduction that Fujitsu QNET achieved for that voice-processing algorithm with envy. Just remember that engineering isn’t as simple as pushing a button.

NOCs: The Undead of the SOC World

November 8, 2009 on 6:14 pm | In SOC, Uncategorized | 3 Comments

The 7th International SOC Conference in Newport Beach featured a session on NOCs (networks on chip). Perhaps it’s the undue influence of the recent Halloween festivities, but NOCs remind me of vampires, of the undead. They just keep coming back no matter what, despite the lack of uptake in the commercial sector.

Academics love NOCs because they can be analyzed to death and they provide wonderful fodder for postgraduate work. You can come up with increasingly elegant, time-consuming, and costly routing algorithms for NOCs, which has permitted the creation of many, many academic papers. Each and every paper lists the prior failings of earlier NOC approaches, analyzes the shortcomings, and then proposes an even more elegant and costly NOC that solves the technical problems of predecessors. But these more elegant solutions have even less commercial potential because of the costs.

When will it end?

Perhaps never.

One of the speakers at last week’s International SOC Conference was Professor Nader Bagherzadeh of UC Irvine’s EECS Department. His presentation was sensibly titled “Is Network-on-Chip (NoC) a Viable Choice for the Future?” That’s a very reasonable question and Processor Bagherzadeh gave a reasoned presentation. One of his first slides contrasted three approaches to SOC interconnect design. The first approach, popular with most of today’s SOC designers, is the use of bus hierarchies.

Buses are the dinosaurs of system design. The fossils of bus-based, board-level designs from decades past form the bones of new SOC designs even though the economics of on-chip nanometer silicon interconnect now bear no resemblance to the copper-and-fiberglass design rules and economics of the 1980s. As Processor Bagherzadeh said, bus-based designs are not scalable, they enforce centralized control in increasingly decentralized systems of growing complexity, and they force the use of long wires on the SOC, which severely degrades performance and needlessly exposes system designs to the newest bugaboo for deep-submicron design: on-chip variability.

The current leader for efficient, fast SOC designs is point-to-point interconnect, which offers low latency, application-specific optimization, very high bandwidth, and low cost. Deep-submicron wires are plentiful and cheap. System designers should use them accordingly.

And then there are NOCs, which also promise shorter wiring runs between on-chip routers. High levels of interconnectivity mean that NOCs can provide high bandwidth with distributed traffic control. However, said Processor Bagherzadeh, NOCs are not as efficient as point-to-point wiring for carrying traffic on application-specific SOCs and consequently we have still not seen many tapeouts that use NOCs for real chips in real applications.

But that doesn’t mean that NOCs are elegantly useless. I think Processor Bagherzadeh made a good case for NOCs to be used as flexible interconnect when designing a platform chip. Here, you don’t have all of the knowledge to predict traffic flows over an entire chip and need some flexibility when routing high-bandwidth traffic. In such cases, you might be willing to suffer the silicon overhead of a NOC in exchange for interconnect flexibility.

It was at that point that Processor Bagherzadeh started to discuss his work with a 7-channel NOC router, which is even bigger, better, and more elegant than the conventional 5-port NOC router, offers more effective traffic bandwidth and throughput, and requires even more elegant routing algorithms. We now return you to our regular NOC programming where the usual solution to low uptake in NOC usage is to create bigger, better, and more elegant NOC hardware and routing algorithms.

The Surprising Popularity Rise of On-Chip Memory

November 8, 2009 on 4:53 pm | In CMOS, DRAM, Design, Low-Power, SOC | No Comments

I attended the 7th International SOC Conference in Newport Beach last week and several of the speakers addressed issues relating to SOC and system power. One of these speakers was Bob Madge, Director of Technology Marketing at LSI Corp (formerly LSI Logic). In case you didn’t know, LSI has been evolving its business from its original focus on developing ASICs and SOCs for customers to a focus on programmable ASSPs (application-specific standard products) and custom silicon specifically aimed at the networking and storage markets. Madge’s first slide explained the reasoning: annual storage-capacity growth is a projected 49% per year and annual network-traffic growth is a projected 42% per year. Good growth numbers for a business to target.

To deliver competitive parts, LSI stays on top of IC design and manufacturing trends. One trend that caught LSI and the semiconductor industry by surprise has been the rapid growth in on-chip memory use. On-chip memory makes sense for two reasons. First and foremost, it provides better performance than off-chip memory because putting memory on the chip along with the logic circuitry eliminates two sets of off-chip drivers and receivers, which reduces power consumption for memory transactions. Second, on-chip logic can communicate with on-chip memory over extremely wide memory interfaces—pin count is not an issue if you stay on the chip. A wide memory interface reduces the number of transfers needed to move a given amount of data and lower transfer rates cut power as well.

However, merging logic and memory on one piece of silicon has always presented design and manufacturing issues. Bulk, high-volume, high-capacity memory manufacturing processes differ from logic manufacturing processes because the two processes must optimize different parameters. Memory processes emphasize low cost manufacturing and tend to have fewer metal layers than logic processes, which emphasize speed and on-chip connectivity. “Frequency, density, and power are always a challenge,” said Madge.

For example:

  • Today’s network routers use 400-Mbit buffers. Switches need 512 Mbits of storage or more. In the future, said Madge, these devices will need as much as 1 Gbit of on-chip memory in multiple configurations.
  • IP controllers used in network storage applications currently use 60 to 100 Mbits of cache memory. In the future, these devices will need 200 Mbits of memory or more.
  • Media processors currently use 60 to 80 Mbits of memory running at 500 MHz. Future needs will be on the order of 100 to 200 Mbits of memory running at 600 to 700 MHz.

All of these examples demonstrate the coming challenges for fast, dense, on-chip memory.

LSI is looking at embedded (on-chip) DRAM and the use of 3D, through-silicon via technology for chip-to-chip stacking as ways of increasing the amount of on-chip memory. The company is doing this because it sees a continued and rapid rise in the amount of on-chip memory needed for its networking and storage chips.

Embedded DRAM cuts power because it uses a 1T (one-transistor) cell, which obviously improves density over a 4T or 6T static RAM cell. However, embedded DRAM also reduces static and dynamic power consumption because the fewer transistors use less power and leak less current than the greater number of transistors required to build the same amount of SRAM memory.

LSI is also investigating other power-saving features that become possible when you move memory onto the logic chip including a sleep mode for the memory, dual power rails, and low-voltage operation. However, said Madge, the biggest benefit appears to be a move to embedded DRAM because of the huge reduction in transistor counts.

Green Chips in Newport Beach

November 6, 2009 on 6:05 pm | In Design, Green Design, Low-Power, SOC | No Comments

Yesterday, I moderated a panel on green chip design in Newport Beach at the 7th International SOC Conference. Chances are you didn’t see or hear any of it because there were only 100 people at this conference in total. That’s really too bad because we had a great set of panelists:

1. Michel Laurence co-founded Octasic, which is a Montreal specialist in echo cancellation and has mastered the art of self-clocking or self-timed (asynchronous) logic design.
2. Jauher Zaidi, CEO, PalmChip Corporation, which was in the chip-design business but has now spun off those activities to focus more on SOC platform software.
3. Alan Ruberg, SPMT architect for SPMT, The Serial Port Memory Technology consortium, which is developing a high-performance, low-power, next-generation memory interface to replace the DDR families with an interface that uses fewer pins.
4. Dr. Simiack Haghighi, Principal Architect of Qualcomm’s CDMA Technology Architecture Group, which should need no introduction…and
5. Steve Carlson, VP of Product Marketing at Cadence (who kindly volunteered from the audience at the last minute when a panelist from another leading EDA company didn’t show).

I tossed out questions from readers of my blogs, some I developed on my own, and some that came from the panelists themselves. Here’s the first question and the panelists’ answers.

This was a cynical question from one of my blog readers: What’s the difference between the design of a Green Chip and one of those greyish Silicon ones?….More to my point – doesn’t Darwin take care of those companies who don’t design to a lower power solution than their competitors and, therefore, is this “Green Chip” thing just hopping aboard the Hype Bandwagon?

Cadence’s Steve Carlson quickly snagged this first question. His cool reply was that the industry’s focus on green technology has little to do with tree hugging. It’s all about business. The confluence of business issues such as cost and power and the social issues that draw the general press coverage, but business drives the design decisions.

Palmchip’s Jauher Zaidi agreed. There’s lots of hype about “green” these days, he said, but cost drives everything. Energy costs drive purchasing decisions at data centers, which use a lot of electrical power. Chip-design teams need power engineers now, he concluded.

SPMT’s Alan Ruberg also chimed in for this first question. Every Watt dissipated by a system requires another Watt in cooling, he said. So every Watt you save in a design delivers a 2-for-1 return in terms of energy savings. He added “By the year 2020, if trends continue on the present course, you’ll only be able to power 9% of a chip at any given time.”

How can that be? Why not just omit the other 91% of the design? Because all of the panelists can foresee a time in the rapidly approaching future when there will be specialized blocks for all the tasks performed by a chip, but not all tasks need be running simultaneously. For example, a mobile handset chip with functional blocks for a still camera and a video camera need power up only one of those blocks at a time because they share the lens and cannot operate simultaneously, yet they each require different optimizations so it makes sense to design special-purpose blocks (or get the relevant predesigned IP) for both functions and then power the one that’s needed.

There were some interesting independent observations that I noted in addition to the answers to the questions asked during the panel session:

Palmchip’s Jauher Zaidi noted that Amdahl’s Law applies to the power component of systems as well as to its usual application for analyzing execution time. It does no good to reduce the power of one system block by 10x in a system like the iPhone, for example, if it represents only a small part of the overall system power consumption. You end up reducing the system’s energy consumption very little. Power reduction requires a systemic approach.

Zaidi also noted that he needs to charge his iPhone three times a day and that he also manually turns functions on and off to extend battery life. He recommended that designers make it easier for users to manage their increasingly complex devices. I countered, saying that my kitchen doesn’t require such management. The microwave oven doesn’t turn itself on spontaneously and the refrigerator turns itself on and off automatically to maintain set temperatures. Surely we can manage more of the functions in today’s more intelligent systems in a… more intelligent manner. I submit that we’re better off trying to engineer smarter systems than smarter customers. Social engineers we’re not.

Cadence’s Steve Carlson estimated that less than one third of SOC designs today use DVFS (dynamic voltage and frequency scaling) and 10% or more don’t even use clock gating. Those are pretty dismal numbers in my opinion for practices that reduce an SOC’s power consumption and are known to work well.

Carlson noted that there’s lots of room for improvement.

Amen.

Powered by WordPress with Pool theme design by Borja Fernandez.
Entries and comments feeds. Valid XHTML and CSS. ^Top^