SPMT engulfs LPDDR2 standard, making adoption a no-brainer. Meanwhile Marvell jumps on the bandwagon.

June 7, 2010 on 9:00 am | In DRAM, Design, LPDDR2, Low-Power, SDRAM, SOC | No Comments

SPMT LogoAn insidious power problem has slowly crept up on embedded-system designers. While most of us were firmly focused on the power dissipation of our ever-expanding logic designs with their increasing number of processor cores in multicore designs, we mostly ignored the huge leaps in power consumption being caused by the rapid growth in memory size and big jumps in memory-access speeds and memory bandwidth. To cut memory costs, most high-end mobile and embedded designs today employ one high-bandwidth SDRAM device or array to satisfy all of a system’s memory requirements. Yet we think very little about the power impact of hooking big DDR SDRAMs up to our SOCs and ASICs—and these SDRAMs run at clock rates measured in hundreds of MHz or GHz, at transfer rates that are double the clock rate. It takes some real power to sling bits between a processor and SDRAM at transfer rates approaching or exceeding 1 Gtransfers/sec and even though the supply and I/O voltages have been dropping on SDRAM keeping memory power somewhat in check (only somewhat), wide DDR2 and DDR3 memory interfaces that deliver the highest bandwidths may now consume Watts of power. Watts! This simply cannot stand.

Not coincidentally, that’s the position of the SPMT (Serial Port Memory Technology) Consortium, which has been developing a low-power, high-performance memory interface for mobile and embedded applications. The low-power aspect arises primarily from SPMT’s use of low-voltage differential signaling (LVDS), which transfers information using 150 mV differential signal swings instead of single-ended, ground-referenced signal swings of more than a volt. The high-performance aspect arises from the use of multi-Gbits/sec transfer rates per SPMT data lane.

But there’s been a big, ugly fly in the SPMT ointment. Memory vendors know that more than 80% of all DRAMs go into PCs and servers and they stick with memory designs—and memory interfaces in particular—that best suit the needs of PC and server designers. Today, that means DDR2 memory, which is the mainstream DRAM technology, but the industry is quickly switching to DDR3. DDR4 is yet undefined but it too is a rapidly approaching memory-interface specification that will most assuredly ”fix” the problems we have with DDR3. These PC- and server-centric, high-speed parallel SDRAM interfaces burn a lot of power to deliver high bandwidth, which creates the niche opportunity that the SPMT Consortium has been trying to fill for mobile and embedded designs. Unfortunately, DDR memory has such a huge presence in the DRAM arena that there’s been little chance for any other interface approach to take hold.

Until now.

Today, the SPMT Consortium announced a major revision to the SPMT standard that may well spell the difference between an interesting technical exercise and an immensely successful new memory-interface standard. Previously, the SPMT specification multiplexed read/write commands and the data on the same unidirectional LVDS lanes. Doing so somewhat reduced the throughput on the data lines but it also reduced the memory pin count because SPMT memory didn’t need separate control/address (CA) lines. The reduced pin count was considered a major benefit that reduced the cost of packaged SPMT memory devices. The new SPMT specification, which completely supersedes the prior specification, does away with this control/address/data multiplexing in favor of using the same CA signal and pin definitions that LPDDR2 memory uses to carry control and address signaling.

This is a significant and important change to the SPMT spec because LPDDR2 is already poised to take over the mobile and embedded design spaces. (See LPDDR2: The new mainstream memory for embedded and mobile applications? on Denali Software’s Memory Report blog.) Further, four pairs of unidirectional SPMT data lanes now precisely overlap the 16 bidirectional data lines of a x16 LPDDR2 memory, making it possible to build one memory chip that can support both LPDDR2 and SPMT protocols using the same set of pins. What that means is that with only a few changes to the memory controller and memory PHY, an SOC or embedded processor can accommodate both LPDDR2 and SPMT memory using exactly the same set of interface pins. It also means that SDRAMs designed to the new SPMT specification can be used as LPDDR2 SDRAMs, ensuring a ready market when commercial SPMT SDRAMs first hit the market near the end of 2011—assuming things go according to the SPMT Consortium’s current plans.

So where’s the power advantage? It kicks in after the required SDRAM transfer rate hits a critical level. For example, the SPMT Consortium’s data estimates that a x32 LPDDR2 memory interface operating at 400MHz dissipates about 180mW while providing 3.2 Gbytes/sec of peak data throughput over 32 data lines (800 Gbits/sec/pin) and 360mW at a peak data throughput of 6.4 Gbytes/sec over 64 data lines. (Regular old DDR2 and DDR3 SDRAM interfaces would consume a lot more power than this.) By contrast, the SPMT interface dissipates 180mW while transferring 6.4 Gbytes/sec over eight data lanes (8 Gbits/sec/lane) and 360mW when transferring 12.8 Gbytes/sec over 16 data lanes. So the SPMT interface appears to be about twice as power efficient as the LPDDR2 interface at higher data rates, which LPDDR2 memory can’t attain without resorting to a very wide data bus and using several memory devices in the bargain. However the LPDDR2 parallel interface has a power advantage over the SPMT serial interface at lower transfer rates. So LPDDR2 memory might suffice for today’s embedded and mobile applications and might also suffice for low-activity modes in future applications.

The graph below, supplied by SPMT, tells the story. The graph shows that at low data rates, LPDDR2 memory dissipates less power than SPMT memory—largely because of the DLL integrated into SPMT memory. (DLLs consume non-negligable amounts of power and although DDR2 and DDR3 memories incorporate DLLs, LPDDR2 memory does not.) So the SPMT Consortium has done something very smart and has developed an integrated mode-switching mechanism called SerialSwitch, which allows an SDRAM controller to programmably shift an SPMT memory between its LPDDR2 and SPMT serial interface modes using a control register built into the memory device.

 

 Memory Crossover

 

Mobile phone vendors and other embedded/mobile system designers know that video will be heavily used in many future products and they also know that memory transfer-rate and bandwidth requirements will only go up as a result. SPMT’s SerialSwitch mechanism provides a way for one memory device to support both low- and high-bandwidth operating modes with an appropriate level of power consumption depending on a system’s instantaneous bandwidth requirements. By definition, all commercial SPMT memories will incorporate the SerialSwitch feature. The following figure shows how the SPMT SerialSwitch mechanism works.

 

SerialSwitch

 

During Tg, the figure shows SPMT memory operating as a x16 LPDDR2 memory. Note that the data lines (DQ/HS) employ full-voltage, single-ended signaling in this mode. During time Tg, the memory’s DLL is off, which saves power. At the beginning of time Th, the system determines that more bandwidth is or soon will be needed, so it directs the memory controller to send a command to the memory to spin up the DLL in preparation for switching to SPMT serial mode. That process takes 5 to 10 microseconds. During this time, the memory continues to operate as an LPDDR2 memory so the DLL spin-up time is hidden and doesn’t interfere with system operation but power consumption will rise. Once the SPMT memory’s DLL has spun up, at time Ti, the system’s memory controller commands the SPMT memory to switch to serial communications mode. This transition takes a maximum of 10 clock cycles. After that and during time Tj in the figure above, the memory operates in SPMT serial-communications mode. Note that the data lines have switched to LVDS signaling, as shown in the figure. LVDS signaling reduces the memory interface’s power consumption. At some later time depending on system requirements, the memory controller can power down the memory (shown as time Tk) or switch back to LPDDR2 mode (the period following the period that starts at time Tk in the above figure). Don’t be misled by this figure by the way—SPMT memory need not pass through the power-down mode to switch from SMPT-serial communications to LPDDR2 mode.

Systems can use SPMT memory in LPDDR2 mode at boot time and whenever the system is operating in a mode with low memory-bandwidth requirements. The system can quickly switch to the LVDS SPMT-serial mode whenever it requires higher memory data rates—for example when video is activated, when multiple operating modes are in use simultaneously, or when multiple processors are running in a multicore device. The SPMT Consortium estimates that the optimum crossover point between LPDDR2 and SPMT serial interface data rates for a x16/8-lane LPDDR2/SPMT-serial memory device is around 1.6 Gbytes/sec based on energy considerations.

By subsuming the LPDDR2 standard and making SPMT memories wholly superset compatible with LPDDR2 memories, I think the SPMT consortium has significantly raised the likelihood of adoption when commercial SPMT memories finally appear late next year. I also think the likelihood of such memories appearing is pretty high considering that the top two DRAM vendors, Samsung and Hynix, are members of the SPMT Consortium. Together, Samsung and Hynix have a bit more than half of the overall DRAM market according to the latest stats from the DRAMeXchange (http://j.mp/aNaNiY).

On the embedded processor side of the equation, Marvel has announced that it too has joined the consortium, which further improves SPMT’s chances of success. In fact, Marvell supplied a canned quote for the SPMT Consortium’s press release with one of the strongest statements I’ve seen in such press releases, so I am suspending my usual cynicism about such quotes and reproduce it here:

“Today’s mobile DRAM technology is geared to support the bandwidth needs of single core processors. As devices evolve to integrate multi-core CPU, multi shader 3D graphic engines at multi-GigaHertz speeds, it’s clear that DRAM will be the single performance bottleneck, especially for handheld systems where power budget is a major constraint,” said Dr. Sehat Sutardja, chairman, president and chief executive officer at Marvell. “Marvell is joining the SPMT Consortium to actively promote Serial Port Memory Technology as an industry standard and address the immediate needs of the industry. We encourage other companies active in the sector to join us in our mission.”

Strong backing like this from a market maker like Marvell can only help SPMT’s cause. Whether or not SPMT actually reaches critical mass is something that we’ll all be watching as events unfold in the hotly competitive memory arena over the next 18 to 24 months.

More on the Xilinx EPP: Three ways to communicate with on-chip peripherals

June 2, 2010 on 3:11 am | In Design, FPGA, SOC | No Comments

Last month I discussed the newly introduced Xilinx Extensible Processing Platform (EPP), which represents a new product line and a new venture for FPGA leader Xilinx. To briefly recap, devices in the EPP device family are essentially a high-end microcontroller or embedded processor based on two ARM Cortex-A9 32-bit RISC processor cores (implemented as hard IP cores and not soft cores in the FPGA fabric), some amount of SRAM used largely for processor cache, some standard peripheral blocks implemented as hard IP cores, and multiple AMBA 4 interconnect buses that link the hard-core, on-chip IP blocks with an FPGA fabric that you can use to create additional peripheral devices or anything else you might need for the digital portion of your embedded design. These Xilinx devices will sell for the low tens of dollars and will consume much less power than full-tilt FPGAs, making them very attractive replacements for 32-bit microcontrollers and standalone processors in certain applications. This month, I want to focus on how you might use those multiple on-chip AMBA 4 buses to communicate with whatever you’ve implemented in the EPP’s FPGA fabric. Xilinx hasn’t yet discussed this sort of technical information, but it’s not too hard to project some basic facts.

There are essentially only three fundamental ways to use the Xilinx EPP’s on-chip AMBA 4 buses to communicate with peripheral devices whether they are hard cores outside of the FPGA fabric or soft cores implemented in the FPGA fabric. Those three ways are: registers, memory-mapped RAM, or streaming. Each of these communications approaches has advantages and disadvantages depending on application needs.

I/O data, control, and status registers date back to the earliest days of peripheral chips that were introduced along with the very first wave of microprocessors back in the 1970s. Back then, registers were generally no wider than eight bits. Data registers were almost always eight bits wide and permitted the passing of individual bytes back and forth between the processor and whatever I/O device lay beyond the peripheral chip. There were peripheral chips for simple parallel I/O, UARTs (universal asynchronous receiver/transmitters) for serial I/O, timer chips, interrupt controllers, and that was pretty much all there was at first.  Each control and status register in these peripheral chips had individual bits and bit groups that implemented specific functions such as “set the output pins to be low-true” or “enable the interrupt pin.”

I/O registers were implemented as individual latches, so it was easy to take the output of a latch bit and use it for driving another piece of hardware inside of the peripheral chip or to take a signal and connect it to the D input of a status-register bit. We still use I/O status and control registers in precisely the same way today, inside of large peripheral blocks like Ethernet and video controllers. We simply use a lot more registers than before and they tend to be wider than eight bits these days.

Memory-mapped I/O maps a large array of bus-addressed memory locations into a linear memory array inside of the peripheral device. Often, this memory array is implemented as a RAM inside of the peripheral device but if the memory array is small enough, it might be implemented as a large register bank instead of RAM.

The earliest use for such memory-mapped arrays in I/O chips was for memory-mapped video. The CPU could write an image to memory-mapped video RAM and a simple sequencing controller read out the video and sent it to the display. Initially, access to the video RAM had to be interleaved between processor and display sequencer but eventually as display speeds and resolution increased, video RAM became dual-ported to handle the rising number of access cycles per unit time.

Originally, it took an entire board to create a memory-mapped video controller. I recall using a Vector Graphics Flashwriter video display card in my North Star Horizon S-100 computer to implement fast video for a an early WordStar editing system. I had to write the low-level video drivers in Z80 assembly code to connect the Flashwriter to the CP/M operating system and to WordStar itself. That was back in 1979 and things were mighty primitive back then. The advantage of the memory-mapped video back then was performance. The North Star’s Z80 CPU could directly manipulate every character location on video display without using the serial escape sequences mandated by the use of RS-232 terminals. The processor would write characters directly to the screen with a simple byte move; it could examine characters with a simple byte read; and it could change the character’s attribute with a simple read-modify-write instruction sequence.

In an era where processors were relatively expensive, it made sense to use the CPU running the application code to directly manipulate video on the screen as well. In the 21st century, microprocessors are so cheap and CPUs are so isolated from peripheral devices by caches and bus hierarchies that we have radically changed the way video works in most computers and embedded systems. Most systems now employ separate video processors but there are still certain non-video applications and certain peripheral devices that can still make effective use of memory-mapped I/O to provide direct processor access to peripheral memory.

Finally there’s stream I/O, which directs long transaction bursts to one memory or port address. Large operating systems, Linux in particular, have a great affinity for stream I/O and it’s an essential I/O protocol for streaming audio and video media. (No coincidence there.) Generally, a peripheral processor is required in such streaming applications to interpret commands embedded within the data stream and to separate multiplexed data streams (such as merged audio/video streams, which have become extremely common). Often, it’s advisable to place a FIFO at the input port of a streaming-I/O peripheral to help buffer the incoming data stream. Buffering helps to bridge mismatched data rates or inter-burst latencies between the streaming transmitter and receiver.

Xilinx hasn’t discussed any of these details but it’s likely that the EPP will support all three types of I/O transactions. What remains to be seen is what will be supported in hard-core IP and what will need to be implemented in the FPGA fabric.

Xilinx redefines the high-end microcontroller with its ARM-based Extensible Processing Platform – Case Studies – Part 2

May 1, 2010 on 8:22 pm | In Design, FPGA, Low-Power, SOC | No Comments

In my previous blog, I discussed the hard-core features of Xilinx’s new Extensible Processing Platform (EPP) and explained the device at the 50,000-foot level. In this blog, I’ll dig a bit deeper into the thinking behind the EPP’s FPGA fabric and I’ll show some case studies that indicate why Xilinx may have come up with a product family that will revolutionize high-end embedded system design.

Two features of Xilinx’s EPP architecture differentiate it from other microcontrollers. The first, discussed in Part 1, is the presence of a dual-core ARM Cortex-A9 processor. Most microcontrollers contain only one processor core. The EPP has two. So it’s already starting from a high-end position. The second differentiating feature is the inclusion of an unidentified amount of FPGA fabric on the device. Since the Xilinx EPP represents a family of parts, it’s safe to assume that various family members will contain differing amounts of FPGA fabric. That’s an especially safe assumption because the Xilinx presentation showed two EPP examples with different amounts of FPGA fabric. So we know that the family will likely include at least two parts—and probably many more if the product line proves successful.

What do you do with this FPGA fabric? Well the hard-core section of the EPP already gives you two 32-bit processor cores, some microprocessor peripherals, a memory controller, and some SRAM cache. So you might use the fabric to add some standard peripherals that your design needs that are not included in the standard hard-core set. Because the EPP is based on the AMBA-AXI bus, there are already many such peripheral devices available as synthesizable IP to choose from and the mere presence of Xilinx’s EPP is likely to increase the number of choices substantially as IP vendors decide to jump on the bandwagon.

Perhaps more likely, you will develop custom accelerators for application-specific tasks that permit the EPP to perform task-specific computations really, really fast. Bolt-on, bus-connected acceleration is the preferred design style for many embedded systems architects and it appears to me that the Xilinx EPP heartily supports this design style. I expect the Xilinx EPP offerings to flourish because it complements in-favor system design styles so well. So let’s take a look at two case studies provided by Xilinx to illustrate how the EPP can reduce a system design’s parts count, cost, and power consumption.

Xilinx EPP Auto ApplicationThe first example is for an automotive optical-recognition system that provides a driver with a number of assist features for collision avoidance, blind spot detection, visually assisted cruise control, night vision, a self-parking system, and a lane-departure warning system. An automotive vendor wanted to develop such a system in a compact package that could be installed high on the windshield between the glass and the rear-view mirror. The system needed to be passively cooled (not an easy feat considering the location of the system). Sensors feeding the system will include video cameras, passive infrared sensors, and active RADAR sensors. The vendor wished for the system to be scalable, based on which and how many sensors are used in the vehicle.

The total processing requirement for this system included 1600 DMIPS from the supervisory processor and 32 GMACs for the sensor processing. Cost and power targets for this system were $50 and 5W. A design based on a processor-based ASSP backed with two auxiliary DSPs (needed to provide the 32 GMACs) came in at $45.75 and 6.6W, so the cost target was achieved but the power consumption was too high. A second design based on a Xilinx EPP came in at “less than” $40.75 (less than because Xilinx is still somewhat secretive about pricing for an unannounced product, so the listed EPP costs “less than $25″) and 4.2W, so the power consumption is about 15% below budget. More important, the EPP design provides roughly 200% DMIPS and GMAC of the processing power needed by the design, delivering 3335 DMIPS and 60 GMACs. Even with these cost and power advantages, the Xilinx EPP would be far less attractive if it forced the software team to use an unfamiliar hardware architecture. One of the biggest advantages of the Xilinx approach is the familiar nature of the EPP’s foundation hardware.

The second case study involves an intelligent video surveillance system that can monitor a scene and raise alarms or generate alerts based on the scene. The estimate for processing requirements was 3100 MIPS from the supervisor processor and 49 GMACs for video processing. Cost and power targets were $100 and 10W. A system design based on separate host and video processors came in just above the processing requirements, with a part cost of $93 and a power dissipation of 10W. So this discrete design just meets spec with very little processing headroom and no leeway in power dissipation. A second system design based on a Xilinx EPP delivers 3335 DMIPS and 60 GMACs, so there’s ample video-processing headroom. Parts cost dropped to “less than $87” (again, Xilinx is being cagey with quoting EPP costs) and 7.9W for power dissipation (20% under the power goal).

Both of these case studies illustrate the Xilinx EPP’s applicability in high-end embedded systems with big processing requirements. In such systems, the EPP’s standardized, high-end, hard-core, dual-processor core (an ARM Cortex-A9 MP cluster) coupled to a high-performance, 28nm FPGA fabric though multiple high-performance buses are significant assets, well suited to such high-end applications. Even though these are high-end applications, they are likely to boost sales of Xilinx’s EPP-based devices to levels rarely achieved by Xilinx’s more expensive FPGAs. EPP component costs listed in these two case studies suggest that Xilinx plans to sell these parts for tens of dollars, not hundreds or thousands of dollars. This feat is possible only because the standardized components within the EPP are hard cores, and they consequently consume only 5-10% of the silicon they’d require if implemented with an FPGA fabric.

Xilinx redefines the high-end microcontroller with its ARM-based Extensible Processing Platform – Part 1

May 1, 2010 on 7:10 pm | In DRAM, Design, FPGA, Low-Power, SOC | No Comments

Last week at the Embedded Systems Conference (ESC) held in San Jose, California, Xilinx disclosed additional information about its upcoming Extensible Processing Platform (EPP), which I previously discussed in a February 1 blog entry written just after RTECC (the Real Time Embedded Computing Conference, see Designing Low-Power Systems with FPGAs, Part 2). This past week at a press conference, Xilinx’s Senior VP of Worldwide Marketing and Business Development Vin Ratford again spoke of the upcoming processor-centric devices Xilinx plans to introduce next year, but this time he provided far more detail. As promised, the devices fuse features of a high-end microcontroller (hard-core implementations of a 32-bit processor, memory, and I/O) with an FPGA fabric. But wait, you say, haven’t both Xilinx and Altera (and other FPGA vendors) tried this before? Yes, they have, with uninspiring results. However, I submit that Xilinx’s EPP is substantially different and it stands a very good chance of capturing significant market share from microcontrollers and from discrete processors. It may also be very attractive to design teams considering the development of certain types of SOCs. Consequently, the Xilinx EPP family may well become the family of high-volume parts Xilinx wants to have in its product catalog. Ratford provided so much information in his ESC announcement that I’ll need multiple blog entries to cover it all. In this first entry, I’ll describe what Xilinx’s EPP is and I’ll cover some of the thinking behind the architecture; In the second entry, I’ll describe some case studies that illustrate why this component family might be very attractive for a certain class of embedded product—because it promises lower parts count, lower cost, and higher performance with lower power consumption. Please understand that Xilinx stopped short of announcing actual products. Ratford described an architecture that will be used to produce a product family with actual products starting to appear next year.

 There are two major components to Xilinx’s EPP: a hard-wired, high-end, microcontroller-like block and a connected FPGA fabric based on Xilinx’s 28nm unified FPGA logic-cell design as shown in the diagram below.

 

Xilinx EPP Block Diagram

Xilinx EPP Block Diagram

 

 

First, let’s look at the hard-wired portion. It’s well known that processors don’t run very fast when implemented with FPGAs. The reason mostly revolves around the wiring congestion associated with the large register files of 32-bit RISC processors. Wiring congestion translates into “slow” and you can figure on giving up 50-75% or more of the processor’s maximum clock rate in a given process technology when comparing a synthesized ASIC implementation against a synthesized FPGA implementation. Hand optimization can reclaim some of that speed but if you’re planning on using a standard processor architecture anyway, it makes perfect sense to implement the processor on the FPGA as a hard core using a standard ASIC synthesis flow. That way, you get the full speed of the IC process technology along with the full logic density and therefore a much lower silicon cost.

Xilinx has chosen ARM’s Cortex-A9 32-bit RISC processor core for the EPP but has gone a step farther by implementing a dual-core version of this processor. That choice immediately puts the Xilinx EPP family at the high-end of the microcontroller spectrum. First, there are two 32-bit processor cores. Second, a Cortex-A9 processor can run at 2 GHz in TSMC’s 40nm, high-performance process technology. That’s one fast processor—much faster that many embedded applications require. A dual-core version, as is employed in Xilinx’s EPP family, is faster still.

In choosing a standard processor core from ARM’s extremely successful stable of processors, Xilinx has plugged directly into a broad community of embedded software developers. In other words, choosing the widely used ARM architecture telegraphs Xilinx’s recognition that embedded software development is now the largest and most expensive part of any high-end embedded project. In many such projects, software developers often outnumber hardware developers by 10:1. In announcing the EPP, Xilinx shows that it fully recognizes the need to make the software development team happy first. The company’s selection of an ARM processor core also leverages the associated large and familiar development-tool set, the good selection of operating systems, and the extended ecosystem that goes with the ARM architecture’s large and growing market dominance in the embedded space. All of these factors make the ARM processor very attractive to embedded development teams.

To the dual-core ARM Cortex-A9 processor, Xilinx has added a number of hard-core peripherals including SRAM caches, timers, interrupt controllers, switches, memory controllers, and commonly used I/O peripherals certain to be useful for many high-end embedded designs. Because these additional blocks are all hard-core implementations, they too take little room on the chip and consume much less power than they’d need if implemented in an FPGA fabric. Note that the EPP chips will contain enough SRAM for caches and small scratchpads however bulk memory, generally implemented with DRAM, will be off-chip. Consequently, the EPP architecture includes hard-core DRAM controllers to manage off-chip memory. Ratford’s talk at ESC did not elaborate on the type of memory the on-chip controller can handle however DDR2, DDR3 or both DDR2 and DDR3 would probably be a good guess, considering the high-end nature of the EPP family. The targeted applications will need a lot of memory and DDR2 and DDR3 DRAM are now the best choices in terms of cost/bit.

Key to the software-friendly approach Xilinx is taking with the EPP, the architecture boots code upon power up just like a microcontroller. Only then is the FPGA fabric configured. This approach makes the EPP look very familiar to software developers who are not at all comfortable with writing code for a fluid, amorphous system that’s not well-defined when power comes up. The FPGA vendors spent a lot of money on reconfigurable architectures learning this lesson. In addition, HLL compilers don’t much care for undefined hardware either—undefined hardware just doesn’t fit the standard software-programming models. So the implementation of a complete, hard-wired microcontroller within the EPP cuts out a lot of that old unfamiliar strangeness associated with previous attempts to marry hard processor cores and FPGA fabrics.

Speaking of the FPGA fabric, Xilinx will be using the unified 28nm FPGA fabric in the EPP. Xilinx developed this fabric for its next-generation Spartan and Virtex FPGAs. (If you want more details about this FPGA fabric, take a look at the White Paper here. According to Ratford, Xilinx’s Virtex and Spartan FPGAs will both employ this fabric, which is the first time that Xilinx has used the same FPGA fabric for its high-performance and its low-cost FPGA product families. Using the same fabric for the two Xilinx FPGA product lines and for the EPP means that Xilinx need only develop one set of hardware-design tools for the 28nm node and it also means that hardware designers only need to learn one set of tools as well.

The EPP’s hard-core embedded microcontroller communicates with the on-chip FPGA fabric using ARM’s newly announced AMBA 4/AXI bus. Ratford said at RTECC and repeated again at ESC that Xilinx worked with ARM to develop a version of this new bus specifically for FPGA use but he’s not provided details. The diagram of the EPP Ratford projected (reproduced above) shows multiple buses connecting the EPP’s hard-core embedded microcontroller and the on-chip FPGA fabric. Although Ratford provided no additional details, I plan to write a third blog entry discussing possible ways of optimally connecting the processor cores to the FPGA fabric. In the next installment of this blog, I’ll discuss some specific case studies Ratford covered in his ESC presentation that show how the EPP can reduce the parts count, cost, and the power consumption of high-end embedded systems.

(You can find a White Paper describing the Xilinx EPP here.)

Designing Low-Power Systems with FPGAs, Part 2

February 1, 2010 on 5:34 pm | In Design, FPGA, SOC | 1 Comment

Literally within an hour of posting my last blog entry on designing low-power systems with FPGAs, Altera’s marketing engine issued a related email and dropped it into my inbox. Altera’s email pre-announces the company’s upcoming FPGAs based on 28nm lithography. The email included the following marketing graph (with no scale) to explain the advantages of the smaller geometries for FPGA manufacture.

Altera 28nm devices

The first set of bars in the graph set the baseline using Altera’s 40nm devices as a reference. The next set of bars show that the feature shrink alone improves FPGA gate density by 25% and power consumption by about 12.5%. (Note: That’s my eyeball talking, not Altera’s official numbers.)

The next set of bars shows what happens incrementally when Altera takes some major logic blocks and hard-codes them. Suddenly, gate density doubles and power consumption drops by 40% compared to 40nm FPGA.

The last set of bars shows what happens when you combine the lithography shrink and hard-coded IP. Suddenly you’re getting 4x the gate density at a mere 25% of the power consumption compared to 40nm devices. (Note: I’m not sure what suddenly happened to the transceiver count, that third bar in the group, which had been constant until everything got combined in the last set. My guess is that the marketing artist who drew the graph got overzealous, cut everything 75% for visual consistency, and the proofreaders missed it. I think the number of transceivers is supposed to stay constant, based on the first three sets of bars in the graph.)

Two things to note here. First, you get a lot of bang out of hard-coded IP. Coincidentally, MIPS announced that Altera had licensed the MIPS32 architecture back in October, 2008 but Altera was mum on the subject back then. RISC processor cores make lousy targets for programmable FPGA fabrics, largely because of the routing congestion around their large register files, so processor core IP is one of the IP types that really should be hard-coded onto an FPGA. Although both Altera and Xilinx did not have much success with their first-generation FPGAs that incorporated hard-coded processor cores, that doesn’t mean they’re not going to try again and the MIPS announcement late last year telegraphed that move.

Want more proof? Last week at the Real Time Embedded Computing Conference held in Santa Clara, California, Xilinx’s Senior VP of Worldwide Marketing and Business Development Vin Ratford did more than telegraph his company’s intent to put processor cores back into FPGAs. He announced and elaborated on that intent. Xilinx will be adopting the ARM architecture and an FPGA-friendly version of ARM’s AMBA interconnect in future FPGA generations.

Make no mistake. Processors are coming to FPGAs for several reasons. First, a RISC processor core consumes between 25,000 and 50,000 gates. You can drop one of those puppies into an FPGA fabric and never see it. In essence, those transistors are “free.” That’s the nature of an FPGA’s programmable interconnect. Logic just sort of disappears.

Second, you can’t build a system without at least one processor these days. Which immediately leads to the third reason. If Xilinx and Altera truly wish to convert their “We’re taking over everything” or “All your chips are belong to us” attitudes, then the processor will just have to live on the FPGA silicon. Otherwise, the FPGA companies don’t get all of the chips. It’s as simple as that.

However, as both Altera and Xilinx discovered last time they tried this, dropping a processor core into an FPGA and making it usable is not just a matter of burying some gates into the FPGA fabric. Effective ways of connecting the processor to the programmable FPGA fabric must also exist and the software developers—who represent more than 90% of modern embedded development teams—must also be happy with the integration. You only make them happy with good development, profiling, and debugging tools.

And there’s the rub.

(It’s possible that Shakespeare’s Hamlet was indeed an embedded systems developer.)

More on Mentor’s Catapult C from John Cooley and Other Designers

December 18, 2009 on 11:15 pm | In Design, EDA, SOC | No Comments

Earlier this month, I wrote about Mentor’s C-to-gates synthesis tool Catapult C and low-power design. The EDA industry’s self-appointed gadfly and uber-user John Cooley has just written an extensive blog posting about Catapult C complete with detailed comments from several of his reader/users. These comments and Cooley’s conclusions are very, very interesting for people in the ESL space, as well as anyone involved in chip design, so I thought I’d highlight some of Cooley’s conclusions.

First, Cooley quotes EDA analyst Gary Smith’s published numbers to quantify Catapult C’s lead in the high-level synthesis arena. He then uses the anecdotal evidence of the large number of comments (both good and bad) that his reader/users make about Catapult C relative to the other high-level synthesis tools to conclude that there do seem to be more IC designers using Catapult C than competing tools.

Cooley then hands the microphone over to his designer/readers for comments. One thing that really strikes me about these comments is the number of people who want to use C++ to describe hardware. Now C++ compilers have a tough enough time creating streamlined object code out of C++ descriptions. C++ allows such a high level of abstract description that algorithmic descriptions more resemble poetry than precise engineering-style descriptions. My opinion is that expecting any and all C++ descriptions to result in efficient hardware is a bit of a reach. No matter how good the compiler is, C++ descriptions can be so abstract that it can be tremendously difficult to infer any sort of efficient hardware design from such descriptions. The likelihood of developing mind-reading compilers in the near future seem mighty slim to me.

Other designer/readers seem to share my concerns. One engineer who sent a comment to Cooley and who preferred to remain anonymous wrote: “I remain concerned that quality-of-results derived from designs developed in ANSI C/C++ will not compare well to hand-coded RTL for our design area (hardware accelerators for broadband communications), regardless of claimed market share of Catapult C.”

Now don’t make something out of this skepticism (mine and “anonymous”) that’s not there. When logic synthesis first appeared in the early 1980s, it too “suffered” from a quality-of-results issue. The earliest logic-synthesis tools could not generate gate-level designs that were as efficient as manually-created designs by even moderately good human logic designers just as C compilers could not initially generate assembly code that was as efficient as code written by a good human codesmith. However, two things happed to make this issue become a non-issue.

First, the tools simply got better. Adoption of Verilog and VHDL as description languages helped to standardize the sea of HDL slopping over the EDA bucket back in the 1980s. Standardization gave compiler designers a focused target and channeled their creative energy into building better synthesis tools rather than creating ever-more-elegant description languages.

Second, Moore’s Law made irrelevant the difference between 10 and 100 gates or even between 100 and 1000 gates. At some point, we stopped counting gates just as we’d previously stopped counting polygons and transistors. We don’t really know how many gates there are on a chip any more and what’s more, we not longer care. Not really. Because it’s square millimeters of silicon that actually costs money, not gates. So today we use square millimeters and then use a fudge factor to estimate the number of gates represented by the square-millimeter metric.

Perhaps something like that will happen with high-level synthesis. The jury’s still out.

Laser Spike Annealing of Nickel in Nanometer CMOS ICs Cuts Leakage 10x

December 6, 2009 on 8:22 pm | In CMOS, Design, EDA, Green Design, Low-Power, SOC | No Comments

One of the sad facts of life for nanometer silicon has been the rise of leakage current as device geometries shrink. At 65nm, CMOS leakage currents roughly equal operating currents, making it virtually impossible to reduce overall operating current by more than half. I’ve long thought this was the result of low-Vt transistors that can never fully turn off, a consequence of the drive to recover speed that’s lost when supply voltages are cut to reduce operating power. Turns out there’s another culprit: nickel contamination that occurs when nickel atoms drift away from the nickel-silicide interface layer used to improve the connectivity of metal inter-layer contact plugs. The nickel atoms drift during the annealing process, which is used to drive the deposited nickel atoms into the transistors’ source and drain contact pads. The first of two annealing cycles drives the metallic nickel atoms into the silicon source and drain pads creating Ni2Si silicide. A second, higher-temperature annealing process converts the Ni2Si into NiSi, which has lower resistance and thus provides good electrical connectivity between the contact pad and the metal interconnect plug.

It turns out that the current “soak” annealing (which lasts for tens of seconds) processes allow the nickel atoms to drift far afield. Like beach sand in your bathing suit, the nickel gets into places you’d rather not have it. The drifting nickel atoms seem to have an affinity for silicon lattice discontinuities, which can be found at the outside ends of the transistor where source and drain diffusions meet the isolation trenches and in long, narrow voids that run from the source and drain regions towards and into the FET channel. Both of these hiding places cause leakage because the metallic nickel conducts electricity where there should be insulator or semiconductor material. Nickel at the ends of the transistor causes substrate leakage and nickel atoms in the channel naturally cause channel leakage.

Applied Materials and European semiconductor research powerhouse IMEC have jointly developed a laser-annealing process with one-millisecond duration instead of taking tens of seconds. As a result, the diffusing nickel doesn’t have time to drift into these unwanted places during the second annealing step that generates NiSi. Applied Materials described a similar laser-spike annealing process back in 2004 (see article here), but reportedly achieved only a 3-4% leakage reduction back then. This latest development appears to be a refinement of that earlier technique. The two companies will be presenting their findings at this week’s IEDM conference in Baltimore, Maryland.

IMEC and Applied Materials will indeed have pulled a rabbit out of the hat if this laser-spike annealing process plus the application of appropriate transistor-design rules result in cutting leakage currents by 90% for nanometer CMOS. Leakage-driven power loss has become a significant problem for advanced IC design and had appeared to be insurmountable, even with the addition of high-K and metal-gate processing. Now, it appears there’s a real solution with the best of all possible implications for system and logic designers: they don’t need to learn anything new. They can leave this fix to the design tools and to the process engineers and once again skirt the system-level and architectural issues of low-power design.

C-to-Gates Synthesis and Low-Power Design

December 4, 2009 on 1:42 pm | In Design, Low-Power, SOC | No Comments

One of the many “pushbutton” design-automation tools that chip designers have sought is a “C-to-Gates” tool that would allow the automated development of hardware from algorithmic descriptions written in the C programming language.

The place to start almost any system design is with the most fundamental aspect of system design: algorithm development. Systems are collections of independently and dependently operating algorithms. For example, a DVD player uses one algorithm to decompress the combined media stream, another to decode the resulting video stream, and yet another algorithm to decode the resulting audio stream. All systems are based on the execution of one or more algorithms.

Most algorithm development begins and ends with C. One exception to this rule is MATLAB from The Mathworks, which is very popular with many algorithm developers. However, there are ways to get MATLAB to produce C from MATLAB-based algorithms as well.

Given the close association between algorithm development and the C language, it’s only natural to want a tool that automatically converts C descriptions to hardware netlists. However, early attempts to create such tools didn’t meet with much commercial success, largely because the quality of results (number of gates, operational speed, and resulting power consumption) compared poorly to manual RTL-generation design techniques.

The tools have been getting better over the years, leading to this recent announcement by Mentor Graphics:

“Mentor Graphics Corp. (NASDAQ: MENT), today announced that Fujitsu Kyushu Network Technologies Limited (Fujitsu QNET) has chosen the Catapult C Synthesis tool for use in its design tool environment to implement complex algorithms in hardware that were previously processed by a processor implemented on LSI. Fujitsu’s growing expertise with the Catapult C tool is also a key enabler in the expansion of their design services business.

Fujitsu QNET was able to dramatically cut power consumption by using the Catapult C Synthesis tool to create a dedicated hardware accelerator for mobile voice processing algorithm versus running in software. The resulting silicon implementation yielded a reduction in power consumption of 83%. This was made possible by the ability of the Catapult C tool to find the optimal trade-off between power, performance and area, in this case, implementing a design satisfying voice performance requirements while running at a lower clock frequency than the previous implementation using a processor.”

Based on this announcement, it would seem that C-to-gates tools, at least Mentor’s Catapult C, are getting closer to reality. In fact, the above announcement would lead you to believe that such tools are here and production-ready today. Indeed, Mentor’s description of the Catapult C synthesis tool appears mighty attractive:

“Catapult C Synthesis reduces design time and verification effort. When writing pure C++, designers focus on the functional intent of their application. Timing and architectural information is abstracted away from the source description. With fewer details in the model, testbench development is also simplified.

Implementation of specific details are automatically added during the synthesis process, eliminating error-prone manual interventions and resulting in RTL designs correct by construction. Debug of the resulting RTL is in turn eliminated, further reducing the overall verification effort.

The Catapult C automated verification environment allows any RTL implementation of a C++ model to be verified using the original C++ testbench. This eliminates the need to write pin-level interfacing and bit-timed RTL environments to verify the RTL blocks created by Catapult before moving to system integration.”

There’s a caveat or two to remember, however. First, there’s good C code and bad C code whether you’re writing code to run on a processor or code that’s to be synthesized into gates. In the case of C-to-gates synthesis, good code signals the design intent as clearly as possible so that the synthesis tool needs to infer as little as possible. Machine inference is the second caveat. Every detail that Catapult C—or any synthesis tool for that matter—must infer is a design detail that you didn’t put into the design.

Conventional RTL-driven logic synthesis makes such inferences all the time and, over the years, designers have gotten savvy to the kinds of inferences that will be made by their logic-synthesis tools and have compensated by adapting their code-writing styles when writing hardware descriptions in the Verilog and VHDL hardware description languages. However, C has largely been used as a sequential algorithm description tool to create software that runs on single processors and use patterns by engineers reflect that long history. In addition, C describes algorithms at a higher level of abstraction than descriptions written in hardware description languages. As a result, there’s always more to infer from a C algorithm description just due to the higher abstraction level.

So eye that 83% power-reduction that Fujitsu QNET achieved for that voice-processing algorithm with envy. Just remember that engineering isn’t as simple as pushing a button.

NOCs: The Undead of the SOC World

November 8, 2009 on 6:14 pm | In SOC, Uncategorized | 3 Comments

The 7th International SOC Conference in Newport Beach featured a session on NOCs (networks on chip). Perhaps it’s the undue influence of the recent Halloween festivities, but NOCs remind me of vampires, of the undead. They just keep coming back no matter what, despite the lack of uptake in the commercial sector.

Academics love NOCs because they can be analyzed to death and they provide wonderful fodder for postgraduate work. You can come up with increasingly elegant, time-consuming, and costly routing algorithms for NOCs, which has permitted the creation of many, many academic papers. Each and every paper lists the prior failings of earlier NOC approaches, analyzes the shortcomings, and then proposes an even more elegant and costly NOC that solves the technical problems of predecessors. But these more elegant solutions have even less commercial potential because of the costs.

When will it end?

Perhaps never.

One of the speakers at last week’s International SOC Conference was Professor Nader Bagherzadeh of UC Irvine’s EECS Department. His presentation was sensibly titled “Is Network-on-Chip (NoC) a Viable Choice for the Future?” That’s a very reasonable question and Processor Bagherzadeh gave a reasoned presentation. One of his first slides contrasted three approaches to SOC interconnect design. The first approach, popular with most of today’s SOC designers, is the use of bus hierarchies.

Buses are the dinosaurs of system design. The fossils of bus-based, board-level designs from decades past form the bones of new SOC designs even though the economics of on-chip nanometer silicon interconnect now bear no resemblance to the copper-and-fiberglass design rules and economics of the 1980s. As Processor Bagherzadeh said, bus-based designs are not scalable, they enforce centralized control in increasingly decentralized systems of growing complexity, and they force the use of long wires on the SOC, which severely degrades performance and needlessly exposes system designs to the newest bugaboo for deep-submicron design: on-chip variability.

The current leader for efficient, fast SOC designs is point-to-point interconnect, which offers low latency, application-specific optimization, very high bandwidth, and low cost. Deep-submicron wires are plentiful and cheap. System designers should use them accordingly.

And then there are NOCs, which also promise shorter wiring runs between on-chip routers. High levels of interconnectivity mean that NOCs can provide high bandwidth with distributed traffic control. However, said Processor Bagherzadeh, NOCs are not as efficient as point-to-point wiring for carrying traffic on application-specific SOCs and consequently we have still not seen many tapeouts that use NOCs for real chips in real applications.

But that doesn’t mean that NOCs are elegantly useless. I think Processor Bagherzadeh made a good case for NOCs to be used as flexible interconnect when designing a platform chip. Here, you don’t have all of the knowledge to predict traffic flows over an entire chip and need some flexibility when routing high-bandwidth traffic. In such cases, you might be willing to suffer the silicon overhead of a NOC in exchange for interconnect flexibility.

It was at that point that Processor Bagherzadeh started to discuss his work with a 7-channel NOC router, which is even bigger, better, and more elegant than the conventional 5-port NOC router, offers more effective traffic bandwidth and throughput, and requires even more elegant routing algorithms. We now return you to our regular NOC programming where the usual solution to low uptake in NOC usage is to create bigger, better, and more elegant NOC hardware and routing algorithms.

The Surprising Popularity Rise of On-Chip Memory

November 8, 2009 on 4:53 pm | In CMOS, DRAM, Design, Low-Power, SOC | No Comments

I attended the 7th International SOC Conference in Newport Beach last week and several of the speakers addressed issues relating to SOC and system power. One of these speakers was Bob Madge, Director of Technology Marketing at LSI Corp (formerly LSI Logic). In case you didn’t know, LSI has been evolving its business from its original focus on developing ASICs and SOCs for customers to a focus on programmable ASSPs (application-specific standard products) and custom silicon specifically aimed at the networking and storage markets. Madge’s first slide explained the reasoning: annual storage-capacity growth is a projected 49% per year and annual network-traffic growth is a projected 42% per year. Good growth numbers for a business to target.

To deliver competitive parts, LSI stays on top of IC design and manufacturing trends. One trend that caught LSI and the semiconductor industry by surprise has been the rapid growth in on-chip memory use. On-chip memory makes sense for two reasons. First and foremost, it provides better performance than off-chip memory because putting memory on the chip along with the logic circuitry eliminates two sets of off-chip drivers and receivers, which reduces power consumption for memory transactions. Second, on-chip logic can communicate with on-chip memory over extremely wide memory interfaces—pin count is not an issue if you stay on the chip. A wide memory interface reduces the number of transfers needed to move a given amount of data and lower transfer rates cut power as well.

However, merging logic and memory on one piece of silicon has always presented design and manufacturing issues. Bulk, high-volume, high-capacity memory manufacturing processes differ from logic manufacturing processes because the two processes must optimize different parameters. Memory processes emphasize low cost manufacturing and tend to have fewer metal layers than logic processes, which emphasize speed and on-chip connectivity. “Frequency, density, and power are always a challenge,” said Madge.

For example:

  • Today’s network routers use 400-Mbit buffers. Switches need 512 Mbits of storage or more. In the future, said Madge, these devices will need as much as 1 Gbit of on-chip memory in multiple configurations.
  • IP controllers used in network storage applications currently use 60 to 100 Mbits of cache memory. In the future, these devices will need 200 Mbits of memory or more.
  • Media processors currently use 60 to 80 Mbits of memory running at 500 MHz. Future needs will be on the order of 100 to 200 Mbits of memory running at 600 to 700 MHz.

All of these examples demonstrate the coming challenges for fast, dense, on-chip memory.

LSI is looking at embedded (on-chip) DRAM and the use of 3D, through-silicon via technology for chip-to-chip stacking as ways of increasing the amount of on-chip memory. The company is doing this because it sees a continued and rapid rise in the amount of on-chip memory needed for its networking and storage chips.

Embedded DRAM cuts power because it uses a 1T (one-transistor) cell, which obviously improves density over a 4T or 6T static RAM cell. However, embedded DRAM also reduces static and dynamic power consumption because the fewer transistors use less power and leak less current than the greater number of transistors required to build the same amount of SRAM memory.

LSI is also investigating other power-saving features that become possible when you move memory onto the logic chip including a sleep mode for the memory, dual power rails, and low-voltage operation. However, said Madge, the biggest benefit appears to be a move to embedded DRAM because of the huge reduction in transistor counts.

Next Page »

Powered by WordPress with Pool theme design by Borja Fernandez.
Entries and comments feeds. Valid XHTML and CSS. ^Top^