“Power is the new timing”: A review of Paul McLellan’s new book—EDA Graffiti

September 2, 2010 on 1:54 am | In Uncategorized | No Comments

I’ve taken the title for this blog entry from the name of a section in Paul McLellan’s new self-published book: EDA Graffiti. This phrase directly refers to one of the profound changes taking place in semiconductor design. In this section of the book, McLellan writes: “The 1990s were the decade of timing, when all [EDA] tools became timing driven with a completely synchronous design methodology. … The 2000s seem to be the decade of power, where the biggest headache is now meeting the power budget.” Now, if you had not read the entire 230 pages in the book preceding these statements, you might think that nothing else in IC design had changed.

But indeed, a lot has changed. The term SOC (system on chip) only came into common use around 1995. The phrase is used to differentiate an ASIC without an on-chip processor core from one with one or more processor cores, which are now called SOCs. There weren’t many SoCs before 1995 so we didn’t need to name such things. Fully 15 years later, you’d be hard-pressed to find any ASIC design underway that doesn’t have at least one on-chip processor. The norm is more like six or 10 processor cores per chip and some chips have more than 100 processors. Given this radical and rapid escalation in chip complexity, plus the death of Dennard scaling at 90nm—which has curbed the downward trend in per-transistor current consumption, it’s no wonder that power consumption and power dissipation have come to the fore as a chief source of grief for chip-design teams.

If you were to purchase this relatively inexpensive ($25) book from McLellan just for this section alone, it would probably be worth your time and money, but EDA Graffiti is much, much more valuable. It contains thick slices of well-cogitated observation and advice from someone who has self-admittedly spent more than three decades thinking about IC EDA and working in the EDA industry. McLellan was at VLSI Technology when the ASIC revolution started in the early 1980s. He became president of Compass Design Automation when it spun out of VLSI Technology. He’s been an executive manager at key EDA vendors including Cadence, Ambit, and VaST. And he was CEO for a year at Envis, a power-centric EDA vendor. McLellan’s longevity in the industry and his varied experiences in the EDA world give him more than enough street cred to carry off a gritty book like this. There is a ton of sharp, cogent information trapped within its covers for the inquisitive to find.

The book starts with an overview of the semiconductor industry. It pays homage to Moore’s Law, but it’s careful to explain what Moore’s Law really said (more transistors every generation) and what it did not say (faster transistors, lower power—things predicted by Dennard scaling). The book also covers semiconductor business economics in a way most technologists never consider. How many dollars per second do you need to capture in revenue to pay for a chip fab? You’d better know you’re going to make that kind of revenue and how many years that revenue will flow into your company before you drop a few billion dollars to build and equip the fab.

McLellan also delves deeply into EDA economics, marketing, and sales and his perspectives will help you to understand why EDA companies do what they do. As I now work in the EDA industry myself, this chapter on EDA is now heavily highlighted in my copy of the book. Nearly every sentence in this chapter on EDA contains deep insight, dearly won. The chapters on EDA marketing and engineering are similarly highlighted in my copy of the book. The chapter on Silicon Valley, not so much. That chapter contains McLellan’s personal assessments of immigration policy as it applies to high-tech and it contains other lifestyle odds and ends such as a profile of Cypress Semiconductor founder and CEO TJ Rodgers that will perhaps interest Silicon Valley insiders more than others.

The chapters include:

 

  • Semiconductor industry
  • EDA industry
  • Silicon Valley
  • Management
  • Sales
  • Marketing
  • Presentations (as in PowerPoint)
  • Engineering
  • Investment and Venture Capital

 

McLellan’s EDA Graffiti book started as a long series of blog entries on the EDN Web site. He has taken these blog entries, expanded them, and added more material in the form of new sections. So even if you read the original blog entries, you’ll find more than enough new material in the book to make it worth the purchase.

The book is not without flaws. Most obviously, it’s not indexed so you’ll have to search for that perfect pithy sentence you remember reading, just as I’ve had to do so while writing this book review. In addition, the book clearly needs copy editing. There are misspellings; there are funny line breaks and strange paginations; and there are sentences that were clearly cut up during editing and then poorly respliced. However, these are really minor nits. You cannot help but ladle out generous portions of insight and knowledge about the IC and EDA industries wherever you dip into this book. Want perfect grammar? Buy a copy of Strunk and White.

If you want a copy of the book—and you should—you can order one at https://www.createspace.com/3452185. It’s also on Amazon.com, which always reports that the book is out of stock. That’s because McLellan self-published this book using the increasingly popular book-on-demand format. My copy was printed on May 13, 2010. (It says so on the back page.) Do not think that McLellan has taken this self-published route for lack of an interested publisher. Given the Internet, email, and 30+ years worth of deep industry connections, an experienced business man and marketer like McLellan can do far more with a book like this through direct marketing than can a traditional book publisher.

By the way, if you’ve got a CEO job in EDA that needs filling, or if you just want to tap into his experience, you’ll find McLellan at www.greenfolder.com. He’s just published a 273-page resume.

Ethernet ports, low power, and multimedia, Part 2

August 1, 2010 on 10:28 pm | In Design, Green Design, Low-Power, Networking | No Comments

In the previous post, I discussed the huge potential power savings being enabled by the IEEE’s 802.1-az Ethernet specification now under development and early deployment. While IEEE 802.1-az promises to save significant amounts of power and energy through the use of sleep modes for inactive Ethernet ports, continuous stream-based multimedia applications of the various Ethernet standards cannot endure power-down and wake-up delays associated with the new specification. Consequently, the IEEE is also developing new standards to help Ethernet connections better handle these multimedia applications. But before discussing those new standards, it’s helpful to step back and take a look at the current state of multimedia networking because it closely resembles the networking situation with computers if you set your personal time machine to take you back in time by 30 years or so. (No DeLorean needed!)

Three decades ago, computer networking was in chaos. Each of the major mainframe and minicomputer vendors had a unique and mutually incompatible networking scheme. The electronics didn’t match. The bit rates didn’t match. The packetizing and de-packetizing schemes didn’t match. The error-detection and –correction schemes didn’t match. And just as important, the cables and connectors didn’t match. Any data center than supported hardware from multiple computer vendors needed a big box of cables with all sorts of connectors just to handle the incompatible networking schemes.

Chances are good that you have a similar box of cables at home to help you connect all of your multimedia devices together. I know I can connect one of my televisions with an RF coax cable, simple audio and video coax cables terminated with RCA plugs, RCA RGB component cables, or HDMI cables. My audio connections include simple RCA-plug audio cables, coaxial and optical TOSlink cables, and speaker wire without connectors. I do indeed have boxes full of AV cables that no longer match any of the AV components I now use. Worst of all, these are all dumb, dumb, dumb connections. The AV system components have no idea what’s coming over these cables. I need to configure each box (usually through a remote control I can’t find) to tell it which of the many back-panel ports to use for the audio and video signals and how the streamed information is encoded. For example, I need to manually tell my system which of the DVD audio streams to use while watching a video. You would think that reasonably good equipment would be able to detect and optimize the experience automatically, but my equipment can’t.

So it’s not much of a reach to envision a world where Ethernet-enabled AV equipment automagically discovers the abilities of the other equipment in a local AV network cloud and then collaborates with the other connected equipment to optimize each viewing or listening experience. That end is precisely the goal of the IEEE 802.1 AVB (audio video bridging) working groups. However, the goals go much farther than that. Imagine AV systems with multiple content sources and multiple listeners. Then imagine a network of AV components that can automatically optimize the listening and viewing experience for each AV network user simultaneously in real time. That scenario is also within the goals of the 802.1 AVB efforts. Part of the need is for components to discover the capabilities of other Ethernet-connected devices in the local AV component cloud. Part of the need is to reserve a substantial part of the cloud’s networking bandwidth for content streams that absolutely require low-latency, high-bandwidth content delivery.

This effort relies on three interwoven specifications:

 

  • IEEE 802.1-AS – A timing synchronization standard
  • IEEE 802.1-Qat – A stream-reservation protocol
  • IEEE 802.1-Qav – A packet forwarding and queuing protocol that can accommodate isochronous and non-isochronous AV traffic using reserved bandwidth and regular data-type Ethernet traffic using best-effort packet delivery.

 

Together with the Energy Efficient Ethernet specification (802.1-az)  discussed in the previous blog entry (even AV components sleep sometimes), the IEEE 802.1 AVB specifications ensure even longer life for the Ethernet protocol, now going on its fourth decade of ever-widening deployment.

Ethernet ports, low power, and multimedia, Part 1

August 1, 2010 on 10:26 pm | In Design, Green Design, Low-Power, Networking | No Comments

When you think about the massive affects that the Ethernet standards have had on system design, the overall impact is no less than staggering—and I do not use that term lightly. Thirty years ago, when Ethernet was new, networking was both provincial and fragmented. The only machines deemed worthy of “internetworking” were mainframes and minicomputers. The big-iron houses like IBM and the minicomputer vendors such as Digital Equipment Corp (DEC) and Hewlett-Packard all had proprietary networking standards. For example, IBM had SNA and DEC had DECnet. Then in 1980 the IEEE started a working group—802—to standardize a network based on the Ethernet protocols that Bob Metcalfe, David Boggs, Chuck Thacker, and Butler Lampson developed at Xerox PARC along the lines of Metcalfe’s PhD project (ALOHAnet). To see the original Ethernet hardware paraphernalia including its thick and unwieldy yellow coaxial backbone cable, its vampire-tap MAUs (media access units), and its special coaxial-cable coring tool, you might be excused for not predicting that Ethernet would take over the planet. But it did through ruthless evolution and continuous cost cutting, which reduced the cost of connection to less than a dollar. And as a result, every computer manufactured these days has either a wired or wireless Ethernet port or multiple ports of various Ethernet flavors.

The interoperability of today’s Ethernet-enabled devices is staggering. To see an iPad in a Starbucks coffee shop surfing Web servers in far-flung places such as Eastern Europe, India, Asia, or even North America through a casual WiFi connection is stunning—yet it is so commonplace that the feat rarely reaches our conscious minds these days. It just happens. Conjoined with that ease of connectivity however is a dark cloud: wasted energy to keep those billions of Ethernet connections alive even when they’re not carrying data. The energy numbers boggle the mind. In a recent Webinar, John Swanson from Synopsys listed a few jolting power-consumption figures. In the US alone, Ethernet ports attached to servers, network storage devices, routers, switches, and other networking equipment burn about 0.5 terawatts per year! Ethernet ports on computers, printers, edge switches, and other local devices installed in commercial, research, and educational institutions burn another 1.5 terawatts per year! Home-based ports burn about 2 terawatts per year! In all, that’s about 5 terawatts or $400 million worth of electricity per year just to keep the bits moving along all of the Ethernet ports in the US alone. And you know that all of those ports aren’t active all the time, yet most of them burn power 24/7. No wonder that the IEEE is now addressing the issue of wasted networking power through several new standards designed to make Ethernet use more efficient.

The key low-power standard under development is called the 802.1-az specification and it employs LLDPDU (link layer discovery protocol data units) to allow a switch and a device to negotiate sleep and quiet times when the Ethernet ports can actually be powered down. Implementations based on the 802.1-az specification will require new hardware and software but these new ports are backward compatible with the old, power-wasting kind of Ethernet port. If there’s no negotiation, there’s no sleep time and the ports will operate normally. However, when negotiation between two 802.1-az ports does take place, both ends of an Ethernet connection can time out and can power down their respective Ethernet MACs and PHYs, which will result in substantial power savings.

Currently, version D3.2 of the working group specification is circulating. More important perhaps, Ethernet controllers with 802.1-az port compatibility are already available as are some early PHY chips. The MAC part of the 802.1-az specification has not changed in a while according to Synopsys’ Swanson and only PHY changes are expected in the future.

Data applications for Ethernet can benefit greatly from the power reductions made possible by the 802.1-az specification but throughput- and latency-sensitive applications such as audio and video over Ethernet need additional support, which I’ll cover in the next blog post.

Freescale’s earthquake: ColdFire+ and Kinetis families shake up the 32-bit microcontroller landscape

July 5, 2010 on 5:24 pm | In Design, Low-Power | No Comments

Yesterday, while walking the San Andreas fault in the Los Trancos Open Space Preserve in the Santa Cruz Mountains above Palo Alto, Paul Billig the Docent pointed out some nondescript rocks half buried along the hiking path. “These rocks don’t come from here,” he said. “They come from there,” said Paul as he pointed at a distant mountain poking through the haze about 30 miles to the south. “That’s Loma Prieta.” We know these rocks came from Loma Prieta because the chemical composition precisely matches the composition of rocks still on the mountain. Over the eons, as the Pacific plate has slipped north against the North American plate along the San Andreas fault, these rocks, which rolled off of Loma Prieta and across the fault line, have been transported north at the rate of about an inch and a half per year. That description’s similar to of one of Freescale’s two newest 32-bit microcontroller families: the Freescale ColdFire+ family that’s based on the 68000 microprocessor architecture introduced by Motorola Semiconductor in 1979. “RISCification” of the 68000 instruction set coupled with increasingly advanced process technology has dragged this processor architecture forward 30 years (rather than 30 miles) into the 21st century—and it holds up remarkably well. Together with Freescale’s new Kinetis microcontroller family based on the 32-bit ARM Cortex-M4 processor core, Freescale is striding into the current hot spot in the microcontroller war zone—the 32-bit zone. Loyalties and market shares for 8- and 16-bit microcontroller families are pretty well settled. The new front is at 32 bits and Freescale’s massive foray into this battle is a sign that there’s still territory to win.

In a world where some processor vendors are designating chips that dissipate Watts of power as “low-power” devices, Freescale’s ColdFire+ and Kinetis microcontroller families are truly low-power devices. Both families are based on 90nm process technology—for good clock speed (100-150MHz) and relatively low leakage—and feature ten similar software-selectable run, wait, and stop operating modes from full run to three successive levels of “very low leakage” stop. The ColdFire+ microcontrollers draw less than 150 microamps/MHz and the ARM Cortex-M4-based Kinetis microcontrollers draw less than 200 microamps/MHz. Their real “low-power” operation occurs when the microcontrollers enter a VLP (very low power) run mode that restricts the processor core and peripherals to a 2MHz clock, which translates into an operating current of less than 300 microamps for the two ColdFire+ microcontroller families and less than 400 microamps for the seven Kinetis families.

The ColdFire+ microcontroller introduction includes two families: Qx and Jx. Both ColdFire+ microcontroller families include several family members with different amounts of RAM, Flash EPROM, a new type of EEPROM memory called FlexMemory (for more on the innovative FlexMemory, see this blog entry from the Denali Memory Report), and a variety of peripherals. The Jx variants include USB OTG (On the Go) support. In all, there will be 40 members of this new ColdFire-based microcontroller series. The Kinetis introduction includes seven different families (K10, K20, K30, K40, K50, K60, and K70) and there will be more than 200 variants within these seven families. All seven families are upward pin-compatible and share a common set of peripheral devices making it easy to move up to more capable family members if needed.

The amount of information about these microcontrollers is massive and will take a while for you to digest, but this is a blog about low-power design so the ten operating modes deserve a bit more discussion here. The ten modes are:

 

  • Run: Normal run mode
  • VLP (Very Low Power) Run: CPU and peripheral clocks limited to 2MHz. Flash access limited to 1MHz. LVD (low-voltage detection) is off.
  • Wait: Peripherals function at full speed but the processor core sleeps.
  • VLP Wait: CPU is in sleep mode. Peripheral clocks limited to 2 MHz.
  • Stop: Processor core in static state. Register contents maintained. LVD on.
  • VLP Stop: Processor core in static state. LVD off. Some peripherals and pin interrupts operational.
  • LL (low-leakage) Stop: Processor core voltage reduced to low-leakage level. Register contents retained. Exit using interrupts from various peripherals and from interrupt pins.
  • VLL (very low leakage) Stop 3: Processor core voltage reduced to low-leakage level. Most internal logic powered down. All system RAM contents retained. I/O states held.
  • VLL Stop 2: Like VLL Stop 3 but only some of the system RAM contents are maintained.
  • VLL Stop 1: Like VLL Stop 3 but only 32 bytes of the register file are maintained.

 

Naturally, the deeper you go into the power-down modes, the longer it takes to wake up the microcontroller. The Kinetis microcontrollers need 4 microseconds to awake from the VLP Wait and VLP Stop modes, 35 microseconds to awake from the VLL Stop 3 and VLL Stop 2 modes, and more than 100 microseconds to awake from the VLL Stop 1 mode and to restore RAM. These ten operating modes give the embedded design team tremendous flexibility in managing system power consumption.

In addition to these 240 new 32-bit devices, Freescale has rolled out substantial development support for them in the form of an Eclipse-based CodeWarrior IDE that includes compilers for both the ColdFire+ and ARM Cortex-M4 processor cores. Freescale also provides the MQX RTOS and associated software stacks at no additional charge to ColdFire+ and Kinetis customers. There’s also a large and growing ecosystem for these parts.

Multicore server, PC, and embedded designs push memory power, drive use of advanced DDR3 SDRAMs

July 2, 2010 on 9:32 pm | In DRAM, Design, Green Design, Low-Power, SDRAM | 4 Comments

Systems designers try all sorts of methods to reduce system power consumption. For years, we’ve relied on circuit tricks and have been reducing logic supply levels from the 5V power supplies that were so common in from the 1970s and throughout the 1980s to the 1V levels we now employ with today’s advanced logic chips. Memory supply voltages have dropped as well. For example, the original DDR SDRAMs had a 2.5V supply voltage and DDR2 SDDRAM employs 1.8V supply voltage. That’s nearly double today’s SOC, processor, and microcontroller core voltages. The reason for this lag in supply-voltage reduction is that memory vendors prefer to stay in the economic sweet spot for IC lithography as opposed to logic design which prefers to stay on or near the bleeding edge. Consequently, memory’s share of a system’s power-consumption pie has been rising and there really hasn’t been much attention paid to reducing memory power consumption. The advent of DDR3 SDRAM provides another opportunity to cut memory power through further reductions in memory supply voltage and coupled with advanced process technology, Samsung has attained a supply voltage of 1.35V for its 40nm DDR3 SDRAMs. This drop in memory supply voltage can produce a 38% cut in server power consumption, according to Samsung.

 

Performance isn’t really the engine that drives DDR3 adoption. The real driver is bandwidth and there are two design trends that force the quest for ever-increasing amounts of memory bandwidth. The first such design trend is the wholesale adoption of homogeneous and heterogeneous multicore architectures. As an industry, we’ve embraced the use of multiple processor cores as a solution to the death of Dennard scaling. Although most people attribute the increase in operating frequency and the decrease in per-transistor power consumption through lithographic shrinks to Moore’s Law, which Gordon Moore codified in an article he published in 1965 while working at Fairchild Semiconductor, that attribution is not factually correct. Moore simply predicted that the number of transistors on a chip would grow exponentially over time as lithographies shrank. It was IBM’s Robert Dennard who observed in 1974 that lithographic advances in IC manufacturing also consistently produced faster transistors that consumed less power. For decades, we’ve used Dennard scaling to produce faster and faster processors (while attributing the improvements to Moore’s Law).

 

The semiconductor industry has poured billions of dollars into keeping Moore’s Law alive but Dennard scaling died at 90nm. We continue to get more transistors on a chip with each advance in IC lithographic scaling, but the transistors no longer get appreciably faster, so the MHz wars have ended. Worse, pushing transistors to their performance limit now produces leaky transistors that dissipate as much power when off as when on. We now recognize that the way to get more performance is to use the transistor bounty to increase the number of processors and to distribute the work load across these processors without striving for multi-GHz clock rates.

 

With all of these on-chip processors executing code and accessing data on a multicore chip, system designers must find a way to make large amounts of inexpensive memory available to these processors. For the last decade, the most cost effective way to provide a system with large amounts of low-cost memory has been the SDRAM. The classic system design teams a multicore processor or SOC with one or more SDRAM channels. As memory bandwidth needs rise, the SDRAMs’ per-channel transfer rate and the number of SDRAM channels used has increased. DDR transfer rate have now reached and exceeded 1600 Mtransfers/sec and it’s not uncommon to find server processors with three SDRAM channels, for example. Because of the constant thirst for memory bandwidth, DDR3 SDRAM sales exceeded DDR2 SDRAM sales beginning with the first quarter of 2010, according to the leading SDRAM vendor Samsung, and the company expects DDR2’s share of SDRAM market sales to drop below 20% by the end of the year.

 

When you move that much data between a processor and memory, you’re likely to dissipate a considerable amount of power and indeed, memory power consumption has been on the rise. Lowering memory power consumption can substantially lower system-level power consumption. For example, states Samsung, going to 40nm, 2-Gbit DDR3 SDRAM with a 1.35V power supply can cut a server’s memory power consumption by 80% compared to the equivalent number of storage bits implemented with 60nm, 1-Gbit, DDR2 SDRAMs running at 1.8V and can even cut memory power consumption by 38% compared to equal-sized memory arrays consisting of 60nm, 1-Gbit, DDR2 SDRAMs running at 1.5V.

As a result, according to Samsung’s measurements, 40nm, 2-Gbit DDR3 SDRAMs running at 1.35V can cut power by an astonishing 38% at the system level for servers. To put that into economic perspective, says Samsung, the use of 1.35V DDR3 SDRAMs in a server can save 2564 kilowatt-hours per year. Samsung estimates that there will be 32 million servers operating in data centers worldwide by the end of this year. If they all were equipped with 1.35V DDR3 memory, the annual power consumption would be reduced by 82 terawatt-hours, worth an estimated $28 billion. That kind of money gets any data-center manager’s attention.

The same sort of energy savings apply to any multicore system whether it’s a server, a PC, or an embedded system based on a heterogeneous multicore processor design.

SPMT engulfs LPDDR2 standard, making adoption a no-brainer. Meanwhile Marvell jumps on the bandwagon.

June 7, 2010 on 9:00 am | In DRAM, Design, LPDDR2, Low-Power, SDRAM, SOC | No Comments

SPMT LogoAn insidious power problem has slowly crept up on embedded-system designers. While most of us were firmly focused on the power dissipation of our ever-expanding logic designs with their increasing number of processor cores in multicore designs, we mostly ignored the huge leaps in power consumption being caused by the rapid growth in memory size and big jumps in memory-access speeds and memory bandwidth. To cut memory costs, most high-end mobile and embedded designs today employ one high-bandwidth SDRAM device or array to satisfy all of a system’s memory requirements. Yet we think very little about the power impact of hooking big DDR SDRAMs up to our SOCs and ASICs—and these SDRAMs run at clock rates measured in hundreds of MHz or GHz, at transfer rates that are double the clock rate. It takes some real power to sling bits between a processor and SDRAM at transfer rates approaching or exceeding 1 Gtransfers/sec and even though the supply and I/O voltages have been dropping on SDRAM keeping memory power somewhat in check (only somewhat), wide DDR2 and DDR3 memory interfaces that deliver the highest bandwidths may now consume Watts of power. Watts! This simply cannot stand.

Not coincidentally, that’s the position of the SPMT (Serial Port Memory Technology) Consortium, which has been developing a low-power, high-performance memory interface for mobile and embedded applications. The low-power aspect arises primarily from SPMT’s use of low-voltage differential signaling (LVDS), which transfers information using 150 mV differential signal swings instead of single-ended, ground-referenced signal swings of more than a volt. The high-performance aspect arises from the use of multi-Gbits/sec transfer rates per SPMT data lane.

But there’s been a big, ugly fly in the SPMT ointment. Memory vendors know that more than 80% of all DRAMs go into PCs and servers and they stick with memory designs—and memory interfaces in particular—that best suit the needs of PC and server designers. Today, that means DDR2 memory, which is the mainstream DRAM technology, but the industry is quickly switching to DDR3. DDR4 is yet undefined but it too is a rapidly approaching memory-interface specification that will most assuredly ”fix” the problems we have with DDR3. These PC- and server-centric, high-speed parallel SDRAM interfaces burn a lot of power to deliver high bandwidth, which creates the niche opportunity that the SPMT Consortium has been trying to fill for mobile and embedded designs. Unfortunately, DDR memory has such a huge presence in the DRAM arena that there’s been little chance for any other interface approach to take hold.

Until now.

Today, the SPMT Consortium announced a major revision to the SPMT standard that may well spell the difference between an interesting technical exercise and an immensely successful new memory-interface standard. Previously, the SPMT specification multiplexed read/write commands and the data on the same unidirectional LVDS lanes. Doing so somewhat reduced the throughput on the data lines but it also reduced the memory pin count because SPMT memory didn’t need separate control/address (CA) lines. The reduced pin count was considered a major benefit that reduced the cost of packaged SPMT memory devices. The new SPMT specification, which completely supersedes the prior specification, does away with this control/address/data multiplexing in favor of using the same CA signal and pin definitions that LPDDR2 memory uses to carry control and address signaling.

This is a significant and important change to the SPMT spec because LPDDR2 is already poised to take over the mobile and embedded design spaces. (See LPDDR2: The new mainstream memory for embedded and mobile applications? on Denali Software’s Memory Report blog.) Further, four pairs of unidirectional SPMT data lanes now precisely overlap the 16 bidirectional data lines of a x16 LPDDR2 memory, making it possible to build one memory chip that can support both LPDDR2 and SPMT protocols using the same set of pins. What that means is that with only a few changes to the memory controller and memory PHY, an SOC or embedded processor can accommodate both LPDDR2 and SPMT memory using exactly the same set of interface pins. It also means that SDRAMs designed to the new SPMT specification can be used as LPDDR2 SDRAMs, ensuring a ready market when commercial SPMT SDRAMs first hit the market near the end of 2011—assuming things go according to the SPMT Consortium’s current plans.

So where’s the power advantage? It kicks in after the required SDRAM transfer rate hits a critical level. For example, the SPMT Consortium’s data estimates that a x32 LPDDR2 memory interface operating at 400MHz dissipates about 180mW while providing 3.2 Gbytes/sec of peak data throughput over 32 data lines (800 Gbits/sec/pin) and 360mW at a peak data throughput of 6.4 Gbytes/sec over 64 data lines. (Regular old DDR2 and DDR3 SDRAM interfaces would consume a lot more power than this.) By contrast, the SPMT interface dissipates 180mW while transferring 6.4 Gbytes/sec over eight data lanes (8 Gbits/sec/lane) and 360mW when transferring 12.8 Gbytes/sec over 16 data lanes. So the SPMT interface appears to be about twice as power efficient as the LPDDR2 interface at higher data rates, which LPDDR2 memory can’t attain without resorting to a very wide data bus and using several memory devices in the bargain. However the LPDDR2 parallel interface has a power advantage over the SPMT serial interface at lower transfer rates. So LPDDR2 memory might suffice for today’s embedded and mobile applications and might also suffice for low-activity modes in future applications.

The graph below, supplied by SPMT, tells the story. The graph shows that at low data rates, LPDDR2 memory dissipates less power than SPMT memory—largely because of the DLL integrated into SPMT memory. (DLLs consume non-negligable amounts of power and although DDR2 and DDR3 memories incorporate DLLs, LPDDR2 memory does not.) So the SPMT Consortium has done something very smart and has developed an integrated mode-switching mechanism called SerialSwitch, which allows an SDRAM controller to programmably shift an SPMT memory between its LPDDR2 and SPMT serial interface modes using a control register built into the memory device.

 

 Memory Crossover

 

Mobile phone vendors and other embedded/mobile system designers know that video will be heavily used in many future products and they also know that memory transfer-rate and bandwidth requirements will only go up as a result. SPMT’s SerialSwitch mechanism provides a way for one memory device to support both low- and high-bandwidth operating modes with an appropriate level of power consumption depending on a system’s instantaneous bandwidth requirements. By definition, all commercial SPMT memories will incorporate the SerialSwitch feature. The following figure shows how the SPMT SerialSwitch mechanism works.

 

SerialSwitch

 

During Tg, the figure shows SPMT memory operating as a x16 LPDDR2 memory. Note that the data lines (DQ/HS) employ full-voltage, single-ended signaling in this mode. During time Tg, the memory’s DLL is off, which saves power. At the beginning of time Th, the system determines that more bandwidth is or soon will be needed, so it directs the memory controller to send a command to the memory to spin up the DLL in preparation for switching to SPMT serial mode. That process takes 5 to 10 microseconds. During this time, the memory continues to operate as an LPDDR2 memory so the DLL spin-up time is hidden and doesn’t interfere with system operation but power consumption will rise. Once the SPMT memory’s DLL has spun up, at time Ti, the system’s memory controller commands the SPMT memory to switch to serial communications mode. This transition takes a maximum of 10 clock cycles. After that and during time Tj in the figure above, the memory operates in SPMT serial-communications mode. Note that the data lines have switched to LVDS signaling, as shown in the figure. LVDS signaling reduces the memory interface’s power consumption. At some later time depending on system requirements, the memory controller can power down the memory (shown as time Tk) or switch back to LPDDR2 mode (the period following the period that starts at time Tk in the above figure). Don’t be misled by this figure by the way—SPMT memory need not pass through the power-down mode to switch from SMPT-serial communications to LPDDR2 mode.

Systems can use SPMT memory in LPDDR2 mode at boot time and whenever the system is operating in a mode with low memory-bandwidth requirements. The system can quickly switch to the LVDS SPMT-serial mode whenever it requires higher memory data rates—for example when video is activated, when multiple operating modes are in use simultaneously, or when multiple processors are running in a multicore device. The SPMT Consortium estimates that the optimum crossover point between LPDDR2 and SPMT serial interface data rates for a x16/8-lane LPDDR2/SPMT-serial memory device is around 1.6 Gbytes/sec based on energy considerations.

By subsuming the LPDDR2 standard and making SPMT memories wholly superset compatible with LPDDR2 memories, I think the SPMT consortium has significantly raised the likelihood of adoption when commercial SPMT memories finally appear late next year. I also think the likelihood of such memories appearing is pretty high considering that the top two DRAM vendors, Samsung and Hynix, are members of the SPMT Consortium. Together, Samsung and Hynix have a bit more than half of the overall DRAM market according to the latest stats from the DRAMeXchange (http://j.mp/aNaNiY).

On the embedded processor side of the equation, Marvel has announced that it too has joined the consortium, which further improves SPMT’s chances of success. In fact, Marvell supplied a canned quote for the SPMT Consortium’s press release with one of the strongest statements I’ve seen in such press releases, so I am suspending my usual cynicism about such quotes and reproduce it here:

“Today’s mobile DRAM technology is geared to support the bandwidth needs of single core processors. As devices evolve to integrate multi-core CPU, multi shader 3D graphic engines at multi-GigaHertz speeds, it’s clear that DRAM will be the single performance bottleneck, especially for handheld systems where power budget is a major constraint,” said Dr. Sehat Sutardja, chairman, president and chief executive officer at Marvell. “Marvell is joining the SPMT Consortium to actively promote Serial Port Memory Technology as an industry standard and address the immediate needs of the industry. We encourage other companies active in the sector to join us in our mission.”

Strong backing like this from a market maker like Marvell can only help SPMT’s cause. Whether or not SPMT actually reaches critical mass is something that we’ll all be watching as events unfold in the hotly competitive memory arena over the next 18 to 24 months.

More on the Xilinx EPP: Three ways to communicate with on-chip peripherals

June 2, 2010 on 3:11 am | In Design, FPGA, SOC | No Comments

Last month I discussed the newly introduced Xilinx Extensible Processing Platform (EPP), which represents a new product line and a new venture for FPGA leader Xilinx. To briefly recap, devices in the EPP device family are essentially a high-end microcontroller or embedded processor based on two ARM Cortex-A9 32-bit RISC processor cores (implemented as hard IP cores and not soft cores in the FPGA fabric), some amount of SRAM used largely for processor cache, some standard peripheral blocks implemented as hard IP cores, and multiple AMBA 4 interconnect buses that link the hard-core, on-chip IP blocks with an FPGA fabric that you can use to create additional peripheral devices or anything else you might need for the digital portion of your embedded design. These Xilinx devices will sell for the low tens of dollars and will consume much less power than full-tilt FPGAs, making them very attractive replacements for 32-bit microcontrollers and standalone processors in certain applications. This month, I want to focus on how you might use those multiple on-chip AMBA 4 buses to communicate with whatever you’ve implemented in the EPP’s FPGA fabric. Xilinx hasn’t yet discussed this sort of technical information, but it’s not too hard to project some basic facts.

There are essentially only three fundamental ways to use the Xilinx EPP’s on-chip AMBA 4 buses to communicate with peripheral devices whether they are hard cores outside of the FPGA fabric or soft cores implemented in the FPGA fabric. Those three ways are: registers, memory-mapped RAM, or streaming. Each of these communications approaches has advantages and disadvantages depending on application needs.

I/O data, control, and status registers date back to the earliest days of peripheral chips that were introduced along with the very first wave of microprocessors back in the 1970s. Back then, registers were generally no wider than eight bits. Data registers were almost always eight bits wide and permitted the passing of individual bytes back and forth between the processor and whatever I/O device lay beyond the peripheral chip. There were peripheral chips for simple parallel I/O, UARTs (universal asynchronous receiver/transmitters) for serial I/O, timer chips, interrupt controllers, and that was pretty much all there was at first.  Each control and status register in these peripheral chips had individual bits and bit groups that implemented specific functions such as “set the output pins to be low-true” or “enable the interrupt pin.”

I/O registers were implemented as individual latches, so it was easy to take the output of a latch bit and use it for driving another piece of hardware inside of the peripheral chip or to take a signal and connect it to the D input of a status-register bit. We still use I/O status and control registers in precisely the same way today, inside of large peripheral blocks like Ethernet and video controllers. We simply use a lot more registers than before and they tend to be wider than eight bits these days.

Memory-mapped I/O maps a large array of bus-addressed memory locations into a linear memory array inside of the peripheral device. Often, this memory array is implemented as a RAM inside of the peripheral device but if the memory array is small enough, it might be implemented as a large register bank instead of RAM.

The earliest use for such memory-mapped arrays in I/O chips was for memory-mapped video. The CPU could write an image to memory-mapped video RAM and a simple sequencing controller read out the video and sent it to the display. Initially, access to the video RAM had to be interleaved between processor and display sequencer but eventually as display speeds and resolution increased, video RAM became dual-ported to handle the rising number of access cycles per unit time.

Originally, it took an entire board to create a memory-mapped video controller. I recall using a Vector Graphics Flashwriter video display card in my North Star Horizon S-100 computer to implement fast video for a an early WordStar editing system. I had to write the low-level video drivers in Z80 assembly code to connect the Flashwriter to the CP/M operating system and to WordStar itself. That was back in 1979 and things were mighty primitive back then. The advantage of the memory-mapped video back then was performance. The North Star’s Z80 CPU could directly manipulate every character location on video display without using the serial escape sequences mandated by the use of RS-232 terminals. The processor would write characters directly to the screen with a simple byte move; it could examine characters with a simple byte read; and it could change the character’s attribute with a simple read-modify-write instruction sequence.

In an era where processors were relatively expensive, it made sense to use the CPU running the application code to directly manipulate video on the screen as well. In the 21st century, microprocessors are so cheap and CPUs are so isolated from peripheral devices by caches and bus hierarchies that we have radically changed the way video works in most computers and embedded systems. Most systems now employ separate video processors but there are still certain non-video applications and certain peripheral devices that can still make effective use of memory-mapped I/O to provide direct processor access to peripheral memory.

Finally there’s stream I/O, which directs long transaction bursts to one memory or port address. Large operating systems, Linux in particular, have a great affinity for stream I/O and it’s an essential I/O protocol for streaming audio and video media. (No coincidence there.) Generally, a peripheral processor is required in such streaming applications to interpret commands embedded within the data stream and to separate multiplexed data streams (such as merged audio/video streams, which have become extremely common). Often, it’s advisable to place a FIFO at the input port of a streaming-I/O peripheral to help buffer the incoming data stream. Buffering helps to bridge mismatched data rates or inter-burst latencies between the streaming transmitter and receiver.

Xilinx hasn’t discussed any of these details but it’s likely that the EPP will support all three types of I/O transactions. What remains to be seen is what will be supported in hard-core IP and what will need to be implemented in the FPGA fabric.

Xilinx redefines the high-end microcontroller with its ARM-based Extensible Processing Platform – Case Studies – Part 2

May 1, 2010 on 8:22 pm | In Design, FPGA, Low-Power, SOC | No Comments

In my previous blog, I discussed the hard-core features of Xilinx’s new Extensible Processing Platform (EPP) and explained the device at the 50,000-foot level. In this blog, I’ll dig a bit deeper into the thinking behind the EPP’s FPGA fabric and I’ll show some case studies that indicate why Xilinx may have come up with a product family that will revolutionize high-end embedded system design.

Two features of Xilinx’s EPP architecture differentiate it from other microcontrollers. The first, discussed in Part 1, is the presence of a dual-core ARM Cortex-A9 processor. Most microcontrollers contain only one processor core. The EPP has two. So it’s already starting from a high-end position. The second differentiating feature is the inclusion of an unidentified amount of FPGA fabric on the device. Since the Xilinx EPP represents a family of parts, it’s safe to assume that various family members will contain differing amounts of FPGA fabric. That’s an especially safe assumption because the Xilinx presentation showed two EPP examples with different amounts of FPGA fabric. So we know that the family will likely include at least two parts—and probably many more if the product line proves successful.

What do you do with this FPGA fabric? Well the hard-core section of the EPP already gives you two 32-bit processor cores, some microprocessor peripherals, a memory controller, and some SRAM cache. So you might use the fabric to add some standard peripherals that your design needs that are not included in the standard hard-core set. Because the EPP is based on the AMBA-AXI bus, there are already many such peripheral devices available as synthesizable IP to choose from and the mere presence of Xilinx’s EPP is likely to increase the number of choices substantially as IP vendors decide to jump on the bandwagon.

Perhaps more likely, you will develop custom accelerators for application-specific tasks that permit the EPP to perform task-specific computations really, really fast. Bolt-on, bus-connected acceleration is the preferred design style for many embedded systems architects and it appears to me that the Xilinx EPP heartily supports this design style. I expect the Xilinx EPP offerings to flourish because it complements in-favor system design styles so well. So let’s take a look at two case studies provided by Xilinx to illustrate how the EPP can reduce a system design’s parts count, cost, and power consumption.

Xilinx EPP Auto ApplicationThe first example is for an automotive optical-recognition system that provides a driver with a number of assist features for collision avoidance, blind spot detection, visually assisted cruise control, night vision, a self-parking system, and a lane-departure warning system. An automotive vendor wanted to develop such a system in a compact package that could be installed high on the windshield between the glass and the rear-view mirror. The system needed to be passively cooled (not an easy feat considering the location of the system). Sensors feeding the system will include video cameras, passive infrared sensors, and active RADAR sensors. The vendor wished for the system to be scalable, based on which and how many sensors are used in the vehicle.

The total processing requirement for this system included 1600 DMIPS from the supervisory processor and 32 GMACs for the sensor processing. Cost and power targets for this system were $50 and 5W. A design based on a processor-based ASSP backed with two auxiliary DSPs (needed to provide the 32 GMACs) came in at $45.75 and 6.6W, so the cost target was achieved but the power consumption was too high. A second design based on a Xilinx EPP came in at “less than” $40.75 (less than because Xilinx is still somewhat secretive about pricing for an unannounced product, so the listed EPP costs “less than $25″) and 4.2W, so the power consumption is about 15% below budget. More important, the EPP design provides roughly 200% DMIPS and GMAC of the processing power needed by the design, delivering 3335 DMIPS and 60 GMACs. Even with these cost and power advantages, the Xilinx EPP would be far less attractive if it forced the software team to use an unfamiliar hardware architecture. One of the biggest advantages of the Xilinx approach is the familiar nature of the EPP’s foundation hardware.

The second case study involves an intelligent video surveillance system that can monitor a scene and raise alarms or generate alerts based on the scene. The estimate for processing requirements was 3100 MIPS from the supervisor processor and 49 GMACs for video processing. Cost and power targets were $100 and 10W. A system design based on separate host and video processors came in just above the processing requirements, with a part cost of $93 and a power dissipation of 10W. So this discrete design just meets spec with very little processing headroom and no leeway in power dissipation. A second system design based on a Xilinx EPP delivers 3335 DMIPS and 60 GMACs, so there’s ample video-processing headroom. Parts cost dropped to “less than $87” (again, Xilinx is being cagey with quoting EPP costs) and 7.9W for power dissipation (20% under the power goal).

Both of these case studies illustrate the Xilinx EPP’s applicability in high-end embedded systems with big processing requirements. In such systems, the EPP’s standardized, high-end, hard-core, dual-processor core (an ARM Cortex-A9 MP cluster) coupled to a high-performance, 28nm FPGA fabric though multiple high-performance buses are significant assets, well suited to such high-end applications. Even though these are high-end applications, they are likely to boost sales of Xilinx’s EPP-based devices to levels rarely achieved by Xilinx’s more expensive FPGAs. EPP component costs listed in these two case studies suggest that Xilinx plans to sell these parts for tens of dollars, not hundreds or thousands of dollars. This feat is possible only because the standardized components within the EPP are hard cores, and they consequently consume only 5-10% of the silicon they’d require if implemented with an FPGA fabric.

Xilinx redefines the high-end microcontroller with its ARM-based Extensible Processing Platform – Part 1

May 1, 2010 on 7:10 pm | In DRAM, Design, FPGA, Low-Power, SOC | No Comments

Last week at the Embedded Systems Conference (ESC) held in San Jose, California, Xilinx disclosed additional information about its upcoming Extensible Processing Platform (EPP), which I previously discussed in a February 1 blog entry written just after RTECC (the Real Time Embedded Computing Conference, see Designing Low-Power Systems with FPGAs, Part 2). This past week at a press conference, Xilinx’s Senior VP of Worldwide Marketing and Business Development Vin Ratford again spoke of the upcoming processor-centric devices Xilinx plans to introduce next year, but this time he provided far more detail. As promised, the devices fuse features of a high-end microcontroller (hard-core implementations of a 32-bit processor, memory, and I/O) with an FPGA fabric. But wait, you say, haven’t both Xilinx and Altera (and other FPGA vendors) tried this before? Yes, they have, with uninspiring results. However, I submit that Xilinx’s EPP is substantially different and it stands a very good chance of capturing significant market share from microcontrollers and from discrete processors. It may also be very attractive to design teams considering the development of certain types of SOCs. Consequently, the Xilinx EPP family may well become the family of high-volume parts Xilinx wants to have in its product catalog. Ratford provided so much information in his ESC announcement that I’ll need multiple blog entries to cover it all. In this first entry, I’ll describe what Xilinx’s EPP is and I’ll cover some of the thinking behind the architecture; In the second entry, I’ll describe some case studies that illustrate why this component family might be very attractive for a certain class of embedded product—because it promises lower parts count, lower cost, and higher performance with lower power consumption. Please understand that Xilinx stopped short of announcing actual products. Ratford described an architecture that will be used to produce a product family with actual products starting to appear next year.

 There are two major components to Xilinx’s EPP: a hard-wired, high-end, microcontroller-like block and a connected FPGA fabric based on Xilinx’s 28nm unified FPGA logic-cell design as shown in the diagram below.

 

Xilinx EPP Block Diagram

Xilinx EPP Block Diagram

 

 

First, let’s look at the hard-wired portion. It’s well known that processors don’t run very fast when implemented with FPGAs. The reason mostly revolves around the wiring congestion associated with the large register files of 32-bit RISC processors. Wiring congestion translates into “slow” and you can figure on giving up 50-75% or more of the processor’s maximum clock rate in a given process technology when comparing a synthesized ASIC implementation against a synthesized FPGA implementation. Hand optimization can reclaim some of that speed but if you’re planning on using a standard processor architecture anyway, it makes perfect sense to implement the processor on the FPGA as a hard core using a standard ASIC synthesis flow. That way, you get the full speed of the IC process technology along with the full logic density and therefore a much lower silicon cost.

Xilinx has chosen ARM’s Cortex-A9 32-bit RISC processor core for the EPP but has gone a step farther by implementing a dual-core version of this processor. That choice immediately puts the Xilinx EPP family at the high-end of the microcontroller spectrum. First, there are two 32-bit processor cores. Second, a Cortex-A9 processor can run at 2 GHz in TSMC’s 40nm, high-performance process technology. That’s one fast processor—much faster that many embedded applications require. A dual-core version, as is employed in Xilinx’s EPP family, is faster still.

In choosing a standard processor core from ARM’s extremely successful stable of processors, Xilinx has plugged directly into a broad community of embedded software developers. In other words, choosing the widely used ARM architecture telegraphs Xilinx’s recognition that embedded software development is now the largest and most expensive part of any high-end embedded project. In many such projects, software developers often outnumber hardware developers by 10:1. In announcing the EPP, Xilinx shows that it fully recognizes the need to make the software development team happy first. The company’s selection of an ARM processor core also leverages the associated large and familiar development-tool set, the good selection of operating systems, and the extended ecosystem that goes with the ARM architecture’s large and growing market dominance in the embedded space. All of these factors make the ARM processor very attractive to embedded development teams.

To the dual-core ARM Cortex-A9 processor, Xilinx has added a number of hard-core peripherals including SRAM caches, timers, interrupt controllers, switches, memory controllers, and commonly used I/O peripherals certain to be useful for many high-end embedded designs. Because these additional blocks are all hard-core implementations, they too take little room on the chip and consume much less power than they’d need if implemented in an FPGA fabric. Note that the EPP chips will contain enough SRAM for caches and small scratchpads however bulk memory, generally implemented with DRAM, will be off-chip. Consequently, the EPP architecture includes hard-core DRAM controllers to manage off-chip memory. Ratford’s talk at ESC did not elaborate on the type of memory the on-chip controller can handle however DDR2, DDR3 or both DDR2 and DDR3 would probably be a good guess, considering the high-end nature of the EPP family. The targeted applications will need a lot of memory and DDR2 and DDR3 DRAM are now the best choices in terms of cost/bit.

Key to the software-friendly approach Xilinx is taking with the EPP, the architecture boots code upon power up just like a microcontroller. Only then is the FPGA fabric configured. This approach makes the EPP look very familiar to software developers who are not at all comfortable with writing code for a fluid, amorphous system that’s not well-defined when power comes up. The FPGA vendors spent a lot of money on reconfigurable architectures learning this lesson. In addition, HLL compilers don’t much care for undefined hardware either—undefined hardware just doesn’t fit the standard software-programming models. So the implementation of a complete, hard-wired microcontroller within the EPP cuts out a lot of that old unfamiliar strangeness associated with previous attempts to marry hard processor cores and FPGA fabrics.

Speaking of the FPGA fabric, Xilinx will be using the unified 28nm FPGA fabric in the EPP. Xilinx developed this fabric for its next-generation Spartan and Virtex FPGAs. (If you want more details about this FPGA fabric, take a look at the White Paper here. According to Ratford, Xilinx’s Virtex and Spartan FPGAs will both employ this fabric, which is the first time that Xilinx has used the same FPGA fabric for its high-performance and its low-cost FPGA product families. Using the same fabric for the two Xilinx FPGA product lines and for the EPP means that Xilinx need only develop one set of hardware-design tools for the 28nm node and it also means that hardware designers only need to learn one set of tools as well.

The EPP’s hard-core embedded microcontroller communicates with the on-chip FPGA fabric using ARM’s newly announced AMBA 4/AXI bus. Ratford said at RTECC and repeated again at ESC that Xilinx worked with ARM to develop a version of this new bus specifically for FPGA use but he’s not provided details. The diagram of the EPP Ratford projected (reproduced above) shows multiple buses connecting the EPP’s hard-core embedded microcontroller and the on-chip FPGA fabric. Although Ratford provided no additional details, I plan to write a third blog entry discussing possible ways of optimally connecting the processor cores to the FPGA fabric. In the next installment of this blog, I’ll discuss some specific case studies Ratford covered in his ESC presentation that show how the EPP can reduce the parts count, cost, and the power consumption of high-end embedded systems.

(You can find a White Paper describing the Xilinx EPP here.)

Tabula FPGA Scatters Logic, Memory, and Power Across Space and Time

April 1, 2010 on 3:20 pm | In CMOS, Design, FPGA, Low-Power | No Comments

Here’s a head-scratcher for you. Why not create tesseract FPGAs? A tesseract is the 4-dimensional version of a 3D cube. (Just as a 3D cube can be unfolded to make a set of six connected 2D squares, a tesseract can be unfolded into a set of eight connected 3D cubes.) I’ve loved the word ever since I learned it by reading Robert A. Heinlein’s classic science fiction short story from 1940 called “And He Built a Crooked House” in which an earthquake causes a house built in the unfolded 3D shape of a tesseract to fold into an actual 4D tesseract, trapping the unfortunate occupant inside. If you fold an FPGA into time, you can extrude some of the physical computational circuitry into elsewhen and reduce the amount of circuitry needed to implement your functions. And that is exactly what the new FPGA vendor Tabula has done. The company’s ABAX 3D FPGA architecture gets octuple duty from a LUT cell by fencing it in with eight sets of input/output latches and eight LUT configuration tables. Then, at 8x the “user” clock rate, the FPGA quickly reconfigures the LUT cell, runs part of a calculation, stores the partial result, and proceeds to the next step. The current FPGA design, just announced by Tabula, runs the user clock at 200 MHz and the “Spacetime” clock at 1.6 GHz. As a result, Tabula can offer really “large” FPGAs (in terms of logic cells) at really low prices compared to the big guys: Altera and Xilinx.

Now to do this, you need some magic and you need to value logic-cell capacity over power consumption. First, the magic. Unless you’re going to retrain FPGA users to manually spread their designs across eight time slices, you need to make the 1.6GHz reconfiguration trick work in the background. Altera and Xilinx spent more than a decade trying to sell the idea of spreading designs across time using “on-the-fly reconfigurable logic” and most designers just never latched onto the idea. For some reason, engineers can understand software overlays and DLLs (dynamic-linked libraries) but cannot come to grips with on-the-fly hardware reconfigurability. I think the issue is training more than anything else, but the big FPGA guys just couldn’t sell the idea broadly after trying for years. So there needs to be magic—or some appropriately advanced technology that looks like magic to most of us—to make this trick work.

And there is such magic in the form of an appropriate synthesis tool from Tabula that understands the extra-dimensional aspects of Tabula’s FPGA. The tool takes standard logic designs and “folds” them into time. However, like much of the magic in the Harry Potter book series, this magic isn’t perfect. You don’t necessarily get 8x the logic circuitry from a 1x FPGA. You get about 2.5x according to Tabula, depending on the design. And you get about 2.9x from the 8-ported, 1.6GHz memories on the chip, again, depending on the design. This gap between the real and the ideal reflects the difficulty in developing automated algorithms that can re-pipeline a datapath for additional stages. It’s an art not a science, as any CPU/processor/microprocessor architect will tell you. You can’t always partition one datapath pipleline stage into eight because there just isn’t enough computation taking place in that pipeline stage to allow such expansion or re-pipelining. So, according to Tabula, the average LUT reuse is about 2.5x based on whatever test cases the company used to develop that number.

Now for the power-consumption ramifications. Tabula’s FPGAs trade off die area (in terms of LUTs and on-chip memories) and therefore silicon cost at the expense of power consumption. Running most of the on-chip circuitry at 1.6GHz while delivering the performance of a 200MHz FPGA must cost additional power. In the real world of chip design, power scales linearly with area but superlinearly with frequency, largely due to voltage-rail considerations. You need more voltage to operate at higher clock rates.  There’s also the leakage issue caused by setting transistor thresholds to operate at 1.6GHz to contend with. So it’s bound to be a bad tradeoff in terms of power. (I don’t actually know this because it doesn’t seem that Tabula’s been forthcoming about power numbers, but some physics just can’t be bypassed as long as you’re still using off-the-shelf CMOS.)

It’s true that you can sacrifice half of the virtualized Spacetime LUTs and get 400MHz or some other combinations, but folks it’s a 1.6GHz device. Not designed for low power. Design tradeoffs obviously favored device cost, which you can see in the low, blink-inducing prices for the devices. Those prices are indeed mighty attractive for such high logic capacities. However, just about everyone’s worried about power these days, even people designing equipment for those power-sucking data centers that are cooled by diverting nearby rivers through the equipment racks. Every Watt of operating power supplied to the equipment requires an additional Watt for cooling (roughly speaking). A megawatt here, a megawatt there, and pretty soon you’re talking about some real energy consumption. And some real energy costs, which is what truly gets the attention of the data-center managers and owners.

I’ve heard about the Tabula announcements from several sources starting with a morning-of article in the San Jose Mercury News. One of the best technical write-ups I’ve seen so far is this article by Kevin Morris from FPGA Journal. Online comments to Morris’ article suggest that there’s a lot of skepticism in the design community with respect to this new FPGA technology. As with any new technology, even a tesseract FPGA, time will tell if the market accepts this idea or if it will end up on the shelf next to the long-dead and now-dusty remains of reconfigurable logic.

Next Page »

Powered by WordPress with Pool theme design by Borja Fernandez.
Entries and comments feeds. Valid XHTML and CSS. ^Top^