Steve Leibson » CMOS http://low-powerdesign.com/sleibson Leibson's Laws and the Penalties for Breaking Them Thu, 02 Sep 2010 04:28:00 +0000 http://wordpress.org/?v=2.8.4 en hourly 1 Tabula FPGA Scatters Logic, Memory, and Power Across Space and Time http://low-powerdesign.com/sleibson/2010/04/01/tabula-fpga-scatters-logic-memory-and-power-across-space-and-time/ http://low-powerdesign.com/sleibson/2010/04/01/tabula-fpga-scatters-logic-memory-and-power-across-space-and-time/#comments Thu, 01 Apr 2010 15:20:51 +0000 sleibson321 http://low-powerdesign.com/sleibson/?p=337 Here’s a head-scratcher for you. Why not create tesseract FPGAs? A tesseract is the 4-dimensional version of a 3D cube. (Just as a 3D cube can be unfolded to make a set of six connected 2D squares, a tesseract can be unfolded into a set of eight connected 3D cubes.) I’ve loved the word ever since I learned it by reading Robert A. Heinlein’s classic science fiction short story from 1940 called “And He Built a Crooked House” in which an earthquake causes a house built in the unfolded 3D shape of a tesseract to fold into an actual 4D tesseract, trapping the unfortunate occupant inside. If you fold an FPGA into time, you can extrude some of the physical computational circuitry into elsewhen and reduce the amount of circuitry needed to implement your functions. And that is exactly what the new FPGA vendor Tabula has done. The company’s ABAX 3D FPGA architecture gets octuple duty from a LUT cell by fencing it in with eight sets of input/output latches and eight LUT configuration tables. Then, at 8x the “user” clock rate, the FPGA quickly reconfigures the LUT cell, runs part of a calculation, stores the partial result, and proceeds to the next step. The current FPGA design, just announced by Tabula, runs the user clock at 200 MHz and the “Spacetime” clock at 1.6 GHz. As a result, Tabula can offer really “large” FPGAs (in terms of logic cells) at really low prices compared to the big guys: Altera and Xilinx.

Now to do this, you need some magic and you need to value logic-cell capacity over power consumption. First, the magic. Unless you’re going to retrain FPGA users to manually spread their designs across eight time slices, you need to make the 1.6GHz reconfiguration trick work in the background. Altera and Xilinx spent more than a decade trying to sell the idea of spreading designs across time using “on-the-fly reconfigurable logic” and most designers just never latched onto the idea. For some reason, engineers can understand software overlays and DLLs (dynamic-linked libraries) but cannot come to grips with on-the-fly hardware reconfigurability. I think the issue is training more than anything else, but the big FPGA guys just couldn’t sell the idea broadly after trying for years. So there needs to be magic—or some appropriately advanced technology that looks like magic to most of us—to make this trick work.

And there is such magic in the form of an appropriate synthesis tool from Tabula that understands the extra-dimensional aspects of Tabula’s FPGA. The tool takes standard logic designs and “folds” them into time. However, like much of the magic in the Harry Potter book series, this magic isn’t perfect. You don’t necessarily get 8x the logic circuitry from a 1x FPGA. You get about 2.5x according to Tabula, depending on the design. And you get about 2.9x from the 8-ported, 1.6GHz memories on the chip, again, depending on the design. This gap between the real and the ideal reflects the difficulty in developing automated algorithms that can re-pipeline a datapath for additional stages. It’s an art not a science, as any CPU/processor/microprocessor architect will tell you. You can’t always partition one datapath pipleline stage into eight because there just isn’t enough computation taking place in that pipeline stage to allow such expansion or re-pipelining. So, according to Tabula, the average LUT reuse is about 2.5x based on whatever test cases the company used to develop that number.

Now for the power-consumption ramifications. Tabula’s FPGAs trade off die area (in terms of LUTs and on-chip memories) and therefore silicon cost at the expense of power consumption. Running most of the on-chip circuitry at 1.6GHz while delivering the performance of a 200MHz FPGA must cost additional power. In the real world of chip design, power scales linearly with area but superlinearly with frequency, largely due to voltage-rail considerations. You need more voltage to operate at higher clock rates.  There’s also the leakage issue caused by setting transistor thresholds to operate at 1.6GHz to contend with. So it’s bound to be a bad tradeoff in terms of power. (I don’t actually know this because it doesn’t seem that Tabula’s been forthcoming about power numbers, but some physics just can’t be bypassed as long as you’re still using off-the-shelf CMOS.)

It’s true that you can sacrifice half of the virtualized Spacetime LUTs and get 400MHz or some other combinations, but folks it’s a 1.6GHz device. Not designed for low power. Design tradeoffs obviously favored device cost, which you can see in the low, blink-inducing prices for the devices. Those prices are indeed mighty attractive for such high logic capacities. However, just about everyone’s worried about power these days, even people designing equipment for those power-sucking data centers that are cooled by diverting nearby rivers through the equipment racks. Every Watt of operating power supplied to the equipment requires an additional Watt for cooling (roughly speaking). A megawatt here, a megawatt there, and pretty soon you’re talking about some real energy consumption. And some real energy costs, which is what truly gets the attention of the data-center managers and owners.

I’ve heard about the Tabula announcements from several sources starting with a morning-of article in the San Jose Mercury News. One of the best technical write-ups I’ve seen so far is this article by Kevin Morris from FPGA Journal. Online comments to Morris’ article suggest that there’s a lot of skepticism in the design community with respect to this new FPGA technology. As with any new technology, even a tesseract FPGA, time will tell if the market accepts this idea or if it will end up on the shelf next to the long-dead and now-dusty remains of reconfigurable logic.

]]>
http://low-powerdesign.com/sleibson/2010/04/01/tabula-fpga-scatters-logic-memory-and-power-across-space-and-time/feed/ 0
Intel cuts IC power by allowing, detecting, and correcting errors http://low-powerdesign.com/sleibson/2010/04/01/cut-power-by-allowing-detecting-and-correcting-errors/ http://low-powerdesign.com/sleibson/2010/04/01/cut-power-by-allowing-detecting-and-correcting-errors/#comments Thu, 01 Apr 2010 14:59:39 +0000 sleibson321 http://low-powerdesign.com/sleibson/?p=331 The low-power IC-design train has long ridden the rails of lowered supply voltage. However, these lowered supply rails are tangentially approaching transistor threshold voltages and have long been headed for a serious collision because transistors in large, nanometer ICs run closer and closer to their switching limits. When designing these large circuits, chip designers and EDA tools must make allowances for noise or voltage droop on the supply rails and noise on the signal interconnects within the chip and that means that the designs can’t really run the transistors as fast as possible or at the lowest possible voltage without risking imperfect operation. And who wants to risk imperfect circuit operation? Well, Intel for one.

In a recent article published in the MIT Technology Review, Katherine Bourzac writes up a report from Intel Labs about an experimental 45nm chip that allows circuits to run at sub-optimum voltage and somewhat-too-fast frequency settings. Most of the time, there’s no problem because there’s not enough noise or droop to cause the circuits to compute incorrectly. However, sometimes, under certain conditions, there will be errors. What to do? Add error-detection circuitry to detect errors when they happen and then back up one step in the calculation, raise the operating voltage a bit or drop the operating frequency a bit, re-run the calculation to get the right result, and then back the supply voltage down to normal. This is research into what Intel Labs calls “resilient circuits.”

Is there a benefit to this approach? Specifically, is there a power benefit? Apparently, there is. Bourzac quotes Wen-Hann Wang, director of circuits and systems research at Intel and vice president of Intel Labs, who says that even with the extra error-detection circuitry, the net power savings can be a whopping 37%. (Or, if you’re a speed freak, you can get 21% faster operation without reducing operating power.) Wang points out that today’s chips are designed to operate in demanding, multimode scenarios such as “playing a graphics-rich game, uploading video to Facebook, and surfing the Web” (Isn’t it amazing how cell-phone scenarios have replaced computer-use scenarios these days?) and that today’s devices must be designed to handle such scenarios correctly, which means that the chip’s circuits will be overdesigned and will use excessive power most of the time, when simpler operating modes are in use. An error-detection-and-correction scheme allows the design of chips that only use additional power when it’s needed—when there’s an error.

There are at least two more factors to consider as well. First, chips age. As they do, device thresholds change and metal migrates, leading to minute changes in the currents flowing within the chip—changes that deviate from modeled operating scenarios created during chip design. The normal result of these changes for devices that are designed to run perfectly all the time is that the circuitry eventually does not run perfectly and the chip effectively dies even though it actually could operate properly at a slightly higher operating voltage or a lower operating frequency. Apparently, according to Bourzac’s article, the addition of error-detecting-and-correcting circuitry and algorithms also compensates for the problems associated with chip aging.

Second, as Moore’s Law takes the industry down the rabbit hole of shrinking geometries, many more error sources appear. That makes error-detection-and-correction schemes even more attractive and no doubt that is why Intel Labs is looking into the design of such circuitry now rather than later.

I think that the advent of real error-detecting-and-correcting computational circuitry is long overdue. On-chip-variability already causes enough headaches to trigger more research into how digital circuitry must deal with errors in a probabilistic world, not the absolutely perfect Boolean world we’ve come to assume over the 70 some years of digital design. The storage and memory worlds got the call long ago. Disk drives became probabilistic with the adoption of PRML (partial-response, maximum likelihood) coding more than a decade ago and have always had to use error detection and correction to deal with real-world, flawed storage media. DRAM and NAND manufacturers long ago adopted redundant design to allow for dead bits, rows, and columns in their devices. Viterbi, Turbo, and other algorithms protect digital data from errors inherent in the transmission over the air, with all the associated noise and reflections that are part of everyday cellular telephony. So, is digital design at the chip level different? Apparently not.

]]>
http://low-powerdesign.com/sleibson/2010/04/01/cut-power-by-allowing-detecting-and-correcting-errors/feed/ 0
Laser Spike Annealing of Nickel in Nanometer CMOS ICs Cuts Leakage 10x http://low-powerdesign.com/sleibson/2009/12/06/laser-spike-annealing-of-nickel-in-nanometer-cmos-ics-cuts-leakage-10x/ http://low-powerdesign.com/sleibson/2009/12/06/laser-spike-annealing-of-nickel-in-nanometer-cmos-ics-cuts-leakage-10x/#comments Sun, 06 Dec 2009 20:22:55 +0000 sleibson321 http://low-powerdesign.com/sleibson/?p=261 One of the sad facts of life for nanometer silicon has been the rise of leakage current as device geometries shrink. At 65nm, CMOS leakage currents roughly equal operating currents, making it virtually impossible to reduce overall operating current by more than half. I’ve long thought this was the result of low-Vt transistors that can never fully turn off, a consequence of the drive to recover speed that’s lost when supply voltages are cut to reduce operating power. Turns out there’s another culprit: nickel contamination that occurs when nickel atoms drift away from the nickel-silicide interface layer used to improve the connectivity of metal inter-layer contact plugs. The nickel atoms drift during the annealing process, which is used to drive the deposited nickel atoms into the transistors’ source and drain contact pads. The first of two annealing cycles drives the metallic nickel atoms into the silicon source and drain pads creating Ni2Si silicide. A second, higher-temperature annealing process converts the Ni2Si into NiSi, which has lower resistance and thus provides good electrical connectivity between the contact pad and the metal interconnect plug.

It turns out that the current “soak” annealing (which lasts for tens of seconds) processes allow the nickel atoms to drift far afield. Like beach sand in your bathing suit, the nickel gets into places you’d rather not have it. The drifting nickel atoms seem to have an affinity for silicon lattice discontinuities, which can be found at the outside ends of the transistor where source and drain diffusions meet the isolation trenches and in long, narrow voids that run from the source and drain regions towards and into the FET channel. Both of these hiding places cause leakage because the metallic nickel conducts electricity where there should be insulator or semiconductor material. Nickel at the ends of the transistor causes substrate leakage and nickel atoms in the channel naturally cause channel leakage.

Applied Materials and European semiconductor research powerhouse IMEC have jointly developed a laser-annealing process with one-millisecond duration instead of taking tens of seconds. As a result, the diffusing nickel doesn’t have time to drift into these unwanted places during the second annealing step that generates NiSi. Applied Materials described a similar laser-spike annealing process back in 2004 (see article here), but reportedly achieved only a 3-4% leakage reduction back then. This latest development appears to be a refinement of that earlier technique. The two companies will be presenting their findings at this week’s IEDM conference in Baltimore, Maryland.

IMEC and Applied Materials will indeed have pulled a rabbit out of the hat if this laser-spike annealing process plus the application of appropriate transistor-design rules result in cutting leakage currents by 90% for nanometer CMOS. Leakage-driven power loss has become a significant problem for advanced IC design and had appeared to be insurmountable, even with the addition of high-K and metal-gate processing. Now, it appears there’s a real solution with the best of all possible implications for system and logic designers: they don’t need to learn anything new. They can leave this fix to the design tools and to the process engineers and once again skirt the system-level and architectural issues of low-power design.

]]>
http://low-powerdesign.com/sleibson/2009/12/06/laser-spike-annealing-of-nickel-in-nanometer-cmos-ics-cuts-leakage-10x/feed/ 0
The Surprising Popularity Rise of On-Chip Memory http://low-powerdesign.com/sleibson/2009/11/08/the-surprising-popularity-rise-of-on-chip-memory/ http://low-powerdesign.com/sleibson/2009/11/08/the-surprising-popularity-rise-of-on-chip-memory/#comments Sun, 08 Nov 2009 16:53:04 +0000 sleibson321 http://low-powerdesign.com/sleibson/?p=243 I attended the 7th International SOC Conference in Newport Beach last week and several of the speakers addressed issues relating to SOC and system power. One of these speakers was Bob Madge, Director of Technology Marketing at LSI Corp (formerly LSI Logic). In case you didn’t know, LSI has been evolving its business from its original focus on developing ASICs and SOCs for customers to a focus on programmable ASSPs (application-specific standard products) and custom silicon specifically aimed at the networking and storage markets. Madge’s first slide explained the reasoning: annual storage-capacity growth is a projected 49% per year and annual network-traffic growth is a projected 42% per year. Good growth numbers for a business to target.

To deliver competitive parts, LSI stays on top of IC design and manufacturing trends. One trend that caught LSI and the semiconductor industry by surprise has been the rapid growth in on-chip memory use. On-chip memory makes sense for two reasons. First and foremost, it provides better performance than off-chip memory because putting memory on the chip along with the logic circuitry eliminates two sets of off-chip drivers and receivers, which reduces power consumption for memory transactions. Second, on-chip logic can communicate with on-chip memory over extremely wide memory interfaces—pin count is not an issue if you stay on the chip. A wide memory interface reduces the number of transfers needed to move a given amount of data and lower transfer rates cut power as well.

However, merging logic and memory on one piece of silicon has always presented design and manufacturing issues. Bulk, high-volume, high-capacity memory manufacturing processes differ from logic manufacturing processes because the two processes must optimize different parameters. Memory processes emphasize low cost manufacturing and tend to have fewer metal layers than logic processes, which emphasize speed and on-chip connectivity. “Frequency, density, and power are always a challenge,” said Madge.

For example:

  • Today’s network routers use 400-Mbit buffers. Switches need 512 Mbits of storage or more. In the future, said Madge, these devices will need as much as 1 Gbit of on-chip memory in multiple configurations.
  • IP controllers used in network storage applications currently use 60 to 100 Mbits of cache memory. In the future, these devices will need 200 Mbits of memory or more.
  • Media processors currently use 60 to 80 Mbits of memory running at 500 MHz. Future needs will be on the order of 100 to 200 Mbits of memory running at 600 to 700 MHz.

All of these examples demonstrate the coming challenges for fast, dense, on-chip memory.

LSI is looking at embedded (on-chip) DRAM and the use of 3D, through-silicon via technology for chip-to-chip stacking as ways of increasing the amount of on-chip memory. The company is doing this because it sees a continued and rapid rise in the amount of on-chip memory needed for its networking and storage chips.

Embedded DRAM cuts power because it uses a 1T (one-transistor) cell, which obviously improves density over a 4T or 6T static RAM cell. However, embedded DRAM also reduces static and dynamic power consumption because the fewer transistors use less power and leak less current than the greater number of transistors required to build the same amount of SRAM memory.

LSI is also investigating other power-saving features that become possible when you move memory onto the logic chip including a sleep mode for the memory, dual power rails, and low-voltage operation. However, said Madge, the biggest benefit appears to be a move to embedded DRAM because of the huge reduction in transistor counts.

]]>
http://low-powerdesign.com/sleibson/2009/11/08/the-surprising-popularity-rise-of-on-chip-memory/feed/ 0
Give OTP a chance for low-power, on-chip storage http://low-powerdesign.com/sleibson/2009/10/04/give-otp-a-change-for-low-power-on-chip-storage/ http://low-powerdesign.com/sleibson/2009/10/04/give-otp-a-change-for-low-power-on-chip-storage/#comments Sun, 04 Oct 2009 18:58:37 +0000 sleibson321 http://low-powerdesign.com/sleibson/?p=185 The on-chip memories that get most of the attention are read/write memories such as SRAM, DRAM, Flash, and MRAM (which I just covered in my previous blog entry). However, there’s a place for OTP (one-time programmable) memory on chip, so the technology bears some thought. I discussed OTP at last week’s GSA Emerging Opportunities Expo and Conference in Santa Clara, California with Jim Lipman of Sidense, a vendor that offers hard IP for on-chip OTP memory.

Sidense’s SiPROM memory cell consists of one specially designed FET as shown in the figure below. The special part of the FET’s design is a stepped gate-oxide layer with two thicknesses: thick and thin. Unprogrammed, the FET looks like a FET. Programming causes a controlled disruption in the thin part of the FET’s channel-oxide insulation to produce a conduction path from the FET’s gate to the conduction channel. Charge-coupled sense amps can detect whether or not an FET in the OTP array has or has not been programmed.

It’s because of the charge-coupled sense amps that Sidense’s SiPROM technology qualifies as a low-power memory technology. These sense amps are only on for tens of nanoseconds during a read cycle and are not powered continuously. This is a patented feature of Sidense’s technology.

Although designers have an obvious bias towards read/write technologies for on-chip memory, OTP memory can be quite useful for storing infrequently programmed or reprogrammed data such as calibration and trim settings, serial numbers, configurations, boot code, and security keys. This last application is particularly interesting. Lipman provided an example. The security keys for the HDMI digital display interface spec need about 2.5 kbits for storage. However, there’s the possibility that the security can be broken and that new keys will need to be distributed. A 16-kbit array of OTP memory can store about six sets of HDMI keys, which should be enough storage to last beyond the expected life of the end equipment.

You should also be aware of the factors that argue in favor of on-chip OTP memory. Sidense’s cells are about 1.2x larger than ROM cells, so there’s a 20% size penalty in exchange for the flexibility of programmability. In exchange for this size penalty, there’s no need for a mask change if the data stored in the OTP ROM needs to be changed in the factory or in the field (for an update).

In addition, Sidense’s OTP memory easily tracks IC manufacturing process changes although it’s hard IP, so Sidense must tailor the IP for each vendor’s process technology. Sidense’s SiPROM products are currently available from 180nm to 55nm and are portable to 40nm and below. Supported foundries include TSMC, UMC, Fujitsu Microelectronics, SMIC, Tower, IBM and Chartered.

It’s also interesting to compare OTP memory with Flash. Lipman says that Sidense’s OTP SiPROM cells are about half the size of Flash cells for a given semiconductor technology. In addition, the creation of Flash-cell floating gates adds process changes that can add roughly 30% to wafer production costs. Finally, Flash process technology is clearly getting into trouble as lithographies shrink. Some presenters at the recent Flash Memory Summit were predicting that the 22nm node might be the last node to support Flash memory, although such end-of-the-world prognostications from the semiconductor pundits are often wrong. By contrast, Sidense’s SiPROM cells require only standard CMOS processing, so the company claims it’s easier for their OTP memory than it is for Flash cells to track process improvements.

]]>
http://low-powerdesign.com/sleibson/2009/10/04/give-otp-a-change-for-low-power-on-chip-storage/feed/ 0
Could A Low-Power Middle Ground Between ASICs/SOCs and FPGAs Help You? http://low-powerdesign.com/sleibson/2009/09/05/could-a-low-power-middle-ground-between-asicssocs-and-fpgas-help-you/ http://low-powerdesign.com/sleibson/2009/09/05/could-a-low-power-middle-ground-between-asicssocs-and-fpgas-help-you/#comments Sat, 05 Sep 2009 16:24:04 +0000 sleibson321 http://low-powerdesign.com/sleibson/?p=124 You can’t always get what you want,
But if you try sometime,
You’ll find,
You get what you need.

Those lyrics from a song from the Rolling Stones describes the situation with ASICs/SOCs and FPGAs. For low power, you want an ASIC or SOC. However, there are huge obstacles to using an ASIC or SOC. First, you need a team that knows how to design custom silicon or you need to rent one—which is expensive. If you have your own design team, you should be prepared to drop a million dollars or so on design tools and another million or so on NRE charges. Also be prepared for a 6-18 month design cycle, lots of painstaking verification, and the risk of at least one silicon respin due to design errors or spec changes. High risk indeed.

On the other hand, there are FPGAs. The NRE cost is zilch. The design tools are low-cost or no-cost. There’s no physical chip design required, hence a lot less verification. In short, it’s much easier to design a system based on FPGAs than on SOCs or ASICs, but there’s a price to pay: higher unit cost, less performance, and higher power consumption. All three figures of merit are 10-20x out of whack for FPGAs versus ASICs/SOCs. In addition, you’ll not get the same maximum gate count in an FPGA, not by a long shot.

So if you need an ASIC or SOC, then you need one. If not, and if an FPGA’s part cost, power consumption, and/or performance aren’t where your design needs to be, there is a middle ground. In the recent past, this middle-ground component has been called a “structured ASIC.” That’s become a tarnished name. In the distant past, the name for a similar sort of device might be called a “gate array.” Today, eASIC calls it a “new ASIC.”

What’s a “new ASIC”? If it’s an eASIC Nextreme or Nextreme2, then it’s a predesigned field-of-LUTs device with a preconfigured routing fabric on the metal layers. The only unconfigured layer is the via 6 layer. Standard Nextreme wafers are processed to metal layer 6 and stored. When a design is sent in, the via 6 and metal 7 and 8 layers are added. Depending on how fast the part needs to be made, the via 6 layer is customized using either direct-write e-beam or a standard lithographic mask and then the standard metal 7 and 8 layers are added on top.

So, what do you get from this technology? You get a zero-NRE, FPGA-like device that has much higher silicon density than an FPGA because there are no switches or configuration RAM cells in the routing fabric—just fast, tiny layer-6 configuration vias. Consequently, you get a chip that can clock faster than an FPGA—250 MHz (typical) for a 90nm Nextreme New ASIC and 500 MHz (typical) for a 45nm Nextreme2 “new ASIC.” You get a device that operates at lower power than an FPGA and you get a device that offers more gates/chip at lower component costs (but not as low as for an ASIC/SOC). You also get a chip that’s easier to design than an ASIC/SOC and one that can be delivered in as little as 4 weeks. Design-tool cost is lower than for ASICs/SOCs as well because eASIC offers a specialized, Nextreme-specific version of Magma’s design tools for as little as $8k per seat.

What are Nextreme parts used for? I asked Jasbinder (Jazz) Bhoot, eASIC’s VP of Worldwide Marketing, that question. His answer was both interesting and a bit surprising:

  • Cell phone microprojectors (where cost and power dissipation are critical)
  • Other microprojectors
  • Medical devices such as ultrasound imagers where power is not so much of a problem but device cooling is a big problem
  • Portable medical devices that run on batteries
  • Wired networking products where Nextreme parts are consolidating several FPGA designs into one chip with much lower power consumption
]]>
http://low-powerdesign.com/sleibson/2009/09/05/could-a-low-power-middle-ground-between-asicssocs-and-fpgas-help-you/feed/ 1
A Hunka, Hunka Burning CMOS (All About Latchup) http://low-powerdesign.com/sleibson/2009/07/05/a-hunka-hunka-burning-cmos-all-about-latchup/ http://low-powerdesign.com/sleibson/2009/07/05/a-hunka-hunka-burning-cmos-all-about-latchup/#comments Sun, 05 Jul 2009 18:47:52 +0000 sleibson321 http://low-powerdesign.com/sleibson/?p=64 You’re a mere 10 minutes from completely understanding and preventing CMOS latchup in your low-power designs. Wizard of Oz Dave Jones has just posted his sixteenth EE Video Blog on these topics. Here it is:

 

 

 


]]>
http://low-powerdesign.com/sleibson/2009/07/05/a-hunka-hunka-burning-cmos-all-about-latchup/feed/ 0