Design Articles

Good Embedded Communications is the Key to Multicore Hardware Design Success

Self-timed NoC interconnects can solve a lot of the problems with overloaded data buses.

By David Lautzenheiser, Vice President of Marketing, Silistix

While multicore processors have certainly become an important part of many SoC designs, there are still several obstacles designers face in dealing with more than one processing engine on a chip. Software engineers face the problem of trying to efficiently program multiple processor cores on the same piece of silicon. On the hardware side, chip developers – from architects down to physical implementation engineers – face difficult communication issues between the various processing and other IP cores and in accessing off-chip DRAM.


Concentrating on the hardware aspects of multicore chip design, a major problem is the industry’s reliance on hierarchical, clock-based bus structures to move data among the various processing cores and the memories – both embedded and off-chip.  It’s time to look at self-timed network on chip (NoC) interconnect fabrics for embedded communication networks.  This article will review the challenges of clock-based buses being used as the main communications mechanism and discuss how self-timed NoC interconnects improve on-chip data flow, simplify and enhance power management, and increase shared memory efficiency for multicore processor SoCs.

Why Multicore?

As SoC designers began “hitting the wall” developing single-processor chips for tackling the increased demands of high-definition video processing and other user requirements, they started developing chips using multiple processing cores.  The multicore approach – based on the assumption that you could break an overall processing job into multiple tasks that could be done concurrently by several processors – resulted in chips that had better performance than single-processor designs but also with lower power dissipation, which meant less heat.  Since increased heat and higher chip operating temperature was a big problem with cranking up the clock rate on a single core processing chip, multicore architectures appeared to be a viable solution for keeping clock rates and heat production manageable.  Other advantages of using multiple cores at lower clock rates included fewer signal integrity problems, less electromagnetic interference (EMI), and fewer problems associated with distributing very high frequency clocks on silicon.

However, there are several barriers to designing high-end multicore processor chips.  Software development – not the subject of this paper – becomes far more complex due to the difficulties in breaking a single processing task into multiple parts that can be processed separately and then reassembled later.  This reflects the fact that certain processor jobs cannot be easily parallelized to run concurrently on multiple processing cores and that load balancing between processing cores – especially heterogeneous cores – is very difficult.

The other set of problems with multicore chip design are hardware-based.  This is the topic of this paper – the difficulties associated with current multicore architectures and what can be done to overcome some of these problems.

Challenges with Current Multicore Architectures

Today’s multicore processor chips usually communicate through a traditional clock-based bus system, often a hierarchical bus architecture with very tight coupling between the different levels of the bus hierarchy.  Data is moved in and out of each processor, and between processors and memory, on a clock edge.  However, having a clock “control” the flow of data between processors, memories and peripherals on a multicore chip is a far more complex problem than with a chip using a single processing engine.

Insufficient interconnect bandwidth
For multicore systems, the data communications problems increase due to the need for load sharing – keeping all processor cores busy by feeding them data at the right times.  If the bus system cannot handle the data flow requirements of the processing cores, data congestion or processor “starvation” – a processor idling while waiting for data - may result and processor efficiency suffers.  For example, in a complex multicore system many processors may simultaneously attempt to initiate transactions to many destinations.

Forcing all transaction traffic to travel one transaction per clock across the already heavily loaded bus can quickly create a bottleneck that can add large queuing delays to transaction delivery.  This is particularly true for multicore systems that use multiple memory cache levels on a shared system bus.  Standard bus protocols such as AMBA create arbitration inefficiencies and inefficient arbitration can cause stalled cycles and keep a processor core idle, reducing performance and increasing power consumption.  Furthermore, the latency of physically long bus lines worsens this situation, which is an inherent problem when clock-based buses are used to transfer data between cores.

Memory Access Issues
silistix fig 1
Figure 1: As the number of processing cores in a multicore SoC
increases, using clock-based buses to move data between the
processors and shared resources such as memories becomes an
increasingly difficult problem. This limits the scalability of
traditional bus architectures for large numbers of processing cores.
As process nodes shrink and processors become faster, there is an increasing disparity between memory and processor speed.  Increasing the number of processor cores on a chip results in increased contention for memory bandwidth and the common practice of sharing memory resources among multiple cores lowers the bandwidth available to each processor.  This is particularly evident when processors access off-chip DRAM, which is one of the drivers for faster DDR speeds (the industry is currently transitioning to DDR3 – with clocking up to 1.6GHz – and is already looking into a specification for a higher performance DDR4 memory).  The problem is that by increasing memory access speed by increasing clock speed, you are negating a big advantage of multicore architectures – lower clock speed, less power dissipation and less heat generation.

Core limits
Using hierarchical bus architectures for multicore network communications is effective – up to a point.  The bus concept is scalable to a relatively small number of processor cores, around four to eight.  Using a bus communications structure for more than around this number of processors is very difficult due to shared resource management and bandwidth limitation issues. The degree of difficulty in providing adequate communications among the processors and shared resources increases rapidly as the number of processing cores increases (Figure 1).

Achieving Timing Closure
The need for a high-speed clock distribution network across a chip to support any clock-based data transfer between processors and other cores makes timing closure for chip designers very difficult due to the need for tight phase control for the hierarchical bus interconnect system.  The chip designer not only has to deal with timing closure issues for the various IP cores on the chip, but with the clocked hierarchical bus system that connects the cores as well.  Interconnect timing closure can add significant time to the total chip hardware design effort.

Advantages of a Self-Timed Communications Network

Replacing a traditional bus interconnect system with a self-timed network provides several advantages for multicore processor chips, as shown in Table 1.


Synchronous Bus

Self-timed interconnect

System bandwidth

Slowest IP core on the bus is often the limiting factor in bus performance

Data moves at wire speed on interconnect between cores, not limited by a clock rate

Ability to add pipeline latches to increase throughput

Possible at expense of all blocks having to cope with faster interconnect

Simple, since only faster blocks need to operate at the higher speed

Power consumption

Clock and the cores it drives consume power even when idle

Only consumes power when transferring data

System power management

Difficult due to coupling between various IP cores

Core non-interdependency allow maximum flexibility in power management of each core

Additional wiring cost

Large due to the global clock distribution network and for the various buses, since they have to run from the IP cores to the processors that access those cores.

Lower due to shorter local wires in datapath and acknowledge

Flexibility to trade low-frequency parallel vs. high-frequency serial operation

Difficult due to fixed clock frequency within a synchronous time domain

Automatic since every communication is self-timed

Timing closure cost

Much validation and many design iterations

Much less validation and far fewer iterations

Radio Frequency Interference

High amplitude, frequency-phased emissions

Low amplitude, spread across the spectrum, and not coupled to system clock frequency

Core Scalability

Limited to a relatively small number of cores since system complexity becomes unmanageable for a large number of cores

Network topology model proven effective in scaling to connect large numbers of processing systems

Table 1. Using a self-timed interconnect network instead of a clock-based synchronous bus provides
many design and system performance benefits.

Power Management
A self-timed communications fabric allows each processor core to operate truly independently of the other processors on the chip – not coupled with master-clock-dependent bus lines.  This gives the chip architect maximum flexibility in the power management of each processor and of the entire SoC.  The system developer can scale the clock rates of different cores or even turn off ones that are not needed for a particular application, thus adding to overall system power efficiency.  In addition, the behavior of the interconnect is independent of individual core (processor or other) power management.

An additional benefit of a self-timed interconnect fabric between the various cores on the chip is power reduction.  Unlike a clocked-based bus, which dissipates power on every clock cycle, self-timed interconnect only dissipates dynamic power when data is moving between two cores. 

Since a global clock network is a major source of on-chip power dissipation, as much as 25% of total chip power, the power savings with self-timed interconnect can be significant.  Eliminating the need for a global clock distribution network also eliminates the need for developing a means of maintaining low clock skew throughout the chip.  Tight clock phase control to keep clock skew low (generally 5% or less of the clock period) is needed for any clocked interconnect system, particularly one with a collection of coupled buses. 

A further benefit is that chip implementation is greatly simplified by eliminating the power-hungry clock distribution network.  Less power dissipation also reduces concerns regarding electromigration problems and enhances chip reliability.

The network topology model has been proven effective in scaling to connect large numbers of processing systems as demonstrated by the scale of the Internet.  This provides an existence proof that the development of chips with very large numbers of processor cores is possible only with a network topology, particularly with the switching completely independent of any of the processors.

Data Transfer and System Bandwidth
With a bus-based interconnect system, the slowest IP core on the bus is often the limiting factor in bus performance. This is the reason why many chips have their peripheral IP cores segregated onto a separate bus at a lower clock rate – to prevent slowing down the main processor bus(es).  With self-timed interconnect between chip cores, data travels at wire speed between endpoints of a communication channel and is not limited by a clock rate, since interconnect data transfer is not clock-dependent and is independent of processor operation.  Interconnect latency problems due to long lines that are dependent on multi-cycle clock transport of data are non-existent since the self-timed interconnect lines don’t use a clock.  Data is available to a processor faster without regard to specific processor clocking schemes or processor clock rates.  Thus, data is presented to and consumed by a processor at a rate dictated by that processor, not by interconnect characteristics such as latency and bandwidth.

A self-timed communications channel still presents some latency to the system.  In fact, in most situations a clock-based bus will result in a better “best-case” latency. However, the challenge in designing a bus-based system is that it is extremely difficult to predict what the worst-case latency will be, since it is highly dependent on what the various processors that access the bus are doing at any point of time (in other words, data transfer characteristics are not deterministic). So, with a clock-based bus, we might have a faster best-case latency, but a longer and much less predictable worst-case latency. With the right design tools and component libraries, designers can take into consideration anticipated total worst-case traffic and size the components used in a self-timed NoC to deliver data to meet worst-case latency requirements, at the desired bandwidth.  For example, Silistix uses CSL (Connection Specification Language) to describe the interconnect fabric between IP cores and the CHAINworks tool suite to provide a systematic way of describing the communications needs of the cores and automatically synthesizing a communications system that meets those aggregate requirements.

Memory Access
A well designed, self-timed interconnect network can overcome many of the memory access problems associated with the hierarchical clock-based bus architectures that are predominant in processor-based systems. This is particularly true for accessing off-chip memory through an embedded DRAM controller.  For example, Silistix’s tools synthesize self-timed interconnect networks from a high-level, architectural description with several attributes that help optimize DRAM operation.  Among these attributes are: carrying the identification of the requesting IP core, even if the protocol at the requesting end does not explicitly support such identification; the bandwidth of a synthesized self-timed communications network is generally much greater than any of its endpoints, such as the DRAM controller; the self-timed interconnect is completely endpoint “transparent;” any request from an initiator (for example, a processor core) is delivered unaltered; and adaptors (logic that services the needs of endpoints) have a request reordering capability for endpoints that lack it, thus increasing the efficiency of an endpoint such as a DRAM controller.

Timing Closure
Timing closure is easier to obtain, potentially saving months of design effort, and overall chip clocking (of cores) is simplified.  Using self-timed interconnect, there are no signals with critical top-level timing closure requirements to worry about.  Once the individual IP cores have achieved timing closure, the composite top-level chip should meet timing on the first pass.

Chip Implementation
With self-timed interconnect, physical chip implementation is greatly simplified.  Without the problems associated with a high-speed clock distribution network across a chip and clock-based data transfer between processors and other cores, the time needed for chip layout and verification is reduced.  A self-timed communications fabric also provides maximum design flexibility for a multicore processing chip, since the chip developer can implement the optimum communications network topology for a particular design.  This not only simplifies the design of the chip’s hardware architecture, but may also provide more flexibility for software developers in partitioning tasks between the various processing cores.

San Jose, CA
(408) 436-1656

This article originally appeared in the February, 2008 issue of Portable Design. Reprinted with permission.

Digg Reddit Stumble Upon Facebook Twitter Google BlinkList Technorati Mixx Windows Live Bookmark MySpace Yahoo Bookmarks Diigo

Insert your comment

Author Name(required):

Author Web Site:

Author email address(required):


Please Introduce Secure Code: