Email this Article
Printer-Friendly
Reader Comments
[News Feature]
64-Core Chip Spins SMP Design to Higher Performance Levels
The Tile64 takes advantage of the iMesh interconnect to open up parallel programming opportunities in network and video applications.
William Wong
ED Online ID #17926
December 6, 2007
Typical general-purpose
symmetrical multipleprocessor
(SMP) multicore
designs contain about
eight cores. Specialized architectures,
on the other hand, push the
number of cores into the hundreds.
Tilera ups the ante for SMP with its
64-core/tile Tile64 chip (see the
figure). Its iMesh interconnect
incorporates five different packet
networks with five switches per tile
(see the table). Chips with 35 and
120 tiles are on the horizon.
Go With the Flow
The SMP non-uniform memory
access (NUMA) architecture is similar
to the HyperTransport system
used by AMD for its Opteron
series. As with AMD’s approach,
location of peripherals and memory
are not important to the application, except at a low level of the
operating system.
The big difference is that AMD
uses the same HyperTransport
interface for all traffic, while Tilera
splits the traffic into different networks.
This enables memory transfers
to occur in parallel with other
transfers, such as peripheral data.
Data moves through non-blocking
switches at one cycle per hop.
By splitting the traffic, different
types of transfers can be optimized.
For example, memory and
stream transfers tend to be on the
larger size, while interrupts and
UDP-style (User Datagram Protocol)
transfers are usually smaller. Highlevel
language support permits
socket-style communication
between nodes.
Communication can occur
between any node. Each has a
matrix address. Some nodes, such
as the memory controllers, feature
more than one address to provide
higher throughput. The source
node determines which address
that should be used.
Typically, the system that initializes
the operating systems on each
core will distribute the addresses.
This will prevent one from becoming
a bottleneck.
My Cache, Your Cache
Each tile incorporates an L1 and
larger L2 cache. A core’s L3 cache
is the sum of the other cores’ L2
caches. The memory controllers
keep track of where information is
located in the L2 cache. Accesses
from a different node are provided
with the location so subsequent
accesses can be made via the
remote L2 cache.
The response characteristics of
this approach are different from a
conventional SMP L3 cache.
However, the efficiency is much
better than accessing main memory
from a speed as well as a
power point of view. Off-chip
accesses require hundreds of
cycles and 500 pJ. An L3 access
will require 20 to 30 cycles and
yet consume only about 3 pJ.
Hardware handles cache operation
and virtual memory support.
Its operation remains transparent
to applications.
Virtual Partitions
A bank of 64 cores can be
handy, but multiple subsets are
often used instead. Tilera’s
Hardwall technology logically partitions
the system into sets of tiles.
Traffic can flow through any
region to memory controllers and
peripherals. However, this prevents
communication between cores in
different regions. Of course, the L3
caching will be within a region
too. Rectangular regions are currently
supported.
A hypervisor runs on each core,
providing virtual-machine support.
Access to peripherals is still controlled
at the software level.
Nonetheless, this is relatively easy
to handle at the hypervisor level.
Moreover, the hypervisor has control
over a tile’s switches.
The Tile64 can support a range
of operating systems, but its initial
flavor is Linux. Support also
includes the Eclipse-based
Multicore Development Environment
(MDE), including the GDB debugger.
The current mix of software
includes open-source tools, plus
some proprietary software, such as
the C/C++ compiler.
Many Cores, Fewer Watts
Power management can be a significant
advantage in multicore environments.
In this case, it’s possible
to power down individual cores
while the switches continue to operate.
The design also makes extensive
use of clock gating, minimizing power requirements for sections of
the system that are inactive.
Soft Tiles
Software support includes tools
specific to the Tile64, such as a
high-level and cycle-accurate simulator.
A whole application model for
collective debugging can single-step
multiple cores. Also, a runtime
library for socket-style streams provides
access to the tile-to-tile hardware
support mentioned earlier.
The architecture has had time to
mature. A similar system was developed
in 1994 at the Massachusetts
Institute of Technology, but it
required a rack of hardware.
Meanwhile, external links between
Tile64 chips can be established
using the Ethernet or PCI Express
interfaces. For now, iMesh operates
only within the chip.
The Tile64 should provide 40
times the performance of dual-core
DSPs and 10 times the performance
of dual-core Xeon processors while
using less power. Of course, these
are 32-bit cores, not 64-bit cores.
Likewise, applications that run on
an SMP platform should work well
without modification on the Tile64.
New designs can take advantage
of more intimate hardware
support. But gaining access to
such a large number of cores
opens new possibilities for parallel
programming. And while the
Tile64 targets network and video
applications, it should equally suit
other applications amenable to
parallel programming.
|