Email this Article
Printer-Friendly
Reader Comments
[Hot Topic]
Games Flourish in a Parallel Universe
Multicore processors accelerate games if developers can take advantage of the features and live with the limitations.
William Wong
ED Online ID #17753
September, 13 2007
Gaming platforms such
as Microsoft’s Xbox
360 and Sony’s
PlayStation 3 push the
proverbial envelope when it
comes to graphics and computation,
delivering sophisticated and
realistic games. Thanks to their
latest multicore 64-bit processing
architectures, programmers are
able to create sophisticated, multithreaded
applications.
The computational processors
are tightly integrated with the
graphical processing units, minimizing
system response time for a
better gaming experience. Even
small delays can disrupt the flow
of a game or its multimedia presentation.
Performance and balance
on both the hardware and
software fronts will provide an
optimal gaming experience.
Gamers tend to grade a system
on the basis of the game’s playing
capabilities, regardless of
how well it takes advantage of
the underlying hardware. Still,
looking under the hood shows
each system’s potential. As with
most programming platforms,
applications rarely take full
advantage of the hardware the
first time around. It takes time to
learn about system idiosyncrasies
and to mold application frameworks
to exploit the hardware.
Game developers have an additional
challenge because game
vendors often target multiple platforms
with the same game.
Obviously, this is desirable from a
vendor’s perspective, because it
widens the market. Unfortunately, even slight differences in platforms
or their capabilities can
significantly impact the software.
Differences between Microsoft’s
and Sony’s platforms are quite
substantial, so a seemingly minor
problem potentially becomes
major. The Xbox 360 uses a
more conventional symmetrical
processing (SMP) architecture.
Sony’s PlayStation 3 is built on
IBM’s Cell processor. The Cell
foregoes the large caches for its
eight Synergistic Processing
Elements (SPEs), forcing application
programmers to use softwarebased
caching support.
THE SYMMETRICAL APPROACH
Microsoft developed a multicore
chip, with IBM, based on the
Power architecture (Fig. 1). Its
three 3.2-GHz processing cores
are identical and have their own
32-kbyte L1 instruction and data
caches. The two-way, set-associative
caches include parity error
checking on the 128-bit lines.
Each core can run two threads.
The processing cores share a 1-Mbyte L2 cache, but this core
has an interesting architecture.
Half of the cache runs at the
processors’ clock frequency,
while the rest of the L2 cache
runs at 1.6 GHz. Then, things
become interesting when adding
a new instruction called Extended
Data Cache Block Touch.
The instruction is designed to
prefetch data from main memory
into the L1 cache. It’s often easier
to take advantage of this instruction
in a gaming environment,
where the size and use of data is
well-defined. Moving data into
the cache reduces L2 thrashing,
so it can be used to quickly build
up a thread’s working set. In a
conventional processor, the working
set is brought in incrementally,
slowing down the overall
thread operation.
The processing chip accesses
main memory through the frontside
bus connected to the graphics
chip. The front-side bus runs at
5.4 GHz with a bandwidth of
21.6 Gbytes/s. The graphics chip
provides a unified memory system
to the on-chip graphics processing
unit (GPU) and the Power cores in
the processing chip. The GPU can
read data directly from the L2
cache for even better interaction
with application code.
The processors also support
cacheable and cache-inhibited
store operations, which are handled
by different pipelines. The
cacheable operations use eight
store-gathering, non-sequential
buffers per core, while the noncacheable
operations use four
sequential buffers. By understanding
these instructions, developers
can optimize their applications.
For example, data written to
main memory for use by the GPU
will often benefit from bypassing
the cache if the application
threads no longer need to access
this data. Running data through
the cache would simply flush data
that might be useful later.
However, the cache isn’t the only
concern for software developers.
Each processing core includes a
VMX128 (Vector/SIMD
Multimedia eXtension) unit. The
VMX128 was specifically
designed to accelerate 3D graphics
and game physics. Developers
can benefit from this feature
because it was built on the VMX
accelerator, which is already
found in many Power architecture
cores like those in Apple’s G4
and G5 Power Macs. Enhancing
SIMD support in a compiler is a
relatively straightforward process
and typically allows a programmer
to exploit the underlying
hardware without significantly
modifying the software.
There are significant advantages
to Microsoft’s more conventional
gaming hardware approach.
SMP with multilevel, transparent
coherent caches is standard fare
on PCs. Thus, it’s significantly easier
to develop multithreaded
applications that will run on different
platforms, often with minimal
application architectural changes
other than recompilation. The
same is true for utilization of VMX
128, since this support is often
hidden by the compiler.
Continued on Page 2.
<-- prev. page
[1]
2
next page -->
|