dma_performance 3.28 KB
DMA Performance Analysis

The DMA performance can be divided into two metrics.  The system
throughput is the aggregate transfer performance of all DMA masters
devices combined.  The device throughput is the DMA performance
available to a single master assuming no other DMA requests.  The
system throughput, in general, is limited by the maximum Rambus
bandwidth.  The device throughput may also be limited by the DMA
arbitration and Rambus interface overhead.

In designing the DMA controller, much more weight was given to system
throughput than device throughput.  This is because in the general
case more than one device is requesting DMA services and there are no
data buffers in the DMA controller to allow for overlapped transfers.

The RCP DMA controller consists of two major parts, the arbiter and
the RI.  The arbiter selects the next DMA master and enables the
master to issue a DMA request to the RI on the Cbus under the control
of the RI.  The RI accepts DMA requests, creates and issues Rambus
request packets, and channels the read/write data over the Rambus.

The following steps take place during DMA transfer:

1) A DMA master issues a DMA request to the arbiter on the dma_request
   line.

2) When the RI signals that it is ready to accept a new DMA request,
   the arbiter enables the DMA master on the Cbus and a grant is given
   to the DMA master on the dma_grant line.

3) The DMA master places the Rambus address on the Cbus followed by
   the DMA transfer type.

4) The RI receives the DMA address and transfer type and creates one
   or more Rambus request packets.  At the appropriate time, the RI
   enables the appropriate device on the Dbus and generates the start
   and last signals to indicate when the master should begin transferring 
   data.

Assuming no other DMA or Cbus traffic, the delay from a request to the
first data transfer is:

page-mode read  : 12 clocks
page-mode write : 6 clocks

approximately 4 clocks are due to arbitration/request and the rest is
due to packet and access overhead.  Non-page mode read/write
operations take an additional 5 cycles if the previous page was clean
(not written to) or 7 cycles if the previous page was dirty.

The arbitration process is completely overlapped with the DMA data
transfer operation.  The only interaction is that DMA data transfer
must begin before arbitration for the next device.  This means that
for a burst of sufficiently long DMA requests (8 clocks or so) the
RDRAM will be fully utilized.

When considering the entire system, it is generally the case that
several devices are simultaneously requesting relatively long bursts
and therefore the system throughput will be mainly limited by the
RDRAM performance.  The device throughput, however, varies depending
on the capability of the DMA master.  To obtain maximum DMA
performance, the master must both request long transfers and overlap
the next request with the previous data transfer.  Currently, the only
devices which do this are the RSP and VI DMA masters.  All other
devices wait until the previous transfer is complete before requesting
the next transfer, reducing the maximum device throughput by 50% or
more.  On the upside, devices which do not overlap requests with
transfers are much less affected by heavy system DMA loading since
they are greatly under-utilizing the RDRAM bandwidth.