nand_spec.html 19.2 KB
<html>

<head>
        <title>BB Module Flash Requirements</title>
</head>

<body>
<h1>
<p align="center">
BB Module Flash Requirements
</p>
</h1>

The document summarizes a key set of flash specifications for the memory module.
It is intended to allow sanity checking of potential flash parts. Once a part
is deemed acceptable given the specs within this document it should still be
fully examined to insure usability within our system.
<p>

<h2>Key Specifications</h2>

<ul>
  <li> <b>Electrical Signaling:</b> the signaling protocol must be compatible
       with SmartMedia (SM) nand flash, with the exception of the LVD (low
       voltage detect SM signal). Key timing parameters are noted in
       specification that follow.<p>
  <li> <b>Operating Voltage, Current:</b> nominal 3.3V supply voltage, max 200mA
        supply current per device. <p>
  <li> <b>Page Size:</b> 528B page size (512B data + 16B OOB). <p>
  <li> <b>Block Wear Lifecycle:</b> 1E5 erase-program cycles for block 
       lifetime. <p>
  <li> <b>Bad Blocks:</b> max 124 bad blocks per 64MB over the specified
       "Block Wear Lifecycle". <p>
  <li> <b>Block Size:</b> 16KB or 32KB preferable, any power-of-two times 
       8KB probably doable. <p>
  <li> <b>Bad Block Marking:</b> bad blocks must be marked such that every
       page within a bad block has byte 517 containing at least 2 zero bits
       (SM physical format specification). The flash memory must either be
       shipped with this marking, or we must be capable of performing this
       marking at initial burn-in of the memory module.
       <p>
       Because the inability to perform this marking leads to a production
       reject for that memory module unit, we specify the memory must be such
       that ther is a 1E-5 probability that this marking cannot successfully
       be completed. <p>
  <li> <b>Read and Write Timing:</b> The primary goal is to be able to
       sustain a worst case 6MB/s "cartridge" DMA, and a typical writing speed
       near 1MB/s. The following nand specs provide a rule of thumb indicating
       this is possible:
       <ul>
         <li> 40us max page access on read (t<sub>R</sub>)
         <li> 50ns (at most) for minimum period in repeated page buffer access
              (t<sub>RC</sub>, t<sub>WC</sub>)
         <li> 0ns for setup times measured from CLE/ALE assertion (to WE or RE)
         <li> 1ms max page program (typical &lt= 500us)
         <li> 10ms max block erase time per 16KB (typical &lt= 2ms)
       </ul>
       <p>
  <li> <b>Permanent Errors (Bad Blocks):</b> These errors are either a failed 
       erase or program operation, and are specified in terms of a number of
       erase-program cycles.  This number is incorporated into the "Bad Blocks"
       number, above so is not individually specified here. <p>
  <li> <b>Soft Errors Impacting Reads:</b> these errors lead to system
       failures when they cause 2 bit errors in the same ECC region of
       256B (2 such regions per page). There are two failure modes of
       primary interest. In the first, the flash memory holding the SK or FA
       fails. The main trouble with these failures is that both the
       player and memory module must be returned to the depot for a fix.
       Also, it is likely that the user will be confused because the UI
       will not be capable of displaying any clear indication of failure.
       This mode will be referred to as severe.
       In the second mode, game or license data is impacted. This is less
       severe since a trip to the depot with only the memory card can
       fix the problem, and the UI can clearly indicate this to the user.
       This mode will be referred to as mild.
       <p>
       The two types of soft errors that can cause these failures are briefly
       described below. To determine how to compute the various probabilities
       please refer to the more detailed analysis in the "System Impact..."
       section to follow.
       <ul>
         <li> <i>Data Retention.</i> In this case single bits may flip at
              random after some duration of time. We specify that this form
              of error should occur after XXX days with probability less
              that 1/100000 in the severe failure mode. For the mild error
              mode the probability can be less than 1/10000 (in addition 
              the number of hours may be relaxed).
              <p>
              <i>NOTE: this error mode impacts shelf-life. It is currently
              dificult to obtain flash reliability data that covers a
              reasonable enough time from manufacture, to shelf, to sale,
              to product life. </i>

         <li> <i>Read Disturbance.</i> In this case a bit within the same
              block as another bit being read may be unintentionally
              programmed from 1 to 0. For the severe failure mode we
              specify that for a minimum of 100000 reads the probability
              of failure should be less than 1/100000. In the mild failure
              mode we specify that for a minimum of 1000000 reads the
              probability of failure is less than 1/10000.
         
       </ul>

</ul>


<h2>System Impact of Specifications</h2>

The specifications above that require more detailed analysis to rationalize
are treated below.

<h3>Electrical Signaling</h3> 
  The signal protocol has been specified to be compatible with standard nand
  flash. However, we will still need to conduct a detailed timing analysis to
  determine acceptability based on potential timing settings of the flash
  controller's configuration register.

<h3>Page Size</h3>
  The page size must be 512B data + 16B OOB because the hardware ECC engine
  requires this layout.

<h3>Block Wear Lifecycle</h3> 
  Although the blocks containing the game binaries should not undergo
  significant re-writing, these blocks are preferably unavailable for
  wear-leveling because their placement remains relatively fixed due to the ATB
  entry address-size constraints (i.e., once a game has been layed out to
  satisfy ATB constraints we would prefer not to relocate it). Then, the pool
  of blocks available for relocation could be relatively small. For this
  specification we assume the worst case scenario that all blocks must be
  written when a game is played.  Now, if a game is played 10x per day,
  everyday for 3 years, this leads to approximately 11000 required writes for
  state saving alone. The wear number must be greater than 10E4, so to maintain
  and order of magnitude safety margin 1E5 has been chosen.
  <p>
  This number could probably be relaxed if need be.

<h3>Bad Blocks</h3>
  Bad blocks impact our ability to present a contiguous cartridge address-space
  mapping to game code. The 
  <a href=../hw/pi-spec.html>PI</a>
  ATB mappings are used to avoid bad blocks, and
  there approximately 190 usable ATB entries for a given game (assuming the
  game uses one catridge address space to access the cartridge). But, the
  requirement that a given ATB entry must have its starting virtual address be
  a multiple of the contiguous physical size mapped, and that the physical size
  must be a power-of-two multiple of 16KB further complicates matters.
  <p>
  To derive a worst-case number, take the largest game we would like to map and
  assume bad blocks are distributed evenly in flash so that we would need a
  single ATB for each contiguous region. For example, assuming the maximum game
  size is 32MB, then with 128 ATB entries, each of size 256KB, the game could
  be mapped. This leaves some margin in the total available ATB entries for
  file-system fragmentation and other issues. Further, asume a 32MB game would
  be placed on a minimum 64MB memory. To determine how many bad blocks we could
  afford and still be able to map the game with 128 ATB entries, divide the
  device size of 64MB by the segments in flash that contain the 256KB mappable
  region plus the bad block (for a 64MB part the block size is assumed to be
  16KB):
  <blockquote>
    64MB / (256KB + 16KB) = 240
  </blockquote>
  The worst case comes as the game size increases. From the game data a 64MB
  game is the maximum previously released. Now assuming we would support this
  game size with a flash module containing 96MB (64MB and 32MB parts), the
  computation would become: 
  <blockquote>
    96MB / (512KB + 16KB) = 187
  </blockquote>
  This yields a rate of 187 bad blocks per 96MB.

<h3>Block Size</h3>
  Block sizes of 16KB and 32KB are usable depending on the "Bad Block Marking"
  capabilities, discussed below. The ATB requires block sizes that are
  power-of-two multiples of 16KB, and uses 16KB as the minimum granularity when
  mapping game cartridge space addresses to flash. So, a 64KB block would
  actually be legal as well. On the smaller end, and 8KB block is usable, but
  it seems unlikely a part of interest to us would use this block size.
  <p>
  <i>Are there file-system issues constraining the block size as well?</i>


<h3>Bad Block Marking</h3>
  The boot code depends on being able to determine if a 16KB "block" is bad by
  reading byte 517 of the first page within the block. If this byte contains 2
  or more 0 bits, the "block" is deemed to be bad (consistent with the
  SmartMedia spec.). Assuming that an arbitrary page within the block may be
  marked in this manner, a true 32KB block is acceptable since a single 32KB
  bad block can have pages 0 and 16 marked bad. The boot code will simply treat
  this case as two consecutive bad blocks.
  <p>
  This topic deserves more detailed coverage because of some inconsistency in
  information pertaining to the marking of bad blocks when nand is shipped. For
  Toshiba, the data sheet specifies that a bad block determined by reading
  every bit within the block. If a single bit is '0' the block is bad. The
  Toshiba application notes indicate the first page of a bad block will be
  marked per the SmartMedia spec (byte 517 as we assume in the boot code).
  Because of this confusion we have asked Toshiba and they have indicated that
  they actually mark byte 517 for every page in a bad block. This last case is
  the most desirable for us, but not imperative since we have asked if we can
  mark byte 517 of any page in a bad block (although a program may fail
  overall, turning at least 2 bits of byte 517 to 0 <i>should</i> work) and
  Toshiba indicated that we can. So, we could use our own burn-in to insure the
  boot code reacts appropriately to a 32KB physical block size.
  <p>
  Samsung's data sheets specify that the first OR second page within the block
  will be marked. We do not know if this implies that we cannot guarantee that
  at least 2 bits in byte 517 can programmed with 0s. If so, we would need to
  change the boot code accordingly, but this seems unlikely for reasons below.
  <p>
  A block must be marked as bad when the erase or program operations fail (as
  indicated by the status read that follows). Given that a block is bad, the
  reason is most likely because an erase failed to flip 0 bits to 1. So, the
  task of programming at least two 0 bits into byte 517 of any given page is
  not hindered by this failure. There may be more catastrophic failure modes
  that could preclude marking byte 517 with 2 0 bits, but these are considered
  far less likely. To be safe, we specify this probability.


<h3>Read Timing</h3>
  As mentioned previously, the true target for these specifications is to
  enable a sustained "cartridge" DMA of 6MB/s. A rule-of-thumb was provided
  that is a reasonable guide in determining if a part will achieve the 6MB/s
  rate. However, the interaction of many flash timing parameters and the
  <a href=../hw/pi-spec.html>PI</a>
  timing configuration need to be considered to truly determine the DMA speed.
  <p>
  A more detailed discussion of this process and application to a number of
  current flash parts is provided <a href=nand_dma_speeds.html>here</a>.
  <p>
  Note that our initial latency will be considerably slower than the n64, but
  we have decided this is OK given we can sustain the rate above.

<h3>Write Timing</h3>
  Like the "Read Timing" case, the actual speed will depend on the optimumun
  allowable setting for the 
  <a href=../hw/pi-spec.html>PI</a>
  flash timing configuration, as determined
  from the set of detailed flash specs. For most cases the worst case is
  64ns per flash cycle, where the flash cycle is used to determine the time
  for accessing individual data bytes, or emitting command/address cycles.
  Then, the worst case numbers from the "Key Specifications" section,
  <ul>
    <li> 10ms block erase per 16KB
    <li> 64ns byte write period (to flash page buffer)
    <li> 64ns address and command write cycle
    <li> 1ms page program 
  </ul>
  yield (per 16KB data bytes): 32*1ms + 32*(528+5)*64ns + 10ms = 
  43.1ms/16KB, <br>
  or 0.36MB/s (the individual byte accesses are nearly negligable, so
  even if the flash configuration turned out to be slower this would
  not have much impact).
  <p>
  Using typical numbers:<br>
  <ul>
    <li> 2ms block erase per 16KB
    <li> 64ns byte write period (to flash page buffer)
    <li> 64ns address write cycle
    <li> 500us page program 
  </ul>
  yields (per 16KB): 32*500us + 32*(528+4)*64ns + 2ms = 19.1ms/16KB, <br>
  or 0.82MB/s.
  

<h3>Soft Errors Impacting Reads</h3>
  The description of failure modes was provided earlier. Here, the
  method of computing the error probability for each type of soft
  error (on reads) is provided, given typical form of reliability test data.
  <ul>
    <li> <i>Data Retention:</i> we assume this error effects isolated bits (no
         predisposition for neighbors to be effected). Data retention is
         generally measured with parameters such as:
         <ul>
           <li> size of parts under test (i.e., 32MB)
           <li> sample size (number of devices in test, typically 50-100)
           <li> hours (72, ..., 1000)
           <li> pre-conditioning cycles
           <li> number of cumulative failed bits
         </ul>
         The "number of cumulative failed bits" is the measured parameter,
         while the other parameters specify the test environment. The formula
         for computing the probability of error is:
         <blockquote>
                P = 1 - (1-(1/N<sub>s</sub>))<sup>(N*(M-1))</sup>  
                        (1+(M-1)/N<sub>s</sub>)<sup>N</sup>
         </blockquote>
         where:
         <ul>
           <li> N<sub>s</sub>: number of 256B regions in sample (number of parts
             times size of each part, in bytes, divided by 256).
           <li> M: number of cumulative failed bits.
           <li> N: number of 256B regions in the expected amount of
                memory covered by the failure mode. For the severe mode the
                total effected memory is estimated at 1MB (SK + FA). For the
                mild failure mode this is approximately the entire capacity of
                the memory module (i.e., so if 64MB parts are tested, and the
                module will consist of a single 64MB part, N is N<sub>s</sub>
                divided by the sample size).
         </ul>
         An upper bound that is easier to compute, less prone to round-off
         error in the computation, and appropriate given our expectation of
         typical parameters:
         <blockquote>
                P &lt (1/2) (M/N<sub>s</sub>)<sup>2</sup> 
                            ( (N<sub>s</sub>-M+2)/(N<sub>s</sub>/N) )
         </blockquote>
         <p>
         As an example, the probability of error anywhere in the
         device (i.e., 2-bit or more errors in same 256B data unit) is
         computed given reliability data from Toshiba for a 64MB
         TC58512FT nand flash. Toshiba's data had:
         <ul>
           <li> 73 samples
           <li> 64MB per sample
           <li> 1.2e5 write-erase cycle pre-condition
           <li> 1000 hrs (42 days)
           <li> 7 cumulative failed bits
         </ul>
         This leads to P = 1.5E-8, which is quite acceptable. The only
         issue is that we need a product lifecycle of much greater than
         42 days!
         <p>
         Using the upper bound approximation results in 1.8E-8, which is
         reasonably close.
         <p>
         Similar test data from Samsung results in lower probabilities for
         failure.
           
    <li> <i>Read Disturbance:</i> this error effectively reprograms some bit
         within the block (on different page) from 1 to 0. It is generally
         specified with parameters such as:
         <ul>
           <li> size of parts under test (i.e., 32MB)
           <li> sample size (number of devices in test, typically 50-100)
           <li> target number of cumulative failed bits (i.e., reads will
                occur over the sample until this number is reached)
           <li> pre-conditioning cycles
           <li> number of read cycles till the number of cumulative failed
                bits, specified earlier, are reached.
         </ul>
         The last parameter, "number of read cycles...", is the measured
         parameter, while the remainder specify test environment.
         <p>
         In this case, the probability of failure, P, is computed with the same
         parameters as for the "Data Retention" case (since games are not
         composed of all 1 bits this is actually an upper bound on the error).
         However, P now measures the validity of the test environment.
         Ideally, we would like to specify the experiment to produce our
         desired P. Then the resultant number of reads obtained from the
         test will determine if the part is reliable enough.
         <p>
         Also, for the severe error mode we are not interested in a high
         number of pre-conditioning cycles, since the SK and FA will not
         be written often. A higher number (typically 100000 cycles) is
         relevant for the mild error mode.
         <p>
         The 100000 reads number for severe errors was estimated as follows.
         The SK and FA are read every time the system resets or a game is
         restarted.  Assuming the combination of these events occurs 10x/day,
         every day for 5 years, we obtain 20000 reads. To have a reasonable
         buffer we specify 100000. The number is very dificult to estimate
         during the game play, so a number 10x that for severe errors is
         chosen. This number for the mild case is somewhat flexible because
         we could occasionally re-write the game "in place", though we
         would avoid this if possible.
         <p>
         As an example, the probability of error anywhere in the
         device (i.e., 2-bit or more errors in same 256B data unit) is
         computed given reliability data from Toshiba for a 64MB
         TC58512FT nand flash. Toshiba's data had:
         <ul>
           <li> 44 samples
           <li> 64MB per sample
           <li> 1.2e5 write-erase cycle pre-condition
           <li> 10 bit errors target
           <li> 3.1E6 reads to reach target (over entire sample) worst case
         </ul>
         This leads to P = 8.9E-8, which is quite acceptable. In this case
         this means that the experiment will satisfy our requirements since
         the computed P is &lt 1/100000. The number of reads, 3.1E6 is also
         within our specification.
         <p>
         Similar test data from Samsung results in lower probabilities for
         failure.

  </ul>

</body>