Network buffers, MPC860 and data caching

Mon Oct 30 19:34:52 UTC 2000

Eric Norum wrote:

> Till Straumann wrote:
> >
> > Eric Norum wrote:
> > > Network receive buffers are allocated by MCLGET, not by malloc.  An mbuf
> > > cluster is aligned on a cluster-sized (2 kbyte) boundary.  Network mbufs
> > > are aligned on an mbuf-sized (128 byte) boundary.  See bsd_init in
> > > src/libnetworking/rtems/rtems_glue.c.  Unless a cache line is larger
> > > than an mbuf there should be no problems.
> >
> > Sorry, I don't quite agree.
> >
> > The point is that the `databuf' starting address, i.e. where the network
> > interface
> > chip writes data must be cache line aligned. If I count the fields in m_hdr and
> > pkthdr,
> > I get 7*4 bytes, i.e. the data area is seems to be 4byte aligned. Assuming that
> > nobody
> > reads any field beyond mh_data while the network interface `owns' the buffer,
> > the
> > first cache line could be flushed before yielding the buffer to the network
> > driver. But this will only
> > be enough on a machine with a cache line size <= 5*4 = 20 bytes, e.g. the MPC
> > 860.
>
> ``What we have here is a failure to communicate.''
>
> We must be looking at different source.  The
> c/src/lib/libcpu/powerpc/mpc8xx/console-generic/console-generic.c in the
> snapshot I've got here (rtems-ss-20000929) has neither calls to malloc
> nor free.  The CVS ID line is:
>
>   $Id: console-generic.c,v 1.5 2000/08/25 17:25:27 joel Exp $
>

OK, I looked at a different version (where the cache support was not yet
integrated).
The current version uses a static buffer (which, as you point out below, _must_ be
cache block aligned).

>
> The receive and transmit buffer areas are:
> /*
>  *  I/O buffers and pointers to buffer descriptors.
>  *  Currently, single buffered input is done. This will work only
>  *  if the Rx interrupts are serviced quickly.
>  *
>  *  TODO: Add a least double buffering for safety.
>  */
> static volatile char rxBuf[NUM_PORTS][RXBUFSIZE];
> static volatile char txBuf[NUM_PORTS];
>
> I agree that there is a problem here since the rxBuf is not aligned on a
> cache boundary.  A cache line which overlapped the front or back of the
> rxBuf could overwrite the values placed in the rxBuf by the SDMA
> channel.

Likewise, data before/after rxBuf could get lost when the cache blocks
corresponding to rxBuf are invalidated, if rxBuf is not cache block aligned.

> The txBuf isn't a problem, since extra flushes would just
> write the same values into txBuf as were placed there by the
> rtems_cache_flush_multiple_data_lines call in m8xx_uart_write.  A fix
> would be to statically allocate RXBUFSIZE+CACHE_LINESIZE-1 bytes for
> each receive buffer, and then to use only the cache-aligned portion of
> the buffer.

yep.

>
> The network driver uses mbuf clusters for incoming packets.  These are
> aligned on 2k boundaries so there's no chance of a cache-line write
> scrambling a packet buffer.  Once a packet has been received, all the
> cache lines which refer to the data are marked invalid.
>
> Hmm...But there *is* a problem here, too.  It's not related to cache
> line boundaries, though.
> 1) CPU allocates an mbuf cluster and does some writes and reads to the
> cluster.  The cache is in writeback mode, so main memory is not updated.
> 2) CPU frees the mbuf cluster.
> 3) Driver read task allocates that mbuf cluster.
> 4) DMA engine starts reading into the mbuf cluster.
> 5) CPU decides it wants to reuse the cache line in question and flushes
> it to main memory -- kaboom -- this overwrites the value stored by the
> DMA engine!
>
> I think that this problem could be avoided by marking the cache lines
> associated with an mbuf cluster invalid *before* pasing the mbuf cluster
> to the DMA engine, instead of after the DMA engine has finished filling
> the cluster as is now the case..

Of course - but that's easy to fix.

> The fix would be to call
> rtems_cache_invalidate_multiple_data_lines just after the MCLGET.  You
> wouldn't have to invalidate all 2kbytes, just the 1536 bytes needed to
> hold an ethernet frame.

That's the point where I do not agree (see my earlier message). If you call
rtems_cache_invalidate_multiple_data_lines on a non-aligned buffer, data
before/after the buffer hold by the cache can be lost: Assume a variable of
type

struct {
     int blah;
     char buffer[1518];
}

to be 128-byte aligned (hence it's only safe to assume that `buffer' is word-
aligned). A value is then written to `blah' which ends up in
the data cache. If you subsequently invalidate all cache lines overlapping
with the `buffer' field, you will also invalidate `blah' and later, an old and
invalid value of blah will be refetched.

However, if you are sure that blah is never written while the network interface
`owns' the data structure, you could
  - first flush (write to memory and invalidate) the first cache line (the one that
      overlaps with `blah')
  - invalidate all other cache lines contained completely within `buffer'.
  - mark the buffer ready for the networking/DMA hardware.

If however, a cache line is larger (in this example) than sizeof(blah), or
if you can not guarantee that `blah' will not be written while the DMA
engine writes to `buffer', the following scenario could happen:

  - CPU writes blah again (after the flush described above). The rest
      of the overlapping first cache line is filled with bogus data from `buffer'

  - DMA engine fills buffer memory.

  - CPU reads cached bogus data (from the line holding `blah') which will
      eventually be written back to memory.

Therefore, IMHO, the only workarounds for mbufs are
 - using the M_EXT and a dedicated allocation/deallocation manager
     that provides cache block aligned buffer space.
 - on the MPC860 (with a cache line size of 16 bytes), the workaround
     described in my last message could be used.

>
> For transmission, once the stack has passed an mbuf to the driver output
> routine, that mbuf is `owned' by the driver so the CPU won't be writing
> to the mbuf header after the driver does the
>       rtems_cache_flush_multiple_data_lines(txBd->buffer, txBd->length);
> All mbufs are aligned on a 128 byte boundary so activity on other mbufs
> can not cause a cache-line write to affect an mbuf which has been
> flushed to main memory.  Actually, it wouldn't make any difference even
> if activity on another mbuf followed by a cache line write *did* cause
> the mbuf in question to be written since the value written would be
> exacly the same as the value in main memory anyway. The DMA engine would
> still see the same value in main memory during the transmission.

I agree, tx buffers are not a problem.

>
> It makes no difference if the txBd->buffer is not on a cache line
> boundary.  Sure a few bytes of mbuf header get written to main memory
> along with the mbuf data, but who cares?  The important thing is that
> the bytes from buffer to buffer+length are forced out to main memory
> where the DMA engine can see them.
>
> Charles. can you fix the two problems mentioned above?  While you're at
> it, could you add support for the FEC so we can get 100baseT Ethernet?
> Joel, I wonder if other systems with caches have similar problems?
>
> >
> > Another idea: is it possible to use the M_EXT facility to provide an aligned
> > area
> > (and a corresponding deallocation routine)?
> > Is it legal to change the flags and the extension record on an mbuf obtained
> > using
> > MGETHEADER()?
>
> Yes, you could use M_EXT but I don't see the need -- at least once the
> changes mentioned above have been made.
>

see above.

>
> > > BTW -- I see you're from SLAC.  Are you considering using EPICS with
> > > RTEMS?
> >
> > Indeed, that's what I'm envisioning. Are you going to OakRidge in November?
> > I hope to learn more at the collaboration meeting about the efforts to untackle
> >
> > EPICS' OS-dependencies, RTEMS etc. It'd be nice talking to you about these
> > issues.
> >
>
> Unfortunately I can't make it to Oak Ridge.

What a pity. I hope to meet you at another occasion... Do you know
of somebody else involved with RTEMS attending the meeting?

> Teaching commitments and
> starting a new job make the trip impossible.  An expense-paid trip to
> Palo Alto sometime to give a seminar or training course on EPICS/RTEMS
> would be ideal :-)

I'll have to talk to my bosses here about that...

> but I'll certainly give you what support I can by
> e-mail.  I'd really like to see more places using RTEMS with EPICS.
> We've got a dozen or so IOC's (68360's) running equipment here and
> things seem to be working just great.  It's really convenient having a
> <$500 (Canadian!) IOC.  You can use them everywhere.

Regards, Till.

Sorry for being pigheaded about this issue but occasional data/cache corruption
will be very hard to track down and considering this topic with great care is quite
important, IMHO.