File system deadlock troubleshooting

Tue Oct 8 14:30:44 UTC 2019

I don't have a test case that I can post or give away, yet.  I may put one
together later.

Yes.  All 4 tasks are on that same conditional wait.  The condition is a
response from a block device.  The test case is hammering the RAM disk.  I
wrote some of the flash drivers but as far as I know, the RAM disk was used
as is so I haven't looked at the code.  That conditional wait does not have
a time out so it just pends forever.

Thanks.  I'll try the RFS trace when I have time.  Unfortunately, this is
late in the development cycle so I don't have ready access to the article
under test.  Loading it with non production code takes time.  Trying to
reproduce it on a non production environment.  For now, all I have is shell
commands and manually decoding dumped memory.

Somebody more knowledgable on the inner workings of the RTEMS kernel
expressed concern a while ago that heavy use of the RAM disk could
basically out run the kernel worker thread.  I noted it but didn't ask
about the mechanics of how or why that could happen.  Are requests to the
worker threads in a lossy queue?  Is it possible the request is getting
dropped?

On Mon, Oct 7, 2019 at 11:30 PM Chris Johns <chrisj at rtems.org> wrote:

> On 8/10/19 12:53 pm, Mathew Benson wrote:
> > I'm using RTEMS 5 on a LEON3.  I'm troubleshooting a failure condition
> that
> > occurs when stress test reading and writing to and from RAM disk.  RAM
> disk to
> > RAM disk.  When the condition is tripped, it appears that I have 4 tasks
> that
> > are pending on conditions that just never happens.
>
> Do you have a test case?
>
> > The task command shows:
> >
> > ID       NAME                 SHED PRI STATE  MODES    EVENTS WAITINFO
> >
> ------------------------------------------------------------------------------
> > 0a01000c TSKA                  UPD  135 MTX    P:T:nA   NONE   RFS
> > 0a01001f TSKB                   UPD  135 CV     P:T:nA   NONE   bdbuf
> access
> > 0a010020 TSKC                   UPD  150 MTX    P:T:nA   NONE   RFS
> > 0a010032 TSKD                 UPD  245 MTX    P:T:nA   NONE   RFS
>
> It looks like TSKA, TSKC and TSKD are waiting for the RFS lock and TSKB is
> blocked in a bdbuf access. I wonder why that is blocked?
>
> The RFS hold's it lock over the bdbuf calls.
> >
> > None of my tasks appear to be failed.  Nobody is pending on anything
> noticeable
> > except the 4 above.  The conditional wait is a single shared resource so
> any
> > attempt to access the file system after this happens results in yet
> another
> > forever pended task.
> >
> > Digging into source code, it appears that the kernel is waiting for a
> specific
> > response from a block device, but just didn't get what its expecting.
> The next
> > thing is to determine which block device the kernel is pending on, what
> the
> > expected response is, and what the block device actually did.  Can
> anybody shed
> > some light on this or recommend some debugging steps?   I'm trying to
> exhaust
> > all I can do before I start manually decoding machine code.
>
> The RFS has trace support you can access via
> `rtems/rfs/rtems-rfs-trace.h`. You
> can set the trace mask in your code or you can can call
> `rtems_rfs_trace_shell_command()` with suitable arguments or hook it to an
> existing shell. There is a buffer trace flag that show the release calls
> to bdbuf ..
>
>  RTEMS_RFS_TRACE_BUFFER_RELEASE
>
> There is no trace call to get or read. Maybe add a get/read trace as well.
>
> The RAM disk also has trace in the code which can be enabled by editing
> the file.
>
> Chris
>

-- 
*Mathew Benson*
CEO | Chief Engineer
Windhover Labs, LLC
832-640-4018

www.windhoverlabs.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rtems.org/pipermail/users/attachments/20191008/92cbdd90/attachment.html>