File system deadlock troubleshooting
Mathew Benson
mbenson at windhoverlabs.com
Tue Oct 8 14:30:44 UTC 2019
I don't have a test case that I can post or give away, yet. I may put one
together later.
Yes. All 4 tasks are on that same conditional wait. The condition is a
response from a block device. The test case is hammering the RAM disk. I
wrote some of the flash drivers but as far as I know, the RAM disk was used
as is so I haven't looked at the code. That conditional wait does not have
a time out so it just pends forever.
Thanks. I'll try the RFS trace when I have time. Unfortunately, this is
late in the development cycle so I don't have ready access to the article
under test. Loading it with non production code takes time. Trying to
reproduce it on a non production environment. For now, all I have is shell
commands and manually decoding dumped memory.
Somebody more knowledgable on the inner workings of the RTEMS kernel
expressed concern a while ago that heavy use of the RAM disk could
basically out run the kernel worker thread. I noted it but didn't ask
about the mechanics of how or why that could happen. Are requests to the
worker threads in a lossy queue? Is it possible the request is getting
dropped?
On Mon, Oct 7, 2019 at 11:30 PM Chris Johns <chrisj at rtems.org> wrote:
> On 8/10/19 12:53 pm, Mathew Benson wrote:
> > I'm using RTEMS 5 on a LEON3. I'm troubleshooting a failure condition
> that
> > occurs when stress test reading and writing to and from RAM disk. RAM
> disk to
> > RAM disk. When the condition is tripped, it appears that I have 4 tasks
> that
> > are pending on conditions that just never happens.
>
> Do you have a test case?
>
> > The task command shows:
> >
> > ID NAME SHED PRI STATE MODES EVENTS WAITINFO
> >
> ------------------------------------------------------------------------------
> > 0a01000c TSKA UPD 135 MTX P:T:nA NONE RFS
> > 0a01001f TSKB UPD 135 CV P:T:nA NONE bdbuf
> access
> > 0a010020 TSKC UPD 150 MTX P:T:nA NONE RFS
> > 0a010032 TSKD UPD 245 MTX P:T:nA NONE RFS
>
> It looks like TSKA, TSKC and TSKD are waiting for the RFS lock and TSKB is
> blocked in a bdbuf access. I wonder why that is blocked?
>
> The RFS hold's it lock over the bdbuf calls.
> >
> > None of my tasks appear to be failed. Nobody is pending on anything
> noticeable
> > except the 4 above. The conditional wait is a single shared resource so
> any
> > attempt to access the file system after this happens results in yet
> another
> > forever pended task.
> >
> > Digging into source code, it appears that the kernel is waiting for a
> specific
> > response from a block device, but just didn't get what its expecting.
> The next
> > thing is to determine which block device the kernel is pending on, what
> the
> > expected response is, and what the block device actually did. Can
> anybody shed
> > some light on this or recommend some debugging steps? I'm trying to
> exhaust
> > all I can do before I start manually decoding machine code.
>
> The RFS has trace support you can access via
> `rtems/rfs/rtems-rfs-trace.h`. You
> can set the trace mask in your code or you can can call
> `rtems_rfs_trace_shell_command()` with suitable arguments or hook it to an
> existing shell. There is a buffer trace flag that show the release calls
> to bdbuf ..
>
> RTEMS_RFS_TRACE_BUFFER_RELEASE
>
> There is no trace call to get or read. Maybe add a get/read trace as well.
>
> The RAM disk also has trace in the code which can be enabled by editing
> the file.
>
> Chris
>
--
*Mathew Benson*
CEO | Chief Engineer
Windhover Labs, LLC
832-640-4018
www.windhoverlabs.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rtems.org/pipermail/users/attachments/20191008/92cbdd90/attachment.html>
More information about the users
mailing list