File system deadlock troubleshooting

Mathew Benson mbenson at windhoverlabs.com
Wed Oct 9 19:23:38 UTC 2019


I added the tracerfs command to the shell.  Not sure why that's not already
there.  I enabled all with "tracerfs set all".  It took a while, but I did
run into the problem.  The end of the log is below.  Can you see what the
kernel is trying to do and what its waiting on from this?  It's going to be
a while, reading through kernel code, to understand exactly what this means.

rtems-rfs: buffer-scan: count=58, block=1: 2252 2237 2223 2205 2188 2169
2153 2132 2119 2098 2079 2068 2049 2031 2015 2001 1984 1965 1955 1934 1915
1894 1873 1853 1838 1817 1796 1775 1761 1741 1724 1717 1696 1675 1658 1641
1622 1601 1581 1563 1544 1524 1508 1494 1476 1467 1448 1427 1406 1326 1292
1246 1236 1218 1199 1181 1162 1053 : not found
rtems-rfs: buffer-scan: count=4, block=1: 2267 1 : found block=1
rtems-rfs: buffer-request: block=1 bdbuf-read=1 refs=1
rtems-rfs: buffer-release: block=1 (dirty) refs=1
rtems-rfs: group-bitmap-alloc: block allocated: 2268
rtems-rfs: buffer-request: block=2268
rtems-rfs: buffer-scan: count=58, block=2268: 2252 2237 2223 2205 2188 2169
2153 2132 2119 2098 2079 2068 2049 2031 2015 2001 1984 1965 1955 1934 1915
1894 1873 1853 1838 1817 1796 1775 1761 1741 1724 1717 1696 1675 1658 1641
1622 1601 1581 1563 1544 1524 1508 1494 1476 1467 1448 1427 1406 1326 1292
1246 1236 1218 1199 1181 1162 1053 : not found
rtems-rfs: buffer-scan: count=4, block=2268: 1 2267 2266 2265 : not found
rtems-rfs: buffer-request: block=2268 bdbuf-get=2268 refs=1
rtems-rfs: buffer-release: block=2268 (dirty) refs=1
rtems-rfs: block-map-grow: entry: blocks=1 count=4
rtems-rfs: buffer-request: block=1
rtems-rfs: buffer-scan: count=58, block=1: 2252 2237 2223 2205 2188 2169
2153 2132 2119 2098 2079 2068 2049 2031 2015 2001 1984 1965 1955 1934 1915
1894 1873 1853 1838 1817 1796 1775 1761 1741 1724 1717 1696 1675 1658 1641
1622 1601 1581 1563 1544 1524 1508 1494 1476 1467 1448 1427 1406 1326 1292
1246 1236 1218 1199 1181 1162 1053 : not found
rtems-rfs: buffer-scan: count=5, block=1: 2268 1 : found block=1
rtems-rfs: buffer-request: block=1 bdbuf-read=1 refs=1
rtems-rfs: buffer-release: block=1 (dirty) refs=1
rtems-rfs: group-bitmap-alloc: block allocated: 2269
rtems-rfs: buffer-request: block=2269
rtems-rfs: buffer-scan: count=58, block=2269: 2252 2237 2223 2205 2188 2169
2153 2132 2119 2098 2079 2068 2049 2031 2015 2001 1984 1965 1955 1934 1915
1894 1873 1853 1838 1817 1796 1775 1761 1741 1724 1717 1696 1675 1658 1641
1622 1601 1581 1563 1544 1524 1508 1494 1476 1467 1448 1427 1406 1326 1292
1246 1236 1218 1199 1181 1162 1053 : not found
rtems-rfs: buffer-scan: count=5, block=2269: 1 2268 2267 2266 2265 : not
found
rtems-rfs: buffer-request: block=2269 bdbuf-get=2269 refs=1
rtems-rfs: buffer-release: block=2269 (dirty) refs=1
rtems-rfs: buffer-release: local cache overflow: 5
rtems-rfs: bdbuf-release: block=0 bdbuf=2265 (modified)
rtems-rfs: block-map-grow: entry: blocks=1 count=5
rtems-rfs: buffer-request: block=1
rtems-rfs: buffer-scan: count=58, block=1: 2252 2237 2223 2205 2188 2169
2153 2132 2119 2098 2079 2068 2049 2031 2015 2001 1984 1965 1955 1934 1915
1894 1873 1853 1838 1817 1796 1775 1761 1741 1724 1717 1696 1675 1658 1641
1622 1601 1581 1563 1544 1524 1508 1494 1476 1467 1448 1427 1406 1326 1292
1246 1236 1218 1199 1181 1162 1053 : not found
rtems-rfs: buffer-scan: count=5, block=1: 2269 1 : found block=1
rtems-rfs: buffer-request: block=1 bdbuf-read=1 refs=1
rtems-rfs: buffer-release: block=1 (dirty) refs=1
rtems-rfs: group-bitmap-alloc: block allocated: 2270
rtems-rfs: buffer-request: block=1
rtems-rfs: buffer-scan: count=58, block=1: 2252 2237 2223 2205 2188 2169
2153 2132 2119 2098 2079 2068 2049 2031 2015 2001 1984 1965 1955 1934 1915
1894 1873 1853 1838 1817 1796 1775 1761 1741 1724 1717 1696 1675 1658 1641
1622 1601 1581 1563 1544 1524 1508 1494 1476 1467 1448 1427 1406 1326 1292
1246 1236 1218 1199 1181 1162 1053 : not found
rtems-rfs: buffer-scan: count=5, block=1: 1 : found block=1
rtems-rfs: buffer-request: block=1 bdbuf-read=1 refs=1
rtems-rfs: buffer-release: block=1 (dirty) refs=1
rtems-rfs: group-bitmap-alloc: block allocated: 2271
rtems-rfs: buffer-request: block=2271
rtems-rfs: buffer-scan: count=58, block=2271: 2252 2237 2223 2205 2188 2169
2153 2132 2119 2098 2079 2068 2049 2031 2015 2001 1984 1965 1955 1934 1915
1894 1873 1853 1838 1817 1796 1775 1761 1741 1724 1717 1696 1675 1658 1641
1622 1601 1581 1563 1544 1524 1508 1494 1476 1467 1448 1427 1406 1326 1292
1246 1236 1218 1199 1181 1162 1053 : not found
rtems-rfs: buffer-scan: count=5, block=2271: 1 2269 2268 2267 2266 : not
found
rtems-rfs: buffer-request: block=2271 bdbuf-get=2271 refs=1
rtems-rfs: block-map-grow: upping: block-count=5
rtems-rfs: buffer-request: block=2270
rtems-rfs: buffer-scan: count=59, block=2270: 2271 2252 2237 2223 2205 2188
2169 2153 2132 2119 2098 2079 2068 2049 2031 2015 2001 1984 1965 1955 1934
1915 1894 1873 1853 1838 1817 1796 1775 1761 1741 1724 1717 1696 1675 1658
1641 1622 1601 1581 1563 1544 1524 1508 1494 1476 1467 1448 1427 1406 1326
1292 1246 1236 1218 1199 1181 1162 1053 : not found
rtems-rfs: buffer-scan: count=5, block=2270: 1 2269 2268 2267 2266 : not
found


All trace events immediately stop here.

On Tue, Oct 8, 2019 at 9:30 AM Mathew Benson <mbenson at windhoverlabs.com>
wrote:

> I don't have a test case that I can post or give away, yet.  I may put one
> together later.
>
> Yes.  All 4 tasks are on that same conditional wait.  The condition is a
> response from a block device.  The test case is hammering the RAM disk.  I
> wrote some of the flash drivers but as far as I know, the RAM disk was used
> as is so I haven't looked at the code.  That conditional wait does not have
> a time out so it just pends forever.
>
> Thanks.  I'll try the RFS trace when I have time.  Unfortunately, this is
> late in the development cycle so I don't have ready access to the article
> under test.  Loading it with non production code takes time.  Trying to
> reproduce it on a non production environment.  For now, all I have is shell
> commands and manually decoding dumped memory.
>
> Somebody more knowledgable on the inner workings of the RTEMS kernel
> expressed concern a while ago that heavy use of the RAM disk could
> basically out run the kernel worker thread.  I noted it but didn't ask
> about the mechanics of how or why that could happen.  Are requests to the
> worker threads in a lossy queue?  Is it possible the request is getting
> dropped?
>
> On Mon, Oct 7, 2019 at 11:30 PM Chris Johns <chrisj at rtems.org> wrote:
>
>> On 8/10/19 12:53 pm, Mathew Benson wrote:
>> > I'm using RTEMS 5 on a LEON3.  I'm troubleshooting a failure condition
>> that
>> > occurs when stress test reading and writing to and from RAM disk.  RAM
>> disk to
>> > RAM disk.  When the condition is tripped, it appears that I have 4
>> tasks that
>> > are pending on conditions that just never happens.
>>
>> Do you have a test case?
>>
>> > The task command shows:
>> >
>> > ID       NAME                 SHED PRI STATE  MODES    EVENTS WAITINFO
>> >
>> ------------------------------------------------------------------------------
>> > 0a01000c TSKA                  UPD  135 MTX    P:T:nA   NONE   RFS
>> > 0a01001f TSKB                   UPD  135 CV     P:T:nA   NONE   bdbuf
>> access
>> > 0a010020 TSKC                   UPD  150 MTX    P:T:nA   NONE   RFS
>> > 0a010032 TSKD                 UPD  245 MTX    P:T:nA   NONE   RFS
>>
>> It looks like TSKA, TSKC and TSKD are waiting for the RFS lock and TSKB is
>> blocked in a bdbuf access. I wonder why that is blocked?
>>
>> The RFS hold's it lock over the bdbuf calls.
>> >
>> > None of my tasks appear to be failed.  Nobody is pending on anything
>> noticeable
>> > except the 4 above.  The conditional wait is a single shared resource
>> so any
>> > attempt to access the file system after this happens results in yet
>> another
>> > forever pended task.
>> >
>> > Digging into source code, it appears that the kernel is waiting for a
>> specific
>> > response from a block device, but just didn't get what its expecting.
>> The next
>> > thing is to determine which block device the kernel is pending on, what
>> the
>> > expected response is, and what the block device actually did.  Can
>> anybody shed
>> > some light on this or recommend some debugging steps?   I'm trying to
>> exhaust
>> > all I can do before I start manually decoding machine code.
>>
>> The RFS has trace support you can access via
>> `rtems/rfs/rtems-rfs-trace.h`. You
>> can set the trace mask in your code or you can can call
>> `rtems_rfs_trace_shell_command()` with suitable arguments or hook it to an
>> existing shell. There is a buffer trace flag that show the release calls
>> to bdbuf ..
>>
>>  RTEMS_RFS_TRACE_BUFFER_RELEASE
>>
>> There is no trace call to get or read. Maybe add a get/read trace as well.
>>
>> The RAM disk also has trace in the code which can be enabled by editing
>> the file.
>>
>> Chris
>>
>
>
> --
> *Mathew Benson*
> CEO | Chief Engineer
> Windhover Labs, LLC
> 832-640-4018
>
>
> www.windhoverlabs.com
>
>

-- 
*Mathew Benson*
CEO | Chief Engineer
Windhover Labs, LLC
832-640-4018


www.windhoverlabs.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rtems.org/pipermail/users/attachments/20191009/56222ff8/attachment.html>


More information about the users mailing list