MVME 2307 BSP Exceptions

Matt Rippa mrippa at gemini.edu
Tue Jul 4 04:40:04 UTC 2017


Hi Till,

Thanks for your advice. Looking at the EPICS release notes, there have been
many fixes between EPICS 3.14.12.4 and 3.14.12.6.
http://www.aps.anl.gov/epics/base/R3-14/12-docs/RELEASE_NOTES.html So my
first step was upgrading and rebuilding all my support code.
That's working now but I won't get back to the telescope for a few days.

I took a look at your electric fence. This looks very useful. I have to say
I learned few things about the linker I never knew about.
I built the library and linked it as instructed. My first attempts today
caused a seq fault immediately on iocInit(). I've attached relevant files.
I need more time to look at this.

Cheers,
-Matt

On Mon, Jul 3, 2017 at 9:48 AM, Till Straumann <strauman at slac.stanford.edu>
wrote:

> Your approach (associate addresses in stack trace with source code) is
> what I usually do, too.
> Also disassemble the fault location and get hints from the register values.
>
> I assume between #2 and #3 there was also a power-cycle.
>
> It is interesting that the fault affects different threads and areas of
> code. This has a strong smell of
> memory/stack corruption. I.e., your culprit could be code which is totally
> unrelated with where the
> fault occurs.
>
> In the first case (program exception, e.g., due to illegal instruction)
> you'd inspect the fault address
> and verify that it holds an illegal instruction. You'd then check if the
> address itself is ok, i.e., somewhere
> in the text segment where it would make sense. It could be that the text
> was overwritten. Otherwise,
> the problem occurred earlier e.g., by jumping to a corrupted pointer value.
>
> If you are able to locate the corrupted memory then sometimes its contents
> can give you a
> hint as to what wrote to it. Otherwise an 'electric fence' tool can be
> quite useful. This is a library
> which wraps malloc & friends to allocate memory always so that start or
> end are aligned on a
> page-boundary. It uses the MMU to protect adjacent pages and thus causes
> an exception on attempt
> to write outside of the allocated region (with standard MMUs it is only
> possible to protect either the
> beginning or the end of the region).
>
> HTH
> - Till
>
>
> On 06/26/2017 08:40 PM, Matt Rippa wrote:
>
> Hi Till, all:
>
> Last week we recommissioned one of our telescope systems using RTEMS and
> EPICS on the MVME2307 BSP. After nearly two days of uptime (~42 hours), we
> get various unrecoverable exceptions (shown below). This occurs suddenly
> when the system is "idle" and not being used for many hours.
>
> Any suggestions tracking down this stack trace, finding the offending
> application code, or getting this to fail-faster?
> We've produced an object dump available at the following link:
> https://github.com/rcardenes/crcs-info
>
> Has anyone else used the MVME2307 BSP?
>
> Thanks,
> -Matt
>
> System details:
> RTEMS: 4.10.2
> EPICS: 3.14.12.4.
> Hardware: MVME-2700 (using mvme2307 alias)
> VME Boards:
>
>    1. Bancomm 635 Time Board
>    2. PMAC 1
>    3. Xycom-240
>
>
>
> *Event #1:*
>
>> Jun 23 15:19:43  E) PORT: crcs, MSG: Exception handler called for
>> exception 7 (0x7)
>> Jun 23 15:19:43  E) PORT: crcs, MSG: #011 Next PC or Address of fault =
>> 011B010C
>> Jun 23 15:19:43  E) PORT: crcs, MSG: #011 Saved MSR = 0008B032
>> Jun 23 15:19:43  E) PORT: crcs, MSG: #011 Context: Task ID 0x0A01004B
>> Jun 23 15:19:43  E) PORT: crcs, MSG: #011 R0  = 01010101 R1  = 00AF5A40
>> R2  = 00000000 R3  = 007F8014
>> Jun 23 15:19:43  E) PORT: crcs, MSG: #011 R4  = 00000014 R5  = 007AD148
>> R6  = 00AF62BC R7  = 00000000
>> Jun 23 15:19:43  E) PORT: crcs, MSG: #011 R8  = 000031A8 R9  = 011B0101
>> R10 = 00AF62B8 R11 = 007AD148
>> Jun 23 15:19:44  E) PORT: crcs, MSG: #011 R12 = 22000022 R13 = 001E4B90
>> R14 = AEFE7BBF R15 = 77FF7FBB
>> Jun 23 15:19:44  E) PORT: crcs, MSG: #011 R16 = AA7FBFFC R17 = FF7FF5FF
>> R18 = 001E0000 R19 = 00000000
>> Jun 23 15:19:44  E) PORT: crcs, MSG: #011 R20 = 000634A0 R21 = 001929E8
>> R22 = 001A2284 R23 = 00191098
>> Jun 23 15:19:44  E) PORT: crcs, MSG: #011 R24 = 00000000 R25 = 00000001
>> R26 = 00000001 R27 = 001A85C4
>> Jun 23 15:19:44  E) PORT: crcs, MSG: #011 R28 = 00AF62BC R29 = 00000028
>> R30 = 00419260 R31 = 007AD148
>> Jun 23 15:19:44  E) PORT: crcs, MSG: #011 CR  = 22042028
>> Jun 23 15:19:44  E) PORT: crcs, MSG: #011 CTR = 011B00FF
>> Jun 23 15:19:44  E) PORT: crcs, MSG: #011 XER = 20000000
>> Jun 23 15:19:44  E) PORT: crcs, MSG: #011 LR  = 0006371C
>> Jun 23 15:19:44  E) PORT: crcs, MSG: #011 DAR = 00000000
>> Jun 23 15:19:44  E) PORT: crcs, MSG: Stack Trace:
>> Jun 23 15:19:44  E) PORT: crcs, MSG:   IP: 0x011B010C, LR: 0x0006371C
>> Jun 23 15:19:44  E) PORT: crcs, MSG: --^ 0x0006371C--^ 0x00082FDC--^
>> 0x000E7094--^ 0x00136048--^ 0x00135F6C
>> Jun 23 15:19:44  E) PORT: crcs, MSG: Suspending faulting task
>> (0x0A01004B)
>
> ...
>
> Using the memory map for the Exception 7 we reach the following trace:
>
> 0x00135f6c _Thread_Handler
>
> /gem_sw/targetOS/RTEMS/source/rtems/rtems-4.10.2/cpukit/
>> score/src/threadhandler.c:80
>
> 0x00136048
>
> /gem_sw/targetOS/RTEMS/source/rtems/rtems-4.10.2/cpukit/
>> score/src/threadhandler.c:145
>
> 0x000e7094
>
> /gem_sw/epics/R3.14.12.4/base/src/libCom/osi/os/RTEMS/osdThread.c:169
>
> 0x00082fdc
>
> /gem_sw/epics/R3.14.12.4/base/src/db/dbEvent.c:883
>
> 0x0006371c
>
> /gem_sw/targetOS/RTEMS/source/tools/gcc-4.4.7/newlib/libc/
>> stdio/vfscanf.c:270
>
>
> Power Cycle
>
> *Event #2:*
>
>> Jun 23 18:32:20  E) PORT: crcs, MSG: Exception handler called for
>> exception 3 (0x3)
>> Jun 23 18:32:20  E) PORT: crcs, MSG: #011 Next PC or Address of fault =
>> 001541E4
>> Jun 23 18:32:20  E) PORT: crcs, MSG: #011 Saved MSR = 00009032
>> Jun 23 18:32:20  E) PORT: crcs, MSG: #011 Context: Task ID 0x09010001
>> Jun 23 18:32:20  E) PORT: crcs, MSG: #011 R0  = FFF5A4DA R1  = 00366720
>> R2  = 00000000 R3  = FFF0A442
>> Jun 23 18:32:20  E) PORT: crcs, MSG: #011 R4  = 00352498 R5  = 00000008
>> R6  = 00000000 R7  = FFF0A442
>> Jun 23 18:32:20  E) PORT: crcs, MSG: #011 R8  = 00352498 R9  = 00000002
>> R10 = 00000000 R11 = 00000001
>> Jun 23 18:32:20  E) PORT: crcs, MSG: #011 R12 = 003BC0E0 R13 = 001E4B90
>> R14 = 00000000 R15 = 00000000
>> Jun 23 18:32:21  E) PORT: crcs, MSG: #011 R16 = 00000000 R17 = 00000000
>> R18 = 00000000 R19 = 00000000
>> Jun 23 18:32:21  E) PORT: crcs, MSG: #011 R20 = 00000000 R21 = 00000000
>> R22 = 00000000 R23 = 00000000
>> Jun 23 18:32:21  E) PORT: crcs, MSG: #011 R24 = 00000000 R25 = 00000000
>> R26 = 00000000 R27 = 00000000
>> Jun 23 18:32:21  E) PORT: crcs, MSG: #011 R28 = FFF0A340 R29 = 00000000
>> R30 = FFF0A341 R31 = 0034D5B0
>> Jun 23 18:32:21  E) PORT: crcs, MSG: #011 CR  = 40000004
>> Jun 23 18:32:21  E) PORT: crcs, MSG: #011 CTR = 0012E3AC
>> Jun 23 18:32:21  E) PORT: crcs, MSG: #011 XER = 00000000
>> Jun 23 18:32:21  E) PORT: crcs, MSG: #011 LR  = 000F2CDC
>> Jun 23 18:32:21  E) PORT: crcs, MSG: #011 DAR = FFF0A442
>> Jun 23 18:32:21  E) PORT: crcs, MSG: Stack Trace:
>> Jun 23 18:32:21  E) PORT: crcs, MSG:   IP: 0x001541E4, LR: 0x000F2CDC
>> Jun 23 18:32:21  E) PORT: crcs, MSG: --^ 0x00135F6C
>> Jun 23 18:32:21  E) PORT: crcs, MSG: Suspending faulting task (0x09010001)
>
>
> *​Event #3:​*
>
>> Jun 24 07:37:50  E) PORT: crcs, MSG: *Exception handler called for exception
>> 8 (0x8)
>> Jun 24 07:37:50  E) PORT: crcs, MSG: #011 Next PC or Address of fault =
>> 0015BB08
>> Jun 24 07:37:50  E) PORT: crcs, MSG: #011 Saved MSR = 00009032
>> Jun 24 07:37:50  E) PORT: crcs, MSG: #011 Context: Task ID 0x09010001
>> Jun 24 07:37:50  E) PORT: crcs, MSG: #011 R0  = 00157754 R1  = 003664B8
>> R2  = 00000000 R3  = 00366758
>> Jun 24 07:37:50  E) PORT: crcs, MSG: #011 R4  = 0036661C R5  = 001B4C68
>> R6  = 00366610 R7  = 00000000
>> Jun 24 07:37:50  E) PORT: crcs, MSG: #011 R8  = 005F5370 R9  = 00642450
>> R10 = 0034C940 R11 = 00366688
>> Jun 24 07:37:50  E) PORT: crcs, MSG: #011 R12 = 40000048 R13 = 001E4B90
>> R14 = 001B4C68 R15 = 00000000
>> Jun 24 07:37:51  E) PORT: crcs, MSG: #011 R16 = 00000000 R17 = 00000000
>> R18 = 00000000 R19 = 00366610
>> Jun 24 07:37:51  E) PORT: crcs, MSG: #011 R20 = 00000000 R21 = 00000000
>> R22 = 00366758 R23 = 001B4CC0
>> Jun 24 07:37:51  E) PORT: crcs, MSG: #011 R24 = 0036661C R25 = 00377E40
>> R26 = 00000000 R27 = 0000002F
>> Jun 24 07:37:51  E) PORT: crcs, MSG: #011 R28 = 00377E40 R29 = 001E49B0
>> R30 = 00377E40 R31 = 00000000
>> Jun 24 07:37:51  E) PORT: crcs, MSG: #011 CR  = 40000048
>> Jun 24 07:37:51  E) PORT: crcs, MSG: #011 CTR = 00000000
>> Jun 24 07:37:51  E) PORT: crcs, MSG: #011 XER = 00000000
>> Jun 24 07:37:51  E) PORT: crcs, MSG: #011 LR  = 00157754
>> Jun 24 07:37:51  E) PORT: crcs, MSG: #011 DAR = 00000000
>> Jun 24 07:37:51  E) PORT: crcs, MSG: Stack Trace:
>> Jun 24 07:37:51  E) PORT: crcs, MSG:   IP: 0x0015BB08, LR: 0x00157754
>> Jun 24 07:37:51  E) PORT: crcs, MSG: --^ 0x00157754--^ 0x000F2820--^
>> 0x000F2DA0--^ 0x001E49BD--^ 0x00135F6C
>> Jun 24 07:37:51  E) PORT: crcs, MSG: Suspending faulting task
>> (0x09010001)
>
>
>
>
>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rtems.org/pipermail/users/attachments/20170703/f2608b0f/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Makefile.ef
Type: application/octet-stream
Size: 692 bytes
Desc: not available
URL: <http://lists.rtems.org/pipermail/users/attachments/20170703/f2608b0f/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Makefile.ioc
Type: application/octet-stream
Size: 3258 bytes
Desc: not available
URL: <http://lists.rtems.org/pipermail/users/attachments/20170703/f2608b0f/attachment-0005.obj>


More information about the users mailing list