RTEMS scheduler bug ?

Wed Apr 3 15:03:15 UTC 2019

This sounds like a problem I had in 2015 on an STM32 that Sebastian helped me get around. At the end of the ordeal I wrote:

"A bit of review to begin with; I am working with an STM32F4 ARM Cortex M4F processor’s ADC section. A feature of this ADC is the ability to have conversions triggered by a timer (great for evenly sampled signals), the results transferred using double buffered DMA, giving you an interrupt when the DMA buffer is half full, then again when the buffer is full.

To let my task know when there was data ready to process, the DMA half/full complete interrupt routines would call rtems_event_send. The task would pend on the events, with a timeout in case something screwed up.

In my case, the timer would trigger 14 channels of ADC conversions to happen 400 times per second. This would yield 200 half full and 200 full interrupts per second, each calling rtems_event_send.

This action would proceed for a few thousand seconds and then the program would crash and, doing some painful debugging, I managed to repeatedly catch the system attempting to expire some “watchdogs”, which I believe is the squelching of outstanding timeouts on satisfied rtems_event_receive calls.

After trying a whole bunch of dead ends, Sebastian Huber asked me about the priorities of the interrupts being generated. 

The ARM architecture uses a vectored interrupt structure quite similar to the MC68xxx processors, where a device generates an interrupt and the address of the service routine is automatically picked up from a known place in a table without having to poll a bunch of registers to figure out what happened and branch off to the handler. The ARM processors have assignable priorities on most of the interrupts, so if two interrupts assert at the same time, or if a higher priority interrupt happens while an interrupt is in progress, you can predict what happens.

What I didn’t know is that RTEMS implements something called Non-Maskable Interrupts (NMI). The software NMIs don’t seem to be like hardware NMIs (a hardware interrupt that can not be turned off), they just have the same name (much like the event watchdogs that aren’t like the hardware watchdogs).

What I learned was that RTEMS NMIs are interrupt routines that are not allowed to use any RTEMS facilities. So, I presume, these routines would be used for dealing with devices that don’t need to interact with task code. The upside is that the interrupts can be entered bypassing RTEMS’ overhead.

A drawback is that if you call for RTEMS facilities from within one of these routines, apparently, your code becomes rather crashy.

To differentiate between NMI routines and a regular ISR that can call RTEMS facilities, the developers use the interrupt priorities and a mask. The NMI determination is not specific to the ARM family, each architecture has a mask that determines which bits are used to determine if an interrupt routine is an NMI or an ISR.

ARM uses an 8 bit priority and a priority in the range of 0x00-0x7F indicates an NMI. On an ARM, the lower the number, the more urgent the interrupt, so NMIs have higher urgency than ISRs that can use RTEMS facilities.

On the STM32F4, only 4 bits of 8 of priority are implemented, the 4 MSBs with the lower 4 being set to 0 (other Cortex M4 implementations have other combinations). In ST’s CubeMX tool, you can set the interrupt priority of the various interrupt sources in the range of 0-15 and Cube generates code to take care of the bit shifting for you. In my case I had set my priorities to 1,2,3 and 6. Shifted, these became 0x10, 0x20, 0x30, and 0x60. Since these numbers are all below 0x80, the RTEMS code was interpreting these interrupts as NMIs, bypassing a bunch of the necessary code to support RTEMS calls.

By changing my interrupt priorities to 9, 10, 11, and 14 (shifting gives 0x90, 0xA0, 0xB0, and 0xE0), the interrupt routines lost their NMI nature and the system immediately became dead stable with a 1kHz tick interrupt rate, 2 ADC DMA interrupts at 200Hz each, and a CAN interrupt at about 36Hz.”

When I ported RTEMS5 to the STM32F7, I ran into the same issue and used the same method to get around it.

I hope this helps.

Andrei

> On 2019-April-03, at 07:46, Sebastian Huber <sebastian.huber at embedded-brains.de> wrote:
> 
> On 03/04/2019 15:41, Catalin Demergian wrote:
>> yes, I realized yesterday evening that gIntrErrs could be incremented in the second if.
>> so I rewrote it like this
>> 
>> int gIntrptErrs;
>> int gInsertErrs;
>> 
>> RTEMS_INLINE_ROUTINE void _Scheduler_priority_Ready_queue_enqueue(
>>   Chain_Node                     *node,
>>   Scheduler_priority_Ready_queue *ready_queue,
>>   Priority_bit_map_Control       *bit_map
>> )
>> {
>>   Chain_Control *ready_chain = ready_queue->ready_chain;
>>   //_Assert(_ISR_Get_level() != 0);
>>   if(_ISR_Get_level() == 0)
>> gIntrptErrs++;
>> 
>>   cnt_before = _Chain_Node_count_unprotected(ready_chain);
>>   _Chain_Append/*_unprotected*/( ready_chain, node );
>>   cnt_after = _Chain_Node_count_unprotected(ready_chain);
>> 
>>   if(cnt_after != cnt_before + 1)
>> gInsertErrs++;
>> 
>>   _Priority_bit_map_Add( bit_map, &ready_queue->Priority_map );
>> }
>> 
>> It didn't seem that we enter that code with interrupts enabled .. output was
>> # cpuuse
>> -------------------------------------------------------------------------------
>>                               CPU USAGE BY THREAD
>> ------------+----------------------------------------+---------------+---------
>>  ID         | NAME                                   | SECONDS       | PERCENT
>> ------------+----------------------------------------+---------------+---------
>> *cdemergian build 11.15 gIntrptErrs=0 gInsertErrs=2*
>>  0x09010001 | IDLE                                   | 244.595117 |  99.238
>>  0x0a010001 | UI1                                    |   1.000929 |   0.406
>>  0x0a010002 | ntwk                                   |   0.099342 |   0.040
>>  0x0a010003 | SCtx                                   |   0.068705 |   0.027
>>  0x0a010004 | SCrx                                   |   0.089272 |   0.036
>>  0x0a010005 | eRPC                                   |   0.000050 |   0.000
>>  0x0a010006 | SHLL                                   |   0.550608 |   0.223
>>  0x0b010001 |                                        |   0.000096 |   0.000
>>  0x0b010002 |                                        |   0.068307 |   0.027
>> ------------+----------------------------------------+---------------+---------
>>  TIME SINCE LAST CPU USAGE RESET IN SECONDS:           246.528065
>> -------------------------------------------------------------------------------
>> [/] #
>> Not all time time, most of the runs both globals were zero, which is wierd ..
>> 
>> I also tried the patch. The issue was reproduced as well.
>> [/] # cpuuse
>> -------------------------------------------------------------------------------
>>                               CPU USAGE BY THREAD
>> ------------+----------------------------------------+---------------+---------
>>  ID         | NAME                                   | SECONDS       | PERCENT
>> ------------+----------------------------------------+---------------+---------
>> *cdemergian build 16.25 gIntrptErrs=233694 gInsertErrs=1*
>>  0x09010001 | IDLE                                   |    94.488726 |  98.619
>>  0x0a010001 | UI1                                    |     1.000931 |   1.044
>>  0x0a010002 | ntwk                                   |     0.030101 |   0.031
>>  0x0a010003 | SCtx                                   |     0.021441 |   0.022
>>  0x0a010004 | SCrx                                   |     0.027176 |   0.028
>>  0x0a010005 | eRPC                                   |     0.000049 |   0.000
>>  0x0a010006 | SHLL                                   |     0.215693 |   0.225
>>  0x0b010001 |                                        |     0.000096 |   0.000
>>  0x0b010002 |                                        |     0.027211 |   0.028
>> ------------+----------------------------------------+---------------+---------
>>  TIME SINCE LAST CPU USAGE RESET IN SECONDS:              95.867059
>> -------------------------------------------------------------------------------
>> 
>> we are getting big numbers for gIntrptErrs (is that normal ? I don't understand all the aspects of the patch just yet)
> 
> Can you set a break point to the gIntrptErrs++ and print the stack traces?
> 
> -- 
> Sebastian Huber, embedded brains GmbH
> 
> Address : Dornierstr. 4, D-82178 Puchheim, Germany
> Phone   : +49 89 189 47 41-16
> Fax     : +49 89 189 47 41-09
> E-Mail  : sebastian.huber at embedded-brains.de
> PGP     : Public key available on request.
> 
> Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG.
> 
> _______________________________________________
> users mailing list
> users at rtems.org
> http://lists.rtems.org/mailman/listinfo/users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rtems.org/pipermail/users/attachments/20190403/98fe378b/attachment-0001.html>