Potential SIS or RTEMS/libbsd problem

Sebastian Huber sebastian.huber at embedded-brains.de
Thu May 23 05:35:25 UTC 2019

On 22/05/2019 22:34, Jiri Gaisler wrote:
> On 5/22/19 7:43 PM, Jiri Gaisler wrote:
>> On 5/22/19 9:49 AM, Sebastian Huber wrote:
>>> On 22/05/2019 09:39, Jiri Gaisler wrote:
>>>> On 5/22/19 8:03 AM, Sebastian Huber wrote:
>>>>> Hello,
>>>>> in the libbsd there is a test for the Epoch Based Reclamation:
>>>>> https://git.rtems.org/rtems-libbsd/tree/testsuite/epoch01/test_main.c
>>>>> When I run this test using the leon3 BSP on real hardware (150MHz
>>>>> NGMP FP) the test completes successfully.
>>>>> If I run the test on the SIS, it is stuck at some point (using "-m
>>>>> 1" works):
>>>>> sparc-rtems5-sis -leon3 -nouartrx -r -tlim 200 s -m 2
>>>>> build/sparc-rtems5-leon3-everything/epoch01.exe
>>>> This test needs a shorter time-slice in the simulator to succeed (-d
>>>> option). The more cpus, the lower number of clocks in the slice is
>>>> needed. Through trial-and-error, these values seem to work:
>>>> 2 CPUs: -m 2 -d 25
>>>> 3 CPUs: -m 3 -d 10
>>>> 4 CPUs will not work, even if -d 1 is set. This is most likely a
>>>> simulator problem, I will try to find time to look at it in more
>>>> detail. A quick trace shows that all CPUs are stuck in a loop
>>>> checking for a lock or similar:
>>> It seems cpu 2 and 3 are in _SMP_barrier_Wait(). The cpu 0 and 1 still
>>> to some stuff in the EBR algorithm (ck_* functions). Maybe the
>>> algorithm works only in case some random timing fluctuations occur.
>> Either that or there is a hidden race condition in the test that does
>> not show up on real hardware. I noticed that increasing the time slice
>> actually make the test succeed even on 4 cpus ..!
>> -m 2 -d 200    PASS
>> -m 3 -d 200    PASS
>> -m 4 -d 200    FAIL
>> -m 4 -d 400    PASS!
>> BUT
>> -m 3 -d 400    FAIL!
>> I will try to add random delays to the interrupt response time to see if
>> that will make a difference. That is more inline with the real hardware ...
> Adding a pseudo-random delay of 0 - 15 clocks to each trap/interrupt causes the test to pass on all cpu configurations with the default time slice (50)..! I am not sure what this means - it could be a hidden race condition, the algorithm might need some jitter to work or it could still be a simulator issue.
> Is there any chance that you could compile this test for sis-riscv? RISC-V has different atomic operations and trap handlers so it would be interesting to see if the test behaves differently.

It locks up at the same spot:

riscv-rtems5-sis -m 4 build/riscv-rtems5-griscv-default/epoch01.exe

  SIS - SPARC/RISCV instruction simulator 2.13,  copyright Jiri Gaisler 2019
  Bug-reports to jiri at gaisler.se

  RISCV emulation enabled, 4 cpus online, delta 50 clocks

cpu0> run
nexus0: <RTEMS Nexus device>
   <EnterExit activeWorker="1">
     <Counter worker="0">1059417</Counter>
   <EnterExit activeWorker="2">
     <Counter worker="0">1059303</Counter>
     <Counter worker="1">1049390</Counter>
   <EnterExit activeWorker="3">
     <Counter worker="0">1058922</Counter>
     <Counter worker="1">1049008</Counter>
     <Counter worker="2">1061640</Counter>
   <EnterExit activeWorker="4">
     <Counter worker="0">1058540</Counter>
     <Counter worker="1">1048679</Counter>
     <Counter worker="2">1061258</Counter>
     <Counter worker="3">1061258</Counter>
   <EnterListOpExit activeWorker="1">
     <Counter worker="0">925414</Counter>
     <Removals worker="0">100</Removals>
   <EnterListOpExit activeWorker="2">
     <Counter worker="0">704898</Counter>
     <Counter worker="1">704835</Counter>
     <Removals worker="0">46</Removals>
     <Removals worker="1">45</Removals>
   <EnterListOpExit activeWorker="3">
     <Counter worker="0">589977</Counter>
     <Counter worker="1">585688</Counter>
     <Counter worker="2">592200</Counter>
     <Removals worker="0">23</Removals>
     <Removals worker="1">23</Removals>
     <Removals worker="2">23</Removals>
   <EnterListOpExit activeWorker="4">
     <Counter worker="0">505834</Counter>
     <Counter worker="1">501869</Counter>
     <Counter worker="2">507615</Counter>
     <Counter worker="3">507614</Counter>
     <Removals worker="0">19</Removals>
     <Removals worker="1">18</Removals>
     <Removals worker="2">18</Removals>
     <Removals worker="3">18</Removals>
   <EnterExitPreempt activeWorker="1">
     <Counter worker="0">275348</Counter>
   <EnterExitPreempt activeWorker="2">
     <Counter worker="0">275971</Counter>
     <Counter worker="1">280381</Counter>
   <EnterExitPreempt activeWorker="3">
     <Counter worker="0">275956</Counter>
     <Counter worker="1">280283</Counter>
     <Counter worker="2">280283</Counter>
   <EnterExitPreempt activeWorker="4">
     <Counter worker="0">275800</Counter>
     <Counter worker="1">280185</Counter>
     <Counter worker="2">280185</Counter>
     <Counter worker="3">280185</Counter>
   <EnterListOpExitPreempt activeWorker="1">
     <Counter worker="0">266212</Counter>
     <Removals worker="0">68</Removals>
  Stopped at time 975738600 (19514.772 ms)

The EBR is a core synchronization primitive in libbsd. It makes me a bit 
nervous to have this dependency on random fluctuations to make progress. 
I don't know the algorithm good enough to say if this is the expected 
behaviour. A real machine with such an exact relative instruction 
execution is probably non-existent.

In general, you can lock up an SMP system quite easily if you perform 
the right LL/SC pair on two processors to that they endlessly steal each 
other the reservation.

Sebastian Huber, embedded brains GmbH

Address : Dornierstr. 4, D-82178 Puchheim, Germany
Phone   : +49 89 189 47 41-16
Fax     : +49 89 189 47 41-09
E-Mail  : sebastian.huber at embedded-brains.de
PGP     : Public key available on request.

Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG.

More information about the devel mailing list