Potential SIS or RTEMS/libbsd problem
Sebastian Huber
sebastian.huber at embedded-brains.de
Thu May 23 05:35:25 UTC 2019
On 22/05/2019 22:34, Jiri Gaisler wrote:
> On 5/22/19 7:43 PM, Jiri Gaisler wrote:
>> On 5/22/19 9:49 AM, Sebastian Huber wrote:
>>> On 22/05/2019 09:39, Jiri Gaisler wrote:
>>>> On 5/22/19 8:03 AM, Sebastian Huber wrote:
>>>>> Hello,
>>>>>
>>>>> in the libbsd there is a test for the Epoch Based Reclamation:
>>>>>
>>>>> https://git.rtems.org/rtems-libbsd/tree/testsuite/epoch01/test_main.c
>>>>>
>>>>> When I run this test using the leon3 BSP on real hardware (150MHz
>>>>> NGMP FP) the test completes successfully.
>>>>>
>>>>> If I run the test on the SIS, it is stuck at some point (using "-m
>>>>> 1" works):
>>>>>
>>>>> sparc-rtems5-sis -leon3 -nouartrx -r -tlim 200 s -m 2
>>>>> build/sparc-rtems5-leon3-everything/epoch01.exe
>>>>>
>>>>>
>>>> This test needs a shorter time-slice in the simulator to succeed (-d
>>>> option). The more cpus, the lower number of clocks in the slice is
>>>> needed. Through trial-and-error, these values seem to work:
>>>>
>>>> 2 CPUs: -m 2 -d 25
>>>>
>>>> 3 CPUs: -m 3 -d 10
>>>>
>>>> 4 CPUs will not work, even if -d 1 is set. This is most likely a
>>>> simulator problem, I will try to find time to look at it in more
>>>> detail. A quick trace shows that all CPUs are stuck in a loop
>>>> checking for a lock or similar:
>>>>
>>> It seems cpu 2 and 3 are in _SMP_barrier_Wait(). The cpu 0 and 1 still
>>> to some stuff in the EBR algorithm (ck_* functions). Maybe the
>>> algorithm works only in case some random timing fluctuations occur.
>> Either that or there is a hidden race condition in the test that does
>> not show up on real hardware. I noticed that increasing the time slice
>> actually make the test succeed even on 4 cpus ..!
>>
>> -m 2 -d 200 PASS
>>
>> -m 3 -d 200 PASS
>>
>> -m 4 -d 200 FAIL
>>
>> -m 4 -d 400 PASS!
>>
>> BUT
>>
>> -m 3 -d 400 FAIL!
>>
>> I will try to add random delays to the interrupt response time to see if
>> that will make a difference. That is more inline with the real hardware ...
> Adding a pseudo-random delay of 0 - 15 clocks to each trap/interrupt causes the test to pass on all cpu configurations with the default time slice (50)..! I am not sure what this means - it could be a hidden race condition, the algorithm might need some jitter to work or it could still be a simulator issue.
>
> Is there any chance that you could compile this test for sis-riscv? RISC-V has different atomic operations and trap handlers so it would be interesting to see if the test behaves differently.
It locks up at the same spot:
riscv-rtems5-sis -m 4 build/riscv-rtems5-griscv-default/epoch01.exe
SIS - SPARC/RISCV instruction simulator 2.13, copyright Jiri Gaisler 2019
Bug-reports to jiri at gaisler.se
RISCV emulation enabled, 4 cpus online, delta 50 clocks
cpu0> run
*** LIBBSD EPOCH 1 TEST ***
nexus0: <RTEMS Nexus device>
<TestEpoch01>
<EnterExit activeWorker="1">
<Counter worker="0">1059417</Counter>
</EnterExit>
<EnterExit activeWorker="2">
<Counter worker="0">1059303</Counter>
<Counter worker="1">1049390</Counter>
</EnterExit>
<EnterExit activeWorker="3">
<Counter worker="0">1058922</Counter>
<Counter worker="1">1049008</Counter>
<Counter worker="2">1061640</Counter>
</EnterExit>
<EnterExit activeWorker="4">
<Counter worker="0">1058540</Counter>
<Counter worker="1">1048679</Counter>
<Counter worker="2">1061258</Counter>
<Counter worker="3">1061258</Counter>
</EnterExit>
<EnterListOpExit activeWorker="1">
<Counter worker="0">925414</Counter>
<Removals worker="0">100</Removals>
</EnterListOpExit>
<EnterListOpExit activeWorker="2">
<Counter worker="0">704898</Counter>
<Counter worker="1">704835</Counter>
<Removals worker="0">46</Removals>
<Removals worker="1">45</Removals>
</EnterListOpExit>
<EnterListOpExit activeWorker="3">
<Counter worker="0">589977</Counter>
<Counter worker="1">585688</Counter>
<Counter worker="2">592200</Counter>
<Removals worker="0">23</Removals>
<Removals worker="1">23</Removals>
<Removals worker="2">23</Removals>
</EnterListOpExit>
<EnterListOpExit activeWorker="4">
<Counter worker="0">505834</Counter>
<Counter worker="1">501869</Counter>
<Counter worker="2">507615</Counter>
<Counter worker="3">507614</Counter>
<Removals worker="0">19</Removals>
<Removals worker="1">18</Removals>
<Removals worker="2">18</Removals>
<Removals worker="3">18</Removals>
</EnterListOpExit>
<EnterExitPreempt activeWorker="1">
<Counter worker="0">275348</Counter>
</EnterExitPreempt>
<EnterExitPreempt activeWorker="2">
<Counter worker="0">275971</Counter>
<Counter worker="1">280381</Counter>
</EnterExitPreempt>
<EnterExitPreempt activeWorker="3">
<Counter worker="0">275956</Counter>
<Counter worker="1">280283</Counter>
<Counter worker="2">280283</Counter>
</EnterExitPreempt>
<EnterExitPreempt activeWorker="4">
<Counter worker="0">275800</Counter>
<Counter worker="1">280185</Counter>
<Counter worker="2">280185</Counter>
<Counter worker="3">280185</Counter>
</EnterExitPreempt>
<EnterListOpExitPreempt activeWorker="1">
<Counter worker="0">266212</Counter>
<Removals worker="0">68</Removals>
</EnterListOpExitPreempt>
Interrupt!
Stopped at time 975738600 (19514.772 ms)
cpu0>
The EBR is a core synchronization primitive in libbsd. It makes me a bit
nervous to have this dependency on random fluctuations to make progress.
I don't know the algorithm good enough to say if this is the expected
behaviour. A real machine with such an exact relative instruction
execution is probably non-existent.
In general, you can lock up an SMP system quite easily if you perform
the right LL/SC pair on two processors to that they endlessly steal each
other the reservation.
--
Sebastian Huber, embedded brains GmbH
Address : Dornierstr. 4, D-82178 Puchheim, Germany
Phone : +49 89 189 47 41-16
Fax : +49 89 189 47 41-09
E-Mail : sebastian.huber at embedded-brains.de
PGP : Public key available on request.
Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG.
More information about the devel
mailing list