[PATCH] aarch64: Add tests that are failing intermittently

Thu Aug 26 23:36:33 UTC 2021

On 8/20/2021 22:06, Chris Johns wrote:
> On 21/8/21 2:38 am, Kinsey Moore wrote:
>> On 8/19/2021 18:03, Chris Johns wrote:
>>> On 20/8/21 4:55 am, Kinsey Moore wrote:
>>>> On 8/19/2021 13:32, Gedare Bloom wrote:
>>>>> On Thu, Aug 19, 2021 at 11:43 AM Kinsey Moore <kinsey.moore at oarcorp.com> wrote:
>>>>>> I've seen these failures on my local system, in our CI, and on a build
>>>>>> server that I sometimes
>>>>>> use for development/testing so if it's a configuration issue we're being
>>>>>> pretty consistent about
>>>>>> misconfiguration across some pretty different environments (docker,
>>>>>> bare-metal, VM, different
>>>>>> OSs, different QEMU versions). I've seen enough of the spintrcritical
>>>>>> tests fail sporadically on
>>>>>> QEMU to lump them all into this category. These are also tests that I
>>>>>> have seen behave badly
>>>>>> on ARMv7 QEMU on my local system (which doesn't rule out
>>>>>> misconfiguration, but it's another
>>>>>> data point).
>>>>>>
>>>>> Yes, for example, it may be a matter of qemu process counts spawned by
>>>>> rtems-test, and the order in which tests get invoked could be a cause
>>>>> for which ones don't work. I could easily see this happening, since
>>>>> each test runtime will be fairly consistent, so you'll often see the
>>>>> same tests running concurrently with each other. But, if you change
>>>>> the order (e.g., by adding new tests), then we may see a new set of
>>>>> sporadically failing testcases, will we just add those, or do we need
>>>>> to re-examine this indetermine set periodically? Who will maintain
>>>>> this list? That's kind of the root of my concern here.
>>>> I understand your concern about maintenance of the failure list and I don't
>>>> have a good answer for you. I imagine going forward it would be a combination
>>>> of the current stake-holders for a given BSP and anyone who watches the
>>>> automated build output from Joel's runs for these kinds of issues.
>>>>
>>>> On the other hand if we don't mark those tests, people will get fatigued
>>>> looking at the spurious failures and assume any new ones just fall into the
>>>> same category as others. At that point is it even worth running the
>>>> automated tests for that platform?
>>>>
>>>>>> As far as your worry about marking these indeterminate, they're only
>>>>>> being marked as such for
>>>>>> QEMU BSPs. The ZynqMP hardware BSP doesn't have these testing carve-outs
>>>>>> and runs all these tests flawlessly.
>>> Great, this is important.
>>>
>>>>>> These failures become much more common when there is otherwise load on
>>>>>> the system and a
>>>>>> lot of them disappear when you limit the tester to a single QEMU
>>>>>> instance at a time.
>>>>>>
>>>>> I'm wondering if we should sacrifice testing speed for
>>>>> coverage/quality. If throttling rtems-test leads to more reliable test
>>>>> results, then it may be a better option than basically ignoring a
>>>>> swath of our testsuite.
>>>> That would certainly mitigate some of the failures, but you'd also have to
>>>> guarantee nothing else is running on the system which could cause the same
>>>> problem. I know at least some of the current automated runs operate on a
>>>> shared system which can and does often have other intensive processes
>>>> running on it. There are also the tests that are sporadic on QEMU even
>>>> without additional load.
>>> What is it in these tests when combined with qemu that causes the tests to fail?
>>> Is there some relation to a real clock, some shared host resource or a bug in
>>> qemu? I am concerned a simulator can vary like this based on the host's load and
>>> it makes me wonder how people use it on machines to host a number VMs.
>> I experienced very similar results on an ARMv7 BSP (not Zynq) and assumed that this
>> was a known/accepted problem with QEMU when the same issues popped up on
>> AArch64.
> I think we have just ignored issue. I know I have ignored it because of the
> rabbit hole it is.
>
>> My local system under no other load produces these failures for the
>> Zynq A9 QEMU
>> BSP:
>>
>>          "failed": [
>>              "spcpucounter01.exe",
>>              "psxtimes01.exe",
>>              "sp69.exe",
>>              "psx12.exe",
>>              "minimum.exe",
>>              "dl06.exe",
>>              "sptimecounter02.exe"
>>          ],
>>
>> minimum.exe
> We have discussed this test in the past and I think the end result from Joel was
> an exit code of 0 meant it had passed but I am not sure the exit code is printed
> because it is minimal. Maybe it should be changed to be a `no-run` type test?
>
>> and dl06.exe are probably unrelated,
> Yeap and that is one I should fix when I can find the time.
>
>> but the remainder are in my problem set for AArch64 on QEMU.
> OK.
>
>> A run of the AArch64 ZynqMP ILP32 BSP produced these failures under the same
>> conditions with all the test carve-outs removed:
>>
>>          "failed": [
>>              "psx12.exe",
>>              "spcpucounter01.exe",
>>              "sptimecounter01.exe",
>>              "sptimecounter02.exe",
>>              "sp04.exe"
>>          ],
>>
>> Because of my experience with the aforementioned ARMv7 BSP and the lack of
>> failures on hardware, I chose not to weed out the root cause of the failures under
>> QEMU.
> Sure. It however leaves the underlying problem about the reasons these fail with
> QEMU and so we caught either way.
>
>> This patch is documentation of our observations across multiple
>> architectures and BSPs running on QEMU more than anything else.
> And also effects the results.
>
>>> I feel with this volume of tests being tagged this way we should have a better
>>> understanding of the problem and so a means to track or not track how to resolve
>>> it. As Gedare has kindly stated once pushed this change disappears into a dark
>>> corner and we have no means to track it.
>>>
>>> The other solution is to set `jobs` to `1` in this BSP's tester config, again
>>> something Gedare has raised. It means we get better or even valid results. What
>>> is more important, valid results or running the testsuite as fast as possible?
>> I fully support dropping the number of jobs to "half" or 1 for better results on
>> QEMU runs that display these problems.
> OK then may be this is the way to go.
I submitted a patch to the mailing list to set jobs=1 on all ARM and 
AArch64 QEMU tester configurations.
>> My comment in that regard was that other system
>> loading (or multiple simultaneous test runs) can also cause the same problem and so
>> this is only a partial solution. Barring a fix for RTEMS or QEMU for these load-
>> dependent and sporadic failures, this at least still needs to be documented in some
>> form.
> Yes and the failures should highlight an issue on the host that needs to be
> looked into.

Since I'm working on SMP and I've had some of those tests failing 
sporadically as well, I took a dive into smpschededf01.exe on AArch64 
and the issue that particular test seems to be encountering is a 
mismatch between the busy wait delay using rtems_test_busy_cpu_usage() 
and the number of kernel ticks that have been experienced. My hypothesis 
is that QEMU is prone to dumping a pile of timer ticks into the virtual 
CPU all at once to catch up to wall time after returning from a context 
switch on the host OS. This would support the observation that failures 
are sporadic and increase under system load.  I instrumented the code 
and can see that the loop in rtems_test_busy_cpu_usage() isn't running 
substantially between these tick interrupts if at all.

I guess my next step is seeing if QEMU has an option to run its timers 
closer to the illusion of metal instead of being based on the wall clock.

Kinsey