[PATCH] tester: Limit simultaneous QEMU jobs to 1

Tue Aug 31 23:00:45 UTC 2021

On 31/8/21 6:30 pm, Sebastian Huber wrote:
> On 31/08/2021 09:00, Chris Johns wrote:
>> On 31/8/21 3:20 pm, Sebastian Huber wrote:
>>> On 30/08/2021 20:32, Kinsey Moore wrote:
>>>> On 8/30/2021 12:12, Sebastian Huber wrote:
>>>>> On 24/08/2021 20:45, Kinsey Moore wrote:
>>>>>> diff --git a/tester/rtems/testing/bsps/a53_ilp32_qemu.ini
>>>>>> b/tester/rtems/testing/bsps/a53_ilp32_qemu.ini
>>>>>> index 3beba06..581c59c 100644
>>>>>> --- a/tester/rtems/testing/bsps/a53_ilp32_qemu.ini
>>>>>> +++ b/tester/rtems/testing/bsps/a53_ilp32_qemu.ini
>>>>>> @@ -36,3 +36,4 @@ bsp           = a53_ilp32_qemu
>>>>>>    arch          = aarch64
>>>>>>    tester        = %{_rtscripts}/qemu.cfg
>>>>>>    bsp_qemu_opts = %{qemu_opts_base} -serial mon:stdio -machine
>>>>>> virt,gic-version=3 -cpu cortex-a53 -m 4096
>>>>>> +jobs          = 1
>>>>>
>>>>> Does this overwrite the command line option or is this a default value?
>>>>>
>>>> When this is set in the tester configuration, the command line switch has no
>>>> effect but it can be overridden in the user-config.
>>>
>>> Overruling the command line option is not that great. I have a vastly different
>>> test run duration with --jobs=1 vs. --jobs=48 with more or less the same test
>>> results.
>>
>> What does more or less mean?
> 
> On Qemu some tests have no reliable outcome. If I run with --jobs=48 only two of
> these tests fail compare to --jobs=1.

It seems the experience varies between archs and hosts. It is the origin of this
patch series.

>> I appreciate the efforts Kinsey has gone to looking into why we have this
>> happening and I also believe we need to keep pushing towards repeatable result.
>> If limiting to 1 gives us repeatable results on qemu then I prefer this over
>> tainted test results with intermittent tags.
> 
> During development waiting one minute is much better than waiting 13 minutes.
> Repeatable tests is one aspect, but there are other aspects too. Overruling
> command line options is not that great. If you run with default values, it is
> all right to trade off repeatable results against a fast test run. However, if I
> want to run with --jobs=N, I want to run with N jobs and not just one.

Yes I agree. How we manage this so it is apparent seems to be the key issue here.

>>> I think this option should be split into a "force-jobs" and
>>> "default-jobs" option.
>>
>> I am sorry I do not understand these options?
> 
> force-jobs forces the jobs to N regardless of what is specified on the command
> line. Maybe a warning or error should be emitted if the command line option
> conflicts with the configuration option.
> 
> default-jobs selects the job count if no --jobs command line option is specified.

What about adding a `max-job` field which is 0 for no limit? This cannot be
exceeded?

Then `default-jobs` can be used as the default, again 0 means no liimit?

>> The command line is ignored because and the value is fixed on purpose and I am
>> not seeing a reason to change this.
> 
> Ignoring command line options is not really a pleasant user experience.

Yes it is not. It was added in a hurry without much though when I added the TFTP
support.

>> When specified in a config it is a physical limit. A user being able to change
>> the number of TFTP jobs on the command line does not make sense.
> 
> Yes, for physical limits this makes sense.

We need to manage the managed this case for new users.

>> This tool's focus is testing on hardware and I see that as more important. And
>> as I have said before if we have problematic tests maybe the test or the tool
>> generating the results needs to be investigated.
>>
>> I see this issue as something specific to the design of qemu and a few of our
>> tests. I can guess at some of the reasons qemu does this but also being able to
>> have the tick timer's clock be sync'ed with the CPU clock is important in some
>> types of simulation, ie our case and these problematic test. We are a real-time
>> operating system so needing this to be right to closer in simulation does not
>> seem unreasonable.
>>
>> This discussion send a clear message, tier 1 archs and BSPs are very important
>> to this project.
> 
> There are several ways to address the sporadic test failures on Qemu. You could
> for example also change the tests to make them independent of the simulator
> timing. For now, my approach would be to change the default jobs count for the
> Qemu BSPs and still let the user overrule the default with a custom value to get
> for example a faster test run.

This is sensible. In summary:

1. Add `max-jobs` as a config file only settings with a default of 0

2. Change the config `jobs` to `default-jobs` again with 0 as the default default.

3. Let the command line override the default jobs and raise an error if over the
maximum jobs allowed.

4. Provide a clear notice at the start and end of a run if the jobs used do not
match the default.

Chris