[tools] tester: Remove hard coded time limit for SIS

Fri Jul 8 00:33:57 UTC 2022

On 7/7/2022 5:34 pm, Sebastian Huber wrote:
> On 07/07/2022 09:29, Chris Johns wrote:
>> -1 from me.
>>
>> On 7/7/2022 4:44 pm, Sebastian Huber wrote:
>>> Remove the hard coded time limit in the SIS configuration which would overrule
>>> the general tester settings (for example the --timeout command line option).
>>> Users can set a SIS time limit in the configuration if necessary.
>> I would this to be left as is and not to be the other way around.
>>
>> What is the problem with you having a user config for this?
>>
>> I am now starting to wonder if you have made some dependencies on this for
>> reasons I cannot explain.
> 
> I have no idea way you are so hesitant to remove this default limit. I just want
> to run the tests with a higher timeout defined by the command line --timeout
> option. That's it. I didn't anticipate that this is such an issue.

You are persistent :) and I am not comfortable removing this setting. I am
frustrated with myself that I cannot seem to clearly explain things. I know this
is complicated and detailed but I will try.

I am happy we are discussing this topic. The stability of simulations running in
parallel has been a problem ever since the tester could do it. Returning to the
topic after a period of time has been good and it has forced me to clarify my
thinking about it.

When Joel pointed out the purpose of the option and he questioned removing it I
saw it was a mistake to remove. I saw the critical factor being the time limit
option is relative to the CPU time and not the host. It is a critical point.

I was happy to see the option exists because it could be part of a solution I
have been searching a long time for with simulations. Now I know it exists with
the SIS simulator you would like to disable it. :) I would rather see us sort
out the timing of tests in a more structured manner.

The period I set is wrong because it needs to be the max period of all tests and
I recognise the validations tests are long complex tests and this means timeouts
in general may now be wrong. I had hoped it could be adjusted to be the max. I
am sorry I did not have the time to tune it correctly.

I would like this setting to stay as a reminder we have work to do to better
manage test run times. I am concerned this change will bury the issue and then
leave users needing to understand a timeout command line option and the problems
it hides to get valid test results.

For projects that depend on formal testing I think disabling the option via a
user config, local patch, etc and using the timeout option is subtly moving an
important configuration item to the metadata of the testing process. That option
is for a moment in time and may not work in the future on different hardware.

The detail:

1. The rtems-test command should be able to run simulators for a supported BSP
on any host with any operating system and achieve the same results we see when a
release is made.

2. An option like --timeout requires detailed knowledge to know it is OK to use
to make the tests pass. You and I have that knowledge and confidence to set and
say the test results are fine, users may not.

3. RTEMS is a deterministic operating system and I think a release's tests
should take the same CPU time.

4. OAR under Joel's guidance has built an awesome regression builder and tester.
They should be congratulated for their leadership here. It is costly,
complicated and difficult. Balancing the tester jobs to get reliable and
consistent test results is not easy. A long time out setting helps however that
increases the testing time because you are left waiting for tests to timeout. As
it is things takes days to run. The need for test runs to be as fast as possible
is important.

5. A CPU realtime clock time limit lets us set an accurate timeout per test or
group of tests and that means we can get timeouts that are consistent across all
hosts running the simulator. It also means we can make the test runs as fast as
they can be and if the simulator on host hardware is faster than real target
hardware that can be a large saving of time.

6. Running ticker on my FreeBSD host takes 0.52s and the CPU saw 35s. This
highlights the disconnect between the host clock, the CPU clock and how setting
a valid --timeout value is pure guess work. The command line timeout appears to
work with the SIS because of the scaling happening but in reality it is just a
guess.

7. QEMU is hard because it is more of a VM system these days than a simulator
and a VM wants to make sure the time in the VM is the same as the time on the
host. It does not match well with what we want.

8. The --timeout was added because of qemu and failing tests because I could not
find a means to manage it internally. The timeout use to be hard coded. It sets
the limit of the exe time and it updated by console output. I was forced to add
it because others running the tests could get the same results I was.

9. I wonder if the follow pins the timeout option to the setting:

  [erc32-sis]
  sis_time_limit = %{timeout}

If this works you could have that as a local setting. It however hides the fact
we need to manage test time limits better.

Chris