[PATCH 4/6] testsuite: Add expected-fail to psim

Sun May 10 00:37:56 UTC 2020

On Sat, May 9, 2020, 6:56 PM Chris Johns <chrisj at rtems.org> wrote:

> On 10/5/20 9:24 am, Joel Sherrill wrote:
> >
> >
> > On Sat, May 9, 2020, 6:18 PM Chris Johns <chrisj at rtems.org
> > <mailto:chrisj at rtems.org>> wrote:
> >
> >     On 10/5/20 4:57 am, Joel Sherrill wrote:
> >      >
> >      >
> >      > On Sat, May 9, 2020, 1:02 PM Gedare Bloom <gedare at rtems.org
> >     <mailto:gedare at rtems.org>
> >      > <mailto:gedare at rtems.org <mailto:gedare at rtems.org>>> wrote:
> >      >
> >      >     On Sat, May 9, 2020 at 2:09 AM Chris Johns <chrisj at rtems.org
> >     <mailto:chrisj at rtems.org>
> >      >     <mailto:chrisj at rtems.org <mailto:chrisj at rtems.org>>> wrote:
> >      >      >
> >      >      > On 9/5/20 11:30 am, Gedare Bloom wrote:
> >      >      > > On Wed, May 6, 2020 at 5:12 AM Chris Johns
> >     <chrisj at rtems.org <mailto:chrisj at rtems.org>
> >      >     <mailto:chrisj at rtems.org <mailto:chrisj at rtems.org>>> wrote:
> >      >      > >>
> >      >      > >>
> >      >      > >>> On 6 May 2020, at 8:15 pm, Sebastian Huber
> >      >     <sebastian.huber at embedded-brains.de
> >     <mailto:sebastian.huber at embedded-brains.de>
> >      >     <mailto:sebastian.huber at embedded-brains.de
> >     <mailto:sebastian.huber at embedded-brains.de>>> wrote:
> >      >      > >>>
> >      >      > >>> On 06/05/2020 12:00, Chris Johns wrote:
> >      >      > >>>
> >      >      > >>>>> On 6/5/20 7:35 pm, Sebastian Huber wrote:
> >      >      > >>>>>> On 06/05/2020 10:41, chrisj at rtems.org
> >     <mailto:chrisj at rtems.org>
> >      >     <mailto:chrisj at rtems.org <mailto:chrisj at rtems.org>> wrote:
> >      >      > >>>>>
> >      >      > >>>>>> From: Chris Johns<chrisj at rtems.org
> >     <mailto:chrisj at rtems.org> <mailto:chrisj at rtems.org
> >     <mailto:chrisj at rtems.org>>>
> >      >      > >>>>>>
> >      >      > >>>>>> Updates #2962
> >      >      > >>>>>> ---
> >      >      > >>>>>>    bsps/powerpc/psim/config/psim-testsuite.tcfg | 22
> >      >     ++++++++++++++++++++
> >      >      > >>>>>>    1 file changed, 22 insertions(+)
> >      >      > >>>>>>    create mode 100644
> >      >     bsps/powerpc/psim/config/psim-testsuite.tcfg
> >      >      > >>>>>>
> >      >      > >>>>>> diff --git
> >     a/bsps/powerpc/psim/config/psim-testsuite.tcfg
> >      >     b/bsps/powerpc/psim/config/psim-testsuite.tcfg
> >      >      > >>>>>> new file mode 100644
> >      >      > >>>>>> index 0000000000..b0d2a05086
> >      >      > >>>>>> --- /dev/null
> >      >      > >>>>>> +++ b/bsps/powerpc/psim/config/psim-testsuite.tcfg
> >      >      > >>>>>> @@ -0,0 +1,22 @@
> >      >      > >>>>>> +#
> >      >      > >>>>>> +# PSIM RTEMS Test Database.
> >      >      > >>>>>> +#
> >      >      > >>>>>> +# Format is one line per test that is_NOT_  built.
> >      >      > >>>>>> +#
> >      >      > >>>>>> +
> >      >      > >>>>>> +expected-fail: fsimfsgeneric01
> >      >      > >>>>>> +expected-fail: block11
> >      >      > >>>>>> +expected-fail: rbheap01
> >      >      > >>>>>> +expected-fail: termios01
> >      >      > >>>>>> +expected-fail: ttest01
> >      >      > >>>>>> +expected-fail: psx12
> >      >      > >>>>>> +expected-fail: psxchroot01
> >      >      > >>>>>> +expected-fail: psxfenv01
> >      >      > >>>>>> +expected-fail: psximfs02
> >      >      > >>>>>> +expected-fail: psxpipe01
> >      >      > >>>>>> +expected-fail: spextensions01
> >      >      > >>>>>> +expected-fail: spfatal31
> >      >      > >>>>>> +expected-fail: spfifo02
> >      >      > >>>>>> +expected-fail: spmountmgr01
> >      >      > >>>>>> +expected-fail: spprivenv01
> >      >      > >>>>>> +expected-fail: spstdthreads01
> >      >      > >>>>>
> >      >      > >>>>> I don't think these tests are expected to fail. If
> they
> >      >     fail, then there is a bug somewhere.
> >      >      > >>>>
> >      >      > >>>> Yes we hope no tests fail but they can and do.
> Excluding
> >      >     tests because they fail would be incorrect. In the 5.1
> >     release these
> >      >     bugs are present so we expect, or maybe it should say, we
> >     know the
> >      >     test will fail. With this change any thing that appears in the
> >      >     failure column is "unexpected" and that means the user build
> >     of the
> >      >     release does not match the state we "expect" and it is worth
> >      >     investigation by the user.
> >      >      > >>>>
> >      >      > >>>> Without these tests being tagged this way the user
> would
> >      >     have no idea where the stand after a build and test run and
> that
> >      >     would mean we would have to make sure a release has no
> >     failures. I
> >      >     consider that as not practical or realistic.
> >      >      > >>> Maybe we need another state, e.g.
> >      >     something-is-broken-please-fix-it.
> >      >      > >>
> >      >      > >> I do not think so, it is implicit in the failure or the
> >     test
> >      >     is broken. The only change is to add unexpected-pass, that
> >     will be
> >      >     on master after the 5 branch.
> >      >      > >>
> >      >      > >
> >      >      > > I disagree with this in principle,
> >      >      >
> >      >      > I did not invent this, it is borrowed from gcc. I
> >     considered their
> >      >      > mature test model as OK to follow. Look for "How to
> >     interpret test
> >      >      > results" in https://gcc.gnu.org/install/test.html.
> >      >      >
> >      >      > We have ...
> >      >      >
> >      >      >
> >      >
> >
> https://docs.rtems.org/branches/master/user/testing/tests.html#test-controls
> >      >      >
> >      >      > Is the principle the two points below?
> >      >      >
> >      >      > > and it should be reverted after we branch 5.
> >      >      >
> >      >      > I would like to understand how regressions are to be
> tracked
> >      >     before we
> >      >      > revert the change. Until this change you could not track
> >     them. We
> >      >     need
> >      >      > to capture the state somehow and I view capturing the
> state in
> >      >     the tests
> >      >      > themselves as the best method.
> >      >      >
> >      >      > > It's fine for now to get the release state sync'd, but we
> >      >      >
> >      >      > I am not following why we would only tracking regressions
> on a
> >      >     release
> >      >      > branch?
> >      >      >
> >      >      > > should find a long-term solution that distinguishes the
> >     cases:
> >      >      > > 1. we don't expect this test to pass on this bsp
> >      >      >
> >      >      > If a test cannot pass on a BSP for a specific reasons it is
> >      >     excluded and
> >      >      > not built, e.g. not enough memory, single core. A test is
> >     expected to
> >      >      > fail because of a bug or missing feature we are not or
> >     cannot fix or
> >      >      > implement so we tag it as expected-fail or by default the
> >     test is
> >      >     tagged
> >      >      > as expected-pass. If a test may or may not pass because of
> >     some edge
> >      >      > case in a BSP it can be tagged 'indeterminate'.
> >      >      >
> >      >      > > 2. we expect this test to pass, but know it doesn't
> >     currently
> >      >      >
> >      >      > This depends on a point in time. After a change I make I
> would
> >      >     consider
> >      >      > this a regression and I would need to see what I have done
> >     in my
> >      >     change
> >      >      > to cause it. For this to happen we need a baseline where
> the
> >      >     tests that
> >      >      > fail because of a known bug or missing feature at the time
> >     I add my
> >      >      > change are tagged as expected to fail.
> >      >      >
> >      >      > An example is dl06 on the beagleboneblack:
> >      >      >
> >      >      >
> https://lists.rtems.org/pipermail/build/2020-May/014695.html
> >      >      >
> >      >      > The RAP needs to support trampolines and it does not so
> >     the test is
> >      >      > expected to fail.
> >      >      >
> >      >      > An example of a regression is a test that passes in a
> >     specific build
> >      >      > configuration and fails in another. These recent psim
> results
> >      >     from Joel
> >      >      > show this where the build without RTEMS_DEBUG passes and
> with
> >      >      > RTEMS_DEBUG fails. Here there are 2 regressions:
> >      >      >
> >      >      >
> https://lists.rtems.org/pipermail/build/2020-May/014943.html
> >      >      >
> https://lists.rtems.org/pipermail/build/2020-May/014946.html
> >      >      >
> >      >      > The regression in fsrfsbitmap01.exe with RTEMS_DEBUG
> >     explains the
> >      >      > timeout in the no RTEMS_DEBUG version. I had not noticed
> >     this before.
> >      >      > They are hard to notice without a baseline in each BSP and
> >      >     expecting us
> >      >      > to have 100% pass on all BSPs in all testing
> configurations,
> >      >     especially
> >      >      > simulation, is too hard.
> >      >      >
> >      >      > My hope is a simple rule "If you do not see 0 fails you
> >     need to check
> >      >      > your changes".
> >      >      >
> >      >      > > They are two very different things, and I don't like
> >     conflating
> >      >     them
> >      >      > > into one "expected-fail" case
> >      >      >
> >      >      > Sorry, I am not following. Would you be able to provide
> >     some examples
> >      >      > for 1. and 2. that may help me understand the issue?
> >      >      >
> >      >
> >      >     Yes. There are tests that "pass" by failing, such as the
> >     intrcritical
> >      >     tests.  These are tests that are expected to fail, always and
> >     forever,
> >      >     and are not worth looking at further if they are failing. An
> >      >     expected-fail that passes is, then, a bug/regression.
> >      >
> >      >     Then there are tests we have triaged and identified as bugs,
> >     which
> >      >     could be tagged by something such as "known-failure" that is
> not
> >      >     expected but we know it happens. This would be like spfenv
> >     tests where
> >      >     the support doesn't exist yet, or like the dl06.  These are
> >     tests that
> >      >     should be passing some day, but they are not right now. Yes,
> >      >     "known-failure" encodes a notion of time, but we must have a
> >     notion of
> >      >     time, because a regression is time-sensitive as well. The
> idea of
> >      >     "known-failure" is just a subset of what you have added to the
> >      >     "expected-failure" column. It would just be another reported
> >     statistic
> >      >     to add just like Timeouts or Benchmarks.
> >      >
> >      >
> >      > I'm concerned that we are not making a distinction between
> >     investigated
> >      > and known failures and deficiencies which have tickets and should
> >     work
> >      > if X is fixed. The Beagle issue and many of the jmr3904 failures
> >     are in
> >      > this category. Known failure should indicate a certainty that it
> >     can't
> >      > be made to work per someone who investigated. You can't add
> >     memory, the
> >      > simulator catches an invalid access before the trap handler, etc.
> As
> >      > opposed to all the TLS tests which fail because it isn't
> >     supported on an
> >      > architecture.
> >      >
> >      > Can we make a distinction between those two conditions? Something
> >     like
> >      > failure accepted pending investigation versus fails and explained
> >     versus
> >      > known failure?
> >      >
> >      > Known failure has a  comment explaining it
> >
> >     Comment where? This is the thread that pulls the design.
> >
> >      > Maybe a known issue which has a comment and ticket.
> >
> >     A ticket is a good place.
> >
> >      > Pending investigation for these you are flagging. Noted as known
> >     but no
> >      > explanation. Can serve as future tasks pool.
> >
> >     This seems like process management or resolution management. The
> >     purpose
> >     here is to automate regression detection. Separate databases or files
> >     containing values breaks a number of other tester requirements or it
> >     complicates the management and tester.
> >
> >      > Is that all that's needed? We don't want to lose the information
> >     that we
> >      > think these likely should pass but we haven't been investigated.
> >      >
> >      > Otherwise we have lost that no one has explained the situation.
> >
> >     For me the states are from the tester's point of view and the state
> is
> >     only metadata for the tester and nothing more. The tester is simpler
> >     when the states we deal with are simpler.
> >
> >     Can you please explain how I determine if my build of a BSP has any
> >     regressions over what the release has? I am fine with whatever labels
> >     you think should be used, or even more if you want them but there is
> >     needs to be a base requirement that a new user builds a BSP and
> >     tests it
> >     and knows what they meets the same standard as the release.
> >
> >
> > We don't have a magic database.
>
> We have the tcfg files. A database is hard or a range of reasons.
>
> > We have mailing list archives with
> > results. I would have to check if my bsp has some results and compare
> > them by hand. If my results match those posted, that's it. You couldn't
> > do that before this release.
>
> The assumes those results are golden and they may not.
>
> > If you want no unexplained failures, then you are raising the bar.
> > Marking all as expected is wrong. They are just not investigated yet.
> > Adding that state is the realistic answer if you don't want any
> > unexpected failures.
> >
> > And I said comment in tcfg for explained expected failures. Unexplained
> > ones there is nothing to say
>
> The states are from the tester's point of view to allow us to machine
> check for regressions. There was never any intention to characterise the
> test with these labels but this seems to be what is happening. It seems
> the label and the state is confusing so I can add another state to catch
> unexplained failures. Is unexplained-fail OK?
>

If it is just to move things from unexpected failure to another category
oh, I'm okay with it. I just don't want test written off as expected
failure when they haven't been investigated. We have a fair number of those.

We also have the situation where some bsps run on multiple simulators and
real hardware and the results don't always align. The tcfg file doesn't
capture that either.

But at least we aren't putting tests and in a bin where they will be
ignored forever. That's a step forward and explainable as a state and a
work activity.

>
> We need to move forward from what we have and I think reverting the
> patch after a release is a step backwards.
>
> Chris
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rtems.org/pipermail/devel/attachments/20200509/87ef3e01/attachment-0001.html>