[GSoC - x86_64] Pre-merge issues (at -O2 optimization level) and WIP review

Tue Aug 14 14:04:46 UTC 2018

On Sun, Aug 12, 2018 at 4:47 PM, Amaan Cheval <amaan.cheval at gmail.com>
wrote:

> Hi!
>
> I've narrowed the issue down to this bintime function:
> https://github.com/RTEMS/rtems/blob/b2de4260c5c71e518742731a8cdebe
> 3411937181/cpukit/score/src/kern_tc.c#L548
>
> The watchdog ticks in _Per_CPU_Information / Clock_driver_ticks are at
> "1000", when that function is called (rtems_clock_get_tod ->
> _TOD_Get_timeval -> _Timecounter_Microtime -> microtime). The bt and
> tvp values there are:
>
> (gdb) p bt
> $2 = {sec = 599562004, frac = 18446744073709551536}
> (gdb) p *tvp
> $3 = {tv_sec = 599562004, tv_usec = 999999}
>
> The full (relevant) debug log for the "wrong" timing despite the
> Clock_driver_ticks being correct is here:
> https://gist.github.com/AmaanC/c59caf5232b03054d457dcacb5ab1c54
>
> I'm quite unfamiliar with how the low-level internals work and it
> looks like it comes from FreeBSD. This is likely a bug from the
> timecounter being "too" precise - it dispatches the task at _exactly_
> the tc_freq it promised - if it slips by 1 tick, then the values start
> looking correct.
>
> This looks more like an off-by-one in the low-level code, in that
> case, since my clock driver's timecounter returns exactly the value it
> ought to be returning (100 when 1 second has passed, for eg., when the
> tc_frequency=100Hz - in that case the bintime's returned "now.tv_sec"
> value in clockgettod.c causes the wrong second to be set in
> "time_buffer").
>
> https://github.com/AmaanC/rtems-gsoc18/blob/ac/daily-03-
> post-hello/bsps/x86_64/amd64/clock/clock.c#L51
>
> On Sun, Aug 12, 2018 at 2:48 PM, Amaan Cheval <amaan.cheval at gmail.com>
> wrote:
> > There's another issue I'm having now:
> >
> > At -O0, ticker.exe works well and passes reliably. At -O2, the TOD
> > seems to be rushed a bit:
> >
> > TA1  - rtems_clock_get_tod - 09:00:00   12/31/1988
> > TA2  - rtems_clock_get_tod - 09:00:00   12/31/1988
> > TA3  - rtems_clock_get_tod - 09:00:00   12/31/1988
> > TA1  - rtems_clock_get_tod - 09:00:04   12/31/1988
> > TA2  - rtems_clock_get_tod - 09:00:09   12/31/1988
> > TA1  - rtems_clock_get_tod - 09:00:09   12/31/1988
> > TA3  - rtems_clock_get_tod - 09:00:14   12/31/1988
> > TA1  - rtems_clock_get_tod - 09:00:14   12/31/1988
> > TA2  - rtems_clock_get_tod - 09:00:19   12/31/1988
> > TA1  - rtems_clock_get_tod - 09:00:19   12/31/1988
> > TA1  - rtems_clock_get_tod - 09:00:24   12/31/1988
> > TA3  - rtems_clock_get_tod - 09:00:29   12/31/1988
> > TA2  - rtems_clock_get_tod - 09:00:29   12/31/1988
> > TA1  - rtems_clock_get_tod - 09:00:29   12/31/1988
> > TA1  - rtems_clock_get_tod - 09:00:34   12/31/1988
> >
> > I'm not sure what it could be - I suspected my get_timecount somehow
> > not realizing that Clock_driver_ticks was volatile, but that seems to
> > be in order. The relevant code is here:
> > https://github.com/AmaanC/rtems-gsoc18/blob/ac/daily-03-
> post-hello/bsps/x86_64/amd64/clock/clock.c
> >
> > On Sun, Aug 12, 2018 at 3:43 AM, Amaan Cheval <amaan.cheval at gmail.com>
> wrote:
> >> Figured it out; turns out my code to align the stack so I could make
> >> calls without raising exceptions was messing up and corrupting the
> >> stack-pointer.
> >>
> >> Running the -O2 code now makes the clock run a bit too quickly - the
> >> calibration may have a minor issue. I'll fix that up and send patches
> >> tomorrow or Monday hopefully.
> >>
> >> I'll be traveling Tuesday, so I'd appreciate if we can get them merged
> >> upstream Monday itself - I'm okay to have a call and walk someone
> >> through the patches and whatnot if need be.
> >>
> >> Cheers!
> >>
> >> On Sun, Aug 12, 2018 at 1:25 AM, Amaan Cheval <amaan.cheval at gmail.com>
> wrote:
> >>> Hi!
> >>>
> >>> In the process of cleaning my work up, I've run into an odd problem
> >>> which only shows up when I set the optimization level to -O2. At -O0,
> >>> it's perfectly fine.
> >>>
> >>> The issue is that somehow execution ends up at address 0x0.
> >>>
> >>> This likely happens due to a _CPU_Context_switch, where the rsp is set
> >>> to a corrupted value, leading to a corrupt (i.e. 0) return address at
> >>> the end of the context switch.
> >>>
> >>> What's curious is that this corruption _seems_ to occur in
> >>> _ISR_Handler's call to _Thread_Dispatch, by somehow messing the value
> >>> of rsp up - I honestly don't know this for sure because gdb says one
> >>> thing (i.e. that rsp = 0), but setting up some code (cmpq $0, rsp) to
> >>> check this seems to say rsp is non-zero, at least.
> >>>
> >>> This is an odd heisenbug I'd like to investigate for sure - I just
> >>> thought I'd shoot this email out because:
> >>>
> >>> - If I can't figure it out tomorrow, soon, I'll just drop it so I can
> >>> create more logical commits to send as patches upstream (thereby
> >>> leaving -O0 upstream, at least temporarily)
> >>>
> >>> - If anyone's seen an odd stack corruption like this, or has any
> >>> advice on debugging it, could you let me know? I suspect something
> >>> like interrupting tasks which ought not to be interrupted (perhaps I
> >>> forgot to implement some kind of "CPU_ISR_Disable") - is there
> >>> anything you can think of of that sort?
> >>>
> >>> Also, here's a Github PR like last time with all the work (just for
> >>> the overall changes, not the specific commits!). I'd appreciate a
> >>> quick review if anyone could - sorry about sending this out over the
> >>> weekend! I've had a surprising share of Heisenbugs with QEMU in the
> >>> past week.
> >>>
> >>> https://github.com/AmaanC/rtems-gsoc18/pull/3/files
>

If I have followed this thread correctly, you have solved the -O0 vs -O2
issue by changing/fixing the stack alignment.

There is still a mystery about the clock math. If this is the case, then
have
you run the nsecs test? It tries to push some degenerate cases known to
trip up clock drivers.

Sebastian is on holiday so I would start a new thread with the clock issue
on Friday or Saturday so he is more likely to see it when he gets back.

Is the PIC support in place enough to build libbsd? It should be worth
starting
a thread to identify the path to SMP and the path to libbsd support.

And while you remember things, is the CPU supplement filled in for the
x86_64?

--joel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rtems.org/pipermail/devel/attachments/20180814/4a98f033/attachment.html>