[GSoC - x86_64] Pre-merge issues (at -O2 optimization level) and WIP review

Thu Aug 16 13:25:59 UTC 2018

On Tue, Aug 14, 2018 at 7:34 PM, Joel Sherrill <joel at rtems.org> wrote:
>
>
> On Sun, Aug 12, 2018 at 4:47 PM, Amaan Cheval <amaan.cheval at gmail.com>
> wrote:
>>
>> Hi!
>>
>> I've narrowed the issue down to this bintime function:
>>
>> https://github.com/RTEMS/rtems/blob/b2de4260c5c71e518742731a8cdebe3411937181/cpukit/score/src/kern_tc.c#L548
>>
>> The watchdog ticks in _Per_CPU_Information / Clock_driver_ticks are at
>> "1000", when that function is called (rtems_clock_get_tod ->
>> _TOD_Get_timeval -> _Timecounter_Microtime -> microtime). The bt and
>> tvp values there are:
>>
>> (gdb) p bt
>> $2 = {sec = 599562004, frac = 18446744073709551536}
>> (gdb) p *tvp
>> $3 = {tv_sec = 599562004, tv_usec = 999999}
>>
>> The full (relevant) debug log for the "wrong" timing despite the
>> Clock_driver_ticks being correct is here:
>> https://gist.github.com/AmaanC/c59caf5232b03054d457dcacb5ab1c54
>>
>> I'm quite unfamiliar with how the low-level internals work and it
>> looks like it comes from FreeBSD. This is likely a bug from the
>> timecounter being "too" precise - it dispatches the task at _exactly_
>> the tc_freq it promised - if it slips by 1 tick, then the values start
>> looking correct.
>>
>> This looks more like an off-by-one in the low-level code, in that
>> case, since my clock driver's timecounter returns exactly the value it
>> ought to be returning (100 when 1 second has passed, for eg., when the
>> tc_frequency=100Hz - in that case the bintime's returned "now.tv_sec"
>> value in clockgettod.c causes the wrong second to be set in
>> "time_buffer").
>>
>>
>> https://github.com/AmaanC/rtems-gsoc18/blob/ac/daily-03-post-hello/bsps/x86_64/amd64/clock/clock.c#L51
>>
>> On Sun, Aug 12, 2018 at 2:48 PM, Amaan Cheval <amaan.cheval at gmail.com>
>> wrote:
>> > There's another issue I'm having now:
>> >
>> > At -O0, ticker.exe works well and passes reliably. At -O2, the TOD
>> > seems to be rushed a bit:
>> >
>> > TA1  - rtems_clock_get_tod - 09:00:00   12/31/1988
>> > TA2  - rtems_clock_get_tod - 09:00:00   12/31/1988
>> > TA3  - rtems_clock_get_tod - 09:00:00   12/31/1988
>> > TA1  - rtems_clock_get_tod - 09:00:04   12/31/1988
>> > TA2  - rtems_clock_get_tod - 09:00:09   12/31/1988
>> > TA1  - rtems_clock_get_tod - 09:00:09   12/31/1988
>> > TA3  - rtems_clock_get_tod - 09:00:14   12/31/1988
>> > TA1  - rtems_clock_get_tod - 09:00:14   12/31/1988
>> > TA2  - rtems_clock_get_tod - 09:00:19   12/31/1988
>> > TA1  - rtems_clock_get_tod - 09:00:19   12/31/1988
>> > TA1  - rtems_clock_get_tod - 09:00:24   12/31/1988
>> > TA3  - rtems_clock_get_tod - 09:00:29   12/31/1988
>> > TA2  - rtems_clock_get_tod - 09:00:29   12/31/1988
>> > TA1  - rtems_clock_get_tod - 09:00:29   12/31/1988
>> > TA1  - rtems_clock_get_tod - 09:00:34   12/31/1988
>> >
>> > I'm not sure what it could be - I suspected my get_timecount somehow
>> > not realizing that Clock_driver_ticks was volatile, but that seems to
>> > be in order. The relevant code is here:
>> >
>> > https://github.com/AmaanC/rtems-gsoc18/blob/ac/daily-03-post-hello/bsps/x86_64/amd64/clock/clock.c
>> >
>> > On Sun, Aug 12, 2018 at 3:43 AM, Amaan Cheval <amaan.cheval at gmail.com>
>> > wrote:
>> >> Figured it out; turns out my code to align the stack so I could make
>> >> calls without raising exceptions was messing up and corrupting the
>> >> stack-pointer.
>> >>
>> >> Running the -O2 code now makes the clock run a bit too quickly - the
>> >> calibration may have a minor issue. I'll fix that up and send patches
>> >> tomorrow or Monday hopefully.
>> >>
>> >> I'll be traveling Tuesday, so I'd appreciate if we can get them merged
>> >> upstream Monday itself - I'm okay to have a call and walk someone
>> >> through the patches and whatnot if need be.
>> >>
>> >> Cheers!
>> >>
>> >> On Sun, Aug 12, 2018 at 1:25 AM, Amaan Cheval <amaan.cheval at gmail.com>
>> >> wrote:
>> >>> Hi!
>> >>>
>> >>> In the process of cleaning my work up, I've run into an odd problem
>> >>> which only shows up when I set the optimization level to -O2. At -O0,
>> >>> it's perfectly fine.
>> >>>
>> >>> The issue is that somehow execution ends up at address 0x0.
>> >>>
>> >>> This likely happens due to a _CPU_Context_switch, where the rsp is set
>> >>> to a corrupted value, leading to a corrupt (i.e. 0) return address at
>> >>> the end of the context switch.
>> >>>
>> >>> What's curious is that this corruption _seems_ to occur in
>> >>> _ISR_Handler's call to _Thread_Dispatch, by somehow messing the value
>> >>> of rsp up - I honestly don't know this for sure because gdb says one
>> >>> thing (i.e. that rsp = 0), but setting up some code (cmpq $0, rsp) to
>> >>> check this seems to say rsp is non-zero, at least.
>> >>>
>> >>> This is an odd heisenbug I'd like to investigate for sure - I just
>> >>> thought I'd shoot this email out because:
>> >>>
>> >>> - If I can't figure it out tomorrow, soon, I'll just drop it so I can
>> >>> create more logical commits to send as patches upstream (thereby
>> >>> leaving -O0 upstream, at least temporarily)
>> >>>
>> >>> - If anyone's seen an odd stack corruption like this, or has any
>> >>> advice on debugging it, could you let me know? I suspect something
>> >>> like interrupting tasks which ought not to be interrupted (perhaps I
>> >>> forgot to implement some kind of "CPU_ISR_Disable") - is there
>> >>> anything you can think of of that sort?
>> >>>
>> >>> Also, here's a Github PR like last time with all the work (just for
>> >>> the overall changes, not the specific commits!). I'd appreciate a
>> >>> quick review if anyone could - sorry about sending this out over the
>> >>> weekend! I've had a surprising share of Heisenbugs with QEMU in the
>> >>> past week.
>> >>>
>> >>> https://github.com/AmaanC/rtems-gsoc18/pull/3/files
>
>
> If I have followed this thread correctly, you have solved the -O0 vs -O2
> issue by changing/fixing the stack alignment.

Right.

>
> There is still a mystery about the clock math. If this is the case, then
> have
> you run the nsecs test? It tries to push some degenerate cases known to
> trip up clock drivers.

Nice, I'll look at it, thanks! I hadn't noticed the test there.

>
> Sebastian is on holiday so I would start a new thread with the clock issue
> on Friday or Saturday so he is more likely to see it when he gets back.

Noted. With GSoC having just ended, I'll be a bit slower now since
there's lots of other life stuff to catch up on. I'll continue
investigating as I can, though!

>
> Is the PIC support in place enough to build libbsd? It should be worth
> starting
> a thread to identify the path to SMP and the path to libbsd support.

I honestly don't know - what are the requirements for libbsd?

The PIC support right now is only there to:
1. Remap the interrupt vectors (so as to not overlap with the fixed
exception vectors)
2. Mask interrupts to disable the PIC

This is done so we can use the APIC instead, but the APIC support is
only there to really enable the APIC, set the timer up, and that's
about it.
There's basically no support for the I/O APIC.

>
> And while you remember things, is the CPU supplement filled in for the
> x86_64?

Not yet. Patches incoming soon!

>
> --joel
>