[GSoC - x86_64] Pre-merge issues (at -O2 optimization level) and WIP review

Sun Aug 12 09:18:06 UTC 2018

There's another issue I'm having now:

At -O0, ticker.exe works well and passes reliably. At -O2, the TOD
seems to be rushed a bit:

TA1  - rtems_clock_get_tod - 09:00:00   12/31/1988
TA2  - rtems_clock_get_tod - 09:00:00   12/31/1988
TA3  - rtems_clock_get_tod - 09:00:00   12/31/1988
TA1  - rtems_clock_get_tod - 09:00:04   12/31/1988
TA2  - rtems_clock_get_tod - 09:00:09   12/31/1988
TA1  - rtems_clock_get_tod - 09:00:09   12/31/1988
TA3  - rtems_clock_get_tod - 09:00:14   12/31/1988
TA1  - rtems_clock_get_tod - 09:00:14   12/31/1988
TA2  - rtems_clock_get_tod - 09:00:19   12/31/1988
TA1  - rtems_clock_get_tod - 09:00:19   12/31/1988
TA1  - rtems_clock_get_tod - 09:00:24   12/31/1988
TA3  - rtems_clock_get_tod - 09:00:29   12/31/1988
TA2  - rtems_clock_get_tod - 09:00:29   12/31/1988
TA1  - rtems_clock_get_tod - 09:00:29   12/31/1988
TA1  - rtems_clock_get_tod - 09:00:34   12/31/1988

I'm not sure what it could be - I suspected my get_timecount somehow
not realizing that Clock_driver_ticks was volatile, but that seems to
be in order. The relevant code is here:
https://github.com/AmaanC/rtems-gsoc18/blob/ac/daily-03-post-hello/bsps/x86_64/amd64/clock/clock.c

On Sun, Aug 12, 2018 at 3:43 AM, Amaan Cheval <amaan.cheval at gmail.com> wrote:
> Figured it out; turns out my code to align the stack so I could make
> calls without raising exceptions was messing up and corrupting the
> stack-pointer.
>
> Running the -O2 code now makes the clock run a bit too quickly - the
> calibration may have a minor issue. I'll fix that up and send patches
> tomorrow or Monday hopefully.
>
> I'll be traveling Tuesday, so I'd appreciate if we can get them merged
> upstream Monday itself - I'm okay to have a call and walk someone
> through the patches and whatnot if need be.
>
> Cheers!
>
> On Sun, Aug 12, 2018 at 1:25 AM, Amaan Cheval <amaan.cheval at gmail.com> wrote:
>> Hi!
>>
>> In the process of cleaning my work up, I've run into an odd problem
>> which only shows up when I set the optimization level to -O2. At -O0,
>> it's perfectly fine.
>>
>> The issue is that somehow execution ends up at address 0x0.
>>
>> This likely happens due to a _CPU_Context_switch, where the rsp is set
>> to a corrupted value, leading to a corrupt (i.e. 0) return address at
>> the end of the context switch.
>>
>> What's curious is that this corruption _seems_ to occur in
>> _ISR_Handler's call to _Thread_Dispatch, by somehow messing the value
>> of rsp up - I honestly don't know this for sure because gdb says one
>> thing (i.e. that rsp = 0), but setting up some code (cmpq $0, rsp) to
>> check this seems to say rsp is non-zero, at least.
>>
>> This is an odd heisenbug I'd like to investigate for sure - I just
>> thought I'd shoot this email out because:
>>
>> - If I can't figure it out tomorrow, soon, I'll just drop it so I can
>> create more logical commits to send as patches upstream (thereby
>> leaving -O0 upstream, at least temporarily)
>>
>> - If anyone's seen an odd stack corruption like this, or has any
>> advice on debugging it, could you let me know? I suspect something
>> like interrupting tasks which ought not to be interrupted (perhaps I
>> forgot to implement some kind of "CPU_ISR_Disable") - is there
>> anything you can think of of that sort?
>>
>> Also, here's a Github PR like last time with all the work (just for
>> the overall changes, not the specific commits!). I'd appreciate a
>> quick review if anyone could - sorry about sending this out over the
>> weekend! I've had a surprising share of Heisenbugs with QEMU in the
>> past week.
>>
>> https://github.com/AmaanC/rtems-gsoc18/pull/3/files