<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Sun, Aug 12, 2018 at 4:47 PM, Amaan Cheval <span dir="ltr"><<a href="mailto:amaan.cheval@gmail.com" target="_blank">amaan.cheval@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi!<br>

<br>

I've narrowed the issue down to this bintime function:<br>

<a href="https://github.com/RTEMS/rtems/blob/b2de4260c5c71e518742731a8cdebe3411937181/cpukit/score/src/kern_tc.c#L548" rel="noreferrer" target="_blank">https://github.com/RTEMS/<wbr>rtems/blob/<wbr>b2de4260c5c71e518742731a8cdebe<wbr>3411937181/cpukit/score/src/<wbr>kern_tc.c#L548</a><br>

<br>

The watchdog ticks in _Per_CPU_Information / Clock_driver_ticks are at<br>

"1000", when that function is called (rtems_clock_get_tod -><br>

_TOD_Get_timeval -> _Timecounter_Microtime -> microtime). The bt and<br>

tvp values there are:<br>

<br>

(gdb) p bt<br>

$2 = {sec = 599562004, frac = 18446744073709551536}<br>

(gdb) p *tvp<br>

$3 = {tv_sec = 599562004, tv_usec = 999999}<br>

<br>

The full (relevant) debug log for the "wrong" timing despite the<br>

Clock_driver_ticks being correct is here:<br>

<a href="https://gist.github.com/AmaanC/c59caf5232b03054d457dcacb5ab1c54" rel="noreferrer" target="_blank">https://gist.github.com/<wbr>AmaanC/<wbr>c59caf5232b03054d457dcacb5ab1c<wbr>54</a><br>

<br>

I'm quite unfamiliar with how the low-level internals work and it<br>

looks like it comes from FreeBSD. This is likely a bug from the<br>

timecounter being "too" precise - it dispatches the task at _exactly_<br>

the tc_freq it promised - if it slips by 1 tick, then the values start<br>

looking correct.<br>

<br>

This looks more like an off-by-one in the low-level code, in that<br>

case, since my clock driver's timecounter returns exactly the value it<br>

ought to be returning (100 when 1 second has passed, for eg., when the<br>

tc_frequency=100Hz - in that case the bintime's returned "now.tv_sec"<br>

value in clockgettod.c causes the wrong second to be set in<br>

"time_buffer").<br>

<br>

<a href="https://github.com/AmaanC/rtems-gsoc18/blob/ac/daily-03-post-hello/bsps/x86_64/amd64/clock/clock.c#L51" rel="noreferrer" target="_blank">https://github.com/AmaanC/<wbr>rtems-gsoc18/blob/ac/daily-03-<wbr>post-hello/bsps/x86_64/amd64/<wbr>clock/clock.c#L51</a><br>

<div class="HOEnZb"><div class="h5"><br>

On Sun, Aug 12, 2018 at 2:48 PM, Amaan Cheval <<a href="mailto:amaan.cheval@gmail.com">amaan.cheval@gmail.com</a>> wrote:<br>

> There's another issue I'm having now:<br>

><br>

> At -O0, ticker.exe works well and passes reliably. At -O2, the TOD<br>

> seems to be rushed a bit:<br>

><br>

> TA1  - rtems_clock_get_tod - 09:00:00   12/31/1988<br>

> TA2  - rtems_clock_get_tod - 09:00:00   12/31/1988<br>

> TA3  - rtems_clock_get_tod - 09:00:00   12/31/1988<br>

> TA1  - rtems_clock_get_tod - 09:00:04   12/31/1988<br>

> TA2  - rtems_clock_get_tod - 09:00:09   12/31/1988<br>

> TA1  - rtems_clock_get_tod - 09:00:09   12/31/1988<br>

> TA3  - rtems_clock_get_tod - 09:00:14   12/31/1988<br>

> TA1  - rtems_clock_get_tod - 09:00:14   12/31/1988<br>

> TA2  - rtems_clock_get_tod - 09:00:19   12/31/1988<br>

> TA1  - rtems_clock_get_tod - 09:00:19   12/31/1988<br>

> TA1  - rtems_clock_get_tod - 09:00:24   12/31/1988<br>

> TA3  - rtems_clock_get_tod - 09:00:29   12/31/1988<br>

> TA2  - rtems_clock_get_tod - 09:00:29   12/31/1988<br>

> TA1  - rtems_clock_get_tod - 09:00:29   12/31/1988<br>

> TA1  - rtems_clock_get_tod - 09:00:34   12/31/1988<br>

><br>

> I'm not sure what it could be - I suspected my get_timecount somehow<br>

> not realizing that Clock_driver_ticks was volatile, but that seems to<br>

> be in order. The relevant code is here:<br>

> <a href="https://github.com/AmaanC/rtems-gsoc18/blob/ac/daily-03-post-hello/bsps/x86_64/amd64/clock/clock.c" rel="noreferrer" target="_blank">https://github.com/AmaanC/<wbr>rtems-gsoc18/blob/ac/daily-03-<wbr>post-hello/bsps/x86_64/amd64/<wbr>clock/clock.c</a><br>

><br>

> On Sun, Aug 12, 2018 at 3:43 AM, Amaan Cheval <<a href="mailto:amaan.cheval@gmail.com">amaan.cheval@gmail.com</a>> wrote:<br>

>> Figured it out; turns out my code to align the stack so I could make<br>

>> calls without raising exceptions was messing up and corrupting the<br>

>> stack-pointer.<br>

>><br>

>> Running the -O2 code now makes the clock run a bit too quickly - the<br>

>> calibration may have a minor issue. I'll fix that up and send patches<br>

>> tomorrow or Monday hopefully.<br>

>><br>

>> I'll be traveling Tuesday, so I'd appreciate if we can get them merged<br>

>> upstream Monday itself - I'm okay to have a call and walk someone<br>

>> through the patches and whatnot if need be.<br>

>><br>

>> Cheers!<br>

>><br>

>> On Sun, Aug 12, 2018 at 1:25 AM, Amaan Cheval <<a href="mailto:amaan.cheval@gmail.com">amaan.cheval@gmail.com</a>> wrote:<br>

>>> Hi!<br>

>>><br>

>>> In the process of cleaning my work up, I've run into an odd problem<br>

>>> which only shows up when I set the optimization level to -O2. At -O0,<br>

>>> it's perfectly fine.<br>

>>><br>

>>> The issue is that somehow execution ends up at address 0x0.<br>

>>><br>

>>> This likely happens due to a _CPU_Context_switch, where the rsp is set<br>

>>> to a corrupted value, leading to a corrupt (i.e. 0) return address at<br>

>>> the end of the context switch.<br>

>>><br>

>>> What's curious is that this corruption _seems_ to occur in<br>

>>> _ISR_Handler's call to _Thread_Dispatch, by somehow messing the value<br>

>>> of rsp up - I honestly don't know this for sure because gdb says one<br>

>>> thing (i.e. that rsp = 0), but setting up some code (cmpq $0, rsp) to<br>

>>> check this seems to say rsp is non-zero, at least.<br>

>>><br>

>>> This is an odd heisenbug I'd like to investigate for sure - I just<br>

>>> thought I'd shoot this email out because:<br>

>>><br>

>>> - If I can't figure it out tomorrow, soon, I'll just drop it so I can<br>

>>> create more logical commits to send as patches upstream (thereby<br>

>>> leaving -O0 upstream, at least temporarily)<br>

>>><br>

>>> - If anyone's seen an odd stack corruption like this, or has any<br>

>>> advice on debugging it, could you let me know? I suspect something<br>

>>> like interrupting tasks which ought not to be interrupted (perhaps I<br>

>>> forgot to implement some kind of "CPU_ISR_Disable") - is there<br>

>>> anything you can think of of that sort?<br>

>>><br>

>>> Also, here's a Github PR like last time with all the work (just for<br>

>>> the overall changes, not the specific commits!). I'd appreciate a<br>

>>> quick review if anyone could - sorry about sending this out over the<br>

>>> weekend! I've had a surprising share of Heisenbugs with QEMU in the<br>

>>> past week.<br>

>>><br>

>>> <a href="https://github.com/AmaanC/rtems-gsoc18/pull/3/files" rel="noreferrer" target="_blank">https://github.com/AmaanC/<wbr>rtems-gsoc18/pull/3/files</a></div></div></blockquote><div><br></div><div>If I have followed this thread correctly, you have solved the -O0 vs -O2</div><div>issue by changing/fixing the stack alignment.</div><div><br></div><div>There is still a mystery about the clock math. If this is the case, then have</div><div>you run the nsecs test? It tries to push some degenerate cases known to</div><div>trip up clock drivers.</div><div><br></div><div>Sebastian is on holiday so I would start a new thread with the clock issue</div><div>on Friday or Saturday so he is more likely to see it when he gets back.</div><div><br></div><div>Is the PIC support in place enough to build libbsd? It should be worth starting</div><div>a thread to identify the path to SMP and the path to libbsd support.</div><div><br></div><div>And while you remember things, is the CPU supplement filled in for the</div><div>x86_64?</div><div><br></div><div>--joel</div></div><br></div></div>