[PATCH 6/6 v2] libdl/rtl-obj.c: synchronize cache after code relocation.

Thu Aug 25 05:22:28 UTC 2016

Hi Pavel,

Sorry about the long delay in getting to this email.

On 21/07/2016 11:23 AM, Pavel Pisa wrote:
> Hello Chris,
>
> On Thursday 21 of July 2016 00:22:30 Chris Johns wrote:
>> On 20/07/2016 18:55, Pavel Pisa wrote:
>>>  From be993a950be6c382b70609bcbc38be9bd161e1d4 Mon Sep 17 00:00:00 2001
>>> Message-Id:
>>> <be993a950be6c382b70609bcbc38be9bd161e1d4.1469004576.git.pisa at cmp.felk.cv
>>> ut.cz> From: Pavel Pisa <pisa at cmp.felk.cvut.cz>
>>> Date: Wed, 20 Jul 2016 10:49:19 +0200
>>> Subject: [PATCH] libdl/rtl-obj.c: synchronize cache after code
>>> relocation. To: rtems-devel at rtems.org
>>>
>>> Memory content changes caused by relocation has to be
>>> propagated to memory/cache level which is used/snooped
>>> during instruction cache fill.
>>
>> This looks fine to me.
>>
>> Does this close https://devel.rtems.org/ticket/2438?
>>
>> If it does could "Closes #2438" please be added to the commit comment.
>
> I have added Trac tag.
>
> Testing on Zynq should be done as well and if there is
> problem still then ticket should be reopened or new
> added.
>
> Change has been tested on RPi1 and RPi2. There are many variant
> of possible integration of ARM cores even for single CPU
> marking variant.
>
> It seems that RPi2 Cortex-A7 includes multiprocessing
> extension which should work such way that it is enough
> to clean only instruction cache and prefetch buffer
> by virtual address range. Invalidation should be propagated
> to other cores. Then new fill of instruction caches
> causes snoops over all CPUs so it is not required to
> flush data cache(s). The observation on RPi2 on UP
> build confirms this kind of integration because
> even data cache flushing before relocation worked on RPi2.
> Actual default implementation of
> rtems_cache_instruction_sync_after_code_change()
> flushes not only instruction cache but even all levels
> of data cache by virtual address ranges.
> The behavior can be optimized in cache_.h by specifying
>
>   #define  CPU_CACHE_SUPPORT_PROVIDES_INSTRUCTION_SYNC_FUNCTION 1
>
> and providing optimized  _CPU_cache_instruction_sync_after_code_change()
> which does not need to flush data cache at all for this Cortex-A
> scenario.
>
> If multiprocessing extension is not implemented then flush
> of instruction cache and the first level of data is required
> on Cortex-A even on uniprocessor system. Flush to the level
> of inner level or level of unification is required on such
> Cortex-A variants.
>
> The RPi1 ARMv6 seems to not snoop data cache so the flush
> of the first data cache level is required and if not done
> after relocation then instruction opcodes are read OK but
> relocated target addresses are not seen. So I and D L1
> cache flush is required and the need has been observed.
> On theother hand, L1 flush is enough so some limited
> flush version can be more efficient.
>
> But actual default code is OK even that it does too deep
> flushes.
>
> The behavior it is my interpretation of ARM documents
> but I have not found some clean table which compares all
> ARM architectures levels and variants and specify which CP15
> instructions are supported for each variant and which
> caches are snooped under which conditions. Architecture
> manuals have many IMPLEMENTATION DEFINED or IF MULTIPROCESSOR
> EXTENSION etc statements so it is far from beeing 100% clear
> to me.
>
> As for the code, it should be enough if there is no other
> executable section, code or trampoline generated in some
> of omitted sections or elsewhere.
>
> I consider, that next mask covers all potential code sections
>
>   RTEMS_RTL_OBJ_SECT_TEXT | RTEMS_RTL_OBJ_SECT_CONST |
>   RTEMS_RTL_OBJ_SECT_DATA | RTEMS_RTL_OBJ_SECT_BSS |
>   RTEMS_RTL_OBJ_SECT_EXEC;
>
> Problem arises if relocation changes code out of list
> of sections of given object. So if you load object and
> that object added symbols results in relocation of code
> in other object which load and rtems_rtl_obj_synchronize_cache()
> has already been called then there is a problem.
>
> So if there can be relocations which go outside of actual
> obj section list then there should be flush of these
> code updates.
>
> So if I think more about the patch then it is possible
> to revert this patch, use original flush after section
> loads but add flush in each relocation operation modifying
> code.
>
> Even if such approach is decided as the next step then
> I would suggest to to leave rtems_rtl_obj_synchronize_cache()
> there as alternative option.
>
> So at the end, I am not sure now when I think more about
> that after code pushing.
>
> On the other hand even if flush is done after each record
> relocation then newly loaded object final flush by sections not
> text, data, bss can be better that long alignment gaps
> or nonstandard layout is flush right.

I currently think limiting cache management to a minimum is best. I need 
to review this area to figure out the best solution. There is support to 
have object memory made read-only so this may be effected.

A possible solution is to set a per object file dirty flag when any 
object file's in-memory image changes then have the top level exit paths 
from all API calls run rtems_rtl_obj_synchronize_cache(). This could 
iterate over all object files managing the per arch cache demands for 
any dirty objects.

>
> But is is possible that some more flushes need to be added
> load of new object can lead to update in previous one
> or mutual objects dependency is solved such way,
> that one object is loaded, rtems_rtl_obj_synchronize_cache()
> called, unresolved symbols are left open, then next object is
> loaded, previously missing symbols are resolved, previous object
> code relocation is updated, the second object relocation is done
> and rtems_rtl_obj_synchronize_cache() is then done only for the
> second object. It would work OK on Cortex-A7 with SMP extension
> but it is not generally correct with actual implementation.
>
> If described scenario is possible by RTL.
>

It is possible and a valid situation to have, the user is responsible 
for any unresolved symbols.

I would need to take a close look at this part of the code to see what 
is needed. I will try and take a look in the coming weeks. I would like 
to start addressing the issue of veneer support for ARM so I will need 
to dig down deep in the object file's section layout to adjust the 
memory allocations to handle the veneers.

Chris

> Best wishes,
>
>                Pavel
>