[PATCH 6/6 v2] libdl/rtl-obj.c: synchronize cache after code relocation.

Thu Jul 21 01:23:30 UTC 2016

Hello Chris,

On Thursday 21 of July 2016 00:22:30 Chris Johns wrote:
> On 20/07/2016 18:55, Pavel Pisa wrote:
> >  From be993a950be6c382b70609bcbc38be9bd161e1d4 Mon Sep 17 00:00:00 2001
> > Message-Id:
> > <be993a950be6c382b70609bcbc38be9bd161e1d4.1469004576.git.pisa at cmp.felk.cv
> >ut.cz> From: Pavel Pisa <pisa at cmp.felk.cvut.cz>
> > Date: Wed, 20 Jul 2016 10:49:19 +0200
> > Subject: [PATCH] libdl/rtl-obj.c: synchronize cache after code
> > relocation. To: rtems-devel at rtems.org
> >
> > Memory content changes caused by relocation has to be
> > propagated to memory/cache level which is used/snooped
> > during instruction cache fill.
>
> This looks fine to me.
>
> Does this close https://devel.rtems.org/ticket/2438?
>
> If it does could "Closes #2438" please be added to the commit comment.

I have added Trac tag.

Testing on Zynq should be done as well and if there is
problem still then ticket should be reopened or new
added.

Change has been tested on RPi1 and RPi2. There are many variant
of possible integration of ARM cores even for single CPU
marking variant.

It seems that RPi2 Cortex-A7 includes multiprocessing
extension which should work such way that it is enough
to clean only instruction cache and prefetch buffer
by virtual address range. Invalidation should be propagated
to other cores. Then new fill of instruction caches
causes snoops over all CPUs so it is not required to
flush data cache(s). The observation on RPi2 on UP
build confirms this kind of integration because
even data cache flushing before relocation worked on RPi2.
Actual default implementation of
rtems_cache_instruction_sync_after_code_change()
flushes not only instruction cache but even all levels
of data cache by virtual address ranges.
The behavior can be optimized in cache_.h by specifying

  #define  CPU_CACHE_SUPPORT_PROVIDES_INSTRUCTION_SYNC_FUNCTION 1

and providing optimized  _CPU_cache_instruction_sync_after_code_change()
which does not need to flush data cache at all for this Cortex-A
scenario.

If multiprocessing extension is not implemented then flush
of instruction cache and the first level of data is required
on Cortex-A even on uniprocessor system. Flush to the level
of inner level or level of unification is required on such
Cortex-A variants.

The RPi1 ARMv6 seems to not snoop data cache so the flush
of the first data cache level is required and if not done
after relocation then instruction opcodes are read OK but
relocated target addresses are not seen. So I and D L1
cache flush is required and the need has been observed.
On theother hand, L1 flush is enough so some limited
flush version can be more efficient.

But actual default code is OK even that it does too deep
flushes.

The behavior it is my interpretation of ARM documents
but I have not found some clean table which compares all
ARM architectures levels and variants and specify which CP15
instructions are supported for each variant and which
caches are snooped under which conditions. Architecture
manuals have many IMPLEMENTATION DEFINED or IF MULTIPROCESSOR
EXTENSION etc statements so it is far from beeing 100% clear
to me.

As for the code, it should be enough if there is no other
executable section, code or trampoline generated in some
of omitted sections or elsewhere.

I consider, that next mask covers all potential code sections

  RTEMS_RTL_OBJ_SECT_TEXT | RTEMS_RTL_OBJ_SECT_CONST |
  RTEMS_RTL_OBJ_SECT_DATA | RTEMS_RTL_OBJ_SECT_BSS |
  RTEMS_RTL_OBJ_SECT_EXEC;

Problem arises if relocation changes code out of list
of sections of given object. So if you load object and
that object added symbols results in relocation of code
in other object which load and rtems_rtl_obj_synchronize_cache()
has already been called then there is a problem.

So if there can be relocations which go outside of actual
obj section list then there should be flush of these
code updates.

So if I think more about the patch then it is possible
to revert this patch, use original flush after section
loads but add flush in each relocation operation modifying
code.

Even if such approach is decided as the next step then
I would suggest to to leave rtems_rtl_obj_synchronize_cache()
there as alternative option.

So at the end, I am not sure now when I think more about
that after code pushing.

On the other hand even if flush is done after each record
relocation then newly loaded object final flush by sections not
text, data, bss can be better that long alignment gaps
or nonstandard layout is flush right.

But is is possible that some more flushes need to be added
load of new object can lead to update in previous one
or mutual objects dependency is solved such way,
that one object is loaded, rtems_rtl_obj_synchronize_cache()
called, unresolved symbols are left open, then next object is
loaded, previously missing symbols are resolved, previous object
code relocation is updated, the second object relocation is done
and rtems_rtl_obj_synchronize_cache() is then done only for the
second object. It would work OK on Cortex-A7 with SMP extension
but it is not generally correct with actual implementation.

If described scenario is possible by RTL.

Best wishes,

               Pavel