gcc 4.3.2 vectorizes access to volatile array

Mon Jun 22 18:14:43 UTC 2009

Andrew Haley wrote:
> Till Straumann wrote:
>   
>> gcc-4.3.2 seems to produce bad code when
>> accessing an array of small 'volatile'
>> objects -- it may try to access multiple
>> such objects in a 'parallel' fashion.
>> E.g., instead of reading two consecutive
>> 'volatile short's sequentially it reads
>> a single 32-bit longword. This may crash
>> e.g., when accessing a memory-mapped device
>> which allows only 16-bit accesses.
>>
>> If I compile this code fragment
>>
>> void volarrcpy(short *d, volatile short *s, int n)
>> {
>> int i;
>>  for (i=0; i<n; i++)
>>    d[i] = s[i];
>> }
>>
>>
>> with '-O3' (the critical option seems to be '-ftree-vectorize')
>> then gcc-4.3.2 produces quite complicated code
>> but the essential section is (powerpc)
>>
>> .L7:
>>    lhz 0,0(11)
>>    addi 11,11,2
>>    lwzx 0,4,9
>>    stwx 0,3,9
>>    addi 9,9,4
>>    bdnz .L7
>>
>> or i386
>>
>> .L7:
>>    movw    (%ecx), %ax
>>    movl    (%esi,%edx,4), %eax
>>    movl    %eax, (%ebx,%edx,4)
>>    incl    %edx
>>    addl    $2, %ecx
>>    cmpl    %edx, -20(%ebp)
>>    ja  .L7
>>
>>
>> Disassembled back into C-code, this reads
>>
>> uint32_t *dst_l = (uint32_t*)d;
>> uint32_t *src_l = (uint32_t*)s;
>>
>> for (i=0; i<n/2; i++) {
>>    d[i]     = s[i];
>>    dst_l[i] = src_l[i];
>> }
>>
>> This code seems neither optimal nor correct.
>> Besides reading half of the locations twice
>> which violates the semantics of volatile
>> objects accessing such objects in a 'vectorized'
>> way (in this case: instead of reading
>> two adjacent short addresses gcc emits
>> a single 32-bit read) seems illegal to me.
>>
>> Similar behavior seems to be present in 4.3.3.
>>
>> Does anybody have some insight? Should I file
>> a bug report?
>>     
>
> I can't reproduce this with "GCC: (GNU) 4.3.3 20081110 (prerelease)"
>
> .L8:
> 	movzwl	(%ecx), %eax
> 	addl	$1, %ebx
> 	addl	$2, %ecx
> 	movw	%ax, (%edx)
> 	addl	$2, %edx
> 	cmpl	%ebx, 16(%ebp)
> 	jg	.L8
>
> I think you should upgrade.
>
> Andrew.
>   

OK, try this then:

void
c(char *d, volatile char *s)
{
int i;
    for ( i=0; i<32; i++ )
        d[i]=s[i];
}

(gcc --version: gcc (Ubuntu 4.3.3-5ubuntu4) 4.3.3)

gcc -m32 -c -S -O3
produces an unrolled sequence:

    movzbl  (%ecx), %eax
    leal    20(%ebx), %edx
    movl    (%ecx), %eax
    movl    %eax, (%edi)
    movzbl  1(%ecx), %eax
    movl    4(%ecx), %eax
    movl    %eax, 4(%edi)
    movzbl  2(%ecx), %eax
    movl    4(%ebx), %eax
    movl    %eax, 4(%esi)
    movzbl  3(%ecx), %eax
    movl    8(%ebx), %eax
    movl    %eax, 8(%esi)
... < snip >...

The 64-bit version even uses SSE registers to
load the volatile data:

(gcc -c -S -O3)

.L7:
    movzbl  (%rsi), %eax
    movdqu  (%rsi), %xmm0
    movdqa  %xmm0, (%rdi)
    movzbl  1(%rsi), %eax
    movdqu  (%rdx), %xmm0
    movdqa  %xmm0, 16(%rdi)

Not sure an upgrade helps ;-)

-- Till