Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-segment data/bss #632

Closed
ua1arn opened this issue Jul 21, 2019 · 24 comments · Fixed by #977
Closed

multi-segment data/bss #632

ua1arn opened this issue Jul 21, 2019 · 24 comments · Fixed by #977
Assignees

Comments

@ua1arn
Copy link
Contributor

ua1arn commented Jul 21, 2019

In template link scripts (Device_Template_Vendor\Vendor\Device\Source\GCC\startup_Device.S as example) lengths of filled/copied areas calculated as bytes:

	.copy.table :
	{
		. = ALIGN(4);
		__copy_table_start__ = .;
		LONG (__etext)
		LONG (__data_start__)
		LONG (__data_end__ - __data_start__)
    /** Add each additional data section here */
    /*
		LONG (__etext2)
		LONG (__data2_start__)
		LONG (__data2_end__ - __data2_start__)
    */
		__copy_table_end__ = .;
	} > FLASH

	.zero.table :
	{
		. = ALIGN(4);
		__zero_table_start__ = .;
    /** Add each additional bss section here */
    /* 
    LONG (__bss2_start__)
		LONG (__bss2_end__ - __bss2_start__) 
    */
		__zero_table_end__ = .;
	} > FLASH

But, actual __cmsis_start implementation assume these values as number of uint32_t elements,

I propouse change all template and ARM-produced startup files like this:

	/* To copy multiple ROM to RAM sections,
	 * uncomment .copy.table section and,
	 * define __STARTUP_COPY_MULTIPLE in startup_ARMCMx.S */

	.copy.table :
	{
		. = ALIGN(4);
		__copy_table_start__ = .;
		LONG (__etext)
		LONG (__data_start__)
		LONG ((__data_end__ - __data_start__) / 4)
		LONG (__etext2)
		LONG (__data2_start__)
		LONG ((__data2_end__ - __data2_start__) / 4)
		__copy_table_end__ = .;
	} > FLASH


	/* To clear multiple BSS sections,
	 * uncomment .zero.table section and,
	 * define __STARTUP_CLEAR_BSS_MULTIPLE in startup_ARMCMx.S */

	.zero.table :
	{
		. = ALIGN(4);
		__zero_table_start__ = .;
		LONG (__bss_start__)
		LONG ((__bss_end__ - __bss_start__) / 4)
		__zero_table_end__ = .;
	} > FLASH

@JonatanAntoni
Copy link
Member

Thanks for pointing this out. Unfortunately the current linker script and startup code seems to be misaligned.

We have:

  1. Linker script, using number of bytes for the section sizes.
  2. Assembly startup code, assuming the section size in bytes.
  3. C startup code, assuming the section size in words.

Hence we would need to change 1 and 2 to use word count, or only 3 to use byte count.

Changes to 3 affect all users, i.e. switching between releases would need to be reflected in the linker script of the project. On the other hand changing 1 and 2 would change startup behavior of legacy components.

@ua1arn
Copy link
Contributor Author

ua1arn commented Jul 22, 2019

I think. breaking compatibility of c-language part is a less nervous solution.
Wee can make non-silent signaling by changing used in c-language ld-script provided name like:
zero_table_start change to zero_table4_start or _zero_table_u32_start

And add to template link scripts second tables set with data conforms to fixed c-language functions.
Tip: structure names also change.

@JonatanAntoni
Copy link
Member

Hi @ua1arn,

I read your suggestion that way:
Keep the current implementation of __cmsis_start and change/fix the linker script template.

As a consequence, i.e. to keep assembly and C startup variants in sync, we need to update the assembly startup routines, accordingly.

In fact this might be the least disturbing fix. Both, the linker script and the assembly startup module, are configuration templates. I.e. both are intended to be used by copy. A change to these files would not affect any project unless a user updates both files, actively.

Does that makes sense to you?

Cheers,
Jonatan

@JonatanAntoni JonatanAntoni assigned ghost Jul 22, 2019
@ua1arn
Copy link
Contributor Author

ua1arn commented Jul 22, 2019

No. I propose change __cmsis_start implementation . Change names and types of used tables.

@ua1arn
Copy link
Contributor Author

ua1arn commented Jul 22, 2019

__STATIC_FORCEINLINE __NO_RETURN void __cmsis_start(void)
{
  extern void _start(void) __NO_RETURN;
  
  typedef struct {
    uint32_t const* src;
    uint32_t* dest;
    uint32_t  wlen;
  } __copy_table4_t;
  
  typedef struct {
    uint32_t* dest;
    uint32_t  wlen;
  } __zero_table4_t;
  
  extern const __copy_table4_t __copy_table4_start__;
  extern const __copy_table4_t __copy_table4_end__;
  extern const __zero_table4_t __zero_table4_start__;
  extern const __zero_table4_t __zero_table4_end__;

  for (__copy_table4_t const* pTable = &__copy_table4_start__; pTable < &__copy_table4_end__; ++pTable) {
    for(uint32_t i=0u; i<pTable->wlen; ++i) {
      pTable->dest[i] = pTable->src[i];
    }
  }
 
  for (__zero_table4_t const* pTable = &__zero_table4_start__; pTable < &__zero_table4_end__; ++pTable) {
    for(uint32_t i=0u; i<pTable->wlen; ++i) {
      pTable->dest[i] = 0u;
    }
  }
 
  _start();
}

This modification affects only project uses c-language startup. Other work without problems

@JonatanAntoni
Copy link
Member

@ua1arn, if we change __cmsis_start and introduce new names we need to change the linker script as well. What do we gain from changing everything? Introducing new names and diverging the C implementation from the assembly one just seems to cause even more confusion to me.

@ua1arn
Copy link
Contributor Author

ua1arn commented Jul 22, 2019

Confusion and error at link phase but not silent hung at startup. Add "four" modified tables to template link scripts, Not remove old tables.

@JonatanAntoni
Copy link
Member

JonatanAntoni commented Jul 22, 2019

Where do you see the chance of a silent hung up at startup? Sure, currently having this bug is ugly. But fixing it in the linker script properly should fix everything.

Simply adding new tables would force any user to adopt the template depending on the startup code in use. There would not be a meaningful default that just works out of the box anymore. That looks even worse to me.

@ua1arn
Copy link
Contributor Author

ua1arn commented Jul 22, 2019

I propose add TWO set of tables in template link scripts and CHANGE cmsis_start implementation for use new tables.
Old assembly code works well. Project who adopt link script to buggy cmsis_start implementation simply update own link scripts.

@JonatanAntoni
Copy link
Member

@ua1arn, sorry, I think we are not yet on the same page.
Could you provide a PR containing all the changes you think we should do?

I have five scenarios in mind I'd like to check:

  1. Updating a legacy project using assembly startup.
  2. Updating an existing project using C startup as per 5.6.0 (*)
  3. Creating a new project using latest assembly startup.
  4. Creating a new project using latest C startup.
  5. Switching from assembly to C startup.

The ideal case would be if all scenarios run seamlessly. Where of course scenario 2 (*) is affected by the bug. Hence this user would need to apply the workaround (i.e. update the linker script) today and update the linker script again during update.

In summary this bug seems to be kind a nasty.

@ua1arn
Copy link
Contributor Author

ua1arn commented Jul 22, 2019

As a fact: no one can use __cmsis_start without modification of link script.
You agree? I propose way whit effects only to these projects, This way rise link time error.

@JonatanAntoni
Copy link
Member

JonatanAntoni commented Jul 22, 2019

@ua1arn, sure. Current situation is __cmsis_start and linker script templates do not work together. __cmsis_start is contained in a library header file and hence cannot be fixed by a user (*), easily. Hence a user of 5.6.0 can only apply your proposed change to the linker script.

(*) __cmsis_start in fact could be overwritten completely as a possible workaround.

@ua1arn
Copy link
Contributor Author

ua1arn commented Jul 22, 2019

I insist: overwriting of __cmsis_start must rise errors with old link scripts.
@JonatanAntoni, thank you.

@JonatanAntoni
Copy link
Member

Hi @ua1arn,

after having multiple discussions we decided to concentrate on fixing the issue as local as possible. In this case the problem came from misinterpreting the section length as word count instead of byte count.

Keeping the tables as they are since decades, i.e. not touching the linker scripts, and fixing __cmsis_start to

  for (__copy_table_t const* pTable = &__copy_table_start__; pTable < &__copy_table_end__; ++pTable) 
  {
    for(uint32_t i=0u; i<(pTable->len/4u); ++i) {
      pTable->dest[i] = pTable->src[i];
    }
  }

should do the job. This brings the C implementation back in sync with the assembly version.

I admit that moving the division by 4 to the linker script would save some instructions at run time, at least for the C version. The price is changing the linker script as well. And to keep assembly and C version in sync we would need to think about the assembly implementation. Keeping a double set of tables in the linker script by default would outweigh the optimization.

I hope this decision makes sense to you and solves the problem, ultimately.

Thanks again for pointing out this issue.

Cheers,
Jonatan

@ua1arn
Copy link
Contributor Author

ua1arn commented Jul 27, 2019

I think, change C interpretation may rise bug in project already uses __cmsis_start. This bug appear like hung at start-up. Again, changing names of link-script provided data (in .c implementation) can help easy fix, And provide TWO set of tables (copying/zeroing data) for C and ASM versions in templates.
Your This brings the C implementation back in sync with the assembly version require bold warning in release notes. More pain, but more correct way.

@alfedotov
Copy link

Just run into this issue. I vote for linker script update. Division by 4 in runtime makes no sense.

@kjbracey
Copy link
Collaborator

kjbracey commented Aug 12, 2019

I wouldn't sweat the "division by 4" - it's well within a compiler's loop optimisation capabilities to eliminate it - it effectively cancels out with the internal "multiplications by 4" occurring for the uint32_t * pointer indexing.

However, all the pointer stuff is potentially going to cause a headache, including potentially eliminating that optimisation. In particular, the compiler has to assume that the dest[i] = src[i] could be changing ptable->len, (it can't know that dest doesn't point to pTable), so it has to keep reloading it (and redividing it).

It would be beneficial to load the pTable entry into local variables prior to the inner loop. (Just loading *pTable into a local __copy_table_t and using that would do, or you could have separate locals for the pointers and len).

Doing that should save far more than added by putting that /4 in.

(Addendum - yes, __copy_table_start__ and __copy_table_end__ are extern const objects, signalling that they're real const, so maybe the compiler could assume they don't change, but you're accessing stuff between those objects. We're into realms of undefined behaviour here, and I doubt a compiler would choose to assume entries in-between are const. I would generally expect compilers to be acting as "allow cautiously" - treat pTable as a general pointer, forgoing any optimisation knowledge that it initially pointed to a const object).

@kjbracey
Copy link
Collaborator

kjbracey commented Aug 12, 2019

Further addendum - you may as well just use memcpy and memset here, right? I don't see why not.

If you did have a reason to not use them, you may in fact want to take more care to prevent the compiler from turning your manual code into those calls itself. GCC can spot that you're doing a copy loop and put in a call to memcpy/memset instead.

I've seen people add volatile to their pointers to dissuade that optimisation when executing from RAM in a flash programming routine during which the ROM is not accessible.

In your case, the potential aliasing between pTable->len and dest would be stopping that optimisation.

@JonatanAntoni
Copy link
Member

Hi @kjbracey-arm,

I had a version using memcpy and memset before, see 17acb21.
I removed this to be independent from the clib.

Do you agree that adding the division by 4 to the loops is the least intrusive fix?
Btw: pTable and src/dest must not alias due to strict aliasing rule, afaik.

Thanks,
Jonatan

@ua1arn
Copy link
Contributor Author

ua1arn commented Sep 3, 2019

I am agree with any progress in this issue.

@ua1arn
Copy link
Contributor Author

ua1arn commented Sep 3, 2019

About "intrusive fix" i'm saying before.
And EXPLICTLY say about renaming tables in existing prototypes for reflect divided-by-four or true value used as counter.

Goldweavers added a commit to motionperfect/IMU-Kinetic that referenced this issue Jan 12, 2020
Goldweavers added a commit to motionperfect/IMU-Kinetic that referenced this issue Jan 14, 2020
coder137 added a commit to coder137/STM32-Repo that referenced this issue Jun 14, 2020
@conoror
Copy link

conoror commented Aug 2, 2020

Do you agree that adding the division by 4 to the loops is the least intrusive fix?

It may be worth commenting that division by 4 is only valid if pTable->len mod 4 is zero. Thus it is a requirement that, in the linker file, __data_end__ and __data_start__ are 4 byte aligned. As __data_start__ must be aligned anyway __data_end__ might as well be but I thought it worth pointing out! I've seen linker files that didn't do this.

Btw: pTable and src/dest must not alias due to strict aliasing rule, afaik.

I agree. The comment by kjbracey-arm is the opposite of what happens in reality. On my machine, the minimum loop code is achieved by using an intermediate uint32_t len and len = pTable->wlen / 4:

for (__copy_table_t const* pTable = &__copy_table_start__; pTable < &__copy_table_end__; ++pTable) {
    len = pTable->wlen / 4;
    for(i = 0; i < len; i++)
        pTable->dest[i] = pTable->src[i];
}

which results in inner loop asm (compiled with gcc 7.2.1 and -O2 -mthumb -mcpu=cortex-m0) of:

.L4:
    ldr     r6, [r0, r3]
    str     r6, [r1, r3]
    adds    r3, r3, #4
    cmp     r2, r3
    bne     .L4

which is as good as it gets!

Conor.

@JonatanAntoni
Copy link
Member

Hi all,

can you review the proposed fix, please?

Cheers,
Jonatan

@conoror
Copy link

conoror commented Aug 5, 2020

Divide by 4 in the linker file and multiply by 4 in the assembly files. Seem fine to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants