Problem report: Struct aliasing problem causes Thread_Ready_Chain corruption in 4.6.99.3

Ralf Corsepius ralf.corsepius at rtems.org
Wed Nov 29 03:09:27 CST 2006


On Wed, 2006-11-29 at 08:49 +0100, Thomas Doerfler wrote:

> Ralf Corsepius schrieb:
> > On Tue, 2006-11-28 at 11:14 -0600, Joel Sherrill wrote:
> > 
> >>Eric Norum wrote:
> >>
> > 
> >>+ It is an OPTIMIZATION and an optional one at that.
> >>This isn't a test of manhood.  There isn't any shame in
> >>disabling it. 
> > 
> > Then you might be able to explain why
> > 
> > * HUGE projects such as Fedora and OpenSuSE are able to compile 1000's
> > of source tarballs and millions of lines of code with it enabled and are
> > only facing very few packages to break?
> > 
> > * GCC and newlib can be compiled with it enabled for RTEMS?
> 
> Oh, please note: the RTEMS kernel and packages can also be compiled with
> this option. And they also work MOST of the time. But this is not
> sufficent for a reliable RTOS.
Are you seriously trying to say, such fundamental bugs would not be
found in those 1000's of source tarballs, in all those _years_
-fstrict-aliasing is effective, if this was a real problem?

So far, only very few packages (in the order of a few dozens), have
exposed this kind of problems.

Unfortunately RTEMS is one of these!

> What is your suggestion to find other potential problem areas?
I can tell you what I've been trying so far (but I am at just at the
very beginning):

Compile RTEMS with and with out -fno-strict-aliasing, disassemble the
object files and compare the disassembly. If these disassembled files
differ, this a files is qualified to be candidate to be examined.

This results into a list of candidate files to be examined (in the order
of 100). It definitely contains many false positives, due
-fno-strict-aliasing affecting ordering of asm-instructions,
nevertheless this list is better than nothing.

> > Our problem is lack of testing (primary cause: way too long release
> > cycles). 
> 
> Here I must totally disagree.
Face it: RTEMS users are still using ancient tools with ancient version
of RTEMS, therefore rtems-4.7 and its toolchain has hardly seen any
public exposure and testing at all.

If we had a "release early/release often policy" such kind of breakdowns
would be tripped over much earlier and could have been much fixed much
earlier.

It's the old "release early/release often" vs. "release seldom/test to
death" dualism - One thing I've learned over the years I am involved
into OpenSource is: The latter might work for BIG companies with BIG
testing departments, but it doesn't work for OpenSource projects,
because they don't have the time, money nor personnel. 

OpenSource works by "many eyes seeing many things" during
"mass-exposure".

>  You will never fix this problem by
> testing. The effort to track down ONE error has been significantly high.
Yes, and? How many errors are there? 1 ... 10 ... 100s?

I suspect very few, with most of them orbiting around "Chains" and
"Object", due to their working principle (based on aliasing types).

> > Instead the RTEMS community seems to prefer to "blindly shoot into the
> > crowd" on "hear/say" and to play with symptoms, but to fix causes.
> 
> OK, can you already identify the causes? or do you have a suggestion how
> we all can do that?

> > Me suspects very few, but central points in RTEMS to be broken and
> > needing to be fundamentally redesigned.
> 
> Please list them.
Chains, chains, chains, chains. 

Having experimented with converting parts of RTEMS to using (nonaliased)
BSD queues, I am expecting the worst. I also would not be surprised to
additionally find real bugs hiding behind "wild casts".

> >>Bottom line is that if we want strict-aliasing on for 4.7, we
> >>will be delaying the release.  This is a very bad thing.  I
> >>am torn between Thomas' suggestions 2 and 3
> >>
> >>
> >>>2.) We set "-fno-strict-aliasing" now and forever
> > 
> > 
> > With all due respect, but to me, this would be "plain stupid".
> 
> Ralf, again with due respect, can you please explain me why it is stupid?

The ... forever ... is stupid. 

RTEMS code is dirty and needs to be cleaned up, that's the point.

> Ralf, I agree with you that it would be nicer to have aliasing-proof code
> from the start, but I see no easy way to get it soon.
Therefore _temporary_, therefore NO -fno-strict-aliasing in rtems-4.8.

We must provoke these bugs to be able to "nail them down" and not pamper
them with "-fno-strict-aliasing".

Ralf





More information about the rtems-users mailing list