Wanna Play a Detective? Find the Bug in a Function from Midnight Commander

bug


In this article, we invite you to try to find a bug in a very simple function from the GNU Midnight Commander project. Why? For no particular reason. Just for fun. Well, okay, it’s a lie. We actually wanted to show you yet another bug that a human reviewer has a hard time finding and the static code analyzer PVS-Studio can catch without effort.
A user sent us an email the other day, asking why he was getting a warning on the function EatWhitespace (see the code below). This question is not as trivial as it might seem. Try to figure out for yourself what’s wrong with this code.

static int
EatWhitespace (FILE * InFile)
  /* ----------------------------------------------------------------------- **
   * Scan past whitespace (see ctype(3C)) and return the first non-whitespace
   * character, or newline, or EOF.
   *
   *  Input:  InFile  - Input source.
   *
   *  Output: The next non-whitespace character in the input stream.
   *
   *  Notes:  Because the config files use a line-oriented grammar, we
   *          explicitly exclude the newline character from the list of
   *          whitespace characters.
   *        - Note that both EOF (-1) and the nul character ('\0') are
   *          considered end-of-file markers.
   *
   * ----------------------------------------------------------------------- **
   */
{
    int c;

    for (c = getc (InFile); isspace (c) && ('\n' != c); c = getc (InFile))
        ;
    return (c);
}                               /* EatWhitespace */


As you can see, EatWhitespace is a tiny function; its body is even smaller than the comment on it:). Now, let’s check a few details.

Here’s the description of the function getc:

int getc ( FILE * stream ); 


Returns the character currently pointed to by the internal file position indicator of the specified stream. The internal file position indicator is then advanced to the next character. If the stream is at the end-of-file when called, the function returns EOF and sets the end-of-file indicator for the stream. If a read error occurs, the function returns EOF and sets the error indicator for the stream (ferror).

And here’s the description of the function isspace:

int isspace( int ch );


Checks if the given character is a whitespace character as classified by the currently installed C locale. In the default locale, the whitespace characters are the following:

  • space (0×20, ' ');
  • form feed (0×0c, '\f');
  • line feed LF (0×0a, '\n');
  • carriage return CR (0×0d, '\r');
  • horizontal tab (0×09, '\t');
  • vertical tab (0×0b, '\v').


Return value. Non-zero value if the character is a whitespace character; zero otherwise.

The EatWhitespace function is expected to skip all whitespace characters except line feed '\n'. The function will also stop reading from the file when it encounters End of file (EOF).

Now that you know all that, try to find the bug!

The two unicorns below will make sure you don’t accidentally peek at the comment.

Figure 1. Time for bug searching. The unicorns are waiting.


Figure 1. Time for bug searching. The unicorns are waiting.

Still no luck?

Well, you see, it’s because we have lied to you about isspace. Bwa-ha-ha! It’s not a standard function at all — it’s a custom macro. Yeah, we’re baddies and we got you confused.

Figure 2. Unicorn confusing readers about isspace.


Figure 2. Unicorn confusing readers about isspace.

It’s not us or our unicorn to blame, of course. The fault for all the confusion lies with the authors of the GNU Midnight Commander project, who made their own implementation of isspace in the file charset.h:

#ifdef isspace
#undef isspace
#endif
....
#define isspace(c) ((c)==' ' || (c) == '\t')


With this macro, the authors confused other developers. The code was written under the assumption that isspace is a standard function, which considers carriage return (0×0d, '\r') a whitespace character.

The custom macro, in its turn, treats only space and tab characters as whitespace characters. Let’s substitute that macro and see what happens.

for (c = getc (InFile);
     ((c)==' ' || (c) == '\t') && ('\n' != c);
     c = getc (InFile))


The ('\n' != c) subexpression is unnecessary (redundant) since it will always evaluate to true. That’s what PVS-Studio warns you about by outputting the warning:

V560 A part of conditional expression is always true: ('\n' != c). params.c 136.

To make it clear, let’s examine 3 possible outcomes:

  • End of file reached. EOF is not a space or tab character. The ('\n' != c) subexpression is not evaluated due to short-circuit evaluation. The loop terminates.
  • The function has read some character that’s not a space or tab character. The ('\n' != c) subexpression is not evaluated due to short-circuit evaluation. The loop terminates.
  • The function has read a space or horizontal tab character. The ('\n' != c) subexpression is evaluated, but its result is always true.


In other words, the code above is equivalent to the following:

for (c = getc (InFile); c==' ' || c == '\t'; c = getc (InFile))


We have found that it doesn’t work the desired way. Now let’s see what the implications are.

A developer, who wrote the call of isspace in the body of the EatWhitespace function expected that the standard function would be called. That’s why they added the condition preventing the LF character ('\n') from being treated as a whitespace character.

It means that, besides space and horizontal tab characters, they were planning to skip form feed and vertical tab characters as well.

What’s more remarkable is that they wanted the carriage return character (0×0d, '\r') to be skipped too. It doesn’t happen though — the loop terminates when encountering this character. The program will end up behaving unexpectedly if newlines are represented by the CR+LF sequence, which is the type used in some non-UNIX systems such as Microsoft Windows.

For more details about the historical reasons for using LF or CR+LF as newline characters, see the Wikipedia page «Newline».

The EatWhitespace function was meant to process files in the same way, whether they used LF or CR+LF as newline characters. But it fails in the case of CR+LF. In other words, if your file is from the Windows world, you’re in trouble:).

While this might not be a serious bug, especially considering that GNU Midnight Commander is used in UNIX-like operating systems, where LF (0×0a, '\n') is used as a newline character, trifles like that still tend to lead to annoying problems with compatibility of data prepared on Linux and Windows.

What makes this bug interesting is that you are almost sure to overlook it while carrying out standard code review. The specifics of the macro’s implementation are easy to forget, and some project authors may not know them at all. It’s a very vivid example of how static code analysis contributes to code review and other bug detection techniques.

Overriding standard functions is a bad practice. By the way, we discussed a similar case of the #define sprintf std: printf macro in the recent article «Appreciate Static Code Analysis».

A better solution would have been to give the macro a unique name, for example, is_space_or_tab. This would have helped to avoid all the confusion.

Perhaps the standard isspace function was too slow and the programmer created a faster version, sufficient for their needs. But they still shouldn’t have done it that way. A safer solution would be to define isspace so that you would get non-compilable code, while the desired functionality could be implemented as a macro with a unique name.

Thanks for reading. Don’t hesitate to download PVS-Studio and try it with your projects. As a reminder, we now support Java too.

© Habrahabr.ru