Jonathan
Jonathan Author of Robopenguins

Software Bug Hunt

Recently, at work, I was faced with one of the more satisfying bug hunts that I’ve had in awhile.

I don’t talk about work projects much since they’re usually not be self contained. Also, the code is proprietary and would be hard to talk about in a general sense. However, I recently spent some time chasing down a bug that was pretty interesting on its own.

For some context, the bug was on an embedded device that saves some data to flash memory. To avoid reading/writing the flash unnecessarily, at boot a copy of the data is written to RAM. This RAM is then written and read from, and the flash is only updated on a save command. This data is validated with a CRC check.

The issue we were seeing was that when the device started up, the data loaded from flash was failing the CRC check. Saving the data would clear the CRC error, but it would come back after rebooting.

The first thing I needed to figure out was where the error was occurring. Was it:

  • The original CRC calculation was wrong.
  • The flash wasn’t storing the correct data.
  • The data read from flash wasn’t correct.
  • The validation CRC check was wrong.

First, I reviewed the code related to saving/loading the flash and performing the CRC checks. Nothing jumped out as a likely issue.

The next thing I did was compare the data saved to flash with the data in memory. The chip we’re using is a STM32H743AI ARM CPU. To program it we use a JLink tool. This tool lets you read the flash as well as connect with GDB for dumping the memory. I could dump the flash directly using the JFlash application.

For dumping the memory, I added a breakpoint to the code where I knew the value was failing the CRC check. From there, I could get the memory address of the data. Since the device is very memory constrained we had allocated certain chunks manually, and this flash buffer was one of them. It happens that this buffer was 2048 bytes at addresses 0x0000 - 0x0800. This is a bit unusual since that includes the “null pointer” address of 0. With the address I could dump the memory with the GDB command:

set logging file gdb_memory_dump.txt
set logging on
x/2048xb 0x00

This prints 2048 bytes starting at address zero as byte-wise hex. See https://sourceware.org/gdb/onlinedocs/gdb/Memory.html

So now I had the data and I needed to figure out which was correct. When I looked at how the CRC was being done, it turned out it was using the ST processor’s built in CRC calculator.

This was the configuration it was using:

1
2
3
4
5
hcrc.Init.DefaultPolynomialUse = DEFAULT_POLYNOMIAL_ENABLE;
hcrc.Init.DefaultInitValueUse = DEFAULT_INIT_VALUE_ENABLE;
hcrc.Init.InputDataInversionMode = CRC_INPUTDATA_INVERSION_BYTE;
hcrc.Init.OutputDataInversionMode = CRC_OUTPUTDATA_INVERSION_ENABLE;
hcrc.InputDataFormat = CRC_INPUTDATA_FORMAT_BYTES;

Which ends up being a “standard” CRC32 calculation. It still took a bit of trial and error using https://www.crccalc.com/ and https://github.com/Michaelangel007/crc32 as reference before I could get my test application to reproduce the CRC checks.

It turned out that the CRC was correct in both cases. I had actually missed it at first, but while I had initially thought the flash and RAM dumps were identical, the 70-71st bytes of the memory dump were modified from the values in flash.

Fortunately, this was reproducible between reboots. I could step through the system’s initialization and use the debugger to check the value of the affected memory location to see that it was initially being loaded correctly but eventually being corrupted by a memory error. Since the address was 0x46 I assumed the likely issue was a null pointer dereference https://cwe.mitre.org/data/definitions/476.html.

I thought this was likely the issue since the address was a small number. It would indicate that some object with a null pointer was trying to have some member at 0x46 byte offset modified. While most systems would crash on this error, since the memory access would fail, since this system had valid memory at that location, it just caused the corruption. Hypothetically, a data breakpoint would have streamlined this a bit, but for whatever reason setting one didn’t work here.

By stepping through the code execution watching for where the memory was corrupted I was able to narrow down where the error was occurring and eventually spot a race condition https://en.wikipedia.org/wiki/Race_condition that was resulting in the error.

It turned out that we were lucky the CRC happened to be checking this memory region since the error was both a race condition and would only occur when a certain set of features were enabled.

I think it goes to show that on embedded systems where the entire memory is always accessible, it’s both really easy to mess things up in a hard to find way, but you also get a huge amount of power in being able to know where everything is in memory. If you understand what’s going on you can get incredible mileage just watching the memory with debugging tools.