Thursday, November 28, 2013

Bug fixing as knowledge creation

There are lots of ways you can think about bug-fixing: it is just a job that developers do; it is problem solving; etc. Here I want to take one particular viewpoint, that it is generating new knowledge about a software system.

One was to think about software is that it is the embodiment of a set of requirements, of how something should work. For example, Moodle can be thought of as a lot of knowledge about what software is required to teach online, and how that software should be designed. Finding and fixing bugs increases that pool of knowledge by identifying errors or omissions and then correcting them.

The bug fixing process

We can break down the process of discovering and fixing a bug into the following steps. This is really trying to break the process down as finely as possible. As you read this list, please think about what new knowledge is generated during each step.

  1. Something's wrong: We start from a state of blissful ignorance. We think our software works exactly as it should, and then some blighter comes along and tells us "Did you know that sometimes ... happens?" Not what you want to hear, but just knowing that there is a problem is vital. In fact the key moment is not when we are told about the problem, but when the user encountered it. Good users report the problems they encounter with an appropriate amount of detail
  2. Steps to reproduce: Knowing the problem exists is vital, but not a great place to start investigating. What you need to know is something like "Using Internet Explorer 9, if you are logged in as a student, are on this page, and then click that link then on the next page press that button, then you get this error." and that all the details there are relevant. This is called steps to reproduce. For some bugs they are trivial. For bugs that initially appear to be random, identifying the critical factors can be a major undertaking.
  3. Which code is broken: Once the developer can reliably trigger the bug, then it is possible to investigate. The first thing to work out is which bit of code is failing. That is, which lines in which file.
  4. What is going wrong: As well as locating the problem code, you also have to understand why it is misbehaving. Is it making some assumption that is not true? Is it misusing another bit of code? Is it mishandling certain unusual input values? ...
  5. How should it be fixed: Once the problem is understood, then you can plan the general approach to solving it. This may be obvious given the problem, but in some cases there is a choice of different ways you could fix it, and the best approach must be selected.
  6. Fix the code: Once you know how you will fix the bug, you need to write the specific code that embodies that fix. This is probably the bit that most people think of when you say bug-fixing, but it is just a tiny part.
  7. No unintended consequences: This could well be the hardest step. You have made a change which fixed the specific symptoms that were reported, but have you changed anything else? Sometimes a bug fix in one place will break other things, which must be avoided. This is a place where peer review, getting another developer to look at your proposed changes, is most likely to spot something you missed.
  8. How to test this change: Given the changes you made, what should be done to verify that the issue is fixed, and that nothing else has broken? You can start with the steps to reproduce. If you work through those, there should no longer be an error. Given the previous point, however, other parts of the system may also need to be tested, and those need to be identified.
  9. Verifying the fix works: Given the fixed software, and the information about what needs to be tested, then you actually need to perform those tests, and verify that everything works.

Some examples

In many cases you hardly notice some of the steps. For example, if the software always fails in a certain place with an informative error message, then that might jump you to step 4. To give a recent example: MDL-42863 was reported to me with this error message:

Error reading from database

Debug info: ERROR: relation "mdl_questions" does not exist

LINE 1: ...ECT count(1) FROM mdl_qtype_combined t1 LEFT JOIN mdl_questi...

SELECT count(1) FROM mdl_qtype_combined t1 LEFT JOIN mdl_questions t2 ON t1.questionid = t2.id WHERE t1.questionid <> $1 AND t2.id IS NULL

[array (0 => '0',]

Error code: dmlreadexception

Stack trace:

  • line 423 of /lib/dml/moodle_database.php: dml_read_exception thrown
  • line 248 of /lib/dml/pgsql_native_moodle_database.php: call to moodle_database->query_end()
  • line 764 of /lib/dml/pgsql_native_moodle_database.php: call to pgsql_native_moodle_database->query_end()
  • line 1397 of /lib/dml/moodle_database.php: call to pgsql_native_moodle_database->get_records_sql()
  • line 1470 of /lib/dml/moodle_database.php: call to moodle_database->get_record_sql()
  • line 1641 of /lib/dml/moodle_database.php: call to moodle_database->get_field_sql()
  • line 105 of /admin/tool/xmldb/actions/check_foreign_keys/check_foreign_keys.class.php: call to moodle_database->count_records_sql()
  • line 159 of /admin/tool/xmldb/actions/XMLDBCheckAction.class.php: call to check_foreign_keys->check_table()
  • line 69 of /admin/tool/xmldb/index.php: call to XMLDBCheckAction->invoke()

I have emboldened the key bit that says where the error is. Well, there are really two errors here. One is that the Combined question type add-on refers to mdl_questions when it should be mdl_question. The other is that the XMLDB check should not die with a fatal error if presented with bad input like this. The point is, this was all immediately obvious to me from the error message.

Another recent example at the other extreme is MDL-42880. There was no error message in this case, but presumably someone noticed that some of their quiz settings had changed unexpectedly (Step 1). Then John Hoopes, who reported the bug, had to do some careful investigation to work out what was going on (Step 2). I am glad he did, because it was pretty subtle thing, so in this case Step 2 was probably a lot of work. From there, it was obvious which bit of the code was broken (Step 3).

Note that Step 3 is not always obvious even when you have an error message. Sometimes things only blow up later as a consequence of something that went wrong before. To use an extreme example, if someone fills your kettle with petrol, instead of water, and then you turn it on to make some tea and it blows up. The error is not with turning the kettle on to make tea, but with filling it with petrol. If all you have is shrapnel, finding out how the petrol ended up in the kettle might be quite hard. (I have no idea why I dreamt up that particular analogy!)

MDL-42880 also shows the difference between the conceptual Steps 4 and 5, and the code-related Steps 3 and 6. I though the problem was with a certain variable becoming un-set at a certain time, so I coded a fix to ensure the value was never lost. That led to complex code that required a paragraph-long comment to try to explain it. Then I had a chat with Sam Marshall who suggested that in fact the problem was that another bit of code was relying on the value that variable, when actually the value was irrelevant. That lead to a simpler (hence better) fix: stop depending on the irrelevant value.

What does this mean for software?

There are a few obvious consequences that I want to mention here, although they are well known good practice. I am sure there are other more subtle ones.

First, you want the error messages output by your software to be as clear and informative as possible. They should lead you to where the problem actually occurred, rather than having symptoms only manifesting later. We don't want exploding kettles. There are some good examples of this in Moodle.

Second, because Step 7, ensuring that you have not broken anything else, is hard, it really pays to structure your software well. If you software is made up of separate modules that are each responsible for doing one thing, and which communicate in defined ways, then it is easier to know what the effect of changing a bit of one component is. If your software is a big tangle, who knows the effect of pulling one string.

Third, it really pays to engage with your users and get them to report bugs. Of course, you would like to find and fix all the bugs before you release the software, but that is impossible. For example, we are working towards a new release of the OU's Moodle platform at the start of December. We have had two professional testers testing it for a month, and a few select users doing various bits of ad-hoc testing. That adds up to less than 100 person days. On the day the software is released, probably 50,000 different users will log in. 50,000 user days, even by non-expert testers, are quite likely to find something that no-one else noticed.

What does this mean for users?

The more important consequences are for users, particularly of open-source software.

  • Reporting bugs (Step 1) is a valuable contribution. You are adding to the collective knowledge of the project.

There are, however, some caveats that follow from the fact that in many projects, the number of developers available to fix bugs is smaller than the number of users reporting bugs.

  • If you report a bug that was already reported, then someone will have to find the duplicate and link the two. Rather than being a useful contribution, this just wastes resources, so try hard to find any existing bug report, and add your information there, before creating a new one.
  • You can contribute more by reporting good steps to reproduce (Step 2). It does not require a developer to work those out, and if you can do it, then there is more chance that someone else will do the remaining work to fix the bug. On the other hand, there is something of a knack to working out and testing which factors are, or are not, significant in triggering a bug. The chances are that an experienced developer or tester can work out the steps to reproduce quicker than you could. If, however, all the experienced developers are busy then waiting for them to have time to investigate is probably slower than investigating yourself. If you are interested, you can develop your won diagnosis skills.
  • If you have an error message then copy and paste it exactly. It may be all the information you need to give to get straight to Step 3 or 4. In Moodle you can get a really detailed error message by setting 'debugging' to 'DEVELOPER' level, then triggering the bug again. (One of the craziest mis-features in Windows is that most error pop-ups do not let you copy-and-paste the message. Paraphrased error messages can be worse than useless.)

Finally, it is worth pointing out that Step 9 is another thing that can be done by the user, not a developer. For developers, it is really motivating when the person who reported the bug bothers to try it out and confirm that it works. This can be vital when the problem only occurs in an environment that the developer cannot easily replicate (for example an Oracle-specific bug in Moodle).

Conclusion

Thinking about bug finding and fixing as knowledge creation puts a more positive spin on the whole process than is normally the case. This shows that lots of people, not just developers and testers, have something useful to contribute. This is something that open source projects are particularly good at harnessing.

It also shows why it makes sense for an organisation like the Open University to participate in an open source community like Moodle: Bugs may be discovered before they harm our users. Other people may help diagnose the problem, and there is a large community of developers with whom we can discuss different possible solutions. Other people will help test our fixes, and can help us verify that they do not have unintended consequences.