Thursday, May 2, 2013

Performance-testing Moodle

Background

The Open University is moving from Moodle 2.3.x to Moodle 2.4.3 in June. As is usual with a major upgrade, we (that is Rod and Derek) did some load testing to see if it still runs fast enough on our servers.

The first results were spectacularly bad! Moodle 2.4 was ten times slower. We were expecting Moodle 2.4 to be faster than 2.3. The first step was easy.

Performance advice: if you are running Moodle 2.4 with load-balanced web servers, don't use the default caching option that stores the data in moodledata on a shared network drive. Use memcache instead.

Take 2 was a lot better. Moodle 2.4 was now only about 1.5 times slower. Still not good enough, but in the right ball park. This blog post is about what we did next, which was to use the tools Moodle provides to work out what was slow and fix it.

Moodle's profiling tool

When your software is too slow, you need measurements to tell you which are the slow bits. Tools that do that are called profilers. One of the better profiling tools for PHP is called XHProf. The good news is that it has already been integrated into Moodle, and there is documenation about getting it working. Basically, you just need to install a PHP extension and turn on some options under Admin -> Development -> Profiling.

Since we already had the necessary PHP extension on our severs, that was really easy. The option I chose was to profile a page when &PROFILEME was added to the end of the URL, but there are several ways to control it.

Profiling output

Once you have profiled a page, the results appear under Admin -> Development -> Profiling runs.

This just lists the runs you have done. You need to click through to see the details of one run. That looks like a big table of all the function that were called as part of rendering the page, how many times each one was called, and how much time each function was responsible for.

Inclusive time is the amount of time taken by that function, and all the other functions it called. Exclusive time is the time taken by that function itself. Some people, like sam, seem to like that tabular view. I am a more visual person, so I tend to click on the [View full callgraph] link. That produces a crazily big image, showing graphically which functions call which other functions, and how much time is spent in each one. Here is the image for the run we are looking at:

You can click for the full-sized image. The yellow and red highlighting is applied automatically to try to highlight places where a lot of time is being spent. Sometimes it is helpful. Sometimes not. The red box in the bottom right is where we do database queries. No suprise there. We know calling the database is one of the slowest things you can do in Moodle code. The other red box is fetching data from memcache, which also involves connecting to another server.

What you have to look for is somewhere on the diagram that makes you go "What! We are spending how much time doing that?! That's surely not necessary." In this case, my eye was drawn to the far right of the yellow chain. When viewing this small course, we are fetching the course format object 134 times, and doing that is accounting for about 9% of the page-load time. There is no way we need to do that.

Fixing the problem

Once you have identified what appears to be a gross inefficiency, then you have to fix it. Mostly that follows the normal Moodle bug-fixing mechanics, but it is worth saying a bit about the different approaches you could take to changing the code:

  1. You might work out that what is being done is unnecessary. Then you can just remove it. For example MDL-39452 or MDL-39449. This is the best case. We have both improved performance and simplified the code.
  2. The next option is to take an overview of the code, and re-organise it to be more sensible. For example, in the course format case, we should probably just get the course format object once, and then use it. However, that would be a big risky change, which I did not want to do at this time (just before the Moodle 2.5 release). This approach does, however, also have the potential to simplify the code while improving performance.
  3. The next option is some other sort of refactoring. For example get_plugin_list was getting called a lot, and it in turn was calling the generic function clean_param to validate something. clean_param is quite fast, but not when you call it a thousand times. Therefore, it was worth extracting a simpler is_valid_plugin_name function. Doing that (MDL-39445) reduced the page load time by about 2%, but did make the code slighly more complex. Still, that is a worth-while trade off.
  4. The last option is to add caching. If you are doing the same thing repeatedly, and it is slow, and you can't avoid doing it repeatedly, then remember the answer the first time you compute it, and reuse it later. This should be the option of last resort because caches definitely increase the code complexity, and if you forget to clear them when necessary you introduce bugs. However, as in the course formats example we are looking at they can make a big difference. This fix reduced page-load times by 8%.

So far, we have found nine speed-ups we can make to Moodle 2.4 in the core Moodle code, and about the same in OU plugins. That is probably a 10-20% speed-up on most pages. Some of those are new problems introduced in Moodle 2.4. Others have been there since Moodle 2.0. We could really benefit from more people looking at Moodle profiling output often, and that is why I wrote this article.

13 comments:

  1. I forgot to include an important point: Before doing any profiling, change the setting on your development server to be more like a live server:

    * Theme designer mode: off
    * Cache language strings: on
    * Cache JavaScript: on
    * Debugging: ALL (or less, not DEVELOPER)

    (I think those are the main ones. I may have forgotten some.)

    ReplyDelete
  2. Thanks Tim, good post! Regarding your comment, I added the following to my config.php before doing this profiling stuff:

    $releaseversion = false;
    if ($releaseversion) {
    $CFG->debug = 0;
    $CFG->langstringcache = 1;
    $CFG->cachejs = 1;
    $CFG->themedesignermode = 0;
    $CFG->enablecssoptimiser = 1;
    }

    Now when I do profiling I just set the variable to 'true' and it fixes all those settings.

    ReplyDelete
  3. Well, interesting Tim. As a non-expert, this is a fascinating post.

    Is there any reason (in the global scheme of things) why such basic testing is not done before a major release rather than after?

    ReplyDelete
  4. We did this testing before a major release.

    As I linked to in the main article (http://docs.moodle.org/dev/Moodle_2.4_release_notes#Performance_improvements) Moodle HQ had done some testing before the 2.4 release that seemed to show it was faster.

    The extra complexity comes from all the different configurations in which Moodle can be installed. The main issue in this case is probably that number of different servers we use at the OU which all have to communicate with each other.

    Another potential issue is that artificially constructed test data may not show the same distribution as test data (however much you try) and that might change performance.

    The strength of open source software is that eventually it does get tested in many different configurations and scenarios, and the problems are found and fixed. Thus you end up with very robust software. If, however, you are the first people to push Moodle in a particular direction, then you might be the first people to trip over some problem that no-one else has found before. Fortunately, at the OU, we have a good team who can deal with that.

    ReplyDelete
  5. Tim - You mention using memcache instead of the moodledata.

    Is this all done within MUC, or is this using custom patches?

    Do you guys run a dedicated memcached server?

    ReplyDelete
  6. You just need to configure this on the MUC admin screen. (Admin -> Plugins -> Cache, I think.)

    We run a memcache server, like we run a DB server. It has to hold data that is accessible from all the web servers.

    ReplyDelete
  7. Do you have access to Google Analytics (or similar) data on your live sites?

    Our data suggests that the server only accounts for 1/3 of the time that users spend waiting for an average page and on the login page (where the user is more likely to have an empty cache) this can drop to as low as 1/10 of the time.

    And our usage is almost all on campus intranet or local broadband. The network time becomes even more important for users on mobile networks or in geographically distant places like Singapore.

    Oh, and we currently don't run any kind of PHP cache (yes, I know, don't ask), which is obviously a best practice and should further reduce the portion of the delay contributed by the server.

    As a result, I'd like to see more focus on metrics such as the "Kilobytes downloaded" one in your link for performance improvements in 2.4 (and I'd love to know how they measure that). It seems there's already been a lot of work in other areas, so maybe the remaining low hanging fruit is in the front end.

    For example, by running Google's page speed analyser I noticed that there are two links to empty javascript files (https://tracker.moodle.org/browse/MDL-35230). And I think there's more simple fixes like that to be found.

    ReplyDelete
  8. Andrew Nichols is doing work on JavaScript load performance. See https://tracker.moodle.org/browse/MDL-38284.

    Measuring Moodle page-load is hard, since normally you can start reading the page before all the JavaScript has loaded, so percieved performance for users (which is what is most important) is better than crude measurements suggest.

    But yes, we need people to continue looking at this, and finding optimisations.

    ReplyDelete
  9. THanks for this example, Tim.

    ReplyDelete
  10. If you are running windows and looking for an easy place to get the XHProf extension, you can find it at; http://dev.freshsite.pl/php-extensions/xhprof.html

    Russell Smith

    ReplyDelete
  11. Thanks Tim,
    as we are currently working in upgrading our moodle from 1.9 to 2.4, we will take your testing scenarios in cosideration when testing our new moodle environment.

    Awadh Al Belushi
    Sultan Qaboos University
    Sultanate of Oman

    ReplyDelete
  12. I get the error failed to execute cmd: "dot -Tpng". stderr: `sh: 1: dot: not found ' when pressing full callgraph. Any idea why? I installed v 0.9.3 on Ubuntu with PHP 5.3.10

    ReplyDelete
  13. Well, dot is a command-line program that renders graphs, and that must be what this is using. Presumably you don't have it installed.

    Looks like apt-get graphviz is the way to install it, but I don't really know. I just Googled that. Running 'dot -v', or 'which dot' is a way to test if it is installed.

    ReplyDelete