Friday, April 25, 2014

Load-testing Moodle 2.6.2 at the OU

At the start of June we will upgrade the OU Moodle sites to Moodle 2.6. Before then we need to know that it will still perform well when subjected to our typical peak load of 100,000 page-views per hour. This time, I got 'volunteered' to do the testing.

The testing servers

To do the testing, we have a set of 10 servers that are roughly similar to our live servers. That is six web servers for handling normal requests, one web server that handles 'administrative' requests. That is, any URL starting /admin, /report or /backup. Those pages are often big, long-running processes, rather than quick page views, so it is better to put them on a different server that is tuned differently. There is one 'web' server is just for running the cron batch processes. Finally we have a database server and a file server.

In order to be able to make easy comparisons, we make two copies of our live site onto these servers. That is, we have two different www_root folders, which correspond to different URLs lrn2-perf-cur and lrn2-perf-upg. In due course we will upgrade one of the copies to the new release while leaving the other open running the current version of the code. This make it easy to switch back and forth when comparing the two.

In addition to the servers running Moodle, we have 6 virtual machines to generate the simulated load.

The testing procedure

We test using JMeter. In order to test Moodle, you need to send lots of requests for different pages, many of which include numeric ids in the URLs. Therefore, the JMeter script needs to be written specifically for the site being tested. Fortunately, our former colleague James Brisland made a script that automatically generates the necessary JMeter script. We shared that script with the community, and you can find a copy here. However, we shared it a long time ago, and since then our version has probably diverged from the community version a bit. Oops!

I say this tool automatically generates the necessary JMeter script, but sadly that is an oversimplification. It fails in certain cases like if a forum is set to separate groups mode. So, having generated the JMeter script, you need to run it and check that it actually works. If not, you have to go into the courses and activities being tested and modify the settings. We really ought to automate that, but no one has had the time. Anyway, eventually (and this took ages) you have a working test script.

Tuning the test script

Once the test script works, in that it simulates users performing various actions without error, one at a time, then you have to start running it at high load. That is, simulating lots of users doing lots of things simultaneously. After it has settled down, you let it run for 15 or 20 minutes, and then look at what sort of load you are generating. The goal is to get about the same number of requests per second for each type of page (course view, forum view, post to forum, view resource, ...) in the test run as in real use on the live system. If not, you tweak the time delays, or number of threads, and then run again. It took about four runs to get to a simulated load that was close (actually slightly higher) than the target request rates we had taken from the live server logs.

All that creation and tuning of the tests scripts is done on the lrn2-perf-cur copy of the site. Once that is OK, then you run the same script against lrn2-perf-upg. That should give exactly the same performance, and before proceeding we want to verify that is the case. It turned out at first that it was slightly different. I had to find the few admin settings that were different between the two servers. Once the configuration was the same, the performance was the same, and we were finally in a position to start comparing the old and new systems.

Upgrade to the new version of Moodle

The next step is to upgrade lrn2-perf-upg to the new code. This code is still work-in-progress. Final testing of the code before release happens next month, but we try to keep our code in a releasable state, so it should be OK for load-testing. However, this is the first time we have run the upgrade on a copy of all our data. Unsurprisingly, we found some bugs. Fortunately they were easily fixed, and better to find them now than later.

Also, a new version of Moodle comes with a lot of new configuration options. This is the moment to consider what we should set them to. Luckily, most of the default values were right, so there was not a lot to do. Moodle prompts you for most of the new settings you need to make as part of the upgrade. However, it does not prompt you to configure any new caches, so you have to remember to go and do that.

Compare performance

At long last (about one and a half weeks into the process) you are finally ready to run the test you want. How does 2.6 performance compare to 2.5? Here is a screen-grab of today's testing:

Good news: Moodle 2.6 is mostly a bit faster (5-10%) than Moodle 2.5. Bad news: every 15 minutes, it suddenly goes slow for about 15 seconds. What?!

Problem solving

Actually, there is a logical explanation. We have cron set to run every 15 minutes, so surely the problem is caused by cron, right? No. Wrong! We stopped cron running, and the spikes remained. We tried various things to see what it might be, and could not make any sense of it. One thing we discovered was that the spikes were about as large as the spikes you get by clicking the 'Purge all caches' button. OK, so something is purging caches, but what?

To cut a long story short, you need to remember that our two test sites lrn2-perf-cur and lrn2-perf-upg are sharing the same servers. Therefore they are sharing the same memcache storage. It appears that something in cron in Moodle 2.5 purges at least some of the caches. When we stopped cron on our Moodle 2.5 site the spikes went away on our 2.6 site. I am afraid we did not try to work out why Moodle 2.5 cron was purging caches, but there is probably a bug there. It turns out that purge caches does not cause a measureable slow-down in Moodle 2.5, at least not for us, which is worth knowing.

Why does Purge caches cause a slow-down in 2.6 but not in 2.5? I am pretty sure the reason is MDL-41436. When things slowed down, it was the course page that slowed down the most, and that is the one most dependent on the modinfo cache.

Summary

  • Moodle 2.6 is about 5-10% faster than 2.5, at least on our servers, which are RHEL5 + Postgres + memcache cluster store. (MDL-42071 - why has that not been integrated yet?)
  • In Moodle 2.5, doing Purge caches when your system is running at high load seems to cause remarkably little slow-down.
  • In Moodle 2.6, doing Purge caches does slow things down a lot, but only very briefly. Performance recovered within about 15 seconds in our test, but then the test was only using a few courses.
  • In Moodle 2.6, clicking Clear theme caches (at the top of Admin -> Appearance -> Themes -> Theme selector) causes no noticeable slow-down.

The bit about what happens when you clear the caches is important because sometimes, when you patch the system with a bug fix, you need to purge one or more caches to make the fix take effect. In the past, we did not know what effect that had. We were cautious and had people waiting up until after midnight to click the button at a time of low system load. It turns out now that is probably not necessary. We can clear caches during the working day, when staff are in the office to pick up the pieces if anything does go wrong.

9 comments:

  1. Tim, have you tested 2.7 yet? cron on it is pretty much faster and takes less memory.

    ReplyDelete
  2. We won't test 2.7 until we have upgraded all our add-ons, which won't be until September.

    ReplyDelete
  3. Hi Tim,

    I was wondering if you could share a bit more about your testing environment. Mainly, are you guys using PHP54 or PHP55? NginX, Apache? FastCGI+PHPFPM or just PHPCGI? Lastly, have you made the move to OpCache (if you weren't on it prior)?

    Otherwise great work. I have used the jMeter test generator to generate some quiz load tests - trying to get everyone moved to Postgres so will be running MySQL vs Postgres environment testings soon!

    Thanks for your work,
    Jason Cameron

    ReplyDelete
  4. We are on Apache + PHP 5.4 as an Apache module with APC.

    We are planning to move to PHP5.5 + Apache later this year. Given the time, we would like to test:

    * FPM instead of Apache module
    * OpCache instead of APC
    * Redis instead of memcache

    We have heard good rumours about all of those changes, but would like to confirm it in testing with our workload before making the switch.

    ReplyDelete
  5. Oops! I forgot one more on our list:

    * Upgrade from Postgres 9.0 to 9.3 - should also be a performance win.

    ReplyDelete
  6. Hi Tim, we're doing JMeter testing too ahead of our upgrade so a very useful post thanks. Just on subject of APC ... do you forsee any probs if we upgrade from 2.4 to 2.6 and keep APC for a few weeks until we can schedule more time to do the OpCache and memcached setup?

    ReplyDelete
  7. @Mike I have no idea. I can't see why it is a problem. APC is not particularly broken, it is just the OpCache is better.

    Separating out the two different upgrades is probably a good idea. For example if something unexpected goes wrong, you will know which change was responsible.

    ReplyDelete
  8. Thank you for sharing those test results. Very interesting.

    I was wondering...
    About the shared storage, Do you use NFS? or do you have a faster solution which can be resized dynamically without the need to take the storage down for maintenance.

    ReplyDelete
  9. The storage back-end is some sort of SAN. However, I believe that to the web servers, it looks like an NFS mount. However, this is not really my area, so what I write here may not be 100% accurate.

    ReplyDelete