The “un-science” of learning evaluation


Evaluation is notoriously under-done in the corporate sector.

And who can blame us?

With ever increasing pressure bearing down on L&D professionals to put out the next big fire, it’s no wonder we don’t have time to scratch ourselves before shifting our attention to something new – let alone measure what has already been and gone.

Alas, today’s working environment favours activity over outcome.

Pseudo echo

I’m not suggesting that evaluation is never done. Obviously some organisations do it more often than others, even if they don’t do it often enough.

However, a secondary concern I have with evaluation goes beyond the question of quantity: it’s a matter of quality.

As a scientist – yes, it’s true! – I’ve seen some dodgy pseudo science in my time. From political gamesmanship to biased TV and clueless newspaper reports, our world is bombarded with insidious half-truths and false conclusions. The trained eye recognises the flaws (sometimes) but of course, most people are not science grads. They can fall for the con surprisingly easily.

The workplace is no exception. However, I don’t see it as employees trying to fool their colleagues with creative number crunching, so much as those employees unwittingly fooling themselves.

If a tree falls in the forest

The big challenge I see with evaluating learning in the workplace is how to demonstrate causality – ie the link between cause and effect.

Suppose a special training program is implemented to improve an organisation’s flagging culture metric. When the employee engagement survey is run again later, the metric goes up. Congratulations to the L&D team for a job well done, right? Not quite.

What actually caused the metric to go up? Sure, it could have been the training, or it could have been something else. Perhaps a raft of unhappy campers left the organisation and were replaced by eager beavers. Perhaps the CEO approved a special bonus to all staff. Perhaps the company opened an onsite crèche. Or perhaps it was a combination of factors.

If a tree falls in the forest and nobody hears it, did it make a sound? Well, if a few hundred employees undertake training, but no one measures its effect, did it make a difference? Without a proper experimental design, the answer remains unclear.

Evaluation by design

To determine with some level of confidence whether a particular training activity was effective, the following eight factors must be considered…

1. Isolation – The effect of the training in a particular situation must be isolated from all other factors in that situation. Then, the metric attributed to the staff who undertook the training can be compared to the metric attributed to the staff who did not undertake the training.

In other words, everything except participation in the training program must be more-or-less the same between the two groups.

2. Placebo – It’s well known in the pharmaceutical industry that patients in a clinical trial who are given a sugar pill rather than the drug being tested sometimes get better. The power of the mind can be so strong that, despite the pill having no medicinal qualities whatsoever, the patient believes they are doing something effective and so their body responds in kind.

As far as I’m aware, this fact has never been applied to the evaluation of corporate training. If it were, the group of employees who were not undertaking the special training would still need to leave their desks and sit in the classroom for three 4-hour stints over three weeks.


Because it might not be the content that makes the difference! It could be escaping the emails and phone calls and constant interruptions. It could be the opportunity to network with colleagues and have a good ol’ chat. It might be seizing the moment to think and reflect. Or it could simply be an appreciation of being trained in something, anything.

3. Randomisation – Putting the actuaries through the training and then comparing their culture metric to everyone else sounds like a great idea, but it will skew the results. Sure, the stats will give you an insight into how the actuaries are feeling, but it won’t be representative of the whole organisation.

Maybe the actuaries have a range of perks and a great boss; or conversely, maybe they’ve just gone through a restructure and a bunch of their mates were made redundant. To minimise these effects, staff from different teams in the organisation should be randomly assigned to the training program. That way, any localised factors will be evened out across the board.

4. Sample size – Several people (even if they’re randomised) can not be expected to represent an organisation of hundreds or thousands. So testing five or six employees is unlikely to produce useful results.

5. Validity – Calculating a few averages and generating a bar graph is a sure-fire way to go down the rabbit hole. When comparing numbers, statistically valid methods such as Analysis of Variance are required to get significant results.

6. Replication – Even if you were to demonstrate a significant effect of the training for one group, that doesn’t guarantee the same effect for the next group. You need to do the test more than once to establish a pattern and negate the suspicion of a one-off.

7. Subsets – Variations among subsets of the population may exist. For example, the parents of young children might feel aggrieved for some reason, or older employees might feel like they’re being ignored. So it’s important to analyse subsets to see if any clusters exist.

8. Time and space – Just because you demonstrated the positive effect of the training program on culture in the Sydney office, doesn’t mean it will have the same effect in New York or Tokyo. Nor does it mean it will have the same effect in Sydney next year.

Weird science

Don’t get me wrong: I’m not suggesting you need a PhD to evaluate your training activity. On the contrary, I believe that any evaluation – however informal – is better than none.

What I am saying, though, is for your results to be more meaningful, a little bit of know-how goes a long way.

For organisations that are serious about training outcomes, I go so far as to propose employing a Training Evaluation Officer – someone who is charged not only with getting evaluation done, but with getting it done right. The “un-science” of learning evaluation – Learning Cafe


This post was originally published on Ryan’s blog, E-Learning Provocateur on 29 November 2011.


  1. Thanks Jeevan.

    Unfortunately I missed that webinar, but it did get me thinking about evaluation, which in turn inspired me to write this post.

    I’m pleased to see the webinar’s slides and synopsis are available.

  2. Good points you make Ryan. Evaluation is always a challenge, – in my experience our business stakeholders are more interested in activity than outcomes, however it is the starting point which we’ve didn’t even have. But it’s our responsibility as learning professionals to educate them about the why it should be more about outcomes.

    To your last point, I hired an evaluation person with a market research background – rather than a traditional profile of an org. pysch – which provided a different slant on things. Our biggest challenge was getting meaningful data and automating to reduce manual data collection and analysis.

  3. Thanks Peter.

    I agree that it’s our responsibility as learning professionals to educate our customers as to why it should be more about outcomes. After all, the point of all the activity is to improve performance.

    Interesting how you hired an evaluation person with a market research background. I guess if they know the stats, why not?

    And I think you’re right in regards to the challenge of data collection and analysis. I guess it depends on the systems that are being used and how easy it is to generate useful reports. But in any case, it’s going to be a big job.

  4. Great post Ryan and great discussion Peter and Jeevan. I agree evaluation needs to be reliable and valid for it to be meanigful and useful. If evaluation findings aren’t meanigful and cannot be used to make decisions, why do it? However since we know learning evaluation is vulnerable to so many confounding factors, the question is how robust is robust enough. Subsets, departments, time, geography, content, representation and sample size all need to be considered but then so do things like learning format, day of week, time of day, facilitator, training category, motivation to learn, prior exposure or knowledge, management support, or even training facility etc etc. To take all of these factors into consideration we need a fairly large sample size over a period of time so these analyses can be properly conducted ie we need to continue collecting this information automatically and electronically after each individual learning event. I think this is still the biggest challenge for evaluation. The importance of a well-designed study should not be overlooked if we want evaluation to be valid and reliable, but without the right data its difficult to capitalise on even the most scientific evaluation frameworks.

  5. Thanks Melissa.

    Yes, there are so many variables, it can go a bit crazy. I appreciate, too, that much of what I wrote in the article is a fantasy in the modern corporate environment. Or maybe “aspiration” is a fairer term.

    Having said that, I agree the limiting factor is time. If, as you suggest, we continually collect the relevant data after each learning event, eventually we should have enough data per variable to start isolating the effects from the noise.


Comments are closed.