Howard Popeck writes:
On the face of it, subjective (ears, not machines) group testing of audio equipment, by which I mean ‘n’ items are compared to each other, are a great idea for the potential buyer. Through this lack of insight by the magazine reader and, unfortunately the failure of the magazines to recognize let alone acknowledge the severe physiological failings inherent in the practice, it is my view that the whole procedure is fatally flawed.
Consequently group testing, as historically practiced, can be fatally damaging to the buyer’s wallet. There are many causes. Let me explain the first of them (The Halo Effect'] in this occasional series.
Memory – failing or otherwise
Group tests are, in the most distilled analysis, utterly dependent on memory. This is not of course the case with single item reviews. In say a group test of six pairs of loudspeakers, without the ability to hold the memory of at least any two adjacent pairs of speakers, let alone all six pairs, the entire exercise and the findings of it are utterly useless.
I'm not currently entering a discussion of what constitutes a ‘good’ sound, or if reviewers can really hear, or if they have good memories. All of that, although important, is a distraction right now. The point of focus here is that reliance on memory is also a reliance on the consistent accuracy of the mental filtering processes that enable pair-wise and group comparisons in listening tests to have any meaning.
It is my contention, based on many published works on the errors of judgement inherent in all facets of choice such as selecting a lover, digital camera, car, holiday – or amplifier, are beyond useless. They are damaging.
Polish that Halo, son
The first error of judgement, by which I mean the most readily understood rather than the most important is the ‘Halo Effect’.
I first came across this effect when during my absence from the audio world I was Head of Innovation for a company I co founded in 1998 which eventually became and remains known as Cognisco Ltd. http://www.cognisco.com/history.asp
I was working on a tool that would enable employers to make better quality decisions about who to invite to first interviews and who to promote in a company while simultaneously enabling the applicant to present themselves in the most truthful light. I solved the problem (and was awarded a Patent), but only after a lot of research into the topic in general and the primary errors of judgement during decision-making in particular.
Here is an extract from one my contributions to the business plan from 1998. In this I am referring to the business process for recruitment that I developed. However the basic belief still hold true in my view re group testing of audio.
“Common observations of people’s behaviour, both at work and in everyday life, suggests that most individuals possess both appropriate and inappropriate employment experience and characteristics. This is reflected in their job applications.
The individual applicant who is superior on all favourable characteristics is extremely rare as is the individual who has no redeeming features. Yet research evidence indicates that recruiters frequently perceive people, via their job applications, in these black and white terms. Applicants tend to be judged as all good or all bad.
This halo effect is particularly likely to occur where an applicant has a single outstanding characteristic revealed in their job application. For example, if an applicant is unusually high on one attribute, recruiters typically tend to minimise or ignore any weaknesses they have in other areas.”
So, and I don’t think this is an oversimplification, if you substitute a pair of loudspeakers, or a DAC or any other piece of audio equipment for the job candidate as described above, you can start to see where the problem are. Well, at least some of the problems.
The lottery of group testing
Let’s say the first pair of speakers (Pair #1) have a very impressive bass extension, and most importantly, that this is their standout feature.
Let’s just focus on this for the moment. It’s highly likely that the reviewer, and they are probably unconscious of this, will rate the bass performance of that next pair against the preceding pair. Think about. You would too – right? So, and please remember we are for this aspect focusing exclusively on the bass, then automatically the bass performance of pair #2 is valued unfavourably against pair #1
We aren’t even going to consider what is meant by ‘better’ or superior. I’ll come to that in future articles. So on bass, pair #1 is judged superior to pair #2. Now then, let’s assume (we can’t really, but for simplicity we will here) that all speakers have identical efficiency, so that all six pairs are playing at identical sound pressure levels, then Mr. Reviewer tries the first two pairs with a known source of mid range detail and similarly with treble detail and then, just to be fair, a recording with terrific dynamics. So far, between just two pairs, there are 4 pair-wise comparisons.
If you stop to consider this, then you'll probably agree that there is a lottery going on.
First, if it just so happens that as described above, the very first pair listened to have, in isolation, terrific bass. Come what may, the second pair might be downgraded on bass. And here’s the insidious bit in that like I or not, Mr. Reviewer’s subconscious over which he has little control will very probably, irrespective of the superiority of the mid and top in pair #2 still overall rate pair #1 overall better.
This is simply because the halo effect demands it be so.
Having said this, if Mr. Reviewer isn’t a bass freak, he might still acknowledge the bass superiority to pair #1 – but he’ll give it less significance.
Perhaps he’s into top-end detail in a big way. It may be that pair #1 is inferior in this respect. However, and here is the catch, research by Solomon E. Asch and others seems to indicate that if the bass aspect is the first aspect to be tested, then irrespective of the importance played by the bass in the conscious mind, overall pair #2 will be judged inferior.
Does this imply that had the treble detail aspect been the first aspect to be investigated and let’s say it was in isolation not too good on pair #1 – to the extent that pair #2 was judged superior, then even if everything else on pair #1 was markedly superior, then overall pair #1 would be judged inferior overall? It's my belief it would. Tricky, right? Oh yes indeed – and that’s just the start.
So what can be made of this?
Well, to state the blindingly obvious, humans are not machines. More importantly, they cannot be expected to behave as such. They do not poses infinite recall.
It seems that in the case of the first pair-wise comparison (bear in mind Mr. Reviewer hasn’t even started on the other four pairs) the crucial factor is not the facet which is most important to this first comparison, rather which facet is being compared first. It influences the whole outcome.
The position is made even more complicated if, as is quite likely, Mr. Reviewer chooses his first comparison to be on that facet he likes best. Bass, mid, imaging, dynamics. It really doesn’t matter which.
But it does matter where he places it in the list of aspects he's is comparing.
A lottery that only the magazines and one accidental winner win
In short, for the first two pairs, (and al the other too, but I’ll come back to that in the next of these articles) the whole process is a lottery entirely dependent on luck i.e. matters over which the maker has not control.
If pair “A” are the first pair tested AND the first test is of a facet of performance in which the reviewer is very interested AND is impressed by, then, sad to report, it doesn’t really matter how well pair “B” performs in other areas. The makers of pair "B" are pretty much stuffed.
Before I finish this today, let’s just add one level of complication by the addition of pair #3 Mr. Reviewer now has to compare several aspects of performance of 3 pairs. For fairness, he cannot totally reject pair #1 at this stage. Were he to do so, then it would be verifiable evidence that the outcome of the review were entirely dependent on where in the queue the maker was placed. In which case any sensible maker would try and insist that theirs was the final pair to be assessed.
So, you think that’s hard?
Mr reviewer is compelled in fairness to hold in his fallible (just like me) impressions on ‘n’ aspects across ‘y’ options. For each successive increase in the number of variables in the group test, the problem escalates, and continues to do so.
Naturally reviewers engaged in groups tests (not the same magnitude of problems for the test in isolation as undertaken by Ken Kessler and others) will try and convince themselves, and the rest of us that they do the very best they can. And by and large I guess this is true. The unpalatable fact is though that ‘best we can’ is just not good enough.
The ubiquitous scapegoat
The escape clause is of course that one should not rely exclusively on the reviewer for one’s buying decision. Quite so. But the insidious subtext is that the reviewer’s comments are the prime motivator in the buyer deciding which of the six pairs of speakers to audition – and which to not.
As we have seen earlier, for reasons of bad luck in terms of where one sits in the product list to be group-tested plus the inevitable sub-conscious effect of review bias, being a maker engaged in a group test is, well, fraught with potential and unavoidable misfortune.
Forthcoming features in this series about the failure of group comparative audio tests:
- Peer Pressure Within The Reviewing Team In The Group Test.
- The Primacy Effect.
- The Contrast Effect.
- The Leniency/Strictness Effect.
- The Central Tendency.
- Are Review Samples Representative Of What’s In The Shops?
- The Recency Effect.
- The Corrosive Effects Of Similarity.
- How Factual Errors Creep Into Reviews.
- Ignoring The Evidence
- Distorting The Evidence
- False Inferences
- The Failure Of Intuition
- Why We Don’t Have This Problem With Cars, Digital Cameras And Lovers.
- And anything else that occurs to me re this subject
That’s it for now.