top of page

Wine Scores: Noise Explained ( Part 1)

  • Apr 14
  • 8 min read

The Noise in the Glass: Why Wine Scores at Our Club Are Messier Than You Think


Wine scoring looks simple. You taste, you feel, you write a number. But decades of research in wine economics tell a more uncomfortable story — one that is worth sitting with honestly, especially if you are the kind of club that takes its scoring seriously.


The 1976 Judgment of Paris — the blind tasting that famously upended the wine world by ranking California wines above French classics — has been picked apart by researchers ever since. What they found was not just a surprising result but a troubling one: the judges disagreed substantially, and some were far more consistent than others. Research published in the Journal of Wine Economics showed that consistent and inconsistent tasters at the Paris tasting produced quite different results, and that findings based on the full panel masked those individual differences entirely.


The problem runs deeper than individual quirks. Tasters do not use the same scale — some are generous, some are stingy — and that divergence in baseline alone can swamp any signal about the wine's actual quality. Two people tasting the same wine may be playing completely different games. Expectations compound the noise further. Price information, prior ratings, a confident neighbour — all of these shift scores, and experienced tasters are not reliably less susceptible than novices.


Now bring this into our setting. We are enthusiastic, curious, and honest. We are not trained tasters. The deviations that researchers document in professional panels — with controlled conditions, calibrated judges, and structured protocols — are almost certainly larger around our table. A high score from one of us and a low score from another may reflect personal preference, mood, familiarity with the style, or simply a different sense of what “85 points” means. I include myself in all of this.


None of this means our scores are worthless. Averaged across enough tasters, noise tends to cancel out, and a genuine consensus can emerge. Research on the Paris tasting found that simply averaging the scores of two or more judges significantly improved the quality of judgments. The wisdom of the crowd is real — but only if we understand what the crowd is actually measuring.


The honest takeaway: treat our collective scores as a rough compass, not a precise instrument. The wine that finishes first in our ranking probably earned it. The difference between second and fourth place is likely noise.


We Taste Blind — and That Is a Genuine Strength


Our club tastes blind. Labels are hidden, bottles are wrapped, and members rank what is in the glass rather than what is on the label. This is not a trivial commitment. Most social wine drinking is saturated with label information — the producer, the vintage, the price, the recommendation — all of which shape perception before the wine reaches the mouth. Removing that information is a genuine methodological achievement, and it is rarer than it should be.


But blind tasting removes the label from the glass, not the taster from the room. The biases documented above — scale divergence, expectation effects, inconsistency — do not require label information to operate. They are features of human perception and judgment, and they travel with us to every tasting regardless of what we can see on the bottle.


Our Method: A Useful Shortcut With Real Weaknesses


Our ranking method is a forced-choice ordinal rank-sum. Each member assigns every wine a rank from 1 (best) to 12 (worst), with no ties permitted. The ranks are then summed across all members, and the wine with the lowest total wins.


This method has real virtues. It is simple, fast, and produces a clear result without requiring members to agree on what a numerical score means. It sidesteps the scale divergence problem by asking for relative judgments rather than absolute ones.


But it carries five weaknesses that are worth being honest about.


First, it discards magnitude. A member who thinks wine A is dramatically better than wine B and a member who thinks they are nearly identical will both record a rank of 1 and 2. The method cannot tell the difference.


Second, it forces discrimination where none may exist. Requiring a strict ordering of twelve wines means that even wines a member cannot meaningfully distinguish must be separated. The ranking looks precise. It may not be.


Third, it produces no measure of agreement. The final scores tell us which wine won, but not whether the group agreed or whether the result was driven by one or two strong opinions pulling in the same direction.


Fourth, ranking twelve wines in a single sitting is a significant cognitive task. Palate fatigue is real, and the reliability of late rankings — wines tasted ninth, tenth, eleventh — is almost certainly lower than the reliability of early ones.


Fifth, and most quietly: a rank of 1 from one member and a rank of 1 from another are treated as equivalent. But if one member is a generous ranker and another is severe — if their internal scales are calibrated differently — then summing those ranks aggregates not judgments but private reference points.


The Host’s Hidden Hand


Sequence matters more than we tend to admit. The order in which wines are tasted shapes how they are perceived, independently of what is in the glass. A lighter wine tasted after a very tannic one will seem fresher than it is. A wine with residual sweetness tasted after a bone-dry one will seem richer. These are not failures of attention — they are features of how palates work.


In a professional tasting, the sequence is carefully managed. In our club, it is managed by the convener, who may or may not be thinking about it. Blind tasting does not reduce this problem. If anything, it amplifies it: without label information to anchor perception, tasters rely more heavily on contrast effects from the wines around the one they are currently tasting.


The implication is uncomfortable. The wine that finishes in the middle of our rankings may have finished there partly because of where it sat in the sequence, not because of what was in the glass.


When the Wines Are Too Similar to Tell Apart


Some of our tastings are horizontal: same grape variety, same vintage, same appellation. The intention is rigour — by holding variables constant, we isolate the differences that matter. But horizontal tastings also push the wines closer together in profile, sometimes below the threshold of reliable discrimination.


Research on expert tasters is unsparing here. In one well-known study, the same wines were presented to expert judges on multiple occasions, and their scores were found to be distributed essentially at random — no more consistent than chance would predict. If trained professionals cannot reliably distinguish wines in a controlled setting, the implications for our club tastings are worth sitting with.


This does not mean horizontal tastings are pointless. A well-defined topic — a single appellation, a single producer across vintages — narrows the criteria gap and gives the evening genuine focus. The conversation tends to be richer when everyone is trying to answer the same question. But the ranking that emerges from a flight of very similar wines should be held lightly.


What Are We Actually Measuring?


Underlying all of this is a question that wine researchers call construct validity: what, exactly, is wine quality? Is it complexity? Balance? Typicity? Longevity? Pleasure? These are not the same thing, and different tasters weigh them differently — often without realising it.

Professional competitions address this, imperfectly, through briefing hierarchies and calibration wines. Judges are told what they are evaluating before they begin, and a reference wine is used to anchor the scale. Our club has none of this. Each member brings their own implicit criteria to the table, which are never made explicit.


The result is that each rank in our system is a private judgment against a private standard. Summing those ranks does not aggregate our judgments — it aggregates twelve separate private conversations, each conducted in a slightly different language.


Research bears this out in an interesting way: expert tasters tend to agree more on wines at the low end of quality than on wines at the high end. Obvious flaws produce consensus. Genuine excellence produces disagreement, because excellence can be achieved in more than one way, and different tasters value different paths to it.


The Conversation That Numbers Cannot Have


The most valuable part of our evening is not the ranking. It is what happens after the scores are in—the conversation the numbers cannot have on their own.


When a wine divides the room, that division is information. When someone names a quality that others had not noticed, that observation changes what everyone else tastes in the next sip. When the producer or region is revealed and the room recalibrates — “of course, that explains the acidity” — something genuinely educational has happened.


But this part of the evening has its own hazards. Dominant voices can pull the room toward a consensus that does not reflect individual experience. Quieter members may not surface their most interesting observations. The discussion can stay at the level of tasting notes — “I got blackcurrant, you got tobacco” — without ever reaching the more interesting question of why the wine is or is not working.


The qualitative discussion is where the real learning happens. It is also where the most can go wrong.


Most Clubs Do Far Less


It is worth pausing here to note that most wine clubs do not do any of this. They pour, they drink, they chat. No blind tasting, no ranking, no attempt to surface disagreement. We have tried harder than that — and I mean that genuinely. By trying harder, we have exposed problems that a more casual approach would never have encountered. That feels like the right kind of trouble to be in.


That is not a reason to abandon the method. It is a reason to be honest about what the method can and cannot tell us. Any scoring system carries these burdens. The question is whether we knowingly carry them.


The Convener


But there is one person in the room for whom preparation is not optional: the convener. Just as a university tutorial rises or falls on the quality of its tutor, a wine tasting rises or falls on the convener. The parallel is exact — and so is the failure mode. The bad tutor is not ignorant. They are often knowledgeable enough to divert every question without ever resolving it: circling, reframing, filling the hour with the appearance of inquiry while the actual problem goes untouched. The bad convener does the same. They talk fluently about what they are sensing, steer the room away from difficult comparisons, and leave everyone feeling vaguely entertained but no clearer about what was in the glass. Knowledge without resolution is its own kind of bore.


The good convener is something else entirely. They have done the reading. They know the producer, the vintage, the appellation. They have thought about the sequence. They have a question they want the evening to answer. In the right hands, a wine tasting is not a social occasion with scoring attached — it is a genuine inquiry, and the scores are just one tool.

Which brings us, finally, to the only questions that truly matter: was the evening enlightening? Do you feel good about it? Did something shift — a new wine discovered, a prejudice gently overturned, a conversation that lingered longer than the glass? If the answer is yes, then the noise in the scores is beside the point. The imperfect method, the undefined criteria, the palate fatigue, the shy member who said nothing — none of it negates an evening well spent among people who care about what is in the glass. That, in the end, is what a wine club is for.


If any of the problems above felt uncomfortably familiar, the companion piece — Seven Things We Could Actually Do Differently — offers some practical responses. None of them requires a new method. Just a little more intention.

 
 
 

Comments


bottom of page