Benchmarking is a bit like putting a stake in the ground so you can see how far you’ve come when you reach the end of a process.

How it relates to navigation design

– It allows you to see the scale of the problem at the start of a design process, and
– measure the outcome once it’s launched.

Ultimately your goal is to create something that doesn’t make the problem any worse.

Why you should benchmark

You might think having a sense of where you are now isn’t that important, but things can get lost in design processes. This is especially true if the time between problem identification and launch is long.

Benchmarking also gives you a bit more information on the problem you’re trying to solve (win-win).

The key reason though is to help you understand if your changes are a success or failure.

What is benchmarking?

I’ve heard a lot of people ask “what’s the industry best practice for x so we can benchmark against it”.

At which point my head usually hits my desk with a thunk.

Navigation design performance is unique to your product or service. You’re not keeping up with the Joneses.

The measurements which you use are widely used, but how you interpret the data or insights needs to be unique to you.

A lot of analytics data is a closely kept secret in an organisation.

The only universal metric that can be benchmarked to an industry standard is bounce rate. However, even then the time limit for this can be adjusted in your analytics software.

How do you benchmark?

You first need to define the parameters.

These will need to include:

What you are benchmarking (please see understanding the problem for more information on this):

– Analytics data

> Screen capture session data

> How well someone completes a task in testing

> Number of customer support queries

> NPS findability score

> Tree test score (aka findability)

– Data parameters, which could include (this is not an exhaustive list)

> Date range – i.e. full calendar month, three calendar months

> Volume – i.e. percentage of people who are using google as navigation

> Accuracy – i.e. 4 out of 5 people couldn’t find the navigation in testing

> Findability – i.e. x% of people couldn’t find what they were looking for

Once you’ve defined them, you’ll need to get the data that answers these questions.

Statistically viable data

The data you’re capturing needs to be a statistically viable representation of the problem and the audience.

One week’s worth of data is not a true representation of the behavioural patterns. That week could be anomalous.

Even when working on a website like NHS digital that had 1billion visits in 2021 I refused to take a week’s sample as evidence of a significant problem.

I could see problems starting to form but I still had to adopt a watch and wait approach. This was to see if it was

– a coronavirus blip,

– if there was an external force causing a blip,

– if there was a seasonal impact, or

– if we had a problem

If you’re wondering, I saw all of those things happening from the search analytics data alone.

Why it’s essential it’s statistically viable

So I worked on a project where stakeholders had been against a navigation approach proposed by a team working on it.

So they designed something and compared it against what they currently had.

They declared what they wanted to implement was a winner, and they had data to prove it!

The problem was the sample sizes.

The sample size of their preferred option was smaller than the benchmark.

Both sample sizes were also very small.

The site got in excess of a million visitors a year.

The benchmark sample had around 34 people in it.

The preferred option had 26.

The two were not comparative and they were too small to be statistically significant to base an important choice on.

If you want to sell a major change to stakeholders who don’t understand what you’re trying to do, you need a representative sample size.

Sample sizes

I’m not a statistician. I suck with numbers. I had to resit my GCSE Maths. So I’m not about to get all numbersey on you.

I while ago I did do some searches on what represents a good sample size for the testing we do. I didn’t find very much (note to self, must do this again, probably before writing a piece like this).

What I did find was from Optimal Workshop who give clear guidance on what they see to be representative sample sizes on their testing platform.

Their advice holds true as I’ve done a number of different tests and found them to be accurate in what they say.

The problem is, (isn’t it always), the bloody stakeholders.

If you’re using this data to get sign-off on budget or approval they will want to see numbers that represent live traffic data.

The way I get around this, because let’s be honest, a representative sample of 1billion visitors would blow a budget, is to be iterative.

The more data you generate as you go through something, the stronger your case is.

If you’ve got analytics data that shows a high percentage of people are not using your navigation, it’s a live indisputable fact.

However, if you’ve identified a problem in usability testing or research it’s likely you have a small level of insight. So you would probably want to use that to look at other types of research to help you build a picture of the significance of the problem.

Tree testing or first click testing as benchmark tools

Tree testing can give you a findability score for navigation structure. I’m heavily caveating this because tree testing doesn’t allow for context created by visual language.

Using tree testing

Tree testing allows you to put your taxonomy into a test and you can ask people to find things. It gives you a degree of statistical viability.

It also gives you a measure of findability.

For example, if you test what you have now and find that 68% of people couldn’t find the key things in your structure you have a problem.

The problem could be the labelling.

I say could be, because visual context is vital in navigation design. Your labels might work but you have a burger menu icon and 80% of traffic is on mobile.

On the flip side, your navigation could be awesome, but your labels are meaningless.

☝🏻 A word of caution. If you use tree testing for benchmarking you will need to write the questions to work for benchmarking and testing. If you don’t you could bias the result of the testing. This isn’t easy, sorry.

First click testing

First click testing uses the taxonomy with the context of the design. Yay!

But it’s static images so you can only put one scenario into play at a time.

Using the previous example, the structure maybe brilliant, but you’ve got a burger menu icon.

This is how you can tell if it’s causing a problem, with quantifiable insights.

However, if you wanted to test the labels in situ of the design you would have to show the navigation menu exposed.

The same caution applies, in that you would have to write the test scenario questions in a way that works for benchmarking and testing.

The wrap up

Benchmarking is a numbers game.

The purpose of benchmarking is:

– to gain a greater understanding of a problem

– to get some numbers to keep the stakeholders quiet

– to help you understand the impact once you’ve launched

This post is part of a series on the four phases of navigation design testing and research. Please also check out the taxonomy of testing and research, and understanding the problem.

More like this