MaxDiff tests are incredibly helpful when you need to compare a long list of alternatives. Respondents will see 3, 4, or 5 alternatives one screen at a time and be asked to rank them. You can add 7 - 200 alternatives for respondents to rank.
To Assemble Your MaxDiff Test
Drag and drop your MaxDiff test from the sidebar or select it from our question type menu. Your MaxDiff will open as a blank slate. You can then enter in a list of alternatives, or copy and paste them in from a list.
- Express will provide aggregate data using the Logit Regression model
- HB will provide individual utility scores and uses the Hierarchical Bayesian model
- HB+TURF includes an additional TURF simulator export.
Keep in mind the mode, the number of alternatives, and items per screen will determine how many questions the Max Diff will take your respondents through.
In this example, we are using HB, with 15 alternatives, and 5 per screen, so respondents will go through the experiment 9 times.
You can change the number of alternatives per screen at the bottom of the question.
Respondents will then be able to rank a randomized list of the alternatives throughout multiple questions. You can also select for them to see the options from an image grid or rank them on a best/worst scale.
If possible, we always recommend adding in Instructional Text before the MaxDiff test to let your respondents know what the rest of your survey will be like. Ranking a long list of alternatives can be repetitive, and it's always useful to prepare respondents for the 6 or more questions they will be asked next.
To Analyze Your MaxDiff Experiment
When you receive the data, you'll be able to analyze based on Preference Likelihood (#/screens), Average-based PL, or Utility Scores.
Not sure which mode is best for your analysis? Keep scrolling to see some FAQs answered by our research experts!
Preference Likelihood (#/screen)
The baseline is set at appropriate percent depending on the number of items per screen programmed in the MaxDiff. This will represent the chance an item would be selected from a random set of items, but now the set size will match how many items were tested within the MaxDiff tasks respondents completed.
- 33% if (3/screen)
- 25% if (4/screen)
- 20% if (5/screen)
In the example below we showed respondent 5 alternatives per screen and you will see that reflected in the mode.
Average-based PL (50% baseline)
The baseline is set at 50%, the probability an item would be chosen from among a set of two no matter how many items per screen respondents interacted with.
Bars now left and right of center (remain zero-centered utilities to match exported data). The values are Zero-Centered (average is equal to zero), which matches the utility score output.
What is the best metric/output to use in my analysis?
There is not a single best metric per se, it often is a matter of personal preference.
- Preference Likelihood scores are more easily interpreted than utility scores because the values have more meaning. With preference likelihood, each percent represents the probability an item would be most preferred out of a given set.
- If you prefer the given set to reflect the task respondents completed, use Preference Likelihood (#/screen).
- If you prefer the given set to reflect a head-to-head comparison of one item versus another, use Average-based PL (50% baseline).
- Utility Scores can provide an easy high-level view of what performs above average (positive value), what performs below average (negative value), and the overall rank order. For significance testing between options, we recommend using Utility Scores.
The rank order of items varies across different metrics. Which metric should I use to report rank order?
The short answer is we recommend using Utility Scores for looking at the overall rank order of items. Without diving too far into the math, Utility Scores are preferred as they are the rawest form of the analysis and the data is normalized.
The baseline value of Preference Likelihood (#/screen) does not match the average of all PL values. Why is this, and what does this mean?
It has to do with the mathematical transformation of the raw utility scores that is being done to produce preference likelihood based on the number of items per screen. The short answer is it’s easier to move upward than downward in these calculations. When there is a clearer rank order and preference among the items, the average of these values will creep above the baseline value which is the theoretical preference likelihood, simply thought of as chance. If all items performed equally, the average of Preference Likelihood scores would more closely align with the baseline value.
Some items that perform below the baseline with Average-based PL (50% baseline) perform above the baseline with Preference Likelihood (#/screen). Why is this, and what does this mean?
As mentioned above, it is possible and likely for Preference Likelihood values based on the number of items per screen in the MaxDiff exercise to creep above the theoretical average and thus each item can move up a little. This isn’t an effect observed with Average-based PL (50% baseline), going back to the mathematical simplicity of this metric. Since these two behave slightly differently with regard to the baseline, it isn’t fair to compare how items perform against the benchmark under different scenarios. For those wanting to understand pure above-average and below-average performers, we recommend looking at the Utility Scores (positive values = above average, negative values = below average, values tightly around 0 are about average).