Peer Review of "Burst Beliefs - Methodological Problems in the Balloon Analogue Risk Task and Implications for its Use"

Studies in the field of psychology often employ (computerised) behavioural tasks, aimed at mimicking real-world situations that elicit certain actions in participants. Such tasks are for example used to study risk propensity, a trait-like tendency towards taking or avoiding risk. One of the most popular tasks for gauging risk propensity is the Balloon Analogue Risk Task (BART; Lejuez et al., 2002), which has been shown to relate well to self-reported risk-taking and to real-world risk behaviours. However, despite its popularity and qualities, the BART has several methodological shortcomings, most of which have been reported before, but none of which are widely known. In the present paper, four such problems are explained and elaborated on: a lack of clarity as to whether decisions are characterised by uncertainty or risk; censoring of observations; confounding of risk and expected value; and poor decomposability into adaptive and maladaptive risk behaviour. Furthermore, for every problem, a range of possible solutions is discussed, which overall can be divided into three categories: using a different, more informative outcome index than the standard average pump score; modifying one or more task elements; or using a different task, either an alternative risk-taking task (sequential or otherwise), or a custom-made instrument. It is important to make use of these solutions, as applying the BART without accounting for its shortcomings may lead to interpretational problems, including false positive and false negative results. Depending on the research aims of a given study, certain shortcomings are more pressing than others, indicating the (type of) solutions most needed. By combining solutions and openly discussing shortcomings, researchers may be able to modify the BART in such a way that it can operationalise risk propensity without substantial methodological problems.


Introduction
To a large extent, psychological science rests on the promises of operationalisation: defining fuzzy concepts as measurable variables, or in other words, P R Manuscript by D G changing conceptual variables into operational ones (Shuttleworth, 2008). This process is imperative as most concepts researchers hypothesise about are not straightforwardly quantifiable. By defining how a concept is measured, operationalisation allows hypotheses to take a falsifiable format and enables us to replicate findings. In a way, operationalisations are arbitrary, as concepts can be defined and thus measured in numerous ways -none of which are surely 'right'. Nonetheless, some measures may be more suitable than others.
A notable example of a concept that can be operationalised in various ways is risk-taking, which has an important place in clinical, cognitive, and developmental psychology, as well as in the fields of criminology, economics, and management (Lauriola & Weller, 2018). One way in which risk-taking is operationalised in these fields is through self-report measures, such as the Domain-Specific Risk-Taking (DOSPERT) scale (Blais & Weber, 2006) and the Financial Risk Tolerance assessment (Grable & Lytton, 1999). Another way is through computerised behavioural tasks, like the Iowa Gambling Task (Bechara, Damasio, Damasio, & Anderson, 1994), the Cambridge Gambling Task (Rogers et al., 1999), the Game of Dice Task (Brand et al., 2005), the Balloon Analogue Risk Task (Lejuez et al., 2002), and the more recent but already widely used Columbia Card Task (Figner, Mackinglay, Wilkening, & Weber, 2009). Importantly, the quality of a study largely depends on the degree to which its operational measures reflect the underlying concept; in this case, one's disposition towards taking risk. If a task is a poor proxy for a concept or is subject to methodological or interpretational problems, any data resulting from it are of limited value to our understanding of the concept. In this regard, several studies have challenged the operationalisation ability of the most-cited risk task, the Iowa Gambling Task (see e.g. Brand, Labudda, & Markowitsch, 2006;Buelow & Suhr, 2009;Figner et al., 2009;Maia & McClelland, 2004). The Balloon Analogue Risk Task, which is the secondmost cited, may yet suffer from even severer issues, hindering its ability to operationalise risk-taking. While some individual issues have been reported in previous publications, no literature so far has discussed these collectively. The present commentary aspires to fill this gap.

The Balloon Analogue Risk Task
In the Balloon Analogue Risk Task, or BART for short, participants are presented with a computer screen showing a small balloon and a pump. They are told that every time they click the pump, the balloon expands, and a fixed amount of money (5 cents) is added to a temporary bank. Every pump also increases the chance of the balloon exploding (marked by a 'pop' sound from the computer), resulting in losing all money in the temporary bank for that particular balloon (trial). The point at which a balloon explodes varies across trials, ranging from the first pump to the point where the balloon fills the entire screen. Participants can decide to stop pumping the balloon at any point during a trial by clicking the 'collect' button (left in Figure 1), which transfers the money accumulated in their temporary bank to their permanent one, while a slot machine sound is played. Once a balloon explodes or once participants cash a balloon's proceeds, the trial ends, and a new, uninflated, balloon appears.
In the original study by Lejuez et al. (2002), participants were informed that they would complete 90 balloons: 30 orange, 30 yellow, and 30 blue ones.
Unbeknownst to participants, differently coloured balloons had a different chance of exploding. The probability distribution governing their explosion points consisted of an array of numbers from which on every pump a random number was drawn without replacement. If a 1 was drawn, the balloon exploded.
Thus, the probability of the balloon exploding on the first pump was 1∕ , and the probability of it exploding on pump (given no prior explosion) was For orange balloons, the array ranged from 1 to 8 (hence 1 = 1 8−1+1 = 1∕8), for yellow balloons from 1 to 32 ( 1 = 1 32−1+1 = 1∕32), and for blue ones from 1 to 128 ( 1 = 1 128−1+1 = 1∕128). Their average explosion points were respectively 4, 16, and 64, with the same (randomly generated) sets of explosion points being used across all participants to limit extraneous variability. Neither the ranges nor the average explosion points were communicated to participants.

P R 3
Miltner, 2019; Schonberg et al., 2011), and that they limit the BART's ability to measure one's propensity to take risk. The key problems in the task are 1) a lack of clarity as to whether decisions are characterised by uncertainty or risk, 2) censoring of observations, 3) confounding of risk and expected value, and 4) poor decomposability into adaptive and maladaptive risk behaviour.

Risk or Uncertainty?
peer reviews▼ In economic theories of decision-making, a key distinction is that between uncertainty and risk, which is often accredited to Knight (1921), and was introduced to psychological thinking in a seminal paper by Edwards (1954) that lies at the origin of behavioural decision theory. When making a decision under the condition of risk, the probabilities associated with the possible outcomes are known. When deciding under uncertainty (which some authors call ambiguity), this probability distribution is unknown. people do not act in accordance with prospect theory, but instead underweight rare events and overweight common encounters. As people have more and more encounters (e.g. trials), their experiences will approach the precision of a priori probabilities, though in practice this is difficult to attain (Knight, 1921).
To address the inability of the BART to differentiate between complete uncertainty, experience-based risk, and description-based risk, several approaches may be used. One option is to apply a model to the BART's data that allows for participants learning through experience. An early example is a model by Wallsten, Pleskac, and Lejuez (2005) in which decision-makers update their probabilities from trial to trial, and continually re-evaluate their options.
Alternatively, one could use a different task, in which decisions are either all characterised by uncertainty or risk, or which includes a well-understood shift between the two. For instance, some tasks involve only decisions made under (a priori) risk, like the Cambridge Gambling Task, the Game of Dice Task, and the Columbia Card Task, the latter of which resembles the BART's dynamic, affective nature (Schonberg et al., 2011). Unfortunately, no task with a well-understood shift has been reported yet, although the shift in the Iowa Gambling Task has been studied more thoroughly than that in the BART.

Censored Observations
Statistical censoring refers to a condition in which the value of an observation is unknown because it is beyond a certain limit. This limit can exist by design, which is common in survival analysis. If a study on a surgical intervention follows patients for up to 10 years, the longevity scores of those who live past this term are censored, as their longevity is at least 10 ( , 2008). Likewise, the between-subjects variability across these averages is reduced (Lejuez et al., 2002). Overall, the (unadjusted) average number of pumps is an ill-suited operationalisation of risk propensity.
As censoring affects all sequential risk-taking tasks like the BART (involving multiple decisions per trial) and various other research paradigms, like survival analysis, several solutions have been proposed. In the paper introducing the BART, Lejuez et al. (2002) suggest computing an adjusted pump average using only trials in which participants stopped voluntarily, that is, in which the balloon did not burst. However, by omitting explosion trials, censored observations are essentially treated as randomly missing, which is inaccurate (Pleskac et al., 2008). The more risk someone takes, the more likely it is that the balloon bursts, and that the trial forcedly ends. The termination of trials is therefore not independent from participants' behaviour. As a result, Lejuez et al.'s adjusted score tends to discard trials in which participants take a lot of risk. This causes the average number of pumps to be biased downwards, similar to the unadjusted score, but to a lesser extent.
To circumvent the problem of censoring, Pleskac et al. (2008)  Another solution to censoring is using a rigged task (Slovic, 1966). Participants are then told that failure can occur at any moment (in the BART, at any pump), but actually, it is set to occur at the last possible choice. Hence, participants can always stop voluntarily, and no scores are censored. To uphold credibility, 'mock' trials are added, in which failure is set to occur early on.
Deciding on the number and timing of mock trials, however, is a challenge.
Since behaviour in a trial is affected by previous outcomes, experiencing ( to the likelihood function to account for censoring (Dijkstra et al., 2020;Tobin, 1958). Such models perform significantly better (i.e., have less biased predictions) than those that do not account for censoring. However, as is the case for all statistical models, their soundness hinges on the validity of their underlying assumptions (Schafer & Graham, 2002), such as that of normality, whose violation not all models are robust against (Powell, 1984).

Confounding and Decomposability
peer reviews▼ The BART was designed to resemble real-world risk situations, where taking modest risk is generally advantageous, but taking excessive risk is increasingly unfavourable (Lejuez et al., 2002;Wallsten et al., 2005). Within a trial, every successful pump earns participants 5 cents, which are added to their temporary bank. As the amount accumulated in the bank grows, the relative gain of taking additional risk decreases, while the potential loss in case of an explosion increases. Additionally, the probability of the balloon exploding increases with every pump: from 1/128 on the first, to 1/127 on the second, and so on.
This combination of characteristics makes that the task's structure entails a serious problem. While the balloon value increases linearly, the probability of the balloon exploding increases superlinearly, so that the expected value of pumping the balloon -the product of the success chance and the reward,  The Normative Solution and an optimal overall expected value. Going back to Table 1, we can see exactly why this is the optimal, or normative, solution in the BART. Up to and including the 64 ℎ pump, the expected value of pumping the balloon is positive; after 64, the expected value is (increasingly) negative. It is therefore optimal to aim for 64 pumps on every balloon, and then stop. Choosing to pump more or fewer than 64 times will decrease expected earnings; and the farther one It is yet unknown exactly why participants often stop pumping before they reach the optimal point, but various factors may play a role.  added. Since this manual BART itself resulted in higher averages than the original BART, the feedback and instructions likely also contributed to the effect (Lejuez et al., 2002). Recent research, however, indicates that informing participants about the optimal strategy is not necessary, and even ill-advised.
Two studies using an automatic BART with full feedback -but without strategy instructions -found equally high pump averages as did Pleskac and colleagues showed that a subgroup of participants -often from a STEM backgroundseem to infer the optimal strategy without any help. Their repeated 64-answers therefore reflect cognitive ability rather than risk propensity, and reduce task variability. Informing participants about the optimal strategy can increase such problematic responses. Therefore, it seems best to add automatic responses and full feedback to the BART, but not strategy instructions. This will likely elicit sufficiently high pump averages, without compromising the validity of the task.

P R 7
soring, which occurs in trials where the balloon explodes, as participants are then prevented from taking additional risk. As a result, the average number of times participants pump the balloon underestimates their risk propensity.
Third, the BART confounds risk with expected value. Since these constructs change simultaneously throughout a trial, participants' pump behaviour again does not reflect risk propensity, as decisions are influenced by both risk and expected value. This also means that the task is poorly decomposable, as it cannot disentangle the motives underlying a pump decision. A final problem concerns the task's normative solution. In the majority of trials, participants stop pumping before the point where expected earnings are maximised. Therefore, participants mostly take adaptive risk, which leads to higher earnings.
Maladaptive risk-taking hardly occurs, even though one would expect to see such behaviour in certain cases.
Despite these problems, much of the research up to now has focused on the empirical findings produced by the BART, rather than on the task itself, with the majority of researchers using the task without critically reviewing For these reasons, it is imperative that researchers critically evaluate the 'fit' between their research and the BART before deciding on using it. For many research aims, one will now see that the original BART does not suffice.
Yet despite these 'burst beliefs', there are three types of approaches one can take to account for its limitations. First, data from the original BART can be analysed using a different, more informative index than Lejuez et al.'s average adjusted pump score. For example, the models by Wallsten et al. (2005) break down behaviour into risk-taking, response consistency, and learning. In addition, computational models can be used to take into account censoring and to provide an index of uncensored risk-taking in the BART (Dijkstra et al., 2020;Tobin, 1958;Weller et al., 2019;Young & McCoy, 2019). A second way of dealing with the BART's limitations is by modifying the task, for example by rigging it (Figner et al., 2009;Slovic, 1966)

Conclusion peer reviews▼
The present paper is the first to review the methodological shortcomings of the Balloon Analogue Risk Task, a highly popular risk-taking task in psychology. The main problems identified are the ambiguity between uncertainty and risk, censoring of observations, confounding of risk and expected value, and poor decomposability into adaptive and maladaptive risk-taking. In addition, the paper reviews solutions that mitigate these problems. By presenting this first-time inventory, the paper highlights earlier mentions of problems in the BART as well as proposed solutions. It calls for a critical attitude towards the BART and experimental tasks in general, as their design deserves at least as much attention as the findings they produce. It also sets the agenda for testing and comparing different tasks and task versions, to explore which designs result in the best usability, reliability, and validity, so that risk propensity can be measured in the most accurate way possible. P R Manuscript by D G

Peer Reviews
Reviewer 1 Michael Young

0) General comments
• Accept manuscript either as is or with minor modifications.
• First, this paper is the closest I have come to recommending an 'accept as is' for an initial submission during my entire career.
• Yes, there are some additional modifications that could be made (see below and, likely, comments from the other reviewers), but the paper reads excellently as written and nicely integrates a wide body of literature on methodological issues with the BART.
• With that said, let me introduce some minor suggestions that the authors might consider.

1) Risk or Uncertainty?
• I was curious about the statement [at Risk or Uncertainty?, last paragraph] that "one could use a different task in which decisions are either all characterized by uncertainty or risk." Can the BART be modified to create two such versions?
-For example, a version in which people are presented the pop probability before each decision to pump or cash in would only entail risk, whereas a version where the pop probability is unknown and progresses differently on each trial (i.e., each balloon has a different but unknown color in the original Lejuez task) might represent maximal uncertainty although this probability would always need to increase with each pump.

2) Confounding and Decomposability
• [At Confounding and Decomposability, paragraph 2], the authors mention the problem that balloon value increases linearly whereas probability of explosion increases superlinearly, and thus the EV of the balloon changes across trials.
-Yes, that's an issue in the Lejeuz version, but the BART can be easily modified to remove that concern. If the P(explosion) increases linearly, then the EV would be constant.

3) Discussion
• Two issues not considered are that people might pump less because they are fatigued and thus wish to reduce effort or because they want to complete the task sooner for a finite number of balloons.
-E.g., if there are 30 balloons, I could end the task after 30 pumps by pumping once per balloon rather than 10-15 times per balloon.
-Thus, 'risk propensity' might reflect laziness or boredom with the task or a desire to leave early.

0) General comments
• This manuscript summarizes the popularity and qualities of the Balloon Analogue Risk Task, one of the most popular tasks for gauging risk propensity and puts forward and analyzes for shortcomings of the task.
• Authors' analysis and viewpoints are very important and demand, indeed, further investigation. In my opinion, this manuscript has an innovative character. I have some suggestions to improve its quality.

1) Confounding and Decomposability
a. Personally, authors completely regard risk or uncertainty as an isolated concept to discuss the shortcomings of the BART. Some of the authors' opinions are too extreme.
• For instance, the authors believe that the BART confounds risk with the expected value. However, the risk is not completely isolated in reality.
Many risk decision-making situations have interaction between risk and expected value. In this respect, the BART has high ecological validity.
b. Besides, authors suggest that the BART requires participants to inflate balloons one pump at a time, it is plausible that they get tired of pumping before reaching the optimum, which may lead to the BART cannot properly differentiate between adaptive and maladaptive risk behavior.
• However, in my opinion, it is precisely this dynamic decision-making process that allows us to discover the cognitive process and neural mechanism of risk dynamics in the BART using cognitive neuroscience methods.
2) The Normative Solution • Additionally, the authors addressed the key problems of BART by comparing the original version to subsequent modified versions, rather than other risk-taking tasks in detail.
• It seems that the goal of the present study is summarizing problems inherent to the original BART, not targeted at the task itself. It should be noted that these modified versions were used to mitigate only one or two methodological aspects, which may lead to a concern for the validity and reliability of BART, and a trade-off is needed for researchers.

3) Discussion
• Also, the authors assume that participants are not given any information about the explosion probabilities, they first decide under uncertainty, which View interactive version here. Journal of Trial and Error 2020 Manuscript by D G P R 9 then gradually shifts towards risk as they learn more about the probabilities in the task.
-However, the authors' discussion on the proposition is not enough, because the authors analyze the transition problem in the BART from deciding under uncertainty to deciding under risk, based on the assumption that the decision in the BART includes two stages: uncertain decision and risk decision.
-In my opinion, there is not enough experimental evidence to support this hypothesis. A recent study suggests that this task is not about risk but uncertainty. In the BART, both outcome and probability distribution are unknown, making it an uncertain task (De Groot & Thurik, 2018).
-A previous study also found that the test-retest reliability of the BART was significantly higher than that of the IGT (Xu et al., 2013), which indicates that it is more difficult for participants to learn and master the probability information of the BART during the decision-making process.

4) Conclusion
• Finally, in the conclusion section, the authors put forward some methods and advice to overcome the shortcomings, but in my opinion, it is still not detailed and meaningful enough and needs further improvement.