Peer Review of Trial and Error (-Related Negativity): An Odyssey of Integrating Di erent Experimental Paradigms

How to read: Sections which have peer review comments are marked by ▼ or links, clicking which directs you to said comments. The reviews contain links referring back to the original section. Abstract Pain can be considered as a signal of “bodily error”: Errors – discrepancies between actual and optimal/targeted state – can put organisms at danger and activate behavioral defensive systems. If the error relates to the body, pain is the warning signal that motivates protective action such as avoidance behavior to safeguard our body’s integrity. Hence, pain shares the functionality of errors. On the neural level, an important error processing component is the error-related negativity (ERN), a negative deflection in the electroencephalographic (EEG) signal generated primarily in the anterior cingulate cortex within 100 ms a er error commission. Despite compelling evidence that the ERN plays an important role in the development of various psychopathologies and is implicated in learning and adjustment of behavior, its relation to pain-related avoidance has not yet been examined. Based on findings from anxiety research, it seems conceivable that individuals with elevated ERN amplitudes are more prone to engage in pain-related avoidance behavior, which may, under certain conditions, be a risk factor for developing chronic pain. Consequently, this new line of research promises to contribute to our understanding of human pain. As in most novel research areas, a first crucial step for integrating the scientific fields of ERN and pain is developing a paradigm suited to address the needs from both fields. The present manuscript presents the development and piloting of an experimental task measuring both ERN and avoidance behavior in response to painful mistakes, as well as the challenges encountered herein. A total of 12 participants underwent one of six di erent task versions. We describe in detail each of these versions, including their results, shortcomings, our solutions, and subsequent steps. Finally, we provide some advice for researchers aiming at developing novel paradigms. Take-home message: Developing a new experimental paradigm is challenging, timeconsuming and requires thorough testing. To make this process most e icient, we advise to clearly define the requirements at the beginning, to keep record of all decisions and adjustments made in the process, and to investigate encountered problems and failures which may help to gradually improve the paradigm.


Introduction peer reviews▼
Chronic pain affects as many as 20% of adults in Europe (Societal Impact of Pain, 2017). This poses a considerable health care challenge and tremendous individual suffering. One of the most prevalent explanations for the transition from a common acute pain episode to chronic disabling pain is the fear-avoidance model (Vlaeyen et al., 2016;Vlaeyen & Linton, 2000;Vlaeyen & Crombez, 2020). It posits that pain-related avoidance behavior, often fueled by catastrophic (mis)interpretations of pain, contributes to individuals entering into a downward spiral of fear, avoidance, inactivity, disability and negative affect. Pain-avoidance refers to individuals avoiding stimuli that are predictive of pain or pain exacerbations. For instance, persons who have sustained an injury to their back, may avoid lifting objects for fear that it will lead to further pain, or fear that lifting may create bodily harm that is signaled by pain. Typically, avoidance occurs in anticipation of pain, and hence individuals have little opportunity to test and, if necessary, correct their beliefs about the threat associated with such activities. This failure to correct mistaken beliefs maintains pain-related fear and further increases mood disturbances and pain itself (van Vliet et al., 2018;Vlaeyen & Linton, 2000).
The neural underpinnings of pain-related avoidance behavior are poorly understood. A new and promising avenue to understanding pain-related avoidance is to study its relation to error-related negativity (ERN). The ERN is an event-related potential (ERP), which occurs within ∼100ms after error commission (Hajcak, 2012). It can be measured as a negative deflection in the electroencephalogram (EEG) over fronto-central scalp positions, originating in the anterior cingulate cortex (Miltner et al., 2003). Differences in the amplitude of the ERN are affected by both contextual and individual factors such as motivational aspects and error salience, and are considered to reflect individual error sensitivity (Hajcak, 2012;Hajcak & Foti, 2008). For instance, it has been observed that the ERN is larger when errors are perceived to have worse consequences (Hajcak, 2012). Additionally, the ERN is believed to activate a defensive motivational system, in order to adjust the erroneous behavior and protect the organism (Hajcak & Foti, 2008). In line with these findings, increased ERN amplitudes have consistently been associated with anxiety disorders (Weinberg, Riesel & Hajcak, 2012) as persons affected arguably perceive their own mistakes in relation to the respective object of fear as highly threatening. In addition, avoidance behavior (including experiential avoidance) is a typical feature of anxiety disorders (Aupperle & Martin, 2010;Berman et al., 2010;Dikman & Allen, 2000;Dymond & Roche, 2009).
Given the commonalities between anxiety disorders and chronic pain (Asmundson & Katz, 2009), and the high prevalence of negative affect in chronic pain (Geisser et al., 2000), it is theoretically plausible that the ERN may be a biomarker for the development of pain-related avoidance behavior as well. In the case of anxiety the particular fear object is considered threatening, whereas individuals with chronic pain perceive pain to be highly threatening. Due to pain being such a salient and aversive stimulus, as well as conceptually signaling bodily error, avoidance responses constitute a defensive mechanism to prevent the feared outcome (Leeuw et al., 2007). As such we may see an association between the ERN amplitude and the level of avoidance exhibited by an individual, with avoidance representing a defensive behavior in response to a motivationally salient and aversive stimulus.
Our long-term goal is to investigate whether individual differences in the ERN amplitude are related to differences in pain-related avoidance behavior, which may help identify individuals more at risk for the development of chronic disabling pain. To this end, we aimed at creating an experimental paradigm that would allow us to measure the ERN and avoidance responses simultaneously, the process of which is described in the present manuscript. A task integrating both the ERN and avoidance is more efficient than measuring them in separate tasks. It ensures that there is a relation between error commissions and painrelated avoidance and puts less burden on the participant. Such a paradigm would need to meet requirements for both measures, ERN and pain-related avoidance, namely: 1. evoke an ERN; 2. yield a reliable estimate of the ERN, requiring a minimum of six errors (Olvet & Hajcak, 2009); and 3. be able to detect individual differences in avoidance behavior by providing sufficient opportunities to avoid (at least 20). Although there is some debate as to whether one needs to be aware of having committed an error to evoke the ERN (Nieuwenhuis et al., 2001), awareness is essential in this task as participants would otherwise be unable to perform the avoidance response. In addition, the task ought to evoke inhibition errors rather than mistakes due to lack of skill or knowledge. For that purpose, we decided to integrate two experimental tasks, one of which is frequently used to measure the ERN [the Eriksen Flanker task (Eriksen & Eriksen, 1974)] and the other is a common assessment of avoidance behavior [the avoidance task by Vervliet and Indekeu (2015)]. We piloted this basic paradigm and continuously adjusted it based on the pilot results, leading to a total of six task variations. The focus of the present article is the design and execution of these different tasks, what we can learn from them and suggestions for future research.

Stimulation
Vibrotactile stimuli were administered to the left side of participants' lower back through two vibrotactors (Dancer Design, St Helens, England) placed 2 cm apart (center to center; Figure 1). The stimulus duration was 112 ms and the vibration intensity, which was set to a clearly perceptible level that remained non-painful throughout the experiment, was kept constant across participants. Tactors were activated in a semi-randomized order.
Electrocutaneous stimuli (e-stim) were generated by a constant-current stimulator (DS7A; Digitimer, Welwyn Garden City, England) and delivered for a duration of 2 ms through two 4mm Ag/AgCl reusable snap electrodes filled with K-Y gel attached 1 cm above the vibrotactors. The intensity of these stimuli was individually calibrated to a level that was "painful and demanding some effort to tolerate". For that purpose, a series of stimuli of ascending intensity was administered, each of which participants were instructed to verbally rate on a scale ranging from 0 (no sensation) to 10 (worst imaginable pain). The calibrated stimulus intensity was kept constant throughout the task.
The lower back was selected as stimulus location for two reasons: 1. the sensory acuity at the back is relatively low compared to more typical distal stimulation sites, such as the arms or hands (Weissman-Fogel et al., 2012), which was considered advantageous with regard to task difficulty; and 2. a large, even surface area was required to attach the vibrotactors and electrodes.

So ware
The experimental task was programmed and presented in Affect 5

Measures
The purpose of the pilot study was to create a task that allows to measure the ERN and avoidance behavior simultaneously. In each task version vibrotactors emitted vibrations on participants' left lower back, EEG in response to correct and error trials was recorded, and painful e-stim were applied when participants made an erroneous response which they could omit by pressing the space bar.
Avoidance was operationalized as the number of button presses to cancel an e-stim.
After performing the task participants were asked questions to receive their perspective of the task and to ensure that requirements were met. Specifically, they were asked whether they were able to perceive when they made errors, how difficult they found the task as well as initiating the avoidance response, and any additional comments they would like to provide.
General procedure peer reviews▼ All task versions were based on a tactile task that was inspired by Eriksen's Flanker task (Eriksen & Eriksen, 1974) and a fear-avoidance paradigm established by Vervliet and Indekeu (2015). In the tactile task participants were instructed to distinguish between locations of vibrations emitted on their left lower back. Much like the arrow version of the Eriksen Flanker task, in which the direction of arrows is the cue for a correct response, participants used left and right mouse buttons to indicate whether they felt a vibration at the left or right location, respectively. Participants were instructed to respond as quickly and as accurate as possible. In addition, participants were informed that upon error commission they would receive an electrocutaneous stimulus, which they could cancel by pressing the space bar on the keyboard. Tasks employed either two or three vibrotactors ( Figure 1).
A typical trial proceeded as follows (see Figure 2): a vibration was emitted (112ms), followed by a response window, to indicate the location of the vibration (200-1000ms). After a fixation period (300ms), participants could provide an avoidance response (1000ms). If participants chose not to avoid the e-stim upon error commissioning, an e-stim was delivered following this response window. The trial concluded in a jittered intertrial interval (600-1000ms).
To prevent 'better safe than sorry' reasoning for performing the avoidance response, participants were told that five, and later two, trials would be added to the duration of the experiment every time they cancelled an e-stim. In reality, no trials were added. As this proved to be highly aversive to the participants, leading to an unwillingness to avoid, we removed this clause after the eighth F I G U R E 2 Schematic representation of the basic trial flow with exact timings depending on the task version.
participant. Any deviations from the general procedure will be indicated. Table   1 indicates the total number of participants and trials per task version. In general, the task versions were composed of four or five blocks of 60 trials each. In total, six different versions of the task were tested: Two Tactor task, Distractor task, 100% coactivation task, Different intensities task, 50% coactivation task, and a Sequence task.
Note. Number of tasks refers to whether participants had to perform any additional tasks to the discrimination task. Tasks are listed in the order in which they were tested.
Two Tactor task peer reviews▼

Procedure
In this first version of the task, two tactors were attached to the left lower back of the participants (Figure 1, a). Participants indicated via left and right mouse button press whether they felt a vibration at the left or right location, respectively. Upon error commission participants were free to cancel the e-stim by pressing the space bar. One participant completed four blocks of this task version.

Evaluation
After running one participant it was quickly established that this task was too easy. This was reflected by participant feedback, as well as the number of errors (M = 2). The participant was aware of having made mistakes, but chose not to give any avoidance responses.

Distractor task
peer reviews▼

Procedure
In order to increase task difficulty, three tactors were attached to the left lower back, with two at the original locations and the third centrally below them ( Figure 1, b). This distractor tactor was emitting vibrations to the beat of a widely known song (e.g., "Seven Nation Army" by The White Stripes). The participant was instructed to respond exclusively to left and right vibrations. At the end of each block they had to choose, in a multiple choice manner, which song the third tactor had been vibrating to. One participant completed four blocks of 60 trials.

Evaluation
There were several issues with this task version. The task appeared to be difficult, which led to the participant being exclusively focused on the differentiation task (M = 10). Additionally, even though participants wore earplugs to prevent them from distinguishing vibrations based on the sound they made, the third tactor acted as a small speaker, and the participant became aware of the song because they could hear it. Due to these issues we did not run any further participants on this task.

Procedure
This task version used three tactors. The distractor vibrotactor emitted vibrations simultaneously with the left and right tactors, to make it harder to distinguish between locations. The first participant for this task completed four blocks; in hopes that increasing the number of blocks would lead to an increase in error commissions, we then increased the task length to five blocks (six participants completed five blocks). One participant only completed two blocks View interactive version here.
Journal of Trial and Error 2020 Manuscript by T . P R 5 of this task as they then completed two blocks of the 50% coactivation task.
All but one participant was told that additional trials would be added if they chose to cancel the e-stim. Additionally, in an attempt to balance the trade-off and increase the cost of not avoiding, the painful stimulation was made more threatening by stating that a "stimulus with a slightly higher intensity than the one previously calibrated" could be delivered occasionally for four participants in the task.

Evaluation
On average 22.5 mistakes were made across eight participants. Participant feedback indicated that the task was not too difficult and that they were conscious of having committed an error as it occurred. Note that for two out of eight participants there was no clause stating that trials would be added to the experiment for every click of the space bar.

Procedure
This task used the 100% coactivation schedule, with the third tactor being set to a higher intensity than the left and right tactors. The goal, once more, was to make the task more difficult, in an attempt to raise the number of error commissions across participants. One participant performed five blocks of this task.

Evaluation
Although this task did raise the number of error commissions (M = 33), participants were less aware of having made a mistake and therefore unable to avoid the e-stim.

Procedure
In this task the third tactor only coactivated with left and right tactors 50% of the time. This adaptation was expected to make the task more difficult as participants may habituate less to the sensation of the third tactor. Two participants completed two and one block respectively on this task.

Evaluation
This task proved to be easier than its 100% coactivation counterpart, with two participants making 18.5 mistakes on average. Participants reported having difficulty recognizing error commission, as such further task versions were explored.

Sequence task
peer reviews▼

Procedure
In this last task, participants were instructed to recreate sequences of three vibrations using button presses. The left and right tactors went off in a sequence (e.g. left-right-right), and participants recreated this using the left and right mouse buttons. The third tactor followed a 100% coactivation schedule. Two participants performed five blocks of the task.

Results
peer reviews▼ EEG data were recorded for ten out of the 12 participants, including both Sequence task pilots, the Different intensities pilot, one 50% coactivation pilot, and six 100% coactivation pilots. One participant on the 100% coactivation task committed less than six errors and was, hence, excluded from the EEG analyses. No EEG data were recorded for the Distractor task or for the Two Tactor task.
Task performance and avoidance behavior peer reviews▼ The Distractor task shows only M = 10 error commissions (SD = 0) and no avoidance responses.
We further compared when the most error commissions occurred in each task, in order to determine whether mistakes are more likely to be attributable to the novelty of the task or whether the task manages to consistently evoke error commissions. Across tasks, 56% of errors occurred in the first two blocks of the experiment. Note that for participants for whom the total number of trials was lower than four blocks, only the first block was considered. The task that evoked errors the most consistently throughout the blocks was the 100% coactivation task (45.8% of errors in the first two blocks) and the least was the Different intensities task (81.8% of errors in the first two blocks).

Discussion peer reviews▼
Here we describe the piloting of a novel paradigm based on the integration of two experimental tasks stemming from different fields of psychology. Our aim was to develop an instrument suited to measure both the neural activity during error commission that is followed by a negative consequence, namely pain, and the avoidance tendencies in direct response to these pain-evoking errors. As described above, this task is required to evoke an ERN and allow for its reliable measurement (≥ 6 errors), evoke conscious errors of inhibition in order to allow participants to avoid the painful stimulation, and create sufficient variability in avoidance behavior (≥ potential avoidance trials, i.e., error trials, and adequate balance between costs and benefits of avoiding vs. not avoiding).
Altogether, six different task versions were put to test, which are discussed in the following.
The first task using only two tactors proved to be too easy, based on the low number of mistakes made and the reported feedback of the participant. Despite the low perceptual acuity at the lower back, the vibrations were reportedly clearly distinguishable and after a first orientation during the practice phase, the participant had no difficulty completing the task. Consequently, it was decided to add a third vibrotactor to the paradigm to distract from the target stimulation, which resembles the distracting flankers of the Eriksen Flanker task.
The Distractor task, the first version using the third tactor, was programmed to emit vibrations in the rhythm of famous songs which the participant was instructed to identify whilst responding to the left and right vibrations. This task turned out highly demanding, leaving the participant focusing entirely on the discrimination task. Although this version successfully elicited a considerable number of errors, according to the participant the task load interfered with performing the avoidance response, which was perceived as disruptive, such that none of the e-stim were avoided.
In order to decrease the demands of the task, we switched to the 100% coactivation task, in which the third tactor coactivated with the target tactor on each trial. On average, participants made a sufficient number of errors for ERN analysis, and allowed for a reasonable variability in avoidance responses.
While this version in principle did meet the pre-specified requirements, we decided to further modify this task in an attempt to ensure that the majority of participants would commit sufficient errors. F I G U R E 3 Response-locked grand average waveforms for ERN and CRN at Cz with topography, averaged for (a) the 100% coactivation task (N = 5); (b) the 50% coactivation task (N = 2); (c) the Sequence task (N = 1); (d) the Di erent intensities task (N = 1); and (e) averaged across the four task versions (N = 9).
difference in the amount of error commissions.
Overall, the tasks succeeded in eliciting an ERN at fronto-central sites when participants committed errors, which resembled the typical ERN as observed in more classic tasks such as the Eriksen Flanker task (Imburgio et al., 2020).
It is noteworthy that the ERNs produced by the different task versions were slightly less pronounced and some peaked earlier than typically found with the Flanker task (Hajcak & Foti, 2008 In conclusion, out of the six versions the 100% coactivation task proved the most reliable and useful for investigating whether individuals with larger ERN amplitudes are more prone to engage in pain-related avoidance behavior. Nevertheless, this task does have several limitations: Firstly, error commissions may remain low for a fair number of participants, so that it can be expected that a study employing this paradigm will lose many participants on the exclusion criterion of twenty errors that are needed to obtain a useful assessment of avoidance behavior. Secondly, the operationalization of avoidance behavior as a single button press may be considered simplistic. Persons with chronic pain tend to engage in a broad variety of, frequently subtle, alternative behaviors and withdrawal from activities that are thought to evoke or worsen their pain. Capturing this complexity in an experimental paradigm is highly challenging, and even though better approximations than button presses do exist (e.g., Meulders et al., 2016), the constraints of the Flanker task, a speeded reaction time clearly favored the latter: Participants were instructed that engaging in avoidance behavior would have the consequence of trials being added at the end of the task. As this did not only prolong the duration of the experiment but also further increased the likelihood of error commission, participants appeared to prefer enduring the e-stim. In an attempt to balance the trade-off and increase the cost of not avoiding, the painful stimulation was made more threatening by stating that a "stimulus with a slightly higher intensity than the one previously calibrated" could be delivered occasionally. Surprisingly, this change, too, did Journal of Trial and Error 2020 View interactive version here. P R Manuscript by T .
not increase the number of avoidance responses. Adjusting the painful stimulus in length or quality was not feasible given the timing of the trial flow. Based on the low levels of avoidance behavior throughout the pilots, possibly due to the cognitive demand of switching from the discrimination task to avoiding the painful stimulus, a surge in avoidance responses is not expected. Moreover, low-cost clinical avoidance, such as constantly carrying pain medication, is common and may be problematic as it impedes treatment and maintains fear (van Vliet et al., 2018;Vervliet & Indekeu, 2015;Volders et al., 2012). Hence, studying these low-cost avoidance behaviors is clinically relevant, and it was decided not to integrate any further avoidance costs.
Apart from these task-specific limitations, two limitations of the piloting process need to be mentioned: On the one hand, the small and unequal sample sizes for the different task versions do not allow extensive statistical comparisons.
On the other hand, some changes to the paradigm were made simultaneously, making it impossible to disentangle the individual effects. The decisions to further modify the paradigm were made as soon as clear converging evidence was obtained from behavioral performance and self-report that a given version was either too easy or too complex, often after just one or two pilots, in order to save time and participants. While this allowed us to try out various ideas and establish a functional paradigm within a few weeks' time, the paradigm clearly requires validation in future studies.
Altogether, developing this experimental task based on the integration of two well-established paradigms was demanding, despite the relative simplicity of the original paradigms. Although the Flanker and the avoidance task seemed compatible on account of their short response intervals and general set-up, many problems arose with regard to timing, task difficulty and interference of responses, and possible solutions often caused unexpected side-effects to participants' behavior. Therefore, a thorough and extensive piloting was pivotal in identifying various pitfalls and re-adjusting the task. In the following section, we have compiled a list of recommendations for those planning to create a novel experimental paradigm.

Tutorial peer reviews▼
This pilot study was an interesting, highly instructive journey. Based on this experience, we propose five key elements to consider when developing a new experimental paradigm or merging existing ones.

Peer Reviews
Reviewer 1 Ilona Domen 0) General comments a. The manuscript describes the development and pilot testing of a new paradigm/task, to measure the ERN and avoidance behavior (related to pain) simultaneously. In general I think the manuscript was well written, understandable, and I applaud their efforts in making and recording a new paradigm (and reporting on the whole process) as well as producing a tutorial section. The manuscript did raise quite a few questions for me, and I do have some comments (and concerns) which I describe below.
b. I assume the references guidelines follow the APA 6/7 rules. Regarding this, the manuscript should be checked on APA guidelines (especially results, tables, and references). b. Might be due to word count restrictions, but the introduction is rather short and it therefore feels like it does not explain the full circle of reasoning (pain -avoidance -more pain and fear; link ERN, why this component. Feedback is also stated, so why not the FRN; neural underpinnings, but there is no mention of fMRI research?). Also the specific need of a task that combines ERN and avoidance behavior (because goal use it in future research / no task exists / separate tasks will not provide the information needed) could be more pronounced (i.e., need/purpose research?). Also, some choices lack motivation/reasoning, such as: 'In addition, the task ought to evoke inhibition errors rather than mistakes due to lack of skill or knowledge' (this also occurs in the method/results sections, for example for the exclusion criteria (sources?)).
c. I feel like (part of) the goal of the research is clear, however, it is not clear to me whether this task is meant to be used in future research (of the researchers themselves or others). The conclusion of this manuscript leaves me wondering whether the task will actually be used or not. The manuscript focuses on three things in my opinion: pilot testing a new task, describing and trying to explain (the not useable) data, and a tutorial.
Maybe less focus on analyzing/interpreting data leaves more room for a more detailed tutorial (choices made and why, what went wrong how to fix it, roadmap for others).
d. The introduction and discussion mention a criterion of "at least 20" [introduction] errors "needed to obtain a useful assessment of avoidance behaviour" [discussion]. How was this criterion decided upon? Contrary to the criterion for the minimum number of ERNs, no reference is provided. New tasks are often introduced as aiming at raising the number of errors, but the 'why' behind this aim is only elaborated upon in the first paragraph of the discussion, which is too late (and, for me personally, was hard to understand).

2) Methods
a. I miss the justification of several choices made in the method and data analysis sections (e.g., rationale for bandpass filters, epoch length, 250hz for data acquisition [normally at least 500hz]), using only one electrode (Fcz).
b. There are no scales mentioned in the measures section (open questions?).
c. Some small questions regarding the general procedure: number of trials, number of blocks, what if participants did not press any button to indicate the location of the vibration, controlled for handedness of the participant (normal for EEG), also index finger might be more used for button presses than other fingers. And regarding the different task versions: why did participants have to guess the song, and why was it not good they were aware of the song; how was it known that participants were less aware of the errors they made in the different intensities task.
d. This manuscript would benefit greatly from a more structured overview of all tasks (N, trials, blocks, distractor yes/no, places of vibrations, song yes/no, changes relative to other task versions, choices made and why etcetera) in general, and very early in the method section. This would be a very strong point in this manuscript and would benefit other researchers very much (I made new paradigms myself and would have benefitted from it greatly I think). At this point while reading the manuscript it is a bit chaotic and hard to follow (concepts such as distractor tactor are mentioned early without the reader knowing that it is used or its role, and task names are also non-explaining). The task versions are then explained one by one, with no clear overview of what is different than the one before (and if the evaluation is more positive or negative, and why some adjustments are made). It is one of the strengths of the manuscript in my opinion, and a very structured and detailed overview would be very informative (also since the goal of the manuscript is also a tutorial section).
e. The low number of participants is understandable given the experiments are time-consuming (I made new paradigms myself). However, one participant per paradigm might be insufficient (especially since error commission differs greatly between participants, and you need a reliable measure, and as it seems in the manuscript, avoidance behavior as well). Also, some paradigms had more participants (so not equal across paradigms), but almost all participants received the penalty for avoiding errors (extension of trials) which distorts the actual measure of avoidance behavior. It Journal of Trial and Error 2020 View interactive version here. P R Manuscript by T .
also seems that the stimulus was not painful (enough) since participants did not avoid (but it could have another reason). Also, the number of blocks differed per paradigm (and participant), and changes to the paradigms were not tested one by one, but often several changes together. Subtle changes can even make large differences (neural). This pilot testing (with the goal to make a usable task for research) could benefit greatly from consistency.
Is there a possibility to collect more data (even on colleagues maybe)?

3) Results
a. I am not sure whether this manuscript should focus on data analysis (too small samples). Could use more space for task overview and tutorial for example b. Results should follow APA guidelines?
c. Why were no EEG data collected for the distractor task and the two tactor task? d. 'This might indicate that participants are unaware of having committed an error and thus are unable to avoid'. Or they did not want to avoid because of penalties/higher intensity pain stimuli?
e. Most errors occurred in the first two blocks of all task versions, but if the first 2 blocks generally evoke the most errors, then for participants with only 1 or 2 blocks, the data are less insightful?
f. The Δ ERN values need some direction/reference values (what is small/large/preferred/good. . . ).
g. Figure 3. Response-locked grand average waveforms for ERN and CRN at Cz with topography. But Fcz was measured for the ERN?

a.
Are there other tasks normally used to measure ERN or avoidance behavior that might work better together (when they are integrated)?
b. Conclusions after the task version discussions are missing (for example, two tactor task was inable to detect the ERN due to lack of error commissions, but avoidance behavior was measurable? Therefore we decided to change. . . ). What is the best version of the task, the 100% co-activation task? But is it ready/useable for research? What needs to be adjusted? Would you use it for your own research, or recommended it to others (or why is it not tested further)? As long as choices are explained, it is more acceptable.
c. Why were no EEG data collected for the distractor task and the two tactor task?
d. I miss the role of the penalties in the discussion of the task versions.
e. I miss references in the discussion section.
f. 'It is noteworthy that the ERNs produced by the different task versions were slightly less pronounced and some peaked earlier than typically found with the Flanker task (Hajcak & Foti, 2008)'. I miss the mean ms of the peaks. Could it be less pronounced because participants were unaware of the errors they made? Or why do you think this is? Results are also hard to interpret due to small sample size, few trials and blocks? In my opinion the results/discussion should not focus on interpreting EEG data?
g. In the discussion section it is mentioned that an option to add avoidance costs (resembling real-life avoidance costs in pain patients) was examined.
Then it is better understood why a penalty was added which seemed to undermine the measure of avoidance behavior. Mention this earlier on?
h. I am curious about these options: 'better approximations than button presses do exist'.
i. 'Although the Flanker and the avoidance task seemed compatible on account of their short response intervals and general set-up, many problems arose with regard to timing, task difficulty and interference of responses, and possible solutions often caused unexpected side-effects to participants' behavior. Therefore, a thorough and extensive piloting was pivotal in identifying various pitfalls and re-adjusting the task. In the following section, we have compiled a list of recommendations for those planning to create a novel experimental paradigm'. I completely agree with the extensive and thorough testing, although despite much work, time, and attention was spend on this research, I feel like this did not completely happen. Also especially all the problems could be more extensively documented?

5) Tutorial
• When I saw that the manuscript included a tutorial section, I was really enthusiastic about this fact. I believe researchers should share their experiences However, this tutorial provides good insights, but remains on the surface and could go more in depth (for example on problems, solutions, choices made).

6) Conclusion
• All in all I think this manuscript could be more insightful than it already is, with small (and big) adjustments. I made new paradigms myself, and I think I would have been very much helped with experiences from other researchers, so not everyone has to invent the wheel again. This manuscript could be a nice step towards more open research practices, and sharing research experiences.
• Therefore, I would recommend a Revise and Resubmit of the manuscript. potentials during error commission, and the avoidance behaviour in response to these pain-evoking errors. This new task integrates two existing tasks: a tactile version of the Eriksen Flanker task and a fear-avoidance paradigm. The new task was piloted using a sample of n = 12 participants, and was continuously adjusted based on the pilot results, resulting in a total of six examined task versions. The focus of the present paper is the development and design of these different task versions, and how such a process ensues (including experience-based tips in the tutorial). The long-term goal of the project is to apply the paradigm to the aetiology of chronic pain, hypothesising that individual differences in ERN amplitude are related to differences in pain avoidance behaviour, with persons with larger ERNs being at risk of developing chronic pain.
• The manuscript is characterised by several strengths. In particular, it is very well written. The text has an appropriate level of depth, and has a good flow. Arguments are presented in a concise and clear way, but are still approached from different perspectives. Finally, the paper's main focus is innovative and fits well with the aims of JOTE (as described in its manifesto).

1) Introduction
• The introduction states that "One of the most prevalent explanations for the transition from a common acute pain episode to chronic disabling pain is the fear-avoidance model", which "posits that pain-related avoidance behaviour, often fuelled by catastrophic (mis)interpretations of pain, contributes to individuals entering into a downward spiral of fear, avoidance, inactivity, disability and negative affect" [...]. While this explanation is indeed prevalent, it is confined to cognitive and behavioural mechanisms, thereby leaving out biological explanations. I feel like it would be fair to readers to provide a broader perspective and also discuss an equally prevalent explanation for chronic pain that focuses on the neural level: the sensitisation model. (I understand that this is not the model the present paper builds on, so a brief description would suffice.) The sensitisation model posits that after an acute pain episode the body hypersensitises the (local) nerves in an attempt to protect the damaged tissue and let it heal. This facilitates triggering pain, reminding the organism to be careful with the damaged tissue. In some cases, hypersensitivity remains, and can spread to other tissues as well. It then becomes a spiral in which the hypersensitivity triggers pain, and the pain in turn hypersensitises the nerves, and so on. See for example https://www.sciencedirect.com/science/article/pii/S1524904210000329 and https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1820749/.
• The introduction stresses the importance of the ERN for error processing, but what about the Pe (peaking at 200-400 ms, also originating from the ACC)? For readers (somewhat) familiar with error processing literature, the choice of the ERN over the Pe could be confusing, and I would advise the authors to briefly explain their choice.

a.
In its present form, I observe one major concern with the manuscript: the sample size used for the piloting. Merely n = 12 participants (n = 10 for the EEG) performed one of six task versions, with three of the piloted tasks having a sample size of just 1, and two having a sample size of 2. Hence, most decisions (like whether a task version satisfied the pre-defined requirements) were made based on only a few datapoints. For example, the two tactor task was deemed too easy based on the data (feedback and number of errors) of just participant, who could easily have been an outlier. In general, the extremely small samples make that data can easily be coincidental rather than structural, even if the results (from that one person) seem very apparent. Related to this, the statistical analysis section (correctly) states that the study has insufficient power for inferential statistics, but I wonder whether non-parametric tests and comparing descriptive statistics acceptable here. The task is newly developed and its score distributions are (yet) unknown. Hence, comparing task versions using only one or two scores per version (which can originate from anywhere on the distribution) is relatively prone to chance. The large number of trials participants obviously compensates for this a bit, but far too little. As a result, I think not much can actually be concluded from the current piloting data (which the authors agree on with regard to the hypothesised link between the ERN and avoidance). As for the reason such a small sample was used, the authors mention saving 'time and participants' [discussion], to which I am truly sympathetic, but I do think the extra cost of more participants would have been worth it considering the likely gains of a larger sample. I would like to hear the authors' take on all of this, and would like to suggest three ways to mitigate these issues in the present paper and/or in the future: • To me, this issue is a methodological error, which JOTE aims to publish to support learning in the scientific community. However, to educate others, I think the issue should be given more thought in the text, possibly in the limitation section and/or the tutorial section if the authors have any concrete recommendations for mitigating it.
• Could it be useful to set a predefined number of participants to be tested for every new task version (like n = 10)? This is still underpowered but would make the decision to modify and retest the paradigm less arbitrary (i.e., less based on what at any moment 'seems' right based on the responses of only a few people). It would reduce the risk of decisions being driven by chance findings that look straightforward -but are not. Such a recommendation could be added to the tutorial if the authors deem it useful as it does reduce some of the flexibility of their current approach). This predefined number could even be included in a preregistration (see more on preregistration below). Alternatively, could doing a power analysis before piloting every new task help? I understand that these changes would increase the costs for this project, but also think that P R Manuscript by T .
they would reduce the need for and cost of further, more extensive validation.
• Like the rest of the manuscript, the results section reads very nicely. One would almost forget that the results hinge on few data. I think it would be fair to readers to integrate Table 1 and 2 so that one immediately sees on what sample size a result is based.

3) Participants
• Did participants receive (monetary) compensation for participating? This is not mentioned in the paper, and neither is the origin or recruitment of the participants (they seem a bit old to be regular college students; is it a general population sample?).

4) General procedure
• While the newly developed task is explained in detail, the two tasks it is based on, an Eriksen Flanker task and an avoidance task, are not. For readers unfamiliar with at least one of these base tasks, the lack of a brief, to the point explanation of what they involve makes the concept of the new task difficult to understand in early parts of the paper. Adding this information to the introduction would vastly benefit overall understanding.
• Why did the authors opt for a tactile Eriksen Flanker task? Wouldn't a regular (visual) Flanker task that triggers painful stimulation when making a mistake also be able to measure an ERN and avoidance behaviour.
• Compared to its predecessor, the 100% coactivation task 1) adds a distractor tactor; 2) tells participants that extra trials are added for avoiding pain; and 3) threatens them with triggers that are more painful. Not all participants were presented with all three changes, but Table 1 and 2 only report on the overall results for this task version. It would be helpful to see exactly what number of participants performed which subversion (Table 1), and what the results for those subgroups were (Table 2).
• For some task versions it is unclear what the 'base settings' are. For example, for the 50% coactivation task I was not sure whether the different intensities were still in place, hence if the 50% coactivation scheme was superimposed on the previous scheme or black it. Again, extending the information in Table 1 would help here. Note: I often thought versions were better explained in the discussion than the procedure section.

5) Sequence task
• The reasoning behind the development of the sequence task was unclear to me. It doesn't seem to logically follow from the previous task versions.

a.
Like the rest of the manuscript, the results section reads very nicely. One would almost forget that the results hinge on few data. I think it would be fair to readers to integrate Table 1 and 2 so that one immediately sees on what sample size a result is based.

7) Task performance
• I observed three issues in Figure 3: a. The topography legend is missing.
b. The windows marked in the Figure do not match the actual analysis window (which is 0-100 ms for all ERNs and CRNs). Possibly partly because of this difference, the ΔERN values in Table 2 seem different from the average difference between the ERN and CRN in Figure 3, which is also confusing.
c. The Figure portrays measurements at Cz, whereas the analysis is based around FCz. As FCz is a more logical site for examining the ERN, any effect would be more pronounced there. This is again confusing and could explain the apparent discrepancy between what we see in Figure 3 and what we read in Table 2.

8) Discussion
• The discussion states that "many problems arose with regard to timing, task difficulty and interference of responses, and possible solutions often caused unexpected side-effects to participants' behaviour". Am I correct that these issues are not further elaborated on in the paper? Would this fit in the discussion or tutorial?

9) Tutorial
• I would like to share several suggestions with regard to the tutorial: a. I fully agree with point 1, but would like to suggest an addition where one preregisters the requirements that a to-be-developed paradigm should meet. This would, among others, introduce more accountability into the process.
b. The same is true for point 2: the OSF platform could for example host a "detailed record of decisions, adjustments and their motivation" [...].
Second, I would advise to give an example from the present project on "it is advised to check all relevant data as seemingly small adjustments of a task may have major effects on participants' perception and behaviour" [...] to make it less generic or more informative.
c. The follow-up questions discussed under point 3 are again suitable for a platform such as OSF. This may aid replication and extension attempts.