CrowdDetective: Wisdom of the Crowds for Detecting Abnormalities in Medical Scans

Machine learning (ML) has great potential for early diagnosis of disease from medical scans, and at times, has even been shown to outperform experts. However, ML algorithms need large amounts of annotated data – scans with outlined abnormalities - for good performance. The time-consuming annotation process limits the progress of ML in this ﬁeld. To address the annotation problem, multiple instance learning (MIL) algorithms were proposed, which learn from scans that have been diagnosed, but not annotated in detail. Unfortunately, these algorithms are not good enough at predicting where the abnormalities are located, which is important for diagnosis and prognosis of disease. This limits the application of these algorithms in research and in clinical practice. I propose to use the “wisdom of the crowds” –internet users without speciﬁc expertise – to improve the predictions of the algorithms. While the crowd does not have experience with medical imaging, recent studies and pilot data I collected show they can still provide useful information about the images, for example by saying whether images are visually similar or not. Such information has not been leveraged before in medical imaging applications. I will validate these methods on three challenging detection tasks in chest computed tomography, histopathology images, and endoscopy video. Understandinghowthecrowdcancontributetoapplicationsthattypicallyrequireex-pert knowledge will allow harnessing the potential of large unannotated sets of data, training more reliable algorithms, and ultimately paving the way towards using ML algorithms in clinical practice.


60
R G A C

F I G U R E 1 Supervised learning and MIL in medical imaging
is driven by the availability of powerful computational resources, sophisticated algorithms, but most of all, large amounts of labelled data. Data is annotated at different levels: outlining the cat in the image (a strong annotation) or tagging the whole image (a weak annotation). Strong labels are the most effective, but weak labels can also be used by weakly supervised or multiple instance learning (MIL) algorithms (Fig.1).
In medicine, ML offers invaluable opportunities for diagnosis of disease, reaching expert performance [1] or even outperforming the experts [2,3].
However, while the amount of medical data is growing rapidly 1 , the data is often only weakly annotated (a scan with a diagnosis, but without outlined abnormalities). Although MIL is gaining popularity in medical imaging [4][5][6][7][8][9], there is an overlooked, fundamental problem. MIL algorithms are optimized to predict weak annotations [10], but the classifier best at predicting weak annotations, is often not the best at predicting strong annotations [6,8,11,12].
In practice this means that without strong annotations, MIL algorithms are poor at localizing abnormalities [13].
I propose a novel way of using crowdsourcing [14] to improve the abnormality localization. Crowdsourcing is the process of gathering input by enlisting the services of a large number of people, either paid or unpaid, typically via the internet, allowing large-scale data annotation [15]. A widely used example is tagging people on social media, used to improve face recognition. Crowdsourcing has therefore been successful in computer vision [15,16] with non-expert tasks, such as recognizing everyday objects. However, internet users are not medical experts -how could they annotate abnormalities? A key insight is that the crowd does not need to mimic the experts to improve MIL for detecting abnormalities. For example, for detecting abnormalities in chest CT images, I propose to instead leverage the human visual system by asking: ⊳ To outline airways, which are easily recognizable structures (Fig.2) ⊳ Whether patches are similar to each other These annotations can be intuitively provided by the crowd, as demonstrated by my pilot results [17,18]. Although such annotations cannot be used to directly train algorithms, they are still informative.
The visual information in these annotations can help the MIL algorithm to find better representations for the data via multi-task learning [19] with labels for related tasks, such as outlining airways, and with similarity-based learning [20,21] with patch similarities.

F I G U R E 2 Airway in a chest CT patch
The Originality and innovative aspects. The problem of predicting strong labels with MIL is underexplored [10,12]. Using crowdsourcing to improve MIL is novel in general -in a recent survey of crowdsourcing in computer vision [16] none of the 195 papers address MIL, while a recent book on MIL [22] does not mention crowdsourcing. While techniques for multi-task and similarity-based learning exist, these often have assumptions not compatible with medical images annotated by crowds. I will develop methods which better address such data.
Crowdsourcing is new to medical imaging, with most studies published in the last five years [17,[23][24][25][26][27]. The crowd typically mimics the experts in labelling images with a diagnosis or outlining structures in image patches, and the results are compared to expert labels. Several studies show promising results, while in other cases the crowd does not achieve the level of the experts [25,27]. Note that in most studies the crowd only labels the images -the labelled images are then not used in ML approaches. Only [27] uses the labels to train algorithms, but collects labels and does not address the MIL scenario.
There is no prior work on collecting labels for related tasks and similarities, except in my recent pilot studies [17,18]. Approach. I will investigate three strategies in which expert weak labels and crowd annotations can be combined to train more robust MIL algorithms: ⊳ Experts + crowd labels for the expert task ⊳ Experts + crowd labels for related tasks ⊳ Experts + crowd similarities My hypothesis is the best strategy will depend on the application, as ex- Chest CT patches with healthy (le and middle) tissue and emphysema (right). Labelling these is di icult for non-experts, but they can still characterize the visual similarity of the patches.
Labels from the crowd can be combined with expert labels to create a more robust classifier by decision fusion (DF) [28,29] Labels for related tasks, such as outlining airways, when the target is to detect abnormalities, can be leveraged via multi-task learning (MTL) [19]. The intuition is that there are underlying features helpful for both tasks, which helps to find a more robust classifier. This is achieved by optimizing a joint loss function for the classifier. Such approaches are used in computer vision to simultaneously detect different everyday objects in images [33]. In medical imaging MTL was used to simultaneously predict an Alzheimer's diagnosis and a cognitive test score [34], and to simultaneously predict different abnormalities related to lung cancer [35].
However, the computer vision approaches have assumptions that might be incompatible with medical imaging, such as well-defined boundaries for each object, and the medical imaging methods do not address MIL and expect expert labels for both tasks. I will investigate the applicability of these and similar approaches, while addressing the specifics of crowd-annotated medical images, for example, by weighting the loss function such that the expert labels are given more emphasis. I expect that the MTL strategy will be most successful when the related task is simple enough for the crowd to do, but has some relevance to the target task.
Similarities, such as "image patch A is more similar to B than to C" (Fig. 3) can also add information during training. The idea is that such judgements are more intuitive for crowds to provide, even if they are not able to label the images.
The similarities provide relative information about the expected classifier outputs, favouring classifiers for which these relationships hold. Examples include metric learning [20] and similarity embedding [21,36], which I group under the name similarity-based learning (SBL).
Again, the assumptions of current methods that focus on MIL may not be applicable. For example [37] assumes that an image label can apply to at most one patch in that image, which is not true for many abnormalities. Based on my experience with different (medical and non-medical) MIL problems [38,39], I will investigate how to create approaches with assumptions more suitable for medical images. I expect SBL to be successful when the crowd can focus on a single aspect of similarity, such as image texture. If there are multiple aspects to focus on, the similarities are likely to be inconsistent across annotators, and might hurt performance of MIL instead.
I will develop MIL algorithms which learn both from expert weak labels and crowd annotations, via DF, MTL or SBL. The strategies I will develop are general, therefore I will use them both with traditional [40,41] and more recent [39,42] MIL classifiers. Similarly, the strategies will apply both to traditional features and features extracted by deep learning.
To investigate which strategy is better in which scenario, I will apply these algorithms to three applications, with different image characteristics and task difficulty: ⊳ Localizing patches with emphysema, in chest CT images [43,44]. This is a task of medium difficulty, where I expect best results from MTL because the underlying disease affects the appearance of airways, or SBL based on recent pilot results [18].
⊳ Localizing different surgical instruments in endoscopy videos [45]. This is a relatively easy task where DF is already likely to be successful.
⊳ Localizing cancerous regions in histopathology images [46]. This is a difficult task, where I would expect best results for SBL, or MTL if a simple but powerful related task can be defined in collaboration with experts.
The datasets have expert weak labels, but also some expert strong labels available, which I will use for validation purposes. I will collect annotations via two platforms, with different trade-off between collection time and quality [47]: ⊳ Amazon Mechanical Turk with paid crowds, which reduces the time to 1-Rajpurkar2017 days, but could reduce quality.
⊳ Zooniverse with unpaid volunteers interested in science, which increases time, but also quality.

F I G U R E 4
Learning curves, method 1 is better with few annotations and vice versa. with expert weak labels, and to each other. In this process I will vary the number of expert labels and crowd annotations used, in order to create more scenarios than the three applications investigated.
By analysing these results from three different applications, I will aim to extract general rules on when each strategy is best to use.

Research plan
The project has four work packages:

Potential
Research. In medical imaging the results of my research will be beneficial to many research groups, as the annotation problem is present in many different applications. In machine learning (ML) the research will demonstrate the potential of improving algorithms with crowd annotations which are not typically leveraged. This is an underexplored area, but is likely to spark interest in the community. I expect this to assist in development of (general purpose) algorithms focusing on such annotations.
Furthermore I expect interest from other applications where annotated data is scarce, such as remote sensing [49] and ecology [50]. Lastly, the behaviour of the crowd (who contributed what and why) could be of interest to research groups in human-computer interaction [47] and social sciences.
Industry. There is already great interest from the industry in ML, and medical imaging is catching up, for example in 2017 the leading conference MICCAI had IBM, Nvidia, Siemens and many medical imaging startups as sponsors.
The interest in my project is demonstrated by the companies in Table 2, including IBM Research. These companies could integrate the research outcomes in their products, which are designed to be user-friendly and secure.
This will enable translation of the proposed research to products and (in case of medical imaging) the clinic.

Society. The research area is relevant to several questions on the National
Science Agenda, including health (questions 105, 89, 94) and society and technology (112, 108). In health, there are long term benefits of improved medical imaging algorithms, such as better prognosis and diagnosis of disease, and better use of the experts' time.
For society, there are benefits in involving the public as annotators, which can raise awareness about health. In the long term, contributing to such projects could create jobs or volunteering opportunities, accessible to people who are be unable to work due to health or care responsibilities [51], increasing their well-being.

Implementation
Research. I will publish in relevant venues (MICCAI, Pattern Recognition) and make the papers/data/code available online. I will continue organizing In year 2 I will organize a workshop on the intersection of ML and crowdsourcing in label-scarce applications beyond medical imaging. I will invite View interactive version here.
Journal of Trial and Error 2020 1(1), 59-71 C R G A 63 researchers via my network and online communities I'm a member of (ml-news, crowd-hcomp). The impact is more difficult to predict, but I expect that at least some groups will be interested in either the data and/or code generated by my project, and translating these to their own applications.
Industry. I have established a user group (Table 2)  In year 3 I will record lectures for an online course on DataCamp 2 , where I am already setting up an image analysis course. The course will combine machine learning, medical imaging and crowdsourcing, but will not require technical prerequisites. This is a more long-term strategy than the others.

Cost estimates
Total budget requested €250.000 Intended starting date September 1, 2018 Application for additional grants Yes. I will collect annotations (labels and similarities) for patches extracted from already available medical images. The annotations will be stored in the widely reusable JSON format.
I will also generate features -numerical data which describes the images, but from which the images cannot be reconstructed. The features will be stored as CSV files to facilitate reuse across different platforms.
The annotations and features (hereafter referred to as relevant data) can be reused in development of other machine learning algorithms.
Where will the data be stored during the research?
All data is stored electronically. The annotations will be collected via the cloud (Amazon Mechanical Turk or Zooniverse) and thus a secure back-up will always be available. The annotations will then be copied to secure network drives available at the host institute. This is also where the generated features will be stored. Daily back-ups are made on this storage facility.
Upon publication of a paper, I will also upload the relevant data to the

Figshare repository (see below).
A er the project has been completed, how will the data be stored for the long-term and made available for the use by third parties? To whom will the data be accessible?
I will store the relevant data on Figshare which will ensure its long-term preservation. Figshare has agreements with publishers such as Nature, PLOS, and others to ensure the data persists for a minimum of 10 years. I will share the relevant data under the Creative Commons Attribution-NonCommercial- Which facilities (ICT, (secure) archive, refrigerators or legal expertise) do you expect will be needed for the storage of data during the research and a er the research? Are these available?* *ICT facilities for data storage are considered to be resources such as data storage capacity, bandwidth for data transport and calculating power for data processing.
I will use in-house computing facilities, which are already available at the host institute. that generating these algorithms may be a natural second step, but -as it isthe application only promises to "investigate three strategies" that are likely to be data set dependent (as the applicant suggests). Overall, I feel the application promises to deliver solid and systematic research that, however, is far from offering new innovative concepts and contributions to the field.

Comments:
The research strategy is well described, and the aims the applicant is presenting are likely to be reached.

Assessment of the knowledge utilisation
Explanation. Criteria -Knowledge utilisation (= KU) NWO uses a broad definition of KU: not only innovative end of pipe product creation is considered, but also purposeful interaction with (potential) knowledge users, such as industry or public organisations and knowledge transfer to other scientific disciplines. NWO asks applicants to fill out the paragraph on KU. An applicant may however explain that KU cannot be expected given the nature of the research project. In that case, we still kindly ask you to assess whether the applicant has provided reasonable argument or evidence sufficiently. Potential: ⊳ contribution to society and/or other academic areas; ⊳ disciplines and organisations that might benefit from the results.

Implementation:
⊳ action plan to allow the outcomes of the research project to benefit the potential knowledge users; ⊳ if and how the potential knowledge users will be involved; ⊳ (concrete) outcomes for society; ⊳ the period over which KU is expected to occur. Comments: Data sets and best practice recommendations, together with related algorithms, will be the promised output: "By analysing these results from three different applications, I will aim to extract general rules on when each strategy is best to use". Whether these general rules exist, will only be know

Comments: no
Referee report of referee 2 Assessment of the quality of the researcher. is somewhat novel but it seems that there are other approaches that would need to be combined. Just getting more data is often not enough and getting the right images annotated, so those that are on the decision boundaries would seem most important. I did not see any strategy of quality control of the crowdsourced annotations and this seems like the major factor that is important.

Question c: What is your opinion on its potential to make a major contribution to the advancement of scholarship, science or technology (academic impact)?
Comments: There is an opportunity to advance the area medical image annotation but to a limited degree with the approaches give if no quality control is done ad if only weak labels are given.MIL is important and finding a link between the annotation and the best approaches could be very interesting.
Pretty much all medical images have reports associated to them, so ignoring the available weak labels would be a pity.
These can be radiology and pathology reports and may be more effective than getting labels of limited quality.

Question d:
To what extent is the proposed method effective? Please comment.
Comments: It is very hard to judge if the method will work. Some approaches haven been using crowdsourcing in the past and they show that with strong quality control this works well. It is not clear how this will be leveraged by the proposed approached.

Assessment of the knowledge utilisation
Explanation. Criteria -Knowledge utilisation (= KU) NWO uses a broad definition of KU: not only innovative end of pipe product creation is considered, but also purposeful interaction with (potential) knowledge users, such as industry or public organisations and knowledge transfer to other scientific disciplines. NWO asks applicants to fill out the paragraph on KU. An applicant may however explain that KU cannot be expected given the nature of the research project. In that case, we still kindly ask you to assess whether the applicant has provided reasonable argument or evidence sufficiently. Potential: ⊳ contribution to society and/or other academic areas; ⊳ disciplines and organisations that might benefit from the results.

Implementation:
View interactive version here.
Journal of Trial and Error 2020 1(1), 59-71 C R G A 67 ⊳ action plan to allow the outcomes of the research project to benefit the potential knowledge users; ⊳ if and how the potential knowledge users will be involved; ⊳ (concrete) outcomes for society; ⊳ the period over which KU is expected to occur. Comments: knowledge utilisation is expected.

R G A C
argue that this is more innovative than developing specific methods for specific applications, which is what is regularly being done at most conferences on the topic.
In contrast to R1, R2 seems to find the proposal too innovative, and suggests it would be better to follow the existing approach -collecting labels from the crowd, and comparing them to expert labels. As I discuss in the proposal, this is likely not to be an optimal strategy. My proposed methods, which focus on alternative (not yet investigated) types of annotations, are more promising in this regard. Since they rely on more intuitive characteristics of the images, the quality control is also less of an issue than suggested by the reviewer. Of course, I will still perform validation, as described on page 7 of the proposal Existing large repositories such as TCIA and TCGA. TCGA is a repository of genomic data, which is not relevant to my proposal. TCIA could be an interesting resource, but does not provide local annotations, which is precisely what is necessary for validation/quality control. As I describe in the proposal, I choose to focus on three applications for which local annotations are available for validation.
Patient reports. Patient reports provide weak labels for images, and are indeed often the basis of the expert weak labels I have available for my datasets.
It is incorrect that I ignore these labels -these are in fact the expert labels my methods will use, in combination with the crowd labels. Processing the patient reports with natural language processing is outside the scope of my research.
Redundancy of expert and crowd labels. The reviewer writes "if expert labels exist than crowd labels do not seem necessary". This is incorrect. The use of expert weak labels alone leads to unstable MIL algorithms, as I have detailed on page 4, "In practice this means that without strong annotations, MIL algorithms are poor at localizing abnormalities [13]." Questions on data management. These questions, together with the other comments, suggest that the reviewer has overlooked an entire page of my proposal (page 7), where I discuss the public datasets I will use, and which already have expert labels available for validation. Overall, given the many positive comments of the reviewers, and the fact that several weak points are not justified, I hope that the committee will consider my proposal for the interview stage. Final score: 4.8

References
Qualification: Your research proposal received the qualification "good", based on the application, the reviewer reports and the rebuttal.

Quality of the Candidate
The committee and reviewers agree that the candidate has a clear research focus and an average to good publication record, although high-impact publications are still missing. One reviewer is therefore hesitant to place the candidate in the top 20% of her international cohort, which is agreed upon by the committee. The candidate was however found by the committee to be an ambitious researcher who has spent a significant amount of time on academic services (workshop and conference organisation, reviewing duties, board member) and outreach.

Quality of the Proposal
The reviewers agree that the proposal tackles a very timely and relevant research topic within medical image processing. They also notice that the methodology is logical, though not overly compelling. One reviewer questions the novelty of the envisioned contributions while another reviewer raises the issue of crowdsourcing quality control that should have been included in the proposal.
The committee shares the doubt of the reviewers on these aspects.

Knowledge Utilization
The committee and the reviewers find the knowledge utilization plan convincing.
The plan aligns well with the candidate's prior experiences and targets different audiences with diverse activities. The inclusion of an industry panel is valued by the committee. One reviewer misses more details on issues of intellectual property rights as well as further details on the setup of the industry panel, to which the committee agrees that more details should be provided.