banner

banner

Terms of Use

Home

Registration

Login/Download

 

SIEMENS

Siemens Medical Solutions USA

Computer Aided Detection (CAD)

Soarian Quality Measures

 

KDD CUP 2008

Background on Breast Cancer

Data Description

Challenge Description

Hints

Workshop on Mining Medical Data

Important Dates

Contact/FAQ  

Hints and Machine Learning ideas that may be useful for the challenge

 

The obvious method of classification is to try to build classifiers that simply label each candidate independently. Below we present a few ideas that participants in the challenge may want to consider to potentially improve their algorithms. 

 

1. Leverage two views of the same breast: Almost always, a cancerous lesion is visible in both views (MLO, CC) of the breast – radiologists routinely try to correlate the two views while diagnosing the patient.  In rare cases, however, some lesions may only be visible in one view, especially in certain areas of the breast. However, negative candidates may either be present in one view (e.g., for image artifacts) or in both views (e.g., if generated by the presence of benign cyst).

 

Unfortunately, since each view is a 2D image obtained from an orthogonal direction, it is not possible to perfectly register (i.e., correlate the locations across) the X-ray images using simple algorithms, e.g., using affine transformations. However, some of a lesion’s features are typically preserved across the two views; particularly, the distance of a lesion from the nipple, and perhaps some of the features themselves relating to size of the lesion, texture, etc. Thus the first idea that may be useful for this challenge is to develop algorithms that simultaneously classify candidates from a pair of images from the same breast. These algorithms could try to exploit correlations in classification decisions for the same region of a breast. To support this, training and testing data sets will include features that identify the (x,y) location of the nipple as well as the (x,y) location of the candidate.

 

2. Class Imbalance: Participants will be able to leverage ideas from classifier design under extreme class imbalance (the vast majority of the regions are normal, and only a small fraction of the regions are actually malignant), and feature selection (a large number of features are proposed and several of them may not be very useful for the task). The prevalence rate (malignant patients as a fraction of all patients) may differ between the training and testing sets.

 

3. Exploit correlations within an image: Participants may develop novel algorithms for exploiting potential correlations between the diagnoses of suspicious regions within a single image (e.g. if they are spatially adjacent).

 

4. Optimize AUC only in narrow FP range: It may be useful to develop training algorithms to maximize the area under the ROC curve (AUC) in a clinically relevant false positive (FP) range, a problem that has not been adequately addressed in the machine learning/data-mining current literature.