Breast Cancer Research

official impact factor 5.79

Open Access Highly Access Research article

Effect of training-sample size and classification difficulty on the accuracy of genomic predictors

Vlad Popovici1, Weijie Chen2, Brandon D Gallas2, Christos Hatzis3, Weiwei Shi4, Frank W Samuelson2, Yuri Nikolsky4, Marina Tsyganova5, Alex Ishkin5, Tatiana Nikolskaya4,5, Kenneth R Hess6, Vicente Valero7, Daniel Booser7, Mauro Delorenzi1,8, Gabriel N Hortobagyi7, Leming Shi9, W Fraser Symmans10 and Lajos Pusztai7*

Author Affiliations

1 Bioinformatics Core Facility, Swiss Institute of Bioinformatics, Génopode Building, Quartier Sorge, Lausanne CH-1015, Switzerland

2 Center for Devices and Radiological Health, US Food and Drug Administration, 10903 New Hampshire Ave WO62-3124, Silver Springs, MD 20993-0002, USA

3 Nuvera Biosciences, 400 West Cummings Park, Woburn, MA 01801, USA

4 GeneGo, Inc., 500 Renaissance Drive, St. Joseph, MI 49085, USA

5 Department of Systems Biology, Vavilov Institute for General Genetics, Russian Academy of Sciences, Gubkina str. 3 korp. 1, Moscow 119333, Russia

6 Department of Biostatistics, P.O. Box 301439, Houston, TX 77230-1439, USA

7 Department of Breast Medical Oncology, P.O. Box 301439, Houston, TX 77230-1439, USA

8 Swiss NCCR Molecular Oncology, Swiss Institute for Experimental Cancer Research (ISREC), School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne, Lausanne CH-1015, Switzerland

9 National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR 72079, USA

10 Department of Pathology of the University of Texas M. D. Anderson Cancer Center, P.O. Box 301439, Houston, TX 77230-1439, USA

For all author emails, please log on.

Breast Cancer Research 2010, 12:R5 doi:10.1186/bcr2468

Published: 11 January 2010

Abstract

Introduction

As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by three clinically relevant endpoints.

Methods

We used gene-expression data from 230 breast cancers (grouped into training and independent validation sets), and we examined 40 predictors (five univariate feature-selection methods combined with eight different classifiers) for each of the three endpoints. Their classification performance was estimated on the training set by using two different resampling methods and compared with the accuracy observed in the independent validation set.

Results

A ranking of the three classification problems was obtained, and the performance of 120 models was estimated and assessed on an independent validation set. The bootstrapping estimates were closer to the validation performance than were the cross-validation estimates. The required sample size for each endpoint was estimated, and both gene-level and pathway-level analyses were performed on the obtained models.

Conclusions

We showed that genomic predictor accuracy is determined largely by an interplay between sample size and classification difficulty. Variations on univariate feature-selection methods and choice of classification algorithm have only a modest impact on predictor performance, and several statistically equally good predictors can be developed for any given classification problem.