Breast Cancer Research

official impact factor 5.79

Open Access Highly Access Research article

Gene expression signatures of morphologically normal breast tissue identify basal-like tumors

Greg Finak1,2,3, Svetlana Sadekova1,4, Francois Pepin1,2,3, Michael Hallett3,5, Sarkis Meterissian6,7, Fawaz Halwani8, Karim Khetani9, Margarita Souleimanova4, Brent Zabolotny10, Atilla Omeroglu9 and Morag Park1,11,2,4,7*

Author Affiliations

1 Molecular Oncology Group, McGill University Health Centre, 687 Pine Ave, West, H3A 1A1, Quebec, Canada

2 Department of Biochemistry, McGill University, 3655 Promenade Sir William Osler, H3G 1Y6, Montreal, Quebec, Canada

3 McGill Centre for Bioinformatics, McGill University, 3775 University Street, H3A 2B4, Montreal, Quebec, Canada

4 Breast Cancer Functional Genomics Group, McGill University, 3775 University Street, H3A 2B4, Montreal, Quebec, Canada

5 School of Computer Science, McGill University, 3480 University Street, H3A 2A7, Montreal, Quebec, Canada

6 Department of Surgery, McGill University, Montreal, 687 Pine Avenue West, H3A 1A1, Quebec, Canada

7 School of Medicine, McGill University, Montreal, 687 Pine Avenue West, H3A 1A1, Quebec, Canada

8 Department of Anatomical Pathology, Sunnybrook Health Sciences Center, 2075 Bayview Avenue, M4N 3M5, Ontario, Canada

9 School of Pathology, McGill University, 3775 University Street, H3A 2B4, Montreal, Quebec, Canada

10 Department of Surgery, Grace General Hospital, 300 Booth Drive, R3J 3M7, Winnipeg, Manitoba, Canada

11 Department of Oncology, McGill University, 546 Pine Ave. W, H2W 1S6, Montreal, Quebec, Canada

For all author emails, please log on.

Breast Cancer Research 2006, 8:R58 doi:10.1186/bcr1608

Published: 20 October 2006

Additional files

Additional file 4:

A table listing complete clinical characteristics of patients in this study.

Format: PDF Size: 28KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 8:

A complete list of tissue specific expression markers identified in this study.

Format: XLS Size: 536KB Download file

This file can be viewed with: Microsoft Excel Viewer

Open Data

Additional file 9:

A complete list of GO categories overrepresented by the normal epithelium and normal stroma gene signatures.

Format: XLS Size: 227KB Download file

This file can be viewed with: Microsoft Excel Viewer

Open Data

Additional file 3:

A table listing tissue specific predictors of clinical characteristics based upon gene expression in adjacent epithelium. The poor quality of the predictors is readily visible from the error rate for the predictors in the first column of the table. The error rate is the fraction of times the predictor misclassifies a sample under cross-validation. Predictors were trained using gene sets from class distinction using SAM or LIMMA. For some combinations of clinical characteristics and class distinction algorithm, no genes passed the filtering criteria, and no predictor could be trained. In such cases the rows are omitted from the table. The gene set size is the initial size of the candidate gene set from which a predictor is built. This set is also selected under cross-validation. The training error is the rate of misclassification for samples included in the training set. The PAM cross-validation error rate reported by the PAM algorithm [30] does not account for the selection of the candidate gene set under cross-validation. The predictor size is the number of genes in the predictor.

Format: PDF Size: 23KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 10:

A list of genes differentially expressed between cellular and pauci cellular fibrotic stroma clusters.

Format: XLS Size: 137KB Download file

This file can be viewed with: Microsoft Excel Viewer

Open Data

Additional file 11:

A list of GO terms overrepresented by genes differentially expressed between cellular and pauci cellular fibrotic stroma clusters.

Format: XLS Size: 57KB Download file

This file can be viewed with: Microsoft Excel Viewer

Open Data

Additional file 2:

A table listing tissue specific predictors of clinical characteristics based upon gene expression in adjacent stroma. The poor quality of the predictors is readily visible from the error rate for the predictors in the first column of the table. The error rate is the fraction of times the predictor misclassifies a sample under cross-validation. Predictors were trained using gene sets from class distinction using SAM or LIMMA. For some combinations of clinical characteristics and class distinction algorithm, no genes passed the filtering criteria, and no predictor could be trained. In such cases the rows are omitted from the table. The gene set size is the initial size of the candidate gene set from which a predictor is built. This set is also selected under cross-validation. The training error is the rate of misclassification for samples included in the training set. The PAM cross-validation error rate reported by the PAM algorithm [30] does not account for the selection of the candidate gene set under cross-validation. The predictor size is the number of genes in the predictor.

Format: PDF Size: 22KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 12:

A figure showing principal component analysis of matched adjacent normal tissues. (a) Scree plot showing the percent of data variation explained by the first 10 principal components of the patient matched adjacent normal tissue. The common reference design accounts for 84.58% of variations in gene expression observed in the data (Additional file 13), while principal components 2 and 3 are explained by variations in gene expression associated with tissue type, and components 4 through 8 are explained by variations in gene expression between individuals. (b) Scatter plot of principal component two against principal component 3. These two dimensions suffice to summarize the between tissue variation observed in the data, as demonstrated by the clustering of epithelial samples on the right of the plot (red), and stromal samples on the left (black). Analogously, in five dimensions, we can explain the variation between individuals. No other clinical characteristics were significantly associated with any principal components.

Format: PDF Size: 44KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 13:

A figure showing the effect of the common reference design in principal component analysis. Data that exhibit no variation in gene expression corresponds to an expression matrix where each gene on each array has exactly the same expression level. A slightly more realistic case exists where each gene has a different expression level, but the expression is just random noise (left panel). The principal components each explain a similar, small amount of the total variation in the data. The case at the other extreme of the spectrum from the random noise example consists of perfectly correlated data with no noise, as might be imagined from ideal replicate arrays (middle panel). The variability in the data occurs from each gene having a different level of expression; however, that expression is identical across arrays. Only one principal component is necessary to capture all of the variation in the data. The third and most realistic case consists of correlated data with random noise. This closely resembles what is observed in the normal tissue dataset with a common reference design. The arrays are highly correlated, resulting in the first principal component explaining the majority of the observed variations, and the remaining variation distributed amongst the remaining components.

Format: PDF Size: 36KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 6:

A figure showing heatmaps of normal tissue expression profiles clustered using published gene signatures. (a) SFT signature, (b) DTF signature [36], (c) activated CSR signature, (d) inactive CSR signature [44].

Format: PDF Size: 185KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 7:

A schematic outlining the gene set comparisons and filtering operations performed using the normal tissue signature and gene sets from published expression profiles. Circles denote gene sets, labeled by name and with their size. Numbers in brackets denote the size of a gene set after filtering for high variance genes (Var >1) in normal tissue; 7.36% of genes in the normal dataset have variance greater than 1. Intersections between gene sets as well as the size of filtered gene sets are labeled with p values denoting the significance of the overlap (hypergeometric test), or the significance of overrepresentation of high variance genes (χ2 goodness of fit test), respectively. The data were derived from the following sources: SFT/DTF (Additional file 6a,b) [36]; SAGE [33]; CSR (Additional file 6c,d) [44].

Format: PDF Size: 175KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 1:

A table listing p values for tests of association between clinical variables and top-level clusters (red boxes, Figure 6) induced by clustering various subsets of the data. Only normal adjacent stroma shows top-level clusters with significant p values by the bootstrap. None of the clinical variables were found to be correlated with either top-level clusters or statistically significant subclusters (data not shown).

Format: PDF Size: 23KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 5:

A figure showing hematoxylin and eosin staining of (a) a breast reduction specimen and (b) a histologically normal specimen from an invasive breast carcinoma patient.

Format: PDF Size: 2.4MB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data