Using metafeature clustering to mine tissue-specific signals from rare variants in the cancer genome
Identifying the primary site of origin for cancers of unknown primary is an important clinical problem. Somatic variant mutation analysis for primary site diagnosis traditionally focuses on a small number of frequently occurring mutations, ignoring the vast number of rare mutations that may contain clinically-relevant signals. Previous research (Chakraborty et al., Nature Communications 10:1-9, 2019) proposed a Bayesian nonparametric method developed in computational linguistics to extract tissue-specific signals from the preponderance of rare genetic variation in the cancer genome. This analysis elucidated the gene-specific and tissue-specific variation in rare mutation in the cancer genome. However, this analysis focused on approximately 300 genes that were frequently mutated in cancer, omitting tens of thousands of more sparsely mutated genes. Here, we propose a framework to extend this analysis to all genes in the cancer genome, by clustering genes according to variables that are known correlates of somatic mutation frequency in human cancer (termed metafeatures). Our results demonstrated that the clustering method did not generate an appreciable tissue specific signal boost compared to null (random) gene groupings, highlighting the need to include more tissue-specific metafeatures in the clustering framework.