Quantifying Taste: Testing for Dependence in Choice
Author: Eric Feder
Institution: Columbia University
Instructor: Tian Zheng
Abstract: How can taste be quantified? We aimed to develop a way to test if, when people choose n out of k objects (whether they be photographs, apps, or anything else), their choices are simply distributed in proportion with the objects’ overall popularity or if dependence can be observed. In addition, if we detected the presence of taste, we wanted to be able to cluster the objects. Our database, the SUSHI Preference Data Set, contained the results of a survey in which people were asked to rank their favorite rolls. To see if individual taste had an impact on the data, we ran a simulation to see what results would have been expected if all choices were made independently. Simply sampling without replacement did not work since this method yielded overall frequencies which were not the same as the actual frequencies. We then found an alternative method in which we could draw samples without replacement which were still proportional to size. Using this method, we found that the actual data was quite different from the expected data, implying that taste had played a role. Furthermore, using the actual and expected matrices, we generated measurements of similarity for all pairings. We took these measurements and ran multidimensional scaling in an attempt to make groupings among the rolls. We found our groupings to be both internally and externally valid and we therefore feel confident applying this general approach to other data sets.
Identification of Clonal Variations Present in Tumor Through Clustering
Author: Sayar Karmakar
Institution: Indian Statistical Institute, Kolkata, India
Instructor: Anil Kumar Ghosh, Analabha Basu
Abstract: Identification of driver mutations in cancer cells is a challenging endeavour. We hypothesize that these driver mutations arise early in the original cancer cells providing it a selective advantage to form distinct clones. Next generation sequencing data provides an opportunity to construct the allele frequency spectrum for all somatic mutations in a particular tumor. Partitioning the allele frequency spectrum in distinct clusters hence can provide an idea of the number of clones present in the tumour cell mass. We here work with the output of tumour sequence data as generated on the 454 platform (Roche Sequencing) and apply clustering algorithm. Initial clusters were obtained using linkage method. To decide upon no. of cluster we compare AIC and BIC and also Gap Statistic. Obtaining initial cluster, we propose a methodology to find out the number of clones as well as clonal proportions using EM algorithm. Same algorithm was repeated for normal blood data and tumour data in order to draw something conclusive. Also in each type of data somatic cell versusgermline cell new-position mutation versus insertion-deletion has been compared. Comparison between number of initial clusters, clones and their proportions gave rise to some biologically significant inferences.
Applying Benford’s Law of Leading Digits to Large, Natural Data Sets
Authors: Allison Lewis, Victoria Cuff
Institution: University of Portland, Clemson University
Instructor: Steven J. Miller (Williams College), Meike Niederhausen (University of Portland)
Abstract: Benford’s Law of Leading Digits has contributed to the analysis of a variety of real-life data sets, including financial reports, election results, and scientific data. In this study, we apply Benford tests to two natural data sets: hydrology statistics from the U.S. Geological Survey and climate data used to support the theory of global warming in a paper published by two of the researchers accused of data distortion in the 2009 Climategate scandal. We analyze each data set for conformity to Benford’s Law, as well as other laws for distributions of digits, taking into account possible instances of rounding discrepancies, errors in data collection methods, and most importantly potential fraud. It is still an open question as to exactly which data sets should be governed by these laws; while frequently it suffices for the data set to be large, have a sufficient number of significant digits, and span multiple orders of magnitude, it is still possible for the data to fail to be Benford without any nefarious activity despite having met these conditions. We discuss the discrepancies and possible explanations, using the results of a Benford analysis to draw conclusions about the integrity of the data sets studied. The primary goal of this study is to get a sense of when Benford’s Law should hold for natural data sets. As such, discrepancies from Benford’s Law need not indicate fraud or nefarious activity, and it is not our intent to accuse anyone of such behavior; our goal is to see whether or not certain data sets follow Benford’s Law, and comment on the results.
Evaluating Methods for the Analysis of Rare Variants in Sequence Data
Authors: Scott Powers, Alexander Luedtke
Institution: University of North Carolina at Chapel Hill, Brown University
Instructor: Nathan Tintle and Airat Bekmetjev (Hope College)
Abstract: A number of rare variant statistical methods have been proposed for analysis of the impending wave of next-generation sequencing data. To date, there are few direct comparisons of these methods on real sequence data. Furthermore, there is a strong need for practical advice on the proper analytic strategies for rare variant analysis. In this paper, we compare four recently proposed rare variant methods on simulated phenotype and next-generation sequencing data as part of Genetic Analysis Workshop 17. Overall, we find that all analyzed methods have serious practical limitations at identifying causal genes. Specifically, no method has more than a 5% true discovery rate (the percentage of genes identified as significantly associated with the phenotype that are truly causal). Further exploration shows that all methods suffer from inflated false positive error rates (chance of a non-causal gene being identified as associated with the phenotype) due to population stratification and gametic phase disequilibrium between non-causal SNPs and causal SNPs. Furthermore, observed true positive rates (chance that a truly causal gene is identified as significantly associated with the phenotype) for each of the four methods was very low (less than 19%). The combination of larger than anticipated false positive rates, low true positive rates and only ~1% of all genes being causal yields poor discriminatory ability for all four methods. This paper identifies gametic phase disequilibrium and population stratification as important areas of further research in the analysis of rare variant data.
A Novel Approach for Analyzing Kinetic Data from Variants of a Calcium-Binding Protein
Author: Suzanne Rohrback
Institution: Kenyon College
Instructor: Brad Hartlaub
Abstract: The sarcoplasmic calcium-binding protein (SCP) is expressed highly in invertebrate muscle tissue, and is believed to be involved in muscle relaxation. Three variants of SCP have been identified in the freshwater crayfish, Procambarus clarkii, but, thus far, no differences have been found between them. The purpose of this study has been to determine if there is a difference in the calcium-binding kinetics of these protein variants. Formal statistical tests (Mack-Skillings and GLM), non-linear curve fitting (log-logistic model), and multivariate analyses (PCA, KNN, and SIMCA) were applied to kinetic data collected for these proteins by measuring tryptophan fluorescence responses to exposure to different concentrations of calcium. All methods and models indicate a distinction between these SCP variants. Multivariate analyses provided the most reliable comparative analysis of the data, and indicate a novel method for this type of kinetic analysis.
OPPERA Study Further Elucidates the Relationship Between Heat Pain Sensitivity and Temporal Summation of Heat Pain
Author: Rebecca Rothwell
Institution: University of North Carolina at Chapel Hill
Instructor: Eric Bair
Abstract: The purpose of the Orofacial Pain: Prospective Evaluation and Risk Assessment (OPPERA) Study is to establish the causal determinants of temporo-mandibular disorder (TMD) pain. Using a prospective cohort design, the study was based on a model with enhanced pain sensitivity as an immediate phenotypic risk factor for TMD development. The results discussed here are from TMD-free controls in addition to TMD cases. We developed their baseline assessment of pain sensitivity from various forms of quantitative sensory testing (QST), including thermal pain sensitivity measures. There are two risk factors for TMD in our study: overall sensitivity to pain and temporal summation. We used existing methods for analyzing thermal data in association with case status including first pulse ratings, area under the curve, delta ratings, maximum ratings, and regression slopes. In addition, maximum ratings and aftersensation ratings were analyzed as predictors of case status. Our analysis indicated an optimal predictor of case status combines measures of general sensitivity to pain (such as first pulse), and measures of temporal summation (such as delta). We also show that the relationship between general sensitivity and temporal summation is more complicated than previously believed. These results have the potential to help more accurately prevent and treat TMD and other chronic pain conditions.