Skip to main content

Assessing the Quality of Ordinary Least Squares in General Lp Spaces

In the context of regression analysis, standard estimation techniques are dominated by the Ordinary Least Squares (OLS) method which yields unbiased, consistent, and efficient estimators when the classical assumptions are satisfied. However, the presence of outliers can significantly drag estimators away from their true parameters even when modeling with the OLS method. The OLS method is implicitly defined on L2 spaces, which implies that large residuals have a disproportionally large, i.e., squared, influence on the regression estimators.

Advanced Decision Making and Interpretability through Neural Shrubs

Advanced decision making using machine learning should be both accurate and interpretable. Many standard machine learning techniques suffer from an inherent lack of transparency with regard to how the resulting decision was made. In the current work, we aim to overcome this issue by introducing a hybrid learning approach using classical decision trees alongside artificial neural networks, dubbed a neural shrub. The Neural Shrub methodology presented in this paper aims to maintain as much interpretability as possible without sacrificing either classification or regression accuracy.

Reducing Carryover Effects in Within-Subjects Designs

Within-subject designs tend to have higher statistical power and require fewer participants than between-subject designs, but this method presents its own detriments in terms of measurement. Carryover effects, such as practice and fatigue effects, may confound the observed effects in within-subjects designs. The present study aims to remedy these confounds by presenting participants with split-halves of the same psychometric scale, with participants never encountering the same item more than once.

The great weight debate: Constructing, exploring, and visualizing survey weights to enhance representativeness and obtain population-level estimates from NCHA-II data

College and university administrations across the US utilize results from the National College Health Assessment (NCHA) surveys conducted on their campuses to make decisions about student health policy. When the NCHA is administered through voluntary response sampling, how representative are the responses that administrations get? This project focuses on weighting responses from the NCHA-II survey conducted at a small liberal arts college to improve its representativeness and promote data-driven policy making.

Automatic Variable Selection Method

Linear regression is one of the most widely used statistical methods and with the ease of modern data collection, often many predictors are associated with a response variable. However, most of the available variable selection methodologies (e.g., elastic net and its special case lasso) involve careful selection of the tuning parameter. This can cause numerous problems in practice and may lead to biased estimates. We present an automatic variable selection method which is simple and compares favorably with many of the currently available variable selection methods for linear models.

fec16 - An R Package Containing Relational Data From The U.S. 2016 Elections

This R package provides a set of cleaned relational data from the 2015-2016 United States election by the Federal Election Commission (FEC). These data include authoritative information about candidates, committees, contributions, expenditures, and election results. Most datasets are included in full, while a small sample of the others are available with the option of retrieving the entire datasets through built-in functions. This package is useful as there exists a demand for such data in teaching.

A Comparative Assessment of Statistical Approaches for fMRI Data to Obtain Activation Maps

Functional Magnetic Resonance Imaging (fMRI) lets us peek into the human mind and try to identify which brain areas are associated with certain tasks without the need for an invasive procedure. However, the data collected during fMRI sessions are complex; this time series of 3D volumes as images of the brain does not allow for straightforward inference.

Using metafeature clustering to mine tissue-specific signals from rare variants in the cancer genome

Identifying the primary site of origin for cancers of unknown primary is an important clinical problem. Somatic variant mutation analysis for primary site diagnosis traditionally focuses on a small number of frequently occurring mutations, ignoring the vast number of rare mutations that may contain clinically-relevant signals. Previous research (Chakraborty et al., Nature Communications 10:1-9, 2019) proposed a Bayesian nonparametric method developed in computational linguistics to extract tissue-specific signals from the preponderance of rare genetic variation in the cancer genome.

Adaptive Learning in Macroeconomic Models - Understanding Monetary Policy

This research project considers adaptive learning algorithms for estimating the Taylor rule to strengthen the public sector's understanding of monetary policy. Unit root tests that allow for a structural break are conducted to analyse the stationarity assumption for macroeconomic time series. In the first step, the Kalman filter simulates economic agents' expectations concerning inflation and unemployment rates. Maximum likelihood optimisations and Monte Carlo integrations accompany the estimation of additional parameters and investigate the statistical significance of coefficients.

Investigate the Most Effective Card-Upgrade Strategy in Clash Royale

Clash Royale is a real-time strategy video game that allows two players to “battle” with their decks—a combination of eight cards, each associated with a level that can be increased through upgrades. Our goal was to investigate the most effective card-upgrade strategy across different decks while also taking into account the in-game currency required for upgrades.