Using a Linear Regression to Estimate the Average BMI of Individuals Aged 12–20

This study used linear regression to examine the relationship between BMI and internal and external variables of those aged 12-20. While BMI does not directly measure body fat, research indicates that BMI correlates to direct measures of body fat that indicates whether a person is underweight, normal weight, overweight, or obese. Data from the IPUMS (Integrated Public Use Microdata Series) website was collected, imported to R, and cleaned to create a final model that can be used to analyze trends in average BMIs among different groups of individuals.

Predicting Sales Price for Homes at Ames Iowa

The goal of this project is to find the best predictive model to predict sale prices for houses at Ames, Iowa using a high dimension dataset with exploratory data analysis and data pre-processing. This project compares the performance of lasso regression, gradient descent decision tree boosting model, and multi-layer perceptron neural network. Cross validations were used to tune the hyperparameters. We found that the gradient boosting model performs the best on this housing dataset.

Investigating image quality loss while using statistical methods to filter grayscale Gaussian noise

Statisticians, as well as machine learning and computer vision experts, have been studying image enhancement through denoising different domains of photography, such as textual documentation, tomography, astronomical, and low-light photography. With the surge of interest in machine- and deep-learning, many in the computer vision field feel that current approaches for effective image denoising are moving away from statistical inference methods and, instead, moving into these subfields of artificial intelligence.

Digitizing, Districting, and Data - Creating an Open Source Precinct Shapefile for Ohio

One of the most recent advances in voting rights research has been the use of GIS and computation to create new metrics for fairness of maps. Voting precinct shapefiles are necessary in order to spatially evaluate election results, bring gerrymandering cases to court, and create viable alternative districting plans. Even as more government data has been made public over the last few years, many counties still do not publish their precinct boundaries in an accessible format. Ohio is particularly challenging for precinct shapefile collection.

A Differentially Private Wilcoxon Signed Rank Test

Hypothesis tests are a crucial statistical tool for data mining and are the workhorse of scientific research in many fields. Here we present a differentially private analogue of the classic Wilcoxon signed-rank hypothesis test, which is used when comparing sets of paired (e.g., before-and-after) data values. We present not only a private estimate of the test statistic, but a method to accurately compute a p-value and assess statistical significance. We evaluate our test on both simulated and real data.

The Procrastinator’s Guide to a Data Science Career

The best approach to a data science career involves discipline, organization, and patience. So what do you do if you have none of those traits? In this talk, I'll share strategies for entering a career in data science or statistics, based on my own experience working as a Data Scientist at Stack Overflow and my history as an inveterate procrastinator. With the right philosophy, procrastination can a surprisingly productive strategy that is especially well-suited to the modern field of data science.