eUSRC 2017 - Graduate School Panel
Moderated by: Dan Sweeney, University of Michigan
Moderated by: Dan Sweeney, University of Michigan
Moderated by: Han Zhang and Sai Bolla, University of Michigan
The best approach to a data science career involves discipline, organization, and patience. So what do you do if you have none of those traits? In this talk, I'll share strategies for entering a career in data science or statistics, based on my own experience working as a Data Scientist at Stack Overflow and my history as an inveterate procrastinator. With the right philosophy, procrastination can a surprisingly productive strategy that is especially well-suited to the modern field of data science.
The dataset for the ASA 2017 Datafest competition was provided by Expedia Inc., a travel company that primarily runs travel fare aggregator websites. The dataset includes over 10 million user records of searches and purchases through various Expedia websites. This paper conducts a machine learning analysis via a classification decision tree to identify potential customers who do not purchase a travel package but are similar to those who do. The paper then narrows down on the countries a group of potential customers is most likely to travel to as well as the types of hotels.
In the environmental sciences, portions of collected data are often reported as non-detect, meaning that the actual data point is known only to be below the detection limit of a measuring device. In mainstream statistics, this type of data is known as left-censored data. Oftentimes, data sets include two detection limits for various reasons.
Graph representations are used across disciplines for the analysis and visualization of relational data. Exponential random graph models allow for a general method of modelling the underlying stochastic process that has generated the observed data conditional on observer attributes of the vertices, or nodes. Recent developments in ERGMs have introduced the notions of local dependence and the exponential random network model, or ERNM.
Risk assessment algorithms are increasingly common in the criminal justice process to predict the chance that an individual convicted of a crime will commit another crime, or recidivate. Recent studies have sparked interest in verifying that such assessment tools predict the risk of recidivism with equal accuracy across races.
Learning to Rank (LTR) is the application of machine learning to rank search results according to their degree of relevance to the query. Salesforce Enterprise Search employs a hand-crafted ranking function to score search results and order them accordingly for users. Data about this ranking process are stored in JSON format, which is a nested tree with arbitrary depth. We present our first effort to parse this data, and extract the inputs of the ranking function into a tabular format.
Women in tech often don’t talk about their own achievements due to fear of being labeled “bossy” or “vain.” In fact, they are far more likely than men to credit mentors rather than themselves with their success. But, how do women talk about themselves when asked by other influential women? We can start to answer this question using tweets from a trending hashtag the week after the release of the “anti-diversity” Google memo that encouraged women in tech to “brag.” Using topic modeling and word associations, we can see that women are actually great at bragging about themselves.
Safe drinking water is a right that should be guaranteed to all populations. In the United States, we know that many urban areas have the ability to obtain safe drinking water, but can rural communities similarly do so? If there is limited access, can technologies, such as point-of-use devices, temporarily improve water quality? With this in mind, we designed an experiment to test water quality of locations of close vicinity around a Midwestern liberal arts institution. Two variables of interest were location of water supply and filtration on how they affect drinking water quality.