Data Science Projects

This section showcased four projects: Fraud Detection for Online Business to Business Transactions (KNN, XGBoost, naive bayes),  Predicting the popularity of Reddit posts (logistic regression, random forest and sentiment analysis), Forecasting Shanghai License Plate Price (time series analysis) and Happiness in Major Cities (data visualization).

I will briefly summarize the three projects.

Relevant Skills: Python (pandas, numpy, sklearn, nltk, keras), R (dplyr, tidyr, lme4, glmnet, astsa), machine learning, time series analysis, data visualization 

Fraud Detection for Online Business to Business Transactions

This is the senior capstone project for my statistics major. A five-student team worked with our client Bill.com, a company processing 120,000+ payment requests daily.

The goal of our project is to use over 250,000 transactions provided by Bill.com to find key features and patterns indicative of fraudulent transactions to build an effective model predicting fraudulent activities.  

After cleaning, EDA, feature engineering and modeling, we found that XG boost with SMOTE oversampled data is the best model, reaching 95%  success rate while being able to flag a minimal number of legitimate transactions as fraud. As a result, the model saves 73.5 hours of manual checking daily and will be put into practice 2019 summer.

Predicting the popularity of Reddit posts

This is the final project for data science tools and algorithms class in my senior year of college.

In our analysis, we seek to answer follow questions:

  1. Which features contribute to the popularity of a Reddit post? Which features are most influential?
  2. Can we employ methods from Sentiment Analysis, such as polarity scores to predict popularity? 

We found that random forest performed the best with the accuracy of .967. More importantly, we found the top 3 factors that can predict the post popularity: number of total comments, number of subscribers of the subreddit, and the emotional score of the comments (generated from sentiment analysis).

The poster summarized our approach.

 

Forecasting Shanghai License Plate Price

This is an independent project for my time series analysis class, completed in my junior year.

Background

Shanghai uses an auction system to sell a limited number of license plates to fossil-fuel car buyers every month. The average price of this license plate is about $13,000 (the unit in original data is CNY) and it is often referred to as “the most expensive piece of metal in the world.” Getting plates outside Shanghai is a less appealing option, because the city doesn’t allow vehicles registered elsewhere on its elevated highways
during rush hours.

Problem Statement

Giving the increasing average price for license plates and decreasing success rate, I want to build a relatively accurate forecast model for average plate price.
The key questions for this work include:
  • What are the forecast, and forecast error bounds, for Shanghai average license plate price? 
  • Based only on 2002-2013 data, how does our prediction for successful application rate differ from the reality?

Analysis

I first plotted the original time series of license average price and success rate. Because both ACF and PACF of average price show exponential decay and are not significant after 1, I first tried ARMA (1,1), and forecasted for the next 10 months.
For success rate, I conducted periodgram analysis and found the predominant cycle of 12.
Based on ACF and PACF plots, I tried several SARIMA and garch models, and found SARIMA(1,0,0)(1,0,0)[12] the best fit.

Conclusion

The average price of Shanghai license plate is well captured by a low order ARIMA model, namely a ARIMA(1,1,1). Comparing the forecast with the actual average price for March and April 2018, the model did a good job.
The successful application rate of Shanghai license plate is well captured by a low order seasonal ARMA model, namely a SARIMA(1,0,0)x(1,0,0)[12]. 
Forecasts predict the near term behavior of the series. The longterm forecasts converge to the estimated mean of the process, as expected. However, the seasonal model for successful application rate failed to see a sharp drop in 2014, which is caused by a sudden increase in the total number of applicants (the denominator for the success rate). The reason for the increasing applicants is unclear, but we need to be mindful about
the impact of future events (or people’s expectations) instead of just focusing on the past data when we build model.

Happiness in Major Cities

This is the final project for my R programming class, where we tried different visualization methods through ggplot2 in R to investigate how the distribution of happiness ratings are influenced by different variables.

My responsibility is team lead, data cleaning and data visualization.

We answered three questions with plots generated from data (only a small portion of them are presented here).

Q1: Are there any similarities or significant differences between each city in happiness level distribution?

People live in Seoul and Tokyo live a less happy life compared with those from other cities, because there are very few ratings of 5 (very happy) for happiness.

A large proportion of people who voted very happy live in Toronto and New York.

Q2: How age affects the distribution of happiness level in different cities?

Generally, the distributions of happiness level across all cities have a relatively right-skewed pattern, indicating that as the age increases, the happiness level generally decreases. Noticeably, most of the cities have the peak at the second or third layer, which reflect that the middle-aged and older adults are the happiest age groups.

For Berlin, NYC, Seoul, Stockholm, Tokyo and Toronto, the young adults are less happy than the middle-aged and older adults, which reflects that people in school or the early stage of their career are pretty unhappy in these seven cities. However, as people in these cities reach the first age quartile (about 30), they are usually at the peak of mean happiness level across all ages in that city. For the remaining cities, London, Milan, and Paris remain high happiness level from young to older adults (18-60). Beijing is an exception because its middle-aged people are relatively unhappy.

For the oldest group (above the third quartile), as the age increases, the happiness level decreases. This phenomenon may be more related to the participants’ decreasing health than the cities they live in. The only exception is New York.

Q3: What are the essential factors that determine the feelings of their happiness level based on each city?

We focused on people in the following age x city groups  in cities. These groups have the lowest number of people who rated 1 “not happy at all”. Specifically, we counted for the top 3 factors that received “very satisfied” scores.

The results indicate that good welfare and safety are the fundamental factors that ensure citizens to be happy.

City Age Group 1st 2nd 3rd
Berlin Old Safety

(343)

Welfare

(314)

City Administration

(159)

Berlin Young Welfare

(209)

Safety

(189)

Economy

(178)

Milan Young Safety

(265)

Community Life

(247)

City Administration

(224)

NYC Old Welfare

(356)

Safety

(354)

Environment

(167)

Leave a Reply

Your email address will not be published. Required fields are marked *