This section showcased four projects: Fraud Detection for Online Business to Business Transactions (KNN, XGBoost, naive bayes), Predicting the popularity of Reddit posts (logistic regression, random forest and sentiment analysis), Forecasting Shanghai License Plate Price (time series analysis) and Happiness in Major Cities (data visualization).
I will briefly summarize the three projects.
Relevant Skills: Python (pandas, numpy, sklearn, nltk, keras), R (dplyr, tidyr, lme4, glmnet, astsa), machine learning, time series analysis, data visualization
Fraud Detection for Online Business to Business Transactions
This is the senior capstone project for my statistics major. A five-student team worked with our client Bill.com, a company processing 120,000+ payment requests daily.
The goal of our project is to use over 250,000 transactions provided by Bill.com to find key features and patterns indicative of fraudulent transactions to build an effective model predicting fraudulent activities.
After cleaning, EDA, feature engineering and modeling, we found that XG boost with SMOTE oversampled data is the best model, reaching 95% success rate while being able to flag a minimal number of legitimate transactions as fraud. As a result, the model saves 73.5 hours of manual checking daily and will be put into practice 2019 summer.
Predicting the popularity of Reddit posts
This is the final project for data science tools and algorithms class in my senior year of college.
In our analysis, we seek to answer follow questions:
- Which features contribute to the popularity of a Reddit post? Which features are most influential?
- Can we employ methods from Sentiment Analysis, such as polarity scores to predict popularity?
We found that random forest performed the best with the accuracy of .967. More importantly, we found the top 3 factors that can predict the post popularity: number of total comments, number of subscribers of the subreddit, and the emotional score of the comments (generated from sentiment analysis).
The poster summarized our approach.
Forecasting Shanghai License Plate Price
This is an independent project for my time series analysis class, completed in my junior year.
Background
Problem Statement
- What are the forecast, and forecast error bounds, for Shanghai average license plate price?
-
Based only on 2002-2013 data, how does our prediction for successful application rate differ from the reality?
Analysis
Conclusion
Happiness in Major Cities
This is the final project for my R programming class, where we tried different visualization methods through ggplot2 in R to investigate how the distribution of happiness ratings are influenced by different variables.
My responsibility is team lead, data cleaning and data visualization.
We answered three questions with plots generated from data (only a small portion of them are presented here).
Q1: Are there any similarities or significant differences between each city in happiness level distribution?
People live in Seoul and Tokyo live a less happy life compared with those from other cities, because there are very few ratings of 5 (very happy) for happiness.
A large proportion of people who voted very happy live in Toronto and New York.
Q2: How age affects the distribution of happiness level in different cities?
Generally, the distributions of happiness level across all cities have a relatively right-skewed pattern, indicating that as the age increases, the happiness level generally decreases. Noticeably, most of the cities have the peak at the second or third layer, which reflect that the middle-aged and older adults are the happiest age groups.
For Berlin, NYC, Seoul, Stockholm, Tokyo and Toronto, the young adults are less happy than the middle-aged and older adults, which reflects that people in school or the early stage of their career are pretty unhappy in these seven cities. However, as people in these cities reach the first age quartile (about 30), they are usually at the peak of mean happiness level across all ages in that city. For the remaining cities, London, Milan, and Paris remain high happiness level from young to older adults (18-60). Beijing is an exception because its middle-aged people are relatively unhappy.
For the oldest group (above the third quartile), as the age increases, the happiness level decreases. This phenomenon may be more related to the participants’ decreasing health than the cities they live in. The only exception is New York.
Q3: What are the essential factors that determine the feelings of their happiness level based on each city?
We focused on people in the following age x city groups in cities. These groups have the lowest number of people who rated 1 “not happy at all”. Specifically, we counted for the top 3 factors that received “very satisfied” scores.
The results indicate that good welfare and safety are the fundamental factors that ensure citizens to be happy.
City | Age Group | 1st | 2nd | 3rd |
Berlin | Old | Safety
(343) |
Welfare
(314) |
City Administration
(159) |
Berlin | Young | Welfare
(209) |
Safety
(189) |
Economy
(178) |
Milan | Young | Safety
(265) |
Community Life
(247) |
City Administration
(224) |
NYC | Old | Welfare
(356) |
Safety
(354) |
Environment
(167) |