PORTFOLIO PROJECTS

 

2016-2022 Vancouver Crime Data Exploration and Modelling

In this project, I applied data analytics and machine learning methodologies for the Vancouver Police Department (VPD) to predict hourly theft crimes across different neighbourhoods in Vancouver BC. Multiple data sources were eventually being incorporated into the original crime dataset from VPD for data exploration and feature engineering. The objective of initial data analysis was to identify key crime patterns that could provide direction for further analysis, dashboard building and model-development. We examined the overall trends of reported crime cases, and analyzed time-related patterns and geographical related patterns. Theft crime accounts for most of the crimes in Vancouver so I decided to narrow down our focus to Theft crimes when building machine learning models. The ultimate goal was to implement predictive policing to help reduce crime, while mitigating risk to law enforcement officers.  

  • Datasets were prepared via cleaning, transforming, joining and aggregating in SQL and Python

  • Visualizing time-related patterns and geographical related patterns on Tableau

  • Building Binary Classification model to classify high risk and low risk of theft crime activities at a given location and hour

  • The model successfully captured 81% of unseen instances where the actual theft crime was greater than 3 cases per neighbourhood per hour

  • Original Project Workbook and Code

 
NYC_Yello_Taxis_bw_smart_cities_Adobe_rt.jpg

New York Taxi Analysis

In this project, I applied and demonstrated the data science pipeline to explore and use deterministic features to predict how much a cab driver can earn per hour in different areas of New York. The knowledge of which areas earn more or earn less in any given day and any given hour allows the union to distribute cab assignments more equitably across different areas, and to rotate cab drivers between higher and lower-income areas on a daily basis. This way, there will not be cab driver would dominate in a high-income area while another cab driver would be continuously evicted to a low-income area.

  • Utilizing Python pandas, numpy, matplotlib for Data Exploration, Data Cleaning and Data Preparation

  • Identifying trends and visualizing time series data of taxi trips

  • Building machine learning algorithms to predict taxi fares in New York

  • Detailed documentation of the each step in jupyter notebook

  • The Python packages I utilized: pandas|numpy|matplotlib|scikit-learn

  • README | Project Workbook

 
red-and-sky-blue-single-virion.png

COVID-19 DATA EXPLORATIONS

  • Identifying trends and visualizing time series data of global, continental, country cases and deaths

  • Examining how different vaccine manufacturers contributed in case reduction

  • Examining the adequacy of policy response to pandemic by analyzing stringency index and cases

  • The tools I utilized: SQL|Tableau

 
wp4729457.jpg

POPULATION GROWTH AND ENVIRONMENTAL DESTRUCTION

  • How has the planet been adversely affected since the population boom of the past 100 years?

  • What are the links between climate change, resource depletion, natural disasters and overpopulation, and what are the implications for humanity?

  • Why is the "Great Green Goal" obsolete and our existing problems will increase further? What is the role of celebrities? Politicians, or even activists? Have they really accomplished anything?

  • The tools I utilized: Excel|Tableau|Canva

 

Air Transport Database Design

Before the pandemic, the number of air travel had reached its all time peak. But at the same time there have been many disturbing incidents such as long delays and overbooked flights. There were even passengers being dragged out from the plane and airline companies paying large sums of fines for their extreme delays. Therefore, in our BCIT course project, I and my group had looked into the problem from the perspective of IATA and see how we could utilize data to analyze the inefficiency of airline operation. The database is modelled based on the 7Ws dimensional modelling technique.

  • The goal of the database is to measure the data of passenger counts and delay figures for each flight

  • Developing Star Schema ER diagram in MySQL Workbench

  • Performed query in SQL to narrow down problems and answer core questions

  • Transform query results to actionable insights to tackle flight overbooking and delays

  • The project includes 3 sections: Project Overview, Database Design Process, SQL Query