San Diego Trip Destination Prediction

Introduction

Our project aims to enhance a segment of SANDAG's Activity-Based Model (ABM). ABMs, which base travel demand on an individual's daily activities, simulate decisions of individuals and households related to daily travel. We are focusing on the ABM Trip Destination component to forecast where individuals and households travel, and serves a crucial step in determining activity patterns and travel demand.

ABMs have significant implications for SANDAG's infrastructure and urban planning policies in San Diego. Currently, the trip destination component takes 40 minutes to process and generates 12 million trips. Our project intends to apply machine learning to boost computational efficiency of SANDAG's trip destination component while preserving its predictive accuracy.

Data and Exploratory Data Analysis

The ABM utilizes a statistical model to generate synthetic population data using the San Diego census data. The synthetic population datasets offer detailed sociodemographic profiles of regional households and individuals, including age, race, education level, and serves as the key initial foundation to the pipeline.

The statistical model then generates a set of synthetic trips for the synthetic population. The synthetic tours dataset contains information on tour details, such as origin, purpose, mode, etc. The project will utilize the synthetic population and trips data to train the model and predict trip destinations. More information about each feature from the datasets can be found through SANDAG's ABM Github Repository .

Since San Diego is a large county containing 12 million trips, data was filtered to represent households in Districts 1, 2, 5, and 6. These district were selected based on areas of interest as they represented communities of diverse socio-economic backgrounds. Trip destinations were in the form of Traffic Analysis Zones (TAZ), which were aggregated in form of Land Use Zones (LUZ).

Interactive 3D Map for Synthetic Population Trip Destination Distribution

Switch Browser for Optimal Performance if Lag Occurs

Methodology

Performing model selection on the synthetic dataset resulted in a Decision Tree Classifier obtaining the best accuracy scores. The baseline Decision Tree model resulted in a 90% training accuracy, 72% testing accuracy, 72% F1 weighted averaged score, and 69% F1 macro averaged score. To mitigate the risks of class imbalance, our team utilized an oversampling technique, Synthetic Minority Oversampling Technique (SMOTE), to balance the class distribution of the dataset. GridSearch Cross Validation helped tune the model's hyperparameters to address issues of overfitting and optimize the model for F1 weighted average score.