Introduction

The New York City Taxi and Limousine Commission (TLC) provides comprehensive datasets on yellow taxi trips. These datasets include details such as pickup and dropoff times, trip distances, and associated costs. This project analyzes the yellow taxi trips from August 2024, using 1 million randomly sampled rows from the 3 million original records. The goal is to identify spatio-temporal demand patterns using clustering methods that account for both spatial and temporal dynamics.

Data Preparation and Cleaning

Load required libraries

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(wavelets)
library(reshape2)
library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:reshape2':
## 
##     smiths

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(ggplot2)
library(spatstat)

## Loading required package: spatstat.data

## Loading required package: spatstat.univar

## spatstat.univar 3.1-1

## Loading required package: spatstat.geom

## spatstat.geom 3.3-4

## Loading required package: spatstat.random

## spatstat.random 3.3-2

## Loading required package: spatstat.explore

## Loading required package: nlme

## 
## Attaching package: 'nlme'

## The following object is masked from 'package:dplyr':
## 
##     collapse

## spatstat.explore 3.3-3

## Loading required package: spatstat.model

## Loading required package: rpart

## spatstat.model 3.3-2

## Loading required package: spatstat.linnet

## spatstat.linnet 3.2-2

## 
## spatstat 3.2-1 
## For an introduction to spatstat, type 'beginner'

library(igraph)

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:spatstat.geom':
## 
##     diameter, edges, is.connected, vertices

## The following objects are masked from 'package:lubridate':
## 
##     %--%, union

## The following object is masked from 'package:tidyr':
## 
##     crossing

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

library(rjags)

## Loading required package: coda

## Linked to JAGS 4.3.2

## Loaded modules: basemod,bugs

library(fields)

## Loading required package: spam

## Spam version 2.11-0 (2024-10-03) is loaded.
## Type 'help( Spam)' or 'demo( spam)' for a short introduction 
## and overview of this package.
## Help for individual functions is also obtained by adding the
## suffix '.spam' to the function name, e.g. 'help( chol.spam)'.

## 
## Attaching package: 'spam'

## The following objects are masked from 'package:base':
## 
##     backsolve, forwardsolve

## Loading required package: viridisLite

## 
## Try help(fields) to get started.

library(stringr)
library(ClustGeo)
library(spdep)

## Loading required package: spData

## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`

## Loading required package: sf

## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE

library(knitr)

Loading and Cleaning the Data

data <- read.csv("cleaned_NYC_data.csv")

# Data Cleaning
data <- data %>%
  mutate(
    trip_id = row_number(),
    tpep_pickup_datetime = as.POSIXct(tpep_pickup_datetime, format = "%Y-%m-%d %H:%M:%S"),
    tpep_dropoff_datetime = as.POSIXct(tpep_dropoff_datetime, format = "%Y-%m-%d %H:%M:%S"),
    Duration = as.numeric(difftime(tpep_dropoff_datetime, tpep_pickup_datetime, units = "mins"))
  ) %>%
  filter(!is.na(Duration), Duration > 0)

# Removing Outliers
duration_limits <- quantile(data$Duration, probs = c(0.25, 0.75), na.rm = TRUE)
duration_iqr <- IQR(data$Duration, na.rm = TRUE)
distance_limits <- quantile(data$trip_distance, probs = c(0.25, 0.75), na.rm = TRUE)
distance_iqr <- IQR(data$trip_distance, na.rm = TRUE)

data <- data %>%
  filter(
    Duration >= (duration_limits[1] - 1.5 * duration_iqr) & Duration <= (duration_limits[2] + 1.5 * duration_iqr),
    trip_distance >= (distance_limits[1] - 1.5 * distance_iqr) & trip_distance <= (distance_limits[2] + 1.5 * distance_iqr)
  )

Methodology

Taxi trips inherently consist of the pickup location (origin) and the dropoff location (destination). In this analysis, we will cluster the pickup and dropoff information separately, then match the clusters to reveal the traffic movement based on time and location to identify demand patterns such as commuting into central business districts or travel to entertainment zones.

To analyze the demand patterns, we aggregate the data into number of taxi trips into hours of the day at each taxi zone.

Temporal Basis Functions

We used the Fourier Transform to extract temporal features from aggregated demand data. The temporal demand \(f_s(t)\) at each location \(s\) and time \(t\) is expressed as: \[ f_s(t) = \beta_{s,0} + \sum_{k=1}^m \beta_{s,k} \cos(2\pi k t) + \sum_{k=1}^m \gamma_{s,k} \sin(2\pi k t). \]

Spatio-Temporal Clustering

We used the ClustGeo package for clustering, balancing temporal patterns with spatial continuity. The combined dissimilarity measure is defined as: \[ \Delta_\alpha = (1 - \alpha) D_0 + \alpha D_1, \] where \(\alpha \in [0, 1]\) controls the trade-off between temporal similarity and spatial continuity.

Pickup and Dropoff Clustering

Pickup Cluster

Pickup Temporal Basis Functions

Use the Fourier Transform to extract temporal features from aggregated demand data.

pickup_data <- data %>%
  select(trip_id, PU_Longitude, PU_Latitude, PULocationID, tpep_pickup_datetime) %>%
  mutate(
    pickup_hour = hour(tpep_pickup_datetime),
    pickup_day = wday(tpep_pickup_datetime, label = TRUE)
  )

hourly_demand <- pickup_data %>%
  group_by(PU_Longitude, PU_Latitude, PULocationID, pickup_hour, pickup_day) %>%
  summarize(demand = n(), .groups = "drop") %>%
  arrange(pickup_day, pickup_hour)

demand_matrix <- hourly_demand %>%
  pivot_wider(
    names_from = c(pickup_hour, pickup_day),
    values_from = demand,
    values_fill = 0
  ) %>%
  arrange(PULocationID, PU_Longitude, PU_Latitude)

## Extract Temporal Features
temporal_demand <- demand_matrix %>%
  select(-PU_Longitude, -PU_Latitude, -PULocationID)

fourier_coefficients <- apply(temporal_demand, 1, function(row) {
  fft_result <- fft(row - mean(row))  # Center data and compute FFT
  c(Re(fft_result)[1:4], Im(fft_result)[2:5])  # Extract key coefficients
}) %>% t()

colnames(fourier_coefficients) <- c("DC", "Real1", "Real2", "Real3", "Imag1", "Imag2", "Imag3", "Imag4")

fourier_data <- demand_matrix %>%
  select(PU_Longitude, PU_Latitude, PULocationID) %>%
  bind_cols(as.data.frame(fourier_coefficients))

scaled_temporal_data <- scale(as.matrix(fourier_data %>% select(-PU_Longitude, -PU_Latitude, -PULocationID)))

Pickup Spatial Constraints and Clustering

Use ClustGeo to determine an appropriate alpha and perform spatio contrained temporal clustering.

coords <- as.matrix(fourier_data %>% select(PU_Longitude, PU_Latitude))
k <- 5
knn <- knearneigh(coords, k = k)
nb <- knn2nb(knn)
nb <- make.sym.nb(nb)  
adj_matrix <- nb2mat(nb, style = "B", zero.policy = TRUE)
diag(adj_matrix) <- 1

D1 <- as.dist(1 - adj_matrix)
D0 <- dist(scaled_temporal_data) 

tree_unconstrained <- hclustgeo(D0)
plot(tree_unconstrained, hang = -1, label = FALSE, main = "Unconstrained Clustering with D0")
rect.hclust(tree_unconstrained, k = 5, border = c(4, 5, 3, 2, 1))

range_alpha <- seq(0, 1, 0.1)
K <- 5

png("pickup_alpha_eval.png", width = 500, height = 500, res = 120)
cr <- choicealpha(D0, D1, range.alpha = range_alpha, K = K, graph = TRUE)
dev.off()

## quartz_off_screen 
##                 2

alpha <- 0.2
tree_constrained <- hclustgeo(D0, D1, alpha = alpha)
P5 <- cutree(tree_constrained, k = 5)

fourier_data <- fourier_data %>%
  mutate(cluster = P5)

# Save pickup cluster to pickup_data
fourier_cluster <- fourier_data %>% select(PULocationID, cluster)
pickup_data <- merge(pickup_data, fourier_cluster, by = "PULocationID")

Pickup Cluster Visualization

ggplot(pickup_data , aes(x = pickup_hour, fill = factor(cluster))) +
  geom_bar(position = "dodge") +
  labs(
    title = "Pick Up Demand by Hour and Cluster",
    x = "Hour of the Day",
    y = "Number of Pickups",
    fill = "Cluster"
  ) +
  theme_minimal()

ggplot(pickup_data , aes(x = pickup_day, fill = factor(cluster))) +
  geom_bar(position = "dodge") +
  labs(
    title = "Pick Up Demand by Day and Cluster",
    x = "Day of the week",
    y = "Number of Pickups",
    fill = "Cluster"
  ) +
  theme_minimal()

Dropoff Cluster

Carry the same analysis as pickup data clustering.

dropoff_data <- data %>%
  select(trip_id, DO_Longitude, DO_Latitude, DOLocationID, tpep_dropoff_datetime) %>%
  mutate(
    dropoff_hour = hour(tpep_dropoff_datetime),
    dropoff_day = wday(tpep_dropoff_datetime, label = TRUE)
  )

hourly_demand_dropoff <- dropoff_data %>%
  group_by(DO_Longitude, DO_Latitude, DOLocationID, dropoff_hour, dropoff_day) %>%
  summarize(demand = n(), .groups = "drop") %>%
  arrange(dropoff_day, dropoff_hour)

demand_matrix_dropoff <- hourly_demand_dropoff %>%
  pivot_wider(
    names_from = c(dropoff_hour, dropoff_day),
    values_from = demand,
    values_fill = 0
  ) %>%
  arrange(DOLocationID, DO_Longitude, DO_Latitude)

temporal_demand_dropoff <- demand_matrix_dropoff %>%
  select(-DO_Longitude, -DO_Latitude, -DOLocationID)

fourier_coefficients_dropoff <- apply(temporal_demand_dropoff, 1, function(row) {
  fft_result <- fft(row - mean(row))  # Center data and compute FFT
  c(Re(fft_result)[1:4], Im(fft_result)[2:5])  # Extract key coefficients
}) %>% t()

colnames(fourier_coefficients_dropoff) <- c("DC", "Real1", "Real2", "Real3", "Imag1", "Imag2", "Imag3", "Imag4")

fourier_data_dropoff <- demand_matrix_dropoff %>%
  select(DO_Longitude, DO_Latitude, DOLocationID) %>%
  bind_cols(as.data.frame(fourier_coefficients_dropoff))

scaled_temporal_data_dropoff <- scale(as.matrix(fourier_data_dropoff %>% select(-DO_Longitude, -DO_Latitude, -DOLocationID)))

coords_dropoff <- as.matrix(fourier_data_dropoff %>% select(DO_Longitude, DO_Latitude))
k <- 5
knn_dropoff <- knearneigh(coords_dropoff, k = k)
nb_dropoff <- knn2nb(knn_dropoff)
nb_dropoff <- make.sym.nb(nb_dropoff)  
adj_matrix_dropoff <- nb2mat(nb_dropoff, style = "B", zero.policy = TRUE)
diag(adj_matrix_dropoff) <- 1

D1_dropoff <- as.dist(1 - adj_matrix_dropoff)
D0_dropoff <- dist(scaled_temporal_data_dropoff) 

tree_unconstrained_dropoff <- hclustgeo(D0_dropoff)
plot(tree_unconstrained_dropoff, hang = -1, label = FALSE, main = "Unconstrained Clustering with D0 (Drop-off)")
rect.hclust(tree_unconstrained_dropoff, k = 5, border = c(4, 5, 3, 2, 1))

range_alpha <- seq(0, 1, 0.1)
K <- 5

png("dropoff_alpha_eval.png", width = 500, height = 500, res = 120)
cr <- choicealpha(D0 = D0_dropoff, D1 = D1_dropoff, range.alpha = range_alpha, K = K, graph = TRUE)
dev.off()

## quartz_off_screen 
##                 2

alpha <- 0.2
tree_constrained_dropoff <- hclustgeo(D0_dropoff, D1_dropoff, alpha = alpha)
P5_dropoff <- cutree(tree_constrained_dropoff, k = 5)

fourier_data_dropoff <- fourier_data_dropoff %>%
  mutate(cluster = P5_dropoff)

ggplot(fourier_data_dropoff, aes(x = DO_Longitude, y = DO_Latitude, color = as.factor(cluster))) +
  geom_point() +
  labs(title = "Clusters with Spatial Constraints (Drop-off)", color = "Cluster") +
  theme_minimal()

fourier_dropoff_cluster <- fourier_data_dropoff %>% select(DOLocationID, cluster)

dropoff_data <- merge(dropoff_data, fourier_dropoff_cluster, by = "DOLocationID")

ggplot(dropoff_data, aes(x = dropoff_hour, fill = factor(cluster))) +
  geom_bar(position = "dodge") +
  labs(
    title = "Drop-off Demand by Hour and Cluster",
    x = "Hour of the Day",
    y = "Number of Drop-offs",
    fill = "Cluster"
  ) +
  theme_minimal()

ggplot(dropoff_data, aes(x = dropoff_day, fill = factor(cluster))) +
  geom_bar(position = "dodge") +
  labs(
    title = "Drop-off Demand by Day and Cluster",
    x = "Day of the Week",
    y = "Number of Drop-offs",
    fill = "Cluster"
  ) +
  theme_minimal()

Results: Traffic Flow Analysis

Matching the Clusters

data_with_clusters <- data %>%
  left_join(pickup_data %>% select(trip_id, pickup_cluster = cluster), by = "trip_id") %>%
  left_join(dropoff_data %>% select(trip_id, dropoff_cluster = cluster), by = "trip_id")

traffic_flow <- data_with_clusters %>%
  group_by(pickup_cluster, dropoff_cluster) %>%
  summarize(flow_count = n(), .groups = "drop")

arrange(traffic_flow, desc(flow_count))

## # A tibble: 21 × 3
##    pickup_cluster dropoff_cluster flow_count
##             <int>           <int>      <int>
##  1              4               3     199527
##  2              3               3     146144
##  3              3               5     142982
##  4              4               5      88819
##  5              3               4      75719
##  6              5               3      66672
##  7              4               4      53443
##  8              2               2      30581
##  9              5               5      17219
## 10              5               4      11063
## # ℹ 11 more rows

ggplot(traffic_flow, aes(x = as.factor(pickup_cluster), y = as.factor(dropoff_cluster), fill = flow_count)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = median(traffic_flow$flow_count)) +
  labs(
    title = "Traffic Flow Between Pickup and Drop-off Clusters",
    x = "Pickup Cluster",
    y = "Drop-off Cluster",
    fill = "Number of Trips"
  ) +
  theme_minimal()

top_traffic_flows <- traffic_flow %>%
  arrange(desc(flow_count)) %>%
  slice(1:3)

data_top_flows <- data_with_clusters %>%
  semi_join(top_traffic_flows, by = c("pickup_cluster", "dropoff_cluster"))

top_traffic_flows <- data_top_flows %>%
  group_by(pickup_cluster, dropoff_cluster) %>%
  summarize(
    avg_trip_distance = round(mean(trip_distance, na.rm = TRUE), 3),
    median_trip_distance = round(median(trip_distance, na.rm = TRUE), 3),
    avg_passenger_count = round(mean(passenger_count, na.rm = TRUE), 3),
    avg_fare_amount = round(mean(fare_amount, na.rm = TRUE), 3),
    avg_duration = round(mean(Duration, na.rm = TRUE), 3),
    total_trips = n(),
    .groups = "drop"
  )

top_flows <- traffic_flow %>%
  arrange(desc(flow_count)) %>%
  head(10)

ggplot(top_flows, aes(x = reorder(paste(pickup_cluster, dropoff_cluster, sep = " → "), -flow_count), y = flow_count, fill = flow_count)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Top 10 Traffic Flows Between Clusters",
    x = "Pickup → Drop-off Cluster",
    y = "Number of Trips",
    fill = "Number of Trips"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Recall that cluster 3,4,5 for both pickup and drop off are within the Manhattan area. Most of the trips were picked up and dropped off in Manhattan.

top_traffic_flows <- data_top_flows %>%
  group_by(pickup_cluster, dropoff_cluster) %>%
  summarize(
    avg_trip_distance = round(mean(trip_distance, na.rm = TRUE), 3),
    median_trip_distance = round(median(trip_distance, na.rm = TRUE), 3),
    avg_passenger_count = round(mean(passenger_count, na.rm = TRUE), 3),
    avg_fare_amount = round(mean(fare_amount, na.rm = TRUE), 3),
    avg_duration = round(mean(Duration, na.rm = TRUE), 3),
    total_trips = n(),
    .groups = "drop"
  )


kable(top_traffic_flows, format = "latex", booktabs = TRUE, digits = 3, 
      col.names = c(
        "Pickup Cluster", 
        "Dropoff Cluster", 
        "Avg Trip Distance (miles)", 
        "Median Trip Distance (miles)", 
        "Avg Passenger Count", 
        "Avg Fare Amount ($)", 
        "Avg Duration (mins)", 
        "Total Trips"
      ),
      caption = "Summary Statistics for Top 3 Traffic Flows")

Traffic Into Manhattan

To examine the traffic into Manhattan, we can consider the trips from pickup clusters 1,2 (outside of Manhattan) to dropoff clusters 3,4,5 (within Manhattan).

hourly_demand_into_city <- data_with_clusters %>%
  filter(pickup_cluster %in% c(1, 2), dropoff_cluster %in% c(3, 4, 5)) %>%
  select(trip_id, PU_Longitude, PU_Latitude, PULocationID, tpep_pickup_datetime) %>%
  mutate(
    pickup_hour = hour(tpep_pickup_datetime),
    pickup_day = wday(tpep_pickup_datetime, label = TRUE)
  ) %>%
  group_by(PU_Longitude, PU_Latitude, PULocationID, pickup_hour, pickup_day) %>%
  summarize(demand = n(), .groups = "drop")

demand_matrix_into_city <- hourly_demand_into_city %>%
  pivot_wider(
    names_from = c(pickup_hour, pickup_day),
    values_from = demand,
    values_fill = 0
  ) %>%
  arrange(PULocationID, PU_Longitude, PU_Latitude)

daily_demand_into_city <- hourly_demand_into_city %>%
  group_by(pickup_day) %>%
  summarize(total_demand = sum(demand), .groups = "drop")

ggplot(daily_demand_into_city, aes(x = pickup_day, y = total_demand, fill = pickup_day)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Total Into Manhattan Demand by Day",
    x = "Day of the Week",
    y = "Total Demand",
    fill = "Day"
  ) +
  theme_minimal()

hourly_heatmap_data <- hourly_demand_into_city %>%
  group_by(pickup_hour, pickup_day) %>%
  summarize(total_demand = sum(demand), .groups = "drop")

ggplot(hourly_heatmap_data, aes(x = pickup_hour, y = pickup_day, fill = total_demand)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(
    title = "Demand into Manhattan (Hourly by Day)",
    x = "Hour of the Day",
    y = "Day of the Week",
    fill = "Demand"
  ) +
  theme_minimal()

Out of Manhattan

hourly_demand_out_of_city <- data_with_clusters %>%
  filter(pickup_cluster %in% c(3, 4, 5), dropoff_cluster %in% c(1, 2)) %>%
  select(trip_id, PU_Longitude, PU_Latitude, PULocationID, tpep_pickup_datetime) %>%
  mutate(
    pickup_hour = hour(tpep_pickup_datetime),
    pickup_day = wday(tpep_pickup_datetime, label = TRUE)
  ) %>%
  group_by(PU_Longitude, PU_Latitude, PULocationID, pickup_hour, pickup_day) %>%
  summarize(demand = n(), .groups = "drop")

demand_matrix_out_of_city <- hourly_demand_out_of_city %>%
  pivot_wider(
    names_from = c(pickup_hour, pickup_day),
    values_from = demand,
    values_fill = 0
  ) %>%
  arrange(PULocationID, PU_Longitude, PU_Latitude)
daily_demand_out_of_city <- hourly_demand_out_of_city %>%
  group_by(pickup_day) %>%
  summarize(total_demand = sum(demand), .groups = "drop")

ggplot(daily_demand_out_of_city, aes(x = pickup_day, y = total_demand, fill = pickup_day)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Total Out-of-City Demand by Day",
    x = "Day of the Week",
    y = "Total Demand",
    fill = "Day"
  ) +
  theme_minimal()

hourly_heatmap_out_of_city <- hourly_demand_out_of_city %>%
  group_by(pickup_hour, pickup_day) %>%
  summarize(total_demand = sum(demand), .groups = "drop")

# Plot hourly heatmap
ggplot(hourly_heatmap_out_of_city, aes(x = pickup_hour, y = pickup_day, fill = total_demand)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(
    title = "Demand Out of City (Hourly by Day)",
    x = "Hour of the Day",
    y = "Day of the Week",
    fill = "Demand"
  ) +
  theme_minimal()

Within Manhattan

hourly_demand_within_city <- data_with_clusters %>%
  filter(pickup_cluster %in% c(3, 4, 5), dropoff_cluster %in% c(3, 4, 5)) %>%
  select(trip_id, PU_Longitude, PU_Latitude, PULocationID, tpep_pickup_datetime) %>%
  mutate(
    pickup_hour = hour(tpep_pickup_datetime),
    pickup_day = wday(tpep_pickup_datetime, label = TRUE)
  ) %>%
  group_by(PU_Longitude, PU_Latitude, PULocationID, pickup_hour, pickup_day) %>%
  summarize(demand = n(), .groups = "drop")

demand_matrix_within_city <- hourly_demand_within_city %>%
  pivot_wider(
    names_from = c(pickup_hour, pickup_day),
    values_from = demand,
    values_fill = 0
  ) %>%
  arrange(PULocationID, PU_Longitude, PU_Latitude)

daily_demand_within_city <- hourly_demand_within_city %>%
  group_by(pickup_day) %>%
  summarize(total_demand = sum(demand), .groups = "drop")

ggplot(daily_demand_within_city, aes(x = pickup_day, y = total_demand, fill = pickup_day)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Total Within-City Demand by Day",
    x = "Day of the Week",
    y = "Total Demand",
    fill = "Day"
  ) +
  theme_minimal()

hourly_heatmap_within_city <- hourly_demand_within_city %>%
  group_by(pickup_hour, pickup_day) %>%
  summarize(total_demand = sum(demand), .groups = "drop")

ggplot(hourly_heatmap_within_city, aes(x = pickup_hour, y = pickup_day, fill = total_demand)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(
    title = "Within-City Demand Heatmap (Hourly by Day)",
    x = "Hour of the Day",
    y = "Day of the Week",
    fill = "Demand"
  ) +
  theme_minimal()

Summary Statistics

into_city_summary <- data_with_clusters %>%
  filter(pickup_cluster %in% c(1, 2), dropoff_cluster %in% c(3, 4, 5)) %>%
  group_by(pickup_cluster, dropoff_cluster) %>%
  summarize(
    avg_trip_distance = round(mean(trip_distance, na.rm = TRUE), 3),
    median_trip_distance = round(median(trip_distance, na.rm = TRUE), 3),
    avg_passenger_count = round(mean(passenger_count, na.rm = TRUE), 3),
    avg_fare_amount = round(mean(fare_amount, na.rm = TRUE), 3),
    avg_duration = round(mean(Duration, na.rm = TRUE), 3),
    total_trips = n(),
    .groups = "drop"
  )
out_of_city_summary <- data_with_clusters %>%
  filter(pickup_cluster %in% c(3, 4, 5), dropoff_cluster %in% c(1, 2)) %>%
  group_by(pickup_cluster, dropoff_cluster) %>%
  summarize(
    avg_trip_distance = round(mean(trip_distance, na.rm = TRUE), 3),
    median_trip_distance = round(median(trip_distance, na.rm = TRUE), 3),
    avg_passenger_count = round(mean(passenger_count, na.rm = TRUE), 3),
    avg_fare_amount = round(mean(fare_amount, na.rm = TRUE), 3),
    avg_duration = round(mean(Duration, na.rm = TRUE), 3),
    total_trips = n(),
    .groups = "drop"
  )

within_city_summary <- data_with_clusters %>%
  filter(pickup_cluster %in% c(3, 4, 5), dropoff_cluster %in% c(3, 4, 5)) %>%
  group_by(pickup_cluster, dropoff_cluster) %>%
  summarize(
    avg_trip_distance = round(mean(trip_distance, na.rm = TRUE), 3),
    median_trip_distance = round(median(trip_distance, na.rm = TRUE), 3),
    avg_passenger_count = round(mean(passenger_count, na.rm = TRUE), 3),
    avg_fare_amount = round(mean(fare_amount, na.rm = TRUE), 3),
    avg_duration = round(mean(Duration, na.rm = TRUE), 3),
    total_trips = n(),
    .groups = "drop"
  )



into_city_summary_stats <- into_city_summary %>%
  summarize(
    avg_trip_distance_mean = mean(avg_trip_distance, na.rm = TRUE),
    avg_trip_distance_sd = sd(avg_trip_distance, na.rm = TRUE),
    avg_fare_amount_mean = mean(avg_fare_amount, na.rm = TRUE),
    avg_fare_amount_sd = sd(avg_fare_amount, na.rm = TRUE),
    avg_duration_mean = mean(avg_duration, na.rm = TRUE),
    avg_duration_sd = sd(avg_duration, na.rm = TRUE),
    avg_passenger_count_mean = mean(avg_passenger_count, na.rm = TRUE),
    avg_passenger_count_sd = sd(avg_passenger_count, na.rm = TRUE),
    total_trips_mean = sum(total_trips, na.rm = TRUE)
  )

out_of_city_summary_stats <- out_of_city_summary%>%
  summarize(
    avg_trip_distance_mean = mean(avg_trip_distance, na.rm = TRUE),
    avg_trip_distance_sd = sd(avg_trip_distance, na.rm = TRUE),
    avg_fare_amount_mean = mean(avg_fare_amount, na.rm = TRUE),
    avg_fare_amount_sd = sd(avg_fare_amount, na.rm = TRUE),
    avg_duration_mean = mean(avg_duration, na.rm = TRUE),
    avg_duration_sd = sd(avg_duration, na.rm = TRUE),
    avg_passenger_count_mean = mean(avg_passenger_count, na.rm = TRUE),
    avg_passenger_count_sd = sd(avg_passenger_count, na.rm = TRUE),
    total_trips_mean = sum(total_trips, na.rm = TRUE)
  )



within_city_summary_stats <- within_city_summary %>%
  summarize(
    avg_trip_distance_mean = mean(avg_trip_distance, na.rm = TRUE),
    avg_trip_distance_sd = sd(avg_trip_distance, na.rm = TRUE),
    avg_fare_amount_mean = mean(avg_fare_amount, na.rm = TRUE),
    avg_fare_amount_sd = sd(avg_fare_amount, na.rm = TRUE),
    avg_duration_mean = mean(avg_duration, na.rm = TRUE),
    avg_duration_sd = sd(avg_duration, na.rm = TRUE),
    avg_passenger_count_mean = mean(avg_passenger_count, na.rm = TRUE),
    avg_passenger_count_sd = sd(avg_passenger_count, na.rm = TRUE),
    total_trips_mean = sum(total_trips, na.rm = TRUE)
  )

combined_summary_stats <- bind_rows(
  into_city_summary %>% mutate(type = "Into City"),
  out_of_city_summary %>% mutate(type = "Out of City"),
  within_city_summary %>% mutate(type = "Within City")
)

kable(combined_summary_stats, format = "html", booktabs = TRUE, caption = "Combined Summary of Traffic Flows")

Combined Summary of Traffic Flows
pickup_cluster	dropoff_cluster	avg_trip_distance	median_trip_distance	avg_passenger_count	avg_fare_amount	avg_duration	total_trips	type
2	3	4.264	4.280	1.385	26.889	22.079	2534	Into City
2	4	4.517	5.025	1.251	24.687	19.448	1986	Into City
2	5	4.624	5.000	1.315	28.274	21.754	1235	Into City
3	1	0.460	0.000	1.000	39.775	22.646	4	Out of City
3	2	4.329	4.310	1.274	23.030	20.401	10453	Out of City
4	1	3.712	4.075	2.333	57.123	24.438	4	Out of City
4	2	5.255	5.370	1.305	27.290	24.913	6605	Out of City
5	2	4.109	3.940	1.310	22.328	20.317	4559	Out of City
3	3	2.173	1.860	1.333	15.179	14.475	146144	Within City
3	4	1.953	1.530	1.291	12.889	10.605	75719	Within City
3	5	1.454	1.150	1.260	11.037	9.613	142982	Within City
4	3	1.490	1.290	1.389	12.263	11.730	199527	Within City
4	4	2.911	2.700	1.377	17.086	15.282	53443	Within City
4	5	1.638	1.500	1.323	13.017	12.956	88819	Within City
5	3	1.575	1.360	1.419	11.925	10.925	66672	Within City
5	4	3.297	2.790	1.404	18.517	16.328	11063	Within City
5	5	2.988	2.800	1.410	17.833	16.684	17219	Within City

References

NYC Taxi Zones Dataset, available at NYC Open Data Portal, 2024.
ClustGeo Package Documentation, available at CRAN, 2024.

NYC Yellow Cab Demand Analysis

Minjee Kim

2024-12-29