Introduction

The New York City Taxi and Limousine Commission (TLC) provides comprehensive datasets on yellow taxi trips. These datasets include details such as pickup and dropoff times, trip distances, and associated costs. This project analyzes the yellow taxi trips from August 2024, using 1 million randomly sampled rows from the 3 million original records. The goal is to identify spatio-temporal demand patterns using clustering methods that account for both spatial and temporal dynamics.

Data Preparation and Cleaning

Load required libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(wavelets)
library(reshape2)
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
## 
##     smiths
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)
library(spatstat)
## Loading required package: spatstat.data
## Loading required package: spatstat.univar
## spatstat.univar 3.1-1
## Loading required package: spatstat.geom
## spatstat.geom 3.3-4
## Loading required package: spatstat.random
## spatstat.random 3.3-2
## Loading required package: spatstat.explore
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## spatstat.explore 3.3-3
## Loading required package: spatstat.model
## Loading required package: rpart
## spatstat.model 3.3-2
## Loading required package: spatstat.linnet
## spatstat.linnet 3.2-2
## 
## spatstat 3.2-1 
## For an introduction to spatstat, type 'beginner'
library(igraph)
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:spatstat.geom':
## 
##     diameter, edges, is.connected, vertices
## The following objects are masked from 'package:lubridate':
## 
##     %--%, union
## The following object is masked from 'package:tidyr':
## 
##     crossing
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
library(rjags)
## Loading required package: coda
## Linked to JAGS 4.3.2
## Loaded modules: basemod,bugs
library(fields)
## Loading required package: spam
## Spam version 2.11-0 (2024-10-03) is loaded.
## Type 'help( Spam)' or 'demo( spam)' for a short introduction 
## and overview of this package.
## Help for individual functions is also obtained by adding the
## suffix '.spam' to the function name, e.g. 'help( chol.spam)'.
## 
## Attaching package: 'spam'
## The following objects are masked from 'package:base':
## 
##     backsolve, forwardsolve
## Loading required package: viridisLite
## 
## Try help(fields) to get started.
library(stringr)
library(ClustGeo)
library(spdep)
## Loading required package: spData
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
## Loading required package: sf
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(knitr)

Loading and Cleaning the Data

data <- read.csv("cleaned_NYC_data.csv")

# Data Cleaning
data <- data %>%
  mutate(
    trip_id = row_number(),
    tpep_pickup_datetime = as.POSIXct(tpep_pickup_datetime, format = "%Y-%m-%d %H:%M:%S"),
    tpep_dropoff_datetime = as.POSIXct(tpep_dropoff_datetime, format = "%Y-%m-%d %H:%M:%S"),
    Duration = as.numeric(difftime(tpep_dropoff_datetime, tpep_pickup_datetime, units = "mins"))
  ) %>%
  filter(!is.na(Duration), Duration > 0)

# Removing Outliers
duration_limits <- quantile(data$Duration, probs = c(0.25, 0.75), na.rm = TRUE)
duration_iqr <- IQR(data$Duration, na.rm = TRUE)
distance_limits <- quantile(data$trip_distance, probs = c(0.25, 0.75), na.rm = TRUE)
distance_iqr <- IQR(data$trip_distance, na.rm = TRUE)

data <- data %>%
  filter(
    Duration >= (duration_limits[1] - 1.5 * duration_iqr) & Duration <= (duration_limits[2] + 1.5 * duration_iqr),
    trip_distance >= (distance_limits[1] - 1.5 * distance_iqr) & trip_distance <= (distance_limits[2] + 1.5 * distance_iqr)
  )

Methodology

Taxi trips inherently consist of the pickup location (origin) and the dropoff location (destination). In this analysis, we will cluster the pickup and dropoff information separately, then match the clusters to reveal the traffic movement based on time and location to identify demand patterns such as commuting into central business districts or travel to entertainment zones.

To analyze the demand patterns, we aggregate the data into number of taxi trips into hours of the day at each taxi zone.

Temporal Basis Functions

We used the Fourier Transform to extract temporal features from aggregated demand data. The temporal demand \(f_s(t)\) at each location \(s\) and time \(t\) is expressed as: \[ f_s(t) = \beta_{s,0} + \sum_{k=1}^m \beta_{s,k} \cos(2\pi k t) + \sum_{k=1}^m \gamma_{s,k} \sin(2\pi k t). \]

Spatio-Temporal Clustering

We used the ClustGeo package for clustering, balancing temporal patterns with spatial continuity. The combined dissimilarity measure is defined as: \[ \Delta_\alpha = (1 - \alpha) D_0 + \alpha D_1, \] where \(\alpha \in [0, 1]\) controls the trade-off between temporal similarity and spatial continuity.

Pickup and Dropoff Clustering

Pickup Cluster

Pickup Temporal Basis Functions

Use the Fourier Transform to extract temporal features from aggregated demand data.

pickup_data <- data %>%
  select(trip_id, PU_Longitude, PU_Latitude, PULocationID, tpep_pickup_datetime) %>%
  mutate(
    pickup_hour = hour(tpep_pickup_datetime),
    pickup_day = wday(tpep_pickup_datetime, label = TRUE)
  )

hourly_demand <- pickup_data %>%
  group_by(PU_Longitude, PU_Latitude, PULocationID, pickup_hour, pickup_day) %>%
  summarize(demand = n(), .groups = "drop") %>%
  arrange(pickup_day, pickup_hour)

demand_matrix <- hourly_demand %>%
  pivot_wider(
    names_from = c(pickup_hour, pickup_day),
    values_from = demand,
    values_fill = 0
  ) %>%
  arrange(PULocationID, PU_Longitude, PU_Latitude)

## Extract Temporal Features
temporal_demand <- demand_matrix %>%
  select(-PU_Longitude, -PU_Latitude, -PULocationID)

fourier_coefficients <- apply(temporal_demand, 1, function(row) {
  fft_result <- fft(row - mean(row))  # Center data and compute FFT
  c(Re(fft_result)[1:4], Im(fft_result)[2:5])  # Extract key coefficients
}) %>% t()

colnames(fourier_coefficients) <- c("DC", "Real1", "Real2", "Real3", "Imag1", "Imag2", "Imag3", "Imag4")

fourier_data <- demand_matrix %>%
  select(PU_Longitude, PU_Latitude, PULocationID) %>%
  bind_cols(as.data.frame(fourier_coefficients))

scaled_temporal_data <- scale(as.matrix(fourier_data %>% select(-PU_Longitude, -PU_Latitude, -PULocationID)))

Pickup Spatial Constraints and Clustering

Use ClustGeo to determine an appropriate alpha and perform spatio contrained temporal clustering.

coords <- as.matrix(fourier_data %>% select(PU_Longitude, PU_Latitude))
k <- 5
knn <- knearneigh(coords, k = k)
nb <- knn2nb(knn)
nb <- make.sym.nb(nb)  
adj_matrix <- nb2mat(nb, style = "B", zero.policy = TRUE)
diag(adj_matrix) <- 1

D1 <- as.dist(1 - adj_matrix)
D0 <- dist(scaled_temporal_data) 

tree_unconstrained <- hclustgeo(D0)
plot(tree_unconstrained, hang = -1, label = FALSE, main = "Unconstrained Clustering with D0")
rect.hclust(tree_unconstrained, k = 5, border = c(4, 5, 3, 2, 1))

range_alpha <- seq(0, 1, 0.1)
K <- 5

png("pickup_alpha_eval.png", width = 500, height = 500, res = 120)
cr <- choicealpha(D0, D1, range.alpha = range_alpha, K = K, graph = TRUE)
dev.off()
## quartz_off_screen 
##                 2
alpha <- 0.2
tree_constrained <- hclustgeo(D0, D1, alpha = alpha)
P5 <- cutree(tree_constrained, k = 5)

fourier_data <- fourier_data %>%
  mutate(cluster = P5)

# Save pickup cluster to pickup_data
fourier_cluster <- fourier_data %>% select(PULocationID, cluster)
pickup_data <- merge(pickup_data, fourier_cluster, by = "PULocationID")

Pickup Cluster Visualization

ggplot(pickup_data , aes(x = pickup_hour, fill = factor(cluster))) +
  geom_bar(position = "dodge") +
  labs(
    title = "Pick Up Demand by Hour and Cluster",
    x = "Hour of the Day",
    y = "Number of Pickups",
    fill = "Cluster"
  ) +
  theme_minimal()

ggplot(pickup_data , aes(x = pickup_day, fill = factor(cluster))) +
  geom_bar(position = "dodge") +
  labs(
    title = "Pick Up Demand by Day and Cluster",
    x = "Day of the week",
    y = "Number of Pickups",
    fill = "Cluster"
  ) +
  theme_minimal()

Dropoff Cluster

Carry the same analysis as pickup data clustering.

dropoff_data <- data %>%
  select(trip_id, DO_Longitude, DO_Latitude, DOLocationID, tpep_dropoff_datetime) %>%
  mutate(
    dropoff_hour = hour(tpep_dropoff_datetime),
    dropoff_day = wday(tpep_dropoff_datetime, label = TRUE)
  )

hourly_demand_dropoff <- dropoff_data %>%
  group_by(DO_Longitude, DO_Latitude, DOLocationID, dropoff_hour, dropoff_day) %>%
  summarize(demand = n(), .groups = "drop") %>%
  arrange(dropoff_day, dropoff_hour)

demand_matrix_dropoff <- hourly_demand_dropoff %>%
  pivot_wider(
    names_from = c(dropoff_hour, dropoff_day),
    values_from = demand,
    values_fill = 0
  ) %>%
  arrange(DOLocationID, DO_Longitude, DO_Latitude)

temporal_demand_dropoff <- demand_matrix_dropoff %>%
  select(-DO_Longitude, -DO_Latitude, -DOLocationID)

fourier_coefficients_dropoff <- apply(temporal_demand_dropoff, 1, function(row) {
  fft_result <- fft(row - mean(row))  # Center data and compute FFT
  c(Re(fft_result)[1:4], Im(fft_result)[2:5])  # Extract key coefficients
}) %>% t()

colnames(fourier_coefficients_dropoff) <- c("DC", "Real1", "Real2", "Real3", "Imag1", "Imag2", "Imag3", "Imag4")

fourier_data_dropoff <- demand_matrix_dropoff %>%
  select(DO_Longitude, DO_Latitude, DOLocationID) %>%
  bind_cols(as.data.frame(fourier_coefficients_dropoff))

scaled_temporal_data_dropoff <- scale(as.matrix(fourier_data_dropoff %>% select(-DO_Longitude, -DO_Latitude, -DOLocationID)))

coords_dropoff <- as.matrix(fourier_data_dropoff %>% select(DO_Longitude, DO_Latitude))
k <- 5
knn_dropoff <- knearneigh(coords_dropoff, k = k)
nb_dropoff <- knn2nb(knn_dropoff)
nb_dropoff <- make.sym.nb(nb_dropoff)  
adj_matrix_dropoff <- nb2mat(nb_dropoff, style = "B", zero.policy = TRUE)
diag(adj_matrix_dropoff) <- 1

D1_dropoff <- as.dist(1 - adj_matrix_dropoff)
D0_dropoff <- dist(scaled_temporal_data_dropoff) 

tree_unconstrained_dropoff <- hclustgeo(D0_dropoff)
plot(tree_unconstrained_dropoff, hang = -1, label = FALSE, main = "Unconstrained Clustering with D0 (Drop-off)")
rect.hclust(tree_unconstrained_dropoff, k = 5, border = c(4, 5, 3, 2, 1))

range_alpha <- seq(0, 1, 0.1)
K <- 5

png("dropoff_alpha_eval.png", width = 500, height = 500, res = 120)
cr <- choicealpha(D0 = D0_dropoff, D1 = D1_dropoff, range.alpha = range_alpha, K = K, graph = TRUE)
dev.off()
## quartz_off_screen 
##                 2
alpha <- 0.2
tree_constrained_dropoff <- hclustgeo(D0_dropoff, D1_dropoff, alpha = alpha)
P5_dropoff <- cutree(tree_constrained_dropoff, k = 5)

fourier_data_dropoff <- fourier_data_dropoff %>%
  mutate(cluster = P5_dropoff)

ggplot(fourier_data_dropoff, aes(x = DO_Longitude, y = DO_Latitude, color = as.factor(cluster))) +
  geom_point() +
  labs(title = "Clusters with Spatial Constraints (Drop-off)", color = "Cluster") +
  theme_minimal()

fourier_dropoff_cluster <- fourier_data_dropoff %>% select(DOLocationID, cluster)

dropoff_data <- merge(dropoff_data, fourier_dropoff_cluster, by = "DOLocationID")

ggplot(dropoff_data, aes(x = dropoff_hour, fill = factor(cluster))) +
  geom_bar(position = "dodge") +
  labs(
    title = "Drop-off Demand by Hour and Cluster",
    x = "Hour of the Day",
    y = "Number of Drop-offs",
    fill = "Cluster"
  ) +
  theme_minimal()

ggplot(dropoff_data, aes(x = dropoff_day, fill = factor(cluster))) +
  geom_bar(position = "dodge") +
  labs(
    title = "Drop-off Demand by Day and Cluster",
    x = "Day of the Week",
    y = "Number of Drop-offs",
    fill = "Cluster"
  ) +
  theme_minimal()

Results: Traffic Flow Analysis

Matching the Clusters

data_with_clusters <- data %>%
  left_join(pickup_data %>% select(trip_id, pickup_cluster = cluster), by = "trip_id") %>%
  left_join(dropoff_data %>% select(trip_id, dropoff_cluster = cluster), by = "trip_id")

traffic_flow <- data_with_clusters %>%
  group_by(pickup_cluster, dropoff_cluster) %>%
  summarize(flow_count = n(), .groups = "drop")

arrange(traffic_flow, desc(flow_count))
## # A tibble: 21 × 3
##    pickup_cluster dropoff_cluster flow_count
##             <int>           <int>      <int>
##  1              4               3     199527
##  2              3               3     146144
##  3              3               5     142982
##  4              4               5      88819
##  5              3               4      75719
##  6              5               3      66672
##  7              4               4      53443
##  8              2               2      30581
##  9              5               5      17219
## 10              5               4      11063
## # ℹ 11 more rows
ggplot(traffic_flow, aes(x = as.factor(pickup_cluster), y = as.factor(dropoff_cluster), fill = flow_count)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = median(traffic_flow$flow_count)) +
  labs(
    title = "Traffic Flow Between Pickup and Drop-off Clusters",
    x = "Pickup Cluster",
    y = "Drop-off Cluster",
    fill = "Number of Trips"
  ) +
  theme_minimal()

top_traffic_flows <- traffic_flow %>%
  arrange(desc(flow_count)) %>%
  slice(1:3)

data_top_flows <- data_with_clusters %>%
  semi_join(top_traffic_flows, by = c("pickup_cluster", "dropoff_cluster"))

top_traffic_flows <- data_top_flows %>%
  group_by(pickup_cluster, dropoff_cluster) %>%
  summarize(
    avg_trip_distance = round(mean(trip_distance, na.rm = TRUE), 3),
    median_trip_distance = round(median(trip_distance, na.rm = TRUE), 3),
    avg_passenger_count = round(mean(passenger_count, na.rm = TRUE), 3),
    avg_fare_amount = round(mean(fare_amount, na.rm = TRUE), 3),
    avg_duration = round(mean(Duration, na.rm = TRUE), 3),
    total_trips = n(),
    .groups = "drop"
  )

top_flows <- traffic_flow %>%
  arrange(desc(flow_count)) %>%
  head(10)

ggplot(top_flows, aes(x = reorder(paste(pickup_cluster, dropoff_cluster, sep = " → "), -flow_count), y = flow_count, fill = flow_count)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Top 10 Traffic Flows Between Clusters",
    x = "Pickup → Drop-off Cluster",
    y = "Number of Trips",
    fill = "Number of Trips"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Recall that cluster 3,4,5 for both pickup and drop off are within the Manhattan area. Most of the trips were picked up and dropped off in Manhattan.

top_traffic_flows <- data_top_flows %>%
  group_by(pickup_cluster, dropoff_cluster) %>%
  summarize(
    avg_trip_distance = round(mean(trip_distance, na.rm = TRUE), 3),
    median_trip_distance = round(median(trip_distance, na.rm = TRUE), 3),
    avg_passenger_count = round(mean(passenger_count, na.rm = TRUE), 3),
    avg_fare_amount = round(mean(fare_amount, na.rm = TRUE), 3),
    avg_duration = round(mean(Duration, na.rm = TRUE), 3),
    total_trips = n(),
    .groups = "drop"
  )


kable(top_traffic_flows, format = "latex", booktabs = TRUE, digits = 3, 
      col.names = c(
        "Pickup Cluster", 
        "Dropoff Cluster", 
        "Avg Trip Distance (miles)", 
        "Median Trip Distance (miles)", 
        "Avg Passenger Count", 
        "Avg Fare Amount ($)", 
        "Avg Duration (mins)", 
        "Total Trips"
      ),
      caption = "Summary Statistics for Top 3 Traffic Flows")

Traffic Into Manhattan

To examine the traffic into Manhattan, we can consider the trips from pickup clusters 1,2 (outside of Manhattan) to dropoff clusters 3,4,5 (within Manhattan).

hourly_demand_into_city <- data_with_clusters %>%
  filter(pickup_cluster %in% c(1, 2), dropoff_cluster %in% c(3, 4, 5)) %>%
  select(trip_id, PU_Longitude, PU_Latitude, PULocationID, tpep_pickup_datetime) %>%
  mutate(
    pickup_hour = hour(tpep_pickup_datetime),
    pickup_day = wday(tpep_pickup_datetime, label = TRUE)
  ) %>%
  group_by(PU_Longitude, PU_Latitude, PULocationID, pickup_hour, pickup_day) %>%
  summarize(demand = n(), .groups = "drop")

demand_matrix_into_city <- hourly_demand_into_city %>%
  pivot_wider(
    names_from = c(pickup_hour, pickup_day),
    values_from = demand,
    values_fill = 0
  ) %>%
  arrange(PULocationID, PU_Longitude, PU_Latitude)

daily_demand_into_city <- hourly_demand_into_city %>%
  group_by(pickup_day) %>%
  summarize(total_demand = sum(demand), .groups = "drop")

ggplot(daily_demand_into_city, aes(x = pickup_day, y = total_demand, fill = pickup_day)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Total Into Manhattan Demand by Day",
    x = "Day of the Week",
    y = "Total Demand",
    fill = "Day"
  ) +
  theme_minimal()

hourly_heatmap_data <- hourly_demand_into_city %>%
  group_by(pickup_hour, pickup_day) %>%
  summarize(total_demand = sum(demand), .groups = "drop")

ggplot(hourly_heatmap_data, aes(x = pickup_hour, y = pickup_day, fill = total_demand)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(
    title = "Demand into Manhattan (Hourly by Day)",
    x = "Hour of the Day",
    y = "Day of the Week",
    fill = "Demand"
  ) +
  theme_minimal()

Out of Manhattan

hourly_demand_out_of_city <- data_with_clusters %>%
  filter(pickup_cluster %in% c(3, 4, 5), dropoff_cluster %in% c(1, 2)) %>%
  select(trip_id, PU_Longitude, PU_Latitude, PULocationID, tpep_pickup_datetime) %>%
  mutate(
    pickup_hour = hour(tpep_pickup_datetime),
    pickup_day = wday(tpep_pickup_datetime, label = TRUE)
  ) %>%
  group_by(PU_Longitude, PU_Latitude, PULocationID, pickup_hour, pickup_day) %>%
  summarize(demand = n(), .groups = "drop")

demand_matrix_out_of_city <- hourly_demand_out_of_city %>%
  pivot_wider(
    names_from = c(pickup_hour, pickup_day),
    values_from = demand,
    values_fill = 0
  ) %>%
  arrange(PULocationID, PU_Longitude, PU_Latitude)
daily_demand_out_of_city <- hourly_demand_out_of_city %>%
  group_by(pickup_day) %>%
  summarize(total_demand = sum(demand), .groups = "drop")

ggplot(daily_demand_out_of_city, aes(x = pickup_day, y = total_demand, fill = pickup_day)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Total Out-of-City Demand by Day",
    x = "Day of the Week",
    y = "Total Demand",
    fill = "Day"
  ) +
  theme_minimal()

hourly_heatmap_out_of_city <- hourly_demand_out_of_city %>%
  group_by(pickup_hour, pickup_day) %>%
  summarize(total_demand = sum(demand), .groups = "drop")

# Plot hourly heatmap
ggplot(hourly_heatmap_out_of_city, aes(x = pickup_hour, y = pickup_day, fill = total_demand)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(
    title = "Demand Out of City (Hourly by Day)",
    x = "Hour of the Day",
    y = "Day of the Week",
    fill = "Demand"
  ) +
  theme_minimal()

Within Manhattan

hourly_demand_within_city <- data_with_clusters %>%
  filter(pickup_cluster %in% c(3, 4, 5), dropoff_cluster %in% c(3, 4, 5)) %>%
  select(trip_id, PU_Longitude, PU_Latitude, PULocationID, tpep_pickup_datetime) %>%
  mutate(
    pickup_hour = hour(tpep_pickup_datetime),
    pickup_day = wday(tpep_pickup_datetime, label = TRUE)
  ) %>%
  group_by(PU_Longitude, PU_Latitude, PULocationID, pickup_hour, pickup_day) %>%
  summarize(demand = n(), .groups = "drop")

demand_matrix_within_city <- hourly_demand_within_city %>%
  pivot_wider(
    names_from = c(pickup_hour, pickup_day),
    values_from = demand,
    values_fill = 0
  ) %>%
  arrange(PULocationID, PU_Longitude, PU_Latitude)

daily_demand_within_city <- hourly_demand_within_city %>%
  group_by(pickup_day) %>%
  summarize(total_demand = sum(demand), .groups = "drop")

ggplot(daily_demand_within_city, aes(x = pickup_day, y = total_demand, fill = pickup_day)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Total Within-City Demand by Day",
    x = "Day of the Week",
    y = "Total Demand",
    fill = "Day"
  ) +
  theme_minimal()

hourly_heatmap_within_city <- hourly_demand_within_city %>%
  group_by(pickup_hour, pickup_day) %>%
  summarize(total_demand = sum(demand), .groups = "drop")

ggplot(hourly_heatmap_within_city, aes(x = pickup_hour, y = pickup_day, fill = total_demand)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(
    title = "Within-City Demand Heatmap (Hourly by Day)",
    x = "Hour of the Day",
    y = "Day of the Week",
    fill = "Demand"
  ) +
  theme_minimal()

Summary Statistics

into_city_summary <- data_with_clusters %>%
  filter(pickup_cluster %in% c(1, 2), dropoff_cluster %in% c(3, 4, 5)) %>%
  group_by(pickup_cluster, dropoff_cluster) %>%
  summarize(
    avg_trip_distance = round(mean(trip_distance, na.rm = TRUE), 3),
    median_trip_distance = round(median(trip_distance, na.rm = TRUE), 3),
    avg_passenger_count = round(mean(passenger_count, na.rm = TRUE), 3),
    avg_fare_amount = round(mean(fare_amount, na.rm = TRUE), 3),
    avg_duration = round(mean(Duration, na.rm = TRUE), 3),
    total_trips = n(),
    .groups = "drop"
  )
out_of_city_summary <- data_with_clusters %>%
  filter(pickup_cluster %in% c(3, 4, 5), dropoff_cluster %in% c(1, 2)) %>%
  group_by(pickup_cluster, dropoff_cluster) %>%
  summarize(
    avg_trip_distance = round(mean(trip_distance, na.rm = TRUE), 3),
    median_trip_distance = round(median(trip_distance, na.rm = TRUE), 3),
    avg_passenger_count = round(mean(passenger_count, na.rm = TRUE), 3),
    avg_fare_amount = round(mean(fare_amount, na.rm = TRUE), 3),
    avg_duration = round(mean(Duration, na.rm = TRUE), 3),
    total_trips = n(),
    .groups = "drop"
  )

within_city_summary <- data_with_clusters %>%
  filter(pickup_cluster %in% c(3, 4, 5), dropoff_cluster %in% c(3, 4, 5)) %>%
  group_by(pickup_cluster, dropoff_cluster) %>%
  summarize(
    avg_trip_distance = round(mean(trip_distance, na.rm = TRUE), 3),
    median_trip_distance = round(median(trip_distance, na.rm = TRUE), 3),
    avg_passenger_count = round(mean(passenger_count, na.rm = TRUE), 3),
    avg_fare_amount = round(mean(fare_amount, na.rm = TRUE), 3),
    avg_duration = round(mean(Duration, na.rm = TRUE), 3),
    total_trips = n(),
    .groups = "drop"
  )



into_city_summary_stats <- into_city_summary %>%
  summarize(
    avg_trip_distance_mean = mean(avg_trip_distance, na.rm = TRUE),
    avg_trip_distance_sd = sd(avg_trip_distance, na.rm = TRUE),
    avg_fare_amount_mean = mean(avg_fare_amount, na.rm = TRUE),
    avg_fare_amount_sd = sd(avg_fare_amount, na.rm = TRUE),
    avg_duration_mean = mean(avg_duration, na.rm = TRUE),
    avg_duration_sd = sd(avg_duration, na.rm = TRUE),
    avg_passenger_count_mean = mean(avg_passenger_count, na.rm = TRUE),
    avg_passenger_count_sd = sd(avg_passenger_count, na.rm = TRUE),
    total_trips_mean = sum(total_trips, na.rm = TRUE)
  )

out_of_city_summary_stats <- out_of_city_summary%>%
  summarize(
    avg_trip_distance_mean = mean(avg_trip_distance, na.rm = TRUE),
    avg_trip_distance_sd = sd(avg_trip_distance, na.rm = TRUE),
    avg_fare_amount_mean = mean(avg_fare_amount, na.rm = TRUE),
    avg_fare_amount_sd = sd(avg_fare_amount, na.rm = TRUE),
    avg_duration_mean = mean(avg_duration, na.rm = TRUE),
    avg_duration_sd = sd(avg_duration, na.rm = TRUE),
    avg_passenger_count_mean = mean(avg_passenger_count, na.rm = TRUE),
    avg_passenger_count_sd = sd(avg_passenger_count, na.rm = TRUE),
    total_trips_mean = sum(total_trips, na.rm = TRUE)
  )



within_city_summary_stats <- within_city_summary %>%
  summarize(
    avg_trip_distance_mean = mean(avg_trip_distance, na.rm = TRUE),
    avg_trip_distance_sd = sd(avg_trip_distance, na.rm = TRUE),
    avg_fare_amount_mean = mean(avg_fare_amount, na.rm = TRUE),
    avg_fare_amount_sd = sd(avg_fare_amount, na.rm = TRUE),
    avg_duration_mean = mean(avg_duration, na.rm = TRUE),
    avg_duration_sd = sd(avg_duration, na.rm = TRUE),
    avg_passenger_count_mean = mean(avg_passenger_count, na.rm = TRUE),
    avg_passenger_count_sd = sd(avg_passenger_count, na.rm = TRUE),
    total_trips_mean = sum(total_trips, na.rm = TRUE)
  )

combined_summary_stats <- bind_rows(
  into_city_summary %>% mutate(type = "Into City"),
  out_of_city_summary %>% mutate(type = "Out of City"),
  within_city_summary %>% mutate(type = "Within City")
)

kable(combined_summary_stats, format = "html", booktabs = TRUE, caption = "Combined Summary of Traffic Flows") 
Combined Summary of Traffic Flows
pickup_cluster dropoff_cluster avg_trip_distance median_trip_distance avg_passenger_count avg_fare_amount avg_duration total_trips type
2 3 4.264 4.280 1.385 26.889 22.079 2534 Into City
2 4 4.517 5.025 1.251 24.687 19.448 1986 Into City
2 5 4.624 5.000 1.315 28.274 21.754 1235 Into City
3 1 0.460 0.000 1.000 39.775 22.646 4 Out of City
3 2 4.329 4.310 1.274 23.030 20.401 10453 Out of City
4 1 3.712 4.075 2.333 57.123 24.438 4 Out of City
4 2 5.255 5.370 1.305 27.290 24.913 6605 Out of City
5 2 4.109 3.940 1.310 22.328 20.317 4559 Out of City
3 3 2.173 1.860 1.333 15.179 14.475 146144 Within City
3 4 1.953 1.530 1.291 12.889 10.605 75719 Within City
3 5 1.454 1.150 1.260 11.037 9.613 142982 Within City
4 3 1.490 1.290 1.389 12.263 11.730 199527 Within City
4 4 2.911 2.700 1.377 17.086 15.282 53443 Within City
4 5 1.638 1.500 1.323 13.017 12.956 88819 Within City
5 3 1.575 1.360 1.419 11.925 10.925 66672 Within City
5 4 3.297 2.790 1.404 18.517 16.328 11063 Within City
5 5 2.988 2.800 1.410 17.833 16.684 17219 Within City

References

  1. NYC Taxi Zones Dataset, available at NYC Open Data Portal, 2024.
  2. ClustGeo Package Documentation, available at CRAN, 2024.