The New York City Taxi and Limousine Commission (TLC) provides comprehensive datasets on yellow taxi trips. These datasets include details such as pickup and dropoff times, trip distances, and associated costs. This project analyzes the yellow taxi trips from August 2024, using 1 million randomly sampled rows from the 3 million original records. The goal is to identify spatio-temporal demand patterns using clustering methods that account for both spatial and temporal dynamics.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(wavelets)
library(reshape2)
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
##
## smiths
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
library(spatstat)
## Loading required package: spatstat.data
## Loading required package: spatstat.univar
## spatstat.univar 3.1-1
## Loading required package: spatstat.geom
## spatstat.geom 3.3-4
## Loading required package: spatstat.random
## spatstat.random 3.3-2
## Loading required package: spatstat.explore
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## spatstat.explore 3.3-3
## Loading required package: spatstat.model
## Loading required package: rpart
## spatstat.model 3.3-2
## Loading required package: spatstat.linnet
## spatstat.linnet 3.2-2
##
## spatstat 3.2-1
## For an introduction to spatstat, type 'beginner'
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:spatstat.geom':
##
## diameter, edges, is.connected, vertices
## The following objects are masked from 'package:lubridate':
##
## %--%, union
## The following object is masked from 'package:tidyr':
##
## crossing
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(rjags)
## Loading required package: coda
## Linked to JAGS 4.3.2
## Loaded modules: basemod,bugs
library(fields)
## Loading required package: spam
## Spam version 2.11-0 (2024-10-03) is loaded.
## Type 'help( Spam)' or 'demo( spam)' for a short introduction
## and overview of this package.
## Help for individual functions is also obtained by adding the
## suffix '.spam' to the function name, e.g. 'help( chol.spam)'.
##
## Attaching package: 'spam'
## The following objects are masked from 'package:base':
##
## backsolve, forwardsolve
## Loading required package: viridisLite
##
## Try help(fields) to get started.
library(stringr)
library(ClustGeo)
library(spdep)
## Loading required package: spData
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
## Loading required package: sf
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(knitr)
data <- read.csv("cleaned_NYC_data.csv")
# Data Cleaning
data <- data %>%
mutate(
trip_id = row_number(),
tpep_pickup_datetime = as.POSIXct(tpep_pickup_datetime, format = "%Y-%m-%d %H:%M:%S"),
tpep_dropoff_datetime = as.POSIXct(tpep_dropoff_datetime, format = "%Y-%m-%d %H:%M:%S"),
Duration = as.numeric(difftime(tpep_dropoff_datetime, tpep_pickup_datetime, units = "mins"))
) %>%
filter(!is.na(Duration), Duration > 0)
# Removing Outliers
duration_limits <- quantile(data$Duration, probs = c(0.25, 0.75), na.rm = TRUE)
duration_iqr <- IQR(data$Duration, na.rm = TRUE)
distance_limits <- quantile(data$trip_distance, probs = c(0.25, 0.75), na.rm = TRUE)
distance_iqr <- IQR(data$trip_distance, na.rm = TRUE)
data <- data %>%
filter(
Duration >= (duration_limits[1] - 1.5 * duration_iqr) & Duration <= (duration_limits[2] + 1.5 * duration_iqr),
trip_distance >= (distance_limits[1] - 1.5 * distance_iqr) & trip_distance <= (distance_limits[2] + 1.5 * distance_iqr)
)
Taxi trips inherently consist of the pickup location (origin) and the dropoff location (destination). In this analysis, we will cluster the pickup and dropoff information separately, then match the clusters to reveal the traffic movement based on time and location to identify demand patterns such as commuting into central business districts or travel to entertainment zones.
To analyze the demand patterns, we aggregate the data into number of taxi trips into hours of the day at each taxi zone.
We used the Fourier Transform to extract temporal features from aggregated demand data. The temporal demand \(f_s(t)\) at each location \(s\) and time \(t\) is expressed as: \[ f_s(t) = \beta_{s,0} + \sum_{k=1}^m \beta_{s,k} \cos(2\pi k t) + \sum_{k=1}^m \gamma_{s,k} \sin(2\pi k t). \]
We used the ClustGeo package for clustering, balancing temporal patterns with spatial continuity. The combined dissimilarity measure is defined as: \[ \Delta_\alpha = (1 - \alpha) D_0 + \alpha D_1, \] where \(\alpha \in [0, 1]\) controls the trade-off between temporal similarity and spatial continuity.
Use the Fourier Transform to extract temporal features from aggregated demand data.
pickup_data <- data %>%
select(trip_id, PU_Longitude, PU_Latitude, PULocationID, tpep_pickup_datetime) %>%
mutate(
pickup_hour = hour(tpep_pickup_datetime),
pickup_day = wday(tpep_pickup_datetime, label = TRUE)
)
hourly_demand <- pickup_data %>%
group_by(PU_Longitude, PU_Latitude, PULocationID, pickup_hour, pickup_day) %>%
summarize(demand = n(), .groups = "drop") %>%
arrange(pickup_day, pickup_hour)
demand_matrix <- hourly_demand %>%
pivot_wider(
names_from = c(pickup_hour, pickup_day),
values_from = demand,
values_fill = 0
) %>%
arrange(PULocationID, PU_Longitude, PU_Latitude)
## Extract Temporal Features
temporal_demand <- demand_matrix %>%
select(-PU_Longitude, -PU_Latitude, -PULocationID)
fourier_coefficients <- apply(temporal_demand, 1, function(row) {
fft_result <- fft(row - mean(row)) # Center data and compute FFT
c(Re(fft_result)[1:4], Im(fft_result)[2:5]) # Extract key coefficients
}) %>% t()
colnames(fourier_coefficients) <- c("DC", "Real1", "Real2", "Real3", "Imag1", "Imag2", "Imag3", "Imag4")
fourier_data <- demand_matrix %>%
select(PU_Longitude, PU_Latitude, PULocationID) %>%
bind_cols(as.data.frame(fourier_coefficients))
scaled_temporal_data <- scale(as.matrix(fourier_data %>% select(-PU_Longitude, -PU_Latitude, -PULocationID)))
Use ClustGeo to determine an appropriate alpha and perform spatio contrained temporal clustering.
coords <- as.matrix(fourier_data %>% select(PU_Longitude, PU_Latitude))
k <- 5
knn <- knearneigh(coords, k = k)
nb <- knn2nb(knn)
nb <- make.sym.nb(nb)
adj_matrix <- nb2mat(nb, style = "B", zero.policy = TRUE)
diag(adj_matrix) <- 1
D1 <- as.dist(1 - adj_matrix)
D0 <- dist(scaled_temporal_data)
tree_unconstrained <- hclustgeo(D0)
plot(tree_unconstrained, hang = -1, label = FALSE, main = "Unconstrained Clustering with D0")
rect.hclust(tree_unconstrained, k = 5, border = c(4, 5, 3, 2, 1))
range_alpha <- seq(0, 1, 0.1)
K <- 5
png("pickup_alpha_eval.png", width = 500, height = 500, res = 120)
cr <- choicealpha(D0, D1, range.alpha = range_alpha, K = K, graph = TRUE)
dev.off()
## quartz_off_screen
## 2
alpha <- 0.2
tree_constrained <- hclustgeo(D0, D1, alpha = alpha)
P5 <- cutree(tree_constrained, k = 5)
fourier_data <- fourier_data %>%
mutate(cluster = P5)
# Save pickup cluster to pickup_data
fourier_cluster <- fourier_data %>% select(PULocationID, cluster)
pickup_data <- merge(pickup_data, fourier_cluster, by = "PULocationID")
ggplot(pickup_data , aes(x = pickup_hour, fill = factor(cluster))) +
geom_bar(position = "dodge") +
labs(
title = "Pick Up Demand by Hour and Cluster",
x = "Hour of the Day",
y = "Number of Pickups",
fill = "Cluster"
) +
theme_minimal()
ggplot(pickup_data , aes(x = pickup_day, fill = factor(cluster))) +
geom_bar(position = "dodge") +
labs(
title = "Pick Up Demand by Day and Cluster",
x = "Day of the week",
y = "Number of Pickups",
fill = "Cluster"
) +
theme_minimal()
Carry the same analysis as pickup data clustering.
dropoff_data <- data %>%
select(trip_id, DO_Longitude, DO_Latitude, DOLocationID, tpep_dropoff_datetime) %>%
mutate(
dropoff_hour = hour(tpep_dropoff_datetime),
dropoff_day = wday(tpep_dropoff_datetime, label = TRUE)
)
hourly_demand_dropoff <- dropoff_data %>%
group_by(DO_Longitude, DO_Latitude, DOLocationID, dropoff_hour, dropoff_day) %>%
summarize(demand = n(), .groups = "drop") %>%
arrange(dropoff_day, dropoff_hour)
demand_matrix_dropoff <- hourly_demand_dropoff %>%
pivot_wider(
names_from = c(dropoff_hour, dropoff_day),
values_from = demand,
values_fill = 0
) %>%
arrange(DOLocationID, DO_Longitude, DO_Latitude)
temporal_demand_dropoff <- demand_matrix_dropoff %>%
select(-DO_Longitude, -DO_Latitude, -DOLocationID)
fourier_coefficients_dropoff <- apply(temporal_demand_dropoff, 1, function(row) {
fft_result <- fft(row - mean(row)) # Center data and compute FFT
c(Re(fft_result)[1:4], Im(fft_result)[2:5]) # Extract key coefficients
}) %>% t()
colnames(fourier_coefficients_dropoff) <- c("DC", "Real1", "Real2", "Real3", "Imag1", "Imag2", "Imag3", "Imag4")
fourier_data_dropoff <- demand_matrix_dropoff %>%
select(DO_Longitude, DO_Latitude, DOLocationID) %>%
bind_cols(as.data.frame(fourier_coefficients_dropoff))
scaled_temporal_data_dropoff <- scale(as.matrix(fourier_data_dropoff %>% select(-DO_Longitude, -DO_Latitude, -DOLocationID)))
coords_dropoff <- as.matrix(fourier_data_dropoff %>% select(DO_Longitude, DO_Latitude))
k <- 5
knn_dropoff <- knearneigh(coords_dropoff, k = k)
nb_dropoff <- knn2nb(knn_dropoff)
nb_dropoff <- make.sym.nb(nb_dropoff)
adj_matrix_dropoff <- nb2mat(nb_dropoff, style = "B", zero.policy = TRUE)
diag(adj_matrix_dropoff) <- 1
D1_dropoff <- as.dist(1 - adj_matrix_dropoff)
D0_dropoff <- dist(scaled_temporal_data_dropoff)
tree_unconstrained_dropoff <- hclustgeo(D0_dropoff)
plot(tree_unconstrained_dropoff, hang = -1, label = FALSE, main = "Unconstrained Clustering with D0 (Drop-off)")
rect.hclust(tree_unconstrained_dropoff, k = 5, border = c(4, 5, 3, 2, 1))
range_alpha <- seq(0, 1, 0.1)
K <- 5
png("dropoff_alpha_eval.png", width = 500, height = 500, res = 120)
cr <- choicealpha(D0 = D0_dropoff, D1 = D1_dropoff, range.alpha = range_alpha, K = K, graph = TRUE)
dev.off()
## quartz_off_screen
## 2
alpha <- 0.2
tree_constrained_dropoff <- hclustgeo(D0_dropoff, D1_dropoff, alpha = alpha)
P5_dropoff <- cutree(tree_constrained_dropoff, k = 5)
fourier_data_dropoff <- fourier_data_dropoff %>%
mutate(cluster = P5_dropoff)
ggplot(fourier_data_dropoff, aes(x = DO_Longitude, y = DO_Latitude, color = as.factor(cluster))) +
geom_point() +
labs(title = "Clusters with Spatial Constraints (Drop-off)", color = "Cluster") +
theme_minimal()
fourier_dropoff_cluster <- fourier_data_dropoff %>% select(DOLocationID, cluster)
dropoff_data <- merge(dropoff_data, fourier_dropoff_cluster, by = "DOLocationID")
ggplot(dropoff_data, aes(x = dropoff_hour, fill = factor(cluster))) +
geom_bar(position = "dodge") +
labs(
title = "Drop-off Demand by Hour and Cluster",
x = "Hour of the Day",
y = "Number of Drop-offs",
fill = "Cluster"
) +
theme_minimal()
ggplot(dropoff_data, aes(x = dropoff_day, fill = factor(cluster))) +
geom_bar(position = "dodge") +
labs(
title = "Drop-off Demand by Day and Cluster",
x = "Day of the Week",
y = "Number of Drop-offs",
fill = "Cluster"
) +
theme_minimal()
data_with_clusters <- data %>%
left_join(pickup_data %>% select(trip_id, pickup_cluster = cluster), by = "trip_id") %>%
left_join(dropoff_data %>% select(trip_id, dropoff_cluster = cluster), by = "trip_id")
traffic_flow <- data_with_clusters %>%
group_by(pickup_cluster, dropoff_cluster) %>%
summarize(flow_count = n(), .groups = "drop")
arrange(traffic_flow, desc(flow_count))
## # A tibble: 21 × 3
## pickup_cluster dropoff_cluster flow_count
## <int> <int> <int>
## 1 4 3 199527
## 2 3 3 146144
## 3 3 5 142982
## 4 4 5 88819
## 5 3 4 75719
## 6 5 3 66672
## 7 4 4 53443
## 8 2 2 30581
## 9 5 5 17219
## 10 5 4 11063
## # ℹ 11 more rows
ggplot(traffic_flow, aes(x = as.factor(pickup_cluster), y = as.factor(dropoff_cluster), fill = flow_count)) +
geom_tile() +
scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = median(traffic_flow$flow_count)) +
labs(
title = "Traffic Flow Between Pickup and Drop-off Clusters",
x = "Pickup Cluster",
y = "Drop-off Cluster",
fill = "Number of Trips"
) +
theme_minimal()
top_traffic_flows <- traffic_flow %>%
arrange(desc(flow_count)) %>%
slice(1:3)
data_top_flows <- data_with_clusters %>%
semi_join(top_traffic_flows, by = c("pickup_cluster", "dropoff_cluster"))
top_traffic_flows <- data_top_flows %>%
group_by(pickup_cluster, dropoff_cluster) %>%
summarize(
avg_trip_distance = round(mean(trip_distance, na.rm = TRUE), 3),
median_trip_distance = round(median(trip_distance, na.rm = TRUE), 3),
avg_passenger_count = round(mean(passenger_count, na.rm = TRUE), 3),
avg_fare_amount = round(mean(fare_amount, na.rm = TRUE), 3),
avg_duration = round(mean(Duration, na.rm = TRUE), 3),
total_trips = n(),
.groups = "drop"
)
top_flows <- traffic_flow %>%
arrange(desc(flow_count)) %>%
head(10)
ggplot(top_flows, aes(x = reorder(paste(pickup_cluster, dropoff_cluster, sep = " → "), -flow_count), y = flow_count, fill = flow_count)) +
geom_bar(stat = "identity") +
labs(
title = "Top 10 Traffic Flows Between Clusters",
x = "Pickup → Drop-off Cluster",
y = "Number of Trips",
fill = "Number of Trips"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Recall that cluster 3,4,5 for both pickup and drop off are within the Manhattan area. Most of the trips were picked up and dropped off in Manhattan.
top_traffic_flows <- data_top_flows %>%
group_by(pickup_cluster, dropoff_cluster) %>%
summarize(
avg_trip_distance = round(mean(trip_distance, na.rm = TRUE), 3),
median_trip_distance = round(median(trip_distance, na.rm = TRUE), 3),
avg_passenger_count = round(mean(passenger_count, na.rm = TRUE), 3),
avg_fare_amount = round(mean(fare_amount, na.rm = TRUE), 3),
avg_duration = round(mean(Duration, na.rm = TRUE), 3),
total_trips = n(),
.groups = "drop"
)
kable(top_traffic_flows, format = "latex", booktabs = TRUE, digits = 3,
col.names = c(
"Pickup Cluster",
"Dropoff Cluster",
"Avg Trip Distance (miles)",
"Median Trip Distance (miles)",
"Avg Passenger Count",
"Avg Fare Amount ($)",
"Avg Duration (mins)",
"Total Trips"
),
caption = "Summary Statistics for Top 3 Traffic Flows")
To examine the traffic into Manhattan, we can consider the trips from pickup clusters 1,2 (outside of Manhattan) to dropoff clusters 3,4,5 (within Manhattan).
hourly_demand_into_city <- data_with_clusters %>%
filter(pickup_cluster %in% c(1, 2), dropoff_cluster %in% c(3, 4, 5)) %>%
select(trip_id, PU_Longitude, PU_Latitude, PULocationID, tpep_pickup_datetime) %>%
mutate(
pickup_hour = hour(tpep_pickup_datetime),
pickup_day = wday(tpep_pickup_datetime, label = TRUE)
) %>%
group_by(PU_Longitude, PU_Latitude, PULocationID, pickup_hour, pickup_day) %>%
summarize(demand = n(), .groups = "drop")
demand_matrix_into_city <- hourly_demand_into_city %>%
pivot_wider(
names_from = c(pickup_hour, pickup_day),
values_from = demand,
values_fill = 0
) %>%
arrange(PULocationID, PU_Longitude, PU_Latitude)
daily_demand_into_city <- hourly_demand_into_city %>%
group_by(pickup_day) %>%
summarize(total_demand = sum(demand), .groups = "drop")
ggplot(daily_demand_into_city, aes(x = pickup_day, y = total_demand, fill = pickup_day)) +
geom_bar(stat = "identity") +
labs(
title = "Total Into Manhattan Demand by Day",
x = "Day of the Week",
y = "Total Demand",
fill = "Day"
) +
theme_minimal()
hourly_heatmap_data <- hourly_demand_into_city %>%
group_by(pickup_hour, pickup_day) %>%
summarize(total_demand = sum(demand), .groups = "drop")
ggplot(hourly_heatmap_data, aes(x = pickup_hour, y = pickup_day, fill = total_demand)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "blue") +
labs(
title = "Demand into Manhattan (Hourly by Day)",
x = "Hour of the Day",
y = "Day of the Week",
fill = "Demand"
) +
theme_minimal()
hourly_demand_out_of_city <- data_with_clusters %>%
filter(pickup_cluster %in% c(3, 4, 5), dropoff_cluster %in% c(1, 2)) %>%
select(trip_id, PU_Longitude, PU_Latitude, PULocationID, tpep_pickup_datetime) %>%
mutate(
pickup_hour = hour(tpep_pickup_datetime),
pickup_day = wday(tpep_pickup_datetime, label = TRUE)
) %>%
group_by(PU_Longitude, PU_Latitude, PULocationID, pickup_hour, pickup_day) %>%
summarize(demand = n(), .groups = "drop")
demand_matrix_out_of_city <- hourly_demand_out_of_city %>%
pivot_wider(
names_from = c(pickup_hour, pickup_day),
values_from = demand,
values_fill = 0
) %>%
arrange(PULocationID, PU_Longitude, PU_Latitude)
daily_demand_out_of_city <- hourly_demand_out_of_city %>%
group_by(pickup_day) %>%
summarize(total_demand = sum(demand), .groups = "drop")
ggplot(daily_demand_out_of_city, aes(x = pickup_day, y = total_demand, fill = pickup_day)) +
geom_bar(stat = "identity") +
labs(
title = "Total Out-of-City Demand by Day",
x = "Day of the Week",
y = "Total Demand",
fill = "Day"
) +
theme_minimal()
hourly_heatmap_out_of_city <- hourly_demand_out_of_city %>%
group_by(pickup_hour, pickup_day) %>%
summarize(total_demand = sum(demand), .groups = "drop")
# Plot hourly heatmap
ggplot(hourly_heatmap_out_of_city, aes(x = pickup_hour, y = pickup_day, fill = total_demand)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "blue") +
labs(
title = "Demand Out of City (Hourly by Day)",
x = "Hour of the Day",
y = "Day of the Week",
fill = "Demand"
) +
theme_minimal()
hourly_demand_within_city <- data_with_clusters %>%
filter(pickup_cluster %in% c(3, 4, 5), dropoff_cluster %in% c(3, 4, 5)) %>%
select(trip_id, PU_Longitude, PU_Latitude, PULocationID, tpep_pickup_datetime) %>%
mutate(
pickup_hour = hour(tpep_pickup_datetime),
pickup_day = wday(tpep_pickup_datetime, label = TRUE)
) %>%
group_by(PU_Longitude, PU_Latitude, PULocationID, pickup_hour, pickup_day) %>%
summarize(demand = n(), .groups = "drop")
demand_matrix_within_city <- hourly_demand_within_city %>%
pivot_wider(
names_from = c(pickup_hour, pickup_day),
values_from = demand,
values_fill = 0
) %>%
arrange(PULocationID, PU_Longitude, PU_Latitude)
daily_demand_within_city <- hourly_demand_within_city %>%
group_by(pickup_day) %>%
summarize(total_demand = sum(demand), .groups = "drop")
ggplot(daily_demand_within_city, aes(x = pickup_day, y = total_demand, fill = pickup_day)) +
geom_bar(stat = "identity") +
labs(
title = "Total Within-City Demand by Day",
x = "Day of the Week",
y = "Total Demand",
fill = "Day"
) +
theme_minimal()
hourly_heatmap_within_city <- hourly_demand_within_city %>%
group_by(pickup_hour, pickup_day) %>%
summarize(total_demand = sum(demand), .groups = "drop")
ggplot(hourly_heatmap_within_city, aes(x = pickup_hour, y = pickup_day, fill = total_demand)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "blue") +
labs(
title = "Within-City Demand Heatmap (Hourly by Day)",
x = "Hour of the Day",
y = "Day of the Week",
fill = "Demand"
) +
theme_minimal()
into_city_summary <- data_with_clusters %>%
filter(pickup_cluster %in% c(1, 2), dropoff_cluster %in% c(3, 4, 5)) %>%
group_by(pickup_cluster, dropoff_cluster) %>%
summarize(
avg_trip_distance = round(mean(trip_distance, na.rm = TRUE), 3),
median_trip_distance = round(median(trip_distance, na.rm = TRUE), 3),
avg_passenger_count = round(mean(passenger_count, na.rm = TRUE), 3),
avg_fare_amount = round(mean(fare_amount, na.rm = TRUE), 3),
avg_duration = round(mean(Duration, na.rm = TRUE), 3),
total_trips = n(),
.groups = "drop"
)
out_of_city_summary <- data_with_clusters %>%
filter(pickup_cluster %in% c(3, 4, 5), dropoff_cluster %in% c(1, 2)) %>%
group_by(pickup_cluster, dropoff_cluster) %>%
summarize(
avg_trip_distance = round(mean(trip_distance, na.rm = TRUE), 3),
median_trip_distance = round(median(trip_distance, na.rm = TRUE), 3),
avg_passenger_count = round(mean(passenger_count, na.rm = TRUE), 3),
avg_fare_amount = round(mean(fare_amount, na.rm = TRUE), 3),
avg_duration = round(mean(Duration, na.rm = TRUE), 3),
total_trips = n(),
.groups = "drop"
)
within_city_summary <- data_with_clusters %>%
filter(pickup_cluster %in% c(3, 4, 5), dropoff_cluster %in% c(3, 4, 5)) %>%
group_by(pickup_cluster, dropoff_cluster) %>%
summarize(
avg_trip_distance = round(mean(trip_distance, na.rm = TRUE), 3),
median_trip_distance = round(median(trip_distance, na.rm = TRUE), 3),
avg_passenger_count = round(mean(passenger_count, na.rm = TRUE), 3),
avg_fare_amount = round(mean(fare_amount, na.rm = TRUE), 3),
avg_duration = round(mean(Duration, na.rm = TRUE), 3),
total_trips = n(),
.groups = "drop"
)
into_city_summary_stats <- into_city_summary %>%
summarize(
avg_trip_distance_mean = mean(avg_trip_distance, na.rm = TRUE),
avg_trip_distance_sd = sd(avg_trip_distance, na.rm = TRUE),
avg_fare_amount_mean = mean(avg_fare_amount, na.rm = TRUE),
avg_fare_amount_sd = sd(avg_fare_amount, na.rm = TRUE),
avg_duration_mean = mean(avg_duration, na.rm = TRUE),
avg_duration_sd = sd(avg_duration, na.rm = TRUE),
avg_passenger_count_mean = mean(avg_passenger_count, na.rm = TRUE),
avg_passenger_count_sd = sd(avg_passenger_count, na.rm = TRUE),
total_trips_mean = sum(total_trips, na.rm = TRUE)
)
out_of_city_summary_stats <- out_of_city_summary%>%
summarize(
avg_trip_distance_mean = mean(avg_trip_distance, na.rm = TRUE),
avg_trip_distance_sd = sd(avg_trip_distance, na.rm = TRUE),
avg_fare_amount_mean = mean(avg_fare_amount, na.rm = TRUE),
avg_fare_amount_sd = sd(avg_fare_amount, na.rm = TRUE),
avg_duration_mean = mean(avg_duration, na.rm = TRUE),
avg_duration_sd = sd(avg_duration, na.rm = TRUE),
avg_passenger_count_mean = mean(avg_passenger_count, na.rm = TRUE),
avg_passenger_count_sd = sd(avg_passenger_count, na.rm = TRUE),
total_trips_mean = sum(total_trips, na.rm = TRUE)
)
within_city_summary_stats <- within_city_summary %>%
summarize(
avg_trip_distance_mean = mean(avg_trip_distance, na.rm = TRUE),
avg_trip_distance_sd = sd(avg_trip_distance, na.rm = TRUE),
avg_fare_amount_mean = mean(avg_fare_amount, na.rm = TRUE),
avg_fare_amount_sd = sd(avg_fare_amount, na.rm = TRUE),
avg_duration_mean = mean(avg_duration, na.rm = TRUE),
avg_duration_sd = sd(avg_duration, na.rm = TRUE),
avg_passenger_count_mean = mean(avg_passenger_count, na.rm = TRUE),
avg_passenger_count_sd = sd(avg_passenger_count, na.rm = TRUE),
total_trips_mean = sum(total_trips, na.rm = TRUE)
)
combined_summary_stats <- bind_rows(
into_city_summary %>% mutate(type = "Into City"),
out_of_city_summary %>% mutate(type = "Out of City"),
within_city_summary %>% mutate(type = "Within City")
)
kable(combined_summary_stats, format = "html", booktabs = TRUE, caption = "Combined Summary of Traffic Flows")
pickup_cluster | dropoff_cluster | avg_trip_distance | median_trip_distance | avg_passenger_count | avg_fare_amount | avg_duration | total_trips | type |
---|---|---|---|---|---|---|---|---|
2 | 3 | 4.264 | 4.280 | 1.385 | 26.889 | 22.079 | 2534 | Into City |
2 | 4 | 4.517 | 5.025 | 1.251 | 24.687 | 19.448 | 1986 | Into City |
2 | 5 | 4.624 | 5.000 | 1.315 | 28.274 | 21.754 | 1235 | Into City |
3 | 1 | 0.460 | 0.000 | 1.000 | 39.775 | 22.646 | 4 | Out of City |
3 | 2 | 4.329 | 4.310 | 1.274 | 23.030 | 20.401 | 10453 | Out of City |
4 | 1 | 3.712 | 4.075 | 2.333 | 57.123 | 24.438 | 4 | Out of City |
4 | 2 | 5.255 | 5.370 | 1.305 | 27.290 | 24.913 | 6605 | Out of City |
5 | 2 | 4.109 | 3.940 | 1.310 | 22.328 | 20.317 | 4559 | Out of City |
3 | 3 | 2.173 | 1.860 | 1.333 | 15.179 | 14.475 | 146144 | Within City |
3 | 4 | 1.953 | 1.530 | 1.291 | 12.889 | 10.605 | 75719 | Within City |
3 | 5 | 1.454 | 1.150 | 1.260 | 11.037 | 9.613 | 142982 | Within City |
4 | 3 | 1.490 | 1.290 | 1.389 | 12.263 | 11.730 | 199527 | Within City |
4 | 4 | 2.911 | 2.700 | 1.377 | 17.086 | 15.282 | 53443 | Within City |
4 | 5 | 1.638 | 1.500 | 1.323 | 13.017 | 12.956 | 88819 | Within City |
5 | 3 | 1.575 | 1.360 | 1.419 | 11.925 | 10.925 | 66672 | Within City |
5 | 4 | 3.297 | 2.790 | 1.404 | 18.517 | 16.328 | 11063 | Within City |
5 | 5 | 2.988 | 2.800 | 1.410 | 17.833 | 16.684 | 17219 | Within City |