This paper proposes a deep spatio-temporal residual network (ST-ResNet) for predicting citywide crowd flows. The model is designed to forecast both inflow and outflow of crowds in each region of a city, considering spatial, temporal, and external factors. ST-ResNet uses a residual neural network framework to model temporal properties of crowd traffic, such as closeness, period, and trend. For each property, it employs a branch of residual convolutional units to model spatial properties. The model dynamically aggregates outputs from these branches, assigning different weights to different regions and branches. The aggregation is further combined with external factors like weather and day of the week to predict the final traffic in each region. Experiments on Beijing and New York City data show that ST-ResNet outperforms six well-known methods. The model is evaluated using Beijing taxi trajectories and meteorological data, and NYC bike trajectory data. The results demonstrate the effectiveness of the approach in predicting crowd flows. The model's structure includes four components: temporal closeness, period, trend, and external influence. The first three components share a similar network structure with convolutional neural networks followed by residual units, capturing spatial dependencies between nearby and distant regions. The external component uses external datasets to extract features like weather and events, feeding them into a fully-connected neural network. The outputs of the first three components are fused with the external component's output, and the final prediction is mapped to the range [-1, 1] using a Tanh function. The model is trained using backpropagation and Adam optimization. The results show that ST-ResNet achieves significantly better performance than baseline methods in predicting crowd flows. The model is applicable to various types of flow prediction tasks and can be extended to other types of flows using appropriate fusion mechanisms.This paper proposes a deep spatio-temporal residual network (ST-ResNet) for predicting citywide crowd flows. The model is designed to forecast both inflow and outflow of crowds in each region of a city, considering spatial, temporal, and external factors. ST-ResNet uses a residual neural network framework to model temporal properties of crowd traffic, such as closeness, period, and trend. For each property, it employs a branch of residual convolutional units to model spatial properties. The model dynamically aggregates outputs from these branches, assigning different weights to different regions and branches. The aggregation is further combined with external factors like weather and day of the week to predict the final traffic in each region. Experiments on Beijing and New York City data show that ST-ResNet outperforms six well-known methods. The model is evaluated using Beijing taxi trajectories and meteorological data, and NYC bike trajectory data. The results demonstrate the effectiveness of the approach in predicting crowd flows. The model's structure includes four components: temporal closeness, period, trend, and external influence. The first three components share a similar network structure with convolutional neural networks followed by residual units, capturing spatial dependencies between nearby and distant regions. The external component uses external datasets to extract features like weather and events, feeding them into a fully-connected neural network. The outputs of the first three components are fused with the external component's output, and the final prediction is mapped to the range [-1, 1] using a Tanh function. The model is trained using backpropagation and Adam optimization. The results show that ST-ResNet achieves significantly better performance than baseline methods in predicting crowd flows. The model is applicable to various types of flow prediction tasks and can be extended to other types of flows using appropriate fusion mechanisms.