CityNav is a new dataset for language-goal aerial navigation using 3D point cloud representations from real-world cities. It includes 32,637 natural language descriptions paired with human demonstration trajectories, collected from participants via a new web-based 3D simulator. Each description specifies a navigation goal, leveraging the names and locations of landmarks within real-world cities. The dataset also provides baseline models of navigation agents that incorporate an internal 2D spatial map representing landmarks referenced in the descriptions. The results using this dataset reveal that aerial agent models trained on human demonstration trajectories outperform those trained on shortest path trajectories, highlighting the importance of human-driven navigation strategies. The integration of a 2D spatial map significantly enhances navigation efficiency at city scale. The dataset and code are available at https://water-cookie.github.io/city-nav-proj/. The CityNav dataset is designed to develop an intelligent aerial agent capable of locating specific geographical objects in real-world cities based on natural language descriptions. The dataset provides descriptions for city-scale point cloud data of SensatUrban, as well as their corresponding trajectories for training aerial agents. To collect large amounts of trajectories in photorealistic 3D environments, a web-based flight simulator synchronized with world maps was implemented. The dataset includes 32,637 trajectories corresponding to natural language descriptions about 5,850 objects, which are approximately four times the size of the existing aerial VLN dataset. The CityNav dataset represents the first large-scale 3D aerial navigation effort that leverages real-world 3D city data and contains a large amount of human-collected geo-aware trajectories and textual descriptions. The primary contributions of this work are: (1) a city-scale 3D aerial navigation dataset using real cities, (2) a geo-aware aerial vision-and-language navigation model, and (3) benchmarking map-based vs. map-less aerial navigation methods. The results show that the proposed map-based method outperforms the existing map-less method. The dataset is benchmarked against current aerial VLN models, and the results demonstrate that the proposed model, which utilizes 2D spatial map representations and human-generated geo-aware trajectories, enhances navigation performance. The CityNav dataset could be a valuable resource for benchmarking and training intelligent aerial agents. Limitations include the lack of agent-object interaction and dynamic elements like moving vehicles and pedestrians in urban simulations. Future work could include integrating physical interaction and real-time data to improve navigation accuracy and expand the dataset. The broader impacts of CityNav include potential improvements in urban surveillance and emergency response by enabling aerial agents to navigate via natural language. However, these technologies also raise ethical concerns regarding privacy and data security. It is crucial to consider social acceptance and regulatory issues, engage with communities to ensure equitable benefits, and address potential risks to privacy and safety.CityNav is a new dataset for language-goal aerial navigation using 3D point cloud representations from real-world cities. It includes 32,637 natural language descriptions paired with human demonstration trajectories, collected from participants via a new web-based 3D simulator. Each description specifies a navigation goal, leveraging the names and locations of landmarks within real-world cities. The dataset also provides baseline models of navigation agents that incorporate an internal 2D spatial map representing landmarks referenced in the descriptions. The results using this dataset reveal that aerial agent models trained on human demonstration trajectories outperform those trained on shortest path trajectories, highlighting the importance of human-driven navigation strategies. The integration of a 2D spatial map significantly enhances navigation efficiency at city scale. The dataset and code are available at https://water-cookie.github.io/city-nav-proj/. The CityNav dataset is designed to develop an intelligent aerial agent capable of locating specific geographical objects in real-world cities based on natural language descriptions. The dataset provides descriptions for city-scale point cloud data of SensatUrban, as well as their corresponding trajectories for training aerial agents. To collect large amounts of trajectories in photorealistic 3D environments, a web-based flight simulator synchronized with world maps was implemented. The dataset includes 32,637 trajectories corresponding to natural language descriptions about 5,850 objects, which are approximately four times the size of the existing aerial VLN dataset. The CityNav dataset represents the first large-scale 3D aerial navigation effort that leverages real-world 3D city data and contains a large amount of human-collected geo-aware trajectories and textual descriptions. The primary contributions of this work are: (1) a city-scale 3D aerial navigation dataset using real cities, (2) a geo-aware aerial vision-and-language navigation model, and (3) benchmarking map-based vs. map-less aerial navigation methods. The results show that the proposed map-based method outperforms the existing map-less method. The dataset is benchmarked against current aerial VLN models, and the results demonstrate that the proposed model, which utilizes 2D spatial map representations and human-generated geo-aware trajectories, enhances navigation performance. The CityNav dataset could be a valuable resource for benchmarking and training intelligent aerial agents. Limitations include the lack of agent-object interaction and dynamic elements like moving vehicles and pedestrians in urban simulations. Future work could include integrating physical interaction and real-time data to improve navigation accuracy and expand the dataset. The broader impacts of CityNav include potential improvements in urban surveillance and emergency response by enabling aerial agents to navigate via natural language. However, these technologies also raise ethical concerns regarding privacy and data security. It is crucial to consider social acceptance and regulatory issues, engage with communities to ensure equitable benefits, and address potential risks to privacy and safety.