Skip to content

Latest commit

 

History

History
1386 lines (1084 loc) · 29.2 KB

README.md

File metadata and controls

1386 lines (1084 loc) · 29.2 KB

Cities with Nice Weather v2.0

I recently came accross this blog article "Cities with Nice Weather" by Jesse Onland.

He explains his approach on how to find the best cities for him to live in, in terms of climate.

What he's looking for is a city with an average temperature over a year close to that of Toronto, and a variance/range in temperature as small as possible.

His analysis is interesting, well written and organized, but the data he used is far from being exhaustive.

Indeed, the list he got from Wikipedia only shows temperature data for major cities, so he might be missing his dream city without knowing it...

Moreover, the list only provides temperature data. There is no information about wind speed and relative humidity, which are two important factors influencing the perceived temperature.

Let's try the same approach with another dataset, and see if we can draw the same conclusions.

Some imports for later:

%matplotlib inline
import warnings

import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pycountry_convert as pc
from adjustText import adjust_text
from geopandas import GeoDataFrame
from meteostat import Normals, Point
from tqdm.notebook import tqdm
Point.radius = 55_000  # Maximum radius for nearby stations in meters
plt.rcParams["figure.facecolor"] = (1.0, 1.0, 1.0, 1)
plt.rcParams["figure.figsize"] = [20, 10]
plt.rcParams["legend.fontsize"] = "x-large"
plt.rcParams["axes.titlesize"] = "x-large"
plt.rcParams["axes.labelsize"] = "x-large"
plt.rcParams["xtick.labelsize"] = "x-large"
plt.rcParams["ytick.labelsize"] = "x-large"
pd.set_option("display.max_rows", None)

Get Data for Cities

First, we start with getting a list of cities with the name, country, population and coordinates since we're going to need latitude and longitude to retrieve historical weather data later on.

This kind of data is fortunately quite easy to find, and more importantly, free.

dataset_1 is taken from opendatasoft. It contains information for 140,000 cities in the world.

dataset_2, from the World Cities Database, contains the same kind of data than dataset_1.

dataset_1 = pd.read_csv(r"./data/geonames-all-cities-with-a-population-1000.csv", sep=";", na_filter=False)
dataset_2 = pd.read_csv(r"./data/simplemaps_worldcities_basicv1.75.csv", sep=",", na_filter=False)
# We need to set na_filter=False since pandas will convert country code "NA" for Namibia as NaN...
dataset_1.head()
Geoname ID Name ASCII Name Alternate Names Feature Class Feature Code Country Code Country name EN Country Code 2 Admin1 Code Admin2 Code Admin3 Code Admin4 Code population Elevation DIgital Elevation Model Timezone Modification date LABEL EN Coordinates
0 8396129 Sanjiang Sanjiang Sanjiang,Sanjiang Jiedao,Sanjiang Qu,san jiang... P PPLA3 CN China 01 3402 0 14 Asia/Shanghai 2021-09-19 China 31.34813,118.36132
1 8405692 Xinmin Xinmin Xinmin,Xinmin Zhen,xin min,xin min zhen,新民,新民镇 P PPLA4 CN China 33 8739734 28033 402 Asia/Shanghai 2022-04-12 China 30.39759,107.3895
2 8416824 Jindaoxia Jindaoxia Jindaoxia,Jindaoxia Zhen,jin dao xia,jin dao x... P PPLA4 CN China 33 8739734 13752 323 Asia/Shanghai 2022-04-01 China 30.00528,106.65187
3 8420197 Jianlong Jianlong Jianlong,Jianlong Xiang,jian long,jian long xi... P PPLA4 CN China 33 8739734 18151 276 Asia/Shanghai 2022-04-01 China 29.3586,106.18522
4 8505210 Jianhua Jianhua Bukui,Bukui Jiedao,Jianhua,Jianhua Qu,bo kui,b... P PPLA3 CN China 08 2302 0 146 Asia/Shanghai 2022-03-12 China 47.35773,123.95977

Would be nice to split the Coordinates column:

dataset_1["Coordinates"] = dataset_1["Coordinates"].astype("string")
dataset_1[["lat", "lng"]] = dataset_1["Coordinates"].str.split(pat=",", n=1, expand=True)
dataset_1["lat"] = dataset_1["lat"].astype(float)
dataset_1["lng"] = dataset_1["lng"].astype(float)

We drop useless data:

dataset_1.drop(
    columns=[
        "Geoname ID",
        "Name",
        "Alternate Names",
        "Feature Class",
        "Feature Code",
        "Country Code 2",
        "Admin1 Code",
        "Admin2 Code",
        "Admin3 Code",
        "Admin4 Code",
        "Elevation",
        "DIgital Elevation Model",
        "Timezone",
        "LABEL EN",
        "Modification date",
        "Coordinates",
    ],
    inplace=True,
)

Some entries don't have a country name, we remove them:

dataset_1.query("`Country name EN` != ''", inplace=True)

Done with dataset_1! Let's clean the second one:

dataset_2.head()
city city_ascii lat lng country iso2 iso3 admin_name capital population id
0 Tokyo Tokyo 35.6839 139.7744 Japan JP JPN Tōkyō primary 39105000 1392685764
1 Jakarta Jakarta -6.2146 106.8451 Indonesia ID IDN Jakarta primary 35362000 1360771077
2 Delhi Delhi 28.6667 77.2167 India IN IND Delhi admin 31870000 1356872604
3 Manila Manila 14.6000 120.9833 Philippines PH PHL Manila primary 23971000 1608618140
4 São Paulo Sao Paulo -23.5504 -46.6339 Brazil BR BRA São Paulo admin 22495000 1076532519

Again, we drop useless columns:

dataset_2.drop(columns=["city", "iso2", "iso3", "admin_name", "capital", "id"], inplace=True)

Some cities lack population data, we remove them:

dataset_2 = dataset_2[dataset_2["population"].str.isnumeric()]
dataset_2["population"] = dataset_2["population"].astype(int)

Looks like dataset_1 is more exhaustive than dataset_2:

print(dataset_1.shape, dataset_2.shape)
(139571, 6) (42143, 5)

Let's check the data for a small city in UK:

dataset_1.query("`Country Code` == 'GB' and `ASCII Name` == 'Portsmouth'")
ASCII Name Country Code Country name EN population lat lng
66901 Portsmouth GB United Kingdom 194150 50.79899 -1.09125
dataset_2.query("`country` == 'United Kingdom' and city_ascii == 'Portsmouth'")
city_ascii lat lng country population
2407 Portsmouth 50.8058 -1.0872 United Kingdom 248440

Not the same figures but seems pretty close to reality.

What about Tokyo:

dataset_1.query("`Country Code` == 'JP' and `ASCII Name` == 'Tokyo'")
ASCII Name Country Code Country name EN population lat lng
33145 Tokyo JP Japan 8336599 35.6895 139.69171
dataset_2.query("`country` == 'Japan' and city_ascii == 'Tokyo'")
city_ascii lat lng country population
0 Tokyo 35.6839 139.7744 Japan 39105000

This time, the difference is not negligible, there's a huge difference in terms of population size.

Looks like the dataset_1 considers the 23 wards that made up the boundaries of the historic city of Tokyo, while dataset_2 considers the greater Tokyo metropolitan area, which is spread over 3 prefectures...

We should notice the same issue with Manila, capital city of the Philippines:

dataset_1.query("`Country Code` == 'PH' and `ASCII Name` == 'Manila'")
ASCII Name Country Code Country name EN population lat lng
16765 Manila PH Philippines 1600000 14.6042 120.9822
dataset_2.query("`country` == 'Philippines' and city_ascii == 'Manila'")
city_ascii lat lng country population
3 Manila 14.6 120.9833 Philippines 23971000

Yes, same issue: dataset_1 considers the Manila city, while dataset_2 gives data for the larger urban area of Metro Manila.

This explains why dataset_2 was smaller, it merges cities of large urban areas.

Last check with Singapore:

dataset_1.query("`Country Code` == 'SG' and `ASCII Name` == 'Singapore'")
ASCII Name Country Code Country name EN population lat lng
9440 Singapore SG Singapore 3547809 1.28967 103.85007
dataset_2.query("`country` == 'Singapore'")
city_ascii lat lng country population
135 Singapore 1.3 103.8 Singapore 5271000

Both datasets seem outdated since the current population of Singapore is 6M, for both the country and the city.

Some data visualization would help us see if some cities/countries/continents are missing:

plt.scatter(x=dataset_1["lng"], y=dataset_1["lat"], s=dataset_1["population"] / 1e6)
plt.show()

png

People sure like living near the sea!

Same plot with dataset_2 this time:

plt.scatter(x=dataset_2["lng"], y=dataset_2["lat"], s=dataset_2["population"] / 1e6)
plt.show()

png

Since both datasets seem accurate enough, from now on we'll just keep the first dataset.

df = dataset_1.copy()
del dataset_1
del dataset_2
print(df.shape)
df.head()
(139571, 6)
ASCII Name Country Code Country name EN population lat lng
0 Sanjiang CN China 0 31.34813 118.36132
1 Xinmin CN China 28033 30.39759 107.38950
2 Jindaoxia CN China 13752 30.00528 106.65187
3 Jianlong CN China 18151 29.35860 106.18522
4 Jianhua CN China 0 47.35773 123.95977

We add a column for the continent code, which might be useful later:

df["Country Code"] = df["Country Code"].astype("string")
df.query("`Country Code` != 'TL'", inplace=True)  # Remove Timor-Leste, not supported by pycountry
df.query("`Country Code` != 'EH'", inplace=True)  # Remove Western Sahara, not supported by pycountry
df["Continent Code"] = df.apply(lambda row: pc.country_alpha2_to_continent_code(row["Country Code"]), axis=1)

Number of cities per continent:

print(df.shape)
df["Continent Code"].value_counts()
(139552, 7)





EU    69029
NA    29755
AS    24765
SA     6701
AF     4889
OC     4413
Name: Continent Code, dtype: int64

Get Historial Weather Data

We now need some historical weather data.

As explained earlier, we can't use this list since it only contains data for "big" cities.

We need historical weather data of the last decade for all the cities shortlisted.

Only monthly statistics are needed since temperatures don't change significantly within a month. We'll compute the variance in temperature of a city only from the statistical monthly weather data.

OpenWeather provides an API with statistical monthly weather data (temp, humidity, wind, etc.) for any month of the entire year, but it's way too expensive for us.

For now, we'll use Meteostat, an open platform which provides free access to historical weather and climate data. Unfortunately, it lacks information about wind speed or humidity.

Getting weather data for Tokyo is straightforward:

tokyo = Point(lat=35.6839, lon=139.7744)
data = Normals(tokyo, 1991, 2020).fetch()
data
tavg tmin tmax prcp wspd pres tsun
month
1 6.3 2.6 10.0 59.8 NaN 1015.9 NaN
2 7.0 3.1 10.8 56.6 NaN 1015.9 NaN
3 10.0 5.9 14.0 116.6 NaN 1015.1 NaN
4 14.8 10.7 19.0 134.1 NaN 1013.9 NaN
5 19.6 15.7 23.5 139.9 NaN 1011.9 NaN
6 22.8 19.5 26.0 168.0 NaN 1009.0 NaN
7 26.7 23.4 30.0 156.5 NaN 1008.6 NaN
8 28.1 24.7 31.5 154.7 NaN 1010.2 NaN
9 24.4 21.2 27.6 222.4 NaN 1013.0 NaN
10 19.0 15.8 22.2 231.3 NaN 1016.6 NaN
11 13.8 10.3 17.2 96.5 NaN 1018.1 NaN
12 8.7 5.0 12.4 58.1 NaN 1016.9 NaN
data.plot(y=["tavg", "tmin", "tmax"])
plt.ylabel("Temperature °C")
plt.show()

png

Pretty big variations over the year for Tokyo!

Having lived there for a few weeks, the tmax data is closer to the my perceived temperature than the tavg.

We code a simple function to get weather data for a specific location:

def get_mean_std_temperature_for_loc(lat: float, lon: float):
    pt = Point(lat=lat, lon=lon)
    try:
        with warnings.catch_warnings():  # disable some meteostat warnings
            warnings.simplefilter("ignore")
            data = Normals(pt, 1991, 2020).fetch()
    except pd.errors.ParserError:  # escape some meteostat bugs
        return np.nan, np.nan, np.nan, np.nan
    mean_ = data["tmax"].mean()
    std_ = data["tmax"].std()
    tmax_ = data["tmax"].max()
    range_ = data["tmax"].max() - data["tmax"].min()
    return mean_, std_, tmax_, range_

Gathering weather data for 140,000 cities takes some time. We keep the biggest cities:

df.query("population >= 100000", inplace=True)

We iterate over cities in the dataframe and get weather data. Not the most efficient way to do it but it works...:

Tmean = []
Tstd = []
Tmax = []
Trange = []
for row in tqdm(df.itertuples(), total=df.shape[0]):
    mean_, std_, tmax_, range_ = get_mean_std_temperature_for_loc(row.lat, row.lng)
    Tmean.append(mean_)
    Tstd.append(std_)
    Tmax.append(tmax_)
    Trange.append(range_)
df["Tmean"] = Tmean
df["Tstd"] = Tstd
df["Tmax"] = Tmax
df["Trange"] = Trange

Some cities don't have any weather data at all, we drop them:

df.dropna(subset=["Tmean", "Tstd", "Tmax"], inplace=True)

Summmary Statistics

fig, axes = plt.subplots(2, 2)

df[["Tmean"]].plot.kde(ax=axes[0, 0])
axes[0, 0].set_xlabel("Mean temperature over a year")
axes[0, 0].get_legend().remove()

df[["Tmax"]].plot.kde(ax=axes[0, 1])
axes[0, 1].set_xlabel("Max temperature over a year")
axes[0, 1].get_legend().remove()

df[["Tstd"]].plot.kde(ax=axes[1, 1])
axes[1, 1].set_xlabel("Std dev in temperature over a year")
axes[1, 1].get_legend().remove()

df[["Trange"]].plot.kde(ax=axes[1, 0])
axes[1, 0].set_xlabel("Range in temperature over a year")
axes[1, 0].get_legend().remove()

plt.show()

png

Out of curiosity, let's get the extreme values in temperature:

df.query("Tmean == Tmean.max() or Tmean == Tmean.min()")[["ASCII Name", "Country name EN", "Tmean", "Tstd"]]
ASCII Name Country name EN Tmean Tstd
37323 Mecca Saudi Arabia 38.683333 4.705864
49692 Yakutsk Russian Federation -2.883333 22.694887

Cities with the most stable/unstable temperatures over a year:

df.query("Tstd == Tstd.min() or Tstd == Tstd.max()")[["ASCII Name", "Country name EN", "Tmean", "Tstd"]]
ASCII Name Country name EN Tmean Tstd
49692 Yakutsk Russian Federation -2.883333 22.694887
122628 Jayapura Indonesia 32.241667 0.317543

Yakutsk seems like a lovely city to live in...

Get some cities to plot later:

cities_to_plot = [
    df.query("`ASCII Name` == 'Reykjavik' and `Country name EN` == 'Iceland'"),
    df.query("`ASCII Name` == 'Yakutsk' and `Country name EN` == 'Russian Federation'"),
    df.query("`ASCII Name` == 'Edinburgh' and `Country name EN` == 'United Kingdom'"),
    df.query("`ASCII Name` == 'Paris' and `Country name EN` == 'France'"),
    df.query("`ASCII Name` == 'Oslo' and `Country name EN` == 'Norway'"),
    df.query("`ASCII Name` == 'Stockholm' and `Country name EN` == 'Sweden'"),
    df.query("`ASCII Name` == 'Helsinki' and `Country name EN` == 'Finland'"),
    df.query("`ASCII Name` == 'Tokyo' and `Country name EN` == 'Japan'"),
    df.query("`ASCII Name` == 'Toronto' and `Country name EN` == 'Canada'"),
    df.query("`ASCII Name` == 'Singapore' and `Country name EN` == 'Singapore'"),
]
assert all(not city.empty for city in cities_to_plot)
plt.scatter(df["Tmean"], df["Tstd"], s=5, c="k", marker=".")
plt.xlabel("Mean temperature")
plt.ylabel("Temperature standard deviation")
for city in cities_to_plot:
    plt.scatter(
        city["Tmean"].iloc[0],
        city["Tstd"].iloc[0],
        s=300,
        label=f'{city["ASCII Name"].iloc[0]} ({city["Country Code"].iloc[0]})',
    )
plt.legend()
plt.show()

png

Nice inverse correlation!

We plot the same data on a map:

world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
ax = world.boundary.plot(edgecolor="black")
sc = plt.scatter(x=df["lng"], y=df["lat"], c=df["Tmean"], s=100, cmap="winter")
plt.colorbar(sc, label="Mean temperature °C")
plt.show()

png

Seems like many cities in Africa and South America disappeared for lack of data.

So Which Cities Are the "Best"?

Disclaimer: everything below is based only on my personal preferences...

Set some threshold to remove cities with warm weather:

df_backup = df.copy()
df.query("Tmean <= 17 and Tmax <= 30 and Tstd <= 10 and Trange <= 14", inplace=True)

Most shortlisted cities are in Europe:

print(df.shape)
print(df.groupby(["Continent Code"])["Country name EN"].value_counts())
(37, 11)
Continent Code  Country name EN
EU              United Kingdom     28
                Ireland             2
                France              1
                Iceland             1
NA              Canada              3
OC              New Zealand         2
Name: Country name EN, dtype: int64

Canada shortlisted cities were under the 2021 heat wave. No thanks, I'll pass.

df_europe = df.query("`Continent Code` == 'EU'")
df_nz = df.query("`Continent Code` == 'OC'")
fig, (ax1, ax2) = plt.subplots(1, 2)
world.boundary.plot(ax=ax1, edgecolor="black")
world.boundary.plot(ax=ax2, edgecolor="black")
sc1 = ax1.scatter(x=df_europe["lng"], y=df_europe["lat"], c=df_europe["Tmean"], s=100, cmap="winter")
sc2 = ax2.scatter(x=df_nz["lng"], y=df_nz["lat"], c=df_nz["Tmean"], s=100, cmap="winter")
ax1.set_xlim(-25, 5)
ax1.set_ylim(40, 70)
ax2.set_xlim(160, 180)
ax2.set_ylim(-50, -30)
cbar = fig.colorbar(sc2, ax=[ax1, ax2])
cbar.set_label("Mean temperature °C")
ax1.set_title("Europe")
ax2.set_title("New Zealand")
plt.show()

png

I was expecting to see Ireland and Scotland so I'm not surprised.

I however thought some cities from Norway/Sweden/Finland would have been shortlisted. Looks like they have a higher temperature range:

df_shortlisted = pd.concat((df_europe, df_nz))
df_nordic = df_backup.query(
    "`Country name EN` == 'Norway' or `Country name EN` == 'Sweden' or `Country name EN` == 'Finland'"
)

fig, axes = plt.subplots(2, 2)

df_nordic[["Tmean"]].plot.kde(ax=axes[0, 0])
df_shortlisted[["Tmean"]].plot.kde(ax=axes[0, 0])
axes[0, 0].legend(["Nordic (NO/SE/FI)", "IS/IE/UK/FR/NZ"])
axes[0, 0].set_xlabel("Mean temperature over a year")

df_nordic[["Tmax"]].plot.kde(ax=axes[0, 1])
df_shortlisted[["Tmax"]].plot.kde(ax=axes[0, 1])
axes[0, 1].legend(["Nordic (NO/SE/FI)", "IS/IE/UK/FR/NZ"])
axes[0, 1].set_xlabel("Max temperature over a year")

df_nordic[["Tstd"]].plot.kde(ax=axes[1, 1])
df_shortlisted[["Tstd"]].plot.kde(ax=axes[1, 1])
axes[1, 1].legend(["Nordic (NO/SE/FI)", "IS/IE/UK/FR/NZ"])
axes[1, 1].set_xlabel("Std dev in temperature over a year")

df_nordic[["Trange"]].plot.kde(ax=axes[1, 0])
df_shortlisted[["Trange"]].plot.kde(ax=axes[1, 0])
axes[1, 0].legend(["Nordic (NO/SE/FI)", "IS/IE/UK/FR/NZ"])
axes[1, 0].set_xlabel("Range in temperature over a year")

plt.show()

png

UK/IS/IE seem to have the "best" weather for me.

Possible future steps: Include pressure/wind to have the perceived temperature or heat index.