Cities with Nice Weather v2.0

I recently came accross this blog article "Cities with Nice Weather" by Jesse Onland.

He explains his approach on how to find the best cities for him to live in, in terms of climate.

What he's looking for is a city with an average temperature over a year close to that of Toronto, and a variance/range in temperature as small as possible.

His analysis is interesting, well written and organized, but the data he used is far from being exhaustive.

Indeed, the list he got from Wikipedia only shows temperature data for major cities, so he might be missing his dream city without knowing it...

Moreover, the list only provides temperature data. There is no information about wind speed and relative humidity, which are two important factors influencing the perceived temperature.

Let's try the same approach with another dataset, and see if we can draw the same conclusions.

Some imports for later:

%matplotlib inline
import warnings

import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pycountry_convert as pc
from adjustText import adjust_text
from geopandas import GeoDataFrame
from meteostat import Normals, Point
from tqdm.notebook import tqdm

Point.radius = 55_000  # Maximum radius for nearby stations in meters
plt.rcParams["figure.facecolor"] = (1.0, 1.0, 1.0, 1)
plt.rcParams["figure.figsize"] = [20, 10]
plt.rcParams["legend.fontsize"] = "x-large"
plt.rcParams["axes.titlesize"] = "x-large"
plt.rcParams["axes.labelsize"] = "x-large"
plt.rcParams["xtick.labelsize"] = "x-large"
plt.rcParams["ytick.labelsize"] = "x-large"
pd.set_option("display.max_rows", None)

Get Data for Cities

First, we start with getting a list of cities with the name, country, population and coordinates since we're going to need latitude and longitude to retrieve historical weather data later on.

This kind of data is fortunately quite easy to find, and more importantly, free.

dataset_1 is taken from opendatasoft. It contains information for 140,000 cities in the world.

dataset_2, from the World Cities Database, contains the same kind of data than dataset_1.

dataset_1 = pd.read_csv(r"./data/geonames-all-cities-with-a-population-1000.csv", sep=";", na_filter=False)
dataset_2 = pd.read_csv(r"./data/simplemaps_worldcities_basicv1.75.csv", sep=",", na_filter=False)
# We need to set na_filter=False since pandas will convert country code "NA" for Namibia as NaN...

dataset_1.head()

	Geoname ID	Name	ASCII Name	Alternate Names	Feature Class	Feature Code	Country Code	Country name EN	Admin1 Code	Admin2 Code	population	DIgital Elevation Model	Timezone	Modification date	LABEL EN	Coordinates
0	8396129	Sanjiang	Sanjiang	Sanjiang,Sanjiang Jiedao,Sanjiang Qu,san jiang...	P	PPLA3	CN	China	01	3402	0	14	Asia/Shanghai	2021-09-19	China	31.34813,118.36132
1	8405692	Xinmin	Xinmin	Xinmin,Xinmin Zhen,xin min,xin min zhen,新民,新民镇	P	PPLA4	CN	China	33	8739734	28033	402	Asia/Shanghai	2022-04-12	China	30.39759,107.3895
2	8416824	Jindaoxia	Jindaoxia	Jindaoxia,Jindaoxia Zhen,jin dao xia,jin dao x...	P	PPLA4	CN	China	33	8739734	13752	323	Asia/Shanghai	2022-04-01	China	30.00528,106.65187
3	8420197	Jianlong	Jianlong	Jianlong,Jianlong Xiang,jian long,jian long xi...	P	PPLA4	CN	China	33	8739734	18151	276	Asia/Shanghai	2022-04-01	China	29.3586,106.18522
4	8505210	Jianhua	Jianhua	Bukui,Bukui Jiedao,Jianhua,Jianhua Qu,bo kui,b...	P	PPLA3	CN	China	08	2302	0	146	Asia/Shanghai	2022-03-12	China	47.35773,123.95977

Would be nice to split the Coordinates column:

dataset_1["Coordinates"] = dataset_1["Coordinates"].astype("string")
dataset_1[["lat", "lng"]] = dataset_1["Coordinates"].str.split(pat=",", n=1, expand=True)
dataset_1["lat"] = dataset_1["lat"].astype(float)
dataset_1["lng"] = dataset_1["lng"].astype(float)

We drop useless data:

dataset_1.drop(
    columns=[
        "Geoname ID",
        "Name",
        "Alternate Names",
        "Feature Class",
        "Feature Code",
        "Country Code 2",
        "Admin1 Code",
        "Admin2 Code",
        "Admin3 Code",
        "Admin4 Code",
        "Elevation",
        "DIgital Elevation Model",
        "Timezone",
        "LABEL EN",
        "Modification date",
        "Coordinates",
    ],
    inplace=True,
)

Some entries don't have a country name, we remove them:

dataset_1.query("`Country name EN` != ''", inplace=True)

Done with dataset_1! Let's clean the second one:

dataset_2.head()

	city	city_ascii	lat	lng	country	iso2	iso3	admin_name	capital	population	id
0	Tokyo	Tokyo	35.6839	139.7744	Japan	JP	JPN	Tōkyō	primary	39105000	1392685764
1	Jakarta	Jakarta	-6.2146	106.8451	Indonesia	ID	IDN	Jakarta	primary	35362000	1360771077
2	Delhi	Delhi	28.6667	77.2167	India	IN	IND	Delhi	admin	31870000	1356872604
3	Manila	Manila	14.6000	120.9833	Philippines	PH	PHL	Manila	primary	23971000	1608618140
4	São Paulo	Sao Paulo	-23.5504	-46.6339	Brazil	BR	BRA	São Paulo	admin	22495000	1076532519

Again, we drop useless columns:

dataset_2.drop(columns=["city", "iso2", "iso3", "admin_name", "capital", "id"], inplace=True)

Some cities lack population data, we remove them:

dataset_2 = dataset_2[dataset_2["population"].str.isnumeric()]
dataset_2["population"] = dataset_2["population"].astype(int)

Looks like dataset_1 is more exhaustive than dataset_2:

print(dataset_1.shape, dataset_2.shape)

(139571, 6) (42143, 5)

Let's check the data for a small city in UK:

dataset_1.query("`Country Code` == 'GB' and `ASCII Name` == 'Portsmouth'")

	ASCII Name	Country Code	Country name EN	population	lat	lng
66901	Portsmouth	GB	United Kingdom	194150	50.79899	-1.09125

dataset_2.query("`country` == 'United Kingdom' and city_ascii == 'Portsmouth'")

	city_ascii	lat	lng	country	population
2407	Portsmouth	50.8058	-1.0872	United Kingdom	248440

Not the same figures but seems pretty close to reality.

What about Tokyo:

dataset_1.query("`Country Code` == 'JP' and `ASCII Name` == 'Tokyo'")

	ASCII Name	Country Code	Country name EN	population	lat	lng
33145	Tokyo	JP	Japan	8336599	35.6895	139.69171

dataset_2.query("`country` == 'Japan' and city_ascii == 'Tokyo'")

	city_ascii	lat	lng	country	population
0	Tokyo	35.6839	139.7744	Japan	39105000

This time, the difference is not negligible, there's a huge difference in terms of population size.

Looks like the dataset_1 considers the 23 wards that made up the boundaries of the historic city of Tokyo, while dataset_2 considers the greater Tokyo metropolitan area, which is spread over 3 prefectures...

We should notice the same issue with Manila, capital city of the Philippines:

dataset_1.query("`Country Code` == 'PH' and `ASCII Name` == 'Manila'")

	ASCII Name	Country Code	Country name EN	population	lat	lng
16765	Manila	PH	Philippines	1600000	14.6042	120.9822

dataset_2.query("`country` == 'Philippines' and city_ascii == 'Manila'")

	city_ascii	lat	lng	country	population
3	Manila	14.6	120.9833	Philippines	23971000

Yes, same issue: dataset_1 considers the Manila city, while dataset_2 gives data for the larger urban area of Metro Manila.

This explains why dataset_2 was smaller, it merges cities of large urban areas.

Last check with Singapore:

dataset_1.query("`Country Code` == 'SG' and `ASCII Name` == 'Singapore'")

	ASCII Name	Country Code	Country name EN	population	lat	lng
9440	Singapore	SG	Singapore	3547809	1.28967	103.85007

dataset_2.query("`country` == 'Singapore'")

	city_ascii	lat	lng	country	population
135	Singapore	1.3	103.8	Singapore	5271000

Both datasets seem outdated since the current population of Singapore is 6M, for both the country and the city.

Some data visualization would help us see if some cities/countries/continents are missing:

plt.scatter(x=dataset_1["lng"], y=dataset_1["lat"], s=dataset_1["population"] / 1e6)
plt.show()

People sure like living near the sea!

Same plot with dataset_2 this time:

plt.scatter(x=dataset_2["lng"], y=dataset_2["lat"], s=dataset_2["population"] / 1e6)
plt.show()

Since both datasets seem accurate enough, from now on we'll just keep the first dataset.

df = dataset_1.copy()
del dataset_1
del dataset_2
print(df.shape)
df.head()

(139571, 6)

	ASCII Name	Country Code	Country name EN	population	lat	lng
0	Sanjiang	CN	China	0	31.34813	118.36132
1	Xinmin	CN	China	28033	30.39759	107.38950
2	Jindaoxia	CN	China	13752	30.00528	106.65187
3	Jianlong	CN	China	18151	29.35860	106.18522
4	Jianhua	CN	China	0	47.35773	123.95977

We add a column for the continent code, which might be useful later:

df["Country Code"] = df["Country Code"].astype("string")
df.query("`Country Code` != 'TL'", inplace=True)  # Remove Timor-Leste, not supported by pycountry
df.query("`Country Code` != 'EH'", inplace=True)  # Remove Western Sahara, not supported by pycountry
df["Continent Code"] = df.apply(lambda row: pc.country_alpha2_to_continent_code(row["Country Code"]), axis=1)

Number of cities per continent:

print(df.shape)
df["Continent Code"].value_counts()

(139552, 7)





EU    69029
NA    29755
AS    24765
SA     6701
AF     4889
OC     4413
Name: Continent Code, dtype: int64

Get Historial Weather Data

We now need some historical weather data.

As explained earlier, we can't use this list since it only contains data for "big" cities.

We need historical weather data of the last decade for all the cities shortlisted.

Only monthly statistics are needed since temperatures don't change significantly within a month. We'll compute the variance in temperature of a city only from the statistical monthly weather data.

OpenWeather provides an API with statistical monthly weather data (temp, humidity, wind, etc.) for any month of the entire year, but it's way too expensive for us.

For now, we'll use Meteostat, an open platform which provides free access to historical weather and climate data. Unfortunately, it lacks information about wind speed or humidity.

Getting weather data for Tokyo is straightforward:

tokyo = Point(lat=35.6839, lon=139.7744)
data = Normals(tokyo, 1991, 2020).fetch()
data

	tavg	tmin	tmax	prcp	wspd	pres	tsun
month
1	6.3	2.6	10.0	59.8	NaN	1015.9	NaN
2	7.0	3.1	10.8	56.6	NaN	1015.9	NaN
3	10.0	5.9	14.0	116.6	NaN	1015.1	NaN
4	14.8	10.7	19.0	134.1	NaN	1013.9	NaN
5	19.6	15.7	23.5	139.9	NaN	1011.9	NaN
6	22.8	19.5	26.0	168.0	NaN	1009.0	NaN
7	26.7	23.4	30.0	156.5	NaN	1008.6	NaN
8	28.1	24.7	31.5	154.7	NaN	1010.2	NaN
9	24.4	21.2	27.6	222.4	NaN	1013.0	NaN
10	19.0	15.8	22.2	231.3	NaN	1016.6	NaN
11	13.8	10.3	17.2	96.5	NaN	1018.1	NaN
12	8.7	5.0	12.4	58.1	NaN	1016.9	NaN

data.plot(y=["tavg", "tmin", "tmax"])
plt.ylabel("Temperature °C")
plt.show()

Pretty big variations over the year for Tokyo!

Having lived there for a few weeks, the tmax data is closer to ~~the~~ my perceived temperature than the tavg.

We code a simple function to get weather data for a specific location:

def get_mean_std_temperature_for_loc(lat: float, lon: float):
    pt = Point(lat=lat, lon=lon)
    try:
        with warnings.catch_warnings():  # disable some meteostat warnings
            warnings.simplefilter("ignore")
            data = Normals(pt, 1991, 2020).fetch()
    except pd.errors.ParserError:  # escape some meteostat bugs
        return np.nan, np.nan, np.nan, np.nan
    mean_ = data["tmax"].mean()
    std_ = data["tmax"].std()
    tmax_ = data["tmax"].max()
    range_ = data["tmax"].max() - data["tmax"].min()
    return mean_, std_, tmax_, range_

Gathering weather data for 140,000 cities takes some time. We keep the biggest cities:

df.query("population >= 100000", inplace=True)

We iterate over cities in the dataframe and get weather data. Not the most efficient way to do it but it works...:

Tmean = []
Tstd = []
Tmax = []
Trange = []
for row in tqdm(df.itertuples(), total=df.shape[0]):
    mean_, std_, tmax_, range_ = get_mean_std_temperature_for_loc(row.lat, row.lng)
    Tmean.append(mean_)
    Tstd.append(std_)
    Tmax.append(tmax_)
    Trange.append(range_)
df["Tmean"] = Tmean
df["Tstd"] = Tstd
df["Tmax"] = Tmax
df["Trange"] = Trange

Some cities don't have any weather data at all, we drop them:

df.dropna(subset=["Tmean", "Tstd", "Tmax"], inplace=True)

Summmary Statistics

fig, axes = plt.subplots(2, 2)

df[["Tmean"]].plot.kde(ax=axes[0, 0])
axes[0, 0].set_xlabel("Mean temperature over a year")
axes[0, 0].get_legend().remove()

df[["Tmax"]].plot.kde(ax=axes[0, 1])
axes[0, 1].set_xlabel("Max temperature over a year")
axes[0, 1].get_legend().remove()

df[["Tstd"]].plot.kde(ax=axes[1, 1])
axes[1, 1].set_xlabel("Std dev in temperature over a year")
axes[1, 1].get_legend().remove()

df[["Trange"]].plot.kde(ax=axes[1, 0])
axes[1, 0].set_xlabel("Range in temperature over a year")
axes[1, 0].get_legend().remove()

plt.show()

Out of curiosity, let's get the extreme values in temperature:

df.query("Tmean == Tmean.max() or Tmean == Tmean.min()")[["ASCII Name", "Country name EN", "Tmean", "Tstd"]]

	ASCII Name	Country name EN	Tmean	Tstd
37323	Mecca	Saudi Arabia	38.683333	4.705864
49692	Yakutsk	Russian Federation	-2.883333	22.694887

Cities with the most stable/unstable temperatures over a year:

df.query("Tstd == Tstd.min() or Tstd == Tstd.max()")[["ASCII Name", "Country name EN", "Tmean", "Tstd"]]

	ASCII Name	Country name EN	Tmean	Tstd
49692	Yakutsk	Russian Federation	-2.883333	22.694887
122628	Jayapura	Indonesia	32.241667	0.317543

Yakutsk seems like a lovely city to live in...

Get some cities to plot later:

cities_to_plot = [
    df.query("`ASCII Name` == 'Reykjavik' and `Country name EN` == 'Iceland'"),
    df.query("`ASCII Name` == 'Yakutsk' and `Country name EN` == 'Russian Federation'"),
    df.query("`ASCII Name` == 'Edinburgh' and `Country name EN` == 'United Kingdom'"),
    df.query("`ASCII Name` == 'Paris' and `Country name EN` == 'France'"),
    df.query("`ASCII Name` == 'Oslo' and `Country name EN` == 'Norway'"),
    df.query("`ASCII Name` == 'Stockholm' and `Country name EN` == 'Sweden'"),
    df.query("`ASCII Name` == 'Helsinki' and `Country name EN` == 'Finland'"),
    df.query("`ASCII Name` == 'Tokyo' and `Country name EN` == 'Japan'"),
    df.query("`ASCII Name` == 'Toronto' and `Country name EN` == 'Canada'"),
    df.query("`ASCII Name` == 'Singapore' and `Country name EN` == 'Singapore'"),
]
assert all(not city.empty for city in cities_to_plot)

plt.scatter(df["Tmean"], df["Tstd"], s=5, c="k", marker=".")
plt.xlabel("Mean temperature")
plt.ylabel("Temperature standard deviation")
for city in cities_to_plot:
    plt.scatter(
        city["Tmean"].iloc[0],
        city["Tstd"].iloc[0],
        s=300,
        label=f'{city["ASCII Name"].iloc[0]} ({city["Country Code"].iloc[0]})',
    )
plt.legend()
plt.show()

Nice inverse correlation!

We plot the same data on a map:

world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
ax = world.boundary.plot(edgecolor="black")
sc = plt.scatter(x=df["lng"], y=df["lat"], c=df["Tmean"], s=100, cmap="winter")
plt.colorbar(sc, label="Mean temperature °C")
plt.show()

Seems like many cities in Africa and South America disappeared for lack of data.

So Which Cities Are the "Best"?

Disclaimer: everything below is based only on my personal preferences...

Set some threshold to remove cities with warm weather:

df_backup = df.copy()
df.query("Tmean <= 17 and Tmax <= 30 and Tstd <= 10 and Trange <= 14", inplace=True)

Most shortlisted cities are in Europe:

print(df.shape)
print(df.groupby(["Continent Code"])["Country name EN"].value_counts())

(37, 11)
Continent Code  Country name EN
EU              United Kingdom     28
                Ireland             2
                France              1
                Iceland             1
NA              Canada              3
OC              New Zealand         2
Name: Country name EN, dtype: int64

Canada shortlisted cities were under the 2021 heat wave. No thanks, I'll pass.

df_europe = df.query("`Continent Code` == 'EU'")
df_nz = df.query("`Continent Code` == 'OC'")

fig, (ax1, ax2) = plt.subplots(1, 2)
world.boundary.plot(ax=ax1, edgecolor="black")
world.boundary.plot(ax=ax2, edgecolor="black")
sc1 = ax1.scatter(x=df_europe["lng"], y=df_europe["lat"], c=df_europe["Tmean"], s=100, cmap="winter")
sc2 = ax2.scatter(x=df_nz["lng"], y=df_nz["lat"], c=df_nz["Tmean"], s=100, cmap="winter")
ax1.set_xlim(-25, 5)
ax1.set_ylim(40, 70)
ax2.set_xlim(160, 180)
ax2.set_ylim(-50, -30)
cbar = fig.colorbar(sc2, ax=[ax1, ax2])
cbar.set_label("Mean temperature °C")
ax1.set_title("Europe")
ax2.set_title("New Zealand")
plt.show()

I was expecting to see Ireland and Scotland so I'm not surprised.

I however thought some cities from Norway/Sweden/Finland would have been shortlisted. Looks like they have a higher temperature range:

df_shortlisted = pd.concat((df_europe, df_nz))
df_nordic = df_backup.query(
    "`Country name EN` == 'Norway' or `Country name EN` == 'Sweden' or `Country name EN` == 'Finland'"
)

fig, axes = plt.subplots(2, 2)

df_nordic[["Tmean"]].plot.kde(ax=axes[0, 0])
df_shortlisted[["Tmean"]].plot.kde(ax=axes[0, 0])
axes[0, 0].legend(["Nordic (NO/SE/FI)", "IS/IE/UK/FR/NZ"])
axes[0, 0].set_xlabel("Mean temperature over a year")

df_nordic[["Tmax"]].plot.kde(ax=axes[0, 1])
df_shortlisted[["Tmax"]].plot.kde(ax=axes[0, 1])
axes[0, 1].legend(["Nordic (NO/SE/FI)", "IS/IE/UK/FR/NZ"])
axes[0, 1].set_xlabel("Max temperature over a year")

df_nordic[["Tstd"]].plot.kde(ax=axes[1, 1])
df_shortlisted[["Tstd"]].plot.kde(ax=axes[1, 1])
axes[1, 1].legend(["Nordic (NO/SE/FI)", "IS/IE/UK/FR/NZ"])
axes[1, 1].set_xlabel("Std dev in temperature over a year")

df_nordic[["Trange"]].plot.kde(ax=axes[1, 0])
df_shortlisted[["Trange"]].plot.kde(ax=axes[1, 0])
axes[1, 0].legend(["Nordic (NO/SE/FI)", "IS/IE/UK/FR/NZ"])
axes[1, 0].set_xlabel("Range in temperature over a year")

plt.show()

UK/IS/IE seem to have the "best" weather for me.

Possible future steps: Include pressure/wind to have the perceived temperature or heat index.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
README_files		README_files
data		data
README.ipynb		README.ipynb
README.md		README.md
clean-markdown.py		clean-markdown.py
notebook-to-markdown.cmd		notebook-to-markdown.cmd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cities with Nice Weather v2.0

Get Data for Cities

Get Historial Weather Data

Summmary Statistics

So Which Cities Are the "Best"?

About

Releases

Packages

Languages

Jonas1312/cities-with-nice-weather

Folders and files

Latest commit

History

Repository files navigation

Cities with Nice Weather v2.0

Get Data for Cities

Get Historial Weather Data

Summmary Statistics

So Which Cities Are the "Best"?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages