update readme

OuyangWenyu · Apr 13, 2024 · 36e76e5 · 36e76e5
1 parent 6aec414
commit 36e76e5
Show file tree

Hide file tree

Showing 5 changed files with 153 additions and 128 deletions.
diff --git a/README.md b/README.md
@@ -1,73 +1,80 @@
 <!--
  * @Author: Wenyu Ouyang
- * @Date: 2022-05-28 17:46:32
- * @LastEditTime: 2023-12-15 17:00:20
+ * @Date: 2024-04-13 18:29:19
+ * @LastEditTime: 2024-04-13 21:45:43
  * @LastEditors: Wenyu Ouyang
- * @Description: README for torchhydro
- * @FilePath: /torchhydro/README.md
- * Copyright (c) 2021-2022 Wenyu Ouyang. All rights reserved.
+ * @Description: English version of the README
+ * @FilePath: \torchhydro\README.md
+ * Copyright (c) 2023-2024 Wenyu Ouyang. All rights reserved.
 -->
-# torchhydro
+# Torchhydro
 
 
 [![image](https://img.shields.io/pypi/v/torchhydro.svg)](https://pypi.python.org/pypi/torchhydro)
 [![image](https://img.shields.io/conda/vn/conda-forge/torchhydro.svg)](https://anaconda.org/conda-forge/torchhydro)
 
+- License: BSD license
+- Documentation: https://OuyangWenyu.github.io/torchhydro  
 
-**datasets, samplers, transforms, and pre-trained models for hydrology and water resources**
+**Note: This repository is still under development**
 
+## Installation
 
--   Free software: BSD license
--   Documentation: https://OuyangWenyu.github.io/torchhydro  
+We provide a pip package for installation:
 
-**NOTE: THIS REPOSITORY IS **STILL UNDER **DEVELOPMENT**!!!****  
-
-## Features
-
--   TODO
-
-## Data source settings
-
-We set a unified data path in hydro_setting.yml in the user directory (for example, `C:\Users\username\` in Windows). You can change the data path in this file.
-
-We have some conventions for data sources and we think it could be better to follow these conventions:
-
-1. Public datasets such as CAMELS are put in the `waterism/datasets-origin` directory.
-2. The processed datasets are put in the `waterism/datasets-interim` directory.
-3. Some cache files are put in the `.hydrodataset/cache` directory.
-
-We set these conventions in the `settings.json`` file. You can specify by yourself. Don't change the "key" in the JSON file unless you are clear about what you are doing.
-
-## For developers
+```Shell
+pip install torchhydro
+```
 
-To install the environment, run the following code in the terminal:
+If you want to participate in the development as a developer, you can install the environment and download the code using the following method:
 
 ```Shell
+# fork this repository to your GitHub account -- xxxx
+git clone [email protected]:xxxx/torchhydro.git
+cd torchhydro
+# If you find it slow, you can install with mamba
+# conda install mamba -c conda-forge
+# mamba env create -f env-dev.yml
 conda env create -f env-dev.yml
 conda activate torchhydro
 ```
 
-To use this repository of dev or other branches in your existing environment:
+## Usage
+
+Currently, we provide an example of training an LSTM on the CAMELS dataset. The functions for reading CAMELS are all written in [hydrodataset](https://github.com/OuyangWenyu/hydrodataset), so first read its readme to download the data properly and place it in the specified folder path. Regarding the folder configuration, check if there is a hydro_setting.yml file in your user directory. If not, manually create one, and refer to [here](https://github.com/OuyangWenyu/torchhydro/blob/6aec414d99e35f4f1672903eb9e18e8eebeadb09/torchhydro/__init__.py#L34) to ensure the local_data_path is set correctly. If you can't download the CAMELS data, you can directly use a version we uploaded on Kaggle: [kaggle CAMELS](https://www.kaggle.com/datasets/headwater/camels)
 
-1. you can fork it to your GitHub account. Don't choose "only fork the main branch" when forking in the Github page.
-2. run the following code in the terminal:
+Then you can try running the files under the experiments folder, such as:
 
 ```Shell
-# xxxxxx is your github account; here we choose to use dev branch
-pip install git+ssh://[email protected]/xxxxxx/torchhydro.git@dev
+cd experiments
+python run_camelslstm_experiments.py
 ```
 
-For the dataset we set a unified data path in settings.txt in the `.hydrodataset` directory which is located in the user directory (for example, `C:\Users\username\.hydrodataset` in Windows). You can change the data path in this file.
+More tutorials will be added gradually.
+
+## Main Modules
+
+The program mainly includes trainers, models, datasets, and configs, with an additional explainer responsible for the model interpretation part.
+
+- **Trainers**: Designed to handle various modes, the main one being a DeepHydro class, found in the deep_hydro module (a .py file). This class configures its data sources, obtains configurations about the model, data, training, and testing (details here), and then initializes the model (load_model function), the data (make_dataset function), and performs training (model_train function) and testing (model_evaluate function). Transfer learning, multitask learning, and federated learning modes will inherit this class and rewrite specific execution code.
+- **Models**: Mainly declared through a model_dict, which shows which models are available for configuration. This includes the selection of loss, and then the remaining model modules like lstm or differentiable models with coupled physical mechanisms.
+- **Datasets**: First, we set up several datasource repository tools to provide data sources, including the public dataset [hydrodataset](https://github.com/OuyangWenyu/hydrodataset) (like CAMELS) and [hydrodatasource](https://github.com/iHeadWater/hydrodatasource) (which requires organizing data by oneself). These data sources mainly provide data access, and in torchhydro, specific torch datasets can be written to match the model's data type. The dataset also has a dict to record, and then specific dataset class modules.
+- **Configs**: This mainly involves overall configurations, which are loaded during the initialization of the DeepHydro class. It's contained in the config module, primarily encompassing four parts: model (currently mode and model together), data (use of data time range, modeling object, etc.), training (training epochs, batch size, etc.), and testing (performance metrics).
 
-Then we have some conventions for the dataset:
+## Why Torchhydro?
 
-1. Public datasets such as CAMELS is put in the `waterism/datasets-origin` directory.
-2. The processed datasets are put in the `waterism/datasets-interim` directory.
+Although there are relatively mature tools like [NeuralHydrology](https://github.com/neuralhydrology/neuralhydrology), we chose not to use it directly for several reasons:
+1. Our model-building mode is not limited to fixed datasets corresponding to a fixed Dataset and then connecting to the model. We believe that the data source, especially considering non-public data situations like in China, is very complex and requires a separate Datasource module to handle the data sources and then make a torch Dataset. This extra layer of abstraction makes code reuse easier. Moreover, not everyone requires deep learning, so having a separate Datasource module allows more hydrologists to use it. We created [hydrodataset](https://github.com/OuyangWenyu/hydrodataset) and [hydrodatasource](https://github.com/iHeadWater/hydrodatasource) for this reason.
+2. Deep learning modes are not limited to single-variable supervised learning of runoff. Commonly used modes include transfer learning, multitask learning, and federated learning. These modes may use the same specific models as conventional ones, but the program expression will differ significantly, requiring these different modes to be considered in the overall program design.
+3. Sometimes, extra configuration is needed for data traversal, normalization methods, data sampling during batch generation, and dropout functionality during model training, necessitating a more flexible design compatible with different specific settings.
+4. For historical reasons, we developed torchhydro independently and in parallel from the beginning, so it has continued as such. The main idea is to extend configuration outwardly as much as possible to achieve flexible matching and calling of data and models.
 
-You can specify by yourself, but some changes are needed. We will optimize this part in the future.
+## Additional Information
 
-## Credits
+This package was inspired by:
 
-This package is inspired by [TorchGeo](https://torchgeo.readthedocs.io/en/stable/).
+- [TorchGeo](https://torchgeo.readthedocs.io/en/stable/).
+- [NeuralHydrology](https://github.com/neuralhydrology/neuralhydrology)
+- [hydroDL](https://github.com/mhpi/hydroDL)
 
-It was created with [Cookiecutter](https://github.com/cookiecutter/cookiecutter) and the [giswqs/pypackage](https://github.com/giswqs/pypackage) project template.
+This package was created using the [Cookiecutter](https://github.com/cookiecutter/cookiecutter) and the [giswqs/pypackage](https://github.com/giswqs/pypackage) project template.
diff --git a/README.zh.md b/README.zh.md
@@ -0,0 +1,82 @@
+<!--
+ * @Author: Wenyu Ouyang
+ * @Date: 2024-04-13 18:29:19
+ * @LastEditTime: 2024-04-13 19:30:01
+ * @LastEditors: Wenyu Ouyang
+ * @Description: 中文版本的README
+ * @FilePath: \torchhydro\README.zh.md
+ * Copyright (c) 2023-2024 Wenyu Ouyang. All rights reserved.
+-->
+# Torchhydro
+
+
+[![image](https://img.shields.io/pypi/v/torchhydro.svg)](https://pypi.python.org/pypi/torchhydro)
+[![image](https://img.shields.io/conda/vn/conda-forge/torchhydro.svg)](https://anaconda.org/conda-forge/torchhydro)
+
+- 开源协议: BSD license
+- 文档: https://OuyangWenyu.github.io/torchhydro  
+
+**注意：这个仓库还在开发中**
+
+## 安装
+
+我们提供了一个pip包的安装方式
+
+```Shell
+pip install torchhydro
+```
+
+如果想以开发者的身份一起参与开发，可以使用以下方式安装环境下载运行代码：
+
+```Shell
+# fork this repository to your GitHub account -- xxxx
+git clone [email protected]:xxxx/torchhydro.git
+cd torchhydro
+# 如果觉得慢，你可以用mamba安装
+# conda install mamba -c conda-forge
+# mamba env create -f env-dev.yml
+conda env create -f env-dev.yml
+conda activate torchhydro
+```
+
+## 使用
+
+目前我们提供了一个CAMELS数据集下面训练LSTM的示例，读取CAMELS的函数我们都写在 [hydrodataset](https://github.com/OuyangWenyu/hydrodataset) 里了，所以先阅读它的 readme 把数据都下载好，并且放置到指定的文件夹路径下。关于文件夹的配置，可以查看自己用户目录下是否有 hydro_setting.yml 文件，如果没有的话，就自己手动创建一个，具体的配置参考[这里](https://github.com/OuyangWenyu/torchhydro/blob/6aec414d99e35f4f1672903eb9e18e8eebeadb09/torchhydro/__init__.py#L34)，保障 local_data_path 路径正确即可。CAMELS数据如果下载不下来，可以直接使用我们在Kaggle上上传的一个版本：[kaggle CAMELS](https://www.kaggle.com/datasets/headwater/camels)
+
+然后你就可以尝试运行 experiments 文件夹下的文件了，比如：
+
+```Shell
+cd experiments
+python run_camelslstm_experiments.py.py
+```
+
+更多使用教程，后续我们会逐渐补充。
+
+## 主要模块
+
+程序主要包括 trainers、models、datasets和configs几个方面，另外还额外增加了一个explainer，负责把模型解释部分。
+
+- trainers：设计来应对多种模式，主体是一个 DeepHydro 类，在 deep_hydro 这个module（就是一个.py文件）里面，这个类的主要作用就是配置好它的数据源，获取它关于模型、数据、训练和测试各方面的配置（详见这里），然后根据这些配置初始化模型（load_model函数）、初始化数据（make_dataset函数）、并执行训练（model_train函数）以及测试(model_evaluate函数)。迁移学习、多任务学习、联邦学习模式都会继承这个类并重写具体的执行代码。
+- models：模型主要通过一个 model_dict 来做一个简单的声明，通过一个dict的value来展示哪些模型是可以被使用的，这样方便能够进行配置选择，这里也包括loss的选择，然后剩下的就是各个model的module文件，有lstm的，有耦合物理机制的可微分模型。
+- datasets：首先，我们设置了几个datasource的仓库工具，来提供数据源，包括公开数据集的[hydrodataset](https://github.com/OuyangWenyu/hydrodataset)（比如CAMELS）、处理自己数据的[hydrodatasource](https://github.com/iHeadWater/hydrodatasource)（不像CAMELS这样做好的数据集，而是需要自己整理的），这些数据源主要提供的功能就是对数据的访问，然后在torchhydro里面就能写具体的torch dataset了，就是按照和模型对接的数据类型来编写dataset，dataset整体也有一个dict 来记录，然后就是具体的dataset类的module了。datasets里面还有一些归一化的、通用处理数据的工具的module
+- configs：这部分主要就是一个总体的配置，是被 DeepHydro 类 初始化的时候加载的配置信息，内容在 config module  里面，主要就是四个部分的配置：模型（目前模式和模型在一起）的、数据的（使用数据的时间范围、建模的对象等）、训练的（训练代数epoch、batch 的size等）、测试的（性能指标等）。
+
+后续我们会补充更详细的文档。
+
+## 为什么要做torchhydro
+
+尽管目前是有像[NeuralHydrology](https://github.com/neuralhydrology/neuralhydrology)这样相对较成熟的工具，但是我们没有选择直接用它，这主要出于几个方面的考虑：
+1. 我们构建模型的模式不只限于 固定数据集对应固定 Dataset，然后再和模型对接，我们认为数据源尤其是考虑了像中国这类不公开数据的情况后，情况会非常复杂，有必要独立地处理数据源，需要一个专门处理Datasource的模块，然后再来做 torch Dataset，这样加一层抽象会更容易使代码复用，另外，不是每个人都要deep learning，所以单独做一个Datasource的模块能让更多水文相关者使用，我们做了 [hydrodataset](https://github.com/OuyangWenyu/hydrodataset) 和 [hydrodatasource](https://github.com/iHeadWater/hydrodatasource) 就是出于这个考虑
+2. 深度学习的模式不局限于径流单一变量的监督学习，比如常用的就有 迁移学习、多任务学习、联邦学习等，这些模式使用的具体模型可能和常规模式是一样的，但是程序表达上会有较大的差别，需要把这些不同模式考虑到整个程序设计中
+3. 有时候对数据的遍历、归一化的方式、批次生成时的数据采样、模型训练时的dropout功能 等都需要额外配置，需要一种更灵活兼容不同具体设置的设计
+4. 历史原因，最早时候我们就独立并行地开发了torchhydro，所以一直就保持下来了，主体思想主要就是尽量将配置外延，放到最外层，实现数据模型的灵活匹配和调用
+
+## 其他说明
+
+本软件包参考了
+
+- [TorchGeo](https://torchgeo.readthedocs.io/en/stable/).
+- [NeuralHydrology](https://github.com/neuralhydrology/neuralhydrology)
+- [hydroDL](https://github.com/mhpi/hydroDL)
+
+本软件包是使用 [Cookiecutter](https://github.com/cookiecutter/cookiecutter) 和 [giswqs/pypackage](https://github.com/giswqs/pypackage) 项目模板创建的。
diff --git a/experiments/run_camelsdplxaj_experiments.py b/experiments/run_camelsdplxaj_experiments.py
@@ -1,7 +1,7 @@
 """
 Author: Wenyu Ouyang
 Date: 2023-09-20 20:05:10
-LastEditTime: 2024-04-09 19:59:43
+LastEditTime: 2024-04-13 19:32:48
 LastEditors: Wenyu Ouyang
 Description: A case for dPL-XAJ model
 FilePath: \torchhydro\experiments\run_camelsdplxaj_experiments.py
@@ -36,11 +36,12 @@ def run_dplxaj(train_period=None, valid_period=None, test_period=None):
     config = default_config_file()
     args = cmd(
         sub="test_camels/expdplxaj",
-        source="CAMELS",
-        source_region="US",
-        source_path=os.path.join(
-            SETTING["local_data_path"]["datasets-origin"], "camels", "camels_us"
-        ),
+        source_cfgs={
+            "source_name": "camels_us",
+            "source_path": os.path.join(
+                SETTING["local_data_path"]["datasets-origin"], "camels", "camels_us"
+            ),
+        },
         ctx=[0],
         model_name="DplLstmXaj",
         model_hyperparam={
@@ -99,7 +100,10 @@ def run_dplxaj(train_period=None, valid_period=None, test_period=None):
         target_as_input=0,
         constant_only=0,
         train_epoch=2,
-        te=2,
+        model_loader={
+            "load_way": "specified",
+            "test_epoch": 20,
+        },
         warmup_length=10,
         opt="Adadelta",
         which_first_tensor="sequence",

diff --git a/experiments/run_camelslstm_experiments.py b/experiments/run_camelslstm_experiments.py
@@ -1,7 +1,7 @@
 """
 Author: Wenyu Ouyang
 Date: 2022-09-09 14:47:42
-LastEditTime: 2024-02-14 16:06:26
+LastEditTime: 2024-04-13 19:27:10
 LastEditors: Wenyu Ouyang
 Description: a script to run experiments for LSTM - CAMELS
 FilePath: \torchhydro\experiments\run_camelslstm_experiments.py
@@ -62,11 +62,12 @@ def run_normal_dl(
     config_data = default_config_file()
     args = cmd(
         sub=project_name,
-        source="CAMELS",
-        source_region="US",
-        source_path=os.path.join(
-            SETTING["local_data_path"]["datasets-origin"], "camels", "camels_us"
-        ),
+        source_cfgs={
+            "source_name": "camels_us",
+            "source_path": os.path.join(
+                SETTING["local_data_path"]["datasets-origin"], "camels", "camels_us"
+            ),
+        },
         ctx=[0],
         model_name="KuaiLSTM",
         model_hyperparam={
@@ -90,7 +91,10 @@ def run_normal_dl(
         rs=1234,
         train_epoch=20,
         save_epoch=1,
-        te=20,
+        model_loader={
+            "load_way": "specified",
+            "test_epoch": 20,
+        },
         gage_id_file=gage_id_file,
         which_first_tensor="sequence",
     )

diff --git a/experiments/test_mean_lstm.py b/experiments/test_mean_lstm.py