docs(tutorial): add yaml explain

gnes-ai · Jul 26, 2019 · 8927cd4 · 8927cd4
1 parent afd5dda
commit 8927cd4
Show file tree

Hide file tree

Showing 9 changed files with 1,503 additions and 14 deletions.
diff --git a/README.md b/README.md
@@ -545,8 +545,8 @@ The official documentation of GNES is hosted on [doc.gnes.ai](https://doc.gnes.a
 
 > 🚧 Tutorial is still under construction. Stay tuned! Meanwhile, we sincerely welcome you to contribute your own learning experience / case study with GNES! 
 
-- [How to write your GNES YAML config](tutorials/gnes-yaml-specifications.md)
-- How to write a component-wise YAML config
+- [How to write your GNES YAML config](tutorials/gnes-compose-yaml-spec.md)
+- [How to write a component-wise YAML config](tutorials/component-yaml-spec.md)
 - Understanding preprocessor, encoder, indexer and router
 - Index and query text data with GNES
 - Index and query image data with GNES

diff --git a/docs/conf.py b/docs/conf.py
@@ -55,7 +55,6 @@
     'sphinxcontrib.apidoc',
     'sphinxarg.ext',
     'recommonmark',
-    'sphinx_markdown_tables',
 ]
 
 

diff --git a/docs/index.rst b/docs/index.rst
@@ -83,9 +83,6 @@ Tutorials
    🚧 Tutorial is still under construction. Stay tuned! Meanwhile, we sincerely welcome you to contribute your own learning experience / case study with GNES!
 
 
-Miscs
------
-
 .. toctree::
    :maxdepth: 1
    :caption: Miscs

diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -1,3 +1,2 @@
 sphinx-argparse
-sphinxcontrib-apidoc
-sphinx-markdown-tables
+sphinxcontrib-apidoc
diff --git a/tutorials/component-yaml-spec.md b/tutorials/component-yaml-spec.md
@@ -0,0 +1,290 @@
+# How to write a component-wise YAML config
+
+YAML is everywhere. This is pretty much your impression when first trying GNES. Understanding the YAML config is therefore extremely important to use GNES.
+
+Essentially, GNES requires two types of YAML config:
+- [GNES-compose YAML](gnes-compose-yaml-spec.md)
+- Component-wise YAML
+
+![](./img/mermaid-diagram-20190726180826.svg)
+
+All other YAML files, including the docker-compose YAML config and Kubernetes config generated from the [GNES Board](https://board.gnes.ai) or `gnes compose` command are not a part of this tutorial. Interested readers are welcome to read their [YAML specification](https://docs.docker.com/compose/compose-file/) respectively.
+
+
+## Table of Content
+
+* [Component-wise YAML specification](#component-wise-yaml-specification)
+* [`!CLS` specification](#--cls--specification)
+* [`parameter` specification](#-parameter--specification)
+  - [Use `args` and `kwargs` to simplify the constructor](#use--args--and--kwargs--to-simplify-the-constructor)
+* [`gnes_config` specification](#-gnes-config--specification)
+* [Every component can be described with YAML in GNES](#every-component-can-be-described-with-yaml-in-gnes)
+* [Stack multiple encoders into a `PipelineEncoder`](#stack-multiple-encoders-into-a--pipelineencoder-)
+* [What's Next?](#what-s-next-)
+
+
+
+## Component-wise YAML specification
+
+Preprocessor, encoder, indexer and router are fundamental components of GNES. They share the same YAML specification. The component-wise YAML defines how a component behaves. On the highest level, it contains three field:
+
+|Argument| Type | Description|
+|---|---|---|
+| `!CLS` | str | choose from all class names registered in GNES |
+| `parameter` | map/dict | a list of key-value pairs that `CLS.__init__()` accepts|
+| `gnes_config`| map/dict | a list of key-value pairs for GNES |
+
+Let's take a look an example:
+
+```yaml
+!BasePytorchEncoder
+parameter:
+  model_dir: ${VGG_MODEL}
+  model_name: vgg16
+  layers:
+    - features
+    - avgpool
+    - x.view(x.size(0), -1)
+    - classifier[0]
+gnes_config:
+  is_trained: true
+  name: my-awesome-vgg
+```
+
+In this example, we define a `BasePytorchEncoder` that loads a pretrained VGG16 model from the path`${VGG_MODEL}`. We then label this component as trained via `is_trained: true` and set its name to `my-awesome-vgg`.
+
+## `!CLS` specification
+
+`!CLS` is a name tag choosed from all class names registered in GNES. Currently, the following names are available:
+
+|`!CLS`| Component Type |
+|---|---|
+|`!BasePreprocessor`|Preprocessor|
+|`!TextPreprocessor`|Preprocessor|
+|`!BaseImagePreprocessor`|Preprocessor|
+|`!BaseTextPreprocessor`|Preprocessor|
+|`!BaseSlidingPreprocessor`|Preprocessor|
+|`!VanillaSlidingPreprocessor`|Preprocessor|
+|`!WeightedSlidingPreprocessor`|Preprocessor|
+|`!SegmentPreprocessor`|Preprocessor|
+|`!BaseUnaryPreprocessor`|Preprocessor|
+|`!BaseVideoPreprocessor`|Preprocessor|
+|`!FFmpegPreprocessor`|Preprocessor|
+|`!ShotDetectPreprocessor`|Preprocessor|
+|`!BertEncoder`|Encoder|
+|`!BertEncoderWithServer`|Encoder|
+|`!BertEncoderServer`|Encoder|
+|`!ElmoEncoder`|Encoder|
+|`!FlairEncoder`|Encoder|
+|`!GPTEncoder`|Encoder|
+|`!GPT2Encoder`|Encoder|
+|`!PCALocalEncoder`|Encoder|
+|`!PQEncoder`|Encoder|
+|`!TFPQEncoder`|Encoder|
+|`!Word2VecEncoder`|Encoder|
+|`!BaseEncoder`|Encoder|
+|`!BaseBinaryEncoder`|Encoder|
+|`!BaseTextEncoder`|Encoder|
+|`!BaseNumericEncoder`|Encoder|
+|`!CompositionalEncoder`|Encoder|
+|`!PipelineEncoder`|Encoder|
+|`!HashEncoder`|Encoder|
+|`!BasePytorchEncoder`|Encoder|
+|`!TFInceptionEncoder`|Encoder|
+|`!CVAEEncoder`|Encoder|
+|`!FaissIndexer`|Indexer|
+|`!LVDBIndexer`|Indexer|
+|`!AsyncLVDBIndexer`|Indexer|
+|`!NumpyIndexer`|Indexer|
+|`!BIndexer`|Indexer|
+|`!HBIndexer`|Indexer|
+|`!JointIndexer`|Indexer|
+|`!BaseIndexer`|Indexer|
+|`!BaseTextIndexer`|Indexer|
+|`!AnnoyIndexer`|Indexer|
+|`!BaseRouter`|Router|
+|`!BaseMapRouter`|Router|
+|`!BaseReduceRouter`|Router|
+|`!ChunkReduceRouter`|Router|
+|`!DocReduceRouter`|Router|
+|`!ConcatEmbedRouter`|Router|
+|`!PublishRouter`|Router|
+|`!DocBatchRouter`|Router|
+
+## `parameter` specification
+
+The key-value pair defined in `parameter` is basically a map of the arguments defined in the constructor of `!CLS`. Let's look at the signature of the constructor `BasePytorchEncoder` as an example:
+
+<table>
+<tr>
+<th>__init__()</th><th>YAML config</th>
+</tr>
+<tr>
+<td>
+   <pre lang="python">
+def __init__(self, model_name: str,
+                 layers: List[str],
+                 model_dir: str,
+                 batch_size: int = 64,
+                 use_cuda: bool = False,
+                 *args, **kwargs):
+  # do model init...
+  # ...
+   </pre>
+</td>
+<td>
+<pre lang="yaml">
+!BasePytorchEncoder
+parameter:
+  model_dir: ${VGG_MODEL}
+  model_name: vgg16
+  layers:
+    - features
+    - avgpool
+    - x.view(x.size(0), -1)
+    - classifier[0]
+</pre>
+</td>
+</tr>
+</table>
+
+Note, if an argument is defined in the `__init__()` but not in YAML, the default value will be used, see `batch_size` and `use_cuda` as examples.
+
+#### Use `args` and `kwargs` to simplify the constructor
+
+When you port an external package/module to GNES, sometimes the original implementation contains too many arguments. It doesn't make sense to write a super long `__init__` as:
+
+```python
+def __init__(self, arg1, arg2, arg3, arg4, arg5, ...):
+    self.arg1 = arg1
+    ext_module.cool_model(arg2, arg3, arg4, arg5, ...)
+```
+
+We provide a convenient way for this. Let's see `BertEncoder` as an example, which invokes `BertClient` from the [`bert-as-service`](https://github.com/hanxiao/bert-as-service/) module. In this case, `BertClient` accepts 10 arguments.
+
+<table>
+<tr>
+<th>__init__()</th><th>YAML config</th>
+</tr>
+<tr>
+<td>
+   <pre lang="python">
+class BertEncoder(BaseTextEncoder):
+    store_args_kwargs = True
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.bert_client = BertClient(*args, **kwargs)
+   </pre>
+</td>
+<td>
+<pre lang="yaml">
+!BertEncoder
+parameter:
+  kwargs:
+    port: $BERT_CI_PORT
+    port_out: $BERT_CI_PORT_OUT
+    ignore_all_checks: true
+gnes_config:
+  is_trained: true
+</pre>
+</td>
+</tr>
+</table>
+
+Note that how we defines a map under `kwargs` to describe the arguments, they will be forwarded to the constructor of `BertClient`. Similarly, one can also define a list under `args` to represent unnamed arguments.
+
+## `gnes_config` specification
+
+`gnes_config` defines some meta-information of this component. It accepts the following arguments
+
+|Argument| Type | Description|
+|---|---|---|
+| `name` | str | the name of the component, default `None` |
+| `is_trained` | bool | choose from `[True, False]` represents whether the model has been trained |
+| `batch_size` | int | a number defines the batch size, often used in `encode()`, `train()` and `index()`, default `None` meaning doing everything in one shot|
+| `work_dir`| str | the working directory of this component, default `$GNES_VOLUME` or the current directory |
+
+`name` is important, as it along with `work_dir` determine the io path of serializing and deserializing the component. If you start a component without a name, it will be assigned to a random name with its class name as the prefix.
+
+## Every component can be described with YAML in GNES
+
+The examples above are all about encoder. In fact, every component including encoder, preprocessor, router, indexer can all be described with YAML and loaded to GNES. For example,
+
+```yaml
+!TextPreprocessor
+parameter:
+  start_doc_id: 0
+  random_doc_id: True
+  deliminator: "[.。！？!?]+"
+gnes_config:
+  is_trained: true
+```
+
+Sometime it could be quite simple, e.g. 
+
+```yaml
+!PublishRouter
+parameter:
+  num_part: 2
+```
+
+Or even a one-liner, e.g.
+
+```yaml
+!ConcatEmbedRouter {}
+```
+
+You can find a lot of examples in the [unittest](../tests/yaml)
+
+## Stack multiple encoders into a `PipelineEncoder`  
+
+For many real-world applications, a single encoder is often not enough. For example, the output of a `BertEncoder` is 768-dimensional. One may want to append it with some dimensional reduction or quantization models. Of course one can spawn every encoder as an independent container and then connect them together via GNES Board/`gnes compose`. But if you don't need them to be elastic, why bother? This is where `PipelineEncoder` can be very useful: it stacks multiple `BaseEncoder` together, simplifying data-flow in all runtimes (i.e. training, indexing and querying).
+
+#### PipelineEncoder in the training runtime
+![](./img/mermaid-diagram-20190726183010.svg)
+
+#### PipelineEncoder in the indexing and querying runtimes
+![](./img/mermaid-diagram-20190726183216.svg)
+
+
+To define a `PipelineEncoder`, you just need to sort the encoders in the right order and put them in a list under the `component` field. Let's look at the following example:
+
+```yaml
+!PipelineEncoder
+component:
+  - !BasePytorchEncoder
+    parameter:
+      model_dir: /ext_data/image_encoder
+      model_name: resnet50
+      layers:
+        - conv1
+        - bn1
+        - relu
+        - maxpool
+        - layer1
+        - layer2
+        - layer3
+        - layer4
+        - avgpool
+        - x.reshape(x.size(0), -1)
+    gnes_config:
+      is_trained: true
+  - !PCALocalEncoder
+    parameter:
+      output_dim: 200
+      num_locals: 10
+    gnes_config:
+      batch_size: 2048
+  - !PQEncoder
+    parameter:
+      cluster_per_byte: 20
+      num_bytes: 10
+gnes_config:
+  name: my-pipeline
+``` 
+
+Note how `gnes_config` is defined for each component and also globally at the very end.
+
+## What's Next?
+
+Now that you have learned how to config a complete GNES app, it is time to run GNES in Shell/Docker/Docker Swarm/Kubernetes!
diff --git a/tutorials/gnes-yaml-specifications.md → tutorials/gnes-compose-yaml-spec.md b/tutorials/gnes-yaml-specifications.md → tutorials/gnes-compose-yaml-spec.md
@@ -2,12 +2,24 @@
 
 YAML is everywhere. This is pretty much your impression when first trying GNES. Understanding the YAML config is therefore extremely important to use GNES.
 
-Essentially, GNES only requires two types of YAML config:
+Essentially, GNES requires two types of YAML config:
 - GNES-compose YAML
-- Component-wise YAML
+- [Component-wise YAML](component-yaml-spec.md)
+
+![](./img/mermaid-diagram-20190726180826.svg)
 
 All other YAML files, including the docker-compose YAML config and Kubernetes config generated from the [GNES Board](https://board.gnes.ai) or `gnes compose` command are not a part of this tutorial. Interested readers are welcome to read their [YAML specification](https://docs.docker.com/compose/compose-file/) respectively.
 
+## Table of Content
+
+
+* [GNES-compose YAML specification](#gnes-compose-yaml-specification)
++ [`services` specification](#-services--specification)
++ [Sequential and parallel services](#sequential-and-parallel-services)
+* [`gRPCFrontend` and `Router`, why are they in my graph?](#-grpcfrontend--and--router---why-are-they-in-my-graph-)
+* [What's Next?](#what-s-next-)
+
+
 ## GNES-compose YAML specification
 
 The GNES-compose YAML defines a high-level service topology behind the GNES app. It is designed for simplicity and clarity, allowing the user to quickly get started with GNES. 
@@ -90,12 +102,12 @@ services:
 <tr>
 <td>
 <a href="https://gnes.ai">
-  <img src="img/mermaid-diagram-20190726144822.svg" alt="GNES workflow of example 1">
+  <img src="./img/mermaid-diagram-20190726144822.svg" alt="GNES workflow of example 1">
   </a>
 </td>
 <td>
 <a href="https://gnes.ai">
-  <img src="img/mermaid-diagram-20190726150531.svg" alt="GNES workflow of example 2">
+  <img src="./img/mermaid-diagram-20190726150531.svg" alt="GNES workflow of example 2">
   </a>
 </td>
 </tr>
@@ -122,9 +134,19 @@ services:
 
 which results a topology like the following:
 
-![](img/mermaid-diagram-20190726154922.svg)
+<p align="center">
+<a href="https://gnes.ai">
+    <img src="./img/mermaid-diagram-20190726154922.svg">
+</a>
+</p>
+
+## `gRPCFrontend` and `Router`, why are they in my graph?
+
+Careful readers may notice that `gRPCFrontend` and `Router` components may be added to the workflow graph, even though they are not defined in the YAML file. Here is the explanation:
 
+- `gRPCFrontend` serves as **the only interface** between GNES and the outside. All data must be sent to it and all results will be returned from it, which likes a hole on the black-box. Its data-flow pattern and the role it's playing in GNES is *so deterministic* that we don't even want to bother users to define it.
+- Put simply, `Router` forwards messages. It is often required when `replicas` > 1. However, the behavior of a router depends on the topology and the runtime (i.e. training, indexing and querying). Sometimes it serves as a mapper, other times it serves as a reducer or an aggregator, or even not required. In general, it might not be very straightforward for beginners to choose the right router. Fortunately, the type of the router can often be determined by the two consecutive layers, which is exactly what GNES Board (`gnes compose`) does.
 
 ## What's Next?
 
-The GNES-compose YAML describes a high-level picture of the GNES topology, the detailed specification of each component is defined in `yaml_path` respectively, namely the *component-wise YAML config*. In the next tutorial, you will learn how to write a component-wise YAML config.
+The GNES-compose YAML describes a high-level picture of the GNES topology. Having it only is not enough. The detailed specification of each component is defined in `yaml_path` respectively, namely the *component-wise YAML config*. In the next tutorial, you will learn how to write a component-wise YAML config.