📚 新 IR 适配编译器方案思路 #56879

DrRyanHuang · 2023-09-01T07:35:13Z

一、CINN 流程

当前主框架下放到CINN后端的主要逻辑是：

framework::ProgramDesc  =>  ir::Graph   =>  frontend::Program (NetBuilder 层)   =>  hlir::Graph =>    
Compute/Schedule()   =>  AST 层面   =>  Module::Builder  =>  CodeGen+NVRTC =>  Runtime::Program

新的优化流程示意图（家人们，点击图片查看大图）

这里有两个大的设计原则：template <typename T, typename Context>
void ConvKernel(const Context& dev_ctx,
                const DenseTensor& input,
                const DenseTensor& filter,
                const std::vector<int>& strides,
                const std::vector<int>& paddings,
                const std::string& padding_algorithm,
                const std::vector<int>& dilations,
                int groups,
                const std::string& data_format,
                DenseTensor* out) {
      /*
      */
}

新IR为了保证整体流程的稳定，要保证整个图的op和周围相互连接的；如果一张大图能够表示整个计算网络，不建议切开，通过name 或者某种机制来做这种“soft”连接，在设计上，就要求fusion merge出来子图（含子图内包含的op）跟整体是相互连接的
根据之前的讨论评审，cinn生成的cuda c kernel，会放入到 CINN JIT Instruction 中来执行，由 CINN JIT Instruction来负责kernel输入、输出的准备，整体的gc由执行器统一管理

关于fusion merge pass实现的细节，详细参考：Group融合Pass流程简述

gusion merge架构会升级下下面的结构；

阶段一： ProgramDesc => ir::Graph => CinnCompiler，由 build_cinn_pass 承担

    auto compilation_key = cinn_compiler->AddGraph(std::move(subgraph));
    VLOG(4) << "Compilation Key:\n"
            << cinn_compiler->ReadableKey(compilation_key);

    // Replace the found cluster to a new cinn op node
    ReplaceSubGraphWithCinnOpNode(cluster_set,
                                  cluster_inputs,
                                  cluster_outputs,
                                  cluster_internals,
                                  compilation_key,
                                  graph);

阶段二： ir::Graph => frontend::Program => hlir::Graph，由 CinnCompiler 承担

    const CinnCompiledObject &CinnCompiler::Compile(
        const Graph &graph,
        const std::map<std::string, const phi::DenseTensor *> &input_tensors,
        const Target &target,
        void *stream) {
        
        auto compiled_res =
          CompileGraph(graph, input_tensors, target, compiled_num, stream);
          
        }
   
     std::unique_ptr<CinnCompiledObject> CinnCompiler::CompileGraph(....){
      CinnGraphSymbolization symbol{compiled_num, graph, target, input_tensors};
       auto frontend_program = symbol();  // <----- 重点 1
       
       auto cinn_graph = Optimize(&frontend_program, fetch_ids, target); // <---- 重点 2
       
       auto graph_compiler = std::make_unique<GraphCompiler>(target, scope, cinn_graph); // <--- 重点 3
       
       auto compiled_res = graph_compiler->Build(options, std::move(fetch_ids), stream);

 }

阶段三： hlir::Graph => Compute/Schedule => AST 层面 => Module::Builder => CodeGen+NVRTC => Runtime::Program，由 GraphCompiler 承担

     GraphCompiler::CompilationResult GraphCompiler::Build(const GraphCompiler::CompileOptions& options,
                                                      std::unordered_set<std::string>&& fetch_var_ids,
                                                      void* stream) {
    
       for (int i = 0; i < groups.size(); i++) {
          if (groups[i].size() == 1) {
            lowered_func = GetOpFunc(groups[i][0]);  // 此处会调用 impl->fcompute、impl->fschedule，返回lang::LowerVec 【重要】这里已经是 AST 层面了
          } else {
            lowered_func = GetOpFunc(groups[i]);
          }
          local_lowered_funcs.emplace_back(std::move(lowered_func));
        }
        
       for (auto&& lowered_func : lowered_funcs) {
              this->ProcessFunction(lowered_func);   // 此处会调用 m_builder_.AddFunction(func); 【重要】这里已经是 AST 层面了
        }
 
        auto build_module = m_builder_.Build(); // ir::Module::Builder
        
        
        auto out = codegen.Compile(build_module, CodeGenC::OutputKind::CImpl); // CodeGen
        compiler_ = backends::Compiler::Create(target_);
        compiler_->Build(build_module, options.attached_code);  // NVRT compiler
        
        auto instructions = BuildInstructions(groups, options.groups.empty() ? graph_->fusion_groups : options.groups);
        result.runtime_program.reset(new Program(scope_, std::move(instructions)));
}

二、适配方案

若要从：framework::ProgramDesc →ir::Graph → frontend::Program (NetBuilder 层)→ hlir::Graph → Compute/Schedule → AST 层面 → Module::Builder → CodeGen+NVRTC → Runtime::Program

迁移到：framework::ProgramDesc → ProgramTranslator → New IR → New IR Graph → Compute/Schedule → AST 层面 → Module::Builder → CodeGen+NVRTC → Runtime::Program

则各个模块的角色变动如下：

统一主框架 ir::Graph、CINN 中的 frontend::Program、hlir::Graph 到 New IR 上，同时表示Program、Graph 的概念
统一 NetBuilder 组件。CINN 中的核心API和Build()动态生成的都将是 New IR，而非 frontend::Program，目前主框架已经有此组件，要考虑如何兼顾Phi和CINN，进一步抽象支持扩展性
完善 New IR Graph 的定义，考虑是否有必要新增Dialect 或者代理组件。CINN 现有逻辑是借助了 hlir::Graph 来进行lower的准备工作（如 inputs 和 outputs 的prepare），然后调用 Compute、Schedule 下沉到 AST 层面
Module::Builder 的角色是否需要保留。个人倾向于保留，Builder、Codegen、NVRTC 都是中间处理模块，与 IR 层面解耦，是横跨「编译期」与「执行期」的桥梁，暂时不需要调整。
Runtime::Program 可以类比于 Phi Kernel，是运行期概念，可能有必要迁移定义为一个Dialect，以回归到 New IR 体系下，方便交给执行引擎来执行。

方案思路：

前期为了工作解耦，先依赖 ProgramDesc + ProgramTranslator 来作为 New IR 的输入入口
build_cinn_pass 迁移为「可行性验证阶段」的「非必要」依赖项，但验证阶段也留意和评估后续实施路径
New IR 在技术设计上，同时承载了 op-by-op 和 Graph 的语义，CINN里强依赖后者，需要驱动实践和完善
前期计划以GraphCompiler为核心切入点，以单测驱动机制验证工作

二、一些开发经验

使用 builder.Build<paddle::dialect::XX> 要包含头文件#include "paddle/fluid/ir/dialect/pd_op.h"
在新 IR 遇到 Instruction.Run() 时会报「非法内存地址访问」，修复PR：
- 这种问题一般是：①数组访问越界 ②访问了不属于此进程空间的地址（如野指针）
- 经过分析，是因为在执行生成的 full kernel时，函数入参的out_ptr是个野指针，并没有调用 cudaMalloc.
- BuildScope 中虽然入参里包含了 target，但却没有用到。那Scope->Var(Tensor).Resize() 时是否会申请显存？答案：不会，只会设置shape
- InsertBufferHandlers里会动态地掺入一些内存malloc和free的额外Instruction。
在 CINN 中 ir::Tensor、Expr 之前是是什么关系？代码中经常看到 as_tensor、as_expr、as_tensor_ref等接口，分别面对什么场景？
OpLowerer 的新旧IR 隔离实现

The text was updated successfully, but these errors were encountered:

DrRyanHuang added status/new-issue 新建 type/docs 文档问题 labels Sep 1, 2023

DrRyanHuang mentioned this issue Sep 1, 2023

「护航计划⛵️」新 IR 适配 AI 编译器 CINN #56880

Closed

paddle-bot bot added the PFCC Paddle Framework Contributor Club，https://github.com/PaddlePaddle/community/tree/master/pfcc label Sep 1, 2023

Ligoml removed the status/new-issue 新建 label Sep 6, 2023

Aurelius84 mentioned this issue Sep 19, 2023

[PIR]Add cinn::dialect::GroupOp for BuildCinnPass #57304

Merged

DrRyanHuang mentioned this issue Dec 10, 2023

WAVE SUMMIT+2023下半年飞桨开源之星评选-信息征集 PaddlePaddle/community#765

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📚 新 IR 适配编译器方案思路 #56879

📚 新 IR 适配编译器方案思路 #56879

DrRyanHuang commented Sep 1, 2023 •

edited

Loading

📚 新 IR 适配编译器方案思路 #56879

📚 新 IR 适配编译器方案思路 #56879

Comments

DrRyanHuang commented Sep 1, 2023 • edited Loading

一、CINN 流程

二、适配方案

二、一些开发经验

DrRyanHuang commented Sep 1, 2023 •

edited

Loading