Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📚 新 IR 适配编译器方案思路 #56879

Open
DrRyanHuang opened this issue Sep 1, 2023 · 0 comments
Open

📚 新 IR 适配编译器方案思路 #56879

DrRyanHuang opened this issue Sep 1, 2023 · 0 comments
Labels
PFCC Paddle Framework Contributor Club,https://github.com/PaddlePaddle/community/tree/master/pfcc type/docs 文档问题

Comments

@DrRyanHuang
Copy link
Member

DrRyanHuang commented Sep 1, 2023

一、CINN 流程

当前主框架下放到CINN后端的主要逻辑是

framework::ProgramDesc  =>  ir::Graph   =>  frontend::Program (NetBuilder 层)   =>  hlir::Graph =>    
Compute/Schedule()   =>  AST 层面   =>  Module::Builder  =>  CodeGen+NVRTC =>  Runtime::Program

image

新的优化流程示意图 (家人们,点击图片查看大图)

image

这里有两个大的设计原则:template <typename T, typename Context>
void ConvKernel(const Context& dev_ctx,
                const DenseTensor& input,
                const DenseTensor& filter,
                const std::vector<int>& strides,
                const std::vector<int>& paddings,
                const std::string& padding_algorithm,
                const std::vector<int>& dilations,
                int groups,
                const std::string& data_format,
                DenseTensor* out) {
      /*
      */
}
  1. 新IR为了保证整体流程的稳定,要保证整个图的op和周围相互连接的; 如果一张大图能够表示整个计算网络,不建议切开,通过name 或者某种机制来做这种“soft”连接,在设计上,就要求fusion merge出来子图(含子图内包含的op)跟整体是相互连接的
  2. 根据之前的讨论评审,cinn生成的cuda c kernel,会放入到 CINN JIT Instruction 中来执行,由 CINN JIT Instruction来负责kernel输入、输出的准备,整体的gc由执行器统一管理

关于fusion merge pass实现的细节, 详细参考:Group融合Pass流程简述

gusion merge架构会升级下下面的结构;

image

阶段一: ProgramDesc => ir::Graph => CinnCompiler,由 build_cinn_pass 承担

    auto compilation_key = cinn_compiler->AddGraph(std::move(subgraph));
    VLOG(4) << "Compilation Key:\n"
            << cinn_compiler->ReadableKey(compilation_key);

    // Replace the found cluster to a new cinn op node
    ReplaceSubGraphWithCinnOpNode(cluster_set,
                                  cluster_inputs,
                                  cluster_outputs,
                                  cluster_internals,
                                  compilation_key,
                                  graph);

阶段二: ir::Graph => frontend::Program => hlir::Graph,由 CinnCompiler 承担

    const CinnCompiledObject &CinnCompiler::Compile(
        const Graph &graph,
        const std::map<std::string, const phi::DenseTensor *> &input_tensors,
        const Target &target,
        void *stream) {
        
        auto compiled_res =
          CompileGraph(graph, input_tensors, target, compiled_num, stream);
          
        }
   
     std::unique_ptr<CinnCompiledObject> CinnCompiler::CompileGraph(....){
      CinnGraphSymbolization symbol{compiled_num, graph, target, input_tensors};
       auto frontend_program = symbol();  // <----- 重点 1
       
       auto cinn_graph = Optimize(&frontend_program, fetch_ids, target); // <---- 重点 2
       
       auto graph_compiler = std::make_unique<GraphCompiler>(target, scope, cinn_graph); // <--- 重点 3
       
       auto compiled_res = graph_compiler->Build(options, std::move(fetch_ids), stream);

 }

阶段三: hlir::Graph => Compute/Schedule => AST 层面 => Module::Builder => CodeGen+NVRTC => Runtime::Program,由 GraphCompiler 承担

     GraphCompiler::CompilationResult GraphCompiler::Build(const GraphCompiler::CompileOptions& options,
                                                      std::unordered_set<std::string>&& fetch_var_ids,
                                                      void* stream) {
    
       for (int i = 0; i < groups.size(); i++) {
          if (groups[i].size() == 1) {
            lowered_func = GetOpFunc(groups[i][0]);  // 此处会调用 impl->fcompute、impl->fschedule,返回lang::LowerVec 【重要】这里已经是 AST 层面了
          } else {
            lowered_func = GetOpFunc(groups[i]);
          }
          local_lowered_funcs.emplace_back(std::move(lowered_func));
        }
        
       for (auto&& lowered_func : lowered_funcs) {
              this->ProcessFunction(lowered_func);   // 此处会调用 m_builder_.AddFunction(func); 【重要】这里已经是 AST 层面了
        }
 
        auto build_module = m_builder_.Build(); // ir::Module::Builder
        
        
        auto out = codegen.Compile(build_module, CodeGenC::OutputKind::CImpl); // CodeGen
        compiler_ = backends::Compiler::Create(target_);
        compiler_->Build(build_module, options.attached_code);  // NVRT compiler
        
        auto instructions = BuildInstructions(groups, options.groups.empty() ? graph_->fusion_groups : options.groups);
        result.runtime_program.reset(new Program(scope_, std::move(instructions)));
}

二、适配方案

若要从:framework::ProgramDesc →ir::Graph → frontend::Program (NetBuilder 层)→ hlir::Graph → Compute/Schedule → AST 层面 → Module::Builder → CodeGen+NVRTC → Runtime::Program

迁移到:framework::ProgramDesc → ProgramTranslator → New IR → New IR Graph → Compute/Schedule → AST 层面 → Module::Builder → CodeGen+NVRTC → Runtime::Program

image

则各个模块的角色变动如下:

  • 统一主框架 ir::Graph、CINN 中的 frontend::Program、hlir::Graph 到 New IR 上,同时表示Program、Graph 的概念
  • 统一 NetBuilder 组件。CINN 中的核心API和Build()动态生成的都将是 New IR,而非 frontend::Program,目前主框架已经有此组件,要考虑如何兼顾Phi和CINN,进一步抽象支持扩展性
  • 完善 New IR Graph 的定义,考虑是否有必要新增Dialect 或者代理组件。CINN 现有逻辑是借助了 hlir::Graph 来进行lower的准备工作(如 inputs 和 outputs 的prepare),然后调用 Compute、Schedule 下沉到 AST 层面
  • Module::Builder 的角色是否需要保留。个人倾向于保留,Builder、Codegen、NVRTC 都是中间处理模块,与 IR 层面解耦,是横跨「编译期」与「执行期」的桥梁,暂时不需要调整。
  • Runtime::Program 可以类比于 Phi Kernel,是运行期概念,可能有必要迁移定义为一个Dialect,以回归到 New IR 体系下,方便交给执行引擎来执行。

方案思路:

  • 前期为了工作解耦,先依赖 ProgramDesc + ProgramTranslator 来作为 New IR 的输入入口
  • build_cinn_pass 迁移为「可行性验证阶段」的「非必要」依赖项,但验证阶段也留意和评估后续实施路径
  • New IR 在技术设计上,同时承载了 op-by-op 和 Graph 的语义,CINN里强依赖后者,需要驱动实践和完善
  • 前期计划以GraphCompiler为核心切入点,以单测驱动机制验证工作

二、一些开发经验

  • 使用 builder.Build<paddle::dialect::XX> 要包含头文件#include "paddle/fluid/ir/dialect/pd_op.h"
  • 在新 IR 遇到 Instruction.Run() 时会报「非法内存地址访问」,修复PR:
    • 这种问题一般是:①数组访问越界 ②访问了不属于此进程空间的地址(如野指针)
    • 经过分析,是因为在执行生成的 full kernel时,函数入参的out_ptr是个野指针,并没有调用 cudaMalloc.
    • BuildScope 中虽然入参里包含了 target,但却没有用到。那Scope->Var(Tensor).Resize() 时是否会申请显存?答案:不会,只会设置shape
    • InsertBufferHandlers里会动态地掺入一些内存malloc和free的额外Instruction。
  • 在 CINN 中 ir::Tensor、Expr 之前是是什么关系?代码中经常看到 as_tensor、as_expr、as_tensor_ref等接口,分别面对什么场景?
  • OpLowerer 的新旧IR 隔离实现
image
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PFCC Paddle Framework Contributor Club,https://github.com/PaddlePaddle/community/tree/master/pfcc type/docs 文档问题
Projects
None yet
Development

No branches or pull requests

2 participants