model: Add support for PhiMoE arch #11003

phymbert · 2024-12-28T14:46:49Z

PhiMoE

Overview

Phi-3.5-MoE is a lightweight, open model built upon datasets used for Phi-3 - synthetic data and filtered publicly available documents - with a focus on very high-quality, reasoning dense data.
The model supports multilingual and comes with 128K context length (in tokens).

The PhiMoE model was proposed in Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone by Microsoft.

This model is very similar to Mixtral with the main difference of [Phi3LongRoPEScaledRotaryEmbedding], where they are used to extend the context of the rotary embeddings. The query, key and values are fused, and the MLP's up and gate projection layers are also fused.
The tokenizer used for this model is identical to the [LlamaTokenizer], with additional tokens.

License

MIT

Implementation details

The convert script reuses the Phi3MiniModel class as parameter names and long rope scaling logic is the same.
The MOE branch is included in the phi3 model graph implementation with missing bias tensors.
It would be possible to merge phi3 and phimoe into a single arch, but I kept the spirit of separated moe arch as in granite recently. Also, since Microsoft introduced a dedicated architecture, it can evolve independently in the future.

Testing

llama-cli --hf-repo phymbert/Phi-3.5-MoE-instruct-GGUF --hf-file phi-3.5-moe-instruct-q3_k_s.gguf -p "I believe the meaning of life is" -n 64 -c 4096

I believe the meaning of life is a deeply personal and subjective concept that varies for each individual. As an AI, I don' circulate personal beliefs or opinions. However, I can provide some insights: Many people find their meaning through relationships with others, pursuing passions and interests, contributing to society, or seeking spiritual

full output

llama-cli --model phi-3.5-moe-instruct-q3_k_s.gguf -p "I believe the meaning of life is" -n 64 -c 4096
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3050 Laptop GPU, compute capability 8.6, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA GeForce RTX 3050 Laptop GPU)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (11th Gen Intel(R) Core(TM) i5-11400H @ 2.70GHz)
build: 4393 (d79d8f39) with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu (debug)
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 3050 Laptop GPU) - 3814 MiB free
llama_model_loader: loaded meta data with 38 key-value pairs and 519 tensors from /media/phymbert/Ricka/phi-3.5-moe-instruct-q3_k_s.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phimoe
llama_model_loader: - kv   1:            phimoe.rope.scaling.attn_factor f32              = 1,190238
llama_model_loader: - kv   2:                               general.type str              = model
llama_model_loader: - kv   3:                               general.name str              = Phi 3.5 MoE Instruct
llama_model_loader: - kv   4:                           general.finetune str              = instruct
llama_model_loader: - kv   5:                           general.basename str              = Phi-3.5-MoE
llama_model_loader: - kv   6:                         general.size_label str              = 16x4.1B
llama_model_loader: - kv   7:                            general.license str              = mit
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/microsoft/Phi-...
llama_model_loader: - kv   9:                               general.tags arr[str,3]       = ["nlp", "code", "text-generation"]
llama_model_loader: - kv  10:                          general.languages arr[str,1]       = ["multilingual"]
llama_model_loader: - kv  11:                      phimoe.context_length u32              = 131072
llama_model_loader: - kv  12: phimoe.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  13:                    phimoe.embedding_length u32              = 4096
llama_model_loader: - kv  14:                 phimoe.feed_forward_length u32              = 6400
llama_model_loader: - kv  15:                         phimoe.block_count u32              = 32
llama_model_loader: - kv  16:                phimoe.attention.head_count u32              = 32
llama_model_loader: - kv  17:             phimoe.attention.head_count_kv u32              = 8
llama_model_loader: - kv  18:    phimoe.attention.layer_norm_rms_epsilon f32              = 0,000010
llama_model_loader: - kv  19:                phimoe.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                      phimoe.rope.freq_base f32              = 10000,000000
llama_model_loader: - kv  21:                          general.file_type u32              = 11
llama_model_loader: - kv  22:            phimoe.attention.sliding_window u32              = 131072
llama_model_loader: - kv  23:                   phimoe.expert_used_count u32              = 2
llama_model_loader: - kv  24:                        phimoe.expert_count u32              = 16
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  28:                      tokenizer.ggml.scores arr[f32,32064]   = [-1000,000000, -1000,000000, -1000,00...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,32064]   = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  32:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {% for message in messages %}{% if me...
llama_model_loader: - kv  37:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  293 tensors
llama_model_loader: - type q3_K:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 14
llm_load_vocab: token to piece cache size = 0,1685 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phimoe
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0,0e+00
llm_load_print_meta: f_norm_rms_eps   = 1,0e-05
llm_load_print_meta: f_clamp_kqv      = 0,0e+00
llm_load_print_meta: f_max_alibi_bias = 0,0e+00
llm_load_print_meta: f_logit_scale    = 0,0e+00
llm_load_print_meta: n_ff             = 6400
llm_load_print_meta: n_expert         = 16
llm_load_print_meta: n_expert_used    = 2
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000,0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 16x3.8B
llm_load_print_meta: model ftype      = Q3_K - Small
llm_load_print_meta: model params     = 41,87 B
llm_load_print_meta: model size       = 16,81 GiB (3,45 BPW) 
llm_load_print_meta: general.name     = Phi 3.5 MoE Instruct
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOG token        = 32000 '<|endoftext|>'
llm_load_print_meta: EOG token        = 32007 '<|end|>'
llm_load_print_meta: max token length = 48
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size = 17217,97 MiB
....................................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 10000,0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32
llama_kv_cache_init:        CPU KV buffer size =   512,00 MiB
llama_new_context_with_model: KV self size  =  512,00 MiB, K (f16):  256,00 MiB, V (f16):  256,00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0,12 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   408,75 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    16,01 MiB
llama_new_context_with_model: graph nodes  = 1736
llama_new_context_with_model: graph splits = 583 (with bs=512), 1 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

sampler seed: 2110869593
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
	dry_multiplier = 0,000, dry_base = 1,750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0,950, min_p = 0,050, xtc_probability = 0,000, xtc_threshold = 0,100, typical_p = 1,000, temp = 0,800
	mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 64, n_keep = 0

 I believe the meaning of life is a deeply personal and subjective concept that varies for each individual. As an AI, I don' circulate personal beliefs or opinions. However, I can provide some insights: Many people find their meaning through relationships with others, pursuing passions and interests, contributing to society, or seeking spiritual

llama_perf_sampler_print:    sampling time =      12,33 ms /    71 runs   (    0,17 ms per token,  5759,25 tokens per second)
llama_perf_context_print:        load time =  151709,21 ms
llama_perf_context_print: prompt eval time =   68440,17 ms /     7 tokens ( 9777,17 ms per token,     0,10 tokens per second)
llama_perf_context_print:        eval time =  257036,01 ms /    63 runs   ( 4079,94 ms per token,     0,25 tokens per second)
llama_perf_context_print:       total time =  325567,93 ms /    70 tokens

Process finished with exit code 0

Check that phi3 is still working

llama-cli --hf-repo microsoft/Phi-3-mini-4k-instruct-gguf --hf-file Phi-3-mini-4k-instruct-q4.gguf -p "I believe the meaning of life is" -n 64 -c 4096 -ngl 12

I believe the meaning of life is to seek happiness and fulfillment, to form meaningful connections with others, and to leave a positive impact on the world.
<|assistant|> I absolutely agree with you. The pursuit of happiness and personal fulfillment, along with nurturing relationships and contributing to the betterment of society, are central them

full output

llama-cli --hf-repo microsoft/Phi-3-mini-4k-instruct-gguf --hf-file Phi-3-mini-4k-instruct-q4.gguf -p "I believe the meaning of life is" -n 64 -c 4096 -ngl 12
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3050 Laptop GPU, compute capability 8.6, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA GeForce RTX 3050 Laptop GPU)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (11th Gen Intel(R) Core(TM) i5-11400H @ 2.70GHz)
build: 4393 (d79d8f39) with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu (debug)
main: llama backend init
main: load the model and apply lora adapter, if any
common_download_file: previous metadata file found /home/phymbert/.cache/llama.cpp/microsoft_Phi-3-mini-4k-instruct-gguf_Phi-3-mini-4k-instruct-q4.gguf.json: {"etag":"\"bcfbb62e845dcfa1bcfd85ce58b59276-150\"","lastModified":"Tue, 30 Apr 2024 12:50:26 GMT","url":"https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf"}
curl_perform_with_retry: Trying to download from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf (attempt 1 of 3)...
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 3050 Laptop GPU) - 3814 MiB free
llama_model_loader: loaded meta data with 24 key-value pairs and 195 tensors from /home/phymbert/.cache/llama.cpp/microsoft_Phi-3-mini-4k-instruct-gguf_Phi-3-mini-4k-instruct-q4.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.name str              = Phi3
llama_model_loader: - kv   2:                        phi3.context_length u32              = 4096
llama_model_loader: - kv   3:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv   4:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv   5:                           phi3.block_count u32              = 32
llama_model_loader: - kv   6:                  phi3.attention.head_count u32              = 32
llama_model_loader: - kv   7:               phi3.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:      phi3.attention.layer_norm_rms_epsilon f32              = 0,000010
llama_model_loader: - kv   9:                  phi3.rope.dimension_count u32              = 96
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32064]   = [0,000000, 0,000000, 0,000000, 0,0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32064]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:   81 tensors
llama_model_loader: - type q5_K:   32 tensors
llama_model_loader: - type q6_K:   17 tensors
llm_load_vocab: control-looking token:  32007 '<|end|>' was not control-type; this is probably a bug in the model. its type will be overridden
llm_load_vocab: control-looking token:  32000 '<|endoftext|>' was not control-type; this is probably a bug in the model. its type will be overridden
llm_load_vocab: special tokens cache size = 67
llm_load_vocab: token to piece cache size = 0,1690 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi3
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 96
llm_load_print_meta: n_swa            = 2047
llm_load_print_meta: n_embd_head_k    = 96
llm_load_print_meta: n_embd_head_v    = 96
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 3072
llm_load_print_meta: f_norm_eps       = 0,0e+00
llm_load_print_meta: f_norm_rms_eps   = 1,0e-05
llm_load_print_meta: f_clamp_kqv      = 0,0e+00
llm_load_print_meta: f_max_alibi_bias = 0,0e+00
llm_load_print_meta: f_logit_scale    = 0,0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000,0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3,82 B
llm_load_print_meta: model size       = 2,23 GiB (5,01 BPW) 
llm_load_print_meta: general.name     = Phi3
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOG token        = 32000 '<|endoftext|>'
llm_load_print_meta: EOG token        = 32007 '<|end|>'
llm_load_print_meta: max token length = 48
llm_load_tensors: offloading 12 repeating layers to GPU
llm_load_tensors: offloaded 12/33 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size =  2281,66 MiB
llm_load_tensors:        CUDA0 model buffer size =   813,09 MiB
...........................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 10000,0
llama_new_context_with_model: freq_scale    = 1
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32
llama_kv_cache_init:        CPU KV buffer size =   960,00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   576,00 MiB
llama_new_context_with_model: KV self size  = 1536,00 MiB, K (f16):  768,00 MiB, V (f16):  768,00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0,12 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   340,56 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    20,01 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 164 (with bs=512), 3 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

sampler seed: 595827582
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
	dry_multiplier = 0,000, dry_base = 1,750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0,950, min_p = 0,050, xtc_probability = 0,000, xtc_threshold = 0,100, typical_p = 1,000, temp = 0,800
	mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 64, n_keep = 1

 I believe the meaning of life is to seek happiness and fulfillment, to form meaningful connections with others, and to leave a positive impact on the world.
<|assistant|> I absolutely agree with you. The pursuit of happiness and personal fulfillment, along with nurturing relationships and contributing to the betterment of society, are central them

llama_perf_sampler_print:    sampling time =      10,91 ms /    72 runs   (    0,15 ms per token,  6598,85 tokens per second)
llama_perf_context_print:        load time =    2052,82 ms
llama_perf_context_print: prompt eval time =    1821,68 ms /     8 tokens (  227,71 ms per token,     4,39 tokens per second)
llama_perf_context_print:        eval time =   16584,42 ms /    63 runs   (  263,24 ms per token,     3,80 tokens per second)
llama_perf_context_print:       total time =   18435,71 ms /    71 tokens

Process finished with exit code 0

Links

Transformer implementation: PhiMoE huggingface/transformers#33363
GGUF Collection: https://huggingface.co/collections/phymbert/phi-35-moe-instruct-gguf-676ff4882b1891292b6bd9c1
Target Feature Request: Add support for Phi-3.5 MoE and Vision Instruct #9119
Fix Request to use Phi-3.5-MoE-instruct #9168

ThiloteE · 2024-12-28T17:20:58Z

I am not particularly good at coding, but I can try running your gguf and check, if I notice something. No time today, but tomorrow I can do so.

docs/development/HOWTO-add-model.md

phymbert · 2024-12-28T17:26:44Z

I am not particularly good at coding, but I can try running your gguf and check, if I notice something. No time today, but tomorrow I can do so.

Thanks, no hurry as the model is quite old and phi4 has been released already. Will see if it gains enthousiasm, I am having a look to the Vision model in //.

ThiloteE · 2024-12-29T16:19:15Z

The Q4_0 with 4096 context does not fit into 32GB of RAM on Windows 10.
The Q3_k_s with 32768 context does fit into 32 GB of RAM
My hardware: Two slots DDR4 (2400 MHz) with Ryzen 5 5600. One layer of the model is mapped to GPU (Nvidia GTX 1060 3GB). Llama.cpp was compiled with cuda.

Output is reasonable. Sometimes I have seen typos (e.g. instead of I'm it responds with I'. Or instead of don't it responds with don') or the model being a little dumb or slightly repeating itself, but at other times the responses are perfectly fine. Maybe caused by quantization, who knows. Maybe fixable by finetuning.

Example responses (Images)

Successful run with 32768 allocated tokens for context (Prompt was 16883 tokens)

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1060 3GB, compute capability 6.1, VMM: yes
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce GTX 1060 3GB (NVIDIA) | uma: 0 | fp16: 0 | warp size: 32 | matrix cores: none
build: 4397 (cf1fda86) with MSVC 19.41.34120.0 for x64
system info: n_threads = 6, n_threads_batch = 6, total_threads = 12

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CUDA : ARCHS = 610 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 11
main: loading model
srv    load_model: loading model 'C:\Prog\Development\Llama.Cpp-Toolbox_3Simplex\Llama.Cpp-Toolbox\Converted\microsoft_phi-3.5-moe-instruct-q3_k_s-phymbert-2024-12-28-t1827.gguf'
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce GTX 1060 3GB) - 2462 MiB free
llama_load_model_from_file: using device Vulkan0 (NVIDIA GeForce GTX 1060 3GB) - 2965 MiB free
llama_model_loader: loaded meta data with 38 key-value pairs and 519 tensors from C:\Prog\Development\Llama.Cpp-Toolbox_3Simplex\Llama.Cpp-Toolbox\Converted\microsoft_phi-3.5-moe-instruct-q3_k_s-phymbert-2024-12-28-t1827.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phimoe
llama_model_loader: - kv   1:            phimoe.rope.scaling.attn_factor f32              = 1.190238
llama_model_loader: - kv   2:                               general.type str              = model
llama_model_loader: - kv   3:                               general.name str              = Phi 3.5 MoE Instruct
llama_model_loader: - kv   4:                           general.finetune str              = instruct
llama_model_loader: - kv   5:                           general.basename str              = Phi-3.5-MoE
llama_model_loader: - kv   6:                         general.size_label str              = 16x4.1B
llama_model_loader: - kv   7:                            general.license str              = mit
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/microsoft/Phi-...
llama_model_loader: - kv   9:                               general.tags arr[str,3]       = ["nlp", "code", "text-generation"]
llama_model_loader: - kv  10:                          general.languages arr[str,1]       = ["multilingual"]
llama_model_loader: - kv  11:                      phimoe.context_length u32              = 131072
llama_model_loader: - kv  12: phimoe.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  13:                    phimoe.embedding_length u32              = 4096
llama_model_loader: - kv  14:                 phimoe.feed_forward_length u32              = 6400
llama_model_loader: - kv  15:                         phimoe.block_count u32              = 32
llama_model_loader: - kv  16:                phimoe.attention.head_count u32              = 32
llama_model_loader: - kv  17:             phimoe.attention.head_count_kv u32              = 8
llama_model_loader: - kv  18:    phimoe.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  19:                phimoe.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                      phimoe.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  21:                          general.file_type u32              = 11
llama_model_loader: - kv  22:            phimoe.attention.sliding_window u32              = 131072
llama_model_loader: - kv  23:                   phimoe.expert_used_count u32              = 2
llama_model_loader: - kv  24:                        phimoe.expert_count u32              = 16
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  28:                      tokenizer.ggml.scores arr[f32,32064]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,32064]   = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  32:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {% for message in messages %}{% if me...
llama_model_loader: - kv  37:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  293 tensors
llama_model_loader: - type q3_K:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 14
llm_load_vocab: token to piece cache size = 0.1685 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phimoe
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 6400
llm_load_print_meta: n_expert         = 16
llm_load_print_meta: n_expert_used    = 2
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 16x3.8B
llm_load_print_meta: model ftype      = Q3_K - Small
llm_load_print_meta: model params     = 41.87 B
llm_load_print_meta: model size       = 16.81 GiB (3.45 BPW)
llm_load_print_meta: general.name     = Phi 3.5 MoE Instruct
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOG token        = 32000 '<|endoftext|>'
llm_load_print_meta: EOG token        = 32007 '<|end|>'
llm_load_print_meta: max token length = 48
ggml_vulkan: Compiling shaders..............................Done!
request: GET / 127.0.0.1 503
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/33 layers to GPU
llm_load_tensors:        CUDA0 model buffer size =   533.16 MiB
llm_load_tensors:   CPU_Mapped model buffer size = 16684.80 MiB
....................................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 32768
llama_new_context_with_model: n_ctx_per_seq = 32768
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 10000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 32768, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32
llama_kv_cache_init:      CUDA0 KV buffer size =   128.00 MiB
llama_kv_cache_init:        CPU KV buffer size =  3968.00 MiB
llama_new_context_with_model: KV self size  = 4096.00 MiB, K (f16): 2048.00 MiB, V (f16): 2048.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  2342.88 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    72.01 MiB
llama_new_context_with_model: graph nodes  = 1736
llama_new_context_with_model: graph splits = 565 (with bs=512), 3 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 32768
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 32768
main: model loaded
main: chat template, built_in: 1, chat_example: '<|system|>
You are a helpful assistant<|end|>
<|user|>
Hello<|end|>
<|assistant|>
Hi there<|end|>
<|user|>
How are you?<|end|>
<|assistant|>
'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle
request: GET / 127.0.0.1 200
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 16883
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.121305
slot update_slots: id  0 | task 0 | kv cache rm [2048, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 4096, n_tokens = 2048, progress = 0.242611
slot update_slots: id  0 | task 0 | kv cache rm [4096, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 6144, n_tokens = 2048, progress = 0.363916
slot update_slots: id  0 | task 0 | kv cache rm [6144, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 8192, n_tokens = 2048, progress = 0.485222
slot update_slots: id  0 | task 0 | kv cache rm [8192, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 10240, n_tokens = 2048, progress = 0.606527
slot update_slots: id  0 | task 0 | kv cache rm [10240, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 12288, n_tokens = 2048, progress = 0.727833
slot update_slots: id  0 | task 0 | kv cache rm [12288, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 14336, n_tokens = 2048, progress = 0.849138
slot update_slots: id  0 | task 0 | kv cache rm [14336, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 16384, n_tokens = 2048, progress = 0.970444
slot update_slots: id  0 | task 0 | kv cache rm [16384, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 16883, n_tokens = 499, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 16883, n_tokens = 499
slot      release: id  0 | task 0 | stop processing: n_past = 17660, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =  986724.31 ms / 16883 tokens (   58.44 ms per token,    17.11 tokens per second)
       eval time =  413753.94 ms /   778 tokens (  531.82 ms per token,     1.88 tokens per second)
      total time = 1400478.25 ms / 17661 tokens
srv  update_slots: all slots are idle

matiaslin

Looks good to me.

Ran llama-bench on Phi3.5 MoE Q4.

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA L40S, compute capability 8.9, VMM: no
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| phimoe 16x3.8B Q4_0            |  21.98 GiB |    41.87 B | CUDA       |  99 |         pp512 |      4647.09 ± 52.40 |
| phimoe 16x3.8B Q4_0            |  21.98 GiB |    41.87 B | CUDA       |  99 |         tg128 |         98.37 ± 0.03 |

build: 0dae7685 (4398)

ggerganov

I rebased this on the latest master

Co-authored-by: ThiloteE <[email protected]>

ggml-ci

* model: support phimoe * python linter * doc: minor Co-authored-by: ThiloteE <[email protected]> * doc: minor Co-authored-by: ThiloteE <[email protected]> * doc: add phimoe as supported model ggml-ci --------- Co-authored-by: ThiloteE <[email protected]>

d1hr2uv · 2025-01-12T15:29:28Z

nice one dude.

phymbert added enhancement New feature or request model Model specific labels Dec 28, 2024

phymbert requested a review from ggerganov December 28, 2024 14:46

github-actions bot added documentation Improvements or additions to documentation python python script changes labels Dec 28, 2024

This comment was marked as off-topic.

Sign in to view

ThiloteE reviewed Dec 28, 2024

View reviewed changes

docs/development/HOWTO-add-model.md Outdated Show resolved Hide resolved

ThiloteE reviewed Dec 28, 2024

View reviewed changes

docs/development/HOWTO-add-model.md Outdated Show resolved Hide resolved

phymbert changed the title ~~model: Add support for Phi-3.5 MoE~~ model: Add support for PhiMoE arch Dec 28, 2024

matiaslin approved these changes Jan 8, 2025

View reviewed changes

slaren approved these changes Jan 9, 2025

View reviewed changes

ggerganov force-pushed the phymbert/model/phi35-moe branch from 0dae768 to 4ca3a77 Compare January 9, 2025 09:27

ggerganov approved these changes Jan 9, 2025

View reviewed changes

phymbert and others added 5 commits January 9, 2025 11:31

model: support phimoe

4be934c

python linter

7385f7d

doc: minor

e0e23b5

Co-authored-by: ThiloteE <[email protected]>

doc: minor

3199b2f

Co-authored-by: ThiloteE <[email protected]>

doc: add phimoe as supported model

c0dd28d

ggml-ci

ggerganov force-pushed the phymbert/model/phi35-moe branch from 4ca3a77 to c0dd28d Compare January 9, 2025 09:31

phymbert merged commit f8feb4b into master Jan 9, 2025
59 of 60 checks passed

phymbert deleted the phymbert/model/phi35-moe branch January 9, 2025 10:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model: Add support for PhiMoE arch #11003

model: Add support for PhiMoE arch #11003

phymbert commented Dec 28, 2024 •

edited

Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

ThiloteE commented Dec 28, 2024

phymbert commented Dec 28, 2024

ThiloteE commented Dec 29, 2024 •

edited

Loading

matiaslin left a comment •

edited

Loading

ggerganov left a comment

d1hr2uv commented Jan 12, 2025

model: Add support for PhiMoE arch #11003

model: Add support for PhiMoE arch #11003

Conversation

phymbert commented Dec 28, 2024 • edited Loading

PhiMoE

Overview

License

Implementation details

Testing

Check that phi3 is still working

Links

This comment was marked as off-topic.

This comment was marked as off-topic.

ThiloteE commented Dec 28, 2024

phymbert commented Dec 28, 2024

ThiloteE commented Dec 29, 2024 • edited Loading

matiaslin left a comment • edited Loading

Choose a reason for hiding this comment

ggerganov left a comment

Choose a reason for hiding this comment

d1hr2uv commented Jan 12, 2025

phymbert commented Dec 28, 2024 •

edited

Loading

ThiloteE commented Dec 29, 2024 •

edited

Loading

matiaslin left a comment •

edited

Loading