-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support MiniCPM-V-2 #6919
base: master
Are you sure you want to change the base?
support MiniCPM-V-2 #6919
Conversation
Encounter some bugs when building this PR on Windows. Log here: https://github.com/MZWNET/actions/actions/runs/8896316696/job/24430759871 |
Can 6c1c4b4 fix this? |
Build successfully, thx for your great work! |
Get another bug when converting the image encoder to gguf. Log: python3 ./examples/minicpmv/convert-image-encoder-to-gguf.py -m ../MiniCPM-V-2 --llava-projector ../MiniCPM-V-2/llava.projector --output-dir ../MiniCPM-V-2-GGUF --image-mean 0.5 0.5 0.5 --image-std 0.5 0.5 0.5
gguf: This GGUF file is for Little Endian only
Traceback (most recent call last):
File "/content/llama.cpp/./examples/minicpmv/convert-image-encoder-to-gguf.py", line 295, in <module>
data = data.squeeze().numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first. |
BTW it may be better if you can add It can take full advantages of a GPU :) |
I followed |
Should set it so the device defaults to CPU and optionally set to GPU if desired. |
I fixed my problems by editing the Edited version: # store these tensors in a new dictionary and torch.save them
projector = {name: checkpoint[name].float().cpu() for name in mm_tensors}
torch.save(projector, f"{args.model}/llava.projector")
clip_tensors = [k for k, v in checkpoint.items() if k.startswith("vpm")]
if len(clip_tensors) > 0:
clip = {name.replace("vpm.", ""): checkpoint[name].float().cpu() for name in clip_tensors}
torch.save(clip, f"{args.model}/llava.clip") I think it would be better to add the |
Get another bug. Log: > python3 ./examples/minicpmv/minicpm-surgery.py -m ../MiniCPM-V-2
Loading checkpoint shards: 100% 2/2 [00:34<00:00, 17.07s/it]
Done!
Now you can convert ../MiniCPM-V-2 to a regular LLaMA GGUF file.
Also, use ../MiniCPM-V-2/llava.projector to prepare a llava-encoder.gguf file.
> python3 ./examples/minicpmv/convert-image-encoder-to-gguf.py -m ../MiniCPM-V-2 --llava-projector ../MiniCPM-V-2/llava.projector --output-dir ../MiniCPM-V-2-GGUF --image-mean 0.5 0.5 0.5 --image-std 0.5 0.5 0.5
gguf: This GGUF file is for Little Endian only
Converting to float32
resampler.pos_embed - f32 - shape = (64, 2304)
Converting to float32
...(too long, ignore)
v.post_ln.weight - f32 - shape = (1152,)
Converting to float32
v.post_ln.bias - f32 - shape = (1152,)
Done. Output file: ../MiniCPM-V-2-GGUF/mmproj-model-f16.gguf
> python3 ./convert-hf-to-gguf.py --outtype f16 --outfile ../MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf ../MiniCPM-V-2/MiniCPM
Loading model: MiniCPM
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
Traceback (most recent call last):
File "/content/llama.cpp/./convert-hf-to-gguf.py", line 2809, in <module>
main()
File "/content/llama.cpp/./convert-hf-to-gguf.py", line 2796, in main
model_instance.set_vocab()
File "/content/llama.cpp/./convert-hf-to-gguf.py", line 1645, in set_vocab
self._set_vocab_llama_hf()
File "/content/llama.cpp/./convert-hf-to-gguf.py", line 377, in _set_vocab_llama_hf
vocab = LlamaHfVocab(self.dir_model)
File "/content/llama.cpp/convert.py", line 523, in __init__
with open(fname_tokenizer, encoding='utf-8') as f:
FileNotFoundError: [Errno 2] No such file or directory: '../MiniCPM-V-2/MiniCPM/tokenizer.json'
> cp ../MiniCPM-V-2/tokenizer.json ../MiniCPM-V-2/MiniCPM/
> python3 ./convert-hf-to-gguf.py --outtype f16 --outfile ../MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf ../MiniCPM-V-2/MiniCPM
Loading model: MiniCPM
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
The repository for ../MiniCPM-V-2/MiniCPM contains custom code which must be executed to correctly load the model. You can inspect the repository content at [https://hf.co/../MiniCPM-V-2/MiniCPM](https://hf.co/MiniCPM-V-2/MiniCPM).
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.
Do you wish to run the custom code? [y/N] y
Traceback (most recent call last):
File "/content/llama.cpp/./convert-hf-to-gguf.py", line 2809, in <module>
main()
File "/content/llama.cpp/./convert-hf-to-gguf.py", line 2796, in main
model_instance.set_vocab()
File "/content/llama.cpp/./convert-hf-to-gguf.py", line 1645, in set_vocab
self._set_vocab_llama_hf()
File "/content/llama.cpp/./convert-hf-to-gguf.py", line 377, in _set_vocab_llama_hf
vocab = LlamaHfVocab(self.dir_model)
File "/content/llama.cpp/convert.py", line 556, in __init__
assert self.tokenizer.is_fast # assume tokenizer.json is used
AssertionError |
Does your MiniCPM-V-2 folder have |
Yes, I confirm that. Log: > ls ../MiniCPM-V-2/ -alh
total 8.0G
drwxr-xr-x 4 root root 4.0K May 2 06:17 .
drwxr-xr-x 1 root root 4.0K May 2 06:16 ..
drwxr-xr-x 2 root root 4.0K May 2 06:13 assets
-rw-r--r-- 1 root root 1.2K May 2 06:13 config.json
-rw-r--r-- 1 root root 11K May 2 06:13 configuration_minicpm.py
-rw-r--r-- 1 root root 111 May 2 06:13 generation_config.json
-rw-r--r-- 1 root root 1.7K May 2 06:13 .gitattributes
-rw-r--r-- 1 root root 1.5G May 2 06:17 llava.clip
-rw-r--r-- 1 root root 113M May 2 06:17 llava.projector
drwxr-xr-x 2 root root 4.0K May 2 06:21 MiniCPM
-rw-r--r-- 1 root root 4.7G May 2 06:14 model-00001-of-00002.safetensors
-rw-r--r-- 1 root root 1.8G May 2 06:13 model-00002-of-00002.safetensors
-rw-r--r-- 1 root root 70K May 2 06:13 modeling_minicpm.py
-rw-r--r-- 1 root root 20K May 2 06:13 modeling_minicpmv.py
-rw-r--r-- 1 root root 54K May 2 06:13 model.safetensors.index.json
-rw-r--r-- 1 root root 9.2K May 2 06:13 README.md
-rw-r--r-- 1 root root 5.5K May 2 06:13 resampler.py
-rw-r--r-- 1 root root 651 May 2 06:13 special_tokens_map.json
-rw-r--r-- 1 root root 3.3K May 2 06:13 tokenizer_config.json
-rw-r--r-- 1 root root 6.0M May 2 06:13 tokenizer.json
-rw-r--r-- 1 root root 2.0M May 2 06:13 tokenizer.model
> ls ../MiniCPM-V-2/MiniCPM -alh
total 12G
drwxr-xr-x 2 root root 4.0K May 2 06:21 .
drwxr-xr-x 4 root root 4.0K May 2 06:17 ..
-rw-r--r-- 1 root root 1.5K May 2 06:17 config.json
-rw-r--r-- 1 root root 11K May 2 06:21 configuration_minicpm.py
-rw-r--r-- 1 root root 111 May 2 06:17 generation_config.json
-rw-r--r-- 1 root root 4.7G May 2 06:18 model-00001-of-00003.safetensors
-rw-r--r-- 1 root root 4.7G May 2 06:20 model-00002-of-00003.safetensors
-rw-r--r-- 1 root root 2.0G May 2 06:21 model-00003-of-00003.safetensors
-rw-r--r-- 1 root root 70K May 2 06:21 modeling_minicpm.py
-rw-r--r-- 1 root root 20K May 2 06:21 modeling_minicpmv.py
-rw-r--r-- 1 root root 30K May 2 06:21 model.safetensors.index.json
-rw-r--r-- 1 root root 5.5K May 2 06:21 resampler.py
-rw-r--r-- 1 root root 765 May 2 06:21 special_tokens_map.json
-rw-r--r-- 1 root root 3.4K May 2 06:21 tokenizer_config.json
-rw-r--r-- 1 root root 2.0M May 2 06:21 tokenizer.model |
So it seems that the save_pretrained method in surgery.py do not save the tokenizer.json file. I manually copy the tokenizer.json file into the MiniCPM sub-folder before the tokenizer.json is uploaded. I wonder the manually copy is no longer needed since it is uploaded. Seems I was wrong. |
However even if I copy it to sub-folder, the converting script didn't work either :( See here: #6919 (comment)
|
fixed. |
Convert successfully, thx! However, I got a bad test result... Log here: > ./minicpmv-cli -ngl 1000000 -m ./MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf --mmproj ./MiniCPM-V-2-GGUF/mmproj-model-f16.gguf -c 4096 --temp 0.6 --top-p 0.8 --top-k 100 --repeat-penalty 1.0 --image ./mzwing.jpg -p "这张图里有什么?"
Log start
clip_model_load: description: image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment: 32
clip_model_load: n_tensors: 440
clip_model_load: n_kv: 18
clip_model_load: ftype: f16
clip_model_load: loaded meta data with 18 key-value pairs and 440 tensors from ./MiniCPM-V-2-GGUF/mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv 0: general.architecture str = clip
clip_model_load: - kv 1: clip.has_text_encoder bool = false
clip_model_load: - kv 2: clip.has_vision_encoder bool = true
clip_model_load: - kv 3: clip.has_llava_projector bool = true
clip_model_load: - kv 4: general.file_type u32 = 1
clip_model_load: - kv 5: general.description str = image encoder for LLaVA
clip_model_load: - kv 6: clip.projector_type str = resampler
clip_model_load: - kv 7: clip.vision.image_size u32 = 448
clip_model_load: - kv 8: clip.vision.patch_size u32 = 14
clip_model_load: - kv 9: clip.vision.embedding_length u32 = 1152
clip_model_load: - kv 10: clip.vision.feed_forward_length u32 = 4304
clip_model_load: - kv 11: clip.vision.projection_dim u32 = 0
clip_model_load: - kv 12: clip.vision.attention.head_count u32 = 16
clip_model_load: - kv 13: clip.vision.attention.layer_norm_epsilon f32 = 0.000001
clip_model_load: - kv 14: clip.vision.block_count u32 = 26
clip_model_load: - kv 15: clip.vision.image_mean arr[f32,3] = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv 16: clip.vision.image_std arr[f32,3] = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv 17: clip.use_gelu bool = true
clip_model_load: - type f32: 277 tensors
clip_model_load: - type f16: 163 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla T4, compute capability 7.5, VMM: yes
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder: 0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector: 1
clip_model_load: model size: 828.18 MB
clip_model_load: metadata size: 0.17 MB
clip_model_load: params backend buffer size = 828.18 MB (440 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 88.80 MB
llama_model_loader: loaded meta data with 22 key-value pairs and 363 tensors from ./MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = minicpm
llama_model_loader: - kv 1: general.name str = MiniCPM
llama_model_loader: - kv 2: minicpm.context_length u32 = 4096
llama_model_loader: - kv 3: minicpm.embedding_length u32 = 2304
llama_model_loader: - kv 4: minicpm.block_count u32 = 40
llama_model_loader: - kv 5: minicpm.feed_forward_length u32 = 5760
llama_model_loader: - kv 6: minicpm.rope.dimension_count u32 = 64
llama_model_loader: - kv 7: minicpm.attention.head_count u32 = 36
llama_model_loader: - kv 8: minicpm.attention.head_count_kv u32 = 36
llama_model_loader: - kv 9: minicpm.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 1
llama_model_loader: - kv 11: minicpm.tie_lm_head bool = false
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,122753] = ["<unk>", "<s>", "</s>", "<SEP>", "<C...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,122753] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,122753] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type f16: 282 tensors
llm_load_vocab: mismatch in special tokens definition ( 3528/122753 vs 271/122753 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = minicpm
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 122753
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 2304
llm_load_print_meta: n_head = 36
llm_load_print_meta: n_head_kv = 36
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 2304
llm_load_print_meta: n_embd_v_gqa = 2304
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 5760
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 2B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 3.01 B
llm_load_print_meta: model size = 5.60 GiB (16.00 BPW)
llm_load_print_meta: general.name = MiniCPM
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.37 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU buffer size = 539.44 MiB
llm_load_tensors: CUDA0 buffer size = 5197.65 MiB
....................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 1440.00 MiB
llama_new_context_with_model: KV self size = 1440.00 MiB, K (f16): 720.00 MiB, V (f16): 720.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.47 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 314.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 12.51 MiB
llama_new_context_with_model: graph nodes = 1368
llama_new_context_with_model: graph splits = 2
encode_image_with_clip: image embedding created: 64 tokens
encode_image_with_clip: image encoded in 455.73 ms by CLIP ( 7.12 ms per image patch)
slice_image: multiple 1
<用户>这张图里有什么?
<AI>
《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《
llama_print_timings: load time = 27998.25 ms
llama_print_timings: sample time = 35.94 ms / 256 runs ( 0.14 ms per token, 7122.19 tokens per second)
llama_print_timings: prompt eval time = 196.38 ms / 80 tokens ( 2.45 ms per token, 407.36 tokens per second)
llama_print_timings: eval time = 7742.32 ms / 255 runs ( 30.36 ms per token, 32.94 tokens per second)
llama_print_timings: total time = 36056.03 ms / 335 tokens The image I used for testing is my GitHub avatar. |
My log here, can not reproduce your result
|
@Achazwl Can you help test the model I quantized? Link here: https://huggingface.co/mzwing/MiniCPM-V-2-GGUF |
The link you provided only contains fp16 models |
The mmproj gguf model is actually there, I just rename it :) Link to the mmproj gguf model: https://huggingface.co/mzwing/MiniCPM-V-2-GGUF/blob/main/MiniCPM-V-2-mmproj.F16.gguf |
Also correct
|
I did some further tests. When I use only the CPU, the model's output is very, very normal. However, when I switching to the GPU, the model seemed... mad. Tested on Google Colab (T4 GPU). Log: > ./minicpmv-cli -ngl 35 -m ./MiniCPM-V-2-GGUF/MiniCPM-V-2.Q2_K.gguf --mmproj ./MiniCPM-V-2-GGUF/MiniCPM-V-2-mmproj.F16.gguf -c 4096 --temp 0.6 --top-p 0.8 --top-k 100 --repeat-penalty 1.0 --image ./mzwing.jpg -p "这张图里有什么?"
Log start
clip_model_load: description: image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment: 32
clip_model_load: n_tensors: 440
clip_model_load: n_kv: 18
clip_model_load: ftype: f16
clip_model_load: loaded meta data with 18 key-value pairs and 440 tensors from ./MiniCPM-V-2-GGUF/MiniCPM-V-2-mmproj.F16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv 0: general.architecture str = clip
clip_model_load: - kv 1: clip.has_text_encoder bool = false
clip_model_load: - kv 2: clip.has_vision_encoder bool = true
clip_model_load: - kv 3: clip.has_llava_projector bool = true
clip_model_load: - kv 4: general.file_type u32 = 1
clip_model_load: - kv 5: general.description str = image encoder for LLaVA
clip_model_load: - kv 6: clip.projector_type str = resampler
clip_model_load: - kv 7: clip.vision.image_size u32 = 448
clip_model_load: - kv 8: clip.vision.patch_size u32 = 14
clip_model_load: - kv 9: clip.vision.embedding_length u32 = 1152
clip_model_load: - kv 10: clip.vision.feed_forward_length u32 = 4304
clip_model_load: - kv 11: clip.vision.projection_dim u32 = 0
clip_model_load: - kv 12: clip.vision.attention.head_count u32 = 16
clip_model_load: - kv 13: clip.vision.attention.layer_norm_epsilon f32 = 0.000001
clip_model_load: - kv 14: clip.vision.block_count u32 = 26
clip_model_load: - kv 15: clip.vision.image_mean arr[f32,3] = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv 16: clip.vision.image_std arr[f32,3] = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv 17: clip.use_gelu bool = true
clip_model_load: - type f32: 277 tensors
clip_model_load: - type f16: 163 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla T4, compute capability 7.5, VMM: yes
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder: 0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector: 1
clip_model_load: model size: 828.18 MB
clip_model_load: metadata size: 0.17 MB
clip_model_load: params backend buffer size = 828.18 MB (440 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 88.80 MB
llama_model_loader: loaded meta data with 23 key-value pairs and 363 tensors from ./MiniCPM-V-2-GGUF/MiniCPM-V-2.Q2_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = minicpm
llama_model_loader: - kv 1: general.name str = MiniCPM
llama_model_loader: - kv 2: minicpm.context_length u32 = 4096
llama_model_loader: - kv 3: minicpm.embedding_length u32 = 2304
llama_model_loader: - kv 4: minicpm.block_count u32 = 40
llama_model_loader: - kv 5: minicpm.feed_forward_length u32 = 5760
llama_model_loader: - kv 6: minicpm.rope.dimension_count u32 = 64
llama_model_loader: - kv 7: minicpm.attention.head_count u32 = 36
llama_model_loader: - kv 8: minicpm.attention.head_count_kv u32 = 36
llama_model_loader: - kv 9: minicpm.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 10
llama_model_loader: - kv 11: minicpm.tie_lm_head bool = false
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,122753] = ["<unk>", "<s>", "</s>", "<SEP>", "<C...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,122753] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,122753] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q2_K: 161 tensors
llama_model_loader: - type q3_K: 80 tensors
llama_model_loader: - type q6_K: 1 tensors
llama_model_loader: - type iq4_nl: 40 tensors
llm_load_vocab: mismatch in special tokens definition ( 3528/122753 vs 271/122753 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = minicpm
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 122753
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 2304
llm_load_print_meta: n_head = 36
llm_load_print_meta: n_head_kv = 36
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 2304
llm_load_print_meta: n_embd_v_gqa = 2304
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 5760
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 2B
llm_load_print_meta: model ftype = Q2_K - Medium
llm_load_print_meta: model params = 3.01 B
llm_load_print_meta: model size = 1.21 GiB (3.44 BPW)
llm_load_print_meta: general.name = MiniCPM
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.37 MiB
llm_load_tensors: offloading 35 repeating layers to GPU
llm_load_tensors: offloaded 35/41 layers to GPU
llm_load_tensors: CPU buffer size = 1234.38 MiB
llm_load_tensors: CUDA0 buffer size = 809.07 MiB
.............................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 180.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 1260.00 MiB
llama_new_context_with_model: KV self size = 1440.00 MiB, K (f16): 720.00 MiB, V (f16): 720.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.47 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 465.51 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 17.01 MiB
llama_new_context_with_model: graph nodes = 1368
llama_new_context_with_model: graph splits = 59
encode_image_with_clip: image embedding created: 64 tokens
encode_image_with_clip: image encoded in 394.83 ms by CLIP ( 6.17 ms per image patch)
slice_image: multiple 1
<用户>这张图里有什么?
<AI>
%</br></h3><strong></h2></tr><br/><SEP>-《<li><SEP></h2><h3/>8--></h3>』</h1>?8<</br></strong>32<h4/>=-</h3>10</h2><tbody/>.<h3></img><h5>『7</h2><img/></h3><tr>2¥3<h4>『11</h5><CLS><li></img></h3>』《『<『%3<li>0<h2/>3!1<tr><h5>?</img><SEP></h1><h4/><CLS>=<h4>```.=<h3/>?2<!--%!-1<td>2㊣<p/><p/><SEP>。<!--¥1。。。</td>》。...<li>
<h5></h5><h2/>㊣?</img>:<</h5><h3>,?.<strong></tr><tr></strong></tbody>3<h1>4-->?-【</li></tr><h3>12<li/>.</h2><SEP>3</h2>?<table>:<br></tbody><h2/><!DOCTYPE>¥9``````=</img></h5><b><h5/>.<li>¥』《</li>-4?<li>!%『<img/><br/></h1>!》<tr>..<table></br></h1><,!《</h2>㊣</h4></tbody>¥</li>。。。。。。。。。。。。<table/></br>-<li/>./.:<《</h2>、<h5>-<<h4/>%</li>1</strong>、</strong><br><h4/>-->《</h4>...<strong/>.<b/>--><tbody/><h4/>,?,『0<img/>【</strong>
9、<tr>-5
-</h5>%<p/><h4/><h5><!DOCTYPE><table/>《6</h5></tr>
llama_print_timings: load time = 2538.88 ms
llama_print_timings: sample time = 32.09 ms / 256 runs ( 0.13 ms per token, 7978.56 tokens per second)
llama_print_timings: prompt eval time = 1714.47 ms / 80 tokens ( 21.43 ms per token, 46.66 tokens per second)
llama_print_timings: eval time = 19998.40 ms / 255 runs ( 78.43 ms per token, 12.75 tokens per second)
llama_print_timings: total time = 22909.54 ms / 335 tokens The binary I compiled is here: https://github.com/MZWNET/actions/releases/tag/llama_cpp-minicpm-v-6c1c4b4 Link to models I quantized: https://huggingface.co/mzwing/MiniCPM-V-2-GGUF If you need, link to my Jupyter Notebook file here: https://github.com/mzwing/AI-related/blob/master/notebooks/MiniCPM_V_2_GGUF.ipynb So, it seems to be a GPU-related bug :( |
So this may not related to my PR? Since the correctness on CPU indicates that the conversion is correct. |
Is the bug happening on LLaVA? |
Oh now I find that the
So, maybe that's the final reason? But the two errors seem too different. Log: > ./llava-cli -ngl 35 -m ./LLaVA-Llama-3-8B-Instruct-GGUF/llava-llama3-8b-Q4_K_M.gguf --mmproj ./LLaVA-Llama-3-8B-Instruct-GGUF/llava-llama3-mmproj-f16.gguf --temp 0.6 --top-p 0.8 --top-k 100 --repeat-penalty 1.0 --image ./mzwing.jpg -p "这张图里有什么?"
Log start
clip_model_load: model name: openai/clip-vit-large-patch14-336
clip_model_load: description: image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment: 32
clip_model_load: n_tensors: 377
clip_model_load: n_kv: 19
clip_model_load: ftype: f16
clip_model_load: loaded meta data with 19 key-value pairs and 377 tensors from ./LLaVA-Llama-3-8B-Instruct-GGUF/llava-llama3-mmproj-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv 0: general.architecture str = clip
clip_model_load: - kv 1: clip.has_text_encoder bool = false
clip_model_load: - kv 2: clip.has_vision_encoder bool = true
clip_model_load: - kv 3: clip.has_llava_projector bool = true
clip_model_load: - kv 4: general.file_type u32 = 1
clip_model_load: - kv 5: general.name str = openai/clip-vit-large-patch14-336
clip_model_load: - kv 6: general.description str = image encoder for LLaVA
clip_model_load: - kv 7: clip.projector_type str = mlp
clip_model_load: - kv 8: clip.vision.image_size u32 = 336
clip_model_load: - kv 9: clip.vision.patch_size u32 = 14
clip_model_load: - kv 10: clip.vision.embedding_length u32 = 1024
clip_model_load: - kv 11: clip.vision.feed_forward_length u32 = 4096
clip_model_load: - kv 12: clip.vision.projection_dim u32 = 768
clip_model_load: - kv 13: clip.vision.attention.head_count u32 = 16
clip_model_load: - kv 14: clip.vision.attention.layer_norm_epsilon f32 = 0.000010
clip_model_load: - kv 15: clip.vision.block_count u32 = 23
clip_model_load: - kv 16: clip.vision.image_mean arr[f32,3] = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv 17: clip.vision.image_std arr[f32,3] = [0.268630, 0.261303, 0.275777]
clip_model_load: - kv 18: clip.use_gelu bool = false
clip_model_load: - type f32: 235 tensors
clip_model_load: - type f16: 142 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla T4, compute capability 7.5, VMM: yes
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder: 0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector: 1
clip_model_load: model size: 595.49 MB
clip_model_load: metadata size: 0.14 MB
clip_model_load: params backend buffer size = 595.49 MB (377 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 32.89 MB
llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from ./LLaVA-Llama-3-8B-Instruct-GGUF/llava-llama3-8b-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = tmp
llama_model_loader: - kv 2: llama.vocab_size u32 = 128257
llama_model_loader: - kv 3: llama.context_length u32 = 8192
llama_model_loader: - kv 4: llama.embedding_length u32 = 4096
llama_model_loader: - kv 5: llama.block_count u32 = 32
llama_model_loader: - kv 6: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 7: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 8: llama.attention.head_count u32 = 32
llama_model_loader: - kv 9: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 11: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128257] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,128257] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128001
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 128256
llama_model_loader: - kv 21: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
llm_load_vocab: special tokens definition check successful ( 257/128257 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128257
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 8.03 B
llm_load_print_meta: model size = 4.58 GiB (4.89 BPW)
llm_load_print_meta: general.name = tmp
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128001 '<|end_of_text|>'
llm_load_print_meta: PAD token = 128256 '<pad>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size = 0.30 MiB
llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 291, got 290
llama_load_model_from_file: failed to load model
llava_init: error: unable to load model
main: error: failed to init llava |
CPU also the same. So it's quite confusing. Maybe you should update your branch?
|
Llava fixed, it is the side effect of my code. The new version of my PR has much fewer modifications outside the minicpm-v folder, and thus will not affect other models now. The bug of MiniCPM-V on GPU is rather hard to find. I can reproduce the NaN issue on GPU, and here's are my observation:
|
@cmp-nct Hey cmp-nct, could you please help us resolve this confusing issue? Thanks a lot! I apologize if this has caused you confusion. |
你好,我试验了一下,我量化之后发小效果差很多,有什么解决方法吗? |
Google Translation:
|
Do you quantify the LLM part or the ViT part? |
./quantize ../MiniCPM-V-2/MiniCPM/ggml-model-f16.gguf ../MiniCPM-V-2/MiniCPM/ggml-model-Q4_K_M.gguf Q4_K_M 这个是我量化的命令 |
这块量化的是 LLM 部分,ViT 部分我没有做量化处理。 |
Hi, does this PR support converting the new MiniCPM V2.5?
Thank you. |
This PR does not support MiniCPM-V 2.5 yet. |
Hi guys, I found a ollama variant of llama.cpp here that is supporting it: https://github.com/OpenBMB/llama.cpp/tree/minicpm-v2.5/examples/minicpmv |
I'm a stuff of minicpmv team. |
It's unlikely to merge these PRs since they duplicate all the LLaVA / CLIP codebase. Try to find ways to fit the implementation into the existing code and reuse/extend the existing API |
When I do the first step 'make', I got this error: |
I created a folder called "minicpmv" in the examples folder of llama.cpp.
More detail can be seen in
llama.cpp/examples/minicpmv/README.md
.The code is based on
examples/llava
but the vision part is quite different.