support MiniCPM-V-2 #6919

Achazwl · 2024-04-26T08:44:38Z

I created a folder called "minicpmv" in the examples folder of llama.cpp.
More detail can be seen in llama.cpp/examples/minicpmv/README.md.

The code is based on examples/llava but the vision part is quite different.

github-actions · 2024-04-26T10:52:14Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 555 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8400.3ms p(95)=19734.55ms fails=, finish reason: stop=511 truncated=44
Prompt processing (pp): avg=103.65tk/s p(95)=452.32tk/s
Token generation (tg): avg=34.33tk/s p(95)=46.58tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=feat-minicpmv commit=70a23863dcff7458839960b304ea166a401d4d8e

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716473680 --> 1716474300
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 399.43, 399.43, 399.43, 399.43, 399.43, 910.28, 910.28, 910.28, 910.28, 910.28, 794.64, 794.64, 794.64, 794.64, 794.64, 806.68, 806.68, 806.68, 806.68, 806.68, 871.83, 871.83, 871.83, 871.83, 871.83, 880.42, 880.42, 880.42, 880.42, 880.42, 871.52, 871.52, 871.52, 871.52, 871.52, 884.07, 884.07, 884.07, 884.07, 884.07, 893.5, 893.5, 893.5, 893.5, 893.5, 904.16, 904.16, 904.16, 904.16, 904.16, 895.77, 895.77, 895.77, 895.77, 895.77, 893.23, 893.23, 893.23, 893.23, 893.23, 911.86, 911.86, 911.86, 911.86, 911.86, 914.39, 914.39, 914.39, 914.39, 914.39, 903.48, 903.48, 903.48, 903.48, 903.48, 881.01, 881.01, 881.01, 881.01, 881.01, 884.16, 884.16, 884.16, 884.16, 884.16, 876.15, 876.15, 876.15, 876.15, 876.15, 873.63, 873.63, 873.63, 873.63, 873.63, 831.29, 831.29, 831.29, 831.29, 831.29, 832.56, 832.56, 832.56, 832.56, 832.56, 838.73, 838.73, 838.73, 838.73, 838.73, 842.35, 842.35, 842.35, 842.35, 842.35, 852.1, 852.1, 852.1, 852.1, 852.1, 828.17, 828.17, 828.17, 828.17, 828.17, 831.51, 831.51, 831.51, 831.51, 831.51, 833.09, 833.09, 833.09, 833.09, 833.09, 812.09, 812.09, 812.09, 812.09, 812.09, 811.03, 811.03, 811.03, 811.03, 811.03, 812.54, 812.54, 812.54, 812.54, 812.54, 819.03, 819.03, 819.03, 819.03, 819.03, 818.86, 818.86, 818.86, 818.86, 818.86, 818.31, 818.31, 818.31, 818.31, 818.31, 828.36, 828.36, 828.36, 828.36, 828.36, 822.09, 822.09, 822.09, 822.09, 822.09, 827.13, 827.13, 827.13, 827.13, 827.13, 805.06, 805.06, 805.06, 805.06, 805.06, 803.22, 803.22, 803.22, 803.22, 803.22, 805.52, 805.52, 805.52, 805.52, 805.52, 808.35, 808.35, 808.35, 808.35, 808.35, 809.36, 809.36, 809.36, 809.36, 809.36, 811.4, 811.4, 811.4, 811.4, 811.4, 808.28, 808.28, 808.28, 808.28, 808.28, 796.46, 796.46, 796.46, 796.46, 796.46, 796.57, 796.57, 796.57, 796.57, 796.57, 796.1, 796.1, 796.1, 796.1, 796.1, 801.6, 801.6, 801.6, 801.6, 801.6, 804.22, 804.22, 804.22, 804.22, 804.22, 803.76, 803.76, 803.76, 803.76, 803.76, 809.35, 809.35, 809.35, 809.35, 809.35, 809.22, 809.22, 809.22, 809.22, 809.22, 812.61, 812.61, 812.61, 812.61, 812.61, 807.83, 807.83, 807.83, 807.83, 807.83, 814.96, 814.96, 814.96, 814.96, 814.96, 814.12, 814.12, 814.12, 814.12, 814.12, 815.86, 815.86, 815.86, 815.86, 815.86, 816.03, 816.03, 816.03, 816.03, 816.03, 816.02, 816.02, 816.02, 816.02, 816.02, 818.09, 818.09, 818.09, 818.09, 818.09, 821.09, 821.09, 821.09, 821.09, 821.09, 820.36]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716473680 --> 1716474300
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 37.39, 37.39, 37.39, 37.39, 37.39, 40.72, 40.72, 40.72, 40.72, 40.72, 28.9, 28.9, 28.9, 28.9, 28.9, 29.61, 29.61, 29.61, 29.61, 29.61, 30.2, 30.2, 30.2, 30.2, 30.2, 30.31, 30.31, 30.31, 30.31, 30.31, 30.85, 30.85, 30.85, 30.85, 30.85, 31.48, 31.48, 31.48, 31.48, 31.48, 32.55, 32.55, 32.55, 32.55, 32.55, 32.55, 32.55, 32.55, 32.55, 32.55, 32.42, 32.42, 32.42, 32.42, 32.42, 32.5, 32.5, 32.5, 32.5, 32.5, 32.35, 32.35, 32.35, 32.35, 32.35, 31.12, 31.12, 31.12, 31.12, 31.12, 30.82, 30.82, 30.82, 30.82, 30.82, 29.78, 29.78, 29.78, 29.78, 29.78, 29.42, 29.42, 29.42, 29.42, 29.42, 28.62, 28.62, 28.62, 28.62, 28.62, 28.8, 28.8, 28.8, 28.8, 28.8, 29.35, 29.35, 29.35, 29.35, 29.35, 29.18, 29.18, 29.18, 29.18, 29.18, 29.19, 29.19, 29.19, 29.19, 29.19, 29.38, 29.38, 29.38, 29.38, 29.38, 29.63, 29.63, 29.63, 29.63, 29.63, 29.59, 29.59, 29.59, 29.59, 29.59, 29.81, 29.81, 29.81, 29.81, 29.81, 30.06, 30.06, 30.06, 30.06, 30.06, 30.13, 30.13, 30.13, 30.13, 30.13, 30.11, 30.11, 30.11, 30.11, 30.11, 30.21, 30.21, 30.21, 30.21, 30.21, 30.62, 30.62, 30.62, 30.62, 30.62, 30.79, 30.79, 30.79, 30.79, 30.79, 30.87, 30.87, 30.87, 30.87, 30.87, 31.09, 31.09, 31.09, 31.09, 31.09, 31.02, 31.02, 31.02, 31.02, 31.02, 30.93, 30.93, 30.93, 30.93, 30.93, 30.73, 30.73, 30.73, 30.73, 30.73, 30.5, 30.5, 30.5, 30.5, 30.5, 30.77, 30.77, 30.77, 30.77, 30.77, 30.91, 30.91, 30.91, 30.91, 30.91, 30.98, 30.98, 30.98, 30.98, 30.98, 31.07, 31.07, 31.07, 31.07, 31.07, 30.88, 30.88, 30.88, 30.88, 30.88, 30.66, 30.66, 30.66, 30.66, 30.66, 30.19, 30.19, 30.19, 30.19, 30.19, 29.42, 29.42, 29.42, 29.42, 29.42, 29.5, 29.5, 29.5, 29.5, 29.5, 29.49, 29.49, 29.49, 29.49, 29.49, 29.46, 29.46, 29.46, 29.46, 29.46, 29.47, 29.47, 29.47, 29.47, 29.47, 29.57, 29.57, 29.57, 29.57, 29.57, 29.58, 29.58, 29.58, 29.58, 29.58, 29.57, 29.57, 29.57, 29.57, 29.57, 29.52, 29.52, 29.52, 29.52, 29.52, 29.52, 29.52, 29.52, 29.52, 29.52, 29.69, 29.69, 29.69, 29.69, 29.69, 29.79, 29.79, 29.79, 29.79, 29.79, 29.92, 29.92, 29.92, 29.92, 29.92, 30.01, 30.01, 30.01, 30.01, 30.01, 30.05, 30.05, 30.05, 30.05, 30.05, 30.09]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716473680 --> 1716474300
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.13, 0.13, 0.13, 0.13, 0.13, 0.34, 0.34, 0.34, 0.34, 0.34, 0.28, 0.28, 0.28, 0.28, 0.28, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.25, 0.25, 0.25, 0.25, 0.25, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.23, 0.23, 0.23, 0.23, 0.23, 0.24, 0.24, 0.24, 0.24, 0.24, 0.21, 0.21, 0.21, 0.21, 0.21, 0.38, 0.38, 0.38, 0.38, 0.38, 0.24, 0.24, 0.24, 0.24, 0.24, 0.4, 0.4, 0.4, 0.4, 0.4, 0.33, 0.33, 0.33, 0.33, 0.33, 0.3, 0.3, 0.3, 0.3, 0.3, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.28, 0.28, 0.28, 0.28, 0.28, 0.22, 0.22, 0.22, 0.22, 0.22, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.28, 0.28, 0.28, 0.28, 0.28, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1, 0.1, 0.1, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.29, 0.29, 0.29, 0.29, 0.29, 0.37, 0.37, 0.37, 0.37, 0.37, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.3, 0.3, 0.3, 0.3, 0.3, 0.43, 0.43, 0.43, 0.43, 0.43, 0.62, 0.62, 0.62, 0.62, 0.62, 0.37, 0.37, 0.37, 0.37, 0.37, 0.09, 0.09, 0.09, 0.09, 0.09, 0.18, 0.18, 0.18, 0.18, 0.18, 0.27, 0.27, 0.27, 0.27, 0.27, 0.24, 0.24, 0.24, 0.24, 0.24, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.25, 0.25, 0.25, 0.25, 0.25, 0.29, 0.29, 0.29, 0.29, 0.29, 0.26, 0.26, 0.26, 0.26, 0.26, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716473680 --> 1716474300
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0]

mzwing · 2024-04-30T15:10:14Z

Encounter some bugs when building this PR on Windows.

Log here: https://github.com/MZWNET/actions/actions/runs/8896316696/job/24430759871

Achazwl · 2024-05-01T03:32:17Z

Encounter some bugs when building this PR on Windows.

Log here: https://github.com/MZWNET/actions/actions/runs/8896316696/job/24430759871

Can 6c1c4b4 fix this?

mzwing · 2024-05-01T10:23:37Z

Can 6c1c4b4 fix this?

Build successfully, thx for your great work!

mzwing · 2024-05-01T15:26:47Z

Get another bug when converting the image encoder to gguf.

Log:

python3 ./examples/minicpmv/convert-image-encoder-to-gguf.py -m ../MiniCPM-V-2 --llava-projector ../MiniCPM-V-2/llava.projector --output-dir ../MiniCPM-V-2-GGUF --image-mean 0.5 0.5 0.5 --image-std 0.5 0.5 0.5

gguf: This GGUF file is for Little Endian only
Traceback (most recent call last):
  File "/content/llama.cpp/./examples/minicpmv/convert-image-encoder-to-gguf.py", line 295, in <module>
    data = data.squeeze().numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

mzwing · 2024-05-01T16:44:49Z

BTW it may be better if you can add device_map="auto" in minicpm-surgery.py#L12&L42 :)

It can take full advantages of a GPU :)

Achazwl · 2024-05-02T01:26:35Z

Get another bug when converting the image encoder to gguf.

Log:

python3 ./examples/minicpmv/convert-image-encoder-to-gguf.py -m ../MiniCPM-V-2 --llava-projector ../MiniCPM-V-2/llava.projector --output-dir ../MiniCPM-V-2-GGUF --image-mean 0.5 0.5 0.5 --image-std 0.5 0.5 0.5

gguf: This GGUF file is for Little Endian only
Traceback (most recent call last):
  File "/content/llama.cpp/./examples/minicpmv/convert-image-encoder-to-gguf.py", line 295, in <module>
    data = data.squeeze().numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

I followed examples/llava/convert-image-encoder-to-gguf.py. It seems that they also don't use .cpu() here, and in my environment the model are loaded to CPU by default.

teleprint-me · 2024-05-02T01:40:02Z

Should set it so the device defaults to CPU and optionally set to GPU if desired.

mzwing · 2024-05-02T05:38:13Z

I followed examples/llava/convert-image-encoder-to-gguf.py. It seems that they also don't use .cpu() here, and in my environment the model are loaded to CPU by default.

I fixed my problems by editing the minicpmv-surgery.py.

Edited version:

# store these tensors in a new dictionary and torch.save them
projector = {name: checkpoint[name].float().cpu() for name in mm_tensors}
torch.save(projector, f"{args.model}/llava.projector")

clip_tensors = [k for k, v in checkpoint.items() if k.startswith("vpm")]
if len(clip_tensors) > 0:
    clip = {name.replace("vpm.", ""): checkpoint[name].float().cpu() for name in clip_tensors}
    torch.save(clip, f"{args.model}/llava.clip")

I think it would be better to add the cpu(), since it has no impact (maybe?) on CPU-only environment and, with device_map="auto", we can make good use of GPU.

mzwing · 2024-05-02T06:03:04Z

Get another bug.

Log:

> python3 ./examples/minicpmv/minicpm-surgery.py -m ../MiniCPM-V-2

Loading checkpoint shards: 100% 2/2 [00:34<00:00, 17.07s/it]
Done!
Now you can convert ../MiniCPM-V-2 to a regular LLaMA GGUF file.
Also, use ../MiniCPM-V-2/llava.projector to prepare a llava-encoder.gguf file.

> python3 ./examples/minicpmv/convert-image-encoder-to-gguf.py -m ../MiniCPM-V-2 --llava-projector ../MiniCPM-V-2/llava.projector --output-dir ../MiniCPM-V-2-GGUF --image-mean 0.5 0.5 0.5 --image-std 0.5 0.5 0.5

gguf: This GGUF file is for Little Endian only
  Converting to float32
resampler.pos_embed - f32 - shape = (64, 2304)
  Converting to float32
...(too long, ignore)
v.post_ln.weight - f32 - shape = (1152,)
  Converting to float32
v.post_ln.bias - f32 - shape = (1152,)
Done. Output file: ../MiniCPM-V-2-GGUF/mmproj-model-f16.gguf

> python3 ./convert-hf-to-gguf.py --outtype f16 --outfile ../MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf ../MiniCPM-V-2/MiniCPM

Loading model: MiniCPM
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
Traceback (most recent call last):
  File "/content/llama.cpp/./convert-hf-to-gguf.py", line 2809, in <module>
    main()
  File "/content/llama.cpp/./convert-hf-to-gguf.py", line 2796, in main
    model_instance.set_vocab()
  File "/content/llama.cpp/./convert-hf-to-gguf.py", line 1645, in set_vocab
    self._set_vocab_llama_hf()
  File "/content/llama.cpp/./convert-hf-to-gguf.py", line 377, in _set_vocab_llama_hf
    vocab = LlamaHfVocab(self.dir_model)
  File "/content/llama.cpp/convert.py", line 523, in __init__
    with open(fname_tokenizer, encoding='utf-8') as f:
FileNotFoundError: [Errno 2] No such file or directory: '../MiniCPM-V-2/MiniCPM/tokenizer.json'

> cp ../MiniCPM-V-2/tokenizer.json ../MiniCPM-V-2/MiniCPM/

> python3 ./convert-hf-to-gguf.py --outtype f16 --outfile ../MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf ../MiniCPM-V-2/MiniCPM

Loading model: MiniCPM
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
The repository for ../MiniCPM-V-2/MiniCPM contains custom code which must be executed to correctly load the model. You can inspect the repository content at [https://hf.co/../MiniCPM-V-2/MiniCPM](https://hf.co/MiniCPM-V-2/MiniCPM).
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.
Do you wish to run the custom code? [y/N] y
Traceback (most recent call last):
  File "/content/llama.cpp/./convert-hf-to-gguf.py", line 2809, in <module>
    main()
  File "/content/llama.cpp/./convert-hf-to-gguf.py", line 2796, in main
    model_instance.set_vocab()
  File "/content/llama.cpp/./convert-hf-to-gguf.py", line 1645, in set_vocab
    self._set_vocab_llama_hf()
  File "/content/llama.cpp/./convert-hf-to-gguf.py", line 377, in _set_vocab_llama_hf
    vocab = LlamaHfVocab(self.dir_model)
  File "/content/llama.cpp/convert.py", line 556, in __init__
    assert self.tokenizer.is_fast  # assume tokenizer.json is used
AssertionError

Achazwl · 2024-05-02T06:05:06Z

Does your MiniCPM-V-2 folder have tokenizer.json? It is a newly uploaded file in https://huggingface.co/openbmb/MiniCPM-V-2/tree/main.

mzwing · 2024-05-02T06:24:27Z

Does your MiniCPM-V-2 folder have tokenizer.json? It is a newly uploaded file in https://huggingface.co/openbmb/MiniCPM-V-2/tree/main.

Yes, I confirm that.

Log:

> ls ../MiniCPM-V-2/ -alh

total 8.0G
drwxr-xr-x 4 root root 4.0K May  2 06:17 .
drwxr-xr-x 1 root root 4.0K May  2 06:16 ..
drwxr-xr-x 2 root root 4.0K May  2 06:13 assets
-rw-r--r-- 1 root root 1.2K May  2 06:13 config.json
-rw-r--r-- 1 root root  11K May  2 06:13 configuration_minicpm.py
-rw-r--r-- 1 root root  111 May  2 06:13 generation_config.json
-rw-r--r-- 1 root root 1.7K May  2 06:13 .gitattributes
-rw-r--r-- 1 root root 1.5G May  2 06:17 llava.clip
-rw-r--r-- 1 root root 113M May  2 06:17 llava.projector
drwxr-xr-x 2 root root 4.0K May  2 06:21 MiniCPM
-rw-r--r-- 1 root root 4.7G May  2 06:14 model-00001-of-00002.safetensors
-rw-r--r-- 1 root root 1.8G May  2 06:13 model-00002-of-00002.safetensors
-rw-r--r-- 1 root root  70K May  2 06:13 modeling_minicpm.py
-rw-r--r-- 1 root root  20K May  2 06:13 modeling_minicpmv.py
-rw-r--r-- 1 root root  54K May  2 06:13 model.safetensors.index.json
-rw-r--r-- 1 root root 9.2K May  2 06:13 README.md
-rw-r--r-- 1 root root 5.5K May  2 06:13 resampler.py
-rw-r--r-- 1 root root  651 May  2 06:13 special_tokens_map.json
-rw-r--r-- 1 root root 3.3K May  2 06:13 tokenizer_config.json
-rw-r--r-- 1 root root 6.0M May  2 06:13 tokenizer.json
-rw-r--r-- 1 root root 2.0M May  2 06:13 tokenizer.model

> ls ../MiniCPM-V-2/MiniCPM -alh

total 12G
drwxr-xr-x 2 root root 4.0K May  2 06:21 .
drwxr-xr-x 4 root root 4.0K May  2 06:17 ..
-rw-r--r-- 1 root root 1.5K May  2 06:17 config.json
-rw-r--r-- 1 root root  11K May  2 06:21 configuration_minicpm.py
-rw-r--r-- 1 root root  111 May  2 06:17 generation_config.json
-rw-r--r-- 1 root root 4.7G May  2 06:18 model-00001-of-00003.safetensors
-rw-r--r-- 1 root root 4.7G May  2 06:20 model-00002-of-00003.safetensors
-rw-r--r-- 1 root root 2.0G May  2 06:21 model-00003-of-00003.safetensors
-rw-r--r-- 1 root root  70K May  2 06:21 modeling_minicpm.py
-rw-r--r-- 1 root root  20K May  2 06:21 modeling_minicpmv.py
-rw-r--r-- 1 root root  30K May  2 06:21 model.safetensors.index.json
-rw-r--r-- 1 root root 5.5K May  2 06:21 resampler.py
-rw-r--r-- 1 root root  765 May  2 06:21 special_tokens_map.json
-rw-r--r-- 1 root root 3.4K May  2 06:21 tokenizer_config.json
-rw-r--r-- 1 root root 2.0M May  2 06:21 tokenizer.model

Achazwl · 2024-05-02T06:31:41Z

Does your MiniCPM-V-2 folder have tokenizer.json? It is a newly uploaded file in https://huggingface.co/openbmb/MiniCPM-V-2/tree/main.

Yes, I confirm that.

Log:

> ls MiniCPM-V-2/ -alh

total 8.0G
drwxr-xr-x 4 root root 4.0K May  2 06:17 .
drwxr-xr-x 1 root root 4.0K May  2 06:16 ..
drwxr-xr-x 2 root root 4.0K May  2 06:13 assets
-rw-r--r-- 1 root root 1.2K May  2 06:13 config.json
-rw-r--r-- 1 root root  11K May  2 06:13 configuration_minicpm.py
-rw-r--r-- 1 root root  111 May  2 06:13 generation_config.json
-rw-r--r-- 1 root root 1.7K May  2 06:13 .gitattributes
-rw-r--r-- 1 root root 1.5G May  2 06:17 llava.clip
-rw-r--r-- 1 root root 113M May  2 06:17 llava.projector
drwxr-xr-x 2 root root 4.0K May  2 06:21 MiniCPM
-rw-r--r-- 1 root root 4.7G May  2 06:14 model-00001-of-00002.safetensors
-rw-r--r-- 1 root root 1.8G May  2 06:13 model-00002-of-00002.safetensors
-rw-r--r-- 1 root root  70K May  2 06:13 modeling_minicpm.py
-rw-r--r-- 1 root root  20K May  2 06:13 modeling_minicpmv.py
-rw-r--r-- 1 root root  54K May  2 06:13 model.safetensors.index.json
-rw-r--r-- 1 root root 9.2K May  2 06:13 README.md
-rw-r--r-- 1 root root 5.5K May  2 06:13 resampler.py
-rw-r--r-- 1 root root  651 May  2 06:13 special_tokens_map.json
-rw-r--r-- 1 root root 3.3K May  2 06:13 tokenizer_config.json
-rw-r--r-- 1 root root 6.0M May  2 06:13 tokenizer.json
-rw-r--r-- 1 root root 2.0M May  2 06:13 tokenizer.model

> ls MiniCPM-V-2/MiniCPM -alh

total 12G
drwxr-xr-x 2 root root 4.0K May  2 06:21 .
drwxr-xr-x 4 root root 4.0K May  2 06:17 ..
-rw-r--r-- 1 root root 1.5K May  2 06:17 config.json
-rw-r--r-- 1 root root  11K May  2 06:21 configuration_minicpm.py
-rw-r--r-- 1 root root  111 May  2 06:17 generation_config.json
-rw-r--r-- 1 root root 4.7G May  2 06:18 model-00001-of-00003.safetensors
-rw-r--r-- 1 root root 4.7G May  2 06:20 model-00002-of-00003.safetensors
-rw-r--r-- 1 root root 2.0G May  2 06:21 model-00003-of-00003.safetensors
-rw-r--r-- 1 root root  70K May  2 06:21 modeling_minicpm.py
-rw-r--r-- 1 root root  20K May  2 06:21 modeling_minicpmv.py
-rw-r--r-- 1 root root  30K May  2 06:21 model.safetensors.index.json
-rw-r--r-- 1 root root 5.5K May  2 06:21 resampler.py
-rw-r--r-- 1 root root  765 May  2 06:21 special_tokens_map.json
-rw-r--r-- 1 root root 3.4K May  2 06:21 tokenizer_config.json
-rw-r--r-- 1 root root 2.0M May  2 06:21 tokenizer.model

So it seems that the save_pretrained method in surgery.py do not save the tokenizer.json file. I manually copy the tokenizer.json file into the MiniCPM sub-folder before the tokenizer.json is uploaded. I wonder the manually copy is no longer needed since it is uploaded. Seems I was wrong.

mzwing · 2024-05-02T06:34:56Z

So it seems that the save_pretrained method in surgery.py do not save the tokenizer.json file. I manually copy the tokenizer.json file into the MiniCPM sub-folder before the tokenizer.json is uploaded. I wonder the manually copy is no longer needed since it is uploaded. Seems I was wrong.

However even if I copy it to sub-folder, the converting script didn't work either :(

See here: #6919 (comment)

cp ../MiniCPM-V-2/tokenizer.json ../MiniCPM-V-2/MiniCPM/

Achazwl · 2024-05-03T02:07:26Z

fixed.

mzwing · 2024-05-04T11:03:31Z

Convert successfully, thx!

However, I got a bad test result...

Log here:

> ./minicpmv-cli -ngl 1000000 -m ./MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf --mmproj ./MiniCPM-V-2-GGUF/mmproj-model-f16.gguf -c 4096 --temp 0.6 --top-p 0.8 --top-k 100 --repeat-penalty 1.0 --image ./mzwing.jpg -p "这张图里有什么?"

Log start
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    440
clip_model_load: n_kv:         18
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 18 key-value pairs and 440 tensors from ./MiniCPM-V-2-GGUF/mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   6:                        clip.projector_type str              = resampler
clip_model_load: - kv   7:                     clip.vision.image_size u32              = 448
clip_model_load: - kv   8:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv   9:               clip.vision.embedding_length u32              = 1152
clip_model_load: - kv  10:            clip.vision.feed_forward_length u32              = 4304
clip_model_load: - kv  11:                 clip.vision.projection_dim u32              = 0
clip_model_load: - kv  12:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  13:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  14:                    clip.vision.block_count u32              = 26
clip_model_load: - kv  15:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  16:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  17:                              clip.use_gelu bool             = true
clip_model_load: - type  f32:  277 tensors
clip_model_load: - type  f16:  163 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     828.18 MB
clip_model_load: metadata size:  0.17 MB
clip_model_load: params backend buffer size =  828.18 MB (440 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 88.80 MB
llama_model_loader: loaded meta data with 22 key-value pairs and 363 tensors from ./MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = minicpm
llama_model_loader: - kv   1:                               general.name str              = MiniCPM
llama_model_loader: - kv   2:                     minicpm.context_length u32              = 4096
llama_model_loader: - kv   3:                   minicpm.embedding_length u32              = 2304
llama_model_loader: - kv   4:                        minicpm.block_count u32              = 40
llama_model_loader: - kv   5:                minicpm.feed_forward_length u32              = 5760
llama_model_loader: - kv   6:               minicpm.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:               minicpm.attention.head_count u32              = 36
llama_model_loader: - kv   8:            minicpm.attention.head_count_kv u32              = 36
llama_model_loader: - kv   9:   minicpm.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                        minicpm.tie_lm_head bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,122753]  = ["<unk>", "<s>", "</s>", "<SEP>", "<C...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,122753]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,122753]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type  f16:  282 tensors
llm_load_vocab: mismatch in special tokens definition ( 3528/122753 vs 271/122753 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = minicpm
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 122753
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 2304
llm_load_print_meta: n_head           = 36
llm_load_print_meta: n_head_kv        = 36
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2304
llm_load_print_meta: n_embd_v_gqa     = 2304
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5760
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 3.01 B
llm_load_print_meta: model size       = 5.60 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = MiniCPM
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.37 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:        CPU buffer size =   539.44 MiB
llm_load_tensors:      CUDA0 buffer size =  5197.65 MiB
....................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1440.00 MiB
llama_new_context_with_model: KV self size  = 1440.00 MiB, K (f16):  720.00 MiB, V (f16):  720.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.47 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   314.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    12.51 MiB
llama_new_context_with_model: graph nodes  = 1368
llama_new_context_with_model: graph splits = 2
encode_image_with_clip: image embedding created: 64 tokens

encode_image_with_clip: image encoded in   455.73 ms by CLIP (    7.12 ms per image patch)
slice_image: multiple 1
<用户>这张图里有什么?
<AI>
《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《

llama_print_timings:        load time =   27998.25 ms
llama_print_timings:      sample time =      35.94 ms /   256 runs   (    0.14 ms per token,  7122.19 tokens per second)
llama_print_timings: prompt eval time =     196.38 ms /    80 tokens (    2.45 ms per token,   407.36 tokens per second)
llama_print_timings:        eval time =    7742.32 ms /   255 runs   (   30.36 ms per token,    32.94 tokens per second)
llama_print_timings:       total time =   36056.03 ms /   335 tokens

The image I used for testing is my GitHub avatar.

Achazwl · 2024-05-05T09:41:25Z

My log here, can not reproduce your result

./minicpmv-cli -m ../MiniCPM-V-2/MiniCPM/ggml-model-f16.gguf --mmproj ../MiniCPM-V-2/mmproj-model-f16.gguf -c 4096 --temp 0.6 --top-p 0.8 --top-k 100 --repeat-penalty 1.0 --image ../mzwing.jpg -p "这张图里有什么?"
Log start
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    440
clip_model_load: n_kv:         18
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 18 key-value pairs and 440 tensors from ../MiniCPM-V-2/mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   6:                        clip.projector_type str              = resampler
clip_model_load: - kv   7:                     clip.vision.image_size u32              = 448
clip_model_load: - kv   8:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv   9:               clip.vision.embedding_length u32              = 1152
clip_model_load: - kv  10:            clip.vision.feed_forward_length u32              = 4304
clip_model_load: - kv  11:                 clip.vision.projection_dim u32              = 0
clip_model_load: - kv  12:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  13:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  14:                    clip.vision.block_count u32              = 26
clip_model_load: - kv  15:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  16:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  17:                              clip.use_gelu bool             = true
clip_model_load: - type  f32:  277 tensors
clip_model_load: - type  f16:  163 tensors
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/acha/Desktop/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
clip_model_load: CLIP using Metal backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     828.18 MB
clip_model_load: metadata size:  0.17 MB
clip_model_load: params backend buffer size =  828.18 MB (440 tensors)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   828.19 MiB, (  829.19 / 21845.34)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    88.81 MiB, (  918.00 / 21845.34)
clip_model_load: compute allocated memory: 88.80 MB
llama_model_loader: loaded meta data with 22 key-value pairs and 363 tensors from ../MiniCPM-V-2/MiniCPM/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = minicpm
llama_model_loader: - kv   1:                               general.name str              = MiniCPM
llama_model_loader: - kv   2:                     minicpm.context_length u32              = 4096
llama_model_loader: - kv   3:                   minicpm.embedding_length u32              = 2304
llama_model_loader: - kv   4:                        minicpm.block_count u32              = 40
llama_model_loader: - kv   5:                minicpm.feed_forward_length u32              = 5760
llama_model_loader: - kv   6:               minicpm.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:               minicpm.attention.head_count u32              = 36
llama_model_loader: - kv   8:            minicpm.attention.head_count_kv u32              = 36
llama_model_loader: - kv   9:   minicpm.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                        minicpm.tie_lm_head bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,122753]  = ["<unk>", "<s>", "</s>", "<SEP>", "<C...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,122753]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,122753]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type  f16:  282 tensors
llm_load_vocab: mismatch in special tokens definition ( 3528/122753 vs 271/122753 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = minicpm
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 122753
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 2304
llm_load_print_meta: n_head           = 36
llm_load_print_meta: n_head_kv        = 36
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2304
llm_load_print_meta: n_embd_v_gqa     = 2304
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5760
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 3.01 B
llm_load_print_meta: model size       = 5.60 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = MiniCPM
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.37 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  5197.67 MiB, ( 6115.67 / 21845.34)
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:      Metal buffer size =  5197.66 MiB
llm_load_tensors:        CPU buffer size =   539.44 MiB
...................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/acha/Desktop/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =  1440.00 MiB, ( 7556.42 / 21845.34)
llama_kv_cache_init:      Metal KV buffer size =  1440.00 MiB
llama_new_context_with_model: KV self size  = 1440.00 MiB, K (f16):  720.00 MiB, V (f16):  720.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.47 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   314.02 MiB, ( 7870.44 / 21845.34)
llama_new_context_with_model:      Metal compute buffer size =   314.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    12.51 MiB
llama_new_context_with_model: graph nodes  = 1368
llama_new_context_with_model: graph splits = 2
encode_image_with_clip: image embedding created: 64 tokens

encode_image_with_clip: image encoded in  2532.64 ms by CLIP (   39.57 ms per image patch)
slice_image: multiple 1
<用户>这张图里有什么?
<AI>
这幅图片展示了一个人站在一个看起来像围栏的地方，周围是蓝色的海洋。这个人正在伸手去触碰天空中的鸟群，这些鸟群以一种抽象的方式排列成一条线。这幅画的风格是水彩，给人一种梦幻、宁静的感觉。颜色以蓝色和白色为主，蓝色象征着海洋和天空，白色则代表云彩和鸟群。

llama_print_timings:        load time =   11091.30 ms
llama_print_timings:      sample time =       5.67 ms /    74 runs   (    0.08 ms per token, 13053.45 tokens per second)
llama_print_timings: prompt eval time =    8420.87 ms /    80 tokens (  105.26 ms per token,     9.50 tokens per second)
llama_print_timings:        eval time =    2901.03 ms /    73 runs   (   39.74 ms per token,    25.16 tokens per second)
llama_print_timings:       total time =   14061.83 ms /   153 tokens
ggml_metal_free: deallocating
ggml_metal_free: deallocating

mzwing · 2024-05-05T10:13:15Z

@Achazwl Can you help test the model I quantized? Link here: https://huggingface.co/mzwing/MiniCPM-V-2-GGUF

Achazwl · 2024-05-05T10:16:56Z

@Achazwl Can you test the model I quantized? Link here: https://huggingface.co/mzwing/MiniCPM-V-2-GGUF

The link you provided only contains fp16 models

mzwing · 2024-05-05T14:36:38Z

The link you provided only contains fp16 models

The mmproj gguf model is actually there, I just rename it :)

Link to the mmproj gguf model: https://huggingface.co/mzwing/MiniCPM-V-2-GGUF/blob/main/MiniCPM-V-2-mmproj.F16.gguf

Achazwl · 2024-05-06T01:52:58Z

Also correct

./minicpmv-cli -m ../MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf --mmproj ../MiniCPM-V-2-GGUF/MiniCPM-V-2-mmproj.F16.gguf -c 4096 --temp 0.6 --top-p 0.8 --top-k 100 --repeat-penalty 1.0 --image ../mzwing.jpg -p "这张图里有什么?" 
Log start
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    440
clip_model_load: n_kv:         18
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 18 key-value pairs and 440 tensors from ../MiniCPM-V-2-GGUF/MiniCPM-V-2-mmproj.F16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   6:                        clip.projector_type str              = resampler
clip_model_load: - kv   7:                     clip.vision.image_size u32              = 448
clip_model_load: - kv   8:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv   9:               clip.vision.embedding_length u32              = 1152
clip_model_load: - kv  10:            clip.vision.feed_forward_length u32              = 4304
clip_model_load: - kv  11:                 clip.vision.projection_dim u32              = 0
clip_model_load: - kv  12:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  13:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  14:                    clip.vision.block_count u32              = 26
clip_model_load: - kv  15:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  16:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  17:                              clip.use_gelu bool             = true
clip_model_load: - type  f32:  277 tensors
clip_model_load: - type  f16:  163 tensors
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/acha/Desktop/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
clip_model_load: CLIP using Metal backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     828.18 MB
clip_model_load: metadata size:  0.17 MB
clip_model_load: params backend buffer size =  828.18 MB (440 tensors)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   828.19 MiB, (  829.19 / 21845.34)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    88.81 MiB, (  918.00 / 21845.34)
clip_model_load: compute allocated memory: 88.80 MB
llama_model_loader: loaded meta data with 22 key-value pairs and 363 tensors from ../MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = minicpm
llama_model_loader: - kv   1:                               general.name str              = MiniCPM
llama_model_loader: - kv   2:                     minicpm.context_length u32              = 4096
llama_model_loader: - kv   3:                   minicpm.embedding_length u32              = 2304
llama_model_loader: - kv   4:                        minicpm.block_count u32              = 40
llama_model_loader: - kv   5:                minicpm.feed_forward_length u32              = 5760
llama_model_loader: - kv   6:               minicpm.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:               minicpm.attention.head_count u32              = 36
llama_model_loader: - kv   8:            minicpm.attention.head_count_kv u32              = 36
llama_model_loader: - kv   9:   minicpm.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                        minicpm.tie_lm_head bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,122753]  = ["<unk>", "<s>", "</s>", "<SEP>", "<C...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,122753]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,122753]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type  f16:  282 tensors
llm_load_vocab: mismatch in special tokens definition ( 3528/122753 vs 271/122753 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = minicpm
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 122753
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 2304
llm_load_print_meta: n_head           = 36
llm_load_print_meta: n_head_kv        = 36
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2304
llm_load_print_meta: n_embd_v_gqa     = 2304
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5760
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 3.01 B
llm_load_print_meta: model size       = 5.60 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = MiniCPM
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.37 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  5197.67 MiB, ( 6115.67 / 21845.34)
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:      Metal buffer size =  5197.66 MiB
llm_load_tensors:        CPU buffer size =   539.44 MiB
...................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/acha/Desktop/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =  1440.00 MiB, ( 7556.42 / 21845.34)
llama_kv_cache_init:      Metal KV buffer size =  1440.00 MiB
llama_new_context_with_model: KV self size  = 1440.00 MiB, K (f16):  720.00 MiB, V (f16):  720.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.47 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   314.02 MiB, ( 7870.44 / 21845.34)
llama_new_context_with_model:      Metal compute buffer size =   314.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    12.51 MiB
llama_new_context_with_model: graph nodes  = 1368
llama_new_context_with_model: graph splits = 2
encode_image_with_clip: image embedding created: 64 tokens

encode_image_with_clip: image encoded in  2460.77 ms by CLIP (   38.45 ms per image patch)
slice_image: multiple 1
<用户>这张图里有什么?
<AI>
这张图片描绘了一个人站在一个看起来像是栏杆的地方，朝向天空。这个人似乎正在伸手向天空，可能是在试图捕捉或触摸星星或鸟儿。天空是深蓝色的，点缀着许多星星和散落的鸟群，给人一种浩瀚和宁静的感觉。这幅画采用了水彩画风格，柔和的水彩笔触营造出一种梦幻般、略带忧郁的氛围。

llama_print_timings:        load time =    9274.04 ms
llama_print_timings:      sample time =       5.80 ms /    76 runs   (    0.08 ms per token, 13103.45 tokens per second)
llama_print_timings: prompt eval time =    6685.98 ms /    80 tokens (   83.57 ms per token,    11.97 tokens per second)
llama_print_timings:        eval time =    2971.87 ms /    75 runs   (   39.62 ms per token,    25.24 tokens per second)
llama_print_timings:       total time =   12316.09 ms /   155 tokens
ggml_metal_free: deallocating
ggml_metal_free: deallocating

mzwing · 2024-05-06T05:42:58Z

I did some further tests.

When I use only the CPU, the model's output is very, very normal. However, when I switching to the GPU, the model seemed... mad.

Tested on Google Colab (T4 GPU).

Log:

> ./minicpmv-cli -ngl 35 -m ./MiniCPM-V-2-GGUF/MiniCPM-V-2.Q2_K.gguf --mmproj ./MiniCPM-V-2-GGUF/MiniCPM-V-2-mmproj.F16.gguf -c 4096 --temp 0.6 --top-p 0.8 --top-k 100 --repeat-penalty 1.0 --image ./mzwing.jpg -p "这张图里有什么?"

Log start
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    440
clip_model_load: n_kv:         18
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 18 key-value pairs and 440 tensors from ./MiniCPM-V-2-GGUF/MiniCPM-V-2-mmproj.F16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   6:                        clip.projector_type str              = resampler
clip_model_load: - kv   7:                     clip.vision.image_size u32              = 448
clip_model_load: - kv   8:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv   9:               clip.vision.embedding_length u32              = 1152
clip_model_load: - kv  10:            clip.vision.feed_forward_length u32              = 4304
clip_model_load: - kv  11:                 clip.vision.projection_dim u32              = 0
clip_model_load: - kv  12:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  13:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  14:                    clip.vision.block_count u32              = 26
clip_model_load: - kv  15:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  16:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  17:                              clip.use_gelu bool             = true
clip_model_load: - type  f32:  277 tensors
clip_model_load: - type  f16:  163 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     828.18 MB
clip_model_load: metadata size:  0.17 MB
clip_model_load: params backend buffer size =  828.18 MB (440 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 88.80 MB
llama_model_loader: loaded meta data with 23 key-value pairs and 363 tensors from ./MiniCPM-V-2-GGUF/MiniCPM-V-2.Q2_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = minicpm
llama_model_loader: - kv   1:                               general.name str              = MiniCPM
llama_model_loader: - kv   2:                     minicpm.context_length u32              = 4096
llama_model_loader: - kv   3:                   minicpm.embedding_length u32              = 2304
llama_model_loader: - kv   4:                        minicpm.block_count u32              = 40
llama_model_loader: - kv   5:                minicpm.feed_forward_length u32              = 5760
llama_model_loader: - kv   6:               minicpm.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:               minicpm.attention.head_count u32              = 36
llama_model_loader: - kv   8:            minicpm.attention.head_count_kv u32              = 36
llama_model_loader: - kv   9:   minicpm.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 10
llama_model_loader: - kv  11:                        minicpm.tie_lm_head bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,122753]  = ["<unk>", "<s>", "</s>", "<SEP>", "<C...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,122753]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,122753]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q2_K:  161 tensors
llama_model_loader: - type q3_K:   80 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq4_nl:   40 tensors
llm_load_vocab: mismatch in special tokens definition ( 3528/122753 vs 271/122753 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = minicpm
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 122753
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 2304
llm_load_print_meta: n_head           = 36
llm_load_print_meta: n_head_kv        = 36
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2304
llm_load_print_meta: n_embd_v_gqa     = 2304
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5760
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = Q2_K - Medium
llm_load_print_meta: model params     = 3.01 B
llm_load_print_meta: model size       = 1.21 GiB (3.44 BPW) 
llm_load_print_meta: general.name     = MiniCPM
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.37 MiB
llm_load_tensors: offloading 35 repeating layers to GPU
llm_load_tensors: offloaded 35/41 layers to GPU
llm_load_tensors:        CPU buffer size =  1234.38 MiB
llm_load_tensors:      CUDA0 buffer size =   809.07 MiB
.............................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   180.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  1260.00 MiB
llama_new_context_with_model: KV self size  = 1440.00 MiB, K (f16):  720.00 MiB, V (f16):  720.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.47 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   465.51 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    17.01 MiB
llama_new_context_with_model: graph nodes  = 1368
llama_new_context_with_model: graph splits = 59
encode_image_with_clip: image embedding created: 64 tokens

encode_image_with_clip: image encoded in   394.83 ms by CLIP (    6.17 ms per image patch)
slice_image: multiple 1
<用户>这张图里有什么?
<AI>
％</br></h3><strong></h2></tr><br/><SEP>－《<li><SEP></h2><h3/>８--></h3>』</h1>？８＜</br></strong>３２<h4/>=－</h3>１０</h2><tbody/>．<h3></img><h5>『７</h2><img/></h3><tr>２￥３<h4>『１１</h5><CLS><li></img></h3>』《『＜『％３<li>０<h2/>３！１<tr><h5>?</img><SEP></h1><h4/><CLS>=<h4>```.=<h3/>？２<!--％！-１<td>２㊣<p/><p/><SEP>。<!--￥１。。。</td>》。...<li>
<h5></h5><h2/>㊣？</img>：＜</h5><h3>，?.<strong></tr><tr></strong></tbody>３<h1>４-->?-【</li></tr><h3>１２<li/>.</h2><SEP>３</h2>?<table>：<br></tbody><h2/><!DOCTYPE>￥９``````=</img></h5><b><h5/>.<li>￥』《</li>-４?<li>！％『<img/><br/></h1>！》<tr>.．<table></br></h1>＜，!《</h2>㊣</h4></tbody>￥</li>。。。。。。。。。。。。<table/></br>-<li/>．／．：＜《</h2>、<h5>－＜<h4/>％</li>１</strong>、</strong><br><h4/>-->《</h4>...<strong/>.<b/>--><tbody/><h4/>，？，『０<img/>【</strong>
９、<tr>-５
－</h5>％<p/><h4/><h5><!DOCTYPE><table/>《６</h5></tr>

llama_print_timings:        load time =    2538.88 ms
llama_print_timings:      sample time =      32.09 ms /   256 runs   (    0.13 ms per token,  7978.56 tokens per second)
llama_print_timings: prompt eval time =    1714.47 ms /    80 tokens (   21.43 ms per token,    46.66 tokens per second)
llama_print_timings:        eval time =   19998.40 ms /   255 runs   (   78.43 ms per token,    12.75 tokens per second)
llama_print_timings:       total time =   22909.54 ms /   335 tokens

The binary I compiled is here: https://github.com/MZWNET/actions/releases/tag/llama_cpp-minicpm-v-6c1c4b4

Link to models I quantized: https://huggingface.co/mzwing/MiniCPM-V-2-GGUF

If you need, link to my Jupyter Notebook file here: https://github.com/mzwing/AI-related/blob/master/notebooks/MiniCPM_V_2_GGUF.ipynb

So, it seems to be a GPU-related bug :(

Achazwl · 2024-05-06T06:29:36Z

So, it seems to be a GPU-related bug :(

So this may not related to my PR? Since the correctness on CPU indicates that the conversion is correct.

Achazwl · 2024-05-08T02:06:39Z

Is the bug happening on LLaVA?

mzwing · 2024-05-08T05:39:13Z

Is the bug happening on LLaVA?

Oh now I find that the llava-cli built in this PR even cannot load the model. It gave out the unable to load model error.

~~For now only tested on GPU env.~~ See comment below.

So, maybe that's the final reason? But the two errors seem too different.

Log:

> ./llava-cli -ngl 35 -m ./LLaVA-Llama-3-8B-Instruct-GGUF/llava-llama3-8b-Q4_K_M.gguf --mmproj ./LLaVA-Llama-3-8B-Instruct-GGUF/llava-llama3-mmproj-f16.gguf --temp 0.6 --top-p 0.8 --top-k 100 --repeat-penalty 1.0 --image ./mzwing.jpg -p "这张图里有什么?"

Log start
clip_model_load: model name:   openai/clip-vit-large-patch14-336
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    377
clip_model_load: n_kv:         19
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 19 key-value pairs and 377 tensors from ./LLaVA-Llama-3-8B-Instruct-GGUF/llava-llama3-mmproj-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                               general.name str              = openai/clip-vit-large-patch14-336
clip_model_load: - kv   6:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   7:                        clip.projector_type str              = mlp
clip_model_load: - kv   8:                     clip.vision.image_size u32              = 336
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1024
clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4096
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 768
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000010
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 23
clip_model_load: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
clip_model_load: - kv  18:                              clip.use_gelu bool             = false
clip_model_load: - type  f32:  235 tensors
clip_model_load: - type  f16:  142 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     595.49 MB
clip_model_load: metadata size:  0.14 MB
clip_model_load: params backend buffer size =  595.49 MB (377 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 32.89 MB
llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from ./LLaVA-Llama-3-8B-Instruct-GGUF/llava-llama3-8b-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = tmp
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128257
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 15
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128257]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,128257]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128257]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 128256
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 257/128257 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128257
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.58 GiB (4.89 BPW) 
llm_load_print_meta: general.name     = tmp
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: PAD token        = 128256 '<pad>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size =    0.30 MiB
llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 291, got 290
llama_load_model_from_file: failed to load model
llava_init: error: unable to load model
main: error: failed to init llava

mzwing · 2024-05-08T05:46:36Z

For now only tested on GPU env.

CPU also the same.

So it's quite confusing. Maybe you should update your branch?

llava-cli in original llama.cpp repo (master branch) works as expected, either in CPU env or GPU env.

Achazwl · 2024-05-12T08:33:36Z

For now only tested on GPU env.

CPU also the same.

So it's quite confusing. Maybe you should update your branch?

llava-cli in original llama.cpp repo (master branch) works as expected, either in CPU env or GPU env.

Llava fixed, it is the side effect of my code. The new version of my PR has much fewer modifications outside the minicpm-v folder, and thus will not affect other models now.

The bug of MiniCPM-V on GPU is rather hard to find. I can reproduce the NaN issue on GPU, and here's are my observation:

The output of the ViT when processing images is aligned with the CPU version (which means the ViT part is correct).
The output of the LLM when processing prompt text is aligned with the CPU version (which means LLM's computation is correct on GPU).
However, when putting the output of ViT into the LLM as LLM's input, NaN is outputted.
I finally find that once the output of ViT is fed into the text model, it immediately becomes NaN, which means it has already turned into NaN at the input embedding stage (input_embed), without calculating any TransformerBlock.
In the function where the ViT input is copied from the CPU to the GPU (the ggml_backend_cuda_buffer_set_tensor function in ggml-cuda.cu), I add a debug code to copy back the input_embed to the CPU. The result copied back to the CPU is the same as the output of ViT, no NaN is appearing. However, I can't figure out what more is happening in the code between "ViT output" and "LLM input", I can only find the CPU->GPU copying. If NaN is not from this stage, then from what?
I also attempted to allocate a new space to copy the output of ViT into, in order to avoid some "access out of bound" issues, but the result was still NaN.

mzwing · 2024-05-13T05:16:49Z

@cmp-nct Hey cmp-nct, could you please help us resolve this confusing issue? Thanks a lot!

I apologize if this has caused you confusion.

sunzhe09 · 2024-05-17T01:47:33Z

你好，我试验了一下，我量化之后发小效果差很多，有什么解决方法吗？

mofosyne · 2024-05-17T02:23:12Z

Google Translation:

Hello, I tried it. After I quantified it, the effect was much worse. Is there any solution?

Achazwl · 2024-05-17T07:39:39Z

你好，我试验了一下，我量化之后发小效果差很多，有什么解决方法吗？

Do you quantify the LLM part or the ViT part?
你量化的是 LLM 部分还是 ViT 部分

sunzhe09 · 2024-05-17T10:00:47Z

你好，我试验了一下，我量化之后发小效果差很多，有什么解决方法吗？

Do you quantify the LLM part or the ViT part? 你量化的是 LLM 部分还是 ViT 部分

./quantize ../MiniCPM-V-2/MiniCPM/ggml-model-f16.gguf ../MiniCPM-V-2/MiniCPM/ggml-model-Q4_K_M.gguf Q4_K_M 这个是我量化的命令

Achazwl · 2024-05-21T05:01:32Z

你好，我试验了一下，我量化之后发小效果差很多，有什么解决方法吗？

Do you quantify the LLM part or the ViT part? 你量化的是 LLM 部分还是 ViT 部分

./quantize ../MiniCPM-V-2/MiniCPM/ggml-model-f16.gguf ../MiniCPM-V-2/MiniCPM/ggml-model-Q4_K_M.gguf Q4_K_M 这个是我量化的命令

这块量化的是 LLM 部分，ViT 部分我没有做量化处理。

naifmeh · 2024-05-23T09:40:20Z

Hi, does this PR support converting the new MiniCPM V2.5?
I've tried running the convert-hf-to-gguf.py script but getting this error:

$  python ./convert-hf-to-gguf.py --outtype f16 --outfile ./MiniCPM25-GGUF/minicpm25_f16.gguf ./models/Minicpm25

Loading model: Minicpm25
Traceback (most recent call last):
  File "/home/powerapi/Documents/nmehanna/llama.cpp/./convert-hf-to-gguf.py", line 2807, in <module>
    main()
  File "/home/powerapi/Documents/nmehanna/llama.cpp/./convert-hf-to-gguf.py", line 2787, in main
    model_class = Model.from_model_architecture(hparams["architectures"][0])
  File "/home/powerapi/Documents/nmehanna/llama.cpp/./convert-hf-to-gguf.py", line 216, in from_model_architecture
    raise NotImplementedError(f'Architecture {arch!r} not supported!') from None
NotImplementedError: Architecture 'MiniCPMV' not supported!

Thank you.

tc-mb · 2024-05-23T12:36:56Z

Hi, does this PR support converting the new MiniCPM V2.5? I've tried running the convert-hf-to-gguf.py script but getting this error:

$  python ./convert-hf-to-gguf.py --outtype f16 --outfile ./MiniCPM25-GGUF/minicpm25_f16.gguf ./models/Minicpm25

Loading model: Minicpm25
Traceback (most recent call last):
  File "/home/powerapi/Documents/nmehanna/llama.cpp/./convert-hf-to-gguf.py", line 2807, in <module>
    main()
  File "/home/powerapi/Documents/nmehanna/llama.cpp/./convert-hf-to-gguf.py", line 2787, in main
    model_class = Model.from_model_architecture(hparams["architectures"][0])
  File "/home/powerapi/Documents/nmehanna/llama.cpp/./convert-hf-to-gguf.py", line 216, in from_model_architecture
    raise NotImplementedError(f'Architecture {arch!r} not supported!') from None
NotImplementedError: Architecture 'MiniCPMV' not supported!

Thank you.

This PR does not support MiniCPM-V 2.5 yet.
We will submit a new PR to support MiniCPM-V 2.5 ASAP.

cmp-nct · 2024-05-27T11:41:36Z

Hi guys,
I'm sorry I've been busy with RL for a while and I'll be on vacation from tomorrow on, I won't be able to look into this for now.
I tested minicpm 2.5 just now and the results, at first glance, were stunning. llava-1.6 type results in my very limited tests.

I found a ollama variant of llama.cpp here that is supporting it: https://github.com/OpenBMB/llama.cpp/tree/minicpm-v2.5/examples/minicpmv
So maybe stuff has already been done and just needs to be added ?

tc-mb · 2024-05-27T12:00:12Z

Hi guys, I'm sorry I've been busy with RL for a while and I'll be on vacation from tomorrow on, I won't be able to look into this for now. I tested minicpm 2.5 just now and the results, at first glance, were stunning. llava-1.6 type results in my very limited tests.

I found a ollama variant of llama.cpp here that is supporting it: https://github.com/OpenBMB/llama.cpp/tree/minicpm-v2.5/examples/minicpmv So maybe stuff has already been done and just needs to be added ?

I'm a stuff of minicpmv team.
There are some differences between minicpm-v 2.5 and minicpm-v 2.0.
We temporarily make fork to support minicpm-v 2.5.
We will integrate v2.5 and v2.0 in the next few days and submit a new PR to official llama.cpp.

ggerganov · 2024-05-27T12:44:33Z

The code is based on examples/llava but the vision part is quite different.

It's unlikely to merge these PRs since they duplicate all the LLaVA / CLIP codebase. Try to find ways to fit the implementation into the existing code and reuse/extend the existing API

howardgriffin · 2024-08-09T07:46:09Z

When I do the first step 'make', I got this error:
examples/minicpmv/minicpmv-cli.cpp:136:108: error: 'class std::vector<std::__cxx11::basic_string<char> >' has no member named 'c_str'=136 | th_filename(ctx_llava->ctx_clip, params->n_threads, params->image.c_str());
I'm a newbie，any suggestions?

support minicpmv

99874e5

Achazwl mentioned this pull request Apr 26, 2024

支持llama.cpp 部署么？ OpenBMB/MiniCPM-o#16

Closed

Achazwl mentioned this pull request Apr 27, 2024

redmi k70 16GB使用MiniCPM-V时报错 OpenBMB/MiniCPM-o#60

Closed

move stl return out of extern C

6c1c4b4

fix tokenizer.json tokenizer_config.json cpu()

36bff51

compilade mentioned this pull request May 3, 2024

convert.py: add python logging instead of print() #6511

Merged

zkh2016 mentioned this pull request May 6, 2024

[Feature Request]:ollama怎么部署新版v2 OpenBMB/MiniCPM#133

Closed

fix LLaVA side effect

a76fbcd

Achazwl force-pushed the feat-minicpmv branch from c7a6103 to a76fbcd Compare May 8, 2024 12:16

mofosyne added enhancement New feature or request Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 9, 2024

remove unrelevant debug code

f0d7be4

zkh2016 mentioned this pull request May 13, 2024

Support MiniCPM-2B-128k #6602

Closed

iceflame89 mentioned this pull request May 23, 2024

GGUF versions doesn't seem to run on llama.cpp (through LocalAI) OpenBMB/MiniCPM-o#114

Closed

Merge branch 'master' into feat-minicpmv

70a2386

github-actions bot added examples python python script changes labels May 23, 2024

tc-mb mentioned this pull request May 23, 2024

Click wrong #7493

Closed

ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label May 27, 2024

Galunid mentioned this pull request May 29, 2024

support MiniCPM-V-2.5 #7599

Merged

Achazwl mentioned this pull request Jul 9, 2024

关于多模态支持 DakeQQ/Native-LLM-for-Android#2

Closed

support MiniCPM-V-2 #6919

Are you sure you want to change the base?

support MiniCPM-V-2 #6919

Conversation

Achazwl commented Apr 26, 2024 • edited Loading

github-actions bot commented Apr 26, 2024 • edited Loading

mzwing commented Apr 30, 2024

Achazwl commented May 1, 2024

mzwing commented May 1, 2024

mzwing commented May 1, 2024

mzwing commented May 1, 2024 • edited Loading

Achazwl commented May 2, 2024

teleprint-me commented May 2, 2024

mzwing commented May 2, 2024 • edited Loading

mzwing commented May 2, 2024 • edited Loading

Achazwl commented May 2, 2024

mzwing commented May 2, 2024 • edited Loading

Achazwl commented May 2, 2024

mzwing commented May 2, 2024 • edited Loading

Achazwl commented May 3, 2024

mzwing commented May 4, 2024

Achazwl commented May 5, 2024

mzwing commented May 5, 2024 • edited Loading

Achazwl commented May 5, 2024

mzwing commented May 5, 2024

Achazwl commented May 6, 2024

mzwing commented May 6, 2024 • edited Loading

Achazwl commented May 6, 2024

Achazwl commented May 8, 2024

mzwing commented May 8, 2024 • edited Loading

mzwing commented May 8, 2024 • edited Loading

Achazwl commented May 12, 2024 • edited Loading

mzwing commented May 13, 2024

sunzhe09 commented May 17, 2024

mofosyne commented May 17, 2024

Achazwl commented May 17, 2024

sunzhe09 commented May 17, 2024

Achazwl commented May 21, 2024

naifmeh commented May 23, 2024 • edited Loading

tc-mb commented May 23, 2024 • edited Loading

cmp-nct commented May 27, 2024

tc-mb commented May 27, 2024 • edited Loading

ggerganov commented May 27, 2024

howardgriffin commented Aug 9, 2024

Achazwl commented Apr 26, 2024 •

edited

Loading

github-actions bot commented Apr 26, 2024 •

edited

Loading

mzwing commented May 1, 2024 •

edited

Loading

mzwing commented May 2, 2024 •

edited

Loading

mzwing commented May 2, 2024 •

edited

Loading

mzwing commented May 2, 2024 •

edited

Loading

mzwing commented May 2, 2024 •

edited

Loading

mzwing commented May 5, 2024 •

edited

Loading

mzwing commented May 6, 2024 •

edited

Loading

mzwing commented May 8, 2024 •

edited

Loading

mzwing commented May 8, 2024 •

edited

Loading

Achazwl commented May 12, 2024 •

edited

Loading

naifmeh commented May 23, 2024 •

edited

Loading

tc-mb commented May 23, 2024 •

edited

Loading

tc-mb commented May 27, 2024 •

edited

Loading