Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support MiniCPM-V-2 #6919

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

support MiniCPM-V-2 #6919

wants to merge 6 commits into from

Conversation

Achazwl
Copy link

@Achazwl Achazwl commented Apr 26, 2024

I created a folder called "minicpmv" in the examples folder of llama.cpp.
More detail can be seen in llama.cpp/examples/minicpmv/README.md.

The code is based on examples/llava but the vision part is quite different.

Copy link
Contributor

github-actions bot commented Apr 26, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 555 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8400.3ms p(95)=19734.55ms fails=, finish reason: stop=511 truncated=44
  • Prompt processing (pp): avg=103.65tk/s p(95)=452.32tk/s
  • Token generation (tg): avg=34.33tk/s p(95)=46.58tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=feat-minicpmv commit=70a23863dcff7458839960b304ea166a401d4d8e

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716473680 --> 1716474300
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 399.43, 399.43, 399.43, 399.43, 399.43, 910.28, 910.28, 910.28, 910.28, 910.28, 794.64, 794.64, 794.64, 794.64, 794.64, 806.68, 806.68, 806.68, 806.68, 806.68, 871.83, 871.83, 871.83, 871.83, 871.83, 880.42, 880.42, 880.42, 880.42, 880.42, 871.52, 871.52, 871.52, 871.52, 871.52, 884.07, 884.07, 884.07, 884.07, 884.07, 893.5, 893.5, 893.5, 893.5, 893.5, 904.16, 904.16, 904.16, 904.16, 904.16, 895.77, 895.77, 895.77, 895.77, 895.77, 893.23, 893.23, 893.23, 893.23, 893.23, 911.86, 911.86, 911.86, 911.86, 911.86, 914.39, 914.39, 914.39, 914.39, 914.39, 903.48, 903.48, 903.48, 903.48, 903.48, 881.01, 881.01, 881.01, 881.01, 881.01, 884.16, 884.16, 884.16, 884.16, 884.16, 876.15, 876.15, 876.15, 876.15, 876.15, 873.63, 873.63, 873.63, 873.63, 873.63, 831.29, 831.29, 831.29, 831.29, 831.29, 832.56, 832.56, 832.56, 832.56, 832.56, 838.73, 838.73, 838.73, 838.73, 838.73, 842.35, 842.35, 842.35, 842.35, 842.35, 852.1, 852.1, 852.1, 852.1, 852.1, 828.17, 828.17, 828.17, 828.17, 828.17, 831.51, 831.51, 831.51, 831.51, 831.51, 833.09, 833.09, 833.09, 833.09, 833.09, 812.09, 812.09, 812.09, 812.09, 812.09, 811.03, 811.03, 811.03, 811.03, 811.03, 812.54, 812.54, 812.54, 812.54, 812.54, 819.03, 819.03, 819.03, 819.03, 819.03, 818.86, 818.86, 818.86, 818.86, 818.86, 818.31, 818.31, 818.31, 818.31, 818.31, 828.36, 828.36, 828.36, 828.36, 828.36, 822.09, 822.09, 822.09, 822.09, 822.09, 827.13, 827.13, 827.13, 827.13, 827.13, 805.06, 805.06, 805.06, 805.06, 805.06, 803.22, 803.22, 803.22, 803.22, 803.22, 805.52, 805.52, 805.52, 805.52, 805.52, 808.35, 808.35, 808.35, 808.35, 808.35, 809.36, 809.36, 809.36, 809.36, 809.36, 811.4, 811.4, 811.4, 811.4, 811.4, 808.28, 808.28, 808.28, 808.28, 808.28, 796.46, 796.46, 796.46, 796.46, 796.46, 796.57, 796.57, 796.57, 796.57, 796.57, 796.1, 796.1, 796.1, 796.1, 796.1, 801.6, 801.6, 801.6, 801.6, 801.6, 804.22, 804.22, 804.22, 804.22, 804.22, 803.76, 803.76, 803.76, 803.76, 803.76, 809.35, 809.35, 809.35, 809.35, 809.35, 809.22, 809.22, 809.22, 809.22, 809.22, 812.61, 812.61, 812.61, 812.61, 812.61, 807.83, 807.83, 807.83, 807.83, 807.83, 814.96, 814.96, 814.96, 814.96, 814.96, 814.12, 814.12, 814.12, 814.12, 814.12, 815.86, 815.86, 815.86, 815.86, 815.86, 816.03, 816.03, 816.03, 816.03, 816.03, 816.02, 816.02, 816.02, 816.02, 816.02, 818.09, 818.09, 818.09, 818.09, 818.09, 821.09, 821.09, 821.09, 821.09, 821.09, 820.36]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716473680 --> 1716474300
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 37.39, 37.39, 37.39, 37.39, 37.39, 40.72, 40.72, 40.72, 40.72, 40.72, 28.9, 28.9, 28.9, 28.9, 28.9, 29.61, 29.61, 29.61, 29.61, 29.61, 30.2, 30.2, 30.2, 30.2, 30.2, 30.31, 30.31, 30.31, 30.31, 30.31, 30.85, 30.85, 30.85, 30.85, 30.85, 31.48, 31.48, 31.48, 31.48, 31.48, 32.55, 32.55, 32.55, 32.55, 32.55, 32.55, 32.55, 32.55, 32.55, 32.55, 32.42, 32.42, 32.42, 32.42, 32.42, 32.5, 32.5, 32.5, 32.5, 32.5, 32.35, 32.35, 32.35, 32.35, 32.35, 31.12, 31.12, 31.12, 31.12, 31.12, 30.82, 30.82, 30.82, 30.82, 30.82, 29.78, 29.78, 29.78, 29.78, 29.78, 29.42, 29.42, 29.42, 29.42, 29.42, 28.62, 28.62, 28.62, 28.62, 28.62, 28.8, 28.8, 28.8, 28.8, 28.8, 29.35, 29.35, 29.35, 29.35, 29.35, 29.18, 29.18, 29.18, 29.18, 29.18, 29.19, 29.19, 29.19, 29.19, 29.19, 29.38, 29.38, 29.38, 29.38, 29.38, 29.63, 29.63, 29.63, 29.63, 29.63, 29.59, 29.59, 29.59, 29.59, 29.59, 29.81, 29.81, 29.81, 29.81, 29.81, 30.06, 30.06, 30.06, 30.06, 30.06, 30.13, 30.13, 30.13, 30.13, 30.13, 30.11, 30.11, 30.11, 30.11, 30.11, 30.21, 30.21, 30.21, 30.21, 30.21, 30.62, 30.62, 30.62, 30.62, 30.62, 30.79, 30.79, 30.79, 30.79, 30.79, 30.87, 30.87, 30.87, 30.87, 30.87, 31.09, 31.09, 31.09, 31.09, 31.09, 31.02, 31.02, 31.02, 31.02, 31.02, 30.93, 30.93, 30.93, 30.93, 30.93, 30.73, 30.73, 30.73, 30.73, 30.73, 30.5, 30.5, 30.5, 30.5, 30.5, 30.77, 30.77, 30.77, 30.77, 30.77, 30.91, 30.91, 30.91, 30.91, 30.91, 30.98, 30.98, 30.98, 30.98, 30.98, 31.07, 31.07, 31.07, 31.07, 31.07, 30.88, 30.88, 30.88, 30.88, 30.88, 30.66, 30.66, 30.66, 30.66, 30.66, 30.19, 30.19, 30.19, 30.19, 30.19, 29.42, 29.42, 29.42, 29.42, 29.42, 29.5, 29.5, 29.5, 29.5, 29.5, 29.49, 29.49, 29.49, 29.49, 29.49, 29.46, 29.46, 29.46, 29.46, 29.46, 29.47, 29.47, 29.47, 29.47, 29.47, 29.57, 29.57, 29.57, 29.57, 29.57, 29.58, 29.58, 29.58, 29.58, 29.58, 29.57, 29.57, 29.57, 29.57, 29.57, 29.52, 29.52, 29.52, 29.52, 29.52, 29.52, 29.52, 29.52, 29.52, 29.52, 29.69, 29.69, 29.69, 29.69, 29.69, 29.79, 29.79, 29.79, 29.79, 29.79, 29.92, 29.92, 29.92, 29.92, 29.92, 30.01, 30.01, 30.01, 30.01, 30.01, 30.05, 30.05, 30.05, 30.05, 30.05, 30.09]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716473680 --> 1716474300
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.13, 0.13, 0.13, 0.13, 0.13, 0.34, 0.34, 0.34, 0.34, 0.34, 0.28, 0.28, 0.28, 0.28, 0.28, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.25, 0.25, 0.25, 0.25, 0.25, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.23, 0.23, 0.23, 0.23, 0.23, 0.24, 0.24, 0.24, 0.24, 0.24, 0.21, 0.21, 0.21, 0.21, 0.21, 0.38, 0.38, 0.38, 0.38, 0.38, 0.24, 0.24, 0.24, 0.24, 0.24, 0.4, 0.4, 0.4, 0.4, 0.4, 0.33, 0.33, 0.33, 0.33, 0.33, 0.3, 0.3, 0.3, 0.3, 0.3, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.28, 0.28, 0.28, 0.28, 0.28, 0.22, 0.22, 0.22, 0.22, 0.22, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.28, 0.28, 0.28, 0.28, 0.28, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1, 0.1, 0.1, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.29, 0.29, 0.29, 0.29, 0.29, 0.37, 0.37, 0.37, 0.37, 0.37, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.3, 0.3, 0.3, 0.3, 0.3, 0.43, 0.43, 0.43, 0.43, 0.43, 0.62, 0.62, 0.62, 0.62, 0.62, 0.37, 0.37, 0.37, 0.37, 0.37, 0.09, 0.09, 0.09, 0.09, 0.09, 0.18, 0.18, 0.18, 0.18, 0.18, 0.27, 0.27, 0.27, 0.27, 0.27, 0.24, 0.24, 0.24, 0.24, 0.24, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.25, 0.25, 0.25, 0.25, 0.25, 0.29, 0.29, 0.29, 0.29, 0.29, 0.26, 0.26, 0.26, 0.26, 0.26, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716473680 --> 1716474300
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0]
                    
Loading

@mzwing
Copy link

mzwing commented Apr 30, 2024

Encounter some bugs when building this PR on Windows.

Log here: https://github.com/MZWNET/actions/actions/runs/8896316696/job/24430759871

@Achazwl
Copy link
Author

Achazwl commented May 1, 2024

Encounter some bugs when building this PR on Windows.

Log here: https://github.com/MZWNET/actions/actions/runs/8896316696/job/24430759871

Can 6c1c4b4 fix this?

@mzwing
Copy link

mzwing commented May 1, 2024

Can 6c1c4b4 fix this?

Build successfully, thx for your great work!

@mzwing
Copy link

mzwing commented May 1, 2024

Get another bug when converting the image encoder to gguf.

Log:

python3 ./examples/minicpmv/convert-image-encoder-to-gguf.py -m ../MiniCPM-V-2 --llava-projector ../MiniCPM-V-2/llava.projector --output-dir ../MiniCPM-V-2-GGUF --image-mean 0.5 0.5 0.5 --image-std 0.5 0.5 0.5

gguf: This GGUF file is for Little Endian only
Traceback (most recent call last):
  File "/content/llama.cpp/./examples/minicpmv/convert-image-encoder-to-gguf.py", line 295, in <module>
    data = data.squeeze().numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

@mzwing
Copy link

mzwing commented May 1, 2024

BTW it may be better if you can add device_map="auto" in minicpm-surgery.py#L12&L42 :)

It can take full advantages of a GPU :)

@Achazwl
Copy link
Author

Achazwl commented May 2, 2024

Get another bug when converting the image encoder to gguf.

Log:

python3 ./examples/minicpmv/convert-image-encoder-to-gguf.py -m ../MiniCPM-V-2 --llava-projector ../MiniCPM-V-2/llava.projector --output-dir ../MiniCPM-V-2-GGUF --image-mean 0.5 0.5 0.5 --image-std 0.5 0.5 0.5

gguf: This GGUF file is for Little Endian only
Traceback (most recent call last):
  File "/content/llama.cpp/./examples/minicpmv/convert-image-encoder-to-gguf.py", line 295, in <module>
    data = data.squeeze().numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

I followed examples/llava/convert-image-encoder-to-gguf.py. It seems that they also don't use .cpu() here, and in my environment the model are loaded to CPU by default.

@teleprint-me
Copy link
Contributor

Should set it so the device defaults to CPU and optionally set to GPU if desired.

@mzwing
Copy link

mzwing commented May 2, 2024

I followed examples/llava/convert-image-encoder-to-gguf.py. It seems that they also don't use .cpu() here, and in my environment the model are loaded to CPU by default.

I fixed my problems by editing the minicpmv-surgery.py.

Edited version:

# store these tensors in a new dictionary and torch.save them
projector = {name: checkpoint[name].float().cpu() for name in mm_tensors}
torch.save(projector, f"{args.model}/llava.projector")

clip_tensors = [k for k, v in checkpoint.items() if k.startswith("vpm")]
if len(clip_tensors) > 0:
    clip = {name.replace("vpm.", ""): checkpoint[name].float().cpu() for name in clip_tensors}
    torch.save(clip, f"{args.model}/llava.clip")

I think it would be better to add the cpu(), since it has no impact (maybe?) on CPU-only environment and, with device_map="auto", we can make good use of GPU.

@mzwing
Copy link

mzwing commented May 2, 2024

Get another bug.

Log:

> python3 ./examples/minicpmv/minicpm-surgery.py -m ../MiniCPM-V-2

Loading checkpoint shards: 100% 2/2 [00:34<00:00, 17.07s/it]
Done!
Now you can convert ../MiniCPM-V-2 to a regular LLaMA GGUF file.
Also, use ../MiniCPM-V-2/llava.projector to prepare a llava-encoder.gguf file.

> python3 ./examples/minicpmv/convert-image-encoder-to-gguf.py -m ../MiniCPM-V-2 --llava-projector ../MiniCPM-V-2/llava.projector --output-dir ../MiniCPM-V-2-GGUF --image-mean 0.5 0.5 0.5 --image-std 0.5 0.5 0.5

gguf: This GGUF file is for Little Endian only
  Converting to float32
resampler.pos_embed - f32 - shape = (64, 2304)
  Converting to float32
...(too long, ignore)
v.post_ln.weight - f32 - shape = (1152,)
  Converting to float32
v.post_ln.bias - f32 - shape = (1152,)
Done. Output file: ../MiniCPM-V-2-GGUF/mmproj-model-f16.gguf

> python3 ./convert-hf-to-gguf.py --outtype f16 --outfile ../MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf ../MiniCPM-V-2/MiniCPM

Loading model: MiniCPM
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
Traceback (most recent call last):
  File "/content/llama.cpp/./convert-hf-to-gguf.py", line 2809, in <module>
    main()
  File "/content/llama.cpp/./convert-hf-to-gguf.py", line 2796, in main
    model_instance.set_vocab()
  File "/content/llama.cpp/./convert-hf-to-gguf.py", line 1645, in set_vocab
    self._set_vocab_llama_hf()
  File "/content/llama.cpp/./convert-hf-to-gguf.py", line 377, in _set_vocab_llama_hf
    vocab = LlamaHfVocab(self.dir_model)
  File "/content/llama.cpp/convert.py", line 523, in __init__
    with open(fname_tokenizer, encoding='utf-8') as f:
FileNotFoundError: [Errno 2] No such file or directory: '../MiniCPM-V-2/MiniCPM/tokenizer.json'

> cp ../MiniCPM-V-2/tokenizer.json ../MiniCPM-V-2/MiniCPM/

> python3 ./convert-hf-to-gguf.py --outtype f16 --outfile ../MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf ../MiniCPM-V-2/MiniCPM

Loading model: MiniCPM
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
The repository for ../MiniCPM-V-2/MiniCPM contains custom code which must be executed to correctly load the model. You can inspect the repository content at [https://hf.co/../MiniCPM-V-2/MiniCPM](https://hf.co/MiniCPM-V-2/MiniCPM).
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.
Do you wish to run the custom code? [y/N] y
Traceback (most recent call last):
  File "/content/llama.cpp/./convert-hf-to-gguf.py", line 2809, in <module>
    main()
  File "/content/llama.cpp/./convert-hf-to-gguf.py", line 2796, in main
    model_instance.set_vocab()
  File "/content/llama.cpp/./convert-hf-to-gguf.py", line 1645, in set_vocab
    self._set_vocab_llama_hf()
  File "/content/llama.cpp/./convert-hf-to-gguf.py", line 377, in _set_vocab_llama_hf
    vocab = LlamaHfVocab(self.dir_model)
  File "/content/llama.cpp/convert.py", line 556, in __init__
    assert self.tokenizer.is_fast  # assume tokenizer.json is used
AssertionError

@Achazwl
Copy link
Author

Achazwl commented May 2, 2024

Does your MiniCPM-V-2 folder have tokenizer.json? It is a newly uploaded file in https://huggingface.co/openbmb/MiniCPM-V-2/tree/main.

@mzwing
Copy link

mzwing commented May 2, 2024

Does your MiniCPM-V-2 folder have tokenizer.json? It is a newly uploaded file in https://huggingface.co/openbmb/MiniCPM-V-2/tree/main.

Yes, I confirm that.

Log:

> ls ../MiniCPM-V-2/ -alh

total 8.0G
drwxr-xr-x 4 root root 4.0K May  2 06:17 .
drwxr-xr-x 1 root root 4.0K May  2 06:16 ..
drwxr-xr-x 2 root root 4.0K May  2 06:13 assets
-rw-r--r-- 1 root root 1.2K May  2 06:13 config.json
-rw-r--r-- 1 root root  11K May  2 06:13 configuration_minicpm.py
-rw-r--r-- 1 root root  111 May  2 06:13 generation_config.json
-rw-r--r-- 1 root root 1.7K May  2 06:13 .gitattributes
-rw-r--r-- 1 root root 1.5G May  2 06:17 llava.clip
-rw-r--r-- 1 root root 113M May  2 06:17 llava.projector
drwxr-xr-x 2 root root 4.0K May  2 06:21 MiniCPM
-rw-r--r-- 1 root root 4.7G May  2 06:14 model-00001-of-00002.safetensors
-rw-r--r-- 1 root root 1.8G May  2 06:13 model-00002-of-00002.safetensors
-rw-r--r-- 1 root root  70K May  2 06:13 modeling_minicpm.py
-rw-r--r-- 1 root root  20K May  2 06:13 modeling_minicpmv.py
-rw-r--r-- 1 root root  54K May  2 06:13 model.safetensors.index.json
-rw-r--r-- 1 root root 9.2K May  2 06:13 README.md
-rw-r--r-- 1 root root 5.5K May  2 06:13 resampler.py
-rw-r--r-- 1 root root  651 May  2 06:13 special_tokens_map.json
-rw-r--r-- 1 root root 3.3K May  2 06:13 tokenizer_config.json
-rw-r--r-- 1 root root 6.0M May  2 06:13 tokenizer.json
-rw-r--r-- 1 root root 2.0M May  2 06:13 tokenizer.model

> ls ../MiniCPM-V-2/MiniCPM -alh

total 12G
drwxr-xr-x 2 root root 4.0K May  2 06:21 .
drwxr-xr-x 4 root root 4.0K May  2 06:17 ..
-rw-r--r-- 1 root root 1.5K May  2 06:17 config.json
-rw-r--r-- 1 root root  11K May  2 06:21 configuration_minicpm.py
-rw-r--r-- 1 root root  111 May  2 06:17 generation_config.json
-rw-r--r-- 1 root root 4.7G May  2 06:18 model-00001-of-00003.safetensors
-rw-r--r-- 1 root root 4.7G May  2 06:20 model-00002-of-00003.safetensors
-rw-r--r-- 1 root root 2.0G May  2 06:21 model-00003-of-00003.safetensors
-rw-r--r-- 1 root root  70K May  2 06:21 modeling_minicpm.py
-rw-r--r-- 1 root root  20K May  2 06:21 modeling_minicpmv.py
-rw-r--r-- 1 root root  30K May  2 06:21 model.safetensors.index.json
-rw-r--r-- 1 root root 5.5K May  2 06:21 resampler.py
-rw-r--r-- 1 root root  765 May  2 06:21 special_tokens_map.json
-rw-r--r-- 1 root root 3.4K May  2 06:21 tokenizer_config.json
-rw-r--r-- 1 root root 2.0M May  2 06:21 tokenizer.model

@Achazwl
Copy link
Author

Achazwl commented May 2, 2024

Does your MiniCPM-V-2 folder have tokenizer.json? It is a newly uploaded file in https://huggingface.co/openbmb/MiniCPM-V-2/tree/main.

Yes, I confirm that.

Log:

> ls MiniCPM-V-2/ -alh

total 8.0G
drwxr-xr-x 4 root root 4.0K May  2 06:17 .
drwxr-xr-x 1 root root 4.0K May  2 06:16 ..
drwxr-xr-x 2 root root 4.0K May  2 06:13 assets
-rw-r--r-- 1 root root 1.2K May  2 06:13 config.json
-rw-r--r-- 1 root root  11K May  2 06:13 configuration_minicpm.py
-rw-r--r-- 1 root root  111 May  2 06:13 generation_config.json
-rw-r--r-- 1 root root 1.7K May  2 06:13 .gitattributes
-rw-r--r-- 1 root root 1.5G May  2 06:17 llava.clip
-rw-r--r-- 1 root root 113M May  2 06:17 llava.projector
drwxr-xr-x 2 root root 4.0K May  2 06:21 MiniCPM
-rw-r--r-- 1 root root 4.7G May  2 06:14 model-00001-of-00002.safetensors
-rw-r--r-- 1 root root 1.8G May  2 06:13 model-00002-of-00002.safetensors
-rw-r--r-- 1 root root  70K May  2 06:13 modeling_minicpm.py
-rw-r--r-- 1 root root  20K May  2 06:13 modeling_minicpmv.py
-rw-r--r-- 1 root root  54K May  2 06:13 model.safetensors.index.json
-rw-r--r-- 1 root root 9.2K May  2 06:13 README.md
-rw-r--r-- 1 root root 5.5K May  2 06:13 resampler.py
-rw-r--r-- 1 root root  651 May  2 06:13 special_tokens_map.json
-rw-r--r-- 1 root root 3.3K May  2 06:13 tokenizer_config.json
-rw-r--r-- 1 root root 6.0M May  2 06:13 tokenizer.json
-rw-r--r-- 1 root root 2.0M May  2 06:13 tokenizer.model

> ls MiniCPM-V-2/MiniCPM -alh

total 12G
drwxr-xr-x 2 root root 4.0K May  2 06:21 .
drwxr-xr-x 4 root root 4.0K May  2 06:17 ..
-rw-r--r-- 1 root root 1.5K May  2 06:17 config.json
-rw-r--r-- 1 root root  11K May  2 06:21 configuration_minicpm.py
-rw-r--r-- 1 root root  111 May  2 06:17 generation_config.json
-rw-r--r-- 1 root root 4.7G May  2 06:18 model-00001-of-00003.safetensors
-rw-r--r-- 1 root root 4.7G May  2 06:20 model-00002-of-00003.safetensors
-rw-r--r-- 1 root root 2.0G May  2 06:21 model-00003-of-00003.safetensors
-rw-r--r-- 1 root root  70K May  2 06:21 modeling_minicpm.py
-rw-r--r-- 1 root root  20K May  2 06:21 modeling_minicpmv.py
-rw-r--r-- 1 root root  30K May  2 06:21 model.safetensors.index.json
-rw-r--r-- 1 root root 5.5K May  2 06:21 resampler.py
-rw-r--r-- 1 root root  765 May  2 06:21 special_tokens_map.json
-rw-r--r-- 1 root root 3.4K May  2 06:21 tokenizer_config.json
-rw-r--r-- 1 root root 2.0M May  2 06:21 tokenizer.model

So it seems that the save_pretrained method in surgery.py do not save the tokenizer.json file. I manually copy the tokenizer.json file into the MiniCPM sub-folder before the tokenizer.json is uploaded. I wonder the manually copy is no longer needed since it is uploaded. Seems I was wrong.

@mzwing
Copy link

mzwing commented May 2, 2024

So it seems that the save_pretrained method in surgery.py do not save the tokenizer.json file. I manually copy the tokenizer.json file into the MiniCPM sub-folder before the tokenizer.json is uploaded. I wonder the manually copy is no longer needed since it is uploaded. Seems I was wrong.

However even if I copy it to sub-folder, the converting script didn't work either :(

See here: #6919 (comment)

cp ../MiniCPM-V-2/tokenizer.json ../MiniCPM-V-2/MiniCPM/

@Achazwl
Copy link
Author

Achazwl commented May 3, 2024

fixed.

@mzwing
Copy link

mzwing commented May 4, 2024

Convert successfully, thx!

However, I got a bad test result...

Log here:

> ./minicpmv-cli -ngl 1000000 -m ./MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf --mmproj ./MiniCPM-V-2-GGUF/mmproj-model-f16.gguf -c 4096 --temp 0.6 --top-p 0.8 --top-k 100 --repeat-penalty 1.0 --image ./mzwing.jpg -p "这张图里有什么?"

Log start
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    440
clip_model_load: n_kv:         18
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 18 key-value pairs and 440 tensors from ./MiniCPM-V-2-GGUF/mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   6:                        clip.projector_type str              = resampler
clip_model_load: - kv   7:                     clip.vision.image_size u32              = 448
clip_model_load: - kv   8:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv   9:               clip.vision.embedding_length u32              = 1152
clip_model_load: - kv  10:            clip.vision.feed_forward_length u32              = 4304
clip_model_load: - kv  11:                 clip.vision.projection_dim u32              = 0
clip_model_load: - kv  12:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  13:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  14:                    clip.vision.block_count u32              = 26
clip_model_load: - kv  15:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  16:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  17:                              clip.use_gelu bool             = true
clip_model_load: - type  f32:  277 tensors
clip_model_load: - type  f16:  163 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     828.18 MB
clip_model_load: metadata size:  0.17 MB
clip_model_load: params backend buffer size =  828.18 MB (440 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 88.80 MB
llama_model_loader: loaded meta data with 22 key-value pairs and 363 tensors from ./MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = minicpm
llama_model_loader: - kv   1:                               general.name str              = MiniCPM
llama_model_loader: - kv   2:                     minicpm.context_length u32              = 4096
llama_model_loader: - kv   3:                   minicpm.embedding_length u32              = 2304
llama_model_loader: - kv   4:                        minicpm.block_count u32              = 40
llama_model_loader: - kv   5:                minicpm.feed_forward_length u32              = 5760
llama_model_loader: - kv   6:               minicpm.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:               minicpm.attention.head_count u32              = 36
llama_model_loader: - kv   8:            minicpm.attention.head_count_kv u32              = 36
llama_model_loader: - kv   9:   minicpm.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                        minicpm.tie_lm_head bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,122753]  = ["<unk>", "<s>", "</s>", "<SEP>", "<C...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,122753]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,122753]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type  f16:  282 tensors
llm_load_vocab: mismatch in special tokens definition ( 3528/122753 vs 271/122753 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = minicpm
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 122753
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 2304
llm_load_print_meta: n_head           = 36
llm_load_print_meta: n_head_kv        = 36
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2304
llm_load_print_meta: n_embd_v_gqa     = 2304
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5760
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 3.01 B
llm_load_print_meta: model size       = 5.60 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = MiniCPM
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.37 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:        CPU buffer size =   539.44 MiB
llm_load_tensors:      CUDA0 buffer size =  5197.65 MiB
....................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1440.00 MiB
llama_new_context_with_model: KV self size  = 1440.00 MiB, K (f16):  720.00 MiB, V (f16):  720.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.47 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   314.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    12.51 MiB
llama_new_context_with_model: graph nodes  = 1368
llama_new_context_with_model: graph splits = 2
encode_image_with_clip: image embedding created: 64 tokens

encode_image_with_clip: image encoded in   455.73 ms by CLIP (    7.12 ms per image patch)
slice_image: multiple 1
<用户>这张图里有什么?
<AI>
《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《

llama_print_timings:        load time =   27998.25 ms
llama_print_timings:      sample time =      35.94 ms /   256 runs   (    0.14 ms per token,  7122.19 tokens per second)
llama_print_timings: prompt eval time =     196.38 ms /    80 tokens (    2.45 ms per token,   407.36 tokens per second)
llama_print_timings:        eval time =    7742.32 ms /   255 runs   (   30.36 ms per token,    32.94 tokens per second)
llama_print_timings:       total time =   36056.03 ms /   335 tokens

The image I used for testing is my GitHub avatar.

mzwing

@Achazwl
Copy link
Author

Achazwl commented May 5, 2024

My log here, can not reproduce your result

./minicpmv-cli -m ../MiniCPM-V-2/MiniCPM/ggml-model-f16.gguf --mmproj ../MiniCPM-V-2/mmproj-model-f16.gguf -c 4096 --temp 0.6 --top-p 0.8 --top-k 100 --repeat-penalty 1.0 --image ../mzwing.jpg -p "这张图里有什么?"
Log start
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    440
clip_model_load: n_kv:         18
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 18 key-value pairs and 440 tensors from ../MiniCPM-V-2/mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   6:                        clip.projector_type str              = resampler
clip_model_load: - kv   7:                     clip.vision.image_size u32              = 448
clip_model_load: - kv   8:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv   9:               clip.vision.embedding_length u32              = 1152
clip_model_load: - kv  10:            clip.vision.feed_forward_length u32              = 4304
clip_model_load: - kv  11:                 clip.vision.projection_dim u32              = 0
clip_model_load: - kv  12:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  13:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  14:                    clip.vision.block_count u32              = 26
clip_model_load: - kv  15:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  16:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  17:                              clip.use_gelu bool             = true
clip_model_load: - type  f32:  277 tensors
clip_model_load: - type  f16:  163 tensors
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/acha/Desktop/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
clip_model_load: CLIP using Metal backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     828.18 MB
clip_model_load: metadata size:  0.17 MB
clip_model_load: params backend buffer size =  828.18 MB (440 tensors)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   828.19 MiB, (  829.19 / 21845.34)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    88.81 MiB, (  918.00 / 21845.34)
clip_model_load: compute allocated memory: 88.80 MB
llama_model_loader: loaded meta data with 22 key-value pairs and 363 tensors from ../MiniCPM-V-2/MiniCPM/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = minicpm
llama_model_loader: - kv   1:                               general.name str              = MiniCPM
llama_model_loader: - kv   2:                     minicpm.context_length u32              = 4096
llama_model_loader: - kv   3:                   minicpm.embedding_length u32              = 2304
llama_model_loader: - kv   4:                        minicpm.block_count u32              = 40
llama_model_loader: - kv   5:                minicpm.feed_forward_length u32              = 5760
llama_model_loader: - kv   6:               minicpm.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:               minicpm.attention.head_count u32              = 36
llama_model_loader: - kv   8:            minicpm.attention.head_count_kv u32              = 36
llama_model_loader: - kv   9:   minicpm.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                        minicpm.tie_lm_head bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,122753]  = ["<unk>", "<s>", "</s>", "<SEP>", "<C...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,122753]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,122753]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type  f16:  282 tensors
llm_load_vocab: mismatch in special tokens definition ( 3528/122753 vs 271/122753 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = minicpm
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 122753
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 2304
llm_load_print_meta: n_head           = 36
llm_load_print_meta: n_head_kv        = 36
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2304
llm_load_print_meta: n_embd_v_gqa     = 2304
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5760
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 3.01 B
llm_load_print_meta: model size       = 5.60 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = MiniCPM
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.37 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  5197.67 MiB, ( 6115.67 / 21845.34)
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:      Metal buffer size =  5197.66 MiB
llm_load_tensors:        CPU buffer size =   539.44 MiB
...................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/acha/Desktop/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =  1440.00 MiB, ( 7556.42 / 21845.34)
llama_kv_cache_init:      Metal KV buffer size =  1440.00 MiB
llama_new_context_with_model: KV self size  = 1440.00 MiB, K (f16):  720.00 MiB, V (f16):  720.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.47 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   314.02 MiB, ( 7870.44 / 21845.34)
llama_new_context_with_model:      Metal compute buffer size =   314.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    12.51 MiB
llama_new_context_with_model: graph nodes  = 1368
llama_new_context_with_model: graph splits = 2
encode_image_with_clip: image embedding created: 64 tokens

encode_image_with_clip: image encoded in  2532.64 ms by CLIP (   39.57 ms per image patch)
slice_image: multiple 1
<用户>这张图里有什么?
<AI>
这幅图片展示了一个人站在一个看起来像围栏的地方,周围是蓝色的海洋。这个人正在伸手去触碰天空中的鸟群,这些鸟群以一种抽象的方式排列成一条线。这幅画的风格是水彩,给人一种梦幻、宁静的感觉。颜色以蓝色和白色为主,蓝色象征着海洋和天空,白色则代表云彩和鸟群。

llama_print_timings:        load time =   11091.30 ms
llama_print_timings:      sample time =       5.67 ms /    74 runs   (    0.08 ms per token, 13053.45 tokens per second)
llama_print_timings: prompt eval time =    8420.87 ms /    80 tokens (  105.26 ms per token,     9.50 tokens per second)
llama_print_timings:        eval time =    2901.03 ms /    73 runs   (   39.74 ms per token,    25.16 tokens per second)
llama_print_timings:       total time =   14061.83 ms /   153 tokens
ggml_metal_free: deallocating
ggml_metal_free: deallocating

@mzwing
Copy link

mzwing commented May 5, 2024

@Achazwl Can you help test the model I quantized? Link here: https://huggingface.co/mzwing/MiniCPM-V-2-GGUF

@Achazwl
Copy link
Author

Achazwl commented May 5, 2024

@Achazwl Can you test the model I quantized? Link here: https://huggingface.co/mzwing/MiniCPM-V-2-GGUF

The link you provided only contains fp16 models

@mzwing
Copy link

mzwing commented May 5, 2024

The link you provided only contains fp16 models

The mmproj gguf model is actually there, I just rename it :)

Link to the mmproj gguf model: https://huggingface.co/mzwing/MiniCPM-V-2-GGUF/blob/main/MiniCPM-V-2-mmproj.F16.gguf

@Achazwl
Copy link
Author

Achazwl commented May 6, 2024

Also correct

./minicpmv-cli -m ../MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf --mmproj ../MiniCPM-V-2-GGUF/MiniCPM-V-2-mmproj.F16.gguf -c 4096 --temp 0.6 --top-p 0.8 --top-k 100 --repeat-penalty 1.0 --image ../mzwing.jpg -p "这张图里有什么?" 
Log start
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    440
clip_model_load: n_kv:         18
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 18 key-value pairs and 440 tensors from ../MiniCPM-V-2-GGUF/MiniCPM-V-2-mmproj.F16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   6:                        clip.projector_type str              = resampler
clip_model_load: - kv   7:                     clip.vision.image_size u32              = 448
clip_model_load: - kv   8:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv   9:               clip.vision.embedding_length u32              = 1152
clip_model_load: - kv  10:            clip.vision.feed_forward_length u32              = 4304
clip_model_load: - kv  11:                 clip.vision.projection_dim u32              = 0
clip_model_load: - kv  12:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  13:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  14:                    clip.vision.block_count u32              = 26
clip_model_load: - kv  15:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  16:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  17:                              clip.use_gelu bool             = true
clip_model_load: - type  f32:  277 tensors
clip_model_load: - type  f16:  163 tensors
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/acha/Desktop/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
clip_model_load: CLIP using Metal backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     828.18 MB
clip_model_load: metadata size:  0.17 MB
clip_model_load: params backend buffer size =  828.18 MB (440 tensors)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   828.19 MiB, (  829.19 / 21845.34)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    88.81 MiB, (  918.00 / 21845.34)
clip_model_load: compute allocated memory: 88.80 MB
llama_model_loader: loaded meta data with 22 key-value pairs and 363 tensors from ../MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = minicpm
llama_model_loader: - kv   1:                               general.name str              = MiniCPM
llama_model_loader: - kv   2:                     minicpm.context_length u32              = 4096
llama_model_loader: - kv   3:                   minicpm.embedding_length u32              = 2304
llama_model_loader: - kv   4:                        minicpm.block_count u32              = 40
llama_model_loader: - kv   5:                minicpm.feed_forward_length u32              = 5760
llama_model_loader: - kv   6:               minicpm.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:               minicpm.attention.head_count u32              = 36
llama_model_loader: - kv   8:            minicpm.attention.head_count_kv u32              = 36
llama_model_loader: - kv   9:   minicpm.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                        minicpm.tie_lm_head bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,122753]  = ["<unk>", "<s>", "</s>", "<SEP>", "<C...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,122753]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,122753]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type  f16:  282 tensors
llm_load_vocab: mismatch in special tokens definition ( 3528/122753 vs 271/122753 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = minicpm
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 122753
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 2304
llm_load_print_meta: n_head           = 36
llm_load_print_meta: n_head_kv        = 36
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2304
llm_load_print_meta: n_embd_v_gqa     = 2304
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5760
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 3.01 B
llm_load_print_meta: model size       = 5.60 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = MiniCPM
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.37 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  5197.67 MiB, ( 6115.67 / 21845.34)
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:      Metal buffer size =  5197.66 MiB
llm_load_tensors:        CPU buffer size =   539.44 MiB
...................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/acha/Desktop/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =  1440.00 MiB, ( 7556.42 / 21845.34)
llama_kv_cache_init:      Metal KV buffer size =  1440.00 MiB
llama_new_context_with_model: KV self size  = 1440.00 MiB, K (f16):  720.00 MiB, V (f16):  720.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.47 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   314.02 MiB, ( 7870.44 / 21845.34)
llama_new_context_with_model:      Metal compute buffer size =   314.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    12.51 MiB
llama_new_context_with_model: graph nodes  = 1368
llama_new_context_with_model: graph splits = 2
encode_image_with_clip: image embedding created: 64 tokens

encode_image_with_clip: image encoded in  2460.77 ms by CLIP (   38.45 ms per image patch)
slice_image: multiple 1
<用户>这张图里有什么?
<AI>
这张图片描绘了一个人站在一个看起来像是栏杆的地方,朝向天空。这个人似乎正在伸手向天空,可能是在试图捕捉或触摸星星或鸟儿。天空是深蓝色的,点缀着许多星星和散落的鸟群,给人一种浩瀚和宁静的感觉。这幅画采用了水彩画风格,柔和的水彩笔触营造出一种梦幻般、略带忧郁的氛围。

llama_print_timings:        load time =    9274.04 ms
llama_print_timings:      sample time =       5.80 ms /    76 runs   (    0.08 ms per token, 13103.45 tokens per second)
llama_print_timings: prompt eval time =    6685.98 ms /    80 tokens (   83.57 ms per token,    11.97 tokens per second)
llama_print_timings:        eval time =    2971.87 ms /    75 runs   (   39.62 ms per token,    25.24 tokens per second)
llama_print_timings:       total time =   12316.09 ms /   155 tokens
ggml_metal_free: deallocating
ggml_metal_free: deallocating

@mzwing
Copy link

mzwing commented May 6, 2024

I did some further tests.

When I use only the CPU, the model's output is very, very normal. However, when I switching to the GPU, the model seemed... mad.

Tested on Google Colab (T4 GPU).

Log:

> ./minicpmv-cli -ngl 35 -m ./MiniCPM-V-2-GGUF/MiniCPM-V-2.Q2_K.gguf --mmproj ./MiniCPM-V-2-GGUF/MiniCPM-V-2-mmproj.F16.gguf -c 4096 --temp 0.6 --top-p 0.8 --top-k 100 --repeat-penalty 1.0 --image ./mzwing.jpg -p "这张图里有什么?"

Log start
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    440
clip_model_load: n_kv:         18
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 18 key-value pairs and 440 tensors from ./MiniCPM-V-2-GGUF/MiniCPM-V-2-mmproj.F16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   6:                        clip.projector_type str              = resampler
clip_model_load: - kv   7:                     clip.vision.image_size u32              = 448
clip_model_load: - kv   8:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv   9:               clip.vision.embedding_length u32              = 1152
clip_model_load: - kv  10:            clip.vision.feed_forward_length u32              = 4304
clip_model_load: - kv  11:                 clip.vision.projection_dim u32              = 0
clip_model_load: - kv  12:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  13:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  14:                    clip.vision.block_count u32              = 26
clip_model_load: - kv  15:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  16:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  17:                              clip.use_gelu bool             = true
clip_model_load: - type  f32:  277 tensors
clip_model_load: - type  f16:  163 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     828.18 MB
clip_model_load: metadata size:  0.17 MB
clip_model_load: params backend buffer size =  828.18 MB (440 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 88.80 MB
llama_model_loader: loaded meta data with 23 key-value pairs and 363 tensors from ./MiniCPM-V-2-GGUF/MiniCPM-V-2.Q2_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = minicpm
llama_model_loader: - kv   1:                               general.name str              = MiniCPM
llama_model_loader: - kv   2:                     minicpm.context_length u32              = 4096
llama_model_loader: - kv   3:                   minicpm.embedding_length u32              = 2304
llama_model_loader: - kv   4:                        minicpm.block_count u32              = 40
llama_model_loader: - kv   5:                minicpm.feed_forward_length u32              = 5760
llama_model_loader: - kv   6:               minicpm.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:               minicpm.attention.head_count u32              = 36
llama_model_loader: - kv   8:            minicpm.attention.head_count_kv u32              = 36
llama_model_loader: - kv   9:   minicpm.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 10
llama_model_loader: - kv  11:                        minicpm.tie_lm_head bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,122753]  = ["<unk>", "<s>", "</s>", "<SEP>", "<C...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,122753]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,122753]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q2_K:  161 tensors
llama_model_loader: - type q3_K:   80 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq4_nl:   40 tensors
llm_load_vocab: mismatch in special tokens definition ( 3528/122753 vs 271/122753 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = minicpm
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 122753
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 2304
llm_load_print_meta: n_head           = 36
llm_load_print_meta: n_head_kv        = 36
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2304
llm_load_print_meta: n_embd_v_gqa     = 2304
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5760
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = Q2_K - Medium
llm_load_print_meta: model params     = 3.01 B
llm_load_print_meta: model size       = 1.21 GiB (3.44 BPW) 
llm_load_print_meta: general.name     = MiniCPM
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.37 MiB
llm_load_tensors: offloading 35 repeating layers to GPU
llm_load_tensors: offloaded 35/41 layers to GPU
llm_load_tensors:        CPU buffer size =  1234.38 MiB
llm_load_tensors:      CUDA0 buffer size =   809.07 MiB
.............................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   180.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  1260.00 MiB
llama_new_context_with_model: KV self size  = 1440.00 MiB, K (f16):  720.00 MiB, V (f16):  720.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.47 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   465.51 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    17.01 MiB
llama_new_context_with_model: graph nodes  = 1368
llama_new_context_with_model: graph splits = 59
encode_image_with_clip: image embedding created: 64 tokens

encode_image_with_clip: image encoded in   394.83 ms by CLIP (    6.17 ms per image patch)
slice_image: multiple 1
<用户>这张图里有什么?
<AI>
%</br></h3><strong></h2></tr><br/><SEP>-《<li><SEP></h2><h3/>8--></h3>』</h1>?8<</br></strong>32<h4/>=-</h3>10</h2><tbody/>.<h3></img><h5>『7</h2><img/></h3><tr>2¥3<h4>『11</h5><CLS><li></img></h3>』《『<『%3<li>0<h2/>3!1<tr><h5>?</img><SEP></h1><h4/><CLS>=<h4>```.=<h3/>?2<!--%!-1<td>2㊣<p/><p/><SEP><!--¥1。。。</td>》。...<li>
<h5></h5><h2/>㊣?</img>:<</h5><h3>?.<strong></tr><tr></strong></tbody><h1>4-->?-【</li></tr><h3>12<li/>.</h2><SEP></h2>?<table><br></tbody><h2/><!DOCTYPE>¥9``````=</img></h5><b><h5/>.<li>¥』《</li>-4?<li>!%『<img/><br/></h1>!》<tr>..<table></br></h1><,!</h2></h4></tbody></li>。。。。。。。。。。。。<table/></br>-<li/>./.:<《</h2><h5>-<<h4/></li></strong></strong><br><h4/>--></h4>...<strong/>.<b/>--><tbody/><h4/>,?,『0<img/></strong>
9、<tr>-5
</h5><p/><h4/><h5><!DOCTYPE><table/>《6</h5></tr>

llama_print_timings:        load time =    2538.88 ms
llama_print_timings:      sample time =      32.09 ms /   256 runs   (    0.13 ms per token,  7978.56 tokens per second)
llama_print_timings: prompt eval time =    1714.47 ms /    80 tokens (   21.43 ms per token,    46.66 tokens per second)
llama_print_timings:        eval time =   19998.40 ms /   255 runs   (   78.43 ms per token,    12.75 tokens per second)
llama_print_timings:       total time =   22909.54 ms /   335 tokens

The binary I compiled is here: https://github.com/MZWNET/actions/releases/tag/llama_cpp-minicpm-v-6c1c4b4

Link to models I quantized: https://huggingface.co/mzwing/MiniCPM-V-2-GGUF

If you need, link to my Jupyter Notebook file here: https://github.com/mzwing/AI-related/blob/master/notebooks/MiniCPM_V_2_GGUF.ipynb

So, it seems to be a GPU-related bug :(

@Achazwl
Copy link
Author

Achazwl commented May 6, 2024

So, it seems to be a GPU-related bug :(

So this may not related to my PR? Since the correctness on CPU indicates that the conversion is correct.

@Achazwl
Copy link
Author

Achazwl commented May 8, 2024

Is the bug happening on LLaVA?

@mzwing
Copy link

mzwing commented May 8, 2024

Is the bug happening on LLaVA?

Oh now I find that the llava-cli built in this PR even cannot load the model. It gave out the unable to load model error.

For now only tested on GPU env. See comment below.

So, maybe that's the final reason? But the two errors seem too different.

Log:

> ./llava-cli -ngl 35 -m ./LLaVA-Llama-3-8B-Instruct-GGUF/llava-llama3-8b-Q4_K_M.gguf --mmproj ./LLaVA-Llama-3-8B-Instruct-GGUF/llava-llama3-mmproj-f16.gguf --temp 0.6 --top-p 0.8 --top-k 100 --repeat-penalty 1.0 --image ./mzwing.jpg -p "这张图里有什么?"

Log start
clip_model_load: model name:   openai/clip-vit-large-patch14-336
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    377
clip_model_load: n_kv:         19
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 19 key-value pairs and 377 tensors from ./LLaVA-Llama-3-8B-Instruct-GGUF/llava-llama3-mmproj-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                               general.name str              = openai/clip-vit-large-patch14-336
clip_model_load: - kv   6:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   7:                        clip.projector_type str              = mlp
clip_model_load: - kv   8:                     clip.vision.image_size u32              = 336
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1024
clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4096
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 768
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000010
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 23
clip_model_load: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
clip_model_load: - kv  18:                              clip.use_gelu bool             = false
clip_model_load: - type  f32:  235 tensors
clip_model_load: - type  f16:  142 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     595.49 MB
clip_model_load: metadata size:  0.14 MB
clip_model_load: params backend buffer size =  595.49 MB (377 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 32.89 MB
llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from ./LLaVA-Llama-3-8B-Instruct-GGUF/llava-llama3-8b-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = tmp
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128257
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 15
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128257]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,128257]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128257]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 128256
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 257/128257 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128257
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.58 GiB (4.89 BPW) 
llm_load_print_meta: general.name     = tmp
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: PAD token        = 128256 '<pad>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size =    0.30 MiB
llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 291, got 290
llama_load_model_from_file: failed to load model
llava_init: error: unable to load model
main: error: failed to init llava

@mzwing
Copy link

mzwing commented May 8, 2024

For now only tested on GPU env.

CPU also the same.

So it's quite confusing. Maybe you should update your branch?

llava-cli in original llama.cpp repo (master branch) works as expected, either in CPU env or GPU env.

@mofosyne mofosyne added enhancement New feature or request Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 9, 2024
@Achazwl
Copy link
Author

Achazwl commented May 12, 2024

For now only tested on GPU env.

CPU also the same.

So it's quite confusing. Maybe you should update your branch?

llava-cli in original llama.cpp repo (master branch) works as expected, either in CPU env or GPU env.

Llava fixed, it is the side effect of my code. The new version of my PR has much fewer modifications outside the minicpm-v folder, and thus will not affect other models now.

The bug of MiniCPM-V on GPU is rather hard to find. I can reproduce the NaN issue on GPU, and here's are my observation:

  1. The output of the ViT when processing images is aligned with the CPU version (which means the ViT part is correct).
  2. The output of the LLM when processing prompt text is aligned with the CPU version (which means LLM's computation is correct on GPU).
  3. However, when putting the output of ViT into the LLM as LLM's input, NaN is outputted.
  4. I finally find that once the output of ViT is fed into the text model, it immediately becomes NaN, which means it has already turned into NaN at the input embedding stage (input_embed), without calculating any TransformerBlock.
  5. In the function where the ViT input is copied from the CPU to the GPU (the ggml_backend_cuda_buffer_set_tensor function in ggml-cuda.cu), I add a debug code to copy back the input_embed to the CPU. The result copied back to the CPU is the same as the output of ViT, no NaN is appearing. However, I can't figure out what more is happening in the code between "ViT output" and "LLM input", I can only find the CPU->GPU copying. If NaN is not from this stage, then from what?
  6. I also attempted to allocate a new space to copy the output of ViT into, in order to avoid some "access out of bound" issues, but the result was still NaN.

@mzwing
Copy link

mzwing commented May 13, 2024

@cmp-nct Hey cmp-nct, could you please help us resolve this confusing issue? Thanks a lot!

I apologize if this has caused you confusion.

@zkh2016 zkh2016 mentioned this pull request May 13, 2024
@sunzhe09
Copy link

你好,我试验了一下,我量化之后发小效果差很多,有什么解决方法吗?

@mofosyne
Copy link
Collaborator

Google Translation:

Hello, I tried it. After I quantified it, the effect was much worse. Is there any solution?

@Achazwl
Copy link
Author

Achazwl commented May 17, 2024

你好,我试验了一下,我量化之后发小效果差很多,有什么解决方法吗?

Do you quantify the LLM part or the ViT part?
你量化的是 LLM 部分还是 ViT 部分

@sunzhe09
Copy link

你好,我试验了一下,我量化之后发小效果差很多,有什么解决方法吗?

Do you quantify the LLM part or the ViT part? 你量化的是 LLM 部分还是 ViT 部分

./quantize ../MiniCPM-V-2/MiniCPM/ggml-model-f16.gguf ../MiniCPM-V-2/MiniCPM/ggml-model-Q4_K_M.gguf Q4_K_M 这个是我量化的命令

@Achazwl
Copy link
Author

Achazwl commented May 21, 2024

你好,我试验了一下,我量化之后发小效果差很多,有什么解决方法吗?

Do you quantify the LLM part or the ViT part? 你量化的是 LLM 部分还是 ViT 部分

./quantize ../MiniCPM-V-2/MiniCPM/ggml-model-f16.gguf ../MiniCPM-V-2/MiniCPM/ggml-model-Q4_K_M.gguf Q4_K_M 这个是我量化的命令

这块量化的是 LLM 部分,ViT 部分我没有做量化处理。

@naifmeh
Copy link

naifmeh commented May 23, 2024

Hi, does this PR support converting the new MiniCPM V2.5?
I've tried running the convert-hf-to-gguf.py script but getting this error:

$  python ./convert-hf-to-gguf.py --outtype f16 --outfile ./MiniCPM25-GGUF/minicpm25_f16.gguf ./models/Minicpm25

Loading model: Minicpm25
Traceback (most recent call last):
  File "/home/powerapi/Documents/nmehanna/llama.cpp/./convert-hf-to-gguf.py", line 2807, in <module>
    main()
  File "/home/powerapi/Documents/nmehanna/llama.cpp/./convert-hf-to-gguf.py", line 2787, in main
    model_class = Model.from_model_architecture(hparams["architectures"][0])
  File "/home/powerapi/Documents/nmehanna/llama.cpp/./convert-hf-to-gguf.py", line 216, in from_model_architecture
    raise NotImplementedError(f'Architecture {arch!r} not supported!') from None
NotImplementedError: Architecture 'MiniCPMV' not supported!

Thank you.

@github-actions github-actions bot added examples python python script changes labels May 23, 2024
@tc-mb tc-mb mentioned this pull request May 23, 2024
@tc-mb
Copy link
Contributor

tc-mb commented May 23, 2024

Hi, does this PR support converting the new MiniCPM V2.5? I've tried running the convert-hf-to-gguf.py script but getting this error:

$  python ./convert-hf-to-gguf.py --outtype f16 --outfile ./MiniCPM25-GGUF/minicpm25_f16.gguf ./models/Minicpm25

Loading model: Minicpm25
Traceback (most recent call last):
  File "/home/powerapi/Documents/nmehanna/llama.cpp/./convert-hf-to-gguf.py", line 2807, in <module>
    main()
  File "/home/powerapi/Documents/nmehanna/llama.cpp/./convert-hf-to-gguf.py", line 2787, in main
    model_class = Model.from_model_architecture(hparams["architectures"][0])
  File "/home/powerapi/Documents/nmehanna/llama.cpp/./convert-hf-to-gguf.py", line 216, in from_model_architecture
    raise NotImplementedError(f'Architecture {arch!r} not supported!') from None
NotImplementedError: Architecture 'MiniCPMV' not supported!

Thank you.

This PR does not support MiniCPM-V 2.5 yet.
We will submit a new PR to support MiniCPM-V 2.5 ASAP.

@cmp-nct
Copy link
Contributor

cmp-nct commented May 27, 2024

Hi guys,
I'm sorry I've been busy with RL for a while and I'll be on vacation from tomorrow on, I won't be able to look into this for now.
I tested minicpm 2.5 just now and the results, at first glance, were stunning. llava-1.6 type results in my very limited tests.

I found a ollama variant of llama.cpp here that is supporting it: https://github.com/OpenBMB/llama.cpp/tree/minicpm-v2.5/examples/minicpmv
So maybe stuff has already been done and just needs to be added ?

@tc-mb
Copy link
Contributor

tc-mb commented May 27, 2024

Hi guys, I'm sorry I've been busy with RL for a while and I'll be on vacation from tomorrow on, I won't be able to look into this for now. I tested minicpm 2.5 just now and the results, at first glance, were stunning. llava-1.6 type results in my very limited tests.

I found a ollama variant of llama.cpp here that is supporting it: https://github.com/OpenBMB/llama.cpp/tree/minicpm-v2.5/examples/minicpmv So maybe stuff has already been done and just needs to be added ?

I'm a stuff of minicpmv team.
There are some differences between minicpm-v 2.5 and minicpm-v 2.0.
We temporarily make fork to support minicpm-v 2.5.
We will integrate v2.5 and v2.0 in the next few days and submit a new PR to official llama.cpp.

@ggerganov
Copy link
Owner

The code is based on examples/llava but the vision part is quite different.

It's unlikely to merge these PRs since they duplicate all the LLaVA / CLIP codebase. Try to find ways to fit the implementation into the existing code and reuse/extend the existing API

@ggerganov ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label May 27, 2024
@Galunid Galunid mentioned this pull request May 29, 2024
@howardgriffin
Copy link

When I do the first step 'make', I got this error:
examples/minicpmv/minicpmv-cli.cpp:136:108: error: 'class std::vector<std::__cxx11::basic_string<char> >' has no member named 'c_str'=136 | th_filename(ctx_llava->ctx_clip, params->n_threads, params->image.c_str());
I'm a newbie,any suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
demo Demonstrate some concept or idea, not intended to be merged enhancement New feature or request examples python python script changes Review Complexity : High Generally require indepth knowledge of LLMs or GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants