I am trying to implement the mlx-collection/gemma-3-4b-it-4bit model with the mlx-vlm library to do multi-image inference, but I get this traceback error and I am not able to figure out how to solve it.
I tried to do both single and multi-image inference but the same error occurs.
Traceback (most recent call last):
File "/Users/Administrator/Documents/create/controllers/VLM_on_Robotics/main.py", line 52, in <module>
prediction, comp_time = phi3.generate(prompt, [images_PIL[0]])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/Administrator/Documents/create/controllers/VLM_on_Robotics/Llava_Phi3/phi3_mlx.py", line 60, in generate
prediction = generate(self.model, self.processor, formatted_prompt, images, verbose=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/Administrator/Documents/create/venv/lib/python3.11/site-packages/mlx_vlm/utils.py", line 1117, in generate
for response in stream_generate(model, processor, prompt, image, **kwargs):
File "/Users/Administrator/Documents/create/venv/lib/python3.11/site-packages/mlx_vlm/utils.py", line 1018, in stream_generate
inputs = prepare_inputs(
^^^^^^^^^^^^^^^
File "/Users/Administrator/Documents/create/venv/lib/python3.11/site-packages/mlx_vlm/utils.py", line 814, in prepare_inputs
inputs = processor(
^^^^^^^^^^
File "/Users/Administrator/Documents/create/venv/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2877, in __call__
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/Administrator/Documents/create/venv/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2987, in _call_one
return self.encode_plus(
^^^^^^^^^^^^^^^^^
File "/Users/Administrator/Documents/create/venv/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3063, in encode_plus
return self._encode_plus(
^^^^^^^^^^^^^^^^^^
File "/Users/Administrator/Documents/create/venv/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 613, in _encode_plus
batched_output = self._batch_encode_plus(
^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'images'
Since the traceback refers to an unexpected argument, I tried to take out the images
argument and do only-test inference and the script works.
Does this mean that there is a bug on how gemma
is implemented for which vision tasks are not supported?
I am trying to implement the mlx-collection/gemma-3-4b-it-4bit model with the mlx-vlm library to do multi-image inference, but I get this traceback error and I am not able to figure out how to solve it.
I tried to do both single and multi-image inference but the same error occurs.
Traceback (most recent call last):
File "/Users/Administrator/Documents/create/controllers/VLM_on_Robotics/main.py", line 52, in <module>
prediction, comp_time = phi3.generate(prompt, [images_PIL[0]])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/Administrator/Documents/create/controllers/VLM_on_Robotics/Llava_Phi3/phi3_mlx.py", line 60, in generate
prediction = generate(self.model, self.processor, formatted_prompt, images, verbose=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/Administrator/Documents/create/venv/lib/python3.11/site-packages/mlx_vlm/utils.py", line 1117, in generate
for response in stream_generate(model, processor, prompt, image, **kwargs):
File "/Users/Administrator/Documents/create/venv/lib/python3.11/site-packages/mlx_vlm/utils.py", line 1018, in stream_generate
inputs = prepare_inputs(
^^^^^^^^^^^^^^^
File "/Users/Administrator/Documents/create/venv/lib/python3.11/site-packages/mlx_vlm/utils.py", line 814, in prepare_inputs
inputs = processor(
^^^^^^^^^^
File "/Users/Administrator/Documents/create/venv/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2877, in __call__
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/Administrator/Documents/create/venv/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2987, in _call_one
return self.encode_plus(
^^^^^^^^^^^^^^^^^
File "/Users/Administrator/Documents/create/venv/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3063, in encode_plus
return self._encode_plus(
^^^^^^^^^^^^^^^^^^
File "/Users/Administrator/Documents/create/venv/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 613, in _encode_plus
batched_output = self._batch_encode_plus(
^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'images'
Since the traceback refers to an unexpected argument, I tried to take out the images
argument and do only-test inference and the script works.
Does this mean that there is a bug on how gemma
is implemented for which vision tasks are not supported?
- Please provide enough code so others can better understand or reproduce the problem. – Community Bot Commented Mar 26 at 3:18
1 Answer
Reset to default 0Problem solved, the issue is related to the transformers dependency.
See: https://github/Blaizzy/mlx-vlm/issues/274