Inferencing Flan-T5 - GGML_ASSERT error

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

- [X] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [X] I carefully followed the [README.md](https://github.com/abetlen/llama-cpp-python/blob/main/README.md).
- [X] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [X] I reviewed the [Discussions](https://github.com/abetlen/llama-cpp-python/discussions), and have a new bug or useful enhancement to share.

# Expected Behavior

I am loading a GGUF model. Inferencing it by either

```py
response = self.llm(prompt, max_tokens=t, echo=True)
```

```py
tokens = self.llm.tokenize(prompt.encode('utf-8'))
self.llm._model.encode(tokens)
response = self.llm(
            prompt,
            max_tokens=10,
            stop=["</s>", "\n"],
            echo=False
)
generated_text = response['choices'][0]['text'].strip()
```

Must return a response

# Current Behavior

Llama-cpp-python got
```
GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first")
```
error in the 1st case and 

```
'LlamaModel' object has no attribute 'encode'
```
in the second case.

Although GGUF model is successfully loaded. So response is not returned, only errors

# Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

* MacOS, M3, Metal disabled, CPU-only for reproduction on other machines

`$ lscpu`

`23.6.0 Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:30 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6030 arm64`

* SDK version, e.g. for Linux:

```
Python 3.10.8
GNU Make 3.81
Apple clang version 16.0.0 (clang-1600.0.26.6)
```

# Failure Information (for bugs)

```
GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first")
```

```
'LlamaModel' object has no attribute 'encode'
```

# Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

1. Install any version of library, even 0.3.12
2. launch script where you initialized gguf model, for example https://huggingface.co/fareshzm/flan-t5-base-Q4_K_M-GGUF, and trying to inference it
3. get the error and become upset that you can't inference your model

Issue is only with llama-cpp-python, with llama-cpp there is no problem

# Failure Logs

```
...
11:44:48 AM [ai-service] AI Service: llama_new_context_with_model: n_ctx      = 256
llama_new_context_with_model: n_batch    = 256
llama_new_context_with_model: n_ubatch   = 256
llama_new_context_with_model: flash_attn = 0
11:44:48 AM [ai-service] AI Service: llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
11:44:48 AM [ai-service] AI Service: llama_kv_cache_init:        CPU KV buffer size =     9.00 MiB
llama_new_context_with_model: KV self size  =    9.00 MiB, K (f16):    4.50 MiB, V (f16):    4.50 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
11:44:48 AM [ai-service] AI Service: llama_new_context_with_model:        CPU compute buffer size =    13.25 MiB
llama_new_context_with_model: graph nodes  = 425
llama_new_context_with_model: graph splits = 193
11:44:48 AM [ai-service] AI Service: AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
11:44:48 AM [ai-service] AI Service: Model metadata: {'tokenizer.ggml.add_bos_token': 'false', 'tokenizer.ggml.unknown_token_id': '2', 'tokenizer.ggml.pre': 'default', 'tokenizer.ggml.model': 't5', 'tokenizer.ggml.eos_token_id': '1', 'general.architecture': 't5', 'tokenizer.ggml.add_space_prefix': 'true', 't5.attention.layer_norm_rms_epsilon': '0.000001', 'tokenizer.ggml.add_eos_token': 'true', 't5.attention.relative_buckets_count': '32', 't5.attention.layer_norm_epsilon': '0.000001', 'tokenizer.ggml.padding_token_id': '0', 't5.attention.value_length': '64', 't5.block_count': '12', 't5.attention.head_count': '12', 'general.file_type': '15', 't5.embedding_length': '768', 'general.size_label': '248M', 't5.context_length': '512', 'general.quantization_version': '2', 'general.license': 'apache-2.0', 't5.attention.key_length': '64', 't5.feed_forward_length': '2048', 't5.decoder_start_token_id': '0', 'general.type': 'model', 'tokenizer.ggml.remove_extra_whitespaces': 'true', 'general.name': 'Flan T5 Base'}
11:44:48 AM [ai-service] AI Service: Using fallback chat format: llama-2
INFO:creative_connections:✅ FLAN-T5 model loaded successfully!
11:44:48 AM [ai-service] AI Service: DEBUG:creative_connections:LLM object after loading: <llama_cpp.llama.Llama object at 0x14fae9f60>
DEBUG:creative_connections:LLM state after _ensure_llm_loaded: <llama_cpp.llama.Llama object at 0x14fae9f60>
DEBUG:creative_connections:Checking LLM state after _ensure_llm_loaded: <llama_cpp.llama.Llama object at 0x14fae9f60>
INFO:creative_connections:Attempting FLAN-T5 analysis...
DEBUG:creative_connections:→ Entered _ensure_llm_loaded()
11:44:48 AM [ai-service] AI Service: DEBUG:creative_connections:LLM already initialized: self.llm = <llama_cpp.llama.Llama object at 0x14fae9f60>
INFO:creative_connections:Calling FLAN-T5 with prompt: Tell me "hello world"!
11:44:48 AM [ai-service] AI Service: /private/var/folders/hl/zs0vv1rj0pb3jp28__qcvkn80000gq/T/pip-install-gng95xkt/llama-cpp-python_7fb0bb26e01a4212ae72ffffb70d14e9/vendor/llama.cpp/src/llama.cpp:15122: GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inferencing Flan-T5 - GGML_ASSERT error #2038

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Failure Logs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Inferencing Flan-T5 - GGML_ASSERT error #2038

Description

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Failure Logs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions