-
Notifications
You must be signed in to change notification settings - Fork 12.4k
server : fix handling of the ignore_eos flag #14710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes seem correct to me and I can confirm that I am now getting the expected numbers of tokens when setting ignore_eos
.
for (llama_token i = 0; i < llama_vocab_n_tokens(vocab); i++) { | ||
if (llama_vocab_is_eog(vocab, i)) { | ||
//SRV_DBG("%s: added %s logit bias = %f\n", __func__, common_token_to_piece(ctx, i).c_str(), -INFINITY); | ||
params.sampling.logit_bias.push_back({i, -INFINITY}); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is done for every token during generation, I suspect it is going to have a significant performance impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's done once per completion request, at the beginning, upon processing the input json parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If performance is a concern we could maybe provide a list of EoG tokens to iterate over instead of iterating over all tokens and checking whether each one is EoG. Although I think iterating over all tokens once per request is going to be negligible vs. iterating over all tokens once per generated token as is being done for sampling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my system this takes about 0.3 ms for a 150k vocab model, so I suppose it is not that bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed this anyway: #14721
fix #14695 (comment)
slot_params.ignore_eos
ignore_eos
is passed in the request, we ignore all EOG tokens by adding -INF logit bias to the sampler