You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: llama_cpp/server/app.py
+59-11Lines changed: 59 additions & 11 deletions
Original file line number
Diff line number
Diff line change
@@ -71,22 +71,70 @@ def get_llama():
71
71
)
72
72
73
73
classCreateCompletionRequest(BaseModel):
74
-
prompt: Union[str, List[str]]
75
-
suffix: Optional[str] =Field(None)
76
-
max_tokens: int=16
77
-
temperature: float=0.8
78
-
top_p: float=0.95
79
-
echo: bool=False
80
-
stop: Optional[List[str]] = []
81
-
stream: bool=False
82
-
logprobs: Optional[int] =Field(None)
74
+
prompt: Union[str, List[str]] =Field(
75
+
default="",
76
+
description="The prompt to generate completions for."
77
+
)
78
+
suffix: Optional[str] =Field(
79
+
default=None,
80
+
description="A suffix to append to the generated text. If None, no suffix is appended. Useful for chatbots."
81
+
)
82
+
max_tokens: int=Field(
83
+
default=16,
84
+
ge=1,
85
+
le=2048,
86
+
description="The maximum number of tokens to generate."
87
+
)
88
+
temperature: float=Field(
89
+
default=0.8,
90
+
ge=0.0,
91
+
le=2.0,
92
+
description="Adjust the randomness of the generated text.\n\n"+
93
+
"Temperature is a hyperparameter that controls the randomness of the generated text. It affects the probability distribution of the model's output tokens. A higher temperature (e.g., 1.5) makes the output more random and creative, while a lower temperature (e.g., 0.5) makes the output more focused, deterministic, and conservative. The default value is 0.8, which provides a balance between randomness and determinism. At the extreme, a temperature of 0 will always pick the most likely next token, leading to identical outputs in each run."
94
+
)
95
+
top_p: float=Field(
96
+
default=0.95,
97
+
ge=0.0,
98
+
le=1.0,
99
+
description="Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P.\n\n"+
100
+
"Top-p sampling, also known as nucleus sampling, is another text generation method that selects the next token from a subset of tokens that together have a cumulative probability of at least p. This method provides a balance between diversity and quality by considering both the probabilities of tokens and the number of tokens to sample from. A higher value for top_p (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text."
101
+
)
102
+
echo: bool=Field(
103
+
default=False,
104
+
description="Whether to echo the prompt in the generated text. Useful for chatbots."
105
+
)
106
+
stop: Optional[List[str]] =Field(
107
+
default=None,
108
+
description="A list of tokens at which to stop generation. If None, no stop tokens are used."
109
+
)
110
+
stream: bool=Field(
111
+
default=False,
112
+
description="Whether to stream the results as they are generated. Useful for chatbots."
113
+
)
114
+
logprobs: Optional[int] =Field(
115
+
default=None,
116
+
ge=0,
117
+
description="The number of logprobs to generate. If None, no logprobs are generated."
118
+
)
119
+
120
+
83
121
84
122
# ignored, but marked as required for the sake of compatibility with openai's api
85
123
model: str=model_field
86
124
87
125
# llama.cpp specific parameters
88
-
top_k: int=40
89
-
repeat_penalty: float=1.1
126
+
top_k: int=Field(
127
+
default=40,
128
+
ge=0,
129
+
description="Limit the next token selection to the K most probable tokens.\n\n"+
130
+
"Top-k sampling is a text generation method that selects the next token only from the top k most likely tokens predicted by the model. It helps reduce the risk of generating low-probability or nonsensical tokens, but it may also limit the diversity of the output. A higher value for top_k (e.g., 100) will consider more tokens and lead to more diverse text, while a lower value (e.g., 10) will focus on the most probable tokens and generate more conservative text."
131
+
)
132
+
repeat_penalty: float=Field(
133
+
default=1.0,
134
+
ge=0.0,
135
+
description="A penalty applied to each token that is already generated. This helps prevent the model from repeating itself.\n\n"+
136
+
"Repeat penalty is a hyperparameter used to penalize the repetition of token sequences during text generation. It helps prevent the model from generating repetitive or monotonous text. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient."
0 commit comments