I want a dataset of common n-grams and their log likelihoods. Normally I would download the Google Books Ngram Exports, but I wonder if I can generate a better dataset using a large language model. For example, this script uses llama_cpp.Llama.create_completion to find likely 3-grams starting with "welcome to":
from llama_cpp import Llama # pip install llama-cpp-python
llm = Llama.from_pretrained(
repo_id="unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF",
filename="DeepSeek-R1-Distill-Qwen-1.5B-Q2_K.gguf",
logits_all=True,
)
print(
llm.create_completion(
"welcome to",
max_tokens=1,
logprobs=10,
)["choices"][0]["logprobs"]["top_logprobs"][0],
)
Output:
{' the': np.float32(-0.18572943), ' this': np.float32(-3.444591), ' our': np.float32(-4.0559974), ' python': np.float32(-4.3010955), ' a': np.float32(-4.571982), ' bc': np.float32(-5.036485), ' module': np.float32(-5.4879394), ' week': np.float32(-5.7402453), ' all': np.float32(-6.2308974), ' thread': np.float32(-6.272795)}
One issue is that the prompt gets prefixed with a BOS token, so I only get n-grams that appear at the beginning of a sentence. I can fix this by passing a list of tokens instead of a string. But this leads to a problem when generating 1-grams:
print(
llm.create_completion(
[],
max_tokens=1,
logprobs=10,
)["choices"][0]["logprobs"]["top_logprobs"][0],
)
AssertionError
at llama.py line 788: assert self.n_tokens > 0
Apparently llama.cpp is unable to generate text when no context is provided. I confirmed this by using llama.cpp directly without the Python wrapper. My question is, is this an arbitrary limitation of the library, or a fundamental limitation of the language model?
I found a possible clue in the model's config.json file:
"initializer_range": 0.02,
The documentation says that initializer_range
is
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
I imagine that the model has a hidden state vector which is initialized with random values sampled from a normal distribution, and the values get updated as context is added. I wonder if it's possible to sample from the model in its initial random state, and get a list of the most common words by sampling multiple times with different random seeds.