nlp - Generating an n-gram dataset based on an LLM

I want a dataset of common n-grams and their log likelihoods. Normally I would download the Google Books Ngram Exports, but I wonder if I can generate a better dataset using a large language model. For example, this script uses llama_cpp.Llama.create_completion to find likely 3-grams starting with "welcome to":

from llama_cpp import Llama # pip install llama-cpp-python

llm = Llama.from_pretrained(
    repo_id="unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF",
    filename="DeepSeek-R1-Distill-Qwen-1.5B-Q2_K.gguf",
    logits_all=True,
)
print(
    llm.create_completion(
        "welcome to",
        max_tokens=1,
        logprobs=10,
    )["choices"][0]["logprobs"]["top_logprobs"][0],
)

Output:

{' the': np.float32(-0.18572943), ' this': np.float32(-3.444591), ' our': np.float32(-4.0559974), ' python': np.float32(-4.3010955), ' a': np.float32(-4.571982), ' bc': np.float32(-5.036485), ' module': np.float32(-5.4879394), ' week': np.float32(-5.7402453), ' all': np.float32(-6.2308974), ' thread': np.float32(-6.272795)}

One issue is that the prompt gets prefixed with a BOS token, so I only get n-grams that appear at the beginning of a sentence. I can fix this by passing a list of tokens instead of a string. But this leads to a problem when generating 1-grams:

print(
    llm.create_completion(
        [],
        max_tokens=1,
        logprobs=10,
    )["choices"][0]["logprobs"]["top_logprobs"][0],
)

AssertionError at llama.py line 788: assert self.n_tokens > 0

Apparently llama.cpp is unable to generate text when no context is provided. I confirmed this by using llama.cpp directly without the Python wrapper. My question is, is this an arbitrary limitation of the library, or a fundamental limitation of the language model?

I found a possible clue in the model's config.json file:

  "initializer_range": 0.02,

The documentation says that initializer_range is

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

I imagine that the model has a hidden state vector which is initialized with random values sampled from a normal distribution, and the values get updated as context is added. I wonder if it's possible to sample from the model in its initial random state, and get a list of the most common words by sampling multiple times with different random seeds.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

nlp - Generating an n-gram dataset based on an LLM - Stack Overflow

与本文相关的文章

评论列表(0)