This user guide explains how to run inference of text generation models with DeepSparse.
DeepSparse support for LLMs is available on DeepSparse's nightly build on PyPI:
pip install -U deepsparse-nightly[llm]
- Hardware: x86 AVX2, AVX-512, AVX-512 VNNI, and ARM v8.2+.
- Operating System: Linux (MacOS will be supported soon)
- Python: v3.8-3.11
For those using MacOS or Windows, we suggest using Linux containers with Docker to run DeepSparse.
DeepSparse exposes a Pipeline interface called TextGeneration
, which is used to construct a pipeline and generate text:
from deepsparse import TextGeneration
# construct a pipeline
model_path = "zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized"
pipeline = TextGeneration(model=model_path)
# generate text
prompt = "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: What is Kubernetes? ### Response:"
output = pipeline(prompt=prompt)
print(output.generations[0].text)
# >> Kubernetes is an open-source container orchestration system for automating the deployment, scaling, and management of containerized applications.
Note: The 7B model takes about 2 minutes to compile. Set
model_path = hf:mgoin/TinyStories-33M-quant-deepsparse
to use a small TinyStories model for quick compilation if you are just experimenting.
DeepSparse accepts models in ONNX format, passed either as SparseZoo stubs or local directories.
Note: DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. At present, we suggest only using LLM ONNX graphs created by Neural Magic.
SparseZoo stubs identify a model in SparseZoo. For instance, zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized
identifies a 50% pruned-quantized pre-trained MPT-7b model fine-tuned on the Dolly dataset. We can pass the stub to TextGeneration
, which downloads and caches the ONNX file.
model_path = "zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized"
pipeline = TextGeneration(model=model_path)
Additionally, we can pass a local path to a deployment directory. Use the SparseZoo API to download an example deployment directory:
from sparsezoo import Model
sz_model = Model("zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized", "./local-model")
sz_model.deployment.download()
Looking at the deployment directory, we see it contains the HF configs and ONNX model files:
ls ./local-model/deployment
>> config.json model.onnx tokenizer.json model.data special_tokens_map.json tokenizer_config.json
We can pass the local directory path to TextGeneration
:
from deepsparse import TextGeneration
pipeline = TextGeneration(model="./local-model/deployment")
Hugging Face models that conform to the directory structure listed above can also be run with DeepSparse by prepending hf:
to a model id. The following runs a 60% pruned-quantized MPT-7b model trained on GSM.
from deepsparse import TextGeneration
pipeline = TextGeneration(model="hf:neuralmagic/mpt-7b-gsm8k-pruned60-quant")
TextGeneration
accepts TextGenerationInput
as input and returns TextGenerationOutput
as output.
The following examples use a quantized 33M parameter TinyStories model for quick compilation:
from deepsparse import TextGeneration
pipeline = TextGeneration(model="hf:mgoin/TinyStories-33M-quant-deepsparse")
TextGenerationInput
has the following fields:
sequences
/prompt
: Input sequences to generate text from. String or list of strings. Required.
prompt1 = "Princess Peach jumped from the balcony"
prompt2 = "Mario ran into the castle"
output = pipeline(sequences=[prompt1, prompt2], max_new_tokens=20)
for prompt_i, generation_i in zip(output.prompts, output.generations):
print(f"{prompt_i}{generation_i.text}")
# >> Princess Peach jumped from the balcony and landed on the ground. She was so happy that she had found her treasure. She thanked the old
# >> Mario ran into the castle and started to explore. He ran around the castle and climbed on the throne. He even tried to climb
streaming
: Boolean determining whether to stream response. If True, then the results are returned as a generator object which yields the results as they are generated.
prompt = "Princess Peach jumped from the balcony"
output_iterator = pipeline(prompt=prompt, streaming=True, max_new_tokens=20)
print(prompt, end="")
for it in output_iterator:
print(it.generations[0].text, end="")
# output is streamed back incrementally
# >> Princess Peach jumped from the balcony and landed on the ground. She was so happy that she had found her treasure. She thanked the old
generation_config
: Parameters used to control sequences generated for each prompt. See below for more detailsgenerations_kwargs
: Arguments to override thegeneration_config
defaults
TextGenerationOutput
has the following fields:
prompts
: String or list of strings. Prompts used for the sequence generation. For multiple input prompts, a list of prompts is returned.generations
: For a single prompt, a list ofGeneratedText
is returned. If multiple prompts are given, a list ofGeneratedText
is returned for each prompt provided. If streaming is enabled, the next generated token is returned. Otherwise, the full generated sequence is returned.created
: Time of inference creation.
GeneratedText
has the following fields:
text
: The generated sequence for a given prompt. If streaming is enabled, this will be the next generated token.score
: The score for the generated token or sequence. The scores have the shape [sequence_length, vocab_size]finished
: Whether generation has stopped.finished_reason
: The reason for generation to stop. Defined byFinishReason
. One of stop, length, or time.
output = pipeline(sequences=prompt, max_new_tokens=20, output_scores=True)
print(f"created: {output.created}")
print(f"output.prompts: {output.prompts}")
print(f"text: {output.generations[0].text}")
print(f"score.shape: {output.generations[0].score.shape}")
print(f"finished: {output.generations[0].finished}")
print(f"finished_reason: {output.generations[0].finished_reason}")
# >> created: 2023-10-02 13:48:47.660696
# >> prompts: Princess peach jumped from the balcony and
# >> text: landed on the ground. She was so happy that she had found her treasure. She thanked the bird and
# >> score.shape: (21, 50257)
# >> finished: True
# >> finished_reason: length
TextGeneration
can be configured to alter several variables in a generation.
The following examples use a quantized 33M parameter TinyStories model for quick compilation:
from deepsparse import TextGeneration
model_id = "hf:mgoin/TinyStories-33M-quant-deepsparse"
pipeline = TextGeneration(model=model_id)
The GenerationConfig
can be created in three ways:
- Via
transformers.GenerationConfig
:
from transformers import GenerationConfig
generation_config = GenerationConfig()
generation_config.max_new_tokens = 10
output = pipeline(prompt=prompt, generation_config=generation_config)
print(f"{prompt}{output.generations[0].text}")
# >> Princess peach jumped from the balcony and landed on the ground. She was so happy that she
- Via a
dictionary
:
output = pipeline(prompt=prompt, generation_config={"max_new_tokens" : 10})
print(f"{prompt}{output.generations[0].text}")
# >> Princess peach jumped from the balcony and landed on the ground. She was so happy that she
- Via
kwargs
:
output = pipeline(prompt=prompt, max_new_tokens=10)
print(f"{prompt}{output.generations[0].text}")
# >> Princess peach jumped from the balcony and landed on the ground. She was so happy that she
We can pass a GenerationConfig
to TextGeneration.__init__
or TextGeneration.__call__
.
- If passed to
__init__
, theGenerationConfig
becomes the default for all subsequent__call__
s:
# set generation_config during __init__
pipeline_w_gen_config = TextGeneration(model=model_id, generation_config={"max_new_tokens": 10})
# generation_config is the default during __call__
output = pipeline_w_gen_config(prompt=prompt)
print(f"{prompt}{output.generations[0].text}")
# >> Princess peach jumped from the balcony and landed on the ground. She was so happy that she
- If passed to
__call__
theGenerationConfig
will be used for just that generation:
# no generation_config set during __init__
pipeline_w_no_gen_config = TextGeneration(model=model_id)
# generation_config is the passed during __call__
output = pipeline_w_no_gen_config(prompt=prompt, generation_config= {"max_new_tokens": 10})
print(f"{prompt}{output.generations[0].text}")
# >> Princess peach jumped from the balcony and landed on the ground. She was so happy that she
The following parameters are supported by the GenerationConfig
:
output_scores
: Whether to return the generated logits in addition to sampled tokens. Default isFalse
output = pipeline(prompt=prompt, output_scores=True)
print(output.generations[0].score.shape)
# (34, 50257) >> (tokens_generated, vocab_size)
num_return_sequences
: The number of sequences generated for each prompt. Default is1
output = pipeline(prompt=prompt, num_return_sequences=2, do_sample=True, max_new_tokens=10)
for generated_text in output.generations[0]:
print(f"{prompt}{generated_text.text}")
# >> Princess peach jumped from the balcony and onto her dress. She tried to get away but mummy
# >> Princess peach jumped from the balcony and ran after her. Jill jumped to the floor and followed
max_new_tokens
: maximum number of tokens to generate. Default isNone
output = pipeline(prompt=prompt, max_new_tokens=10)
print(f"{prompt}{output.generations[0].text}")
# >> Princess peach jumped from the balcony and landed on the ground. She was so happy that she
do_sample
: If True, will apply sampling from the probability distribution computed from the logits rather than deterministic greedy sampling. Default isFalse
output = pipeline(prompt=prompt, do_sample=True, max_new_tokens=15)
print(f"{prompt}{output.generations[0].text}")
output = pipeline(prompt=prompt, do_sample=True, max_new_tokens=15)
print(f"{prompt}{output.generations[0].text}")
# >> Princess peach jumped from the balcony and flew down. She used her big teeth to pick it up and gave it some
# >> Princess peach jumped from the balcony and landed in front of her. She stood proudly and exclaimed, “I did
temperature
: The temperature of the sampling operation. 1 means regular sampling, 0 means always taking the highest score, 100.0 is close to uniform probability. If0.0
, temperature is turned off. Default is0.0
# more random
output = pipeline(prompt=prompt, do_sample=True, temperature=1.5, max_new_tokens=15)
print(f"{prompt}{output.generations[0].text}")
# less random
output = pipeline(prompt=prompt, do_sample=True, temperature=0.5, max_new_tokens=15)
print(f"{prompt}{output.generations[0].text}")
# >> Princess peach jumped from the balcony and disappeared forever. All that means now is Maria staying where nothing draws herloads.
# >> Princess peach jumped from the balcony and landed on the floor. She was very scared, but she knew that her mom
top_k
: Int defining the top tokens considered during sampling. If0
,top_k
is turned off. Default is0
import numpy
# only 20 logits are not set to -inf == only 20 logits used to sample token
output = pipeline(prompt=prompt, do_sample=True, top_k=20, max_new_tokens=15, output_scores=True)
print(numpy.isfinite(output.generations[0].score).sum(axis=1))
# >> [20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20]
top_p
: Float to define the tokens that are considered with nucleus sampling. If0.0
,top_p
is turned off. Default is0.0
import numpy
# small set of logits are not set to -inf == nucleus sampling used
output = pipeline(prompt=prompt, do_sample=True, top_p=0.9, max_new_tokens=15, output_scores=True)
print(numpy.isfinite(output.generations[0].score).sum(axis=1))
# >> [ 5 119 18 14 204 6 7 367 191 20 12 7 46 6 2 35]
repetition_penalty
: The more a token is used within generation the more it is penalized to not be picked in successive generation passes. If0.0
,repetation_penalty
is turned off. Default is0.0
output = pipeline(prompt=prompt, repetition_penalty=1.3)
print(f"{prompt}{output.generations[0].text}")
# >> Princess peach jumped from the balcony and landed on the ground. She was so happy that she had found her treasure. She thanked the bird and went back inside to show her family her new treasure.