-
-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: Add support for IBM Spyre accelerator #9652
Comments
Hi @tdoublep, thanks for the thorough writeup! I had some torch.compile related questions
Is this a torch dynamo backend? Or does it integrate via Inductor?
Agreed, something like this would be useful for GPUs. We want to be able to write custom inductor passes where we only apply certain optimizations depending on the number of tokens we're working with. (cc @bnellnm who is working on an inductor pass like that) BTW: It sounds like the Spyre might require all shapes to be static, is that true? |
Thanks for reading @tlrmchlsmth
The former, it integrates at the level of FX graphs.
Right now each compile has a static prompt length and batch size, but the number of output tokens is dynamic to an extent (we still need to define some maximum). We are able to compile multiple graphs to support different cases though, and we have some logic to pad to the nearest reasonable shape (again, sort of like how CUDA graphs works today iirc). |
Update from our side: we have now open-sourced the code for Spyre support on IBM's fork of vLLM: |
Nice, looks like all the needed changes are contained in this PR IBM/vllm#56? |
@tlrmchlsmth Yes, those are the changes compared with a commit from last month. We will continue developing on the open repo from here onwards, and will be pulling in changes from upstream frequently. |
Motivation.
IBM has recently announced its Spyre AI accelerator at Hot Chips 2024. This accelerator has been designed, in collaboration with IBM Research, to scale-up enterprise AI workloads running on IBM's mainframe systems (IBM Z), as well as on IBM's Power platform. Since IBM is building our inference stack on top of vLLM, we would like to enable support for IBM Spyre within the vLLM framework.
Spyre has been designed to fit seamlessly into the PyTorch ecosystem via torch.compile. Specifically, IBM Research has developed a new backend for torch.compile that will compile torch FX graphs for execution on the Spyre hardware. In this sense, we envision that Spyre support in vLLM can work in a similar way to how the TPU support is working today (e.g., see here).
Today, there are two key limitations that affect this integration and need to be worked around. Specifically:
fms
). This is only a temporary limitation, the end-goal is that, via torch.compile, we can also run the native vLLM modeling code on Spyre. We hope that recent efforts to make vLLM models torch compilable will significantly accelerate this effort.Proposed Change.
In this RFC, we propose the following sequence of PRs to enable IBM Spyre support in vLLM:
SpyreExecutor
classMultiprocessingSpyreExecutor
class.While much of the work here (P1, P2, P3) has already been completed in an private fork, we plan to upstream the changes as a sequence of smaller PRs, to make the changes easier to review. Below we will discuss the planned changes from each PR step.
P1: Add support for single Spyre card via new
SpyreExecutor
classWe will introduce a set of classes, inheriting from the core vLLM classes, that will enable execution on a single Spyre device. Architecturally, this will look very similar to the equivalent classes that were introduced for running on AWS Inferentia (e.g.,
NeuronExecutor
,NeuronWorker
,NeuronModelRunner
,NeuronCausalLM
). In a similar way to how theNeuronModelRunner
use the modeling code from thetransformers_neuronx
package, these new classes will execute the modeling code from IBM'sfms
package.In the diagram below, we compare the proposed Spyre classes, with the corresponding classes that already exist for the AWS Inferentia support:
Since Spyre works via torch.compile, to ensure that compilation does not occur on the critical path (e.g., serving user requests), we need to ensure that all compilation gets triggered at init time. This PR will also introduce a routine for warming up the inference server when using the Spyre, triggering compilation of all required shapes (e.g., prompt length, number of output tokens, batch size). We will write code to ensure that batches get padded to one of the compiled shapes before execution. This behaviour is akin to what happens today in vLLM for CUDA graphs, and presumably something like this warmup will also be needed once vLLM starts using torch.compile more extensively. This could be one area to explore commonality with others parts of the codebase.
Testing: While testing on the real hardware can only be performed internally for now, we can test the vast majority of the integration on the CPU by either (a) running in eager mode or (b) by using torch compile with the
inductor
backend. Thus, in this PR we will also add a set of unit and integration tests to verify that everything behaves as expected. The tests will focus on the offline mode, since changes to the scheduling algorithm are needed to support online mode (see P2). We will also add aDockerfile.spyre
containing all necessary dependencies (e.g., FMS) in which the tests can be executed. Whether we could have these tests running as part of vLLM's CI/CD is something we would like to discuss.P2: Changes to scheduling algorithm to disable continuous batching.
We need to introduce a few changes to the scheduling algorithm to workaround the lack of continuous batching support. Specifically:
These changes must be conditional and not affect the behaviour of the scheduler on existing supported devices. They could either be applied within the scheduler itself (e.g., by checking
is_spyre()
) or we could try to "plug in" an alternate scheduler design? This is one of the design choices we would like some feedback on.Testing: As part of this PR, we will also introduce tests to cover the integration with the
MQLLMEngine
and online operation.P3: Enable TP execution across multiple Spyre cards via
MultiprocessingSpyreExecutor
class.We have found that the
MultiprocessingGPUExecutor
can be easily adapted into aMultiprocessingSpyreExecutor
to enable TP execution across multiple Spyre devices in parallel. However, to reduce code duplication we propose refactoring the common code betweenMultiprocessingGPUExecutor
andMultiprocessingSpyreExecutor
into a common parent classMultiprocessingExecutor
. By inheriting fromMultiprocessingExecutor
and a corresponding mixin class (e.g.,GPUExecutor
orSpyreExecutor
) it should be possible to achieve the desired behaviour with very little device-specific code. Note that something along these lines already exists for theMultiprocessingXPUExecutor
(e.g., see here), but the design proposed below would give more flexibility for device-specific specialization, and would also easily allow us to create multi-processing executors for all support devices if we want.The architecture would look something like this:
Testing: We will add tests to verify that the
MultiprocessingSpyreExecutor
behaves as expected for tensor parallel execution when running using eager or usinginductor
backend on CPU. Internally, we will run these tests against the real hardware.P4: Enable paged attention and continuous batching for Spyre.
TBD
P5: Enable vLLM modeling code to run on Spyre.
TBD
Feedback Period.
2 weeks
CC List.
@njhill @simon-mo @youkaichao @zhuohan123 @comaniac @WoosukKwon @Yard1
Please cc anyone else as you see fit!
Any Other Things.
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: