[batch] E2E works with driver and request proxy #272

xinchen384 · 2024-10-03T17:13:38Z

Pull Request Description

Driver is to bind all components together to server job requests.
Update job manager to update request info within job meta.
add round_robin job scheduling to scheduler.
Request proxy is to handle calling to inference engine.
Driver test is to check job's status is consistent.

Related Issues

Resolves: part of #182

Important: Before submitting, please complete the description above and review the checklist below.

Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

[Bug]: Corrections to existing functionality
[CI]: Changes to build process or CI pipeline
[Docs]: Updates or additions to documentation
[API]: Modifications to aibrix's API or interface
[CLI]: Changes or additions to the Command Line Interface
[Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

PR title includes appropriate prefix(es)
Changes are clearly explained in the PR description
New and existing tests pass successfully
Code adheres to project style and best practices
Documentation updated to reflect changes (if applicable)
Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

Jeffwan

please rebase the main change and resolve the testing issues. then I will start the review

Makefile

python/aibrix/aibrix/batch/constant.py

Jeffwan

Can you add some docs and examples?

Jeffwan · 2024-10-03T21:20:18Z

python/aibrix/aibrix/batch/scheduler.py

@@ -77,6 +99,7 @@ def schedule_get_job(self):
        else:
            print("Unsupported scheduling policy!")

+        self._job_manager.start_execute_job(job_id)


what's the role of scheduler? if it's used to select id. should the main workflow be responsible for launching the job?? or you like it to include the execution?

Yes, the main workflow is to launch the job. It is the scheduler's responsibility to select which job. Job manager is to handle jobs' state transition. What you mark here on "start_execute_job", maybe I need to change the name of this function. It is just used to change job's state.

It is the scheduler's responsibility to select which job. Job manager is to handle jobs' state transition

My point is you embed job manager inside the scheduler. should it removed from scheduler and added in into driver?

Insider of the scheduler, I have to check job's status by exposing manager to schduler.

em. scheduler could be static and you can pass job status into the scheduler. now, the coupling makes the testing etc a little bit harder. you need to mock lots of things in future

I am ok if you like to get this merged now. Let's revisit it and make necessary refactor

python/aibrix/aibrix/batch/request_proxy.py

python/aibrix/aibrix/batch/scheduler.py

python/aibrix/aibrix/batch/job_manager.py

Jeffwan · 2024-10-08T21:56:42Z

@xinchen384 seems some of the comments and TODOs are not addressed yet. Let's continue the work and try to merge it tomorrow

xinchen384 · 2024-10-09T05:28:07Z

Now all comments are addressed. Documents with examples are added as well. @Jeffwan

python/aibrix/aibrix/batch/README.md

Jeffwan · 2024-10-09T17:37:03Z

python/aibrix/aibrix/batch/README.md

+The request's format is in json. 
+The json should cover multiple attributes as specified here, https://platform.openai.com/docs/guides/batch/getting-started, such as endpoint and completion window. 
+
+## Submit job input data


this is something not expected. As we discussed, we want to support openAI style batch API interface. If we do not have that supported, it increase the user's learning curve and usage curve a lot.

For example, where's the running instance of user's python script. can it be interrupted? it's much hard to use it comparing to submit a remote API

Are you saying these interfaces are different from openAI batch API? I think that they are the same. The only missing part is a proxy that expose there interfaces to client side.

Let's leave to next PR

Jeffwan · 2024-10-09T17:37:41Z

python/aibrix/aibrix/batch/README.md

+# Batch API Tutorial
+
+## Prepare dataset
+Before submitting a batch job, you need to prepare input data as a file. 


can you give an example dataset here to reduce the efforts to prepare data?

I included here: https://github.com/aibrix/aibrix/blob/xin/driver/python/aibrix/tests/sample_job_input.json

I suggest to reference the file in the readme so people can easily find it. Otherwise, how can user know there's one in testing folder

Jeffwan · 2024-10-10T06:10:46Z

There're some minor issues we can address in future PRs. this one looks good to me

* Update manifests version to v0.1.0-rc.3 (#287) * [Misc] Add sync images step and scripts in release process (#283) Add sync images step and scripts in release process * [batch] E2E works with driver and request proxy (#272) * e2e driver and test * comment functions * check job status in test * format update * update copyright * add examples with instructions and interfaces * move batch tutorial --------- Co-authored-by: xin.chen <[email protected]> * Fix address already in use when AIRuntime start in pod (#289) add uvicorn startup into file entrypoint * Read model name from request body (#290) * Use model name from request body * rename dummy to reserved router * Fix redis bootstrap flaky connection issue (#293) * skip docs CI if no changes in /docs dir (#294) * skip docs CI if no changes in /docs dir * test docs build * Improve Rayclusterreplicaset Status (#295) * improve rayclusterreplicaset status * nit * fix lint error * improve isClusterActive logic * fix lint error * remove redundant isRayPodCreateOrDeleteFailed check --------- Signed-off-by: Yicheng-Lu-llll <[email protected]> * Add request trace for profiling (#291) * Add request trace for profiling * add to redis at 10 second interval * nit * round to nearest 10s interval * round timestamp to nearest 10s interval and aggregate data by model * add go routine to add request trace * Update the crd definiton due to runtime upgrade (#298) #295 introduce the latest kuberay api and the dependencies bumps sigs.k8s.io/controller-runtime from v0.17.3 to v0.17.5. Due to that change, make manifest update the CRD definitions * Push images to Github registry in release pipeline (#301) * Disable docker build github workflow to cut CI cost * Push images to Github registry in release pipeline * Build autoscaler abstractions like fetcher, client and scaler (#300) * minor clean up on the autoscaler controller * Extract the algorithm package algorithm is extracted to distinguish with the scaler. * Refactor scaler interface 1. Split the Scaler interface and BaseAutoscaler implementation 2. Create APA/KPA scaler separately and adopt the corresponding algorithms * Introduce the scalingContext in algorithm * Introduce k8s.io/metrics for resource & custom metrics fetching * Extract metric fetcher to cover the fetching logic * Optimize the scaler workflow to adopt fetch and client interface * Further refactor the code structure * Support pod autoscaler periodically check (#306) * Support pod autoscaler periodically check * Fix the error case * Add timeout in nc check for redis bootstrap (#309) * Refactor AutoScaler: metricClient, context, reconcile (#308) * Refactor AutoScaler: optimize metric client, context, and reconcile processes. * fix make lint-all * fix typos --------- Signed-off-by: Yicheng-Lu-llll <[email protected]> Co-authored-by: xinchen384 <[email protected]> Co-authored-by: xin.chen <[email protected]> Co-authored-by: brosoul <[email protected]> Co-authored-by: Varun Gupta <[email protected]> Co-authored-by: Yicheng-Lu-llll <[email protected]> Co-authored-by: Rong-Kang <[email protected]>

* e2e driver and test * comment functions * check job status in test * format update * update copyright * add examples with instructions and interfaces * move batch tutorial --------- Co-authored-by: xin.chen <[email protected]>

Jeffwan reviewed Oct 3, 2024

View reviewed changes

Makefile Outdated Show resolved Hide resolved

python/aibrix/aibrix/batch/constant.py Show resolved Hide resolved

xin.chen added 4 commits October 4, 2024 02:03

e2e driver and test

0232a5d

comment functions

32ed2e7

check job status in test

43db7af

format update

a4e3886

xinchen384 force-pushed the xin/driver branch from be88efa to a4e3886 Compare October 3, 2024 18:30

update copyright

0720bdc

Jeffwan reviewed Oct 3, 2024

View reviewed changes

add examples with instructions and interfaces

042c131

Jeffwan reviewed Oct 9, 2024

View reviewed changes

move batch tutorial

26d4705

Jeffwan merged commit 2f32a01 into main Oct 10, 2024
10 checks passed

Jeffwan deleted the xin/driver branch October 10, 2024 06:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[batch] E2E works with driver and request proxy #272

[batch] E2E works with driver and request proxy #272

xinchen384 commented Oct 3, 2024

Jeffwan left a comment

Jeffwan left a comment

Jeffwan Oct 3, 2024

xinchen384 Oct 4, 2024

Jeffwan Oct 4, 2024

xinchen384 Oct 9, 2024

Jeffwan Oct 9, 2024

Jeffwan Oct 9, 2024

Jeffwan commented Oct 8, 2024

xinchen384 commented Oct 9, 2024

Jeffwan Oct 9, 2024

xinchen384 Oct 10, 2024

Jeffwan Oct 10, 2024

Jeffwan Oct 9, 2024

xinchen384 Oct 9, 2024

Jeffwan Oct 10, 2024

Jeffwan commented Oct 10, 2024

[batch] E2E works with driver and request proxy #272

[batch] E2E works with driver and request proxy #272

Conversation

xinchen384 commented Oct 3, 2024

Pull Request Description

Related Issues

Pull Request Title Format

Submission Checklist

Jeffwan left a comment

Choose a reason for hiding this comment

Jeffwan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jeffwan commented Oct 8, 2024

xinchen384 commented Oct 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jeffwan commented Oct 10, 2024