-
Notifications
You must be signed in to change notification settings - Fork 742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1alpha2: Implementation #492
Conversation
per this design proposal kubeflow/community#30. Update API to v1alpha2
* linter: Fix Signed-off-by: Ce Gao <[email protected]> * linter: Keep v1alpha2 Signed-off-by: Ce Gao <[email protected]> * client: Update Signed-off-by: Ce Gao <[email protected]>
* utils: Add FakeServiceControl Signed-off-by: Ce Gao <[email protected]> * test: Add basic test case Signed-off-by: Ce Gao <[email protected]> * travis: Ignore vendored code Signed-off-by: Ce Gao <[email protected]> * travis: Ignore vendered code Signed-off-by: Ce Gao <[email protected]>
* controller: Fix the status outdate problem Signed-off-by: Ce Gao <[email protected]> * test: Add check for status update Signed-off-by: Ce Gao <[email protected]> * test: Remove call for KeyFunc Signed-off-by: Ce Gao <[email protected]> * pod: Add comment and remove debug statements Signed-off-by: Ce Gao <[email protected]>
* *: Fix some errors in Travis CI Signed-off-by: Ce Gao <[email protected]> * controller: Fix Signed-off-by: Ce Gao <[email protected]>
* controller: Add internal state test Signed-off-by: Ce Gao <[email protected]> * test: Remove useless log Signed-off-by: Ce Gao <[email protected]>
* controller: Separate ps and worker pods Signed-off-by: Ce Gao <[email protected]> * test: Remove log Signed-off-by: Ce Gao <[email protected]> * test: Fix travis Signed-off-by: Ce Gao <[email protected]>
* controller: Separate ps and worker pods Signed-off-by: Ce Gao <[email protected]> * test: Remove log Signed-off-by: Ce Gao <[email protected]> * test: Add Signed-off-by: Ce Gao <[email protected]>
* defaulter: Add Signed-off-by: Ce Gao <[email protected]> * default: Add Signed-off-by: Ce Gao <[email protected]> * default: Fix Signed-off-by: Ce Gao <[email protected]> * defaulter: Remove image default value Signed-off-by: Ce Gao <[email protected]> * example: Add minimal example Signed-off-by: Ce Gao <[email protected]> * utils.go: Add copyright holder Signed-off-by: Ce Gao <[email protected]> * defaulter: Remove restartpolicy Signed-off-by: Ce Gao <[email protected]>
* test: Add run test case Signed-off-by: Ce Gao <[email protected]> * test: Add enqueue test Signed-off-by: Ce Gao <[email protected]> * pod_test: Add Signed-off-by: Ce Gao <[email protected]> * service_test: Add Signed-off-by: Ce Gao <[email protected]>
So there's good news and bad news. 👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there. 😕 The bad news is that it appears that one or more commits were authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request. Note to project maintainer: This is a terminal state, meaning the |
Fantastic job on test the test coverage. |
/assign jlewi |
/retest |
1 similar comment
/retest |
@jlewi Could we merge it ASAP since it blocks the continuous development of v1alpha2? |
Signed-off-by: Ce Gao <[email protected]>
@gaocegege: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
The disk quota issue is kubeflow/testing#81 and should be fixed. |
Can we split the PR please and commit the code in v1alpha2 and code gen first? Massive PRs slow down code review and block other work. Splitting a PR into multiple PRs is a great way to speed up the review process. |
Review status: 0 of 49 files reviewed at latest revision, 10 unresolved discussions, some commit checks failed. cmd/tf-operator.v2/app/server.go, line 53 at r1 (raw file):
This looks very similar to Only difference I see is we get rid of reading readControllerConfig and more importantly we use the v1alpha packages Why not refactor the code a bit to promote code reuse? What if we had a RunForVersionFunction that could be called by Run. I think this would be useful because this won't be the only time we change the API. For example at a minimum at sometime in the future we will have a v1beta1 and a v1 API. cmd/tf-operator.v2/app/options/options.go, line 15 at r1 (raw file):
Could we just reuse the existing options package? examples/simple_tf_job.yaml, line 1 at r1 (raw file):
Why do we need this? examples/tf_job_v1alpha2.yaml, line 1 at r1 (raw file):
Do we need this? pkg/controller.v2/controller.go, line 142 at r1 (raw file):
Can you explain this design? Why do we need to check number of events? Why don't we just check that if we expect workers 1....N that there are 1...N pods as expected? Are there other K8s controllers that are designed this way? pkg/controller.v2/controller_pod.go, line 67 at r1 (raw file):
Why are you only checking the number of creations? Why don't you check that the actual pods you expect exist? Suppose we expect 4 worker replicas; 1...N. But suppose we get the following create events So worker 0 is created twice but worker 3 is never created. Won't the code think all 4 workers have been created even though its not true? pkg/controller.v2/controller_utils.go, line 398 at r1 (raw file):
Why do we need a function to create pods on specific nodes? pkg/util/signals/signal.go, line 2 at r1 (raw file):
Can you move this into a separate PR? It will speed up the review process if we can break this up into small PRs so its easier to focus on the important bits in each PR. pkg/util/signals/signal_windows.go, line 17 at r1 (raw file):
Why do we have a package for windows? When would we ever run this code on windows? test/e2e/sleep-and-random-exit/Dockerfile, line 3 at r1 (raw file):
If we need this, please move this into a separate PR. Comments from Reviewable |
@jlewi API and code-gen is 0d61bf9 V1alpha2 core logic is 4e1dd8d These two commits are the most important commits and other commits mainly focus on test and I think we could just look through them. Sorry that I think it is hard to split the PRs and here is the reason: We removed the v1alpha1 during the development and added it again in one commit since you recommend to keep two implementations in master. Then if we split the PR, there are lots of work to do to keep two implementations from the beginning instead of the commit caf21fa. Or we could keep one implementation and just remove v1alpha1 then I think I could split the previous commits since the commits in v1alpha2 already do. Another option is to keep the development in v1alpha2 and wait until the downstream is ready, then we could split the commits in v1alpha2 and overwrite master. /cc @ScorpioCPH |
I really think this PR needs to be split up. There's a bunch of code in here that should be relatively straightforward to approve ; e.g. signals. So the inclusion in this PR is just a distraction. I don't understand why splitting up this PR is difficult. Create a new based of master head. That PR would be just the files for the API which should be relatively straightforward to review. Follow the same process to create a PR for the signals package. Once those PRs are committed which should be relatively straightforward. You can sync with master. At that point the only thing left is the controller package which is where I think the most attention needs to be paid. |
@jlewi You can check out branch v1alpha2, the whole commit history are there. |
@ScorpioCPH I'm not sure I follow. The point of splitting up this PR is that the vast majority of it is straightforward to review. So if we split up the PR we can get most of the files committed pretty quickly. This would leave us with the controller logic and the controller logic is where I'd like to focus during the review process because there appear to be substantial changes to how we track and reconcile resources as part of the TFJob. |
Review status: 0 of 49 files reviewed at latest revision, 10 unresolved discussions, some commit checks failed. pkg/controller.v2/controller.go, line 76 at r1 (raw file):
Do we really need exponential backoff? If we correctly configure and listen to all appropriate events for a TFJob then we should be able to set the periodic reconcile to a large value 1-2 minutes. pkg/controller.v2/controller.go, line 197 at r1 (raw file):
Do we configure this informer to periodically generate update events to be used for reconcilation like we currently do? pkg/controller.v2/controller.go, line 289 at r1 (raw file):
Can I suggest making processNextWorkItem a utility function that can be called from the v1 and v2 controller? I just fixed some issues in #501. I don't think this code is actually different so it would be useful to avoid code duplication. pkg/controller.v2/controller.go, line 325 at r1 (raw file):
It would be good to use logrus and add fields to tag log entries with job name. This will make it alot easier to debug issues processing jobs. pkg/controller.v2/controller.go, line 344 at r1 (raw file):
What is satisifiedExpections? Why don't we just call tc.reconcileTFJobs and let the reconcile function figure out what if anything needs to be done? pkg/controller.v2/controller.go, line 362 at r1 (raw file):
Why do we need to get the pods and service here? pkg/controller.v2/controller_pod.go, line 37 at r1 (raw file):
When a job is finished but not deleted, what prevents pods from being recreated? pkg/controller.v2/controller_pod.go, line 242 at r1 (raw file):
Why do we need to update the expectations? What are the expectations used for? pkg/controller.v2/controller_pod.go, line 293 at r1 (raw file):
Can we just enqueue a work item for the associated job which will cause us to call reconcile for the job? We can rely on the workQueue to aggregate events and rate limit things. pkg/controller.v2/controller_service.go, line 172 at r1 (raw file):
What are adoptions? pkg/controller.v2/controller_service.go, line 255 at r1 (raw file):
Can we just enqueue a work item for the associated job which will cause us to call reconcile for the job? We can rely on the workQueue to aggregate events and rate limit things. Comments from Reviewable |
@gaocegege How's it going with v1alpha2? |
We planned to discuss about the integration but that community meeting has no time for this topic. Maybe we could schedule a meeting for it |
SGTM. We could schedule a online meeting to discuss and make a decision after tomb-sweeping day (the traditional Qingming festival). How about Apr 8th? |
For anyone who happens to read this PR, we will split the PR into smaller ones to ease review. |
@ScorpioCPH and @gaocegege implemented the main logic and basic unit test cases (73% coverage).
I think it is time to open the PR, and we decided to add basic e2e test cases for v1alpha2 these days.
We have created a branch for v1alpha1 and when the v1alpha2 is merged into master we will remove the branch v1alpha2.
PS: Suggest to rebase the PR to master, not squash it since the commit message may be helpful for new contributors.
WDYT? @jlewi
This change is