-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Paddle cloud web features design #378
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this design!
You might want to check English writings using Grammarly.com.
doc/design/web.md
Outdated
|
||
## Account Management | ||
|
||
I'll skip this section because it is a design that almost every website need. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this document is for Production Team's reference, we should at least give an example Web site here.
In my mind, we do need a mockup for this page. At least, once a user logs in, s/he must be able to see all his/her jobs listed. And s/he should be able to click each job to see the job's dashboard.
doc/design/web.md
Outdated
|
||
## Jupiter Notebook | ||
|
||
Start a ReplicaSet using image `docker.paddlepaddle.org/book` in kubernetes cluster and add an ingress endpoint when user first enters the notebook page. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubernetes => Kubernetes
doc/design/web.md
Outdated
|
||
## Jupiter Notebook | ||
|
||
Start a ReplicaSet using image `docker.paddlepaddle.org/book` in kubernetes cluster and add an ingress endpoint when user first enters the notebook page. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReplicaSet needs a URL as its reference.
doc/design/web.md
Outdated
|
||
## Jupiter Notebook | ||
|
||
Start a ReplicaSet using image `docker.paddlepaddle.org/book` in kubernetes cluster and add an ingress endpoint when user first enters the notebook page. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ingress => Ingress
doc/design/web.md
Outdated
|
||
```python | ||
sess = paddle.framework.remote_session( | ||
topology=block, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this example program mean? Is it intended to run a block? I am not sure if our API design could generate a block which is assignable to the topology
parameter. Basically, our API is designed to generate a ProgramDesc
protobuf message that includes a repeated field of BlockDesc
messages, as described here https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/program.md.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will just use pseudo code from https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/refactor/session.md
doc/design/web.md
Outdated
|
||
After this, there will be a job description and perfomance monitoring pages able to view at "Job Dashboard" | ||
|
||
## Job Dashboard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which program would serve this job dashboard? As it is per-job, it seems that the master process of a job should serve it. If so, it could be part of the PaddleBoard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, job dashboard will list all jobs for the current user. Job dashboard is just one web page simply calls Kubernetes API to get the job list.
- Upload/Download page | ||
- file sharing page | ||
|
||
## Paddle Board |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't expect draw_board
functions calls in user programs. I am not sure how configurable the TensorBoard is, but in my mind, PaddleBoard just needs to be able to present outputs from Evaluator operators aggregated/accumulated over minibatches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not inserting function calls in user programs, we need to automatically find out which variable represents the cost and evaluator operator by default and draw the variable value to the web page. I'm not sure how to do that for now.
Here is a short example of how TensorBoard config the metrics, using tf.summary
. User explicitly specify values to output for drawing.
|
||
Calling `draw_board` will output graph files on the distributed storage, and then the web page can load the data and refresh the graph. | ||
|
||
## Serving |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that a serving job is different from a training job in that the former doesn't have a master process. If so, each process in a serving job needs to be able to present its own metrics, and there is no chance for them to present a PaddleBoard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what metrics to display when running inference(serving), the neural network configuration may not define cost functions, and there's no label to evaluate the result. Metrics like QPS(queries per second) is more like "monitoring" but not PaddleBoard.
doc/design/web.md
Outdated
1. inference network configuration in `.proto` format, or user can also define the network in Python in the webpage. | ||
1. number of CPU/GPU resource in total to use for serving the model, the more resource there is, the more concurrent calls can be served. | ||
|
||
After cliking the "Langch" button, a "kubernetes deployment" will be created to serve the model. The current serving instances will be listed at the current page. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the "Launch" button?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated following comments.
- Account Management | ||
- Registration, send email to inform if registration succeeded | ||
- Account Login/Logout | ||
- Password changing, find back |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need "find back", maybe change to "resetting". Since we probably would only store a hashed password.
- Registration, send email to inform if registration succeeded | ||
- Account Login/Logout | ||
- Password changing, find back | ||
- Download SSL keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the SSL key for? I thought currently authentication is done via token?
- Datasets | ||
- Public Dataset viewing | ||
- Upload/Download private datasets | ||
- Share datasets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe need to be more specific: Does it mean can share the dataset to anyone by a link, or just set dataset visible to certain group (similar to the unix read file permission).
|
||
## Account Management | ||
|
||
Account management page is designed to satisfy multi-tenant use cases. One account should have a unique account ID for login, and this account owns one access key to one unique [Kubernetes namespace](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/) cluster. Multiple users can log in to this account ID and operate jobs and data files. The only "master user" can do modifications like increase quota or manage account settings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"log in to this account ID" or log in to its own account which belongs to the group. If so, "master user" could be "group owner".
my_metric = my_metric_graph(output, label) | ||
my_metric_value = output | ||
|
||
draw_board(cost, evaluator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think draw_board
should take only one variable that returns a scalar, and a optional name. E.g.,
draw_board(evaluator, "evaluate result")
|
||
1. model `tar.gz` files to the cloud. | ||
1. inference network configuration in `.proto` format or user can also define the network in Python in the web page. | ||
1. number of CPU/GPU resource in total to use for serving the model, the more resource there is, the more concurrent calls can be served. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change "number of CPU/GPU resource" to number of instances and CPU / Mem / GPU per instance. Otherwise it's hard for us to figure out how many instances to run (we don't know the model property and user's serving requirement).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @helinwang , additional, we can caculate the total resources usage on the web site and display them on the web site.
- Performance Monitoring | ||
- Quota Monitoring | ||
- Datasets | ||
- Public Dataset viewing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dataset => dataset
- Serving | ||
- Submit serving instances | ||
- Deactivate serving | ||
- Serving performance monitoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need a feature: scale serving instances
, and we can use HPA to implement auto-scaling.
|
||
<img src="pictures/notebook.png" width="500px" align="center"> | ||
|
||
Users can write a program in python in the web page and save their programs, which will be saved at cloud storage. Users also can run a script like below to submit a cluster training job: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python => Python
|
||
A web page containing a table to list jobs satisfying user's filters. The user can only list jobs that were submitted by themselves. | ||
|
||
| jobname | start time | age | success | fails | actions | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need mor information on the Job list, such as PS_READY, PS_TOTAL, TRAINER_READY, TRAINER_TOTAL
.
|
||
Datasets and Models are quite the same, both like a simple file management and sharing service. | ||
|
||
- file listing and viewing page |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file => File
|
||
## Datasets and Models | ||
|
||
Datasets and Models are quite the same, both like a simple file management and sharing service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can supplement more information about the file sharing service.Such as we can share files between users, namespaces or just a publish link?
|
||
Click the "Launch" button in this web page will pop up a modal dialogue to configure the job: | ||
|
||
1. model `tar.gz` files to the cloud. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upper the first letter, the same as below.
model tar.gz
files to the cloud
=>
The path of model files with suffix tar.ge
on the Cloud.
|
||
1. model `tar.gz` files to the cloud. | ||
1. inference network configuration in `.proto` format or user can also define the network in Python in the web page. | ||
1. number of CPU/GPU resource in total to use for serving the model, the more resource there is, the more concurrent calls can be served. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @helinwang , additional, we can caculate the total resources usage on the web site and display them on the web site.
Closing, will reopen if we are going to do this work. |
Fix #377