Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paddle cloud web features design #378

Closed
wants to merge 7 commits into from

Conversation

typhoonzero
Copy link
Collaborator

Fix #377

@typhoonzero typhoonzero changed the title Web desing Web design Sep 30, 2017
@typhoonzero typhoonzero changed the title Web design Paddle cloud web features design Sep 30, 2017
Copy link

@wangkuiyi wangkuiyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this design!

You might want to check English writings using Grammarly.com.


## Account Management

I'll skip this section because it is a design that almost every website need.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this document is for Production Team's reference, we should at least give an example Web site here.

In my mind, we do need a mockup for this page. At least, once a user logs in, s/he must be able to see all his/her jobs listed. And s/he should be able to click each job to see the job's dashboard.


## Jupiter Notebook

Start a ReplicaSet using image `docker.paddlepaddle.org/book` in kubernetes cluster and add an ingress endpoint when user first enters the notebook page.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubernetes => Kubernetes


## Jupiter Notebook

Start a ReplicaSet using image `docker.paddlepaddle.org/book` in kubernetes cluster and add an ingress endpoint when user first enters the notebook page.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReplicaSet needs a URL as its reference.


## Jupiter Notebook

Start a ReplicaSet using image `docker.paddlepaddle.org/book` in kubernetes cluster and add an ingress endpoint when user first enters the notebook page.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ingress => Ingress


```python
sess = paddle.framework.remote_session(
topology=block,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this example program mean? Is it intended to run a block? I am not sure if our API design could generate a block which is assignable to the topology parameter. Basically, our API is designed to generate a ProgramDesc protobuf message that includes a repeated field of BlockDesc messages, as described here https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/program.md.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


After this, there will be a job description and perfomance monitoring pages able to view at "Job Dashboard"

## Job Dashboard

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which program would serve this job dashboard? As it is per-job, it seems that the master process of a job should serve it. If so, it could be part of the PaddleBoard.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, job dashboard will list all jobs for the current user. Job dashboard is just one web page simply calls Kubernetes API to get the job list.

- Upload/Download page
- file sharing page

## Paddle Board

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't expect draw_board functions calls in user programs. I am not sure how configurable the TensorBoard is, but in my mind, PaddleBoard just needs to be able to present outputs from Evaluator operators aggregated/accumulated over minibatches.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not inserting function calls in user programs, we need to automatically find out which variable represents the cost and evaluator operator by default and draw the variable value to the web page. I'm not sure how to do that for now.

Here is a short example of how TensorBoard config the metrics, using tf.summary. User explicitly specify values to output for drawing.


Calling `draw_board` will output graph files on the distributed storage, and then the web page can load the data and refresh the graph.

## Serving

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that a serving job is different from a training job in that the former doesn't have a master process. If so, each process in a serving job needs to be able to present its own metrics, and there is no chance for them to present a PaddleBoard?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what metrics to display when running inference(serving), the neural network configuration may not define cost functions, and there's no label to evaluate the result. Metrics like QPS(queries per second) is more like "monitoring" but not PaddleBoard.

1. inference network configuration in `.proto` format, or user can also define the network in Python in the webpage.
1. number of CPU/GPU resource in total to use for serving the model, the more resource there is, the more concurrent calls can be served.

After cliking the "Langch" button, a "kubernetes deployment" will be created to serve the model. The current serving instances will be listed at the current page.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the "Launch" button?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated following comments.

- Account Management
- Registration, send email to inform if registration succeeded
- Account Login/Logout
- Password changing, find back
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need "find back", maybe change to "resetting". Since we probably would only store a hashed password.

- Registration, send email to inform if registration succeeded
- Account Login/Logout
- Password changing, find back
- Download SSL keys
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the SSL key for? I thought currently authentication is done via token?

- Datasets
- Public Dataset viewing
- Upload/Download private datasets
- Share datasets
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe need to be more specific: Does it mean can share the dataset to anyone by a link, or just set dataset visible to certain group (similar to the unix read file permission).


## Account Management

Account management page is designed to satisfy multi-tenant use cases. One account should have a unique account ID for login, and this account owns one access key to one unique [Kubernetes namespace](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/) cluster. Multiple users can log in to this account ID and operate jobs and data files. The only "master user" can do modifications like increase quota or manage account settings.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"log in to this account ID" or log in to its own account which belongs to the group. If so, "master user" could be "group owner".

my_metric = my_metric_graph(output, label)
my_metric_value = output

draw_board(cost, evaluator)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think draw_board should take only one variable that returns a scalar, and a optional name. E.g.,

draw_board(evaluator, "evaluate result")


1. model `tar.gz` files to the cloud.
1. inference network configuration in `.proto` format or user can also define the network in Python in the web page.
1. number of CPU/GPU resource in total to use for serving the model, the more resource there is, the more concurrent calls can be served.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change "number of CPU/GPU resource" to number of instances and CPU / Mem / GPU per instance. Otherwise it's hard for us to figure out how many instances to run (we don't know the model property and user's serving requirement).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @helinwang , additional, we can caculate the total resources usage on the web site and display them on the web site.

- Performance Monitoring
- Quota Monitoring
- Datasets
- Public Dataset viewing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataset => dataset

- Serving
- Submit serving instances
- Deactivate serving
- Serving performance monitoring
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need a feature: scale serving instances, and we can use HPA to implement auto-scaling.


<img src="pictures/notebook.png" width="500px" align="center">

Users can write a program in python in the web page and save their programs, which will be saved at cloud storage. Users also can run a script like below to submit a cluster training job:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python => Python


A web page containing a table to list jobs satisfying user's filters. The user can only list jobs that were submitted by themselves.

| jobname | start time | age | success | fails | actions |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need mor information on the Job list, such as PS_READY, PS_TOTAL, TRAINER_READY, TRAINER_TOTAL.


Datasets and Models are quite the same, both like a simple file management and sharing service.

- file listing and viewing page
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file => File


## Datasets and Models

Datasets and Models are quite the same, both like a simple file management and sharing service.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can supplement more information about the file sharing service.Such as we can share files between users, namespaces or just a publish link?


Click the "Launch" button in this web page will pop up a modal dialogue to configure the job:

1. model `tar.gz` files to the cloud.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upper the first letter, the same as below.
model tar.gz files to the cloud
=>
The path of model files with suffix tar.ge on the Cloud.


1. model `tar.gz` files to the cloud.
1. inference network configuration in `.proto` format or user can also define the network in Python in the web page.
1. number of CPU/GPU resource in total to use for serving the model, the more resource there is, the more concurrent calls can be served.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @helinwang , additional, we can caculate the total resources usage on the web site and display them on the web site.

@typhoonzero
Copy link
Collaborator Author

Closing, will reopen if we are going to do this work.

@typhoonzero typhoonzero closed this Nov 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants