-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Improvement] Optimize data flushing and memory usage for huge partitions to improve stability #378
Comments
After discussing with my mentor, I raise a new proposal. Feel free to discuss more. @xianjingfeng @jerqi |
What's the advantage of this strategy compared with current strategy? |
Reasons have been attached. |
My concern about this strategy is that we need to maintain more partition meta. Will it give more gc pressure? |
Current strategy have picked the most partition to HDFS. You can see our implement. |
Yes, I have tested this implementation. But it looks invalid when fallback to HDFS. But this is not a big problem. The key of proposal is to avoid too much data into HDFS, we don't want to accept shuffle data 3 replica in HDFS. And sometimes, the performance is not good. |
You can set the data replica with hadoop configuration. |
This will bring a great burden to the HDFS ops |
You misunderstand me. The replica num is a client configuration. We just modify the Spark's client. You can see https://github.com/apache/incubator-uniffle/blob/884921b6d7c5452f6f3548ebc87840071c6be81f/proto/src/main/proto/Rss.proto#L422 for details. |
Let me take a look. |
The data which was written to HDFS must be higher than a data size. If we write small data size, it will influence the performance of HDFS. |
I know this, so I hope to use hdfs as little as possible, only open for those huge partition data. |
If you write the data which partition size > 50G && event > 64MB, I feel that it's ok. |
In our production environment the disk is enough in most of the time. And it is difficult to set the threshold of partition size, and this will result in reduced disk usage. So we prefer to write hdfs when the disk is full. |
@jerqi Here is our scenario: We co-located Uniffle shuffle servers with our YARN NodeManagers. Therefore the shuffle servers have limited memory (10-20 GB) and limited disk space (1-5 TB). However, there can be as many as 1k shuffle tasks on a shuffle server at the same time, with each task takes up only 1~2MB memory on the average. When a super large shuffle partition comes in, one of the disks quickly reaches its high watermark and memory starts to flush directly to HDFS. Since each task has only a small size (1-2MB) in the memory, it could flush(append) to HDFS with only 1~2MB at a time even if the configured flush threshold IMO, the flush threshold works only with two preconditions:
If either condition is not met, it's better to have another machanism to better utilize the limited memory/disk resources, instead of wasting them on a uncommonly large partition. Please correct me if I'm wrong. |
Yes, you're right. But if we only write 2MB to HDFS directly, the HDFS will have poor performance. |
Yes, this is difficult to set the threshold of partition size. I still think existing mechanism is hard to solve the huge single partition, especially in our scenario mentioned by @ChengbingLiu . I want to share some practices that wants to solve this problem but failed. First case:I enable the storage type of MEMORY_LOCALFILE_HDFS and the conf of When a huge single partition came and quickly it made the disk reach the high-watermark, it started to flush HDFS directly. However, unfortunately, because this huge partition occupies a large space in memory(6GB), the shuffle dataFlushEvent is large, which leads to the need to spend a longer time to consume. And this will cause other jobs to be unable to write data normally. The picture is shown below. Second case:After above practice, I set the However, unfortunately, this still is useless. Although the huge partition will generate several dataFlushEvents waiting to be flushed to a single HDFS file, but its writerHandler will hold the write lock to write one by one. So this will cause the flush thread pool busy but the writing speed is still slow.
The case's picture is shown below. Possible solutionThe above problem is hard to avoid in production environment. I think we need to introduce new strategy to solve. The following solutions are just draft
|
Maybe we could introduce multi-thread writing HDFS. If the file is too big, we could split them to multiple files. ByteDance CSS have similar concept. If file exceed the size, we will open and write another file. |
Yes. The key of problem is the low speed of writing single one data file.
Let me take a look. But I think writing another file is not a good solution, which wont improve the writing concurrency for multiple same partition events. |
I mean that we can write multiple files at the same time. |
Got it. It's an optional solution, the flush speed could improve based on (HDFS speed) * (file number). And I think we also need to limit the huge partition app's writing speed to avoid affecting other jobs. |
VIP Shop have similarly minds. cc @Gustfh |
I feel that it's necessary. Because it will be too slow that the server only use single thread to flush a large partition although we have multiple servers. |
Make sense. I think I will do this optimization. |
Draft a PR #396 . For this PR, it only solve the problem mentioned by second case. But if the |
…DFS storage (#396) ### What changes were proposed in this pull request? 1. Introduce the `PooledHdfsShuffleWriteHandler` to support writing single partition to multiple HDFS files concurrently. ### Why are the changes needed? As the problem mentioned by #378 (comment), the writing speed of HDFS is too slow and it can't write concurrently. Especially when huge partition exists, this problem will cause other apps slow due to the slight memory. So the improvement of writing speed is an important factor to flush the huge partition to HDFS quickly. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? 1. UTs
If the huge partition has the high writing speed and the speed of flushing is slow, the most memory will be occupied by huge partition's app. For ensuring the other apps stability, I prefer limiting the memory usage for huge partition. WDYT? @jerqi cc @advancedxy |
This issue will track all the optimizations of huge partition, and all sub-tasks will be connected with this. The solution of handling huge partitions is to make it flush to HDFS directly and limit memory usage, all subtasks are as follows.
cc @jerqi @advancedxy I will create some subtasks of issues and PRs, feel free to discuss more. |
It's ok for me. |
Missed this comment. LGTM. |
…ata flush (#471) ### What changes were proposed in this pull request? 1. Introduce memory usage limit for huge partition to keep the regular partition writing stable 2. Once partition is marked as huge-partition, when its buffer size is greater than `rss.server.single.buffer.flush.threshold` value, single-buffer flush will be triggered whatever the single buffer flush is enabled or not ### Why are the changes needed? 1. To solve the problems mentioned by #378 ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? 1. UTs
Closed. Feel free to open it if having any problem. |
…ategy (#621) ### What changes were proposed in this pull request? 1. Introduce storage manager selector to support more selector strategy for `MultiStorageManager` 2. Introduce the conf of `rss.server.multistorage.manager.selector.class` to support different flush strategy, like I hope huge partition directly flushed to HDFS and normal partition could be flushed to DISK when single buffer flush is enabled. ### Why are the changes needed? Solving the problem mentioned in #378 (comment). In current codebase, when encountering huge partition, if single buffer flush is enabled, the normal partition data will be flush to HDFS(I don't hope so, because the local disk is free and the flushing speed is faster than HDFS). But if disable single flush buffer, the huge partition event before marking as huge partition may be big, which cause the slow flushing and then cause requiring allocated buffer failed. Based on above problems, this PR is to make single event carrying with 100 mb flushed into HDFS or local file leveraging the conf of `rss.server.multistorage.manager.selector.class` ### Does this PR introduce _any_ user-facing change? Yes. Doc will be updated later. ### How was this patch tested? 1. UTs
…ategy (apache#621) ### What changes were proposed in this pull request? 1. Introduce storage manager selector to support more selector strategy for `MultiStorageManager` 2. Introduce the conf of `rss.server.multistorage.manager.selector.class` to support different flush strategy, like I hope huge partition directly flushed to HDFS and normal partition could be flushed to DISK when single buffer flush is enabled. ### Why are the changes needed? Solving the problem mentioned in apache#378 (comment). In current codebase, when encountering huge partition, if single buffer flush is enabled, the normal partition data will be flush to HDFS(I don't hope so, because the local disk is free and the flushing speed is faster than HDFS). But if disable single flush buffer, the huge partition event before marking as huge partition may be big, which cause the slow flushing and then cause requiring allocated buffer failed. Based on above problems, this PR is to make single event carrying with 100 mb flushed into HDFS or local file leveraging the conf of `rss.server.multistorage.manager.selector.class` ### Does this PR introduce _any_ user-facing change? Yes. Doc will be updated later. ### How was this patch tested? 1. UTs
…ategy (apache#621) ### What changes were proposed in this pull request? 1. Introduce storage manager selector to support more selector strategy for `MultiStorageManager` 2. Introduce the conf of `rss.server.multistorage.manager.selector.class` to support different flush strategy, like I hope huge partition directly flushed to HDFS and normal partition could be flushed to DISK when single buffer flush is enabled. ### Why are the changes needed? Solving the problem mentioned in apache#378 (comment). In current codebase, when encountering huge partition, if single buffer flush is enabled, the normal partition data will be flush to HDFS(I don't hope so, because the local disk is free and the flushing speed is faster than HDFS). But if disable single flush buffer, the huge partition event before marking as huge partition may be big, which cause the slow flushing and then cause requiring allocated buffer failed. Based on above problems, this PR is to make single event carrying with 100 mb flushed into HDFS or local file leveraging the conf of `rss.server.multistorage.manager.selector.class` ### Does this PR introduce _any_ user-facing change? Yes. Doc will be updated later. ### How was this patch tested? 1. UTs
### What changes were proposed in this pull request? Fix the metric huge_partition_num. ### Why are the changes needed? A follow-up PR for: #494. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs.
### What changes were proposed in this pull request? Fix the metric huge_partition_num. ### Why are the changes needed? A follow-up PR for: #494. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs.
Code of Conduct
Search before asking
What would you like to be improved?
Background
In our internal uniffle cluster, when using the storage type of
MEMORY_LOCALFILE,
the partial single huge partitions make disk reach high watermark, which make whole shuffle-server out-of-service due to full memory occupation.Possible solutions
Using the storage type ofMEMORY_LOCALFILE_HDFS
with the fallback strategy ofLocalStorageManagerFallbackStrategy
. [ISSUE-163][FEATURE] Write to hdfs when local disk can't be write #235Introduce the new strategy of dynamic disk selection [Improvement] Optimize local disk selection strategy #373Introduce partition size based strategy to flush single huge partition data to HDFSCurrently, I prefer3th
solution. In which, we could use cold storage(HDFS) when cumulative size of a particular partition is above a specific threshold(like 50g?). Actually, if exclude the partial huge partitions, the disk free ratio of whole shuffle-server is 20%-30%.Reasons
We could avoid hotspot of a single shuffle server, which could use HDFS to distribute pressure.Especially useful when disk spaces of shuffle servers are limited, which cannot hold a super large shuffle partition.Compared with1th
solution, it is more restrained for HDFS use and will not cause greater pressure. (Actually we don't want to accept much shuffle-data to HDFS with 3 replicas)Compared with 2th solution, it is more effective and have better performance.Only downgrade storage of partial huge partition jobs to HDFS, which is effective, especially when the HDFS cluster performance is not good.Final solution
The solution of handling huge partitions is to make it flush to HDFS directly and limit memory usage, all subtasks are as follows.
How should we improve?
No response
Are you willing to submit PR?
The text was updated successfully, but these errors were encountered: