Mounting s3 hosted model files using s3fs is causing startup issues #765

robert-moyai · 2025-02-28T14:38:31Z

I wanted to download models from self managed s3 bucket. I got it working by downloading the files to local storage. But somehow referencing the models using their s3 path is making the vllm crash cause it interprets the s3 link as a huggingface model repo (https://aibrix.readthedocs.io/latest/features/lora-dynamic-loading.html#model-registry).

No problem I thought I can just mount the s3 bucket using s3fs. Which I can I do no problem but then when the vllm engine is trying to start it tries to pull in the model weights locally but they are not there yet cause they have to be pulled from s3 using s3fs which introduces latency over native storage. I added a stacktrace from inside my pod.

$ kubectl logs -f llama-3-1-8b-instruct-65fbb7f8b7-rzzvb -c vllm-openai
INFO 02-28 05:36:43 __init__.py:183] Automatically detected platform cuda.
WARNING 02-28 05:36:44 api_server.py:630] Lora dynamic loading & unloading is enabled in the API server. This should ONLY be used for local development!
INFO 02-28 05:36:44 api_server.py:838] vLLM API server version 0.7.1
INFO 02-28 05:36:44 api_server.py:839] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='warning', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='/mnt/s3/base-model/llama-3.1-8B-instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=12288, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=True, enable_lora_bias=False, max_loras=1, max_lora_rank=64, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['llama-3-1-8b-instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 02-28 05:36:44 api_server.py:204] Started engine process with PID 29
INFO 02-28 05:36:47 __init__.py:183] Automatically detected platform cuda.
WARNING 02-28 05:36:47 api_server.py:630] Lora dynamic loading & unloading is enabled in the API server. This should ONLY be used for local development!
INFO 02-28 05:36:49 config.py:526] This model supports multiple tasks: {'score', 'embed', 'reward', 'generate', 'classify'}. Defaulting to 'generate'.
INFO 02-28 05:36:52 config.py:526] This model supports multiple tasks: {'reward', 'classify', 'score', 'generate', 'embed'}. Defaulting to 'generate'.
INFO 02-28 05:36:52 llm_engine.py:232] Initializing a V0 LLM engine (v0.7.1) with config: model='/mnt/s3/base-model/llama-3.1-8B-instruct', speculative_config=None, tokenizer='/mnt/s3/base-model/llama-3.1-8B-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=12288, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=llama-3-1-8b-instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
INFO 02-28 05:36:53 cuda.py:235] Using Flash Attention backend.
INFO 02-28 05:36:54 model_runner.py:1111] Starting to load model /mnt/s3/base-model/llama-3.1-8B-instruct...
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]

The text was updated successfully, but these errors were encountered:

robert-moyai · 2025-03-04T07:54:33Z

Yeah I the pod just hangs forever indeed. I feel its the initialization that silently crashes because it tries to load a model that first need to be pulled from s3.

…

-- Robert

On Mon, Mar 03, 2025 at 23:19:48, Thomas Jack Carroll < ***@***.*** > wrote: Did that pod just hang forever? image. png (view on web) ( https://github.com/user-attachments/assets/7c6b329b-fe8a-428f-aaac-109a5c1f6f98 ) Here's log output from a model pod that started successfully from a HuggingFace pull. It takes a while. — Reply to this email directly, view it on GitHub ( #765 (comment) ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/BKBTE2PKTX67ZP27YONBVE32STIQJAVCNFSM6AAAAABYCO2RAWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJVGY4TMNBSGA ). You are receiving this because you authored the thread. Message ID: <vllm-project/aibrix/issues/765/2695696420 @ github. com> *jolfr* left a comment (vllm-project/ aibrix#765) ( #765 (comment) ) Did that pod just hang forever? image. png (view on web) ( https://github.com/user-attachments/assets/7c6b329b-fe8a-428f-aaac-109a5c1f6f98 ) Here's log output from a model pod that started successfully from a HuggingFace pull. It takes a while. — Reply to this email directly, view it on GitHub ( #765 (comment) ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/BKBTE2PKTX67ZP27YONBVE32STIQJAVCNFSM6AAAAABYCO2RAWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJVGY4TMNBSGA ). You are receiving this because you authored the thread. Message ID: <vllm-project/aibrix/issues/765/2695696420 @ github. com>

Jeffwan added kind/bug Something isn't working area/lora triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Mar 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mounting s3 hosted model files using s3fs is causing startup issues #765

Mounting s3 hosted model files using s3fs is causing startup issues #765

robert-moyai commented Feb 28, 2025

robert-moyai commented Mar 4, 2025 via email

Mounting s3 hosted model files using s3fs is causing startup issues #765

Mounting s3 hosted model files using s3fs is causing startup issues #765

Comments

robert-moyai commented Feb 28, 2025

robert-moyai commented Mar 4, 2025 via email