Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mounting s3 hosted model files using s3fs is causing startup issues #765

Open
robert-moyai opened this issue Feb 28, 2025 · 1 comment
Open
Labels
area/lora kind/bug Something isn't working triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@robert-moyai
Copy link

I wanted to download models from self managed s3 bucket. I got it working by downloading the files to local storage. But somehow referencing the models using their s3 path is making the vllm crash cause it interprets the s3 link as a huggingface model repo (https://aibrix.readthedocs.io/latest/features/lora-dynamic-loading.html#model-registry).

No problem I thought I can just mount the s3 bucket using s3fs. Which I can I do no problem but then when the vllm engine is trying to start it tries to pull in the model weights locally but they are not there yet cause they have to be pulled from s3 using s3fs which introduces latency over native storage. I added a stacktrace from inside my pod.

$ kubectl logs -f llama-3-1-8b-instruct-65fbb7f8b7-rzzvb -c vllm-openai
INFO 02-28 05:36:43 __init__.py:183] Automatically detected platform cuda.
WARNING 02-28 05:36:44 api_server.py:630] Lora dynamic loading & unloading is enabled in the API server. This should ONLY be used for local development!
INFO 02-28 05:36:44 api_server.py:838] vLLM API server version 0.7.1
INFO 02-28 05:36:44 api_server.py:839] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='warning', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='/mnt/s3/base-model/llama-3.1-8B-instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=12288, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=True, enable_lora_bias=False, max_loras=1, max_lora_rank=64, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['llama-3-1-8b-instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 02-28 05:36:44 api_server.py:204] Started engine process with PID 29
INFO 02-28 05:36:47 __init__.py:183] Automatically detected platform cuda.
WARNING 02-28 05:36:47 api_server.py:630] Lora dynamic loading & unloading is enabled in the API server. This should ONLY be used for local development!
INFO 02-28 05:36:49 config.py:526] This model supports multiple tasks: {'score', 'embed', 'reward', 'generate', 'classify'}. Defaulting to 'generate'.
INFO 02-28 05:36:52 config.py:526] This model supports multiple tasks: {'reward', 'classify', 'score', 'generate', 'embed'}. Defaulting to 'generate'.
INFO 02-28 05:36:52 llm_engine.py:232] Initializing a V0 LLM engine (v0.7.1) with config: model='/mnt/s3/base-model/llama-3.1-8B-instruct', speculative_config=None, tokenizer='/mnt/s3/base-model/llama-3.1-8B-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=12288, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=llama-3-1-8b-instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
INFO 02-28 05:36:53 cuda.py:235] Using Flash Attention backend.
INFO 02-28 05:36:54 model_runner.py:1111] Starting to load model /mnt/s3/base-model/llama-3.1-8B-instruct...
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
@Jeffwan Jeffwan added kind/bug Something isn't working area/lora triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Mar 1, 2025
@robert-moyai
Copy link
Author

robert-moyai commented Mar 4, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/lora kind/bug Something isn't working triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

2 participants