llama : refactor llama_build_graph to reduce code duplication #3382

ggerganov · 2023-09-28T19:13:18Z

With the support of new model architectures, we start to observe a lot of repeating patterns in the code for building their compute graphs. We should find a way to refactor and reuse the repetitive code. We should also consider splitting the implementation in separate source files if necessary.

https://github.com/ggerganov/llama.cpp/blob/0e76a8992c8200237bbc6471a53fb8796b3872f7/llama.cpp#L3997-L4026

Open to ideas and suggestions

ggerganov · 2023-10-03T19:41:43Z

After merging #3329 #3071 and #3417 we should put effort into resolving this issue

ggerganov · 2023-10-13T10:42:41Z

Something I am thinking we should consider in the scope of this issue is decoupling the llm_build_xxx() functions from the use of ggml_alloc and having a post-build step where we set the input data to the input tensors. I.e. we can break the following into 2 steps:

        // current llm_build_llama()
        struct ggml_tensor * inp_tokens = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, n_tokens);
        ggml_allocr_alloc(lctx.alloc, inp_tokens);
        if (!ggml_allocr_is_measure(lctx.alloc)) {
            memcpy(inp_tokens->data, batch.token, n_tokens*ggml_element_size(inp_tokens));
        }
        ggml_set_name(inp_tokens, "inp_tokens");

        // ------

        // new llm_build_llama()
        struct ggml_tensor * inp_tokens = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, n_tokens);
        ggml_set_name(inp_tokens, "inp_tokens");

        // new llm_setup_llama()
        ggml_tensor * inp_tokens = ggml_get_tensor(ctx, "inp_tokens");
        ggml_allocr_alloc(lctx.alloc, inp_tokens);
        if (!ggml_allocr_is_measure(lctx.alloc)) {
            memcpy(inp_tokens->data, batch.token, n_tokens*ggml_element_size(inp_tokens));
        }

Having build functions that do not rely on the state of the allocator would facilitate some things around estimating the required memory. (cc @slaren for thoughts)

slaren · 2023-10-13T10:55:32Z

I think it would be good to pre-allocate all the input and output tensors in a different buffer. In this way, these tensors would always be allocated and the calls to ggml_allocr_alloc and ggml_allocr_is_measure would not be necessary. Having the outputs pre-allocated would remove the hack of taking the results of the evaluation from the last two tensors of the graph.

This was already in the first version of ggml-backend, and I expect to do the same again for the current implementation of ggml-backend, once it is ready to be added to llama.cpp.
https://github.com/ggerganov/llama.cpp/blob/d273bfd2c98148fab309f978c44ec2e8c24d1c4d/llama.cpp#L2631-L2651

jowi-dev · 2023-10-17T00:36:51Z

I don't write much cpp but I'm happy to take a stab at this

ggerganov added good first issue Good for newcomers refactoring Refactoring labels Sep 28, 2023

ggerganov added this to ggml : roadmap Sep 28, 2023

ggerganov moved this to Todo in ggml : roadmap Sep 28, 2023

ggerganov added the high priority Very important issue label Oct 3, 2023

This was referenced Oct 11, 2023

Layer skipping/self-speculation demo #3565

Draft

Refactor graph building to reduce duplication #3591

Closed

ggerganov moved this from Todo to In Progress in ggml : roadmap Oct 28, 2023

This was referenced Nov 1, 2023

llama : refactor graph build code #3837

Merged

llm : add llm_build_context #3881

Merged

ggerganov closed this as completed Nov 1, 2023

ggerganov moved this from In Progress to Done in ggml : roadmap Nov 1, 2023

ggerganov mentioned this issue Nov 2, 2023

StableLM support #3586

Merged

12 tasks

Galunid added a commit to Galunid/llama.cpp that referenced this issue Nov 5, 2023

Update after ggml-org#3382

c959376

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : refactor llama_build_graph to reduce code duplication #3382

llama : refactor llama_build_graph to reduce code duplication #3382

ggerganov commented Sep 28, 2023

ggerganov commented Oct 3, 2023

ggerganov commented Oct 13, 2023

slaren commented Oct 13, 2023

jowi-dev commented Oct 17, 2023

llama : refactor llama_build_graph to reduce code duplication #3382

llama : refactor llama_build_graph to reduce code duplication #3382

Comments

ggerganov commented Sep 28, 2023

ggerganov commented Oct 3, 2023

ggerganov commented Oct 13, 2023

slaren commented Oct 13, 2023

jowi-dev commented Oct 17, 2023