[Neural Speed] Enable StableLM-2-12B #253

aahouzi · 2024-05-10T18:44:07Z

Type of Change

Stability.ai open sourced StableLM-2-12B, which has a different architecture than its 1.6B & 3B counterparts. This PR adds support for these models: stabilityai/stablelm-2-12b & stabilityai/stablelm-2-12b-chat.

Description

Uses GQA instead of MHA + Parallel MLP layer + per-head qk_normalization
Model description: StableLM-2-12B

How has this PR been tested?

StableLM-2-12B:

(neural-speed) C:\Users\Intel\Desktop\aahouzi\neural-speed>set NEURAL_SPEED_VERBOSE=1 && build\bin\Release\run_stablelm.exe -m stablelm12b-q4_0.bin -n 256 -p "Building a website can be done in 10 simple steps:"
Welcome to use the stablelm on the ITREX!
main: seed  = 1715364662
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1
model_file_loader: loading model from stablelm12b-q4_0.bin
Loading the bin file with NE format...
load_ne_hparams  0.hparams.n_vocab = 100352
load_ne_hparams  1.hparams.n_embd = 5120
load_ne_hparams  2.hparams.n_mult = 0
load_ne_hparams  3.hparams.n_head = 32
load_ne_hparams  4.hparams.n_head_kv = 8
load_ne_hparams  5.hparams.n_layer = 40
load_ne_hparams  6.hparams.n_rot = 40
load_ne_hparams  7.hparams.ftype = 1
load_ne_hparams  8.hparams.max_seq_len = 4096
load_ne_hparams  9.hparams.alibi_bias_max = 0.000
load_ne_hparams  10.hparams.clip_qkv = 0.000
load_ne_hparams  11.hparams.par_res = 0
load_ne_hparams  12.hparams.word_embed_proj_dim = 0
load_ne_hparams  13.hparams.do_layer_norm_before = 0
load_ne_hparams  14.hparams.multi_query_group_num = 0
load_ne_hparams  15.hparams.ffn_hidden_size = 13824
load_ne_hparams  16.hparams.inner_hidden_size = 0
load_ne_hparams  17.hparams.n_experts = 0
load_ne_hparams  18.hparams.n_experts_used = 0
load_ne_hparams  19.hparams.n_embd_head_k = 160
load_ne_hparams  20.hparams.norm_eps = 0.000010
load_ne_hparams  21.hparams.freq_base = 10000.000
load_ne_hparams  22.hparams.freq_scale = 1.000
load_ne_hparams  23.hparams.rope_scaling_factor = 1.000
load_ne_hparams  24.hparams.original_max_position_embeddings = 0
load_ne_hparams  25.hparams.use_yarn = 0
load_ne_vocab    26.vocab.bos_token_id = 100257
load_ne_vocab    27.vocab.eos_token_id = 100257
load_ne_vocab    28.vocab.pad_token_id = 0
load_ne_vocab    29.vocab.sep_token_id = 0
init: n_vocab                  = 100352
init: n_embd                   = 5120
init: n_head                   = 32
init: n_layer                  = 40
init: n_ff                     = 13824
init: n_parts                  = 1
init: n_embd      = 5120
init: max_seq_len      = 4096
load: ne ctx size = 5845.07 MB
load: mem required  = 16085.07 MB (+ memory per state)
.............................................................................................
model_init_from_file: support_bestla_kv = 1
model_init_from_file: kv self size =  111.56 MB

system_info: n_threads = 24 / 48 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | F16C = 1 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, tokens_length = 12, n_batch = 512, n_predict = 256, n_keep = 0


Building a website can be done in 10 simple steps: you only need to know what is the process of creating pages, uploading (hosting) those page onto your site’s domain. To do this, we are going … Read More
WordPress Plugins give extensive capabilities and functions that extend beyond their default function as blogging platforms.
There was a time when WordPress’ primary use used solely for blogs only,but nowadays it is not so since the advancement of technology makes themsuitable also fora newspages (suchas online magazines),events sites,ecommerce site(among others).
If you want to build your website/blog,you can do this easily by installing themes and plugins. … Read More
WordPress platform was mainly created for blogging,but nowadays it is not so since there are many WP hosting options.
The primary functions of WordPress (WP)are: easy-to-manage content,immediate publishing capability,long-term storage,and the ability to work with almost any kind/categoriesof blog posts.You can also add/remove/change themes/plugin anytime. … Read More
If you want your website/blogto be easily found by others,you need first build SEO(SEO is a tool that increase site ranking among all sites).Here are some tips for beginners (as well as experienced) inorder to rank better.
The latest version released in mid-2016
model_print_timings:        load time =   268.58 ms
model_print_timings:      sample time =   264.81 ms /   256 runs   (    1.03 ms per token)
model_print_timings: prompt eval time =   267.77 ms /    12 tokens (   22.31 ms per token)
model_print_timings:        eval time = 15677.07 ms /   255 runs   (   61.48 ms per token)
model_print_timings:       total time = 16732.44 ms

Also, I ensured that my code changes don't break inference for previously enabled models:

StableLM-2-1.6B:

(neural-speed) C:\Users\Intel\Desktop\aahouzi\neural-speed>set NEURAL_SPEED_VERBOSE=1 && build\bin\Release\run_stablelm.exe -m stablelm16b-q4_0.bin -n 256 -p "Building a website can be done in 10 simple steps:"
Welcome to use the stablelm on the ITREX!
main: seed  = 1715365117
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1
model_file_loader: loading model from stablelm16b-q4_0.bin
Loading the bin file with NE format...
load_ne_hparams  0.hparams.n_vocab = 100352
load_ne_hparams  1.hparams.n_embd = 2048
load_ne_hparams  2.hparams.n_mult = 0
load_ne_hparams  3.hparams.n_head = 32
load_ne_hparams  4.hparams.n_head_kv = 32
load_ne_hparams  5.hparams.n_layer = 24
load_ne_hparams  6.hparams.n_rot = 16
load_ne_hparams  7.hparams.ftype = 1
load_ne_hparams  8.hparams.max_seq_len = 4096
load_ne_hparams  9.hparams.alibi_bias_max = 0.000
load_ne_hparams  10.hparams.clip_qkv = 0.000
load_ne_hparams  11.hparams.par_res = 0
load_ne_hparams  12.hparams.word_embed_proj_dim = 0
load_ne_hparams  13.hparams.do_layer_norm_before = 0
load_ne_hparams  14.hparams.multi_query_group_num = 0
load_ne_hparams  15.hparams.ffn_hidden_size = 5632
load_ne_hparams  16.hparams.inner_hidden_size = 0
load_ne_hparams  17.hparams.n_experts = 0
load_ne_hparams  18.hparams.n_experts_used = 0
load_ne_hparams  19.hparams.n_embd_head_k = 64
load_ne_hparams  20.hparams.norm_eps = 0.000010
load_ne_hparams  21.hparams.freq_base = 10000.000
load_ne_hparams  22.hparams.freq_scale = 1.000
load_ne_hparams  23.hparams.rope_scaling_factor = 1.000
load_ne_hparams  24.hparams.original_max_position_embeddings = 0
load_ne_hparams  25.hparams.use_yarn = 0
load_ne_vocab    26.vocab.bos_token_id = 100257
load_ne_vocab    27.vocab.eos_token_id = 100257
load_ne_vocab    28.vocab.pad_token_id = 0
load_ne_vocab    29.vocab.sep_token_id = 0
init: n_vocab                  = 100352
init: n_embd                   = 2048
init: n_head                   = 32
init: n_layer                  = 24
init: n_ff                     = 5632
init: n_parts                  = 1
init: n_embd      = 2048
init: max_seq_len      = 4096
load: ne ctx size =  805.41 MB
load: mem required  = 2853.41 MB (+ memory per state)
............................................................................
model_init_from_file: support_bestla_kv = 1
model_init_from_file: kv self size =  121.50 MB

system_info: n_threads = 32 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | F16C = 1 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, tokens_length = 12, n_batch = 512, n_predict = 256, n_keep = 0


Building a website can be done in 10 simple steps: 1 – Decide what you want your site to say and who you’d like visitors to become. 2 – Plan the layout, font type and colors used on your site, and add photos if you have any handy.. 3– If you have graphics software installed or a design program you can create your own background image.
When it comes to creating an online store it is very important that you choose the best hosting company. Some people think because they are not using WordPress CMS that they don’t need hosting anymore, but it’s wrong.. WordPress 4.0.1 Hosting Reviews
In today’s article I am going to share some tips on building your first web app from scratch.. 5 Tips On Building Your First Web App From Scratch – The Geek Blog
If you are using Windows 10 and want to disable the auto start feature in OS, then here is how you can do that.. 6 Tips To Turn Off Auto Start Feature In Windows 10 – Tech2sme.. 7 Tips On How To Turn Off Auto Start Feature In Windows 10 – Geek Tonic.. 8 Tips On How To Disable Auto Start Feature In Windows 10 [Part]
How To Install WordPress On Hosting. Some people think because they are not
model_print_timings:        load time =    46.01 ms
model_print_timings:      sample time =   298.55 ms /   256 runs   (    1.17 ms per token)
model_print_timings: prompt eval time =    44.13 ms /    12 tokens (    3.68 ms per token)
model_print_timings:        eval time =  3624.08 ms /   255 runs   (   14.21 ms per token)
model_print_timings:       total time =  4611.76 ms

StableLM-3B:

(neural-speed) C:\Users\Intel\Desktop\aahouzi\neural-speed>set NEURAL_SPEED_VERBOSE=1 && build\bin\Release\run_stablelm.exe -m stablelm3b-q4_0.bin -n 256 -p "Building a website can be done in 10 simple steps:"
Welcome to use the stablelm on the ITREX!
main: seed  = 1715365250
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1
model_file_loader: loading model from stablelm3b-q4_0.bin
Loading the bin file with NE format...
load_ne_hparams  0.hparams.n_vocab = 50304
load_ne_hparams  1.hparams.n_embd = 2560
load_ne_hparams  2.hparams.n_mult = 0
load_ne_hparams  3.hparams.n_head = 32
load_ne_hparams  4.hparams.n_head_kv = 32
load_ne_hparams  5.hparams.n_layer = 32
load_ne_hparams  6.hparams.n_rot = 20
load_ne_hparams  7.hparams.ftype = 1
load_ne_hparams  8.hparams.max_seq_len = 4096
load_ne_hparams  9.hparams.alibi_bias_max = 0.000
load_ne_hparams  10.hparams.clip_qkv = 0.000
load_ne_hparams  11.hparams.par_res = 0
load_ne_hparams  12.hparams.word_embed_proj_dim = 0
load_ne_hparams  13.hparams.do_layer_norm_before = 0
load_ne_hparams  14.hparams.multi_query_group_num = 0
load_ne_hparams  15.hparams.ffn_hidden_size = 6912
load_ne_hparams  16.hparams.inner_hidden_size = 0
load_ne_hparams  17.hparams.n_experts = 0
load_ne_hparams  18.hparams.n_experts_used = 0
load_ne_hparams  19.hparams.n_embd_head_k = 80
load_ne_hparams  20.hparams.norm_eps = 0.000010
load_ne_hparams  21.hparams.freq_base = 10000.000
load_ne_hparams  22.hparams.freq_scale = 1.000
load_ne_hparams  23.hparams.rope_scaling_factor = 1.000
load_ne_hparams  24.hparams.original_max_position_embeddings = 0
load_ne_hparams  25.hparams.use_yarn = 0
load_ne_vocab    26.vocab.bos_token_id = 0
load_ne_vocab    27.vocab.eos_token_id = 0
load_ne_vocab    28.vocab.pad_token_id = 0
load_ne_vocab    29.vocab.sep_token_id = 0
init: n_vocab                  = 50304
init: n_embd                   = 2560
init: n_head                   = 32
init: n_layer                  = 32
init: n_ff                     = 6912
init: n_parts                  = 1
init: n_embd      = 2560
init: max_seq_len      = 4096
load: ne ctx size = 1355.59 MB
load: mem required  = 4427.59 MB (+ memory per state)
............................................................................................
model_init_from_file: support_bestla_kv = 1
model_init_from_file: kv self size =  195.00 MB

system_info: n_threads = 24 / 48 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | F16C = 1 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, tokens_length = 11, n_batch = 512, n_predict = 256, n_keep = 0


Building a website can be done in 10 simple steps:
- Selecting topic and name. […]
- Finding the perfect host – domain registration
- Creating your first page
- Adding text information to site
- Editing HTML code
- Troubleshooting web hosting problems
- Addressing DNS server errors
- Installing Google analytics
- Tracking visitors on your website
You may think that this article is nonsense, but if you are not really tech savvy, and want to make a website without any programming experience – then read carefully. Because what I tell you here will help you build a simple, static HTML site in 10 minutes time with absolutely no programming skills at all.
This book is targeted at people who are interested into learning how to create free websites but do not know where to start from? How many tutorials they need to follow before starting their first website project? And when and how to learn about it?.
All these questions and more will be answered by this book, in short – 10 minutes of time.
This is all the wisdom that you get from reading “Ten Steps” to Web Success”.
Avoiding any kind of technical language, this guide takes off on a wild ride through a world-building experience as unique as your own. It will help you take off
model_print_timings:        load time =    89.55 ms
model_print_timings:      sample time =   143.08 ms /   256 runs   (    0.56 ms per token)
model_print_timings: prompt eval time =    88.76 ms /    11 tokens (    8.07 ms per token)
model_print_timings:        eval time =  6030.98 ms /   255 runs   (   23.65 ms per token)
model_print_timings:       total time =  6551.84 ms

Stable-Code-3B:

(neural-speed) C:\Users\Intel\Desktop\aahouzi\neural-speed>set NEURAL_SPEED_VERBOSE=1 && build\bin\Release\run_stablelm.exe -m stablec3b-q4_0.bin -n 256 -p "Given an int n, write a function that returns n! in Python:"
Welcome to use the stablelm on the ITREX!
main: seed  = 1715366425
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1
model_file_loader: loading model from stablec3b-q4_0.bin
Loading the bin file with NE format...
load_ne_hparams  0.hparams.n_vocab = 50304
load_ne_hparams  1.hparams.n_embd = 2560
load_ne_hparams  2.hparams.n_mult = 0
load_ne_hparams  3.hparams.n_head = 32
load_ne_hparams  4.hparams.n_head_kv = 32
load_ne_hparams  5.hparams.n_layer = 32
load_ne_hparams  6.hparams.n_rot = 20
load_ne_hparams  7.hparams.ftype = 1
load_ne_hparams  8.hparams.max_seq_len = 16384
load_ne_hparams  9.hparams.alibi_bias_max = 0.000
load_ne_hparams  10.hparams.clip_qkv = 0.000
load_ne_hparams  11.hparams.par_res = 0
load_ne_hparams  12.hparams.word_embed_proj_dim = 0
load_ne_hparams  13.hparams.do_layer_norm_before = 0
load_ne_hparams  14.hparams.multi_query_group_num = 0
load_ne_hparams  15.hparams.ffn_hidden_size = 6912
load_ne_hparams  16.hparams.inner_hidden_size = 0
load_ne_hparams  17.hparams.n_experts = 0
load_ne_hparams  18.hparams.n_experts_used = 0
load_ne_hparams  19.hparams.n_embd_head_k = 80
load_ne_hparams  20.hparams.norm_eps = 0.000010
load_ne_hparams  21.hparams.freq_base = 1000000.000
load_ne_hparams  22.hparams.freq_scale = 1.000
load_ne_hparams  23.hparams.rope_scaling_factor = 1.000
load_ne_hparams  24.hparams.original_max_position_embeddings = 0
load_ne_hparams  25.hparams.use_yarn = 0
load_ne_vocab    26.vocab.bos_token_id = 0
load_ne_vocab    27.vocab.eos_token_id = 0
load_ne_vocab    28.vocab.pad_token_id = 0
load_ne_vocab    29.vocab.sep_token_id = 0
init: n_vocab                  = 50304
init: n_embd                   = 2560
init: n_head                   = 32
init: n_layer                  = 32
init: n_ff                     = 6912
init: n_parts                  = 1
init: n_embd      = 2560
init: max_seq_len      = 16384
load: ne ctx size = 1355.59 MB
load: mem required  = 4427.59 MB (+ memory per state)
............................................................................................
model_init_from_file: support_bestla_kv = 1
model_init_from_file: kv self size =  195.00 MB

system_info: n_threads = 24 / 48 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | F16C = 1 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, tokens_length = 15, n_batch = 512, n_predict = 256, n_keep = 0


Given an int n, write a function that returns n! in Python:
    >>> factorial(5) #=> 120

    n! = 1 * 2 * 3 * 4 = 24, so we have :

    f(x)! => [0] = 1*1=1; f(x+2)= 1 + 0*(x-x)....etc. "
"""
def factorial_of_a_number(n):
  if n == 0: return 1
  return n * factorial_of_a_number(n - 1)

<|endoftext|> [end of text]

model_print_timings:        load time =    65.85 ms
model_print_timings:      sample time =    62.64 ms /   109 runs   (    0.57 ms per token)
model_print_timings: prompt eval time =    64.92 ms /    15 tokens (    4.33 ms per token)
model_print_timings:        eval time =  2529.52 ms /   108 runs   (   23.42 ms per token)
model_print_timings:       total time =  2784.30 ms

for more information, see https://pre-commit.ci

luoyu-intel · 2024-05-13T08:51:42Z

Please run clang-format for your branch.

neural_speed/convert/convert_stablelm.py

neural_speed/models/stablelm/stablelm.cpp

neural_speed/convert/convert_stablelm.py

neural_speed/models/stablelm/stablelm.cpp

a32543254

LGTM

a32543254 · 2024-05-15T01:23:27Z

@intellinjun could you kindly help to add extension test for this new model for more performance test？

intellinjun · 2024-05-15T01:57:34Z

@intellinjun could you kindly help to add extension test for this new model for more performance test？

sure

intellinjun · 2024-05-15T03:02:56Z

@intellinjun could you kindly help to add extension test for this new model for more performance test？

We have an extension test for stabilityai/stablelm-2-1_6b, maybe we just need a local performance test for StableLM-2-12B?

intellinjun · 2024-05-15T03:09:31Z

https://inteltf-jenk.sh.intel.com/job/neural_speed_extension/120/

intellinjun · 2024-05-15T06:12:30Z

https://inteltf-jenk.sh.intel.com/job/neural_speed_extension/120/artifact/report.html
performance test result

luoyu-intel · 2024-05-15T06:12:31Z

Thanks @aahouzi!

aahouzi and others added 13 commits May 2, 2024 19:39

Enable gguf conversion with qk_norm stacking

444d1f5

Convert function NE format

7d99fab

Add stablelm-2-12b config

c4583ec

Fix convert stacking + tensor allocation

df1d7e5

Minor fixes

442aa89

Add inference for stablelm-2-12b

c5c35b8

Fix conversion issues

a9be06c

Merge branch 'intel:main' into stablelm2

eb54189

Fix gibberish text issue

c6e2fca

Merge branch 'intel:main' into stablelm2

1c68d0d

Remove redundant tensor allocation + Add supported models

991d55b

Fix clang format

80bf6dd

[pre-commit.ci] auto fixes from pre-commit.com hooks

5a08cfc

for more information, see https://pre-commit.ci

airMeng requested review from intellinjun and luoyu-intel May 13, 2024 08:48

luoyu-intel added the ready to review Ready to review label May 13, 2024

luoyu-intel requested review from zhentaoyu and a32543254 May 13, 2024 08:49

a32543254 reviewed May 13, 2024

View reviewed changes

neural_speed/convert/convert_stablelm.py Show resolved Hide resolved

neural_speed/models/stablelm/stablelm.cpp Show resolved Hide resolved

zhentaoyu reviewed May 14, 2024

View reviewed changes

neural_speed/convert/convert_stablelm.py Show resolved Hide resolved

neural_speed/models/stablelm/stablelm.cpp Show resolved Hide resolved

a32543254 approved these changes May 14, 2024

View reviewed changes

zhentaoyu approved these changes May 14, 2024

View reviewed changes

aahouzi added 3 commits May 14, 2024 12:29

Fix clang formatting issue

223e375

Modify list of supported models

e251da2

Minor fix

3eb74ee

intellinjun approved these changes May 15, 2024

View reviewed changes

luoyu-intel merged commit 753c158 into intel:main May 15, 2024
11 checks passed

aahouzi deleted the stablelm2 branch May 15, 2024 08:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Neural Speed] Enable StableLM-2-12B #253

[Neural Speed] Enable StableLM-2-12B #253

aahouzi commented May 10, 2024

luoyu-intel commented May 13, 2024

a32543254 left a comment

a32543254 commented May 15, 2024

intellinjun commented May 15, 2024

intellinjun commented May 15, 2024

intellinjun commented May 15, 2024

intellinjun commented May 15, 2024

luoyu-intel commented May 15, 2024

[Neural Speed] Enable StableLM-2-12B #253

[Neural Speed] Enable StableLM-2-12B #253

Conversation

aahouzi commented May 10, 2024

Type of Change

Description

How has this PR been tested?

luoyu-intel commented May 13, 2024

a32543254 left a comment

Choose a reason for hiding this comment

a32543254 commented May 15, 2024

intellinjun commented May 15, 2024

intellinjun commented May 15, 2024

intellinjun commented May 15, 2024

intellinjun commented May 15, 2024

luoyu-intel commented May 15, 2024