Skip to content
This repository was archived by the owner on Aug 30, 2024. It is now read-only.

[Neural Speed] Enable StableLM-2-12B #253

Merged
merged 16 commits into from
May 15, 2024
Merged

Conversation

aahouzi
Copy link
Member

@aahouzi aahouzi commented May 10, 2024

Type of Change

Stability.ai open sourced StableLM-2-12B, which has a different architecture than its 1.6B & 3B counterparts. This PR adds support for these models: stabilityai/stablelm-2-12b & stabilityai/stablelm-2-12b-chat.

Description

  • Uses GQA instead of MHA + Parallel MLP layer + per-head qk_normalization
  • Model description: StableLM-2-12B

How has this PR been tested?

  • StableLM-2-12B:
(neural-speed) C:\Users\Intel\Desktop\aahouzi\neural-speed>set NEURAL_SPEED_VERBOSE=1 && build\bin\Release\run_stablelm.exe -m stablelm12b-q4_0.bin -n 256 -p "Building a website can be done in 10 simple steps:"
Welcome to use the stablelm on the ITREX!
main: seed  = 1715364662
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1
model_file_loader: loading model from stablelm12b-q4_0.bin
Loading the bin file with NE format...
load_ne_hparams  0.hparams.n_vocab = 100352
load_ne_hparams  1.hparams.n_embd = 5120
load_ne_hparams  2.hparams.n_mult = 0
load_ne_hparams  3.hparams.n_head = 32
load_ne_hparams  4.hparams.n_head_kv = 8
load_ne_hparams  5.hparams.n_layer = 40
load_ne_hparams  6.hparams.n_rot = 40
load_ne_hparams  7.hparams.ftype = 1
load_ne_hparams  8.hparams.max_seq_len = 4096
load_ne_hparams  9.hparams.alibi_bias_max = 0.000
load_ne_hparams  10.hparams.clip_qkv = 0.000
load_ne_hparams  11.hparams.par_res = 0
load_ne_hparams  12.hparams.word_embed_proj_dim = 0
load_ne_hparams  13.hparams.do_layer_norm_before = 0
load_ne_hparams  14.hparams.multi_query_group_num = 0
load_ne_hparams  15.hparams.ffn_hidden_size = 13824
load_ne_hparams  16.hparams.inner_hidden_size = 0
load_ne_hparams  17.hparams.n_experts = 0
load_ne_hparams  18.hparams.n_experts_used = 0
load_ne_hparams  19.hparams.n_embd_head_k = 160
load_ne_hparams  20.hparams.norm_eps = 0.000010
load_ne_hparams  21.hparams.freq_base = 10000.000
load_ne_hparams  22.hparams.freq_scale = 1.000
load_ne_hparams  23.hparams.rope_scaling_factor = 1.000
load_ne_hparams  24.hparams.original_max_position_embeddings = 0
load_ne_hparams  25.hparams.use_yarn = 0
load_ne_vocab    26.vocab.bos_token_id = 100257
load_ne_vocab    27.vocab.eos_token_id = 100257
load_ne_vocab    28.vocab.pad_token_id = 0
load_ne_vocab    29.vocab.sep_token_id = 0
init: n_vocab                  = 100352
init: n_embd                   = 5120
init: n_head                   = 32
init: n_layer                  = 40
init: n_ff                     = 13824
init: n_parts                  = 1
init: n_embd      = 5120
init: max_seq_len      = 4096
load: ne ctx size = 5845.07 MB
load: mem required  = 16085.07 MB (+ memory per state)
.............................................................................................
model_init_from_file: support_bestla_kv = 1
model_init_from_file: kv self size =  111.56 MB

system_info: n_threads = 24 / 48 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | F16C = 1 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, tokens_length = 12, n_batch = 512, n_predict = 256, n_keep = 0


Building a website can be done in 10 simple steps: you only need to know what is the process of creating pages, uploading (hosting) those page onto your site’s domain. To do this, we are going … Read More
WordPress Plugins give extensive capabilities and functions that extend beyond their default function as blogging platforms.
There was a time when WordPress’ primary use used solely for blogs only,but nowadays it is not so since the advancement of technology makes themsuitable also fora newspages (suchas online magazines),events sites,ecommerce site(among others).
If you want to build your website/blog,you can do this easily by installing themes and plugins. … Read More
WordPress platform was mainly created for blogging,but nowadays it is not so since there are many WP hosting options.
The primary functions of WordPress (WP)are: easy-to-manage content,immediate publishing capability,long-term storage,and the ability to work with almost any kind/categoriesof blog posts.You can also add/remove/change themes/plugin anytime. … Read More
If you want your website/blogto be easily found by others,you need first build SEO(SEO is a tool that increase site ranking among all sites).Here are some tips for beginners (as well as experienced) inorder to rank better.
The latest version released in mid-2016
model_print_timings:        load time =   268.58 ms
model_print_timings:      sample time =   264.81 ms /   256 runs   (    1.03 ms per token)
model_print_timings: prompt eval time =   267.77 ms /    12 tokens (   22.31 ms per token)
model_print_timings:        eval time = 15677.07 ms /   255 runs   (   61.48 ms per token)
model_print_timings:       total time = 16732.44 ms

Also, I ensured that my code changes don't break inference for previously enabled models:

  • StableLM-2-1.6B:
(neural-speed) C:\Users\Intel\Desktop\aahouzi\neural-speed>set NEURAL_SPEED_VERBOSE=1 && build\bin\Release\run_stablelm.exe -m stablelm16b-q4_0.bin -n 256 -p "Building a website can be done in 10 simple steps:"
Welcome to use the stablelm on the ITREX!
main: seed  = 1715365117
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1
model_file_loader: loading model from stablelm16b-q4_0.bin
Loading the bin file with NE format...
load_ne_hparams  0.hparams.n_vocab = 100352
load_ne_hparams  1.hparams.n_embd = 2048
load_ne_hparams  2.hparams.n_mult = 0
load_ne_hparams  3.hparams.n_head = 32
load_ne_hparams  4.hparams.n_head_kv = 32
load_ne_hparams  5.hparams.n_layer = 24
load_ne_hparams  6.hparams.n_rot = 16
load_ne_hparams  7.hparams.ftype = 1
load_ne_hparams  8.hparams.max_seq_len = 4096
load_ne_hparams  9.hparams.alibi_bias_max = 0.000
load_ne_hparams  10.hparams.clip_qkv = 0.000
load_ne_hparams  11.hparams.par_res = 0
load_ne_hparams  12.hparams.word_embed_proj_dim = 0
load_ne_hparams  13.hparams.do_layer_norm_before = 0
load_ne_hparams  14.hparams.multi_query_group_num = 0
load_ne_hparams  15.hparams.ffn_hidden_size = 5632
load_ne_hparams  16.hparams.inner_hidden_size = 0
load_ne_hparams  17.hparams.n_experts = 0
load_ne_hparams  18.hparams.n_experts_used = 0
load_ne_hparams  19.hparams.n_embd_head_k = 64
load_ne_hparams  20.hparams.norm_eps = 0.000010
load_ne_hparams  21.hparams.freq_base = 10000.000
load_ne_hparams  22.hparams.freq_scale = 1.000
load_ne_hparams  23.hparams.rope_scaling_factor = 1.000
load_ne_hparams  24.hparams.original_max_position_embeddings = 0
load_ne_hparams  25.hparams.use_yarn = 0
load_ne_vocab    26.vocab.bos_token_id = 100257
load_ne_vocab    27.vocab.eos_token_id = 100257
load_ne_vocab    28.vocab.pad_token_id = 0
load_ne_vocab    29.vocab.sep_token_id = 0
init: n_vocab                  = 100352
init: n_embd                   = 2048
init: n_head                   = 32
init: n_layer                  = 24
init: n_ff                     = 5632
init: n_parts                  = 1
init: n_embd      = 2048
init: max_seq_len      = 4096
load: ne ctx size =  805.41 MB
load: mem required  = 2853.41 MB (+ memory per state)
............................................................................
model_init_from_file: support_bestla_kv = 1
model_init_from_file: kv self size =  121.50 MB

system_info: n_threads = 32 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | F16C = 1 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, tokens_length = 12, n_batch = 512, n_predict = 256, n_keep = 0


Building a website can be done in 10 simple steps: 1 – Decide what you want your site to say and who you’d like visitors to become. 2 – Plan the layout, font type and colors used on your site, and add photos if you have any handy.. 3– If you have graphics software installed or a design program you can create your own background image.
When it comes to creating an online store it is very important that you choose the best hosting company. Some people think because they are not using WordPress CMS that they don’t need hosting anymore, but it’s wrong.. WordPress 4.0.1 Hosting Reviews
In today’s article I am going to share some tips on building your first web app from scratch.. 5 Tips On Building Your First Web App From Scratch – The Geek Blog
If you are using Windows 10 and want to disable the auto start feature in OS, then here is how you can do that.. 6 Tips To Turn Off Auto Start Feature In Windows 10 – Tech2sme.. 7 Tips On How To Turn Off Auto Start Feature In Windows 10 – Geek Tonic.. 8 Tips On How To Disable Auto Start Feature In Windows 10 [Part]
How To Install WordPress On Hosting. Some people think because they are not
model_print_timings:        load time =    46.01 ms
model_print_timings:      sample time =   298.55 ms /   256 runs   (    1.17 ms per token)
model_print_timings: prompt eval time =    44.13 ms /    12 tokens (    3.68 ms per token)
model_print_timings:        eval time =  3624.08 ms /   255 runs   (   14.21 ms per token)
model_print_timings:       total time =  4611.76 ms
  • StableLM-3B:
(neural-speed) C:\Users\Intel\Desktop\aahouzi\neural-speed>set NEURAL_SPEED_VERBOSE=1 && build\bin\Release\run_stablelm.exe -m stablelm3b-q4_0.bin -n 256 -p "Building a website can be done in 10 simple steps:"
Welcome to use the stablelm on the ITREX!
main: seed  = 1715365250
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1
model_file_loader: loading model from stablelm3b-q4_0.bin
Loading the bin file with NE format...
load_ne_hparams  0.hparams.n_vocab = 50304
load_ne_hparams  1.hparams.n_embd = 2560
load_ne_hparams  2.hparams.n_mult = 0
load_ne_hparams  3.hparams.n_head = 32
load_ne_hparams  4.hparams.n_head_kv = 32
load_ne_hparams  5.hparams.n_layer = 32
load_ne_hparams  6.hparams.n_rot = 20
load_ne_hparams  7.hparams.ftype = 1
load_ne_hparams  8.hparams.max_seq_len = 4096
load_ne_hparams  9.hparams.alibi_bias_max = 0.000
load_ne_hparams  10.hparams.clip_qkv = 0.000
load_ne_hparams  11.hparams.par_res = 0
load_ne_hparams  12.hparams.word_embed_proj_dim = 0
load_ne_hparams  13.hparams.do_layer_norm_before = 0
load_ne_hparams  14.hparams.multi_query_group_num = 0
load_ne_hparams  15.hparams.ffn_hidden_size = 6912
load_ne_hparams  16.hparams.inner_hidden_size = 0
load_ne_hparams  17.hparams.n_experts = 0
load_ne_hparams  18.hparams.n_experts_used = 0
load_ne_hparams  19.hparams.n_embd_head_k = 80
load_ne_hparams  20.hparams.norm_eps = 0.000010
load_ne_hparams  21.hparams.freq_base = 10000.000
load_ne_hparams  22.hparams.freq_scale = 1.000
load_ne_hparams  23.hparams.rope_scaling_factor = 1.000
load_ne_hparams  24.hparams.original_max_position_embeddings = 0
load_ne_hparams  25.hparams.use_yarn = 0
load_ne_vocab    26.vocab.bos_token_id = 0
load_ne_vocab    27.vocab.eos_token_id = 0
load_ne_vocab    28.vocab.pad_token_id = 0
load_ne_vocab    29.vocab.sep_token_id = 0
init: n_vocab                  = 50304
init: n_embd                   = 2560
init: n_head                   = 32
init: n_layer                  = 32
init: n_ff                     = 6912
init: n_parts                  = 1
init: n_embd      = 2560
init: max_seq_len      = 4096
load: ne ctx size = 1355.59 MB
load: mem required  = 4427.59 MB (+ memory per state)
............................................................................................
model_init_from_file: support_bestla_kv = 1
model_init_from_file: kv self size =  195.00 MB

system_info: n_threads = 24 / 48 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | F16C = 1 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, tokens_length = 11, n_batch = 512, n_predict = 256, n_keep = 0


Building a website can be done in 10 simple steps:
- Selecting topic and name. […]
- Finding the perfect host – domain registration
- Creating your first page
- Adding text information to site
- Editing HTML code
- Troubleshooting web hosting problems
- Addressing DNS server errors
- Installing Google analytics
- Tracking visitors on your website
You may think that this article is nonsense, but if you are not really tech savvy, and want to make a website without any programming experience – then read carefully. Because what I tell you here will help you build a simple, static HTML site in 10 minutes time with absolutely no programming skills at all.
This book is targeted at people who are interested into learning how to create free websites but do not know where to start from? How many tutorials they need to follow before starting their first website project? And when and how to learn about it?.
All these questions and more will be answered by this book, in short – 10 minutes of time.
This is all the wisdom that you get from reading “Ten Steps” to Web Success”.
Avoiding any kind of technical language, this guide takes off on a wild ride through a world-building experience as unique as your own. It will help you take off
model_print_timings:        load time =    89.55 ms
model_print_timings:      sample time =   143.08 ms /   256 runs   (    0.56 ms per token)
model_print_timings: prompt eval time =    88.76 ms /    11 tokens (    8.07 ms per token)
model_print_timings:        eval time =  6030.98 ms /   255 runs   (   23.65 ms per token)
model_print_timings:       total time =  6551.84 ms
  • Stable-Code-3B:
(neural-speed) C:\Users\Intel\Desktop\aahouzi\neural-speed>set NEURAL_SPEED_VERBOSE=1 && build\bin\Release\run_stablelm.exe -m stablec3b-q4_0.bin -n 256 -p "Given an int n, write a function that returns n! in Python:"
Welcome to use the stablelm on the ITREX!
main: seed  = 1715366425
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1
model_file_loader: loading model from stablec3b-q4_0.bin
Loading the bin file with NE format...
load_ne_hparams  0.hparams.n_vocab = 50304
load_ne_hparams  1.hparams.n_embd = 2560
load_ne_hparams  2.hparams.n_mult = 0
load_ne_hparams  3.hparams.n_head = 32
load_ne_hparams  4.hparams.n_head_kv = 32
load_ne_hparams  5.hparams.n_layer = 32
load_ne_hparams  6.hparams.n_rot = 20
load_ne_hparams  7.hparams.ftype = 1
load_ne_hparams  8.hparams.max_seq_len = 16384
load_ne_hparams  9.hparams.alibi_bias_max = 0.000
load_ne_hparams  10.hparams.clip_qkv = 0.000
load_ne_hparams  11.hparams.par_res = 0
load_ne_hparams  12.hparams.word_embed_proj_dim = 0
load_ne_hparams  13.hparams.do_layer_norm_before = 0
load_ne_hparams  14.hparams.multi_query_group_num = 0
load_ne_hparams  15.hparams.ffn_hidden_size = 6912
load_ne_hparams  16.hparams.inner_hidden_size = 0
load_ne_hparams  17.hparams.n_experts = 0
load_ne_hparams  18.hparams.n_experts_used = 0
load_ne_hparams  19.hparams.n_embd_head_k = 80
load_ne_hparams  20.hparams.norm_eps = 0.000010
load_ne_hparams  21.hparams.freq_base = 1000000.000
load_ne_hparams  22.hparams.freq_scale = 1.000
load_ne_hparams  23.hparams.rope_scaling_factor = 1.000
load_ne_hparams  24.hparams.original_max_position_embeddings = 0
load_ne_hparams  25.hparams.use_yarn = 0
load_ne_vocab    26.vocab.bos_token_id = 0
load_ne_vocab    27.vocab.eos_token_id = 0
load_ne_vocab    28.vocab.pad_token_id = 0
load_ne_vocab    29.vocab.sep_token_id = 0
init: n_vocab                  = 50304
init: n_embd                   = 2560
init: n_head                   = 32
init: n_layer                  = 32
init: n_ff                     = 6912
init: n_parts                  = 1
init: n_embd      = 2560
init: max_seq_len      = 16384
load: ne ctx size = 1355.59 MB
load: mem required  = 4427.59 MB (+ memory per state)
............................................................................................
model_init_from_file: support_bestla_kv = 1
model_init_from_file: kv self size =  195.00 MB

system_info: n_threads = 24 / 48 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | F16C = 1 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, tokens_length = 15, n_batch = 512, n_predict = 256, n_keep = 0


Given an int n, write a function that returns n! in Python:
    >>> factorial(5) #=> 120

    n! = 1 * 2 * 3 * 4 = 24, so we have :

    f(x)! => [0] = 1*1=1; f(x+2)= 1 + 0*(x-x)....etc. "
"""
def factorial_of_a_number(n):
  if n == 0: return 1
  return n * factorial_of_a_number(n - 1)

<|endoftext|> [end of text]

model_print_timings:        load time =    65.85 ms
model_print_timings:      sample time =    62.64 ms /   109 runs   (    0.57 ms per token)
model_print_timings: prompt eval time =    64.92 ms /    15 tokens (    4.33 ms per token)
model_print_timings:        eval time =  2529.52 ms /   108 runs   (   23.42 ms per token)
model_print_timings:       total time =  2784.30 ms

@luoyu-intel luoyu-intel added the ready to review Ready to review label May 13, 2024
@luoyu-intel
Copy link
Contributor

Please run clang-format for your branch.

Copy link
Contributor

@a32543254 a32543254 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@a32543254
Copy link
Contributor

@intellinjun could you kindly help to add extension test for this new model for more performance test?

@intellinjun
Copy link
Contributor

@intellinjun could you kindly help to add extension test for this new model for more performance test?

sure

@intellinjun
Copy link
Contributor

@intellinjun could you kindly help to add extension test for this new model for more performance test?

We have an extension test for stabilityai/stablelm-2-1_6b, maybe we just need a local performance test for StableLM-2-12B?

@intellinjun
Copy link
Contributor

@intellinjun
Copy link
Contributor

https://inteltf-jenk.sh.intel.com/job/neural_speed_extension/120/artifact/report.html
performance test result

@luoyu-intel
Copy link
Contributor

Thanks @aahouzi!

@luoyu-intel luoyu-intel merged commit 753c158 into intel:main May 15, 2024
11 checks passed
@aahouzi aahouzi deleted the stablelm2 branch May 15, 2024 08:30
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
ready to review Ready to review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants