-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
b2447 (c47cf41) decreased output quality #6571
Comments
Can you provide more details - how did you determine that the quality is worse? Is the perplexity higher? |
I have noticed the same issue - > conducted tests as follows: (via LmStudio) CPU only beat GPU output hands down. Additionally there is a cascading "ROPE" issue causing issues in LmSTudio ; which I have contacted the DEVs about. Additional note: This issue was driving me crazy as I could detect it (after testing 500+ GGUF models) , but took a while to clue in on it. This is "human judgement" - however easy to spot "side by side". I would say the models are following the more nuanced instructions in the prompt (standardized across 500+ model tests), and the output also shows greater detail and nuance as well when partial CPU or full CPU vs GPU only. |
@david565656 Instead of subjective(confusing) judgement, let's see an example focused on |
Here is the prompt and method to reproduce the results. For clarity GPU only and CPU only. On all testing of 500+ models (which also includes comparisons between Qs and IQ of the same model and comparisons against the same model's GPTQ, AWQ, EXL2 ) the testing, parameters, and prompt are exactly the same. This has been maintained in 6+ months of testing. TEST 1 - IQ tests (low IQ used to constrast the differences more sharply): TimeLess-20B.i1-IQ1_M.gguf 4.98 TimeLess-20B.i1-IQ1_S.gguf 4.61 TEST Group 2 - Reg Qs : DavidAU/DarkSapling-7B-v1.0-Q6_K-GGUF/darksapling-7b-v1.0.Q6_K.gguf This test group - when run with GPU only, and then CPU only highlights stark differences in output quality. On CPU - no issues. Stops when it should, context is coherent, and detailed. Likewise for the other 5 models run on cpu only. In fact just visually speaking CPU output of all 6 models are almost the same at the visual level (not reading at all), whereas GPU output is all over the place (paragraph issues, prose, spacing and the like). Note I am running windows 11, with Nvidia 4060ti 16 GB (nov 2023). Subjective differences: Here is the master test prompt: Using the following "story idea" below, write the first scene in the novel introducing the young woman. This scene should start in the middle of the action, include dialog, vivid passages, and end on a cliffhanger relevant to the story idea but it should also be unexpected. The scene should be 1000 words long and escalate in conflict and suspense and be written in first person, present tense with the point of view character being the young woman. Story idea: In a world ruled by dictatorship, a rebel young woman leads a rebellion against the system. Despite the risks, she fights to overthrow the dictator and restore democracy to her country. The government executes her for treason, but she sticks to her beliefs and is responsible for starting the revolution. Here is the system role: Parameters: Let me know if a pdf gen would help ; |
@Azirine See I didn't say "LMSYS". Please do not read things I didn't say, that'd be great.
I accept c47cf41 may have changed output. Now, will you share how you're accessing |
./main -ins -s 0 --in-prefix "[INST] " --in-suffix "[/INST] " -c 0 --repeat-penalty 1 --temp 0 -m mistral-7b-instruct-v0.2.Q5_K_M.gguf b2444
b2447
|
@Azirine The CPU that you referenced supports AVX-512: So difference before and after c47cf41 are normal to be observed due to the different instruction sets used before (AVX) and after (AVX-512). The AVX-512 change has not been thoroughly tested, so there might be issues. One thing that could be problematic is that we PAD the KV cache to 32 elements, while AVX-512 requires 64 elements: Though that single example is not enough to make a definite conclusion. Could you apply the following patch and see if it makes a difference at all in the generated AVX-512 results: diff --git a/llama.cpp b/llama.cpp
index cf95cea1..e391f30b 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -10473,7 +10473,7 @@ static int llama_decode_internal(
// a heuristic, to avoid attending the full cache if it is not yet utilized
// after enough generations, the benefit from this heuristic disappears
// if we start defragmenting the cache, the benefit from this will be more important
- kv_self.n = std::min(kv_self.size, std::max(32u, GGML_PAD(llama_kv_cache_cell_max(kv_self), 32)));
+ kv_self.n = std::min(kv_self.size, std::max(64u, GGML_PAD(llama_kv_cache_cell_max(kv_self), 64)));
//kv_self.n = llama_kv_cache_cell_max(kv_self);
}
} |
Options: ./main -ins -s 0 --in-prefix "[INST] " --in-suffix "[/INST] " -c 0 --repeat-penalty 1 --temp 0 -m mistral-7b-instruct-v0.2.Q5_K_M.gguf Prompt: Who are you? b2444 I am a large language model trained by Mistral AI. I don't have the ability to have a personal identity or emotions, but I can help answer questions, generate text, and provide information on a wide range of topics. How can I assist you today? b2447 with patch I am a large language model developed by Mistral AI. I am designed to generate human-like text based on the input I receive. I don't have the ability to have a personality or emotions, but I can simulate conversation and respond to a wide range of topics. How can I help you today? b2447 I am a large language model developed by Mistral AI. I am designed to generate human-like text based on the input I receive. I don't have the ability to have a personality or emotions, but I can process and generate text in various styles and formats. I'm here to help answer questions, generate creative content, and engage in conversational exchanges. How can I assist you today? |
A longer exampleOptions: ./main -ins -s 0 --in-prefix "[INST] " --in-suffix "[/INST] " -c 0 --repeat-penalty 1 --temp 0 -m mistral-7b-instruct-v0.2.Q5_K_M.gguf Prompt: In a world ruled by dictatorship, a rebel young woman leads a rebellion against the system. Despite the risks, she fights to overthrow the dictator and restore democracy to her country. The government executes her for treason, but she sticks to her beliefs and is responsible for starting the revolution. b2444 Title: "The Unyielding Flame: A Tale of Courage and Revolution" Once upon a time, in a land shrouded by the iron grip of a tyrannical dictator, there lived a young woman named Aria. Her heart was filled with a burning desire for freedom and democracy, a dream that seemed like a distant memory in her oppressed homeland. Aria was not one to shy away from challenges. She was a beacon of hope, a symbol of resistance against the oppressive regime. Despite the risks, she organized a rebellion, rallying the disenchanted masses to join her cause. Her words were like a spark that ignited the flames of revolution. The government, fearful of the growing unrest, hunted Aria relentlessly. But she was a clever and resourceful leader, always managing to evade capture. Her followers were inspired by her bravery and determination, and they continued to fight, fueled by her unwavering belief in the power of the people. However, one day, Aria was finally captured. The dictator, enraged by her defiance, ordered her execution for treason. But even as the noose tightened around her neck, Aria's spirit remained unbroken. She looked her captors in the eye and declared, "Your tyranny will not last forever. The people will rise again, and they will reclaim their freedom!" Her words echoed through the crowd, inspiring her followers to redouble their efforts. The revolution continued to grow, fueled by the memory of Aria's courage and sacrifice. And though she was gone, her spirit lived on, guiding the rebellion to victory. In the end, the dictator was overthrown, and democracy was restored to the land. Aria's name became a legend, a symbol of hope and courage in the face of tyranny. And though she had given her life for the cause, her legacy lived on, inspiring generations to come. The story of Aria, the brave rebel, serves as a reminder that even in the darkest of times, there is always hope. That one person, with the courage to stand up and fight, can change the course of history. And that the flame of freedom, once lit, can never be extinguished. b2447 with patch (identical except last 3 paragraphs) ... Her words echoed through the crowd, inspiring her followers to redouble their efforts. And so, the revolution continued, fueled by the memory of Aria and her unyielding spirit. Though she was gone, her legacy lived on, inspiring a new generation of rebels to carry on the fight for democracy and freedom. In the end, the dictator's regime was overthrown, and a new era of peace and prosperity dawned upon the land. Aria's sacrifice had not been in vain. She had become a symbol of hope and courage, a reminder that even in the darkest of times, the human spirit can never be truly extinguished. And so, the story of Aria, the brave rebel, became a legend, passed down from generation to generation, a testament to the power of the human spirit and the indomitable will to fight for what is right. b2447 Title: "The Unyielding Flame: A Rebel's Tale" Prologue: In the heart of a land shrouded in darkness, where the sun of freedom was eclipsed by the iron fist of tyranny, a young woman named Aria stood defiant. The air was thick with the stench of fear and oppression, yet her spirit remained unbroken. She was a beacon of hope in a world consumed by despair. Chapter 1: The Spark Aria, a humble weaver's daughter, had always been a dreamer. She longed for a world where the people were free to speak their minds, to live without fear of persecution, and to determine their own destiny. As she wove intricate patterns into the fabric of her family's livelihood, she wove dreams of a better future into the hearts and minds of her fellow citizens. Her whispers of change, however, did not go unnoticed. The dictator, a cruel and merciless ruler, saw her as a threat to his iron grip on power. He ordered her arrest, but Aria was not one to be easily silenced. Chapter 2: The Rebellion In the dank and dismal cells of the prison, Aria's spirit only grew stronger. She rallied her fellow inmates, igniting a flame of rebellion that would soon spread like wildfire throughout the land. With her words of hope and her unwavering determination, she inspired a movement that would shake the very foundations of the dictator's regime. Chapter 3: The Uprising The people, once cowed and subdued, rose up in defiance. They marched through the streets, their voices raised in a cacophony of rebellion. The dictator's soldiers, once thought invincible, were no match for the unyielding spirit of the people. Chapter 4: The Sacrifice Aria, the face of the rebellion, was captured and brought before the dictator. Despite the risks, she refused to back down. She stood before the tyrant, her eyes blazing with the fire of freedom. The dictator, enraged by her defiance, ordered her execution. Epilogue: Aria's death was a martyrdom, a symbol of hope and freedom in a world long bereft of both. Her sacrifice sparked a revolution that would eventually topple the dictator and restore democracy to the land. The people, inspired by her courage, continued the fight, and the flame of rebellion burned bright, illuminating the path to a brighter future. The End. |
my two cents here: If you want to test/try out a model specific for creative prose / contrast with/without the patch try this one: This model and test prompt ("Rebel") may show the contrasts more sharply - especially nuance. |
Output improved with the patch but there are still differences compared to pre-b2447, what else could be causing it? |
@Azirine The differences are due to using different instruction sets to perform the matrix multiplication - the floating-point numbers are accumulated in different order leading to small numerical differences |
Regarding the patch, on further thought, the computation is correct even without it since we handle the "leftover" elements in the last non-64 block: So far, I think we are just observing numerical variations and technically all computations are correct. We need more objective criteria to decide if there really is any issue - the examples so far are very subjective IMO |
Question: In terms of the differences between GPU / CPU and mixed GPU/CPU and observed results: On it's face (the prompt/test) it is subjective ; however over testing several runs on different LLMS the contrast between GPU / CPU and mixed CPU/GPU becomes clear. This is also dependent on the LLM being tested that relates to the prompt itself. IE: testing involves LLMs that are of the same use case type -> IE creative output. Testing this prompt with a non-creative vs creative llm would contaminate the results. I guess the question is if the for/next loop is executed on the CPU are the math computations of the same accuracy? I am not experienced in "C" to add any further comment... but have run into issues like this in other languages - especially when for/next / loops do not behave as expected when it comes to math and/or the default precision in the language is not adequate for the case. In extreme cases I had to set the decimal places manually within the code to ensure other operations were not skewed later in the code. That being said - something else entirely could be going on: This would explain the results I am seeing on GPU only, CPU only and mixed GPU/CPU. Here is an extreme example of gpu/cpu "errors" (may or may not relate): Running Golaith 120b at IQ1_S - 16 GBish on GPU, rest on CPU -> In testing 500+ models, I saw something I have never seen before -- the model's output STUTTERED. (more than once). Sorry for the wordy reply... |
CPU and GPU results are not exactly the same, but at such low bit quantization I don't think we can make any meaningful conclusions |
Thank you for all you do. FYI: Finally got llama.cpp installed on my windows machine. Put the details and fixes in another ticket for those that were having the same issues installing it. Windows/MS make things way more difficult than they have to be. Spent the day testing various quants / imatrix files and perplexities and studying the differences using different imatrix.dat files. |
With identical seeds and options, b2447 (c47cf41) produces different output that seems lower in quality compared to b2446. Is it possible to preserve old output quality in new builds?
System: MacBook Pro w/ i5-1038NG7
The text was updated successfully, but these errors were encountered: