GSM8k AI News List

Time	Details
2026-03-27 10:57	MEMCOLLAB Breakthrough: Cross-Model Memory Boosts Llama 3 8B to 42.4% on MATH500 — Analysis and Business Impact According to God of Prompt, Pennsylvania State University identified that agent memories distilled from a single model’s reasoning traces carry model-specific biases and heuristics that hurt transfer, causing performance to fall below zero-memory baselines when moved across models; as reported by the tweet and summarized from the study highlights, giving a 7B model’s memory to a 32B model reduced MATH500 from 63.8% to 50.6% and HumanEval from 68.3% to 34.1%, and the reverse transfer also degraded results. According to the same source, the proposed fix, MEMCOLLAB, constructs memory from cross-model agreement by contrasting a success trajectory with a failure trajectory to extract invariant reasoning principles, not style; this raised Llama 3 8B MATH500 from 27.4% to 42.4% and lifted average accuracy across four benchmarks from 41.7% to 53.9%. As reported by God of Prompt, Qwen 7B improved from 52.2% to 67.0% on MATH500 and from 42.7% to 74.4% on HumanEval, while reasoning turns dropped from 3.3 to 1.5 on HumanEval and 3.1 to 1.4 on MBPP, indicating efficiency gains that reduce inference cost. According to the same source, cross-architecture memory construction (Qwen 32B plus Llama 8B) outperformed same-family memory on GSM8K at 95.2% vs 93.6%, signaling opportunities for vendors to standardize cross-model memory pipelines, lower token spend, and improve reliability in production agents for coding, math tutoring, and workflow automation. Source
2026-02-04 09:35	Latest Analysis: Phi and Mistral Models Show 13% Accuracy Drop on GSM1k vs GSM8k, Revealing Memorization Issues According to God of Prompt on Twitter, recent testing shows that the Phi and Mistral models experienced a significant 13% accuracy drop when evaluated on the GSM1k benchmark compared to GSM8k. Some model variants saw drops as high as 13.4 percentage points. The analysis suggests these models are not demonstrating true reasoning abilities but rather memorization, as they were exposed to the correct answers during training. This finding highlights critical concerns about the generalization and reliability of these AI models for business and research applications. Source
2026-02-04 09:35	Latest Analysis Reveals 0.32 Correlation Between GSM8k Reproduction and Performance Gap in AI Models According to God of Prompt on Twitter, researchers have identified a 0.32 correlation between an AI model's ability to reproduce GSM8k test examples and its performance gap. This finding suggests that models which can recite test questions tend to perform worse when faced with new, unseen questions. As reported by God of Prompt, the implication is that these models may be memorizing answers rather than demonstrating true problem-solving capabilities, raising concerns about the validity of current AI evaluation benchmarks. Source
2025-09-13 16:08	GSM8K Paper Highlights: AI Benchmarking Insights from 2021 Transform Large Language Model Evaluation According to Andrej Karpathy on X (formerly Twitter), the GSM8K paper from 2021 has become a significant reference point in the evaluation of large language models (LLMs), especially for math problem-solving capabilities (source: https://twitter.com/karpathy/status/1966896849929073106). The dataset, which consists of 8,500 high-quality grade school math word problems, has been widely adopted by AI researchers and industry experts to benchmark LLM performance, identify model weaknesses, and guide improvements in reasoning and logic. This benchmarking standard has directly influenced the development of more robust AI systems and commercial applications, driving advancements in AI-powered tutoring solutions and automated problem-solving tools (source: GSM8K paper, 2021). Source

2026-03-27
10:57

MEMCOLLAB Breakthrough: Cross-Model Memory Boosts Llama 3 8B to 42.4% on MATH500 — Analysis and Business Impact

According to God of Prompt, Pennsylvania State University identified that agent memories distilled from a single model’s reasoning traces carry model-specific biases and heuristics that hurt transfer, causing performance to fall below zero-memory baselines when moved across models; as reported by the tweet and summarized from the study highlights, giving a 7B model’s memory to a 32B model reduced MATH500 from 63.8% to 50.6% and HumanEval from 68.3% to 34.1%, and the reverse transfer also degraded results. According to the same source, the proposed fix, MEMCOLLAB, constructs memory from cross-model agreement by contrasting a success trajectory with a failure trajectory to extract invariant reasoning principles, not style; this raised Llama 3 8B MATH500 from 27.4% to 42.4% and lifted average accuracy across four benchmarks from 41.7% to 53.9%. As reported by God of Prompt, Qwen 7B improved from 52.2% to 67.0% on MATH500 and from 42.7% to 74.4% on HumanEval, while reasoning turns dropped from 3.3 to 1.5 on HumanEval and 3.1 to 1.4 on MBPP, indicating efficiency gains that reduce inference cost. According to the same source, cross-architecture memory construction (Qwen 32B plus Llama 8B) outperformed same-family memory on GSM8K at 95.2% vs 93.6%, signaling opportunities for vendors to standardize cross-model memory pipelines, lower token spend, and improve reliability in production agents for coding, math tutoring, and workflow automation.

Source

2026-02-04
09:35

Latest Analysis: Phi and Mistral Models Show 13% Accuracy Drop on GSM1k vs GSM8k, Revealing Memorization Issues

According to God of Prompt on Twitter, recent testing shows that the Phi and Mistral models experienced a significant 13% accuracy drop when evaluated on the GSM1k benchmark compared to GSM8k. Some model variants saw drops as high as 13.4 percentage points. The analysis suggests these models are not demonstrating true reasoning abilities but rather memorization, as they were exposed to the correct answers during training. This finding highlights critical concerns about the generalization and reliability of these AI models for business and research applications.

Source

2026-02-04
09:35

Latest Analysis Reveals 0.32 Correlation Between GSM8k Reproduction and Performance Gap in AI Models

According to God of Prompt on Twitter, researchers have identified a 0.32 correlation between an AI model's ability to reproduce GSM8k test examples and its performance gap. This finding suggests that models which can recite test questions tend to perform worse when faced with new, unseen questions. As reported by God of Prompt, the implication is that these models may be memorizing answers rather than demonstrating true problem-solving capabilities, raising concerns about the validity of current AI evaluation benchmarks.

Source

2025-09-13
16:08

GSM8K Paper Highlights: AI Benchmarking Insights from 2021 Transform Large Language Model Evaluation

According to Andrej Karpathy on X (formerly Twitter), the GSM8K paper from 2021 has become a significant reference point in the evaluation of large language models (LLMs), especially for math problem-solving capabilities (source: https://twitter.com/karpathy/status/1966896849929073106). The dataset, which consists of 8,500 high-quality grade school math word problems, has been widely adopted by AI researchers and industry experts to benchmark LLM performance, identify model weaknesses, and guide improvements in reasoning and logic. This benchmarking standard has directly influenced the development of more robust AI systems and commercial applications, driving advancements in AI-powered tutoring solutions and automated problem-solving tools (source: GSM8K paper, 2021).

Source

List of AI News about GSM8k