Researchers from Microsoft Research, Sahara AI, and Emory University released new data this week challenging artificial general intelligence claims. Their MATHVISTA benchmark tested twelve foundation models on multimodal math reasoning tasks involving charts and diagrams. Results indicate that even the most advanced systems fail to match average human performance in this specific domain. This finding suggests the industry may be overestimating current capabilities regarding general reasoning capabilities. Public reports confirm the data was released this week following extensive testing protocols.
GPT-4 Vision achieved the highest score among the tested models with 49.9% accuracy on the benchmark. Human participants averaged 60.3% when solving the same problems, leaving a significant gap between machines and people. The study highlights that current technology struggles with interpreting visual information alongside mathematical logic. This discrepancy persists despite the massive computational resources dedicated to developing these systems.
The benchmark required annotators to distinguish between simple counting tasks and deeper mathematical reasoning. Microsoft selected Sahara AI to support the creation of more than 6,000 multimodal examples through custom workflows. These examples included geometry, algebra, and statistics grounded in visual data like plots and graphs. The process involved rigorous multi-stage quality checks to ensure the dataset was reliable for testing.
Hao Cheng, a principal researcher at Microsoft Research, stated the goal is for machines to perform daily tasks like an average person. He noted that many existing evaluation datasets included problems solvable without visual reasoning capabilities. Models often reached correct answers by relying solely on text rather than interpreting the image content. Cheng emphasized that this limitation prevents accurate measurement of true machine intelligence.
Sean Ren, CEO of Sahara AI and an associate professor at USC, warned about the risk of data contamination in benchmarks. He explained that if benchmark answers appear in a model’s training data, high scores reflect memorization rather than reasoning. This makes it difficult to determine whether AI systems are genuinely improving or just learning the test questions. Reliable benchmarks are essential for measuring progress toward broader machine intelligence.
Artificial general intelligence remains one of the most cited milestones in the AI industry despite unclear definitions. Tech executives frequently predict its arrival while investors fund billions of dollars into related research efforts. Critics warn about the risks once such systems arrive, yet researchers still disagree on what counts as general intelligence. There is no consensus on when this capability might manifest or how anyone would recognize it once it does.
Elon Musk recently argued that progress toward human-level reasoning depends on live data rather than static training sets. He claimed during a recent interview that xAI’s data access from X provides a competitive edge for Grok 5. Musk assigned a 10% probability to achieving artificial general intelligence with the upcoming model release. This contrasts with the findings from the Microsoft-led study regarding current limitations.
Researchers point to limits in publicly available training data as a potential bottleneck for future advancements. Much of the existing material does not support the complex visual reasoning required for general intelligence tasks. Without better training data, measuring progress toward broader machine intelligence becomes increasingly difficult. The industry must focus on creating more robust evaluation datasets before claiming AGI is near or viable. Future success depends on resolving these data limitations and improving model architectures significantly.