AI language models may have far surpassed humans in some computational areas, but quantitative reasoning continues to be a difficulty for them.
One of the advantages of deploying artificial intelligence is its immense computational assistance, which can run calculations in a fraction of the time it would take a human.
While number crunching and calculations may be the realm of computers now, it appears that quantitative reasoning, or applying mathematics to real-world problems, has stumped even the most sophisticated of AI language models.
See Also: Machine Learning Isn’t An Instant Fix For Fraud
The Center for AI Safety developed an data set for quantitative reasoning, called MATH, and put some of the top-of-the-line AI language models to the test. The results were not great, with the models averaging seven percent in the test, much lower than the human grad student average of 40 percent. Math Olympiad champions scored 90 percent on the test.
The reason for this low test score is due to quantitative reasoning requiring a combination of skills including parsing a question, recalling formulas, and properly interpreting the problem through step-by-step solutions. If the AI slips up at any one of the steps, it can cause major deviation from the correct answer.
According to reporting from technology and science magazine IEEE Spectrum, AI researchers from University of California, Berkeley, OpenAI, and Google have all performed better with a less intense data set, called GSM8K, which was produced by OpenAI and features grade-school level problems.
Google Minerva, which is built on the company’s Pathways Language Model (PaLM), has seen the most success, announcing the model had reached 78 percent accuracy in June. This was ahead of OpenAI’s expectations, as they previously said its GPT model would need to be trained on 100 times more data to achieve 80 percent accuracy.
Google says it achieved this improvement in accuracy with minimal scaling upwards, through “chain-of-thought prompting”, which breaks down larger problems into more manageable chunks, alongside majority voting, which runs the same problem 100 times instead of just once and choose the solution which it went for the most. It has seen improvement in Minerva’s accuracy with the MATH data set, hitting 50 percent recently.
“Our approach to quantitative reasoning is not grounded in formal mathematics,” said Ethan Dyer and Guy Gur-Ari, research scientists at Google Research. “Minerva parses questions and generates answers using a mix of natural language and LaTeX mathematical expressions, with no explicit underlying mathematical structure. This approach has an important limitation, in that the model’s answers cannot be automatically verified. Even when the final answer is known and can be verified, the model can arrive at a correct final answer using incorrect reasoning steps, which cannot be automatically detected.”
Artificial intelligence has been marketed as more than just calculations and arithmetic, possibly making more accurate decisions than humans based on data. To that end, building AI language models that are able to employ critical thinking and quantitative reasoning at a high level seems like a necessity, if humans are ever going to seriously contemplate shifting parts of decision making to algorithms.