If GPT-4 were a student, it would be one of the most brilliant. The OpenAI itself evaluated its capacity with a series of exams that were created for human beings and in them they scored spectacular results. Conseguiría estar de hecho among the 10% of those who obtain better qualifications, but there are those who say that in reality that does not mean much.
What happened. OpenAI submitted GPT-4 to academic exams of various types, such as the Uniform Bar Exam, the most popular test in the US to become a lawyer, or the LSAT, the test that gives the possibility of accessing Columbia Law School. He also submitted to the GRE Quantitative test, which measures the ability to reason and understand mathematical concepts. In almost all of them, their score was exceptional, and that seemed to make GPT-4 superior to most human students. A recent study by two researchers reveals that there are problems with that perception.
I suspect GPT-4’s performance is influenced by data contamination, at least on Codeforces.
Of the easiest problems on Codeforces, it solved 10/10 pre-2021 problems and 0/10 recent problems.
This strongly points to contamination.
1/4 https://t.co/wKtkyDRGGG pic.twitter.com/wm6yP6AmGx
— Horace He (@cHHillee) March 14, 2023
Data contamination. To begin with, the researchers verified that GPT-4 knew memory responses… when its memory reached as far as there. It is known that the data with which the model was trained was from before September 2021. When it was submitted to programming questions before that date, it answered well, but could not answer any tests based on later tests included when the problems were simple.
If he qualifies this problem as “data contamination”, and even changing small details in the form of enunciation of the problem, he can confuse ChatGPT – which was a mediocre student – and probably GPT-4, pointing out that he would not have it in from the case of a human.

There will be exams for humans, not for machines. “Memorization is a specter”, explained the authors. Although a model like GPT-4 does not have an exact problem in its training, “it is inevitable that we have seen quite similar examples, simply because of the size of the training corpus”. This allows the model to “use a much less deep level of reasoning”. To be experts, there will be linguistic models that do not have the reasoning capacity that humans need that are examined and then applied in the real world.
The comparisons are odious. Exams such as access to law “put too much emphasis on knowledge of the material and little on the skills of the real world, which are much more difficult to measure in a standardized way”. Or what is the same: these exams will not only emphasize the incorrect, but precisely “make too much emphasis precisely on what the linguistic models do well”. For the authors of the study, the choice of these tests to evaluate GPT-4 is “unfortunate”.
Quality, no quantity. For researchers, qualitative studies are needed, not quantitative. Although they recognize that GPT-4 “is really exciting and can solve many problems of professionals” like automating routine tasks, this type of evaluations with exams like those used for OpenAI can lead to confusion.
In Xataka | How to educate and prepare for a future in which robots do most of the work
In Xataka | Students no longer copy, use ChatGPT: universities are starting to watch out for the use of artificial intelligence