First Proof, Second Proof
There will be a webinar at 1pm east coast time today to announce the results of First Proof Second Batch.
https://cmsa.fas.harvard.edu/first-proof-second-batch/
I participated in the development of the second batch of research-level mathematical problems to test frontier models with. I’m not sure if I am supposed to say whether “my” problem was selected or not so you’ll have to read until the end to figure that out! During the workshop, I wrote up my reflections and have lightly edited them here. I shared these privately with a few colleagues when I returned but now that there’s a webinar, I’ll go public.
This is the second batch of problems to form a dataset; the group has already been through this process once before and we are at a stage of refining the process. Also, the AI companies have been taking this seriously now. Everything seems more existentially critical at this point. OpenAI wants to “win” this; they have an IPO planned for sometime in ‘27 (hear that with a tone of both disbelief and hubris on their part). Altman is clearly flying close to the sun, even to the point of his personal safety. DeepMind wants to win a Fields Medal or solve one of the Millennium prizes; they already have a Nobel prize, why not go for the whole lot of them?!
The design of the experiment is as follows:
Problem statements -- should be fairly short and self-contained, in LaTeX.
Problem solutions -- should be similarly short, no more than 8 pages in 12pt font, again produced in LaTeX and then compiled to the finished product.
Novelty statements -- address the extent to which this problem was “new”
Tests on publicly available models, with explicit details on number of tokens, costs of said tokens, and everyone uses the same harness. I had to ask what a harness was. We need transparency and reproducibility and replicability. We need to know that the public could do this and at what cost. The experiments will run for 24 hours and no more. There is a set date for these tests.
The human solutions will be assessed by professional mathematicians without exposure to LLMs a few weeks before the experiment. They will check correctness, style, etc.
Then, we will have experts converge on the test site for two days and “grade” the machines’ answers. The grading scheme is similar to that of editorial refereeing: accept no revisions, accept minor revisions (there are minor flaws, details to be cleaned up, but the proof is essentially correct), accept/reject major revisions (I actually forget what goes in here), and finally outright reject. The proofs have to get citations correct and everything needs to be properly cited.
This process and the results are what the First Proof folks are discussing today via webinar. I am not an official spokesperson so listen to the webinar for the real information. As for the preliminary evidence and the initial dataset building stages, here is my “behind the scenes” report.
There were on the order of 20 questions that we prepared. We sent them in (in LaTeX, for those in the know) and the team gathered preliminary evidence as to whether a problem was “broken” (i.e., could a frontier model solve the problem right now or did it get pretty close), how the frontier models were “thinking” (was the thought process a random walk off into intellectual oblivion or was it sort of on the right track?)? We used OpenRouter as a front end/API to access all the possible publicly available models (although just the big three: ChatGPT, Claude, and Gemini) with a zero-data-retention (ZDR) policy. The ZDR meant that none of the models would “save” or retain information about our problem. FirstProof had budget for tokens for OpenRouter and it was fun to play with the bleeding edge models that I’m not paying for right now!
We had some issues with hallucinations of references and citations using the ZDR version of the engines as they did not make web queries. There was discussion about how NOT having access to the web made the proofs more likely to be correct, empirically. Which is kind of interesting because that is the sort of thing we would tell our students -- don’t go to the library for a couple of weeks, try to solve this yourself. In fact, having external input can be really distracting when trying to solve a problem. My advisor used to tell the story that when she was a student, the French and the Flemish universities were divorcing and the library was used as a custody pawn so she didn’t have access to it. It made her fearless in solving problems and in working out examples.
I’ll leave the technical stuff for the webinar (including how and which problems were chosen) and stick to sociological commentary and reportage of conversations amongst mathematicians. The one junior faculty member in our group was the most vocal about being freaked out by this process. It’s possible the rest of us were equally freaked out or were more mature about the process. I suspect a combination thereof. (I have been curious, fascinated, and awakened from my burned-out, cynical, demoralized state.) We did reassure this young person that 1. He was going to get tenure, 2. One’s career takes many twists and turns, 3. Life/math was a constantly evolving process. Someone suggested that he have a child, his perspective would shift. (Yes, oh so much!) He joked that he was missing a preliminary step towards that.
We spent a while discussing ethical questions about this experiment. Should problem generators be compensated? We all tend to think of problems belonging to the community and we do this work “for free” (except for travel and food, perhaps) but in business, publishing, etc. one would certainly be paid. This data is incredibly valuable!
People in the news, commentariat, are debating just what AI means for early stage careers and for young people. There is considerable debate over the role of AI in the recent layoffs and hiring slowdowns at tech companies. Similarly for the shift or no shift in economic productivity. We mathematicians are having the same conversations. We spent considerable time discussing what our efforts meant for graduate students, young faculty, and teaching people! We debated just what doing mathematics meant. Is it the solving of the problems or is it all the other bits such as finding the problem, articulating why it matters, putting that problem into context, crafting new ideas, etc.? We discussed how we train our graduate students (or even whether we did!). We talked about what roles these students played in our own research and what roles AI was playing.
What would we do if AI “put us out of business?” This is akin to the cocktail party conversation, “What was your Plan B?” Those of us who graduated in the mid-nineties (or, heck, anyone with a Math/Physics PhD any time since the 70s) frequently asked ourselves what our backup plans in case we couldn’t get a job after grad school. Or, maybe didn’t get the kind of job we wanted. Plans B then included — and now include — social worker, physician, lawyer, high school teacher...
All mathematicians should understand the difference between finding or stating problems and solving them. (It fascinates me that some folks don’t know this dichotomy and that they don’t think about this! It’s also interesting which subfields of math emphasize one of these skills over the other.) Part of the preliminary evidence gathering entailed rephrasing our questions and adjusting them so as to get the most out of the LLMs. One problem proposer determined that his problem, once suitably phrased, was solvable by the frontier models.
The junior faculty member present had not thought much about the dichotomy of problem solvers vs problem finders. Fortunately for all of us, he is the one who did the following clever experiment. He uploaded several recent papers in his field and asked the models to produce follow-on research questions based on those papers. To everyone’s surprise, the models produced a question that was a research level question and, if answered, would constitute a paper. He was stunned, excited, and demoralized all at the same time.
We’d been gathered to ensure a mixture of problems from various subfields and our expertise reflected different strengths. One theoretical statistician commented or hypothesized that in younger fields, the problem statement was more important than the solution while in areas of mathematics, we have had community-identified lists of open problems for decades, even centuries. That’s why Fermat’s Last Theorem or the Riemann Hypothesis or the Millennium Prizes or Hilbert’s Problems are so important. Certainly, in Theoretical Computer Science, how to “mathematize” a problem is critical. Only after doing this step, can one attempt to solve it. The same is probably true of theoretical physics.
And, of course, the gossip. There is considerable scuttlebutt out of the AI companies. OpenAI has hoovered up most of the Stanford CS faculty. They’ve plucked prominent mathematicians out of top departments. The effort at DeepMind on Navier-Stokes is floundering. Anthropic is focusing on business and writing code (that’s where the money is, duh!), not solving math problems. We all know there’s no money in actually solving math problems, only the glory of boasting about IMO gold medals, and then an IPO. OpenAI is working its way through Erdos problems. They’ve optimized their “math team” for this task. (Math team reminds me of high school all over again.)
There’s quite a bit of “us vs them” as a result of these rumors. Don’t join the AI companies, don’t test your questions on existing software, don’t run your proofs through their codes. Use protection, folks! We struggle to keep the process anonymous, we don’t want to be smeared on social media and we don’t want to lend extra glory to these problems, the process, or the AI companies simply because Prof. XYZ was involved. Or, Prof. XYZ said his problem was a two-hint homework problem and yet, none of the AIs could solve it. Just what does that mean?! This all feels separate from the standard eschewing of conflict of interest in scientific research, which we are working assiduously to maintain. There is an analogy to having drug companies sponsor faculty, sponsor research, or even design the research trials. And yet, all of this feels more existential. There are potentially trillions of dollars riding on bragging rights and we are potentially on the cusp of a sea change in our economy and, for us mathematicians, on our way of life. Don’t worry, no AI company has plucked me out of my department and I’m not funded by anyone. Hello?!
So, what happened with my problem? My problem was a mixture of probability and combinatorics (graph theory, to be precise) and while the preliminary evidence suggested that the publicly available LLMs were struggling on one important discrete math (as opposed to probabilistic analysis) point in the proof, the pace at which solution announcements in combinatorics were being made by OpenAI in the few months since our meeting and the actual experiment (Erdős problems solved, famous conjectures disproven, follow-on results!) suggested that this small element of a proof would be easily broken too. My problem is in the public section of the dataset. Reader, we will find out together what that means.


