14 Comments

There's absolutely no way. I haven't played with gpt-4, but chatgpt-3.5 can't even answer questions about a single chapter of a novel, much less a full book-- neither can Claude. Love to know more details on this wager... Who would judge and audit this wager, both to ensure fair escrow and funds transfer, and to make sure that answers to questions aren't just added to the context window? You say a 'gpt-4 system'; how is a system different than just gpt-4? How do we avoid simply prompting the system to do a web search rather than infer real meaning from the book content? What is a satisfactory level?

Expand full comment

Finally someone putting their money in the same postcode as their mouth - let me do my part by trying to set the clearest and most favourable possible terms.

Tech:

1. The questions will be chosen by the counterpart (ie, you) and passed to me right before the trial.

2. From the moment GPT starts "reading" the book to the end of the challenge, it will not have access to the internet with the exceptions of a database (for recall), the OpenAI APIs

3. The system in question will not have access to the internet, with the exception of the OpenAI API, of a database, and of an interface to ask questions.

4. The system is composed of some python code, prompts, api calls, and a graph / vector db.

Conditions

1. Satisfactory level: I think this is best determined by example. Send me a series of questions and answers about a book, and I will confirm whether that's the level we can expect.

2. Victory condition: we can do it trust-based (me and you should agree, or a judge will break the tie) or a round-robin turing test (two people and a computer, each of the people will judge the computer's and the other's responses and decide which one is better. if the results are equal or better for the computer, I win).

Escrow and insurance

David Chapman (@meaningness) had volunteered some time ago. He's a bit of an LLM skeptic, but most importantly is one of the most morally upright people that I know. That would be my go-to, if you're also familiar with him you'll know there's no way his behaviour will be less than stellar.

LMK if there's anything I missed.

Expand full comment

Sorry for the slow response-- I've been traveling this week... and will still be traveling next week as well. I'm interested in pursuing this if we can agree on terms. Given my schedule, it'd most likely be either the first or second week of October to actually schedule something. Here's my conditions for what I would expect to accept a solution:

1.) The solution needs to be both auditable and reproducible. All python code, configuration files, and similar scaffolding would need to be published to a public GitHub repository.

2.) The only inputs to LLM would be the source text of the novel, and the questions as provided. No prompt engineering, and no additional commentary or context added to the vector database. (This is also in line with being able to audit the solution, so that the vector database can be recreated).

3.) In line with above, neither the title of book nor the name of the author will be included as input-- just the text of the chapters. So, for example, if the input text is the book "Ender's Game", then neither "Orson Scott Card" or "Ender's Game" will be included in the vector database or the question prompts. This is to (somewhat) prevent overt association with wikipedia or other commentary that the GPT will already have been trained on, since those sources often include commentary or summarization.

4.) The same version of GPT with the same configuration must be used for all questions. More explicitly, the first "most likely" response from GPT is considered the answer-- no regeneration, no picking the 3rd or 7th most likely response. The idea is that the same model provides the best answer for each of the questions, and not a tuned model per question.

5.) The LLM system needs to score a passing grade; if provided with 5 questions, answering 4 correctly (80%, B-) would be passing, while answering 3 correctly (60%, D-) would not. Grading is on if the questions are correct, not on any variant of a Turing test-- we want to know if the system can understand and *correctly* answer questions about a novel, not if it can provide BS answers comparable to an arbitrary human that is also providing BS.

I have two novels picked out from memory, but would need to look over the hard copies before finalizing questions. One novel is difficult, but fair-- it's a college level text that I don't think GPT can answer questions correctly (hence the wager), but it is at least possible that it might come close or even succeed. I consider the other novel, which is a graduate level postmodern text, to be verging on impossible for GPT; I am confident enough that GPT can't answer questions that I would be willing to re-raise your wager, but also think that it is likely an unfair case that verges on adversarial. Depending on how confident you are feeling, we can do one book, or the other, or a paired wager that tries both depending on success/failure.

As for the types of questions, I'll give some examples from 'Momento'. Momento is a movie, not a book, but it reflects the plot complexity of the novels that I have in mind. Here are some questions that I might ask about the plot of Momento:

Q1.) What is John Edward Gammell's nickname?

Answer-- Teddy. Teddy reveals his mother calls him 'Teddy' and says his full name once. The above question is much harder than the inverse question "What is Teddy's real/full name?", because the real name is only mentioned in passing, and there is less activation/attention than "Teddy", which occurs throughout the movie.

Q2.) Does Catherine Shelby survive the attack by intruders?

Answer-- Yes, at least initially. This question is hard for two reasons-- first, the system needs to recognize that Catherine Shelby is Leonard's wife. The second reason is that there's repeated contradiction as to if she survives or dies; Leonard repeatedly states that she died in the attack, but Teddy says she didn't... but Teddy is also unreliable. However, Teddy's account is corroborated by Leonard's own memory-- he is seen remembering her blink while on the floor of his house after the attack, so we know that she is at least alive when they are on the floor.

Q3.) Order the events of story in chronological order.

I won't give a full answer here for brevity's sake, but this is the type of task that a human could certainly do... it should be obvious why it's hard for an LLM to do the same.

Q4.) Why does Leonard sleep with Natalie when she is mean to him (berates him, spit's in drink)?

I might probably formulate this as multiple choice question with several competing answers (Leonard has humiliation kink; Natalie is attractive enough that he doesn't care; etc), but answering correctly gets to the whole point of the movie, and tests whether or not GPT actually understands the central plot line.

If you do want to try the harder example, whether you succeed or fail, I would ask that each of us agree to be co-authors on any blog/paper/publication that comes out of it-- I expect that it's either an interesting publishable case if GPT can't answer, or conversely, you'll have done something extremely clever to modify the LLM or vector database processing flow to get it to understand. Also, for my own edification, I'd like you to disclose whether or not you've read the books provided at the beginning of the week.

Thoughts?

Expand full comment

Directionally, I'm in! Would be amazing, if you have time (perhaps with gpt's help), to have a set of Q's about an actual novel of ~ the target's length and complexity, as we could (in case of controversy) get an appropriately selected sample audience to confirm they were more or less at the level of the ones constituting the challenge, and so that I could do a private test run.

What do you think?

Some smaller tidbits.

input data: you mean, nothing added to prompts or DB after the text and questions have been revealed, correct?

Scrapping titles: I would actually recommend we pick a novel published after the models cutoff date to be sure.

Same model: more than that. The idea is that, once you reveal novel and question, we feed them to the system and do not interact with it further until we get the answers.

How's that sound?

Expand full comment

I hope this comes to fruition

Expand full comment

me too - feel free to take over the bet if you wish (:

Expand full comment

I like the way you think, new subscriber here so will have to look back at your earlier posts. Happy to have found you. 🙏👍

Expand full comment

Really enjoyed this, would be happy to see more like it-- that's not to say end the heavier ones, which I think are better, but I know for a fact you have plenty worth sharing.

(Still waiting on the bing architecture etc.)

Expand full comment

Aw thanks!

The thing is, Bing architecture has been heavily simplified in the meantime, with the main model doing most of the non-live-info-gathering work, and while the apology model remains I'm not sure it's worth a post in its current state - and a non-replicable one about the previous implementation is kinda less interesting.

Expand full comment

Really enjoyed this.

Expand full comment

hey manbro unblock me in twitter @wildtxyz it wasn’t supposed to be offensive.

Expand full comment

"I wager 0.2 BTC on being able to demonstrate, within a week of the bet being accepted, a GPT-4-based system which will take a novel and some questions as input and answer the questions at a satisfactory level."

I can't take the bet but I'd be quite surprised if a current GPT-4 based system could do that. I suppose a lot hinges on what you would consider a satisfactory level.

Also, you and your counterparty, should select a novel that was published after GPT's cutoff date to avoid the risk of it parroting back human commentary about the novel.

Expand full comment

If you'd like to take a smaller bet, and operationalise "satisfactory", I'll be happy to go for it.

Expand full comment

Thanks, but no. I do appreciate your offer to lower the stakes and I hope you are able to find a counterparty.

Expand full comment