Validating Finite Automata Reduction results

We are getting very close to be in a position of officially accepting @UncombedCoconut 's [Decider] Finite Automata Reduction decider (the method is also presented extensively here).

In a nutshell, the method gives a very efficient algorithm for finding a regular language that describes the eventually-halting configurations of a TM. That way, the TM can be proven to be non-halting if the initial, all-0 configuration is not part of the language. This approach is closely related to @heiner and @sligocki CTL method.

Following the requirements of [Debate & Vote] Deciders’ validation process, the method is close to the finish line because:

  1. The decider follows the principles of our reproducibility and verifiability statement.
  2. It decides 99% of the currently undecided machines (it only leaves a few hundreds undecided machine out of the current 535,801).
  3. The decider has been reproduced independently by @TonyG (here) and by @cosmo (here).
  4. The method has been formally presented and proven correct here (still WIP but currently in a good readable state).

One notable feature of the method is that, it is very well suited to verification: for a given machine decided by Finite Automata Reduction, it is easy to provide a certificate with the necessary and sufficient data (mainly consisting in a specific finite automaton) which allow to conclude that the machine doesn’t halt. Moreover, the task of verifying that such certificates are valid is computationally simple.

Hence, a verification format for FAR has been proposed by @TonyG and @UncombedCoconut has been proposed and several independent verifiers for FAR proofs have been built (here and here).

This brings us to the point of this topic: re-finding from scratch all FAR proofs that have been found by @UncombedCoconut (and combined in this ~15Mb DVF file (DVF stands for Decider Validation File)) requires months of compute power and quite an expertise to accelerate the search algorithm in certain cases. However:

  1. All those proofs, combined in the ~15Mb DVF file, are easy to verify: it takes roughly 5 minutes on a 8-core laptop to verify all ~500k proofs using the slow (but hopefully easy to follow) Python implementation of the verifier.

  2. Finding from scratch ~90% of those proofs only require little compute time (~10m computation using @UncombedCoconut’s rust implementation).

Hence, here is what I propose, and please feel free to express your opinion on this post:

  1. We officially accept all machines that have a FAR certificate present in the DVF file, regardless of how hard was the proof to produce.
  2. We require that ~70% of the proofs of the DVF file to be re-found independently. The Python independent implementation of the decider, available here, could achieve this goal using a few hours of compute time.

So, maybe put a heart reaction if you are OK with that plan and/or express your opinion in this thread :slight_smile:

4 Likes

Note: there’s an extra wrinkle because the proof file solving all but <=141 of the machines is not limited to proofs my decider did find, or necessarily even could find, via search.
Explanation: the CTL proof technique, and the verification data format we’ve used to represent it, is highly general. So, not only are TM solutions found by Iijil’s Bruteforce-CTL program importable, but also CPS (closed position set) solutions found by Nathan F’s “simple n-gram CPS” decider or certs emitted by savask’s CPS.
I can’t import TonyG’s bouncer proofs yet, but that’s possible too.
In case the solution is representable with 25 DFA states or fewer on one of the sides, you could theoretically wait long enough and be fully confident my program’s search would find it (or an alternate one). You’d die before it finished the larger ones, though, unless you took other manual steps to guide the search.
Above 25, my Rust program’s search won’t even output the solution yet.

So, there is a wide spectrum of possible responses: take all successfully verified solutions, take all my program could do, take all my program will do in a fixed depth… Or introduce a distinction between TMs with known solutions and TMs whose solutions pop out of practically replicable steps.

I prefer simply adopting the full list of verifiable solutions – while also explaining in the write-up that they came from diverse sources, many authors deserve credit, and new deciders are worth accepting even if they solve “already decided” TMs.
Basically: reflect the frontier of BB(5) progress, while maintaining transparency.

1 Like

This is so exciting! Amazing work @UncombedCoconut and everyone else working on this! I agree completely with this plan. From my point of view, a verification certificate that can be independently verified is sufficient and will definitely be necessary for many holdout TMs.

Oh, interesting, I didn’t realize you were pulling in CPS proofs, that’s really cool that the verification format is so generalizable!

It does feel like we sort of have two categories of “proven” TMs now: Ones for which we have a verified proof (produced by who knows what) and ones that can be proven reasonably quickly using published code. And also that we have sort of two meanings of decider. Previously each decider was a specific program/algorithm, here we have many deciders, but one uniting verification format. It would be cool if we could associate each TM with what program decided it (in addition to associating them all to the common verification format).

In general, I really like this diverse (unproven, potentially not reproduced) deciders, single proven, reproduced verification format idea. I’ve been thinking lately about such a format for the sort of “rule proving systems” that uni and I use. One thing that I find really interesting about these is that it could help support adding verification to human constructed proofs of complicated holdouts! Maybe someone will discover that a holdout has a CTL representation that has a lot of states (but maybe enough structure to be human readable) and can write their own machine verifiable proof! (not quite Lean, but a good step on that path :slight_smile: )

Are you saying that all of these proof techniques (including, say CPS) are converted into the Direct DFA/NFA style proof described in the paper? IIUC, the DVF verification formats ID 10 & 11 both only work for this specific restriction of the structure of the finite automaton? Or are you also including the general NFA verification format (listed w/o ID on GitHub - UncombedCoconut/bbchallenge-nfa-verification: Tentative DeciderInfo format for NFA proofs.) which seems to allow for general NFA without restriction?

Yes, I’m determinizing on one side (actually, trying what I’m about to describe on both sides and keeping whichever one results in a smaller proof).
Typically, there’s a transition function in the picture. For example, with n-gram CPS, there is a set of n-grams allowed on the side of interest – so, I know how to transition from one set of possible n-grams to the next. (Shift each one, append the bit which has been read, and intersect the resulting set with the allowed set.)
Instead of worrying about the head-symbol transitions, I can stop there, calculate the BFS-ordered DFA on the {0,1} alphabet, and complete it.
The conversion process tends to be inefficient, so I also test quotients of this DFA to see if they work. (That is, I identify pairs of states, and see if the resulting DFA still completes.) This can improve the state count by dozens or more. In many cases I end up with CTL data smaller than the original CPS data.

1 Like

I think we should only officially accept those proofs that can easily be found from scratch with the FAR decider.

My reasoning for that is that we want the “undecided machines” to roughly represent the work we still need to do. And we probably need to do the work of validating a CPS and a Bouncer decider at some point.

A final proof that only briefly mentions CPS and Bouncers as ways to generate FAR certificates would be very unsatisfying. And with the work of validating those in the future anyway it seems disingenuous to reach ahead with FAR and include the converted results.

At some point we might have machines that no decider that is likely to be validated can prove easily from scratch. Only when that happens should we consider applying the FAR certificates that we have gotten from experimental searches or deciders that are too complicated to proof.

The number on the page then won’t reflect the “true frontier” that we are working with on discord, but it has never really done that at any point before. I don’t see a reason to change that now.

I wouldn’t be against displaying a second number for the discord frontier. Though picking that value might require a separate discussion. After converting the output of the unvalidated deciders into FAR certificates we are only missing a handful of hand-proofs and would need to agree on which existing proofs are good enough.

This is an interesting point. (I also think the rest of the post makes perfect sense. My preference expressed above is not strong. If we can iron out the details, a dual-status UI seems like the best way to address all of the goals that are in tension.)
I think the answer would be: start by accepting 0 of the manual proofs, follow the acceptance procedure for new verifiers and individual proofs, add certs to accepted verifiers if the verifiers accept. (We’re talking about ~30 beavers of ~8 types – the process doesn’t have to be perfectly streamlined.)

A final proof that only briefly mentions CPS and Bouncers as ways to generate FAR certificates would be very unsatisfying.
And we probably need to do the work of validating a CPS and a Bouncer decider at some point.

This touches on an interesting question which @savask also broached at length in Discord: In a world of separate deciders and verifiers, do we want to apply the full rigorous acceptance procedure to both deciders and verifiers, only to verifiers, some mix?

From a purely mathematical truth point of view, it seems that only verifiers need to be proven, reproducible, etc. If they are correct, then the deciders could effectively just be heuristics. Sort of like how in a math paper, nobody says you have to explain how you happened think of the proof, only what the proof is and that it is valid.

However, this has two drawbacks as far as I’m concerned (and I’m mostly stealing from @savask’s comments now):

(1) It’s unsatisfying / unhelpful to future researches. We would like to share how we got results, not just that we got them. Towards this end, there seems to be agreement that for the paper (for example) we would prefer to be able to prove as many TMs as possible via algorithms that can be clearly stated, reproduced and run reasonably quickly. But if that’s true, then, as @Iijil mentions, we’ll have to validate and prove these other methods eventually. And if so, then maybe it does make sense to have the webpage reflect that.

(2) verifiers have more potential to have false positive bugs. Since independent verifiers are only tested on files containing all “correct” proofs, it’s easier to detect false-negatives (when your verifier mistakenly rejects a valid proof) since you are basically testing it on every valid proof and much harder to detect false-positives (when your verifier mistakenly accepts an invalid proof).

This all makes me wonder if we might want to split up the formal requirements for a decider to be included in bbc: That the decider must be reproducible by someone else, but that the verifier is the only part that needs to be formally proven?

I wouldn’t be against displaying a second number for the discord frontier.

FWIW, I think calling this a “discord frontier” is a bit misleading. Up until now, we have had an informal discord frontier where people share their lists of holdouts, growing steadily smaller. But this lists were extremely unproven, maybe someone posts a list of 1k holdouts from their method, then someone else takes that list and widdles it down to 100 with their method, etc. But can you trust these numbers, I wouldn’t too much, they are based on a lot of steps in which any one might have a mistake. In contrast, the “second number” here would represent a count that had been proven (to some extent) through written proof and validation certificate, verified by multiple independent verifiers. To me, if we were to include 2 numbers it would basically be: (1) TM Holdouts for which we want a better proof and (2) TM Holdouts for which we have no proof.

FWIW, I think it’s wrong to draw this conclusion from the paucity of bad/adversarial examples. A FAR verifier is a simple program with a default-deny design. Deciders start at a catastrophic disadvantage, and thank goodness negative-example testing can mitigate it!
That said, I’m watching the formally-verified effort (singular for now) with great interest. The more assurance the better.

Ok, I think you make strong points (related to @savask points on discord).

I am then leaning towards the option of validating FAR proofs that are easy to find, which currently would set the number of undecided machines to around 32k.

I am not really keen on displaying another counter on the website as I feel it clutters the story. Methods for deciding machines either pass our “quality standards” or they don’t, having something in the middle seems odd to me, especially when we would only have to wait for a few more months to see the 32k counter go down again.

I should add something important:

To me the main point is that we do not currently have 100% certainty in the validity of the FAR verifiers, mainly by lack of testing for false positive. In a world where we had that certainty, I would be 100% for advertising the ~100 holdouts number officially on the main page as to me, this number represents, to date, what is the number of undecided machines that we are sure of.

That number does not reflect how the story will be presented in the final paper, and that is a separate discussion to me (I agree that it is not 100% satisfying just to dump all the verified proofs on the reader, even if have absolute confidence in the verifiers).

1 Like

In the end we went with the conservative option, see [April 9th 2023] 503,169 new machines have been decided! Only 32,632 left to go!

Thank you all for participating to this debate!