Standard TM Text format

sligocki · June 8, 2022, 8:03pm

There are many potential formats for sharing TMs. It seems that bbchallenge, I and Nick Drozd have converged on extremely similar (but not identical) formats. I have wanted to change my format for a couple of reasons and thought it might be nice to build consensus around a new format here (and maybe y’all would be interested in adopting it as well?).

I have noticed 3 closely related formats:

IIUC, bbchallenge uses formats like 1RB---0LC1LC0RD0LB1RA0RE1RD1RH (no spaces)
In our codebase, we use formats like 1RB --- 0LC 1LC 0RD 0LB 1RA 0RE 1RD 1RZ (it does not get rendered correctly here. 1 Space between columns, 2 spaces between rows.)
On Nick Drozd’s blog he used 1RB --- 0LC 1LC 0RD 0LB 1RA 0RE 1RD 1RZ (1 Space everywhere) and more recently 1RB --- ; 0LC 1LC ; 0RD 0LB ; 1RA 0RE ; 1RD 1RZ (; between rows).

There are two things that I like about my format:

It’s a little more human readable with some spaces, it gives a visual sense for where the row changes happen (if the medium preserves spaces).
It allows specifying multi-symbol machines in the same format (because row breaks are explicitly indicated).

but there are also things I dislike about it. Mainly that many programs deal poorly with the spaces:

Markdown (and HTML renderers in general) seems to ignore double spaces in code blocks. So, this forum and my blog all lose that double space info.
Command lines also treat spaces specially and ignore double spaces, so I cannot pass a TM text spec in directly into the command line, I have to quote it or something annoying.
URLs don’t like spaces, thus I see why y’all chose the no-space version.

Therefore, I’d like to come up with a new format which (A) allows distinguishing row-breaks clearly in the format (so that multi-symbol TMs can be represented) but also (B) works well with command line, markdown/HTML, URLs. My current thought is basically to replace the spaces in my format with some new pair of chars. Something like:

1RB_---__0LC_1LC__0RD_0LB__1RA_0RE__1RD_1RZ (replace spaces with _. Use 2 _ as row break).
1RB_---/0LC_1LC/0RD_0LB/1RA_0RE/1RD_1RZ (/ for row breaks. _ for column breaks.)
1RB_---;0LC_1LC;0RD_0LB;1RA_0RE;1RD_1RZ (; for row breaks)

Honestly, I don’t really like any of these. I was most interested in (2), but that seems bad for URLs … I guess (3) is not great for command line …

What do y’all think? Do you have any alternative suggestions? I think it’s so great that we have such aligned formats in general (We’ve all agreed to say 1RB not B1L, for example. We’ve agreed to list these in row order where rows represent states, --- for undefined, etc.) It’s just this one detail of row and column separators that differs.

cosmo · June 8, 2022, 9:13pm

Thank you for your post! I think that agreeing on formats and trying to promote a unified one is very important.

I agree with your point about column/row separators, bbchallenge’s format is unsatisfyingly ambiguous and not very readable. For this latter concern, I will although note that I was more concerned with it being easily copiable/pastable than clearly readable: I tend to find spaces scary when defining a format because they easily go missing or redundant, and as you said, not very URL friendly. The initial encodings we were using were absolutely terrible (bad idea from me): it was a base-64 encoding of the machine’s binary rep. Completely not humanely parasable (but URL safe…, this was changed with this issue, shout out to @modderme123 ).

In fact, I never really relied on 1RB0RE1LC0RE---0LD1LE0LB1RA0RC to be an easily readable representation of the machine, but to be an easy way for me to copy and paste the machine around with low error probability and provide a link, where a readable transition table is printed.

There seems to me that there are two separate concerns:

Human readability. For this matter I would argue that, perhaps, no linear representation is really satisfying and that, because mediums tend to be more and more powerful in their rendering capabilities, a markdown/html table will probably be increasingly easy to share almost everywhere and will be human readable in essence. For instance:

	0	1
A	1RB	1LC
B	1RC	1RB
C	1RD	0LE
D	1LA	1LD
E	—	0LE

|   | 0   | 1   |
|---|-----|-----|
| A | 1RB | 1LC |
| B | 1RC | 1RB |
| C | 1RD | 0LE |
| D | 1LA | 1LD |
| E | --- | 0LE |

(I used this website to generate it: Markdown Tables generator - TablesGenerator.com)

Usability with software. The main advantage of a linear representation is the ability to be easily parsed and outputted by the softwares that we use and be easily pastable/copyable around. As I mentioned above I would tend to dislike the use of spaces because they are so easy to miss/repeat. I had had the following idea: to add a semi-colon separated header giving the number of symbols if not 2. For instance:
- 1RB0RE1LC0RE---0LD1LE0LB1RA0RC for a 2-symbol machine
- 3;2RB1RE2RB0LC0RE0RE---2LD2RE2RB1RE2RB2RB1RE2RB for a 3 symbol machine
  The format could allow, optionally, to add some row separators in for readability:
- 3;2RB1RE2RB_0LC0RE0RE_---2LD2RE_2RB1RE2RB_2RB1RE2RB
  I am personally not sure that column separators would really help improve the readability of a linear format which is, by essence not very readable, and that they would probably clutter the space more than they would help.

What do you think?

nickdrozd · June 8, 2022, 10:05pm

Tables are great for general exposition, but for quick communication among specialists (like us) I think a one-line human-readable format is extremely useful. It allows us to, for example, send an email with a list of programs. Of course we don’t usually spend a lot of time reading these by the naked eye, but it’s possible to make some cursory analysis – for example, does a given program have any reflexive nodes?

For a while I was using just one space between all instructions. But that makes it difficult to distinguish between the states and colors. So I switched to Shawn’s format with the two spaces between different states. But as he says, those extra spaces sometimes get dropped, which is annoying.

Lately I’ve been thinking that the semicolon is the best solution. It remains human-readable, colors and states are distinguished, and there’s less chance of renderers taking liberties with the characters used.

Examples:

4/2 champ: 1RB 1LB ; 1LA 0LC ; 1RZ 1LD ; 1RD 0RA
2/4 champ: 1RB 2LA 1RA 1RA ; 1LB 1LA 3RB 1RZ

The spaces can be replaced with underscores in certain circumstances, but IMO that detracts from readability. (I’m fine using quotation marks at the command line.)

A related issue is what to do with the halt instruction. I don’t love the Z, and recently I switched to using underscores. But that doesn’t look great, and maybe there’s something better.

cosmo · June 9, 2022, 8:56am

Ok! I like ;

Why not - for halt: 1R- ? This unifies well with --- which we use for undefined transitions at bbchallenge as we are not too concerned with the last symbol and move that are done before halting. IIUC, @sligocki is using --- as well.

Then, the specification of the format could be:

1RB 2LA 1RA 1RA ; 1LB 1LA 3RB 1R-
Halting transitions can either be specified as completely undefined --- or partially undefined e.g. 1R- (write symbol and move direction must be given).
Spaces are optional and “meaningless” (i.e. having 0, 1, 2, 3, etc… spaces between transitions (or around the ;) does not change the meaning)
Hence, the following is also valid encoding of the previous machine (and useful for URLs): 1RB2LA1RA1RA;1LB1LA3RB1R-

Having spaces be meaningless is practical because then, parsers can just trim the input (remove all spaces) and be left with a machine specification that is very simple to parse.

What do you think?

sligocki · June 10, 2022, 9:30pm

Cool, I think we are converging on a good format that works in many situations I am very happy with the format being like 1RB 1LB ; 1LA 0LC ; 1RZ 1LD ; 1RD 0RA but with spaces optional.

I don’t feel strongly about the Z for halt. I use it because at one time I started using machines with 8 states and so H became ambiguous (of course with 9 states you get state I which is also not great for readability, sigh). I am happy with either 1R_ or 1R-. They both have the advantage of drawing attention to the halting transition which is nice

cosmo · June 11, 2022, 9:36am

Cool!

Could I add the proposal that ; be optional in the case of 2-symbol machines? (that way the currently used bbchallenge format would remain valid).

sligocki · June 13, 2022, 7:28pm

@cosmo, my preference is no, but let’s discuss. Here are my thoughts (sorry, this ended up a bit long-winded):

I am interested in developing a standard here. What is the value of a standard? I do not want it to be a situation where everyone feels forced to use the standard, instead I want it to be something that I can write a blog post about to describe precisely and let people know that this is the format I will be using to share machines. But, I will be very excited to have a standard which is multiple folks active in the community are also using. Notably here, if we can can produce a standard here which all 3 of us implement, that would be awesome

What is a standard? And what does it mean to implement a standard? For a format which has a unique way to represent, it’s pretty simple: You have to be able to read and write the correct format. But we are discussing a flexible format which would have many ways to be written. From my point of view implementing this standard means that your system can read all valid written forms of the format and write to one of the forms.

So by adding special cases (like ; optional for 2-symbol machines) we make all parsers more complicated (since everyone implementing the standard must be able to read this). That means that describing the format is harder and there is more barrier to entry for anyone else implementing a reader for this standard.

Furthermore, what is the value of loosening our standard to fit the existing bbchallenge format? I think the value is that then we could do things like copy the bbchallenge format string from the website into one of Nick or my tools and it would “just work” without needing to change formats. That is valuable and I do hope we end up in a place where that works! But what about the other direction? What if Nick or I find a TM and want to search for it on your site? Would we be able to enter it in our format with ;? Or would the website only parse TMs in the non-; format? If so, it is not really implementing the standard.

So I guess what I’m getting at is that if you want to implement parsing for this standard on bbchallenge you will have to make code changes to allow it to read the different forms of the same TM anyway. Once you’re doing that maybe it makes sense to just write TMs out in the ;-ful form anyway? (This question is not rhetorical, let me know what you think, it’s just my guess) Also, even if the ;-less form is not part of the standard, you could still continue to accept it as an input format on bbchallenge if you liked (for legacy reasons). That would mean your parser implemented the standard (and additionally accepts a non-standard format).

But let me know what you think. I am open to the idea of allowing a shortcut for 2-symbol cases as I know that is the case which is most common for most folks.

cosmo · June 13, 2022, 8:12pm

I agree with your points.

We could stick with this definition:

I would add the point that, in the special case of URLs, spaces are not allowed (i.e. bbchallenge would not parse a machine of the format where spaces have been replaced by %20 for URL compliance).

Also, the point about halting transitions could simplified/relaxed in the following way:

The symbol - means undefined and is allowed in a transition for all or either of write/move/goto with the consequence that if at least one of them is undefined the machine halts.

cosmo · June 16, 2022, 8:04am

@sligocki @nickdrozd Pavel has made the very good point that ; is not CLI friendly at all (bash will parse this symbol for something else).

Hence I like Pavel’s idea: ; → _ which would give 1RB2LA1RA1RA_1LB1LA3RB1R-.

Pavel is quite against having human-readability as a criterion (which I tend to agree with) and argues that it is better to have a format that is always copy-pastable by double click (symbol - does not help in that regards).

He also makes the point that using - for halting forces to modify his parsers a bit whereas using the rule "any letter that is equal or bigger than A + nb_states" has proven to be very easy to parse and use for him. This way of doing things gave him easy interoperability with Marxen’s format.

Those are interesting points that we should consider.

sligocki · June 16, 2022, 4:10pm

Pavel also brought up on Discord interoperability with Heiner Marxens format which looks like “B1L” instead of “1RB”. My parser was also (originally) flexible to support this format, but now I have moved away from that compatibility over time.

sligocki · June 16, 2022, 4:22pm

It sounds to me that @cosmo @nickdrozd @uni and I are all in agreement that we would like to develop a common standard and the question is now down to the details. I will say from the start that if you are all committed to supporting a standard, then I am willing to have almost any variation Let’s make this happen!

Details to decide:

Have spaces between transitions? This improves human readability but adds complications for computers.
Allow optional spaces? This complicates parsing and makes the standard less “standard”, but perhaps supports different users’s preferences and thus encourages adoption of the standard(?)
Row separator char? Should we use ;? _? Other?
Allow row separator to be dropped for 2-state case? Complicates parser, but maybe makes common case simpler.
Transition format. Should we require 1RB format? Or also allow Marxen’s B1L format? (They are distinguishable as long as there are less than L states)
Halt state char? Z? -? H? …
Undefined transition format? ---? Or should we not special case this in the format and just use 1RZ (or whatever we use for halt trans)?

Let me know if y’all can think of any other details we are undecided on.

sligocki · June 16, 2022, 4:32pm

I think the best way to move forward would be for each of us to start by listing our preferences on these and then we can try to converge. I’ll start:

I have mixed feelings here. I care about readability (even for the 1-line format). Especially for sharing with folks who are not intimately aware of our systems, I think the format with spaces looks like things they may have seen before like Pascal’s website: Historical survey . On the other hand, I agree with Pavel that allowing these strings to be selectable with double click and work without quotes on command line would be a huge bonus (URLs as well). So I’m really torn here.
I’m open. The more we talk, the more I’m preferring that we pick a single format (without optional differences) to simplify parsing and to make it obvious that we’re all using the same format (sort of developing a “brand” for the standard).
I’m happy with anything as long as we support multi-symbol as a “first-class citizen”! Ex: ; or _ are fine by me. I do like _ a bit more since it will work in command line and double-click selection.
As described in a previous post, I am against this.
I prefer the standard to be to use exactly one format. My pref is 1RB, but I’m open to the other.
I don’t really care too much here. I slightly dislike H because it limits us to 7 states (which is awkwardly close to where the fun is happening ). I use Z and like it, but I’m open to other options.
I prefer --- to represent undefined trans and to consider that as different than a normal halt. However, I’m flexible on this. In my system the difference is important (to decide if we need to continue TNF or not) but for an interchange format, perhaps the difference is unnecessary (or even confusing).

univerz · June 16, 2022, 10:40pm

spaces look good, but are not human friendly when copypasting and the ratio between reading line format vs copypasting is probalbly low. i’m ok with 1RB_2LA_1RA_1RA__1LB_1LA_3RB_1RZ if it helps.
i’m ok with optional something, if it is compatible with copypasting and easily detectable.
_ - readability comparable with space & works in url / commandline
i do not play with >2 symbol machines yet & do not have this in my code, but i understand & support shawn’s view.
i was probably the last to actively use the B1R format, but changed it to shawn’s recently exactly because of interoperability between us. => there is no need for you to support it, i just need it because of my old data.
i actually like my solution - any letter "bigger" then machine size - it’s parser compatible with H & Z. but i can hardcode Z for example if you want a fixed letter. i understand visual appeal & sort of similarity (but halting is not undefined) - vs ---, but - at the end makes machine not double click selectable and you have to be able to print halting configuration - 0 <- 101 looks weird & your code probably looks like this (state + 'A' as u8) as char anyway.
i’m probably ok with ---, used it in my old code. do not have this in new codebase yet, so i rewrite it to 1RZ temporarily.

Especially for sharing with folks who are not intimately aware of our systems

i’m not advocating banning spaces, i just like the idea to speed up most common “fastpath” == us exchanging machines. it’s ok to use quotes and open space delimited machine once a year if someone doesn’t use our format. (i’ll probably just do ~replace('_', ' ') on whole input anyway.)

cosmo · June 26, 2022, 2:39pm

Thank you @univerz for giving you opinion here!

I am not in favor of spaces in this format. To me, its primary use is to exchange information that we can easily plug in our simulators/programs and no space is clearly better for this purpose. Also, I still argue that even with space a linear representation of TM is hard to read.
Inspired by comments above, I don’t think we should allow for optional features in this format. Having one clearly and uniquely defined format makes everything simpler and hopefully more future proof.
I vote for _, for TTY compatibility.
No optional: _ is mandatory in all cases.
Let’s settle on one format. 1RB is the most intuitive to me: the machine writes a 1, moves to the right and then switches to instruction B.

6 and 7 are currently the trickier points for me. I believe that undefined transitions (---) are an elegant way to handle halting and I am not personally very interested in Rado’s \Sigma function hence whether the machine writes a 0 or a 1 before halting is not important to me, hence writing nothing is a good option to me.

If I had to vote following my heart:

6 & 7. Undefined transitions == halting transitions and are of the form --- uniquely.

However I understand that I might be minority here in my view for 6&7 and that, as @univerz mentioned this requires to handle a special case in parsers. I like his rule of 1R<what ever state that is not in the machine> for halting.

sligocki · June 29, 2022, 9:31pm

OK, we’ve heard from everyone except @nickdrozd . My interpretation of the consensus is:

No spaces.
No optional things. Everyone using exchange format uses same format.
Row separator: _.
_ explicit row separator mandatory in all cases (no optional things).
Require standard 1RB format.
@univerz and @cosmo like “any letter > num states” as halt state. I’m not a fan of this in general, but I’m fine with this as long as the norm is to only use H or Z. (I think it’s confusing if someone used like L or S, or num_states + 1 or something that would make it seem like they had made a mistake).
Undefined transitions appears to be the most contentious aspect remaining. @univerz prefers to always use 1RZ (no undefined). @cosmo prefers to always use --- (undefined = halt). I prefer to allow both as separate meaning. Internally in my system, those two have different meanings (pre-split / post-split in TNF algorithm). So, my preference is to continue to allow both in the standard. But I could be convinced to force them all to a common format for interchange.

univerz · June 30, 2022, 9:43am

@univerz prefers to always use 1RZ (no undefined).

no, i’m 95% ok with separate --- & 1RZ with the same meaning as you use (undefined & halt) - i used something like that in my old code. i just don’t have it in my new code yet, so i’m not 100% sure. but it should be ok.

1RB---_0LC1LC_0RD0LB_1RA0RE_1RD1RZ?

cosmo · July 9, 2022, 10:25am

After discussing it on discord I got convinced that having both 1RZ and --- is important, so

1RB---_0LC1LC_0RD0LB_1RA0RE_1RD1RZ

is good for me.

If everyone is OK I will then advertise this format on the method section of bbchallenge and use it in URLs.

sligocki · July 9, 2022, 3:01pm

Yes, sounds great

cosmo · July 9, 2022, 7:10pm

@star and @univerz have raised the point that --- is not convenient for double-click copy/paste:

1RB---_0LC1LC_0RD0LB_1RA0RE_1RD1RZ

Among the alternatives that were proposed:

1RBxxx_0LC1LC_0RD0LB_1RA0RE_1RD1RZ
1RBXXX_0LC1LC_0RD0LB_1RA0RE_1RD1RZ
1RBuuu_0LC1LC_0RD0LB_1RA0RE_1RD1RZ
1RBUUU_0LC1LC_0RD0LB_1RA0RE_1RD1RZ
1RBvvv_0LC1LC_0RD0LB_1RA0RE_1RD1RZ
1RBVVV_0LC1LC_0RD0LB_1RA0RE_1RD1RZ

To me 1RBXXX_0LC1LC_0RD0LB_1RA0RE_1RD1RZ and 1RBUUU_0LC1LC_0RD0LB_1RA0RE_1RD1RZ look good as they do not mix small letters with caps and they are visually noticeable. I slightly prefer U as it is the first letter of undefined.

What are your feelings on this?

sligocki · July 12, 2022, 2:16am

Makes sense. In general, my biggest hope is just that we solidify this soon and stop proposing changes, lol So my proposal is to put this up to a vote. Voting will be open until Friday. Anyone who will use this format can vote. I propose the following topics to vote on:

--- or letters for undefined transitions?
If letters, upper case or lower case?
If letters, which letters do you prefer?

Majority wins, but losers can try to convince others to join them and change vote. If you don’t like my voting system, feel free to propose an alternate