About strings in Coq

Tags: / coq strings unicode /

About strings and their representation in Coq..


Strings in Coq are represented using terms of type string, like

Require Import String.

Check "hello"%string.

But the string "hello" shown in this example is just a 'pretty-printed' (using string notations) version of its Coq representation.

string is actually defined as something like a list of characters.

Inductive string : Set :=
| EmptyString : string
| String : Ascii.ascii -> string -> string.

Arguments String _%char_scope _%string_scope

We can see it if we turn off notations while printing:

(* Disable use of notations while printing *)
Unset Printing Notations.

Check "ab"%string.
(*
String (Ascii.Ascii true false false false false true true false)
  (String (Ascii.Ascii false true false false false true true false)
     EmptyString)
     : string
*)

(* Enable notations while printing *)
Set Printing Notations.

Here the Ascii.ascii values are byte values representing the characters of the string.

(The name ascii used for the type Ascii.ascii may be a bit misleading as they are not exactly representing just ASCII values. ¹⁴)

The resemblance with the definition of list is evident. list looks like

Inductive list (A : Type) : Type :=
| nil : list A
| cons : A -> list A -> list A.

EmptyString and String are like the nil and cons respectively of list.

Ascii.ascii type

Ascii.ascii is defined as

Inductive ascii : Set :=
    Ascii : bool -> bool -> bool -> bool ->
            bool -> bool -> bool -> bool -> Ascii.ascii

Arguments Ascii.Ascii (_ _ _ _ _ _ _ _)%bool_scope

Ascii.Ascii, the sole constructor of Ascii.ascii (easy to confuse between the two names!), accepts 8 boolean values where each bool represents one bit (true for 1) that constitute an 8-bit integer which is the integer value of the character.

Note that the most-significant-bit (MSB) is the last bool value. ie, the LSB appears first.

This integer value may be thought of as the value of the character when encoded in UTF-8.

For example, the letter 'A', with binary value 01000001 would look like

Ascii.Ascii true false false false false false true false.

But remember that ASCII encoding is essentially a 7-bit value and not really of 8-bits.

A surprise

I looked if there's a function to convert an Ascii.Ascii value to a (sort of single-character) string:

Search (Ascii.ascii -> string).

This gave no output. So I guess no such function has been loaded as of now. But it's an easy enough function to make:

Definition str_of_ascii (a : Ascii.ascii) : string :=
  String a EmptyString.

str_of_ascii converts an Ascii.ascii value to a string with just the single Ascii.Ascii value in it.

Trying it out:

Compute str_of_ascii
  (Ascii.Ascii true false false false false false true false).
(*
= "A"%string
     : string
*)

Compute str_of_ascii (Ascii.ascii_of_nat 65).
(*
= "A"%string
     : string
*)

I was a bit surprised to find that the string form of 255 wasn't ÿ, whose integer value is 255.

Compute str_of_ascii (Ascii.ascii_of_nat 255).
(*
= String (Ascii.Ascii true true true true true true true true) ""
     : string
*)

Unset Printing Notations.
Compute str_of_ascii (Ascii.ascii_of_nat 255).
(*
= String (Ascii.Ascii true true true true true true true true)
         EmptyString
     : string
*)

So I checked how ÿ looks like

Compute "ÿ"%string.
(*
= "ÿ"%string
     : string
*)

Unset Printing Notations.
Compute "ÿ"%string.
(*
= String (Ascii.Ascii true true false false false false true true)
         (String (Ascii.Ascii true true true true true true false true)
            EmptyString)
     : string
*)

Instead of a single Ascii.Ascii, there are two of them!

I wondered why. I mean, a single Ascii.Ascii denotes 8 bits and can very well represent any value less than 256 by itself. Why did coq use two of them then? Sounded like something unnecessary.

That is until I got to know that that UTF-8 representation of ÿ required 2 bytes (16 bits).

$ unicode ÿ

U+00FF LATIN SMALL LETTER Y WITH DIAERESIS
UTF-8: c3 bf UTF-16BE: 00ff Decimal: ÿ Octal: \0377
ÿ (Ÿ)
Uppercase: 0178
Category: Ll (Letter, Lowercase)
Unicode block: 0080..00FF; Latin-1 Supplement
Bidi: L (Left-to-Right)
Decomposition: 0079 0308

Notice that the decimal form is 255 but the hexcode for UTF-8 is c3bf, which is a 2 byte value.

The reason is that ASCII values are limited to 7 bits. UTF-8 is something quite different that merely happen to use the same values as that of ASCII till 127 (including 127).

About UTF-8

UTF-8 (Unicode Transformation Format - 8 bit) is an encoding defined by unicode that can represent any valid unicode character using 1 to 4 bytes.

(See here for the story about UTF-8's origin.)

UTF-8 is a variable width character encoding. ie, it uses different number of bytes to represent different ranges of characters. 1 or 2 or 3 or 4 bytes are used depending on the character.

Byte 1 Byte 2 Byte 3 Byte 4
0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The bits marked x are the bits that are actually available to represent the character.

Byte count Free bit count
1 7 = 7
2 5 + 6 = 11
3 4 + 6 + 6 = 16
4 3 + 6 + 6 + 6 = 21

The range of characters that can be represented with 1 byte (effectively 7 bits after accounting for the zero-bit prefix) are the ASCII characters (hence the characters have same integer value in both UTF-8 and ASCII).

All other characters need at least 2 bytes.

That's why ÿ needed two Ascii.ascii values:

Definition strbool (b : bool) : string :=
  match b with
  | false => "0"%string
  | true => "1"%string
  end.
Definition binstr_of_ascii (a : Ascii.ascii) : string :=
  match a with
  | Ascii.Ascii b0 b1 b2 b3 b4 b5 b6 b7 =>
      (strbool b7) ++ (strbool b6) ++ (strbool b5) ++ (strbool b4) ++
      (strbool b3) ++ (strbool b2) ++ (strbool b1) ++ (strbool b0)
  end.

Definition get' (n : nat) (s : string) : Ascii.ascii :=
  match (String.get n s) with
  | Some a => a
  | None => Ascii.ascii_of_nat 0
  end.

Compute binstr_of_ascii (get' 0 "ÿ"%string).
(*
= "11000011"%string
     : string
*)

Compute binstr_of_ascii (get' 1 "ÿ"%string).
(*
= "10111111"%string
     : string
*)

Because ÿ is 0xc340.

Ascii.ascii beyond 255

While making an Ascii.ascii value, we can use all 8 bits to represent characters.

Compute Ascii.ascii_of_nat 0.    (* NUL *)
(*
= Ascii.Ascii false false false false false false false false
     : Ascii.ascii
*)

Compute Ascii.ascii_of_nat 255.
(*
= Ascii.Ascii true true true true true true true true
     : Ascii.ascii
*)

But the character with the 8-bit value 0xff (ie, 255) is incomplete in UTF-8 as characters which can be represented in just 1-byte start with 0 whereas 255's binary value starts with 1.

Just for the fun of it, I also tried converting a nat value bigger than 255 to Ascii.ascii:

Compute Ascii.ascii_of_nat 256.  (* same as ascii_of_nat 0 *)
(*
= Ascii.Ascii false false false false false false false false
     : Ascii.ascii
*)

So no error. Just overflow causing bit patterns to be reused.

Strings with quotes

The default notation for strings needs a pair of double quotes to kick in. But then how would we write string with double quotes in them? For example, a string like hello "hiya" hi?

We can use "" (two double quotes) to stand in for a single literal double quote.

Compute "hello ""hiya"" hi"%string.
(*
= "hello ""hiya"" hi"%string
     : string
*)

Compute """"%string.  (* a pair of double quotes *)
(*
= """"%string
     : string
*)

Multi-line strings

We can write strings with newline in them (ie, multi-line strings) by simply writing it one line after the other.

Definition foo := "a
b"%string.

Compute foo.
(*
= "a
b"%string
     : string
*)

The newline is showing up. Let's confirm by checking the string length:

Compute length foo.
(* 3 : nat *)

Sounds about right: 'a', '' and 'b' are the characters present in the string.

Let see what's the integer value of the second character in the string. Newline character has integer value of 10.

Compute Ascii.nat_of_ascii
 match (String.get 1 foo) with
 | Some x => x
 | None => Ascii.ascii_of_nat 0
 end.
(* 10 : nat *)

Okay, newline is there.

(String.get i str takes a nat value i and a string str and returns an option Ascii.ascii value denoting the i-th Ascii.ascii value in str, if any.)

Check String.get.
(*
get
     : nat -> string -> option ascii
*)

Another way of explicitly specifying a newline character is with its integer value as a string under the char notation scope:

Require Import Ascii.

Compute "010"%char.
(*
= "010"%char
     : ascii
*)

Finding a way to 'escape' characters

In programming languages like C, some special characters can be represented by 'escaping' certain characters. For example, newline can be represented with '\n'.

Couldn't find a way to do this sort of 'escaping' in Coq strings.

Compute "a\nb"%string.
(*
= "a\nb"%string
     : string
*)

Compute String.length "a\nb"%string.
(*
= 4
     : nat
*)

The \n was counted as literal \ and n.

It is possible to have a newline made using its Ascii.ascii equivalent though:

(* 10 is ASCII value of '\n' *)
Compute ("a" ++
         String (Ascii.ascii_of_nat 10) EmptyString ++
         "b")%string.
(*
= "a
b"%string
     : string
*)

So, as an attempt to get a better idea of coq strings, I made a function fmt to 'escape' newlines in strings so that it is possible to just write '' in the usual notation for strings itself.

Fixpoint fmt (s : string) : string :=
  match s with
  | EmptyString => EmptyString
  | String x xs =>
      match (Ascii.nat_of_ascii x) with
      | 92 => fmt_aux xs        (* '\' *)
      | _ => String x (fmt xs)  (* ignore next character *)
      end
  end
with fmt_aux (s : string) : string :=
  match s with
  | EmptyString => EmptyString
  | String x xs =>
      match (Ascii.nat_of_ascii x) with
      | 110 => String (Ascii.ascii_of_nat 10) (fmt xs) (* 'n' *)
      | _ => fmt xs
      end
  end.

This function recognizes only the character 'n' for escaping. Escaping starts with a '\'. If the character following '\' is not 'n', that character is simply skipped.

Let's try it out with a \n (newline):

Compute fmt "a\nb".
(*
= "a
b"%string
     : string
*)

Seems to be working. Let's try how it works with an unknown character after the \ escape:

Compute fmt "a\zb".
(*
= "ab"%string
     : string
*)

Cool, the 'z' that was escaped got skipped.

References

Update (28-Jul-2022): Had stumbled on a small bug in coqtail while trying the stuff in this blog post. It has since been fixed by the maintainer.