Tags: /
coq strings unicode
/
Strings in Coq are represented using terms of type string
, like
Require Import String.
Check "hello"%string.
But the string "hello"
shown in this example is just a 'pretty-printed' (using string notations) version of its Coq representation.
string
is actually defined as something like a list of characters.
Inductive string : Set :=
| EmptyString : string
| String : Ascii.ascii -> string -> string.
Arguments String _%char_scope _%string_scope
We can see it if we turn off notations while printing:
(* Disable use of notations while printing *)
Unset Printing Notations.
Check "ab"%string.
(*
String (Ascii.Ascii true false false false false true true false)
(String (Ascii.Ascii false true false false false true true false)
EmptyString)
: string
*)
(* Enable notations while printing *)
Set Printing Notations.
Here the Ascii.ascii
values are byte values representing the characters of the string.
(The name ascii
used for the type Ascii.ascii
may be a bit misleading as they are not exactly representing just ASCII values. ¹⁴)
The resemblance with the definition of list
is evident. list
looks like
Inductive list (A : Type) : Type :=
| nil : list A
| cons : A -> list A -> list A.
EmptyString
and String
are like the nil
and cons
respectively of list
.
Ascii.ascii
typeAscii.ascii
is defined as
Inductive ascii : Set :=
Ascii : bool -> bool -> bool -> bool ->
bool -> bool -> bool -> bool -> Ascii.ascii
Arguments Ascii.Ascii (_ _ _ _ _ _ _ _)%bool_scope
Ascii.Ascii
, the sole constructor of Ascii.ascii
(easy to confuse between the two names!), accepts 8 boolean values where each bool
represents one bit (true
for 1
) that constitute an 8-bit integer which is the integer value of the character.
Note that the most-significant-bit (MSB) is the last bool
value. ie, the LSB appears first.
This integer value may be thought of as the value of the character when encoded in UTF-8.
For example, the letter 'A'
, with binary value 01000001
would look like
Ascii.Ascii true false false false false false true false.
But remember that ASCII encoding is essentially a 7-bit value and not really of 8-bits.
I looked if there's a function to convert an Ascii.Ascii
value to a (sort of single-character) string
:
Search (Ascii.ascii -> string).
This gave no output. So I guess no such function has been loaded as of now. But it's an easy enough function to make:
Definition str_of_ascii (a : Ascii.ascii) : string :=
String a EmptyString.
str_of_ascii
converts an Ascii.ascii
value to a string
with just the single Ascii.Ascii
value in it.
Trying it out:
Compute str_of_ascii
(Ascii.Ascii true false false false false false true false).
(*
= "A"%string
: string
*)
Compute str_of_ascii (Ascii.ascii_of_nat 65).
(*
= "A"%string
: string
*)
I was a bit surprised to find that the string form of 255 wasn't ÿ
, whose integer value is 255.
Compute str_of_ascii (Ascii.ascii_of_nat 255).
(*
= String (Ascii.Ascii true true true true true true true true) ""
: string
*)
Unset Printing Notations.
Compute str_of_ascii (Ascii.ascii_of_nat 255).
(*
= String (Ascii.Ascii true true true true true true true true)
EmptyString
: string
*)
So I checked how ÿ
looks like
Compute "ÿ"%string.
(*
= "ÿ"%string
: string
*)
Unset Printing Notations.
Compute "ÿ"%string.
(*
= String (Ascii.Ascii true true false false false false true true)
(String (Ascii.Ascii true true true true true true false true)
EmptyString)
: string
*)
Instead of a single Ascii.Ascii
, there are two of them!
I wondered why. I mean, a single Ascii.Ascii
denotes 8 bits and can very well represent any value less than 256 by itself. Why did coq use two of them then? Sounded like something unnecessary.
That is until I got to know that that UTF-8 representation of ÿ
required 2 bytes (16 bits).
$ unicode ÿ
U+00FF LATIN SMALL LETTER Y WITH DIAERESIS
UTF-8: c3 bf UTF-16BE: 00ff Decimal: ÿ Octal: \0377
ÿ (Ÿ)
Uppercase: 0178
Category: Ll (Letter, Lowercase)
Unicode block: 0080..00FF; Latin-1 Supplement
Bidi: L (Left-to-Right)
Decomposition: 0079 0308
Notice that the decimal form is 255
but the hexcode for UTF-8 is c3bf
, which is a 2 byte value.
The reason is that ASCII values are limited to 7 bits. UTF-8 is something quite different that merely happen to use the same values as that of ASCII till 127 (including 127).
UTF-8 (Unicode Transformation Format - 8 bit) is an encoding defined by unicode that can represent any valid unicode character using 1 to 4 bytes.
(See here for the story about UTF-8's origin.)
UTF-8 is a variable width character encoding. ie, it uses different number of bytes to represent different ranges of characters. 1 or 2 or 3 or 4 bytes are used depending on the character.
Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|
0xxxxxxx | |||
110xxxxx | 10xxxxxx | ||
1110xxxx | 10xxxxxx | 10xxxxxx | |
11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
The bits marked x
are the bits that are actually available to represent the character.
Byte count | Free bit count |
---|---|
1 | 7 = 7 |
2 | 5 + 6 = 11 |
3 | 4 + 6 + 6 = 16 |
4 | 3 + 6 + 6 + 6 = 21 |
The range of characters that can be represented with 1 byte (effectively 7 bits after accounting for the zero-bit prefix) are the ASCII characters (hence the characters have same integer value in both UTF-8 and ASCII).
All other characters need at least 2 bytes.
That's why ÿ
needed two Ascii.ascii
values:
(Ascii.Ascii true true false false false false true true)
: 11000011₂ = 0xc3(Ascii.Ascii true true true true true true false true)
: 01000000₂ = 0x40Definition strbool (b : bool) : string :=
match b with
| false => "0"%string
| true => "1"%string
end.
Definition binstr_of_ascii (a : Ascii.ascii) : string :=
match a with
| Ascii.Ascii b0 b1 b2 b3 b4 b5 b6 b7 =>
(strbool b7) ++ (strbool b6) ++ (strbool b5) ++ (strbool b4) ++
(strbool b3) ++ (strbool b2) ++ (strbool b1) ++ (strbool b0)
end.
Definition get' (n : nat) (s : string) : Ascii.ascii :=
match (String.get n s) with
| Some a => a
| None => Ascii.ascii_of_nat 0
end.
Compute binstr_of_ascii (get' 0 "ÿ"%string).
(*
= "11000011"%string
: string
*)
Compute binstr_of_ascii (get' 1 "ÿ"%string).
(*
= "10111111"%string
: string
*)
Because ÿ
is 0xc340.
Ascii.ascii
beyond 255While making an Ascii.ascii
value, we can use all 8 bits to represent characters.
Compute Ascii.ascii_of_nat 0. (* NUL *)
(*
= Ascii.Ascii false false false false false false false false
: Ascii.ascii
*)
Compute Ascii.ascii_of_nat 255.
(*
= Ascii.Ascii true true true true true true true true
: Ascii.ascii
*)
But the character with the 8-bit value 0xff (ie, 255) is incomplete in UTF-8 as characters which can be represented in just 1-byte start with 0 whereas 255's binary value starts with 1.
Just for the fun of it, I also tried converting a nat
value bigger than 255 to Ascii.ascii
:
Compute Ascii.ascii_of_nat 256. (* same as ascii_of_nat 0 *)
(*
= Ascii.Ascii false false false false false false false false
: Ascii.ascii
*)
So no error. Just overflow causing bit patterns to be reused.
The default notation for strings needs a pair of double quotes to kick in. But then how would we write string with double quotes in them? For example, a string like hello "hiya" hi
?
We can use ""
(two double quotes) to stand in for a single literal double quote.
Compute "hello ""hiya"" hi"%string.
(*
= "hello ""hiya"" hi"%string
: string
*)
Compute """"%string. (* a pair of double quotes *)
(*
= """"%string
: string
*)
We can write strings with newline in them (ie, multi-line strings) by simply writing it one line after the other.
Definition foo := "a
b"%string.
Compute foo.
(*
= "a
b"%string
: string
*)
The newline is showing up. Let's confirm by checking the string length:
Compute length foo.
(* 3 : nat *)
Sounds about right: 'a', '' and 'b' are the characters present in the string.
Let see what's the integer value of the second character in the string. Newline character has integer value of 10.
Compute Ascii.nat_of_ascii
match (String.get 1 foo) with
| Some x => x
| None => Ascii.ascii_of_nat 0
end.
(* 10 : nat *)
Okay, newline is there.
(String.get i str
takes a nat
value i
and a string str
and returns an option Ascii.ascii
value denoting the i
-th Ascii.ascii
value in str
, if any.)
Check String.get.
(*
get
: nat -> string -> option ascii
*)
Another way of explicitly specifying a newline character is with its integer value as a string under the char
notation scope:
Require Import Ascii.
Compute "010"%char.
(*
= "010"%char
: ascii
*)
In programming languages like C, some special characters can be represented by 'escaping' certain characters. For example, newline can be represented with '\n'
.
Couldn't find a way to do this sort of 'escaping' in Coq strings.
Compute "a\nb"%string.
(*
= "a\nb"%string
: string
*)
Compute String.length "a\nb"%string.
(*
= 4
: nat
*)
The \n
was counted as literal \
and n
.
It is possible to have a newline made using its Ascii.ascii
equivalent though:
(* 10 is ASCII value of '\n' *)
Compute ("a" ++
String (Ascii.ascii_of_nat 10) EmptyString ++
"b")%string.
(*
= "a
b"%string
: string
*)
So, as an attempt to get a better idea of coq strings, I made a function fmt
to 'escape' newlines in strings so that it is possible to just write '' in the usual notation for strings itself.
Fixpoint fmt (s : string) : string :=
match s with
| EmptyString => EmptyString
| String x xs =>
match (Ascii.nat_of_ascii x) with
| 92 => fmt_aux xs (* '\' *)
| _ => String x (fmt xs) (* ignore next character *)
end
end
with fmt_aux (s : string) : string :=
match s with
| EmptyString => EmptyString
| String x xs =>
match (Ascii.nat_of_ascii x) with
| 110 => String (Ascii.ascii_of_nat 10) (fmt xs) (* 'n' *)
| _ => fmt xs
end
end.
This function recognizes only the character 'n' for escaping. Escaping starts with a '\'. If the character following '\' is not 'n', that character is simply skipped.
Let's try it out with a \n
(newline):
Compute fmt "a\nb".
(*
= "a
b"%string
: string
*)
Seems to be working. Let's try how it works with an unknown character after the \
escape:
Compute fmt "a\zb".
(*
= "ab"%string
: string
*)
Cool, the 'z' that was escaped got skipped.
—
Update (28-Jul-2022): Had stumbled on a small bug in coqtail while trying the stuff in this blog post. It has since been fixed by the maintainer.