A String in Elixir is a UTF-8 encoded binary.
Codepoints and graphemes
The functions in this module act according to the Unicode Standard, version 6.3.0.As per the standard, a codepoint is an Unicode Character, which may be represented by one or more bytes. For example, the character "é" is represented with two bytes:
iex> byte_size("é")
2
However, this module returns the proper length:
iex> String.length("é")
1
Furthermore, this module also presents the concept of graphemes, which are multiple characters that may be "perceived as a single character" by readers. For example, the same "é" character written above could be represented by the letter "e" followed by the accent ́:
iex> string = "\x{0065}\x{0301}"
iex> byte_size(string)
3
iex> String.length(string)
1
Although the example above is made of two characters, it is perceived by users as one.
Graphemes can also be two characters that are interpreted as one by some languages. For example, some languages may consider "ch" as a grapheme. However, since this information depends on the locale, it is not taken into account by this module.
In general, the functions in this module rely on the Unicode Standard, but does not contain any of the locale specific behaviour.
More information about graphemes can be found in the Unicode Standard Annex #29. This current Elixir version implements Extended Grapheme Cluster algorithm.
String and binary operations
To act accordingly to the Unicode Standard, many functions in this module runs in linear time, as it needs to traverse the whole string considering the proper Unicode codepoints.
For example, String.length/1
is going to take longer as
the input grows. On the other hand, byte_size/1
always runs
in constant time (i.e. regardless of the input size).
This means often there are performance costs in using the functions in this module, compared to the more low-level operations that work directly with binaries:
Kernel.binary_part/3
- retrieves part of the binaryKernel.bit_size/1
andKernel.byte_size/1
- size related functionsKernel.is_bitstring/1
andKernel.is_binary/1
- type checking function- Plus a number of functions for working with binaries (bytes)
in the
:binary
module
There are many situations where using the String
module can
be avoided in favor of binary functions or pattern matching.
For example, imagine you have a string prefix
and you want to
remove this prefix from another string named full
.
One may be tempted to write:
iex> take_prefix = fn full, prefix ->
...> base = String.length(prefix)
...> String.slice(full, base, String.length(full) - base)
...> end
...> take_prefix.("Mr. John", "Mr. ")
"John"
Although the function above works, it performs poorly. To
calculate the length of the string, we need to traverse it
fully, so we traverse both prefix
and full
strings, then
slice the full
one, traversing it again.
A first attempting at improving it could be with ranges:
iex> take_prefix = fn full, prefix ->
...> base = String.length(prefix)
...> String.slice(full, base..-1)
...> end
...> take_prefix.("Mr. John", "Mr. ")
"John"
While this is much better (we don't traverse full
twice),
it could still be improved. In this case, since we want to
extract a substring from a string, we can use byte_size/1
and binary_part/3
as there is no chance we will slice in
the middle of a codepoint made of more than one byte:
iex> take_prefix = fn full, prefix ->
...> base = byte_size(prefix)
...> binary_part(full, base, byte_size(full) - base)
...> end
...> take_prefix.("Mr. John", "Mr. ")
"John"
Or simply used pattern matching:
iex> take_prefix = fn full, prefix ->
...> base = byte_size(prefix)
...> <<_ :: binary-size(base), rest :: binary>> = full
...> rest
...> end
...> take_prefix.("Mr. John", "Mr. ")
"John"
On the other hand, if you want to dynamically slice a string
based on an integer value, then using String.slice/3
is the
best option as it guarantees we won't incorrectly split a valid
codepoint in multiple bytes.
Integer codepoints
Although codepoints could be represented as integers, this module represents all codepoints as strings. For example:
iex> String.codepoints("olá")
["o", "l", "á"]
There are a couple of ways to retrieve a character integer
codepoint. One may use the ?
construct:
iex> ?o
111
iex> ?á
225
Or also via pattern matching:
iex> << eacute :: utf8 >> = "á"
iex> eacute
225
As we have seen above, codepoints can be inserted into a string by their hexadecimal code:
"ol\x{0061}\x{0301}" #=>
"olá"
Self-synchronization
The UTF-8 encoding is self-synchronizing. This means that if malformed data (i.e., data that is not possible according to the definition of the encoding) is encountered, only one codepoint needs to be rejected.
This module relies on this behaviour to ignore such invalid
characters. For example, length/1
is going to return
a correct result even if an invalid codepoint is fed into it.
In other words, this module expects invalid data to be detected when retrieving data from the external source. For example, a driver that reads strings from a database will be the one responsible to check the validity of the encoding.