A string is an immutable array of bytes. Strings usually store text, in which case the bytes are the UTF-8 encoding of the text’s code points. But you can put any kind of byte values in there you want, including null bytes or invalid UTF-8.
There are a few ways to think of a string:
As a searchable chunk of text composed of a sequence of textual code points.
As an iterable sequence of code point numbers.
As a flat array of directly indexable bytes.
All of those are useful for some problems, so the string API supports all three. The first one is the most common, so that’s what methods directly on the string class cater to.
In UTF-8, a single Unicode code point—very roughly a single “character”—may encode to one or more bytes. This means you can’t efficiently index by code point. There’s no way to jump directly to, say, the fifth code point in a string without walking the string from the beginning and counting them as you go.
Because counting code points is relatively slow, the indexes passed to string methods are byte offsets, not code point offsets. When you do:
someString[3]
That means “get the code point starting at byte three”, not “get the third code point in the string”. This sounds scary, but keep in mind that the methods on strings return byte indexes too. So, for example, this does what you want:
var metalBand = "Fäcëhämmër" var hPosition = metalBand.indexOf("h") System.print(metalBand[hPosition]) //> h
A string can also be indexed with a Range, which will return a new string as a substring of the original.
var example = "hello wren" System.print(example[0...5]) //> hello System.print(example[-4..-1]) //> wren
If you want to work with a string as a sequence numeric code points, call the
codePoints
getter. It returns a Sequence that decodes UTF-8
and iterates over the code points, returning each as a number.
If you want to get at the raw bytes, call bytes
. This returns a Sequence that
ignores any UTF-8 encoding and works directly at the byte level.
Creates a new string containing the UTF-8 encoding of codePoint
.
String.fromCodePoint(8225) //> ‡
It is a runtime error if codePoint
is not an integer between 0
and
0x10ffff
, inclusive.
Creates a new string containing the single byte byte
.
String.fromByte(255) //> �
It is a runtime error if byte
is not an integer between 0
and 0xff
, inclusive.
Gets a Sequence
that can be used to access the raw bytes of
the string and ignore any UTF-8 encoding. In addition to the normal sequence
methods, the returned object also has a subscript operator that can be used to
directly index bytes.
System.print("hello".bytes[1]) //> 101 (for "e")
The count
method on the returned sequence returns the number of bytes in the
string. Unlike count
on the string itself, it does not have to iterate over
the string, and runs in constant time instead.
Gets a Sequence
that can be used to access the UTF-8 decode
code points of the string as numbers. Iteration and subscripting work similar
to the string itself. The difference is that instead of returning
single-character strings, this returns the numeric code point values.
var string = "(ᵔᴥᵔ)" System.print(string.codePoints[0]) //> 40 (for "(") System.print(string.codePoints[4]) //> 7461 (for "ᴥ")
If the byte at index
does not begin a valid UTF-8 sequence, or the end of the
string is reached before the sequence is complete, returns -1
.
var string = "(ᵔᴥᵔ)" System.print(string.codePoints[2]) //> -1 (in the middle of "ᵔ")
Checks if other
is a substring of the string.
It is a runtime error if other
is not a string.
Returns the number of code points in the string. Since UTF-8 is a variable-length encoding, this requires iterating over the entire string, which is relatively slow.
If the string contains bytes that are invalid UTF-8, each byte adds one to the count as well.
Checks if the string ends with suffix
.
It is a runtime error if suffix
is not a string.
Returns the index of the first byte matching search
in the string or -1
if
search
was not found.
It is a runtime error if search
is not a string.
Returns the index of the first byte matching search
in the string or -1
if
search
was not found, starting at byte offset start
. The start
offset can
also be negative, which will be offset relative to end of the string instead.
Searches forward, from the offset to the end of the string.
It is a runtime error if search
is not a string or start
is not an integer
index within the string’s byte length.
Implements the iterator protocol for iterating over the code points in the string:
var codePoints = [] for (c in "(ᵔᴥᵔ)") { codePoints.add(c) } System.print(codePoints) //> [(, ᵔ, ᴥ, ᵔ, )]
If the string contains any bytes that are not valid UTF-8, this iterates over those too, one byte at a time.
Returns a new string with all occurrences of old
replaced with swap
.
var string = "abc abc abc" System.print(string.replace(" ", "")) //> abcabcabc
Returns a list of one or more strings separated by separator
.
var string = "abc abc abc" System.print(string.split(" ")) //> [abc, abc, abc]
It is a runtime error if separator
is not a string or is an empty string.
Checks if the string starts with prefix
.
It is a runtime error if prefix
is not a string.
Returns a new string with whitespace removed from the beginning and end of this string. “Whitespace” is space, tab, carriage return, and line feed characters.
System.print(" \nstuff\r\t".trim()) //> stuff
Returns a new string with all code points in chars
removed from the beginning
and end of this string.
System.print("ᵔᴥᵔᴥᵔbearᵔᴥᴥᵔᵔ".trim("ᵔᴥ")) //> bear
Like trim()
but only removes from the end of the string.
System.print(" \nstuff\r\t".trimEnd()) //> " \nstuff"
Like trim()
but only removes from the end of the string.
System.print("ᵔᴥᵔᴥᵔbearᵔᴥᴥᵔᵔ".trimEnd("ᵔᴥ")) //> ᵔᴥᵔᴥᵔbear
Like trim()
but only removes from the beginning of the string.
System.print(" \nstuff\r\t".trimStart()) //> "stuff\r\t"
Like trim()
but only removes from the beginning of the string.
System.print("ᵔᴥᵔᴥᵔbearᵔᴥᴥᵔᵔ".trimStart("ᵔᴥ")) //> bearᵔᴥᴥᵔᵔ
Returns a new string that concatenates this string and other
.
It is a runtime error if other
is not a string.
Returns a new string that contains this string repeated count
times.
It is a runtime error if count
is not a positive integer.
Checks if the string is equal to other
.
Check if the string is not equal to other
.
Returns a string containing the code point starting at byte index
.
System.print("ʕ•ᴥ•ʔ"[5]) //> ᴥ
Since ʕ
is two bytes in UTF-8 and •
is three, the fifth byte points to the
bear’s nose.
If index
points into the middle of a UTF-8 sequence or at otherwise invalid
UTF-8, this returns a one-byte string containing the byte at that index:
System.print("I ♥ NY"[3]) //> (one-byte string [153])
It is a runtime error if index
is greater than the number of bytes in the
string.