Strings and Characters
|Properties of Strings||Special Characters in Strings|
|Operations on Strings||Newlines and Tabs in Strings|
|Characters in Strings||Character Codes|
|String Patterns||Raw Character Encodings|
Much of what the Wolfram Language does revolves around manipulating structured expressions. But you can also use the Wolfram Language as a system for handling unstructured strings of text.
When you input a string of text to the Wolfram Language you must always enclose it in quotes. However, when the Wolfram Language outputs the string it usually does not explicitly show the quotes.
You can see the quotes by asking for the input form of the string. In addition, in a Wolfram System notebook, quotes will typically appear automatically as soon as you start to edit a string.
The fact that the Wolfram Language does not usually show explicit quotes around strings makes it possible for you to use strings to specify quite directly the textual output you want.
You should understand, however, that even though the string "x" often appears as x in output, it is still a quite different object from the symbol x.
You can test whether any particular expression is a string by looking at its head. The head of any string is always String.
All strings have head String:
You can use strings just like other expressions as elements of patterns and transformations. Note, however, that you cannot assign values directly to strings.
The Wolfram Language provides a variety of functions for manipulating strings. Most of these functions are based on viewing strings as a sequence of characters, and many of the functions are analogous to ones for manipulating lists.
join several strings together
give the number of characters in a string
reverse the characters in a string
StringLength gives the number of characters in a string:
StringReverse reverses the characters in a string:
make a string by taking the first n characters from s
take the n th character from s
take characters n1 through n2
make a string by dropping the first n characters in s
drop characters n1 through n2
StringTake and StringDrop are the analogs for strings of Take and Drop for lists. Like Take and Drop, they use standard Wolfram Language sequence specifications, so that, for example, negative numbers count character positions from the end of a string. Note that the first character of a string is taken to have position 1.
insert the string snew at position n in s
insert several copies of snew into s
StringInsert[s,snew,n] is set up to produce a string whose n th character is the first character of snew.
This uses Riffle to add a space between the words in a list:
replace the characters at positions m through n in s by the string snew
replace several substrings in s by snew
replace substrings in s by the corresponding snewi
give a list of the starting and ending positions at which sub appears as a substring of s
include only the first k occurrences of sub in s
include occurrences of any of the subi
You can use StringPosition to find where a particular substring appears within a given string. StringPosition returns a list, each of whose elements corresponds to an occurrence of the substring. The elements consist of lists giving the starting and ending character positions for the substring. These lists are in the form used as sequence specifications in StringTake, StringDrop, and StringReplacePart.
count the occurrences of sub in s
count occurrences of any of the subi
test whether s is free of sub
test whether s is free of all the subi
replace sb by sbnew wherever it appears in s
replace sbi by the corresponding sbnewi
do at most n replacements
give a list of the strings obtained by making each possible single replacement
give at most n results
StringReplace scans a string from left to right, doing all the replacements it can, and then returning the resulting string. Sometimes, however, it is useful to see what all possible single replacements would give. You can get a list of all these results using StringReplaceList.
split s into substrings delimited by whitespace
split at delimiter del
split at any of the deli
split into at most n substrings
insert rhs at the position of each delimiter
insert rhsi at the position of the corresponding deli
sort a list of strings
Sort sorts strings into standard dictionary order:
trim whitespace from the beginning and end of s
trim substrings matching patt from the beginning and end
find an optimal alignment of s1 and s2
convert a string to a list of characters
convert a list of characters to a string
StringJoin converts the list of characters back to a single string:
test whether all characters in a string are digits
test whether all characters in a string are letters
test whether all characters in a string are uppercase letters
test whether all characters in a string are lowercase letters
Not all the letters are uppercase, so the result is False:
generate a string in which all letters are uppercase
generate a string in which all letters are lowercase
generate a list of all characters from c1 and c2
CharacterRange will usually give meaningful results for any range of characters that have a natural ordering. The way CharacterRange works is by using the character codes that the Wolfram Language internally assigns to every character.
An important feature of string manipulation functions like StringReplace is that they handle not only literal strings but also patterns for collections of strings.
You can specify patterns for strings by using string expressions that contain ordinary strings mixed with Wolfram Language symbolic pattern objects.
a sequence of strings and pattern objects
test whether "s" matches patt
test whether "s" is free of substrings matching patt
give a list of the substrings of "s" that match patt
replace each case of lhs by rhs
give a list of the positions of substrings that match patt
count how many substrings match patt
replace every substring that matches lhs
give a list of all ways of replacing lhs
split s at every substring that matches patt
split at lhs, inserting rhs in its place
You can use all the standard Wolfram Language pattern objects in string patterns. Single blanks (_) always stand for single characters. Double blanks (__) stand for sequences of one or more characters.
a literal string of characters
any single character
any sequence of one or more characters
any sequence of zero or more characters
substrings given the name x
pattern given the name x
pattern repeated one or more times
pattern repeated zero or more times
a pattern matching at least one of the patti
a pattern for which cond evaluates to True
a pattern for which test yields True for each character
a sequence of whitespace characters
the characters of a number
an object representing a character class (see below)
substring matching a regular expression
You can use standard Wolfram Language constructs such as Characters["c1c2…"] and CharacterRange["c1","c2"] to generate lists of alternative characters to use in string patterns.
In addition to allowing explicit lists of characters, the Wolfram Language provides symbolic specifications for several common classes of possible characters in string patterns.
any of the "ci"
any of the "ci"
any character in the range "c1" to "c2"
space, newline, tab or other whitespace character
letter or digit
any character except ones matching p
String patterns are often used as a way to extract structure from strings of textual data. Typically this works by having different parts of a string pattern match substrings that correspond to different parts of the structure.
ToExpression converts them to ordinary symbols and numbers:
In many situations, textual data may contain sequences of spaces, newlines or tabs that should be considered "whitespace", and perhaps ignored. In the Wolfram Language, the symbol Whitespace stands for any such sequence.
String patterns normally apply to substrings that appear at any position in a given string. Sometimes, however, it is convenient to specify that patterns can apply only to substrings at particular positions. You can do this by including symbols such as StartOfString in your string patterns.
start of the whole string
end of the whole string
start of a line
end of a line
boundary between word characters and others
anywhere except at the particular positions StartOfString, etc.
String patterns allow the same kind of /; and other conditions as ordinary Wolfram Language patterns.
When you give an object such as x__ or e.. in a string pattern, the Wolfram Language normally assumes that you want this to match the longest possible sequence of characters. Sometimes, however, you may instead want to match the shortest possible sequence of characters. You can specify this using Shortest[p].
the longest consistent match for p (default)
the shortest consistent match for p
Shortest specifies that instead the shortest possible match should be found:
The Wolfram Language by default treats characters such "X" and "x" as distinct. But by setting the option IgnoreCase->True in string manipulation operations, you can tell the Wolfram Language to treat all such uppercase and lowercase letters as equivalent.
In some string operations, one may have to specify whether to include overlaps between substrings. By default StringCases and StringCount do not include overlaps, but StringPosition does.
StringPosition includes overlaps by default:
include all overlaps
include at most one overlap beginning at each position
exclude all overlaps
General Wolfram Language patterns provide a powerful way to do string manipulation. But particularly if you are familiar with specialized string manipulation languages, you may sometimes find it convenient to specify string patterns using regular expression notation. You can do this in the Wolfram Language with RegularExpression objects.
a regular expression specified by "regex"
RegularExpression in the Wolfram Language supports all standard regular expression constructs.
the literal character c
any character except newline
any of the characters ci
any character in the range c1–c2
any character except the ci
p repeated zero or more times
p repeated one or more times
zero or one occurrence of p
p repeated between m and n times
the shortest consistent strings that match
strings matching the sequence p1p2…
strings matching p1 or p2
There is a close correspondence between many regular expression constructs and basic general Wolfram Language string pattern constructs.
Just as in general Wolfram Language string patterns, there are special notations in regular expressions for various common classes of characters. Note that you need to use double backslashes (∖∖) to enter most of these notations in Wolfram Language regular expression strings.
digit 0–9 (DigitCharacter)
space, newline, tab, or other whitespace character ( WhitespaceCharacter )
word character (letter, digit, or _ ) ( WordCharacter )
characters in a named class
characters not in a named class
The Wolfram Language supports the standard POSIX character classes alnum, alpha, ascii, blank, cntrl, digit, graph, lower, print, punct, space, upper, word, and xdigit.
the beginning of the string ( StartOfString )
the end of the string ( EndOfString )
word boundary ( WordBoundary )
In general Wolfram Language patterns, you can use constructs like x_ and x:patt to give arbitrary names to objects that are matched. In regular expressions, there is a way to do something somewhat like this using numbering: the n th parenthesized pattern object (p) in a regular expression can be referred to as \\n within the body of the pattern, and $n outside it.
In addition to the ordinary characters that appear on a standard keyboard, you can include in Wolfram Language strings any of the special characters that are supported by the Wolfram Language.
In a Wolfram System notebook, a special character such as can always be displayed directly. But if you use a text‐based interface, then often the only characters that can readily be displayed are the ones that appear on your keyboard. Exactly which special characters can be displayed is inferred from the value of $CharacterEncoding.
As a result, what the Wolfram System does in such situations is to try to approximate special characters by similar‐looking sequences of ordinary characters. And when this is not practical, the Wolfram System just gives the full name of the special character.
In a Wolfram System notebook using StandardForm, special characters can be displayed directly:
In OutputForm, however, special characters that cannot be displayed exactly are approximated when possible by sequences of ordinary ones:
When using InputForm or FullForm, special characters are not approximated. The Wolfram Language uses full names for non-representable special characters in InputForm, while FullForm always uses long names, even in the notebook interface.
In InputForm, all characters not part of the encoding—in this case the special characters other than é—are written using long names:
In FullForm, all special characters are written using long names:
By default, the Wolfram System uses the character encoding "PrintableASCII" when saving notebooks and packages. This means that when special characters are written out to files or external programs, they are represented purely as sequences of ordinary characters. This uniform representation is crucial in allowing special characters in the Wolfram Language to be used in a way that does not depend on the details of particular computer systems.
In InputForm, all special characters are written out fully when using "PrintableASCII":
a literal character
a character specified using its full name
a " to be included in a string
a \ to be included in a string
In InputForm there is an explicit ∖n to represent the newline:
You should realize that even though it is possible to achieve some formatting of Wolfram Language output by creating strings which contain raw tabs and newlines, this is rarely a good idea. Typically a much better approach is to use the higher-level Wolfram Language formatting primitives discussed in "String-Oriented Output Formats", "Output Formats for Numbers", and "Tables and Matrices". These primitives will always yield consistent output, independent of such issues as the positions of tab settings on a particular device.
The front end formatting construct Column gives more control. Here text is aligned on the right:
give a list of the character codes for the characters in a string
construct a character from its character code
construct a string of characters from a list of character codes
The Wolfram Language assigns every character that can appear in a string a unique character code. This code is used internally as a way to represent the character.
FromCharacterCode reconstructs the original string:
generate a list of characters with successive character codes
The Wolfram Language assigns names such as ∖[Alpha] to a large number of special characters. This means that you can always refer to such characters just by giving their names, without ever having to know their character codes.
The Wolfram Language has names for all the common characters that are used in mathematical notation and in standard European languages. But for languages such as Japanese, Chinese, and Korean, there are thousands of additional characters, and the Wolfram Language does not assign an explicit name to each of them. Instead, it refers to such characters by standardized character codes.
In FullForm, these characters are referred to by standardized character codes. The character codes are given in hexadecimal:
The notebook front end for the Wolfram System is set up so that when you enter a character, the Wolfram System will automatically work out the character code for that character.
Sometimes, however, you may find it convenient to be able to enter characters directly using character codes.
a character with hexadecimal code nn
a character with hexadecimal code nnnn
a character with hexadecimal code nnnnnn
For characters with character codes below 256, you can use \.nn. For characters with character codes above 256, you must use either \:nnnn or \|nnnnnn. Note that in all cases you must give a fixed number of hexadecimal digits, padding with leading 0s if necessary.
This enters the characters using their character codes. Note the leading 0 inserted in the character code for :
In assigning codes to characters, the Wolfram Language follows three compatible standards: ASCII, ISO Latin‐1, and Unicode. ASCII covers the characters on a normal American English keyboard. ISO Latin‐1 covers characters in many European languages. Unicode is a more general standard which defines character codes for several tens of thousands of characters used in languages and notations around the world.
ASCII control characters
printable ASCII characters
lowercase English letters
ISO Latin‐1 characters
letters in European languages
Unicode standard public characters
Chinese, Japanese, and Korean characters
modified letters used in mathematical notation
mathematical symbols and operators
Unicode private characters defined specially by the Wolfram Language
Here are some special characters used in mathematical notation. The empty boxes correspond to characters not available in the current font:
The Wolfram Language always allows you to refer to special characters by using names such as ∖[Alpha] or explicit hexadecimal codes such as ∖:03b1. And when the Wolfram Language writes out files, it by default uses these names or hexadecimal codes.
But sometimes you may find it convenient to use raw encodings for at least some special characters. What this means is that rather than representing special characters by names or explicit hexadecimal codes, you instead represent them by raw bit patterns appropriate for a particular computer system or particular font.
use printable ASCII names for all special characters
use the raw character encoding specified by name
the default raw character encoding for your particular computer system
When you press a key or combination of keys on your keyboard, the operating system of your computer sends a certain bit pattern to the Wolfram System. How this bit pattern is interpreted as a character within the Wolfram System will depend on the character encoding that has been set up.
The notebook front end for the Wolfram System typically takes care of setting up the appropriate character encoding automatically for whatever font you are using. But if you use the Wolfram System with a text‐based interface or via files or pipes, then you may need to set $CharacterEncoding explicitly.
By specifying an appropriate value for $CharacterEncoding you will typically be able to get the Wolfram Language to handle raw text generated by whatever language‐specific text editor or operating system you use.
You should realize, however, that while the standard representation of special characters used in the Wolfram Language is completely portable across different computer systems, any representation that involves raw character encodings will inevitably not be.
printable ASCII characters only
all ASCII including control characters
characters for common western European languages
characters for central and eastern European languages
characters for additional European languages (e.g. Catalan, Turkish)
characters for other additional European languages (e.g. Estonian, Lappish)
English and Cyrillic characters
Adobe standard PostScript font encoding
Macintosh roman font encoding
Windows standard font encoding
symbol font encoding
Zapf dingbats font encoding
shift‐JIS for Japanese (mixture of 8‐ and 16‐bit)
extended Unix code for Japanese (mixture of 8‐ and 16‐bit)
Unicode transformation format encoding
The Wolfram System knows about various raw character encodings, appropriate for different computer systems and different languages. Copying of characters between the Wolfram System notebook interface and user interface environment on your computer generally uses the native character encoding for that environment. Wolfram Language characters which are not included in the native encoding will be written out using standard Wolfram Language full names or hexadecimal codes.
The Wolfram Language kernel can use any character encoding you specify when it writes or reads text files. By default, Put and PutAppend produce an ASCII representation for reliable portability of Wolfram Language files from one system to another.
The Wolfram Language supports both 8‐ and 16‐bit raw character encodings. In an encoding such as "ISOLatin1", all characters are represented by bit patterns containing 8 bits. But in an encoding such as "ShiftJIS" some characters instead involve bit patterns containing 16 bits.
Most of the raw character encodings supported by the Wolfram Language include basic ASCII as a subset. This means that even when you are using such encodings, you can still give ordinary Wolfram Language input in the usual way, and you can specify special characters using ∖[ and ∖: sequences.
Some raw character encodings, however, do not include basic ASCII as a subset. An example is the "Symbol" encoding, in which the character codes normally used for a and b are instead used for and .
generate codes for characters using the standard Wolfram Language encoding
generate codes for characters using the specified encoding
generate characters from codes using the standard Wolfram Language encoding
generate characters from codes using the specified encoding
Here are the codes in the Windows standard encoding. There is no code for ∖[Pi] in that encoding:
The character codes used internally by the Wolfram Language are based on Unicode. But externally the Wolfram Language by default always uses plain ASCII sequences such as ∖[Name] or ∖:nnnn to refer to special characters. By telling it to use the "UTF-8" character encoding, however, you can get the Wolfram Language to read and write characters in a standard Unicode form.