Introduction

Audible for iPhone
Audible for iPhone

Unicode allows Audible to display, consume, and process text in the languages spoken by our customers. It also allows us to handle emoji’s, which are appearing in more and more product descriptions, customer reviews, etc. Every developer should be familiar with Unicode and UTF-8, and should be comfortable dealing with non-US-ASCII text (e.g., Ü Ý Þ ß Ж З И ओ औ 並 书😃). In this article, we’ll introduce you to the basics of Unicode and UTF-8, and examine some common mistakes that Java developers make when processing non-US-ASCII text. Along the way, we’ll dispel some myths about character data (e.g., any Unicode character can be stored in a Java char).

The article is organized as follows:

Before There Was Unicode

Before Unicode, there were many character encodings used throughout the world, including the following. A character encoding is a mapping between characters and bytes. For example, the US-ASCII character encoding maps the letter A to the binary number 01000001 (which is decimal 65 or hexadecimal 41).

  • US-ASCII
  • EBCDIC (CP37, CP930, CP1047)
  • ISO 8859:
    • ISO 8859-1 Western Europe
    • ISO 8859-2 Western and Central Europe
    • ISO 8859-3 Western Europe and South European (Turkish, Maltese plus Esperanto)
    • ISO 8859-4 Western Europe and Baltic countries (Lithuania, Estonia, Latvia and Lapp)
    • ISO 8859-5 Cyrillic alphabet
    • ISO 8859-16 Central, Eastern and Southern European languages …
  • MS-Windows character sets:
    • Windows-1250 for Central European languages that use Latin script, (Polish, …
    • Windows-1251 for Cyrillic alphabets
    • Windows-1252 for Western languages
    • Windows-1253 for Greek
    • Windows-1258 for Vietnamese
  • Chinese Guobiao
    • GB 2312
    • GBK (Microsoft Code page 936)
    • GB 18030
  • JIS X 0208
    • Shift JIS (Microsoft Code page 932 is a dialect of Shift_JIS)
    • EUC-JP
    • ISO-2022-JP
  • Many others

(This list is a subset of the list at Character encoding.)

US-ASCII is a 7-bit encoding that was originally published in 1963. It defines only 128 characters, including the 26 letters of the English alphabet, digits, punctuation and control characters.

The character encodings for Western languages were usually extensions of US-ASCII making use of the 8th bit to include another 128 characters. 

As you can see, each character encoding covered only a limited number of languages. Thus, it was difficult to develop applications that processed text from many languages.

Quiz!

Which language can use US-ASCII to encode all its characters? No, it’s not English, since English includes loanwords with diacritical marks borrowed from other languages; for example, résumé, exposé, façade, piñata, naïve, entrée, etc. The answer is Rotokas, which is a language spoken by roughly 4,000 people on the island of Bougainville, which is part of Papua New Guinea (near Indonesia).

What is Unicode?

Unicode is an international computing industry standard for the representation of text. 

It’s concerned with characters (i.e., letters, numerals, pictographic characters, ideographic characters, punctuation, diacritical marks, mathematical symbols, Braille, musical notation, emoji’s, etc.). Unicode is not concerned with formatting (e.g., italics, bold, etc.).

Unicode currently includes over 128,000 characters covering 135+ modern & historic scripts, plus various symbol sets. (Unicode supports over 1.1 million possible characters.) Unicode strives to avoid duplicate characters across scripts.

As we’ll see below, Unicode, unlike US-ASCII, supports multiple character encodings (e.g., UTF-8, UTF-16, UTF-32, etc.). 

The following table (which is a subset of Richard Tobin’s table) includes a tiny portion of Unicode’s characters. Each bullet is one of the blocks in Unicode. A block is defined as a contiguous range of characters. Blocks do not overlap. A block can have unassigned character slots.

  • Basic Latin: ! “ # $ % & ‘ ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~
  • Latin-1 Supplement:   ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
  • Cyrillic: Ё Ђ Ѓ Є Ѕ І Ї Ј Љ Њ Ћ Ќ Ў Џ А Б В Г Д Е Ж З И Й К Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я а б в г д е ж з и й …
  • Devanagari: ऎ ए ऐ ऑ ऒ ओ औ क ख ग घ ङ च छ ज झ ञ ट ठ ड ढ ण त थ द ध न ऩ प फ ब भ म य र ऱ ल ळ ऴ व श ष स ह …
  • CJK Unified Ideographs: 一 丁 丂 七 丄 丅 丆 万 丈 三 上 下 丌 不 与 丏 丐 丑 丒 专 且 丕 世 丗 丘 丙 业 丛 东 丝 丞 丟 丠 両 丢 丣 两 严 並 丧 丨 丩 个 丫 丬 中 丮 丯 丰 …
  • General Punctuation: ‐ ‒ – — ― ‖ ‗ ‘ ’ ‚ ‛ “ ” „ ‟ † ‡ • ‣ ․ ‥ … ‧ ⁇ ⁈ ⁉ ⁊ ⁋ ⁌ ⁍ ⁎ ⁏ ⁐ ⁑ ⁒ ⁗ …
  • Currency Symbols: ₠ ₡ ₢ ₣ ₤ ₥ ₦ ₧ ₨ ₩ ₪ ₫ € ₭ ₮ ₯ ₰ ₱
  • Mathematical Operators: ∀ ∁ ∂ ∃ ∄ ∅ ∆ ∇ ∈ ∉ ∊ ∋ ∌ ∍ ∎ ∏ ∐ ∑ − ∓ ∔ ∕ ∖ ∗ ∘ ∙ √ ∛ ∜ ∝ ∞ ∟ ∠ ∡ ∢ ∣ ∤ ∥ ∦ ∧ ∨ ∩ ∪ ∫ ∬ ∭ ∮ ∯ …
  • Miscellaneous Symbols: ☀ ☁ ☂ ☃ ☄ ★ ☆ ☇ ☈ ☉ ☊ ☋ ☌ ☍ ☎ ☏ ☐ ☑ ☒ …
  • Etc.

The first block is Basic Latin, which contains all the letters and control codes of US-ASCII. As we’ll see below, it’s the only block where each character is encoded as 1 byte in UTF-8.

The Unicode standard specifies various properties of characters, including the following:

  • Name - for example, “LATIN CAPITAL LETTER A“.
  • Code Point - a unique number assigned to each Unicode character. Code Points are usually expressed in Hexadecimal as U+xxxx or U+xxxxx. The code point for ‘A’ is U+0041.
  • Category - for example, “Letter, Uppercase”.
  • Block - a contiguous range of code points; for example, “Basic Latin”.
  • Script - a collection of characters used to represent text in one or more languages. For example, “Latin”, “Han”, “Cyrillic”, etc.
  • Directionality - for example, “Left to Right”.
  • Lower Case - the corresponding lower case character(s), if any.
  • Upper Case - the corresponding upper case character(s), if any.
  • Picture - a picture of the character.
  • Number value for Numerals - for example, 10 for the Roman Numeral “X”.
  • Etc.

The following table provides several properties of the characters “A”, “”, “书” and “🔊”. They will be recurring characters throughout our Unicode story. 

Property A Â 🔊
Name LATIN CAPITAL LETTER A LATIN CAPITAL LETTER A WITH CIRCUMFLEX book, letter, document; writings SPEAKER WITH THREE SOUND WAVES
Code Point U+0041 U+00C2 U+4E66 U+1F50A
Script Latin Latin Han Common
Category Letter, Uppercase Letter, Uppercase Letter, Other Symbol, Other
Block Basic Latin Latin-1 Supplement CJK Unified Ideographs Miscellaneous Symbols and Pictograph
Directionality Left to Right Left to Right Left to Right Other Neutrals
Lower Case U+0061 U+00E2

The 🔊 character is an emoji. Unicode includes over 1,000 emoji’s.

Combining Marks

There’s something you should know about characters that have diacritical marks (e.g., Â, whose diacritical mark is the circumflex on top of the A). Unicode allows such a character to be represented as the base character followed by one or more combining marks. For the character  (U+00C2), the base character is A (U+0041), and the combining mark is the circumflex (U+0302).

UTF-8, UTF-16, and UTF-32

Unicode supports multiple character encodings. We mentioned above that a character encoding is a mapping between characters and bytes. For example, the US-ASCII character encoding maps the letter A to the binary number 01000001 (which is decimal 65 or hexadecimal 41). The most common Unicode character encodings are UTF-8, UTF-16 and UTF-32. As we’ll see next, each has advantages and disadvantages.

UTF-8

UTF-8 is the de facto character encoding for Unicode. Roughly 87% of all web pages use the UTF-8 encoding. UTF-8 is used by FreeBSD and most recent Linux distributions. It’s the default encoding for XML and HTML.

UTF-8 is an 8-bit, variable-width encoding, which encodes each Unicode character using 1 to 4 bytes. In UTF-8, each US-ASCII character (e.g., “A”) is encoded as 1 byte. In fact, UTF-8 is backwards compatible with US-ASCII. Thus, a file containing US-ASCII is a UTF-8 file. 

UTF-8 uses 2, 3, or 4 bytes to encode non-US-ASCII characters. 

Let’s look at the number of bytes required to encode our four favorite characters using UTF-8:

  • 1 byte for A  (i.e., U+0041)
  • 2 bytes for   (i.e., U+00C2)
  • 3 bytes for 书 (i.e., U+4E66)
  • 4 bytes for 🔊 (i.e., U+1F50A)

As you can see, UTF-8 is efficient for US-ASCII text, but it’s not very efficient for Asian text.

UTF-16

UTF-16 encodes each Unicode character as one or two 16-bit code units (i.e., each character is encoded using 2 bytes or 4 bytes).

Let’s look at the number of bytes required to encode our four favorite characters in UTF-16:

  • 2 bytes for A  (i.e., U+0041)
  • 2 bytes for   (i.e., U+00C2)
  • 2 bytes for 书 (i.e., U+4E66)
  • 4 bytes for 🔊 (i.e., U+1F50A)

As you can see, UTF-16 is efficient for Asian languages, but not for US-ASCII.

UTF-16 uses two bytes for the code points from U+0000 to U+FFFF; it uses four bytes for code points between U+10000 and U+10FFFF, which are called the supplementary characters.

UTF-32

UTF-32 encodes each Unicode character as one 32-bit code unit (i.e., each character is encoded using 4 bytes).

The following table compares the number of bytes required to encode our four favorite characters in UTF-8, UTF-16, and UTF-32:

Number of Bytes
UTF-8 UTF-16 UTF-32
A (i.e., U+0041) 1 2 4
  (i.e., U+00C2) 2 2 4
书 (i.e., U+4E66) 3 2 4
🔊 (i.e., U+1F50A) 4 4 4

As you can see, UTF-32 is inefficient for all languages! Recall that Unicode is a 21-bit standard; thus, UTF-32 uses only 21 of the 32 bits.

Byte Order Mark (BOM)

UTF-16 and UTF-32 have to deal with the issue of Big Endian (BE) vs Little Endian (LE) because they use multi-byte code units. With a BE machine, the most significant byte precedes the least significant byte in a word. With an LE machine, the least significant byte precedes the most significant byte in a word. If a BE machine sends UTF-16 text to an LE machine, then the consuming software on the LE machine will have difficulty converting the UTF-16 data into characters. Unicode has a solution for this issue; and it’s da BOM!

The Byte Order Mark (BOM) is special code point (U+FEFF) that can appear at the start of Unicode text to indicate whether the code units use BE ordering or LE ordering. On a Big Endian machine, the BOM appears as the 0xFE byte followed by the 0xFF byte. On a Little Endian machine, the BOM appears as the 0xFF byte followed by the 0xFE byte. The consumer of UTF-16 or UTF-32 text uses the BOM to figure out the byte order (i.e., the endianness) for the data.

The BOM is optional; if used, it appears at the start of the Unicode text. The BOM is not needed if encoding of the text is explicitly specified as UTF-16LE or UTF-16BE (or UTF-32LE or UTF-32BE).

The BOM is not needed for UTF-8, and the Unicode standard states that the BOM is not recommended for UTF-8.

Java and Unicode

Java supports Unicode. It uses UTF-16 for its internal text representation; the Java char data type is stored as 16 bits in memory. The Unicode character “书” (i.e., U+4E66) can be represented in a string literal as “\u4E66”. Thus, the following Java expression evaluates to true:

    "书".equals("\u4E66") // returns true

JDK 1.0 supported Unicode 1.1.5, which was a 16-bit standard that supported only 65K characters; back then, the Java 16-bit char was sufficient for all Unicode characters.

Unicode 3.1 introduced supplementary characters (i.e., code points above U+FFFF). Unicode was now a 21-bit standard supporting over 1.1 million possible characters. Java’s 16-bit char data type was no longer sufficient for all Unicode characters.

The supplementary characters include ancient scripts, such as Gothic and Cuneiform. Some supplementary characters are used as part of Chinese and Japanese personal names. And the supplementary characters include many emoji’s (e.g., 🔊, 😊, 😉, etc.). Unicode includes over 1,000 emoji’s and the vast majority of them are supplementary characters.

Java 5 (i.e., J2SE 5.0) introduced support for Unicode 4.0 in 2006. Java finally supported code points above U+FFFF (i.e., the supplementary characters).

Surrogate Pairs

Java decided against introducing a new char32 data type (or widening the existing char data type) to support the supplementary characters. Instead, they decided to support supplementary characters using surrogate pairs. (If you’re new to surrogate pairs, then you might want to sit down before reading the following!)

Each supplementary character (i.e., code point above U+FFFF) is represented as a surrogate pair, which Java stores as a pair of char values. You heard me right; a single Unicode character can be represented as two adjacent Java char values!

For example, in Java, the 🔊 character is represented as a surrogate pair (i.e., two Java char values). Thus, the following expression evaluates to true:

    "🔊".length() == 2

To determine the number of Unicode characters (i.e., code points) in a Java string, you can use the String.codePointCount() method:

    "🔊".codePointCount(0, "🔊".length()) == 1

Unicode specifies the surrogate pair representation of each supplementary character. For example, the surrogate pair for 🔊 (i.e., U+1F50A) is U+D83D followed by U+DD0A. Thus, the following Java expression evaluates to true:

    "🔊".equals("\uD83D\uDD0A")  // returns true

Surrogate pairs may be ugly, but they build character! 

How do you find the surrogate pair for a supplementary character? If you’re on a Mac or a Linux machine, you can use the native2ascii command:

    $ echo 🔊 | native2ascii
    \ud83d\udd0a
    $

You can also get the hexadecimal numbers for a surrogate pair at the Unicode Character Search.

You can use Java’s Character.toChars() method to convert a Unicode code point into a Java string. For example, the following Java expression evaluates to true:

    "🔊".equals( new String( Character.toChars(0x1F50A) ) ) // returns true

Do I Really Need to Worry about Supplementary Characters?

By now you’re probably asking yourself:

Do I really need to worry about supplementary characters? My applications don’t process ancient scripts, and surrogate pairs are scary!

The answer is yes. Even if your text processing applications do not process Gothic text, or Chinese and Japanese names that contain supplementary characters, they will most likely need to process emoji’s. And, as I’m sure you’ve noticed:

🔊 “Emoji’s are everywhere!”

Emoji’s are appearing in book titles, product descriptions, customer reviews, customer feedback, etc. As mentioned above, Unicode includes over 1,000 emoji’s, and the vast majority of them are supplementary characters.

Common Unicode Mistakes in Java Apps

Next, we’ll look at some common mistakes that Java developers make when processing non-US-ASCII text.

String.length()

As we saw earlier, String.length() does not return the number of Unicode characters in a string. Instead, it returns the number of Java char values in the string. For example:

    "🔊".length() == 1 // returns false!

The expression “🔊”.length() returns 2 because the “🔊” character is represented as a surrogate pair, which is stored as a pair of Java char values.

The String.codePointCount(int beginIndex, int endIndex) method returns the number of Unicode characters in a string. For example:

    "🔊".codePointCount(0, "🔊".length()) == 1 // returns true

String.substring()

The String.substring() method does not handle the supplementary characters. For example, the following invocation of the substring() method does not return the first Unicode character of the string “🔊书”; instead, it returns “?” because the first Java char of the surrogate pair for “🔊” is not a valid character:

    "🔊书".substring(0,1) // returns "?" rather than "🔊"

You can use the String.offsetByCodePoints(int index, int codePointOffset) as follows to determine the correct endIndex of the first Unicode character of a string:

    "🔊书".substring(0, "🔊书".offsetByCodePoints(0, 1)) // returns "🔊"

Similarly, you shouldn’t use String.substring() to get the first N characters of a string. The following expression does not return the first two Unicode characters in the string “A🔊书”; instead, it returns “A?” because the first Java char of the surrogate pair for “🔊” is not a valid character:

    "A🔊书".substring(0, 2) // returns "A?" rather than "A🔊"

Finding the prefix of a string is further complicated by Unicode’s combining marks. For example, the first two Unicode characters of the string “AA\u0302” are in fact “AA”. However, a person will tell you that the first two characters are “A” (i.e., “AA\u0302”). That’s because “A\u0302” represents the Unicode character Â.

The java.text.BreakIterator class can be used to iterate through the characters of a string. It handles the supplementary characters, and it will treat “A\u0302” as a single character. The following is a Java method that returns the prefix of a string using the BreakIterator class; the numCharacters parameter is defined to treat “A\u0302” as a single character:

 
   public static String getPrefix(String string, int numCharacters, Locale locale) {
        if (string == null) return null;
        if (numCharacters < 1) return "";

        BreakIterator breakIterator = BreakIterator.getCharacterInstance(locale);
        breakIterator.setText(string);
        int charCount = 0;
        breakIterator.first();
        for (int end = breakIterator.next(); end != BreakIterator.DONE; end = breakIterator.next()) {
            ++charCount;
            if (charCount >= numCharacters) {
                return string.substring(0, end);
            }
        }
        return string;
    }

Sorting Strings

The following is a common way to sort US-ASCII strings in Java:

    List<String> strings = Arrays.asList("airplane", "airbag");
    Collections.sort(strings);

This works great for US-ASCII strings, but it doesn’t work for non-English strings. In fact, it doesn’t even work for English strings:

    List<String> strings = Arrays.asList("fake", "façade");
    Collections.sort(strings);
    // Result of sort is incorrect: ["fake", "façade"]

To sort Unicode strings correctly, you should use the java.text.Collator class:

    List<String> strings = Arrays.asList("fake", "façade");
    Collator collator = Collator.getInstance(Locale.US);
    Collections.sort(strings, collator);
    // Result of sort is correct: ["façade", "fake"]

The rules for sorting strings vary by locale. Here’s an example of sorting German strings:

    List<String> strings = Arrays.asList("kreativ", "können");
    Collator collator = Collator.getInstance(Locale.GERMAN);
    Collections.sort(strings, collator);
    // Result of sort is correct: ["können", "kreativ"]

Equality of Strings

The String.equals() class does not handle Unicode’s combining marks. For example:

    "Â".equals("A\u0302") // returns false

One solution to this problem is to use the java.text.Normalizer class to normalize the strings before comparing them via String.equals(). There’s an easier way. The Collator.equals(String source, String target) method handles Unicode’s combining marks. For example:

    Collator.getInstance(Locale.US).equals("Â", "A\u0302") // returns true

Character.toUpperCase()

It’s not a good idea to use the Character.toUpperCase(int codePoint) method (or the Character.toUpperCase(char ch) method) to convert a Unicode character to upper case; that’s because the method returns a single Character. In German, the ‘ß’ character (U+00DF) converted to upper case is “SS”, which is two Unicode characters. Unfortunately, the following two expressions return ‘ß’ rather than “SS”:

    Character.toUpperCase('ß') // returns 'ß' rather than "SS"

    Character.toUpperCase(0x00DF) // returns 'ß' rather than "SS"

The solution is to use the String.toUpperCase(Locale locale) method instead:

    "ß".toUpperCase(Locale.GERMAN) // returns "SS"

Similarly, it’s better to use the String.toLowerCase(Locale locale) method rather than the Character.toLowerCase(int codePoint) method.

Character.isLetter()

The Character.isLetter(char ch) method does not support the supplementary characters. We can illustrate this using the supplementary character 𠜎 (i.e., U+2070E), which is a letter. The following expression returns false because the first Java char of the surrogate pair for “𠜎” is not a letter:

    Character.isLetter("𠜎".charAt(0)) // returns false!

The solution is to use the Character.isLetter(int codePoint) method, which does support supplementary characters:

    Character.isLetter(0x2070E) // returns true

Other Character Methods to Avoid

There are a bunch of other java.lang.Character methods that do not support supplementary characters; for example, getDirectionality(char ch), isDefined(char ch), isWhitespace(char ch), toTitleCase(char ch), etc.

What do all these methods have in common? You guessed it! The parameter to these methods is a char value. And we know that a Java char value can’t hold a supplementary character.

Fortunately for each methodName(char ch) method in the Character class, there is usually a corresponding methodName(int codePoint) method that does support the supplementary characters. For example, as we saw above, Character.isLetter(char ch) does not support the supplementary characters, but Character.isLetter(int codePoint) does.

Java and Charsets

A very common Unicode mistake made by Java developers is to convert a String into bytes without specifying the charset (which includes the character encoding). For example:

    String string = "AÂ书🔊";
    byte[] bytes = string.getBytes();

The getBytes() method uses the platform’s default charset. If the default charset is US-ASCII, then the getBytes() method call above will return garbage.

Even if the default charset of your platform is UTF-8 (and it probably is), you should still explicitly specify the charset when converting a string to bytes (or vice versa). For example:

    String string = "AÂ书🔊";
    byte[] bytes = string.getBytes(StandardCharsets.UTF_8);

Similarly, you should specify the charset when converting a byte array into a String:

     new String(bytes, StandardCharsets.UTF_8)

Quiz!

How many bytes will be returned by the following expression?

    "AÂ书🔊".getBytes(StandardCharsets.UTF_8)

The answer is calculated as follows: 1 byte for “A”, plus 2 bytes for “”, plus 3 bytes for “书”, plus 4 bytes for “🔊”. Thus, the answer is ten.

The following are other common Unicode mistakes made by Java developers:

For example, you should avoid doing this:

    Reader reader = new InputStreamReader(inputStream);
    Writer writer = new OutputStreamWriter(outputStream);

Instead, you should specify the charset as follows:

    Reader reader = new InputStreamReader(inputStream, StandardCharsets.UTF_8);
    Writer writer = new OutputStreamWriter(outputStream, StandardCharsets.UTF_8);

Summary of Common Unicode Mistakes

To avoid these common Unicode mistakes, remember the following when writing Java code:

  • Not all text is US-ASCII. In fact, not all English text is US-ASCII. The English word “façade” contains a non-US-ASCII character (i.e., “ç”).
  • Not all Unicode characters fit into a Java char. In Java, the supplementary characters are represented as a surrogate pair, which is a pair of Java char values.
  • With the popularity of emoji’s, supplementary characters and surrogate pairs are becoming unavoidable.
  • Remember that String.length() counts Java char values; it doesn’t count the number of Unicode characters.
  • Remember that a character with a diacritical mark can be represented as multiple Unicode characters (i.e., a base character followed by one or more combining marks).
  • When converting between Strings and byte arrays, always specify the charset (e.g., StandardCharsets.UTF_8).
  • When wrapping a java.io.InputStream with an InputStreamReader, always specify the charset (e.g., StandardCharsets.UTF_8). Similarly, when wrapping a java.io.OutputStream with an OutputStreamWriter, always specify the charset.
  • Avoid the String and Character methods that do not support supplementary characters (e.g., emoji’s): String.substring(), Character.toUpperCase(), Character.toLowerCase(), Character.isLetter(char ch), Character.isUpperCase(char ch), etc.
  • When sorting text, be sure to use the java.text.Collator class and to specify the correct Locale.
  • The String.equals() method does not normalize the strings; the Collator.equals(String source, String target) method does.
  • When converting text to upper or lower case, use the String.toUpperCase(Locale locale) and String.toLowerCase(Locale locale) methods, and be sure to specify the correct Locale. Don’t use the toUpperCase()/toLowerCase() methods of Character class.

HTML and Unicode

Audible.co.jp Home Page
Audible.co.jp Home Page

The easiest way to support Unicode in your HTML pages is to use the UTF-8 charset. This means that each HTML page contains Unicode text and is transmitted to the browser as a UTF-8 byte stream.

There are two common ways to identify the charset of your HTML pages. You can specify the charset in the Content-Type field of the HTTP header as follows:

    Content-Type: text/html; charset=UTF-8

You can also specify the charset using the meta tag in your HTML pages:

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>

If you are using XHTML, then you’ll be happy to know that the default charset for XHTML is UTF-8. You can also explicitly specify the charset of your XHTML using the XML declaration:

    <?xml version="1.0" encoding="UTF-8"?>

HTML supports numeric character references, which allow you to include any Unicode character in your HTML page by specifying the character’s code point. The numeric character reference for “🔊” (i.e., U+1F50A) is the following: &#x1F50A;

Thus, both of the following HTML paragraphs are equivalent:

    <p>🔊</p>

    <p>&#x1F50A;</p>

Recall that the supplementary character “🔊” can be included in a Java string by specifying the code points of its surrogate pair: “\uD83D\uDD0A”

This does not mean that you can include a supplementary character in an HTML page by specifying the numeric character references of the character’s surrogate pair. For example, you cannot include the supplementary character “🔊” in an HTML page using the following pair of numeric character references: &#xD83D;&#xDD0A;

The user’s browser would probably display &#xD83D;&#xDD0A; as follows: ��

Earlier versions of the popular StringEscapeUtils.escapeHtml() method of Apache Common Lang had this problem with supplementary characters. The bug was fixed in version 3.0 and the escapeHtml() method was replaced with StringEscapeUtils.escapeHtml4().

You can use numeric character references for combining marks. Recall that the “” character (U+00C2) can be represented in Unicode as the base character (“A”) followed by the circumflex combining mark (U+0302). Thus, the following HTML paragraphs are equivalent, in that each of them will be displayed as “” in the user’s browser:

    <p>A&#x0302;</p>

    <p>Â</p>

    <p>&#x00C2;</p>

Summary

Wow! I’m impressed that you made it to the end of this article! Let’s summarize what we’ve learned:

  • Not all text is US-ASCII. In fact, not all English text is US-ASCII. The English word “façade” contains a non-US-ASCII character (i.e., “ç”).
  • Unicode is an international computing industry standard for the representation of text. It currently includes over 128,000 characters covering 135+ modern & historic scripts, plus various symbol sets. Unicode supports over 1.1 million possible characters.
  • The most popular Unicode character encoding is UTF-8. It’s backwards compatible with US-ASCII. Roughly 87% of all web pages use the UTF-8 encoding. UTF-8 uses 1, 2, 3, or 4 bytes to encode Unicode characters. 
  • Java uses UTF-16 to represent text internally. Each Unicode character from code point U+0000 to code point U+FFFF is represented as a 16-bit Java char value. The code points between U+10000 and U+10FFFF are the supplementary characters. Each supplementary character is represented in Java as a surrogate pair (i.e., a pair of Java char values).
  • With the popularity of emoji’s, supplementary characters (and surrogate pairs) are becoming unavoidable. The vast majority of Unicode’s emoji’s are supplementary characters.
  • Avoid the common Unicode mistakes that are made in Java programs: avoid the many String/Character methods that do not support supplementary characters; don’t use the default charset; etc. Also, don’t forget to specify the locale when sorting text or when converting between upper and lower case.

Further reading: