Character Sets & Encoding

Correct character encoding is essential for reading and writing text across systems. Java uses Unicode internally and the Charset API for encoding/decoding bytes.

Unicode and Java

Java char is 16-bit UTF-16 (a Unicode code unit)
String internally stores UTF-16
Supplementary characters (e.g., emoji) use two char values (surrogate pairs)
Use codePointAt() for full Unicode characters:

  String emoji = "Hello 👋";
int codePoint = emoji.codePointAt(6); // 128075 (waving hand)
int length = emoji.length();            // 8 (char units, not characters)
int charCount = emoji.codePointCount(0, emoji.length()); // 7

Charset API

  import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;

Charset utf8 = StandardCharsets.UTF_8;
Charset iso8859 = StandardCharsets.ISO_8859_1;

// Available charsets
Charset.availableCharsets().keySet().forEach(System.out::println);

Encoding and Decoding

  String text = "Hello, 世界";

// String → bytes
byte[] utf8Bytes = text.getBytes(StandardCharsets.UTF_8);
byte[] utf16Bytes = text.getBytes(StandardCharsets.UTF_16);

// bytes → String
String decoded = new String(utf8Bytes, StandardCharsets.UTF_8);

With Charset Encoder/Decoder

  CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder();
CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();

ByteBuffer bytes = encoder.encode(CharBuffer.wrap("Hello"));
CharBuffer chars = decoder.decode(bytes);

File I/O with Encoding

  Path file = Path.of("data.txt");

// Write with explicit encoding
Files.writeString(file, "Hello, 世界", StandardCharsets.UTF_8);

// Read with explicit encoding
String content = Files.readString(file, StandardCharsets.UTF_8);

// BufferedReader with charset
try (BufferedReader reader = Files.newBufferedReader(file, StandardCharsets.UTF_8)) {
    reader.lines().forEach(System.out::println);
}

Common Charsets

Charset	Description
`UTF-8`	Default for web, Linux, macOS — variable length, ASCII-compatible
`UTF-16`	Java internal, Windows internals
`ISO-8859-1`	Latin-1, one byte per character
`US-ASCII`	7-bit ASCII

Always use UTF-8 unless you have a specific reason not to.

Default Charset Pitfall

  // ❌ Uses platform default charset — different on Windows vs Linux
byte[] bytes = text.getBytes();
String s = new String(bytes);

// ✅ Always specify charset
byte[] bytes = text.getBytes(StandardCharsets.UTF_8);
String s = new String(bytes, StandardCharsets.UTF_8);

Since Java 18, the default charset is UTF-8 on all platforms, but explicit specification is still best practice.

Detecting Encoding

Java does not provide built-in encoding detection. Options:

Assume UTF-8 (most common)
Use libraries like ICU4J or juniversalchardet
Store encoding metadata alongside files

Properties Files

  // Java 9+ — load properties with UTF-8
try (Reader reader = Files.newBufferedReader(
        Path.of("app.properties"), StandardCharsets.UTF_8)) {
    Properties props = new Properties();
    props.load(reader);
}

Best Practices

Always specify StandardCharsets.UTF_8 explicitly
Never rely on the platform default charset in production code
Use Files.readString/writeString with charset parameter
Validate that external data declares its encoding (HTTP Content-Type, XML declaration)
Test with non-ASCII characters (Chinese, emoji) during development

Serialization

JVM Architecture

Character Sets & Encoding

Unicode and Java link

Charset API link

Encoding and Decoding link

With Charset Encoder/Decoder link

File I/O with Encoding link

Common Charsets link

Default Charset Pitfall link

Detecting Encoding link

Properties Files link

Best Practices link