On this page
Character Sets & Encoding
Correct character encoding is essential for reading and writing text across systems. Java uses Unicode internally and the Charset API for encoding/decoding bytes.
Unicode and Java
- Java
charis 16-bit UTF-16 (a Unicode code unit) Stringinternally stores UTF-16- Supplementary characters (e.g., emoji) use two
charvalues (surrogate pairs) - Use
codePointAt()for full Unicode characters:
String emoji = "Hello 👋";
int codePoint = emoji.codePointAt(6); // 128075 (waving hand)
int length = emoji.length(); // 8 (char units, not characters)
int charCount = emoji.codePointCount(0, emoji.length()); // 7
Charset API
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
Charset utf8 = StandardCharsets.UTF_8;
Charset iso8859 = StandardCharsets.ISO_8859_1;
// Available charsets
Charset.availableCharsets().keySet().forEach(System.out::println);
Encoding and Decoding
String text = "Hello, 世界";
// String → bytes
byte[] utf8Bytes = text.getBytes(StandardCharsets.UTF_8);
byte[] utf16Bytes = text.getBytes(StandardCharsets.UTF_16);
// bytes → String
String decoded = new String(utf8Bytes, StandardCharsets.UTF_8);
With Charset Encoder/Decoder
CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder();
CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
ByteBuffer bytes = encoder.encode(CharBuffer.wrap("Hello"));
CharBuffer chars = decoder.decode(bytes);
File I/O with Encoding
Path file = Path.of("data.txt");
// Write with explicit encoding
Files.writeString(file, "Hello, 世界", StandardCharsets.UTF_8);
// Read with explicit encoding
String content = Files.readString(file, StandardCharsets.UTF_8);
// BufferedReader with charset
try (BufferedReader reader = Files.newBufferedReader(file, StandardCharsets.UTF_8)) {
reader.lines().forEach(System.out::println);
}
Common Charsets
| Charset | Description |
|---|---|
UTF-8 |
Default for web, Linux, macOS — variable length, ASCII-compatible |
UTF-16 |
Java internal, Windows internals |
ISO-8859-1 |
Latin-1, one byte per character |
US-ASCII |
7-bit ASCII |
Always use UTF-8 unless you have a specific reason not to.
Default Charset Pitfall
// ❌ Uses platform default charset — different on Windows vs Linux
byte[] bytes = text.getBytes();
String s = new String(bytes);
// ✅ Always specify charset
byte[] bytes = text.getBytes(StandardCharsets.UTF_8);
String s = new String(bytes, StandardCharsets.UTF_8);
Since Java 18, the default charset is UTF-8 on all platforms, but explicit specification is still best practice.
Detecting Encoding
Java does not provide built-in encoding detection. Options:
- Assume UTF-8 (most common)
- Use libraries like ICU4J or juniversalchardet
- Store encoding metadata alongside files
Properties Files
// Java 9+ — load properties with UTF-8
try (Reader reader = Files.newBufferedReader(
Path.of("app.properties"), StandardCharsets.UTF_8)) {
Properties props = new Properties();
props.load(reader);
}
Best Practices
- Always specify
StandardCharsets.UTF_8explicitly - Never rely on the platform default charset in production code
- Use
Files.readString/writeStringwith charset parameter - Validate that external data declares its encoding (HTTP
Content-Type, XML declaration) - Test with non-ASCII characters (Chinese, emoji) during development