Correct character encoding is essential for reading and writing text across systems. Java uses Unicode internally and the Charset API for encoding/decoding bytes.

Unicode and Java

  • Java char is 16-bit UTF-16 (a Unicode code unit)
  • String internally stores UTF-16
  • Supplementary characters (e.g., emoji) use two char values (surrogate pairs)
  • Use codePointAt() for full Unicode characters:
  String emoji = "Hello 👋";
int codePoint = emoji.codePointAt(6); // 128075 (waving hand)
int length = emoji.length();            // 8 (char units, not characters)
int charCount = emoji.codePointCount(0, emoji.length()); // 7
  

Charset API

  import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;

Charset utf8 = StandardCharsets.UTF_8;
Charset iso8859 = StandardCharsets.ISO_8859_1;

// Available charsets
Charset.availableCharsets().keySet().forEach(System.out::println);
  

Encoding and Decoding

  String text = "Hello, 世界";

// String → bytes
byte[] utf8Bytes = text.getBytes(StandardCharsets.UTF_8);
byte[] utf16Bytes = text.getBytes(StandardCharsets.UTF_16);

// bytes → String
String decoded = new String(utf8Bytes, StandardCharsets.UTF_8);
  

With Charset Encoder/Decoder

  CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder();
CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();

ByteBuffer bytes = encoder.encode(CharBuffer.wrap("Hello"));
CharBuffer chars = decoder.decode(bytes);
  

File I/O with Encoding

  Path file = Path.of("data.txt");

// Write with explicit encoding
Files.writeString(file, "Hello, 世界", StandardCharsets.UTF_8);

// Read with explicit encoding
String content = Files.readString(file, StandardCharsets.UTF_8);

// BufferedReader with charset
try (BufferedReader reader = Files.newBufferedReader(file, StandardCharsets.UTF_8)) {
    reader.lines().forEach(System.out::println);
}
  

Common Charsets

Charset Description
UTF-8 Default for web, Linux, macOS — variable length, ASCII-compatible
UTF-16 Java internal, Windows internals
ISO-8859-1 Latin-1, one byte per character
US-ASCII 7-bit ASCII

Always use UTF-8 unless you have a specific reason not to.

Default Charset Pitfall

  // ❌ Uses platform default charset — different on Windows vs Linux
byte[] bytes = text.getBytes();
String s = new String(bytes);

// ✅ Always specify charset
byte[] bytes = text.getBytes(StandardCharsets.UTF_8);
String s = new String(bytes, StandardCharsets.UTF_8);
  

Since Java 18, the default charset is UTF-8 on all platforms, but explicit specification is still best practice.

Detecting Encoding

Java does not provide built-in encoding detection. Options:

  • Assume UTF-8 (most common)
  • Use libraries like ICU4J or juniversalchardet
  • Store encoding metadata alongside files

Properties Files

  // Java 9+ — load properties with UTF-8
try (Reader reader = Files.newBufferedReader(
        Path.of("app.properties"), StandardCharsets.UTF_8)) {
    Properties props = new Properties();
    props.load(reader);
}
  

Best Practices

  • Always specify StandardCharsets.UTF_8 explicitly
  • Never rely on the platform default charset in production code
  • Use Files.readString/writeString with charset parameter
  • Validate that external data declares its encoding (HTTP Content-Type, XML declaration)
  • Test with non-ASCII characters (Chinese, emoji) during development