Fixing CSV Mojibake — Shift_JIS, UTF-8 and the BOM

Most developers have experienced opening a CSV file only to find it contains garbled characters. Particularly with CSV files containing Japanese text, issues frequently arise due to differences in character encoding. This article consolidates knowledge to fundamentally understand the causes of character corruption and provide solutions to reliably resolve them.

Character Encoding Basics: `UTF-8` vs `Shift_JIS`

The character encodings that cause problems in Japanese CSV files are mainly the following two:

UTF-8

The current web standard and a Unicode-based character encoding capable of handling characters worldwide. UTF-8 is the default in Linux, macOS, and modern web applications. It uses variable-length encoding of 1 to 4 bytes per character, with ASCII characters represented as a single byte.

Shift_JIS（CP932）

A Japanese character encoding used in legacy Windows applications and older versions of Excel. To be precise, CP932 (Microsoft's extension of Shift_JIS) is used in Windows environments, and some platform-dependent characters (circled numbers, Roman numerals, etc.) can only be handled by CP932.

Why does character corruption occur

Reading a UTF-8 file as Shift_JIS produces classic mojibake output

The root cause of character corruption is simple: the character encoding used when writing the file does not match the encoding used when reading it.

For example, when a web application exports a CSV in UTF-8 and a user opens it in Excel, the characters become garbled. This is because Excel interprets the file as Shift_JIS (CP932) by default in Japanese environments.

Role of BOM (Byte Order Mark)

BOM is a marker of several bytes added to the beginning of a file that indicates the character encoding. The UTF-8 BOM is 3 bytes: 0xEF 0xBB 0xBF.

The important point is that a BOM is required to correctly open UTF-8 CSV files in Excel. Opening a UTF-8 CSV without a BOM in Excel will cause it to be interpreted as Shift_JIS, resulting in garbled characters.

// PHP で BOM 付き UTF-8 CSV を出力する例
header('Content-Type: text/csv; charset=UTF-8');
header('Content-Disposition: attachment; filename="data.csv"');

// BOM を出力
echo "\xEF\xBB\xBF";

$fp = fopen('php://output', 'w');
fputcsv($fp, ['名前', 'メール', '部署']);
fputcsv($fp, ['田中太郎', 'tanaka@example.com', '開発部']);
fclose($fp);

# Python で BOM 付き UTF-8 CSV を出力する例
import csv

with open('data.csv', 'w', encoding='utf-8-sig', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['名前', 'メール', '部署'])
    writer.writerow(['田中太郎', 'tanaka@example.com', '開発部'])

However, there are some precautions to note regarding the BOM. Some programs and shell scripts treat the BOM as an invalid character, which can cause errors. It is safer not to include a BOM in CSV files returned by API responses or processed by programs.

Line Break Code Issues

Along with garbled CSV characters, a common issue is the difference in line ending codes.

CRLF (\r\n) — Windows standard
LF (\n) — Linux / macOS standard
CR (\r) — Legacy Mac OS (rarely used today)

RFC 4180 specifies that CSV line breaks should be CRLF. However, in practice, many tools can handle CSV files with only LF without issues. Problems are more likely to occur with Excel and some legacy systems.

Command-line tools are convenient for checking line break codes.

# file コマンドで確認
file data.csv
# 出力例: data.csv: UTF-8 Unicode (with BOM) text, with CRLF line terminators

# xxd で先頭バイトを確認（BOM の有無）
xxd data.csv | head -3

Practical Solution Chart

Below is a summary of optimal settings depending on the intended use of CSV.

Use case	Character Encoding	BOM	Line Break Code
Open in Excel	UTF-8	Yes	CRLF
Process by program	UTF-8	None	LF
Legacy system integration	Shift_JIS (CP932)	None	CRLF
API Response	UTF-8	None	LF

Test CSV file

To verify that your application correctly handles CSV files with various character encodings, use the test files available in DevLab.

CSV Test Files by Character Encoding — UTF-8 (with/without BOM), Shift_JIS, EUC-JP, and more
Test Files by Newline Code — Various CRLF, LF, and CR files
CSV test files list — CSV files of various sizes

Summary

CSV character encoding issues can be reliably resolved by correctly understanding three key elements: character encoding, BOM (Byte Order Mark), and line endings. For Excel, the standard is UTF-8 with BOM using CRLF line endings; for programmatic processing, use UTF-8 without BOM using LF line endings. Using DevLab's various test files, verify that your application correctly handles each pattern.

Test files for this article

❓ Frequently Asked Questions

Why does a CSV file turn into mojibake when opened in Excel?

In a Japanese environment Excel assumes CSV files are Shift_JIS (CP932). To make Excel read a UTF-8 CSV correctly, prefix the file with a BOM — the bytes 0xEF 0xBB 0xBF.

Should a CSV file use UTF-8 or Shift_JIS?

It depends on the consumer. For Excel, use UTF-8 with a BOM and CRLF line endings. For programmatic processing or API responses, use UTF-8 without a BOM and LF. For legacy system integrations, Shift_JIS (CP932) with CRLF is still the right choice.

What is a BOM, and should a CSV have one?

A BOM is a short marker at the start of a file that signals its encoding; for UTF-8 it is the three bytes 0xEF 0xBB 0xBF. Excel needs it to display a UTF-8 CSV correctly, but leave it off for CSVs consumed by programs or APIs — some tools treat the BOM as stray characters.

Completely Solve CSV Character Encoding Issues! Fundamental Knowledge of Character Codes, BOM, and Line Endings

Character Encoding Basics: `UTF-8` vs `Shift_JIS`

UTF-8

Shift_JIS（CP932）

Why does character corruption occur

Role of BOM (Byte Order Mark)

Line Break Code Issues

Practical Solution Chart

Test CSV file

Summary

Test files for this article

Related articles

❓ Frequently Asked Questions

🛠️ Related DevLab tools

CSV Cleaner

CSV ⇔ JSON

Mojibake Fixer

Line Endings & Encoding

📝 Related articles

Open and Save CSV Correctly in Excel — UTF-8 BOM

Calculating multipart/form-data Overhead Exactly

📚 Reference

MIME Types Reference

Character Encoding Basics: UTF-8 vs Shift_JIS

UTF-8

Shift_JIS（CP932）

Why does character corruption occur

Role of BOM (Byte Order Mark)

Line Break Code Issues

Practical Solution Chart

Test CSV file

Summary

Test files for this article

Related articles

❓ Frequently Asked Questions

🛠️ Related DevLab tools

CSV Cleaner

CSV ⇔ JSON

Mojibake Fixer

Line Endings & Encoding

📝 Related articles

Open and Save CSV Correctly in Excel — UTF-8 BOM

Calculating multipart/form-data Overhead Exactly

📚 Reference

MIME Types Reference

Character Encoding Basics: `UTF-8` vs `Shift_JIS`