Completely Solve CSV Character Encoding Issues! Fundamental Knowledge of Character Codes, BOM, and Line Endings
Most developers have experienced opening a CSV file only to find it contains garbled characters. Particularly with CSV files containing Japanese text, issues frequently arise due to differences in character encoding. This article consolidates knowledge to fundamentally understand the causes of character corruption and provide solutions to reliably resolve them.
Character Encoding Basics: <code>UTF-8</code> vs <code>Shift_JIS</code>
The character encodings that cause problems in Japanese CSV files are mainly the following two:
UTF-8
The current web standard and a Unicode-based character encoding capable of handling characters worldwide. UTF-8 is the default in Linux, macOS, and modern web applications. It uses variable-length encoding of 1 to 4 bytes per character, with ASCII characters represented as a single byte.
Shift_JIS(CP932)
A Japanese character encoding used in legacy Windows applications and older versions of Excel. To be precise, CP932 (Microsoft's extension of Shift_JIS) is used in Windows environments, and some platform-dependent characters (circled numbers, Roman numerals, etc.) can only be handled by CP932.
Why does character corruption occur
The root cause of character corruption is simple: the character encoding used when writing the file does not match the encoding used when reading it.
For example, when a web application exports a CSV in UTF-8 and a user opens it in Excel, the characters become garbled. This is because Excel interprets the file as Shift_JIS (CP932) by default in Japanese environments.
Role of BOM (Byte Order Mark)
BOM is a marker of several bytes added to the beginning of a file that indicates the character encoding. The UTF-8 BOM is 3 bytes: <code>0xEF 0xBB 0xBF</code>.
The important point is that <strong>a BOM is required to correctly open UTF-8 CSV files in Excel</strong>. Opening a UTF-8 CSV without a BOM in Excel will cause it to be interpreted as Shift_JIS, resulting in garbled characters.
// PHP で BOM 付き UTF-8 CSV を出力する例
header('Content-Type: text/csv; charset=UTF-8');
header('Content-Disposition: attachment; filename="data.csv"');
// BOM を出力
echo "\xEF\xBB\xBF";
$fp = fopen('php://output', 'w');
fputcsv($fp, ['名前', 'メール', '部署']);
fputcsv($fp, ['田中太郎', 'tanaka@example.com', '開発部']);
fclose($fp);
# Python で BOM 付き UTF-8 CSV を出力する例
import csv
with open('data.csv', 'w', encoding='utf-8-sig', newline='') as f:
writer = csv.writer(f)
writer.writerow(['名前', 'メール', '部署'])
writer.writerow(['田中太郎', 'tanaka@example.com', '開発部'])
However, there are some precautions to note regarding the BOM. Some programs and shell scripts treat the BOM as an invalid character, which can cause errors. It is safer not to include a BOM in CSV files returned by API responses or processed by programs.
Line Break Code Issues
Along with garbled CSV characters, a common issue is the difference in line ending codes.
- <strong>CRLF</strong> (<code>\r\n</code>) — Windows standard
- <strong>LF</strong> (<code>\n</code>) — Linux / macOS standard
- <strong>CR</strong> (<code>\r</code>) — Legacy Mac OS (rarely used today)
RFC 4180 specifies that CSV line breaks should be CRLF. However, in practice, many tools can handle CSV files with only LF without issues. Problems are more likely to occur with Excel and some legacy systems.
Command-line tools are convenient for checking line break codes.
# file コマンドで確認
file data.csv
# 出力例: data.csv: UTF-8 Unicode (with BOM) text, with CRLF line terminators
# xxd で先頭バイトを確認(BOM の有無)
xxd data.csv | head -3
Practical Solution Chart
Below is a summary of optimal settings depending on the intended use of CSV.
| Use case | Character Encoding | BOM | Line Break Code |
|---|---|---|---|
| Open in Excel | UTF-8 | Yes | CRLF |
| Process by program | UTF-8 | None | LF |
| Legacy system integration | Shift_JIS (CP932) | None | CRLF |
| API Response | UTF-8 | None | LF |
Test CSV file
To verify that your application correctly handles CSV files with various character encodings, use the test files available in DevLab.
- <a href="/ja/files/encoding/">CSV Test Files by Character Encoding</a> — UTF-8 (with/without BOM), Shift_JIS, EUC-JP, and more
- <a href="/ja/files/newline/">Test Files by Newline Code</a> — Various CRLF, LF, and CR files
- <a href="/ja/files/csv/">CSV test files list</a> — CSV files of various sizes
Summary
CSV character encoding issues can be reliably resolved by correctly understanding three key elements: character encoding, BOM (Byte Order Mark), and line endings. For Excel, the standard is UTF-8 with BOM using CRLF line endings; for programmatic processing, use UTF-8 without BOM using LF line endings. Using DevLab's various test files, verify that your application correctly handles each pattern.
Test files for this article
- → <a href="/ja/files/encoding/" class="text-primary-600 dark:text-primary-400 hover:underline">CSV Test Files by Character Encoding (UTF-8 / Shift_JIS / EUC-JP)</a>
- → <a href="/ja/files/csv/" class="text-primary-600 dark:text-primary-400 hover:underline">CSV Test Files List</a>
Related articles
- → <a href="/ja/blog/file-format-quick-reference/" class="text-primary-600 dark:text-primary-400 hover:underline">File Format Quick Reference for Developers</a>
- → <a href="/ja/blog/wordpress-upload-limit-fix/" class="text-primary-600 dark:text-primary-400 hover:underline">5 Ways to Increase WordPress Upload Limits</a>