Skip to content

Completely Solve CSV Character Encoding Issues! Fundamental Knowledge of Character Codes, BOM, and Line Endings

Category: Data Processing
This article is currently available in Japanese only. We are working on translations.

Most developers have experienced opening a CSV file only to find it contains garbled characters. Particularly with CSV files containing Japanese text, issues frequently arise due to differences in character encoding. This article consolidates knowledge to fundamentally understand the causes of character corruption and provide solutions to reliably resolve them.

Character Encoding Basics: <code>UTF-8</code> vs <code>Shift_JIS</code>

The character encodings that cause problems in Japanese CSV files are mainly the following two:

UTF-8

The current web standard and a Unicode-based character encoding capable of handling characters worldwide. UTF-8 is the default in Linux, macOS, and modern web applications. It uses variable-length encoding of 1 to 4 bytes per character, with ASCII characters represented as a single byte.

Shift_JIS(CP932)

A Japanese character encoding used in legacy Windows applications and older versions of Excel. To be precise, CP932 (Microsoft's extension of Shift_JIS) is used in Windows environments, and some platform-dependent characters (circled numbers, Roman numerals, etc.) can only be handled by CP932.

Why does character corruption occur

The root cause of character corruption is simple: the character encoding used when writing the file does not match the encoding used when reading it.

For example, when a web application exports a CSV in UTF-8 and a user opens it in Excel, the characters become garbled. This is because Excel interprets the file as Shift_JIS (CP932) by default in Japanese environments.

Role of BOM (Byte Order Mark)

BOM is a marker of several bytes added to the beginning of a file that indicates the character encoding. The UTF-8 BOM is 3 bytes: <code>0xEF 0xBB 0xBF</code>.

The important point is that <strong>a BOM is required to correctly open UTF-8 CSV files in Excel</strong>. Opening a UTF-8 CSV without a BOM in Excel will cause it to be interpreted as Shift_JIS, resulting in garbled characters.

// PHP で BOM 付き UTF-8 CSV を出力する例
header('Content-Type: text/csv; charset=UTF-8');
header('Content-Disposition: attachment; filename="data.csv"');

// BOM を出力
echo "\xEF\xBB\xBF";

$fp = fopen('php://output', 'w');
fputcsv($fp, ['名前', 'メール', '部署']);
fputcsv($fp, ['田中太郎', 'tanaka@example.com', '開発部']);
fclose($fp);
# Python で BOM 付き UTF-8 CSV を出力する例
import csv

with open('data.csv', 'w', encoding='utf-8-sig', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['名前', 'メール', '部署'])
    writer.writerow(['田中太郎', 'tanaka@example.com', '開発部'])

However, there are some precautions to note regarding the BOM. Some programs and shell scripts treat the BOM as an invalid character, which can cause errors. It is safer not to include a BOM in CSV files returned by API responses or processed by programs.

Line Break Code Issues

Along with garbled CSV characters, a common issue is the difference in line ending codes.

  • <strong>CRLF</strong> (<code>\r\n</code>) — Windows standard
  • <strong>LF</strong> (<code>\n</code>) — Linux / macOS standard
  • <strong>CR</strong> (<code>\r</code>) — Legacy Mac OS (rarely used today)

RFC 4180 specifies that CSV line breaks should be CRLF. However, in practice, many tools can handle CSV files with only LF without issues. Problems are more likely to occur with Excel and some legacy systems.

Command-line tools are convenient for checking line break codes.

# file コマンドで確認
file data.csv
# 出力例: data.csv: UTF-8 Unicode (with BOM) text, with CRLF line terminators

# xxd で先頭バイトを確認(BOM の有無)
xxd data.csv | head -3

Practical Solution Chart

Below is a summary of optimal settings depending on the intended use of CSV.

Use case Character Encoding BOM Line Break Code
Open in Excel UTF-8 Yes CRLF
Process by program UTF-8 None LF
Legacy system integration Shift_JIS (CP932) None CRLF
API Response UTF-8 None LF

Test CSV file

To verify that your application correctly handles CSV files with various character encodings, use the test files available in DevLab.

  • <a href="/ja/files/encoding/">CSV Test Files by Character Encoding</a> — UTF-8 (with/without BOM), Shift_JIS, EUC-JP, and more
  • <a href="/ja/files/newline/">Test Files by Newline Code</a> — Various CRLF, LF, and CR files
  • <a href="/ja/files/csv/">CSV test files list</a> — CSV files of various sizes

Summary

CSV character encoding issues can be reliably resolved by correctly understanding three key elements: character encoding, BOM (Byte Order Mark), and line endings. For Excel, the standard is UTF-8 with BOM using CRLF line endings; for programmatic processing, use UTF-8 without BOM using LF line endings. Using DevLab's various test files, verify that your application correctly handles each pattern.

Test files for this article

  • → <a href="/ja/files/encoding/" class="text-primary-600 dark:text-primary-400 hover:underline">CSV Test Files by Character Encoding (UTF-8 / Shift_JIS / EUC-JP)</a>
  • → <a href="/ja/files/csv/" class="text-primary-600 dark:text-primary-400 hover:underline">CSV Test Files List</a>

Related articles

  • → <a href="/ja/blog/file-format-quick-reference/" class="text-primary-600 dark:text-primary-400 hover:underline">File Format Quick Reference for Developers</a>
  • → <a href="/ja/blog/wordpress-upload-limit-fix/" class="text-primary-600 dark:text-primary-400 hover:underline">5 Ways to Increase WordPress Upload Limits</a>

📚 Reference