Unicode signature BOM (Byte Order Mark) issue for UTF-8 files

Unicode signature BOM (Byte Order Mark) issue for UTF-8 files


I recently encountered a strange thing when debugging a Chinese Zen Cart website with UTF8 encoding. The text on the webpage was displayed normally, but when I used IE to view the source file (opened it with Notepad), I found garbled characters. Firefox did not have this problem. After much online verification and testing, the problem was solved. It was actually a problem with the Unicode signature BOM (Byte Order Mark) of the UTF-8 file.

BOM (Byte Order Mark) is a standard mark used to identify encoding in the UTF encoding scheme. In UTF-16, it was originally FF FE, and in UTF-8 it becomes EF BB BF. This flag is optional, and since UTF8 bytes have no order, it can be used to detect whether a byte stream is UTF-8 encoded. Microsoft does this detection, but some software does not and treats it as a normal character.

Microsoft adds three bytes EF BB BF before its own UTF-8 text files. Programs such as Notepad on Windows use these three bytes to determine whether a text file is ASCII or UTF-8. However, this is just a mark made by Microsoft secretly. Other platforms do not have such a mark for UTF-8 text files.

That is to say, a UTF-8 file may have a BOM or may not have a BOM, so how to distinguish them? Three methods. 1. Open the file with UltraEdit-32, switch to hexadecimal editing mode, and check whether there is EF BB BF in the file header. 2. Open it with Dreamweaver, check the page properties, and see if there is a check mark in front of "Include Unicode Signature BOM". 3. Open it with Windows Notepad, select "Save As", and check whether the default encoding of the file is UTF-8 or ANSI. If it is ANSI, it will not have BOM.

I found html_header.php in the Zen Cart template file and discovered that the file did not have a BOM. I saved it with UltraEdit-32, added the BOM, and then uploaded html_header.php. Everything was normal.

Note that when using Convertz to convert a gb2312 file to a UTF-8 file, the default setting is to not include BOM. The above garbled characters may appear without BOM. However, if BOM is included, you should be careful with PHP include files, as EF BB BF will be added in front of the PHP byte stream. Outputting it to the display in advance may cause program errors. One solution is to save all included files as ANSI, and the main file can be UTF-8. To remove the BOM from a file, open it with UlterEdit, switch to hexadecimal editing mode, replace the first three bytes (the damn EF BB BF) with 20, save the file (note to turn off the automatic backup function when saving), then switch to the default editing mode and remove the first three spaces.

I also learned some little knowledge about encoding: the so-called unicode saved files are actually utf-16, which just happens to be the same as the unicode code, but conceptually unicode and utf are two different things. unicode is a memory encoding representation scheme, and utf is a scheme for how to save and transmit unicode. UTF-16 is divided into two types: high byte first (LE) and high byte last (BE). The official utf encoding also includes utf-32, which is also divided into LE and BE. The non-unicode official utf encoding also includes utf-7, which is mainly used for email transmission. The single-byte part of utf-8 is compatible with iso-8859-1. This is mainly because some old systems and library functions cannot handle utf-16 correctly and are forced out. For English characters, it also saves saved file space (at the expense of wasting space for non-English characters). When using iso-8859-1, both utf8 and iso-8859-1 are represented by one byte. When representing other characters, utf-8 uses two or three bytes.

<<:  Summary of Mysql-connector-java driver version issues

>>:  DIV common attributes collection

Recommend

17 JavaScript One-Liners

Table of contents 1. DOM & BOM related 1. Che...

The reason why MySQL uses B+ tree as its underlying data structure

We all know that the underlying data structure of...

Pure js to achieve the effect of carousel

This article shares the specific code of js to ac...

Use pure JS to achieve the secondary menu effect

This article example shares the specific code of ...

What to do if you forget your mysql password

Forgot your MySQL password twice? At first I did ...

How to set an alias for a custom path in Vue

How to configure custom path aliases in Vue In ou...

Detailed explanation of how to create MySql scheduled tasks in navicat

Detailed explanation of creating MySql scheduled ...

The use of textarea in html and common problems and case analysis

The textarea tag is an HTML tag that we often use....

MySQL 5.7.30 Installation and Upgrade Issues Detailed Tutorial

wedge Because the MySQL version installed on the ...

Several methods of implementing two fixed columns and one adaptive column in CSS

This article introduces several methods of implem...

Using Docker to create static website applications (multiple ways)

There are many servers that can host static websi...

MySQL query redundant indexes and unused index operations

MySQL 5.7 and above versions provide direct query...

Mysql cannot select non-aggregate columns

1. Introduction I recently upgraded my blog and a...

Implementation of MySQL scheduled database backup (full database backup)

Table of contents 1. MySQL data backup 1.1, mysql...