HTML4.01规范中英文对照-HTML文档展示(2) (

5.2.2 Specifying the character encoding

How does a server determine which character encoding applies for a document it serves? Some servers examine the first few bytes of the document, or check against a database of known files and encodings. Many modern servers give Web masters more control over charset configuration than old servers do. Web masters should use these mechanisms to send out a "charset" parameter whenever possible, but should take care not to identify a document with the wrong "charset" parameter value.html

服 务器如何决定对外服务文档的字符编码?有一些服务器会检查文档的最开始几个字节,或者检测一组已知的文件和编码。与那些老版本的服务器相比,不少现代的服 务器为Web管理员提供了更多关于字符集参数的控制。只要有可能,Web管理员就应该采用这些机制来对外发送"charset"参数,但同时也要注意,不 要错误标识文档的"charset"参数。express

How does a user agent know which character encoding has been used? The server should provide this information. The most straightforward way for a server to inform the user agent about the character encoding of the document is to use the "charset" parameter of the "Content-Type" header field of the HTTP protocol ([RFC2616], sections 3.4 and 14.17) For example, the following HTTP header announces that the character encoding is EUC-JP:api

用 户代理是如何知道使用哪种字符编码呢?相应的信息应该由服务器给出。对于服务器来讲,最直接的方式就是在HTTP协议([RFC2616]3.4以及 14.17部分)中的"Content-Type"头信息的charset参数中告知用户代理有关文档的字符编码信息。例以下面的HTTP头,声明了字符 编码为EUC-JP:服务器

Content-Type: text/html; charset=EUC-JP

Please consult the section on conformance for the definition of text/html.app

有关text/html 的定义请参看规范符合部分。less

The HTTP protocol ([RFC2616], section 3.7.1) mentions ISO-8859-1 as a default character encoding when the "charset" parameter is absent from the "Content-Type" header field. In practice, this recommendation has proved useless because some servers don't allow a "charset" parameter to be sent, and others may not be configured to send the parameter. Therefore, user agents must not assume any default value for the "charset" parameter.ide

HTTP 协议在其[RFC2616]的3.7.1部分说起了在"Content-Type"头信息的"charset"参数缺失时,将采用ISO-8859-1做 为缺省的字符编码方式。在实践中,这个建议是没什么用处的。由于有些服务器是不容许发送"charset"参数的,或者还有一些服务器被设置成不发送这个 参数。因此,用户代理绝对不能对"charset"参数的缺省值作任何假设。字体

To address server or configuration limitations, HTML documents may include explicit information about the document's character encoding; the META element can be used to provide user agents with this information.ui

为了解决服务器自己或者配置的限制,HTML文档内能够包含显式的关于文档字符编码方式的信息;META元素能够用来为用户代理提供该类信息。this

For example, to specify that the character encoding of the current document is "EUC-JP", a document should include the following META declaration:

例如,为了表述当前文档的字符编码为“EUC-JP”,文档内应该包含以下内容的META元素声明:

<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">

The META declaration must only be used when the character encoding is organized such that ASCII-valued bytes stand for ASCII characters (at least until the META element is parsed). META declarations should appear as early as possible in the HEAD element.

META元素声明只能在文档的字符编码机制对ASCII字符的处理与ASCII字节标准一致时才能使用。至少,这种一致性在META元素以前(包括其自己)要保持。META元素在HEAD元素中出现的位置应该是越早越好。

For cases where neither the HTTP protocol nor the META element provides information about the character encoding of a document, HTML also provides the charset attribute on several elements. By combining these mechanisms, an author can greatly improve the chances that, when the user retrieves a resource, the user agent will recognize the character encoding.

对于既不采用HTTP协议方式,也不采用META元素方式提供字符编码信息的状况,HTML对于不少元素提供了charset属性来指定字符编码。这些的组合使用,在用户检索资源时,做者能够极大地提升用户代理识别文档字符编码的机会。

To sum up, conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest):

  1. An HTTP "charset" parameter in a "Content-Type" field.
  2. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
  3. The charset attribute set on an element that designates an external resource.

总结一下,符合规范的用户代理必须按照一下的优先级顺序(从高到低)来决定一个文档的字符编码:

  1. HTT头信息"Content-Type"中的"charset"参数。
  2. META 元素声明,该声明将"http-equiv" 属性值包含"Content-Type" 和做为ontent-Type的值的"charset"。
  3. 元素的charset属性,这种元素是指明一个外部资源。

In addition to this list of priorities, the user agent may use heuristics and user settings. For example, many user agents use a heuristic to distinguish the various encodings used for Japanese text. Also, user agents typically have a user-definable, local default character encoding which they apply in the absence of other indicators.

做为上述优先级列表的补充,用户代理能够采起启发式的方式或者采用用户设置的方式来决定字符编码。例如,不少用户代理都会用启发的方式来决定日文文本的多种编码方式。另外,用户代理通常都会有一个用户可设置的本地缺省字符编码以应对全部的指定字符编码的机制都缺失的状况。

User agents may provide a mechanism that allows users to override incorrect "charset" information. However, if a user agent offers such a mechanism, it should only offer it for browsing and not for editing, to avoid the creation of Web pages marked with an incorrect "charset" parameter.

用户代理能够提供一种覆盖不争取"charset"信息的机制。然而,若是用户代理提供这样的机制,为了不建立包含有不正确"charset"参数的网页,这种机制只能用于浏览操做而不能用于编辑操做。

Note. If, for a specific application, it becomes necessary to refer to characters outside [ISO10646], characters should be assigned to a private zone to avoid conflicts with present or future versions of the standard. This is highly discouraged, however, for reasons of portability.

注释。对于特定的应用,若是必定要引用到[ISO10646]字符集之外的字符,这些字符应该被放到私有区域以免与标准的如今及将来版本冲突。基于可移植性的考虑,这样的用法是被强烈不推荐的。

5.3 Character references

A given character encoding may not be able to express all characters of the document character set. For such encodings, or when hardware or software configurations do not allow users to input some document characters directly, authors may use SGML character references. Character references are a character encoding-independent mechanism for entering any character from the document character set.

对于某些特定的字符编码机制来讲,它可能不能表示文档字符集中的全部字符。对于这些编码机制以及因为软硬件的配置限制不容许用户直接输入一些文档字符的情形,做者可使用SGML字符引用。字符引用是一种独立于字符编码机制的能够输入任何文档字符集中字符的机制。

Character references in HTML may appear in two forms:

  • Numeric character references (either decimal or hexadecimal).
  • Character entity references.

Character references within comments have no special meaning; they are comment data only.

在HTML中,字符引用能够有以下两种形式:

  • 数字形式的字符引用(十进制形式和十六进制形式)。
  • 字符实体引用。

在注释中出现的字符引用是没有任何特殊意义的,即它们不会被认为是字符引用。它们只是注释数据而已。

Note. HTML provides other ways to present character data, in particular inline p_w_picpaths.

Note. In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.

注释。HTML提供另外展现字符数据的方法,尤为是行内图片

注释。 在SGML中在某些状况下省略字符引用最后的分号";"是能够的(例如:在折行处或者该引用后面紧跟一个标签时)。在其余状况下这个分号是不能够省略的 (例如:在一个单词的中间时)。咱们强烈建议在任何状况下都要使用“;”以免有的用户代理强制要求该字符必须出现所带来的问题。

5.3.1 Numeric character references

Numeric character references specify the code position of a character in the document character set. Numeric character references may take two forms:

  • The syntax "&#D;", where D is a decimal number, refers to the ISO 10646 decimal character number D.
  • The syntax "&#xH;" or "&#XH;", where H is a hexadecimal number, refers to the ISO 10646 hexadecimal character number H. Hexadecimal numbers in numeric character references are case-insensitive.

数字形式的字符引用是采用直接指定字符在文档字符集中的代码位置的形式。数字形式字符引用能够有以下两种形式:

  • "&#D;"语法形式,这里 D是指一个十进制的数字, 表示ISO 10646的十进制的字符代码位置。
  • "&#xH;" 或者 "&#XH;"语法形式, 这里了 H 是一个十六进制数字, 表示ISO 10646的十六进制的字符代码位置。在数字形式字符引用中十六进制数字是不区分大小写的。

Here are some examples of numeric character references:

  • &#229; (in decimal) represents the letter "a" with a small circle above it (used, for example, in Norwegian).
  • &#xE5; (in hexadecimal) represents the same character.
  • &#Xe5; (in hexadecimal) represents the same character as well.
  • &#1048; (in decimal) represents the Cyrillic capital letter "I".
  • &#x6C34; (in hexadecimal) represents the Chinese character for water.

下面是一些数字形式字符引用的例子:

  • &#229; (十进制形式) 表明字母挪威语中使用的在头上有个小圆圈的字母"a" 。
  • &#xE5; (十六进制) 和上面例子的表明的字符同样。
  • &#Xe5; (十六进制) 一样表明头上有个小圆圈的字母"a" 。
  • &#1048; (十进制)表明斯拉夫大写字母"I"。
  • &#x6C34; (十六进制l) 表明中文汉字“水”。

Note. Although the hexadecimal representation is not defined in [ISO8879], it is expected to be in the revision, as described in [WEBSGML]. This convention is particularly useful since character standards generally use hexadecimal representations.

注释。虽然在[ISO10646]中没有定义十六进制的形式,但根据[WEBSGML]的描述,这种形式将在后续的修订版本中出现。字符标准采用十六进制表现形式将是特别有用的。

5.3.2 Character entity references

In order to give authors a more intuitive way of referring to characters in the document character set, HTML offers a set of character entity references. Character entity references use symbolic names so that authors need not remember code positions. For example, the character entity reference &aring; refers to the lowercase "a" character topped with a ring; "&aring;" is easier to remember than &#229;.

为了给做者一种更加直观引用文档字符集中字符的方式,HTML提供了一组字符实体引用。字符实体引用采用符号形式的名字,因此做者就没必要再记忆字符的代码位置。例如:字符实体引用 &aring;表示小写的头上有个圆圈的字母"a"; "&aring;" 比 &#229;更加容易记忆。

HTML 4 does not define a character entity reference for every character in the document character set. For instance, there is no character entity reference for the Cyrillic capital letter "I". Please consult the full list of character references defined in HTML 4.

HTML并无为文档字符集中每个字符都定义一个相关的字符实体引用,好比就没有为斯拉夫大写字符"I"提供给字符实体引用。有关HTML 4中所有的字符实体引用信息,请参阅字符引用彻底列表部分。

Character entity references are case-sensitive. Thus, &Aring; refers to a different character (uppercase A, ring) than &aring; (lowercase a, ring).

字符引用是大小写敏感的。因此,&Aring; 和 &aring; 表示的是彻底不一样的字符,前者表明带圆圈的大写A,后者表示的是带圆圈的小写a。

Four character entity references deserve special mention since they are frequently used to escape special characters:

  • "&lt;" represents the < sign.
  • "&gt;" represents the > sign.
  • "&amp;" represents the & sign.
  • "&quot; represents the " mark.

有四个字符实体引用,因为他们会被频繁的使用,须要在此被特殊关照:

  • "&lt;" 表明<符号。
  • "&gt;" 表明>符号。
  • "&amp;" 表明&符号。
  • "&quot;" 表明" 符号。

Authors wishing to put the "<" character in text should use "&lt;" (ASCII decimal 60) to avoid possible confusion with the beginning of a tag (start tag open delimiter). Similarly, authors should use "&gt;" (ASCII decimal 62) in text instead of ">" to avoid problems with older user agents that incorrectly perceive this as the end of a tag (tag close delimiter) when it appears in quoted attribute values.

若是HTML做者须要在文本中输入"<"就应该使用"&lt;" (ASCII十进制代码60),以免与标签其实符号冲突。相似的,HTML做者若是须要录入">",为了不一些老版本的用户代理在其出如今用双引号框定的属性值中时将其错误当成标签结束符处理,也须要用"&gt;" (ASCII 十进制编码62)来表示。

Authors should use "&amp;" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&amp;" in attribute values since character references are allowed within CDATA attribute values.

为了不与字符引用开始符号冲突,HTML文档做者在文本中应该用"&amp;" (ASCII 十进制编码38) 来代替"&"。因为在属性值中字符引用一样会起做用,HTML做者也须要在属性值中采用"&amp;"来代替"&"。

Some authors use the character entity reference "&quot;" to encode instances of the double quote mark (") since that character may be used to delimit attribute values.

在HTML文档中因为双引号用来框定属性值,因此一些做者也会用字符实体引用 "&quot;"来代替双引号。

5.4 Undisplayable characters

A user agent may not be able to render all characters in a document meaningfully, for instance, because the user agent lacks a suitable font, a character has a value that may not be expressed in the user agent's internal character encoding, etc.

例如,存在用户代理缺少合适的字体或者某个字符在用户代理的内部字符编码中不能被表示等状况,用户代理可能不能正确显示全部的字符。

Because there are many different things that may be done in such cases, this document does not prescribe any specific behavior. Depending on the implementation, undisplayable characters may also be handled by the underlying display system and not the application itself. In the absence of more sophisticated behavior, for example tailored to the needs of a particular script or language, we recommend the following behavior for user agents:

由 于在这些状况下有不少不一样的事情须要处理,本文档不对任何特定的行为作出规定。根据具体实现的不一样,不可显示字符能够交给底层的现实系统出来,而不是应用 自己。因为缺少更多成熟应对方案,例如根据某个特定的Script脚本或者语言的须要进行裁剪,咱们建议用户代理可以根据许下方案处理:

  1. Adopt a clearly visible, but unobtrusive mechanism to alert the user of missing resources.
  2. If missing characters are presented using their numeric representation, use the hexadecimal (not decimal) form since this is the form used in character set standards.

  1. 为用户提供有关资源缺失的清晰可见且不突兀的提示机制。
  2. 若是缺失的字符用它们的数字编码来展示,要用字符集标准使用的十六进制而不是十进制来进行展示。