:: wikimiki.org ::
| VISCII |
VISCIIThe Vietnamese Standard Code for Information Interchange (VISCII) is a character set comprising the Vietnamese alphabet, punctuation, and other graphemes. Vietnamese requires slightly too many (134) letter/diacritic combinations make a traditional extended ASCII character set for it. There are essentially 3 possible solutions to this.
#Use a variable width encoding
#Use combining diacritical marks (as windows-1258 does)
#Replace something from ASCII
VISCII went for the last option, replacing 6 of the least problematic (e.g., least likely to be recognised by an application and acted on specially) C0 control codes (STX, ENQ, ACK, DC4, EM, and RS) with 6 of the least used uppercase letter/diacritic combinations. While this may cause issues with some programs in handling VISCII text if they use those control codes, it creates fewer complications than either of the other two solutions. However, it leaves absolutely no space available for things other than accented letters such as symbols, superscripted numbers, curved quotes, proper dashes, etc.
Codepage layout
External links
- RFC 1456 - Conventions for Encoding the Vietnamese Language
- [http://www.vietstd.org/ Vietnamese-Standard Working Group]
- [http://www.vnet.org/vietstd/report/rep92.htm Viet-Std Report 1992]
Category:Character sets
Character setA character encoding consists of a code that pairs a set of characters (representations of graphemes or grapheme-like units, such as might appear in an alphabet or syllabary for the communication of a natural language) with a set of something else, such as numbers or electrical pulses, in order to facilitate the storage of text in computers and the transmission of text through telecommunication networks. Common examples include Morse code, which encodes letters of the Latin alphabet as series of long and short depressions of a telegraph key; and ASCII, which encodes letters, numerals, and other symbols, both as integers and as 7-bit binary versions of those integers.
Conventionally character set and character encoding were considered synonmous as the same standard would specify both what characters were availible and how they were to be encoded into a stream of code units (usually with a single character per code unit). However unicode broke away from this idea seperating the idea of numbering a series of characters and encoding those characters into a stream of code units. For historical reasons mime and systems based on it use the term charset to reffer to a character encoding.
Character repertoire
In some contexts, especially computer storage and communication, it makes sense to distinguish a character repertoire (a full set of abstract characters that a system supports) from a coded character set or character encoding (which specifies how to represent characters from that set using a number of integer codes).
In earlier days of computing, the introduction of character repertoires such as ASCII (1963) and EBCDIC (1964) began the process of standardisation. The limitations of such sets soon became apparent, and a number of ad-hoc methods developed to extend them. The need to support multiple writing systems, including the CJK family of East Asian scripts, required support for a far larger number of characters and demanded a systematic approach to character encoding rather than the previous ad hoc approaches.
For example, the full repertoire of Unicode encompasses over 100,000 characters. Each of these characters has a unique integer code in the range 0 to hexadecimal 10FFFF (a little over 1.1 million, so not all integers in that range represent coded characters). Other common repertoires include ASCII and ISO 8859-1, which mirror exactly the first 128 and 256 coded characters of Unicode respectively.
Encoding forms and encoding schemes
Computer scientists sometimes overload the term character encoding to mean also how a specific sequence of bits represent characters. This involves an encoding form which specifies the conversion of the integer code into a series of integer code values that facilitate storage in a system that uses fixed bit widths. For example, integers greater than 65535 ( hex FFFF) will not fit in 16 bits, so the UTF-16 encoding form mandates representation of these integers as a surrogate pair of integers, each less than 65536 and not assigned to characters (for example, hex 10000 becomes the pair D800 DC00). An encoding scheme then converts code values to bit sequences, with attention given to things like platform-dependent byte order issues (for example, D800 DC00 might become 00 D8 00 DC on an Intel x86 architecture). A character set or character map or code page shortcuts this process by directly mapping abstract characters to specific bit patterns. [http://www.unicode.org/reports/tr17/ Unicode Technical Report #17] explains this terminology in depth and provides further examples.
Since most applications use only a small subset of Unicode, encoding schemes (like UTF-8 and UTF-16) and character maps (like ASCII) provide efficient ways to represent Unicode characters in computer storage or communications by using short binary words. Some of these simple encodings use data compression techniques to represent a large repertoire with a smaller number of codes.
Popular character encodings
- ISO 646
- ASCII
- EBCDIC
- ISO 8859:
- ISO 8859-1, ISO 8859-2, ISO 8859-3, ISO 8859-4, ISO 8859-5, ISO 8859-6, ISO 8859-7, ISO 8859-8, ISO 8859-9, ISO 8859-10, ISO 8859-11, ISO 8859-13, ISO 8859-14, ISO 8859-15, ISO 8859-16
- DOS character sets:
- CP437, CP737, CP850, CP852, CP855, CP857, CP858, CP860, CP861, CP863, CP865, CP866, CP869
- Windows character sets:
- Windows-1250
- Windows-1251 for Cyrillic alphabets
- Windows-1252
- Windows-1253
- Windows-1254
- Windows-1255 for Hebrew
- Windows-1256 for Arabic
- Windows-1257
- Windows-1258 for Vietnamese
- KOI8-R, KOI8-U, KOI7
- ISCII
- VISCII
- Big5
- HKSCS
- Guobiao
- GB2312
- GB18030
- ISO 2022, Shift-JIS, EUC
- Unicode (and subsets thereof, such as the 16-bit 'Basic Multilingual Plane'). See UTF-8
See also
- :Category:Character encoding — articles related to character encoding in general
- :Category:Character sets — articles detailing specific character encodings
- Mojibake — character set mismap.
External links
- [http://www.iana.org/assignments/character-sets Character sets registered by Internet Assigned Numbers Authority]
- [http://www.unicode.org/unicode/reports/tr17/ Unicode Technical Report #17: Character Encoding Model]
- [http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id= SIL's freeware fonts, editors and documentation] See SIL
- [http://www.ibm.com/software/globalization/icu/demo/converters ICU Converter Explorer]
- [http://czyborra.com/charsets/cyrillic.html The Cyrillic Charset soup]
- [http://homepages.cwi.nl/~dik/english/codes/stand.html Early history of character set standardization]
- [http://www.i18nguy.com/unicode/codepages.html Character Sets And Code Pages At The Push Of A Button]
- [http://www.cs.mcgill.ca/~aelias4/encodings.html A complete introduction to Japanese character encodings]
- [http://www.cs.tut.fi/~jkorpela/chars.html A tutorial on character code issues]
-
ja:文字コード
zh-min-nan:Pian-bé
Punctuation
Punctuation marks are written symbols that do not correspond to either phonemes (sounds) of a spoken language nor to lexemes (words and phrases) of a written language, but which serve to organize or clarify written language. See orthography.
The rules of what punctuation marks should be used in what circumstances vary with language, location and time. The rules are constantly evolving and certain aspects of punctuation are style — the author's choice. An English language bibliography may be found at the end of this article.
Commonly used punctuation marks
Some common examples used by English and other languages using the Roman alphabet are listed below (with their Unicode preferred names, where appropriate).
Because of the limited number of characters available in ASCII, many of these punctuation characters have also been given specialized meanings in computer programs composed on ASCII keyboards. The dot and commercial at in e-mail addresses are examples of this kind of use. See the individual articles.
The individual articles listed below include information on use and misuse in English and provide examples:
- apostrophe ( ' ), ( ’ )
- bracket - i.e., parentheses (aka round brackets) ((, )), square brackets ([, ]), curly brackets (aka braces) (), and angle brackets (, )
- colon (:)
- comma (,)
- dash – i.e., figure dash(), en dash (–), em dash (—), and quotation dash (―)
- ellipsis or suspension points (...)
- exclamation mark (!) (aka bang)
- full stop or period (.)
- hyphen (-), ()
- interrobang () (symbol resembles a question mark laid over an exclamation mark)
- question mark (?)
- quotation marks (British English: inverted commas) and guillemets ('; ‘, ’; "; “,”; ‹, ›)
- semicolon (;)
- slash or solidus (/)
- space between words to provide interword separation. Because the interword space has no mark, it is arguably not a "written symbol", but it clearly serves to organize and clarify Latin script writings.
The following typographical symbols or glyphs are not true punctuation marks:
- ampersand (&)
- asterisk ( - )
- asterism ()
- bullet (•; more)
- at (@)
- currency (¤)
- dagger or obelisk (†) and double dagger (‡)
- number sign (#) – aka pound sign, hash, crosshatch, octothorp, etc.
- prime ( ′ )
- tilde or swung dash (~)
- underscore ( _ )
- vertical bar (|)
- greater than sign ( > )
- less than sign ( < )
- section sign ( § )
- pilcrow (¶)
Also related are diacritical marks (or diacritics), which serve to distinguish among similar sounds using the same primary letter symbol, or to clarify emphasis or tone.
Each script, and each language within a script, can have its own set of punctuation marks and usage conventions.
Chinese and Japanese use a different set of punctuation marks from Western languages. These only came into use relatively recently, the ancient forms of these languages having no punctuation at all. Traditional poetry and calligraphy maintains this punctuation-free style.
Nearly all of the punctuation marks used are larger than their Western counterparts, and occupy a square area that is the same size as the characters around them. These punctuation marks are called "fullwidth" to contrast them from "halfwidth" Western punctuation marks.
Japanese and Traditional Chinese can be written horizontally or vertically, while Simplified Chinese is rarely written vertically. Some punctuation marks adapt to this change in direction: the parentheses, curved brackets, square quotation marks (Japanese and Traditional Chinese), book title marks (Chinese), ellipsis mark, dash, and wavy dash (Japanese) all rotate themselves 90 degrees when used in vertical rather than horizontal text. The three underline-like punctuation marks in Chinese (proper noun mark, wavy book title mark, and emphasis mark) rotate and shift to the left side of the text in vertical script. (Shifting to the right side of the text is also possible, but this is outmoded and can clash with the placement of other punctuation marks.)
Major differences between Western and Chinese/Japanese punctuation marks include:
- Some punctuation marks are similar in use to their equivalent Western ones. The only difference is in size: they are fullwidth instead of halfwidth:
- ! is the exclamation mark (!).
- ? is the question mark (?).
- ; is the semi-colon (;).
- : is the colon (:).
- () are curved brackets or parentheses (()).
- 【】 are square brackets ([]).
- Other punctuation marks are more different, whether in shape or usage:
- The Chinese and Japanese full stop is a fullwidth small circle (。). In horizontally-written Japanese the full stop is placed in the same position as it would be in English; in vertical writing it is placed below and to the right of the last Character. In Chinese the full stop is always after the last character.
- In Japanese and Traditional Chinese, the double and single quotation marks are fullwidth 『 』 and 「 」. The double quotation marks are used when embedded within single quotation marks: 「...『...』...」.
- In Traditional Chinese, Western-style quotation marks, “” and ‘’ can also be used for horizontal texts. In Simplified Chinese, only the Western-style quotation marks are used. Here, the single quotation marks are used when embedded within double quotation marks: “...‘...’...”. These quotation marks are fullwidth in printed matter, but share the same codepoints as the Western quotation marks in Unicode, so they require a Chinese-language font to be displayed correctly.
- In Chinese, the fullwidth comma (,) has the same shape as the Western comma. In Japanese, the fullwidth comma (、) is shaped like a teardrop with the narrow sharp end pointing top-left and round end pointing bottom-right; it may be depicted on your computer in another font.
- Chinese also has a repetition comma, which must be used instead of the regular comma when separating words constituting a list. It is identical to the Japanese fullwidth comma (、). In Japanese, either the regular fullwidth comma (、) or a fullwidth middle dot (・) is used for this purpose.
- Both Chinese and Japanese use a middle dot to separate words in a foreign name, since native first and last names in Chinese or Japanese are not separated using any punctuation or spaces. For example, "Leonardo da Vinci" in Simplified Chinese: "列奥纳多·达·芬奇", in Japanese: "レオナルド・ダ・ヴィンチ". Japanese always uses the fullwidth middle dot (・). In Chinese, the middle dot is also fullwidth in printed matter, but the halfwidth middle dot (·) is used in computer input, which is then rendered as fullwidth in Chinese-language fonts.
- For emphasis, Chinese and Japanese use emphasis marks instead of italic type. Each emphasis mark is a single dot (in Chinese) or dash (in Japanese) placed under each character to be emphasized (for vertical text, the dot is placed to the left hand side of each character). Although frequent in printed matter, emphasis marks are rare online, as they cannot be represented as plain text, are not supported by HTML and most word processors, and otherwise inconvenient to input. In Japanese, these emphasis marks are called bōten or wakiten.
- For book titles, Chinese uses fullwidth double book title marks, 《 book title》, and fullwidth single book title marks, 〈book title〉. The latter is used when embedded within the former: 《...〈...〉...》; in Traditional Chinese, the latter is also used for articles in or sections of a book. In Japanese, book titles are marked out using double quotation marks 『 』. (Italic type is never used in Chinese or Japanese.)
- A proper noun mark (an underline) is occasionally used in Chinese, such in teaching materials and some movie subtitles. For consistency in style, a wavy underline (﹏﹏) is used instead of the regular book title marks whenever the proper noun mark is used in the same text. When the text runs vertically, the proper name mark is written as a line to the left of the characters (to the right in some older books).
- In Chinese, the ellipsis is written with six dots (not three) occupying the same space as two characters (……) in the center of the line. Similarly, the dash is written so that it occupies the space of two characters (——) in the center of the line. There should be no breaking in the line. The Japanese ellipsis is also properly written as six dots, not three.
- When connecting two words to signify a range, Chinese generally uses a fullwidth dash occupying the space of one character (—, e.g. 1月—7月 "January to July"), while Japanese generally uses a fullwidth wavy dash occupying the space of one character (~, e.g. 1月~7月 "January to July"). The wavy dash is also sometimes used in Chinese and Korean.
- Whilst Western languages use a narrow space between each letter, and a wider space between words, Chinese and Japanese use a narrow space both between characters and between words. In this way it somewhat resembles the scriptio continua of ancient Greek and Latin.
- There are a small number of exceptions. In Japanese, a fullwidth space is often used where a colon or comma would be used in English: 大和銀行 大阪支店 (Yamato Bank, Osaka Branch). The fullwidth space is extremely rare in modern-day Chinese, but in archaic usage it may be used as an honorific marker. A modern example, found in Taiwan, is that of referring to Chiang Kai-shek as 先總統 蔣公 (Late President, Lord Chiang), where the space is an honorific marker for 蔣公; this use is also still current in very formal letters or other old-style documents. (The full width space is also sometimes used purely for spacing purposes.)
- Also, when Chinese is written entirely in Hanyu Pinyin or when Japanese is written entirely in kana, spaces are always introduced to assist in reading.
- Japanese uses iteration marks, the most common of which being 々, to indicate a repeated character. Chinese uses the iteration mark in informal or calligraphic writing, but never in careful writing or printed matter.
- There is no equivalent of the apostrophe in Chinese or Japanese.
Korean, the third member language of CJK, currently uses mostly Western punctuation.
Like Classical Chinese, traditional Mongolian employed no punctuation at all. But now, as it uses the Cyrillic alphabet, its punctuations are similar, if not identical, to Russian.
Other scripts
In ancient forms of Roman script, the interpunct served to separate words.
Ethiopian languages, including Amharic, Tigrinya, Ge'ez, and Afaan Oromo, make use of the following punctuation marks:
- space (፡) (resembles an English colon)
- comma (፣) (resembles an English colon with a line on top)
- sentence end (።) (resembles four dots at the corners of an imaginary square)
- semicolon (፤) (resembles an English colon with two small horizontal lines, one above and one below)
- colon (፥) (resembles an English colon with a small horizontal line between the dots)
- preface colon (፦) (resembles an English colon with a small horizontal line between the dots but more to the right than in the semicolon)
- question mark (፧) (three dots in a vertical line)
- paragraph separator (፨) (seven dots: three in a vertical line flanked by two vertical lines of two dots each, appearing as the corners of a hexagon with a dot in the center)
See also [http://www.omniglot.com/writing/ethiopic.htm Ethiopic Script].
Oringinally Sanskrit had no punctuation. In the 1600s+, Sanskrit and Marathi, both written in DevNagri script, used the vertical bar (|) to end a line of a verse and double vertical bars (||) to end the verse.
Arabic — written from right to left — uses a reversed question mark: ؟.
Legal issues
A patent has been granted for two new punctuation marks, the question comma and the exclamation comma. [http://v3.espacenet.com/textdoc?DB=EPODOC&IDX=WO9219458&F=0]
Further reading
- Eats, Shoots & Leaves: The Zero Tolerance Approach to Punctuation - Lynne Truss (Profile Books 2003 ISBN 1861976127)
- Punctuation - Robert Allen (Oxford University Press 2002)
- The King's English: a guide to modern usage - Kingsley Amis (HarperCollins 1997)
- The King's English - H. W. Fowler (Clarendon Press 1906)
- Plain Words: a guide to the use of English - Ernest Gowers ( HMSO 1948)
- Pause and Effect: An Introduction to the History of Punctuation in the West - M.B. Parkes (University of California Press 1993)
See also
- Emoticon
- Typographical syntax
- Japanese typographic symbols contains a list of Japanese punctuation and explanation of usage.
External links
- [http://www.unicode.org/charts/PDF/U2000.pdf Unicode reference glyphs for general punctuation]
- [http://www.unicode.org/charts/PDF/U3000.pdf Unicode reference glyphs for CJK symbols and punctuation]
- [http://www.unicode.org/charts/PDF/UFE30.pdf Unicode reference glyphs for CJK compatibility forms]
- [http://www.unicode.org/charts/PDF/UFE50.pdf Unicode reference glyphs for small form variants]
- [http://www.unicode.org/charts/PDF/UFF00.pdf Unicode reference glyphs for halfwidth and fullwidth forms]
- [http://home.chkpcc.net/~chi/PUNCTUATIONA.htm 標點符號的種類 (Types of Punctuation Marks)] Chinese punctuation marks and their names (In Chinese)
- [http://www.cmi.hku.hk/Ref/Article/article08/ 中華人民共和國國家標準標點符號用法 (The People's Republic of China's National Standards on the Usage of Punctuation Marks)] (In Chinese)
- [http://www.sf.airnet.ne.jp/~ts/japanese/punctuation.html Japanese Punctuation Marks]
- [http://www.kwiznet.com/p/showCurriculum.php Grammar & Punctuation Learning Resource]
Category:Diacritics
Category:Punctuation
zh-min-nan:Phiau-tiám-hû-hō
ko:문장 부호
ja:約物
th:เครื่องหมายวรรคตอน
GraphemeA grapheme designates the atomic unit in written language. Graphemes include letters, Chinese ideograms, numerals, punctuation marks, and other symbols.
In a phonological orthography a grapheme corresponds to one phoneme. In spelling systems that are non-phonemic — such as the spellings used most widely for written English — multiple graphemes may represent a single phoneme. These are called digraphs (two graphemes for a single phoneme) and trigraphs (three graphemes). For example, the word ship contains four graphemes (s, h, i, and p) but only three phonemes, because sh is a digraph. An example of a trigraph is the tch in itch.
Different glyphs can represent the same grapheme. For example, the minuscule letter a can be seen in two variants, with a hook at the top, and without. Not all glyphs are graphemes; for example the logogram ampersand (&) represents the Latin word et (English word and), which contains two phonemes.
See also
- Digraph (orthography)
- Trigraph (orthography)
- Allograph (orthography)
- Tilde
Category:Linguistics
als:Buchstabe
zh-min-nan:Grapheme
Extended ASCIIThe term extended ASCII (or high ASCII) describes eight-bit or larger character encodings that include the standard seven-bit ASCII characters as well as others. The use of the term has sometimes been criticized, because it can be mistakenly interpreted that the ASCII standard has been updated to include more than 127 characters, which is untrue.
Motives for extending
Because the number of written symbols used in common natural languages far exceeds the limited range of the ASCII code, many extensions to it have been used to facilitate handling of those languages. Markets for computers and communication equipment outside English-speaking countries were historically open long before standards bodies had time to deliberate upon the best way to accommodate them, so there are many incompatible proprietary extensions to ASCII.
Since ASCII is a seven-bit code and most computers manipulate data in eight-bit bytes, many extensions use the additional 128 codes available by using all eight bits of each byte. This helps include many languages otherwise not easily representable in ASCII, but still not enough to cover all languages of countries in which computers are sold, so even these eight-bit extensions had to have local variants.
Proprietary extensions
Various proprietary extensions appeared on non-EBCDIC mainframe and mini-computers, especially in universities. Commodore microcomputers added many graphic symbols to their non-standard ASCII (PETSCII, based on the original ASCII standard of 1963). IBM introduced eight-bit extended ASCII codes on the original IBM PC and later produced variations for different languages and cultures. IBM called such character sets code pages and assigned numbers to both those they themselves invented as well as many invented and used by other manufacturers. Accordingly, character sets are very often indicated by their IBM code page number. In ASCII-compatibile code pages, the lower 128 characters maintained their standard US-ASCII values, and different pages (or sets of characters) could be made available in the upper 128 characters. DOS computers built for the North American market, for example, used code page 437, which included accented characters needed for French, German, and a few other European languages, as well as some graphical line-drawing characters. The larger character set made it possible to create documents in a combination of languages such as English and French, but not, for example, in English and Greek (which required code page 737).
A set with less characters but more letter and diacritic combinations was used by the Digital VT-220 terminal based on draft versions of a ISO standard that was being developed.
ISO 8859 and proprietary adaptions
Eventually, ISO released this standard as ISO 8859 describing its own set of eight-bit ASCII extensions. The most popular was ISO 8859-1, also called ISO Latin1, which contained characters sufficient for the most common Western European languages.
Variations were standardized for other languages as well: ISO 8859-2 for Eastern European languages and ISO 8859-5 for Cyrillic languages, for example.
One notable way in which ISO character sets differ from code pages is that the character positions 128 to 159, corresponding to ASCII control characters with the high-order bit set, are specifically unused and undefined in the ISO standards, though they had often been used for printable characters in proprietary code pages, a breaking of ISO standards that was almost universal.
Microsoft later created code page 1252, a compatible superset of ISO 8859-1 with extra characters in the ISO unused range.
Code page 1252 is the standard character encoding of western European language versions of Microsoft Windows, including English versions.
ISO 8859-1 is the common character encoding used by the X Window System, and most Internet standards.
The Apple Macintosh, under Mac OS X, currently uses Unicode as its default encoding. Under Mac OS, it used MacRoman.
Input methods
One problem with eight-bit codes is that computer keyboards were originally designed for seven-bit ASCII, and users became accustomed to them. Different manufacturers have solved this problem in different ways, most by using additional shift-type keys labelled "Alt" or "Meta", and sometimes by interpreting multi-keystroke sequences. Also, MS-DOS allowed the user to enter any character by typing its three-digit code point while holding down the Alt key. While it did allow users to take advantage of the full MS-DOS code page 437 character set, it was difficult to remember and caused problems when users switched to other character sets (including Microsoft's switch to code page 1252 beginning with Windows 3.0). To ease the transition, two slightly different numeric entry methods are available in Microsoft Windows: typing a three-digit number with the Alt key down enters a character from code page 437; typing a four-digit number (beginning with 0) will enter the character from code page 1252. For example, code page 437 used code point 151 for the lowercase u with grave accent (ù). Typing Alt+151 on a Windows machine will actually produce the character at code point 249 from code page 1252, which is where code page 1252 has ù. If you want the character that is in position 151 of code page 1252 (which is the em dash, —), you must type Alt+0151. But various international keyboard drivers are available through the Control Panel in which the right Alt-key can be used to select alternate characters without this cumbersome typing of numbers. The user can easily switch between multiple keyboards. The free utility [http://allchars.zwolnet.com AllChars] is also available.
Character set confusion
Because these ASCII extensions have so many variants, it is necessary to identify which set is being used for a particular text for it to be interpreted correctly. However, because the most-used characters (those in ASCII, the seven-bit code points) are common to all sets--even most proprietary ones like the Macintosh--failure to correctly identify a character set often suffers no adverse consequences if the user is typing in English. Further, because many Internet standards use ISO 8859-1, and because Microsoft Windows (using the code page 1252 superset of ISO 8859-1) is the dominant operating system for personal computers today, unannounced use of ISO 8859-1 is quite commonplace, and should generally be assumed without evidence to the contrary.
In many protocols, most importantly e-mail and HTTP the character encoding of content has to be tagged with IANA-assigned character set identifiers.
Unicode
A proposal called Unicode was made in 1991 to address many of these problems, and is now widely accepted. Unicode reserves 1,114,112 (= 220 + 216) code points, and currently assigns characters to more than 96,000 of those code points. The first 256 codes precisely match those of ISO 8859-1. The majority of the 96,000 code points are, at this time, used for Chinese and Korean characters.
External links
- [http://www.i18nguy.com/unicode/codepages.html Character Sets and Code Pages at the Push of a Button]
- [http://allchars.zwolnet.com AllChars Utility for Windows]
- [http://developer.apple.com/intl/ Apple's page about internationalization support for Mac OS X]
- [http://www.unicode.org/ Unicode]
Category:Character sets
Combining diacritical mark
Combining characters are characters that are intended to modify other characters. The best known combining characters (at least to westerners) are the Combining diacritical marks (including combining accents). In Unicode the main block of combining diacritics for European languages and the International Phonetic Alphabet is U+0300–U+036F. Combining diacritical marks are also present in many other blocks of Unicode characters. In Unicode, diacritics are always added after the main character. It is possible to add several diacritics to the same character.
Unicode also contains a lot of precomposed characters. So in many cases it is possible to use both combining diacritics and precomposed characters, at the user or applications choice. This leads to a requirement to perform unicode normalisation before comparing two unicode strings and to carefully design encoding converters to correctly map all of the valid ways to represent a character in unicode to a legacy encoding to avoid data loss. For example when converting between windows-1258 and VISCII the former uses combining diacritics whilst the other has a large selection of precomposed characters so a converter using a simple mapping between code values and unicode code points will mess up text when converting between them.
On most computer systems today, the combining diacritics are added to main characters with simple superposition of glyphs. The results are usually far from perfect. The better strategy is to use algorithms that allow to select precomposed glyphs for such combinations. This is possible with OpenType fonts.
See also
- Diacritic
External links
- [http://www.unicode.org/charts/PDF/U0300.pdf Combining diacritics chart] (in Adobe PDF format)
- [http://www.unics.uni-hannover.de/nhtcapri/temp/combimarks.html combining marks] testpage facing combined and precomposed letters
- [http://www.alanwood.net/unicode/combining_diacritical_marks.html Alan Wood’s Unicode Resources]
Category:Unicode
Windows-1258Windows-1258 is a codepage used in Microsoft Windows to represent Vietnamese texts. It makes use of combining diacritical marks. Windows-1258 is not compatible with VISCII. It is very similar to windows-1252 with the differences being that s-caron and z-caron (which were added to windows-1252 later) are missing, four of the letters with diacritics have been replaced by combining diacritics and a few other letter/diacritic combinations have been replaced.
Use of combining diacritics means that windows-1258 can cover the large number of letter/diacritic combinations in Vietnamese without compromising coverage of control codes or symbols.
Codepage layout
Only the upper half (128–255) of the table is shown, the lower half (0–127) being plain ASCII. Combining diacritical marks are shown applied to a plus sign e.g. +̀ and the cells they are in also have a pink background. Other differences from windows-1252 are indicated with a light blue background NVSP and SHY reffer to the non-breaking space and soft hyphen respectively.
External links
- http://www.microsoft.com/globaldev/reference/sbcs/1258.htm
Category:Character sets
Category:Windows code pages
Category:Character setsThe category of character sets includes articles on specific character encodings (see the article for a precise definition). This includes coded character sets, character encoding forms, character encoding schemes, and character maps (historically called character sets or code pages), and even includes those that use non-numeric, pre-digital codes, such as electrical impulses. This category does not include unencoded character repertoires like the Windows Glyph List 4 or any of the articles in List of alphabets.
Articles pertaining to character encoding in general, or encoding in general, may be found in the parent categories, :Category:Character encoding and :Category:Encodings.
Much of this terminology is standardized in [http://www.unicode.org/unicode/reports/tr17/ Unicode Technical Report #17] and [http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/tr15285:1998.pdf ISO/IEC TR 15285:1998].
Category:Character encoding
Category:Encodings
ja:Category:文字コード
Silly StringSilly String is a child's toy. An aerosol can dispenses a stream of product that quickly sets into a flexible, brightly-colored plastic string. Thus, one can "shoot" a seemingly-endless strand of Silly String from the aerosol can.
Category:Toys
hotels Berlin Odzyskiwanie danych spalanie kalorii Doda i Virgin Hotel Genoa
|
|
|
| :: RELATED NEWS :: |
|
|
|
Friedrich Bayer
Friedrich Bayer ( - 6. Juni 1825 in Barmen-Wichlinghausen (heute zu Wuppertal); † 6. Mai 1880 in Würzburg) gründete 1863 in Elberfeld die Farbenfabrik Friedrich Bayer, die heutige Read More... |
Verwaltungsgemeinschaft Gosberg
In der Verwaltungsgemeinschaft Gosberg aus dem oberfränkischen Landkreis Forchheim haben sich drei Gemeinden zur Erledigung ihrer Verwaltungsgeschäfte zusammengeschlossen.
Daten
- Sitz der Verwaltungsgemeinschaft ist in der Gemeinde Pinzberg.
Die Gemeinden
#
|
Gallia Cisalpina
Gallia cisalpina (deutsch „diesseitiges Gallien“) war von 203 v. Chr. bis 41 v. Chr. eine Provinz des römischen Reiches. Nach modernen geographischen Begriffen umfasste die Gallia cisalpina in etwa das heutige Norditalien.
Das Gebiet gehörte nach antiker Auffassung zunächst zu Gallie
|
Magenpförtnerkrampf
Der Magenpförtnerkrampf (medizinisch: Pylorusstenose, engl.: Pylorospasm) ist eine Krankheit, die bei Säuglingen in besonders in der 4. - 8. Lebenswoche auftritt. Der Magenpförtner ist ein Muskel, der den Magen vom Zwölffingerdarm trennt. Zum Magenpförtnerkrampf kommt es, wenn der Muskel zu stark ausgeprägt ist und den Magensaft nicht mehr passieren lässt.
Die Erkrankung ist bereits mit der Geburt angelegt und
|
|