Cynthia Duca: Week 5 - Character Encoding and Questions

Introduction
Character Encoding

In XML there are different types of encoding this is used for text which is any other language other than english like German or Arabic, usually these are not supported by ASCII.

What is ASCII?

This stands for American Standard Code for Information Interchange. ASCII consists of 128 characters from 0 to 127, each one known as byte. This represents text and makes it easier to transfer data from one computer to another.

What is Unicode?

Unicode is the same thing as ASCII however used for a larger range of character spectrum. There are two types of standards UTF-8 and UTF-16 which used to be able to convert ASCII into Unicode

Quick Questions
Question1
This is a smiley <:-/>. Is it also a well-formed XML document? Say why?
Answer
The above syntax is well formed XML. However XML syntax ideally do not start with colons but this is not banned or known as bad code. Colons can be considered to be as a reserved word used for namespaces. Also these should start with either an underscore or a letter as according to W3C elements can start with anything which is not a number or punctuation and cannot start with the word XML. This is also closed with the use of "/>" meaning it is a well formed XML, but not highly considered.

Question 2

What is the difference between well-formed and valid XML?

Answer

A well formed XML document is one which follows the syntax rules including having proper nesting and element names, anything which is defined within XML 1.0 Fifth Edition.

A valid XML document has to validate against the DTD but can contain errors with the XML document.

Question 3
Is it a good idea to start an XML document with a comment, explaining what the document is and what it's from? Say why.
Answer

Comments don't inter fear with any part of the code. It is good to put comments along the document this will put explanations. If you are passing on this document to someone else they will be able to understand what's going on through the code. They should have comments however they shouldn't be placed before the declaration as these might be completely ignored by an XML parser, hence it might not be a good idea of placing them exactly at the top of the document.

Longer Questions

Question 1

A set of documents is to be constructed as follows. The type of document is a college textbook. Every college textbook has a title page, on which is a title and an author and the publisher; optionally, there may be an aphorism. Every college textbook has a title page verso, on which is a publisher’s address, a copyright notice, an ISBN; there may be a dedication, or there may be more than one. Every college textbook has several chapters, and each chapter has several sections, and each section has several bodies of text. A chapter is identified by a chapter number and a chapter title. A section is identified by a section number and a section title. The name of the publisher will always be Excellent Books Ltd. The address of the publisher will always be 21 Cemetry Lane, SE1 1AA, UK. The application that will process the documents can accept Unicode. Write a .dtd file for this specification.

Answer

<?xml version=”1.0” encoding =”utf-8”?>

<!DOCTYPE college_textbooks

[

<!ENTITY publisherName “Excellent Books Ltd”>

<!ENTITY publisherAddress “21,Cemetery Lane, SE1 1AA,UK”>

<!ELEMENT collage_textbook (title_page, page_verso,chapter+)>

<!ELEMENT title_page (title, author, publisher, aphorism?)>

<!ELEMENT page_verso (pub_add, copyright,ISBN, dedication*)>

<!ELEMENT chapter (section+)>

<!ELEMENT section (bodytext +)>

<!ATTLIST chapter chapter_no CDATA #REQUIRED chapter_title CDATA #REQUIRED>

<!ATTLIST section section_no CDATA #REQUIRED section_title CDATA #REQUIRED>

<!ELEMENT title (#PCDATA)>

<!ELEMENT author (#PCDATA)>

<!ELEMENT publisher (#PCDATA)>

<!ELEMENT aphorism (#PCDATA)>

<!ELEMENT pub_add (#PCDATA)>

<!ELEMENT copyright (#PCDATA)>

<!ELEMENT isbn (#PCDATA)>

<!ELEMENT dedication (#PCDATA)>

Question 2

Write an XML document that contains the following information: the name of a London tourist attraction. The name of the district it is in. The type of attraction it is (official building, art gallery, park etc). Whether it is in-doors or out-doors. The year it was built or founded [Feel free to make this up if you don’t know]. Choose appropriate tags. Use attributes for the type of attraction and in-doors or out-doors status.

Answer
<?xml version=”1.0”?><attractions> <attraction type = “Gallery Muesum” set=”indoor”>London Arts </attraction> <districtName> Oxford </districtName> <yearBuilt>1752</yearBuilt></attractions>

Question 3a
The following is the document element (root element) of an XML document.
It’s clear that it’s concerned with English phrases and their Russian translations. One of the start tags is <targLangPhrase> with </targLangPhrase> as its end tag. Why do you suppose this isn’t <russianPhrase> with </russianPhrase> ?

Answer
This file is concerned with English to Russion translations, however it would be more usefull if you can translate into other languages without having the need of chaning element names and all you need to change is the attribute within the element <phrasebook>

Question 3b
Write a suitable prolog for this document.

Answer
<?xml version = “1.0” encoding = “utf-8”?>
<!DOCTYPE phraseBook SYSTEM “phraseBook.dtd”>

Question 3c
Write a .dtd file to act as the documet type description for this document

Answer
<?xml version=”1.0” encoding =”utf-8”?>

<!ELEMENT phraseBook (section+)>

<!ATTLIST pharseBook targLang CDATA #REQUIRED>

<!ELEMENT section(sectionTitle, pharseGroup+)>

<!ELEMENT sectionTitle (#PCDATA)>

<!ELEMENT pharseGroup (engPharse, translitePharse, trangLangPhrase)>

<!ELEMENT engPharse (#PCDATA | gloss)*)>

<!ELEMENT translitPharse (#PCDATA)>

<!ELEMENT trangLangPhrase (#PCDATA)>

Question 3d

The application that is to use this document runs on a Unix system, and was written some years ago. Is that likely to make any difference to the XML declaration?

Answer
If the encoding is set to “UTF-8” then there would be no compatability problem as this contains all the characters found in K018-R character set which is a native character encoding and is fully compatible with 7-bit ASCII.

References:
http://www.fileformat.info/info/unicode/utf8.htm
http://www.w3schools.com/dtd/dtd_intro.asp
http://www.webopedia.com/TERM/A/ASCII.html

Wednesday, 23 November 2011

Week 5 - Character Encoding and Questions

No comments:

Post a Comment

About Me

Blog Archive