Markup Languages
Markup Languages
A markup language is a system for noting the attributes of a document. Historically, the term "markup" has been used to refer to the process of marking manuscript copy for typesetting, usually with directions for the use of type fonts and sizes, spacing, indentation, and other formatting features. In the electronic era, "markup" refers to the sequence of characters or other symbols that are inserted within a text or word processing file to describe the document's logical structure or indicate how the document should appear when it is displayed or printed. (Notation entered with the intention of describing logical properties is usually referred to as descriptive markup, whereas notation concerned with formatting is referred to as procedural markup.)
Unlike programming languages, which are dynamic and process data through various calculations, markup languages are static. In essence, a markup language identifies similar units of information within a document, bringing a form of instructed intelligence to a document so that applications may read and process it more effectively.
Efforts to devise electronic markup languages evolved initially along two distinct lines. Proprietary software developers, such as Microsoft, focused largely on procedural markup schemes, expressed in application-specific language and offering functions similar to printers' marks. Their efforts were concerned mainly with the quality and economy of presentation. Interest in descriptive markup languages was motivated by several factors, including the realization that the extent to which electronic documents may be manipulated depends largely on the extent and sophistication of the treatment of logical structures. There was also an awareness that common methods of treatment enhance communication, and the recognition that it will be simpler and less expensive to build backward compatible systems if archived documents have been marked under a standardized system.
This interest grew dramatically, with the advent and rapid expansion of the World Wide Web, a system in which publication and information exchange are based largely on the Hypertext Markup Language (HTML), an open, application-neutral markup language. Application-neutral markup languages have become increasingly important in the design of network-aware applications in recent years. This is in part because proprietary developers have begun to accommodate the interests of users who want to create web documents and take advantage of the other capabilities inherent in application-neutral schemes, and because increasing bandwidth has allowed programmers to consider the Internet as a computational environment.
Today, the most important markup languages are the Standard Generalized Markup Language (SGML), the Hypertext Markup Language (HTML), and the Extensible Markup Language (XML).
SGML
The Standard Generalized Markup Language (SGML) is "a set of rules for defining and expressing the logical structure of documents thereby enabling software products to control the searching, retrieval, and structured display of those documents," as noted on the Encoded Archival Description Official Web Site. SGML was developed at IBM in the late 1960s by a group of programmers charged with the development of an integrated information system. Led by Charles Goldfarb, the team rejected the idea of application-specific coding, opting instead for an open scheme of descriptive tags capable of accommodating the requirements of different types of documents and different computer platforms. Known first as the Generalized Markup Language (GML), SGML was expanded in its scope and further developed under the auspices of the American National Standards Institute (ANSI) and the International Organization for Standardization (ISO). It was adopted as an international standard (ISO 8879) in 1986.
Originally intended to be a method for creating interchangeable, structured documents, SGML became instead a framework for developing more specific markup languages, based largely on its implementation of the concept of a formally defined document type definition (DTD), with an explicit, nested element structure.
HTML
The Hypertext Markup Language is an SGML Document Type Definition (DTD) that was designed specifically for the World Wide Web. In essence, HTML is a set of markup codes inserted in a file that note logical structures and instruct a web browser how to display a web page's words and images.
Under HTML, markup elements are expressed in pairs to indicate when a structure or display effect begins and ends. For example
<p> Now is the time for all good men to come to the aid of their country.</p>
instructs a web browser to treat the sentences as a paragraph. Under the HTML 4.01 version of the Hypertext Markup Language, the paragraph may be formatted through an enhancement known as "inline styling." The expression below
<p STYLE "font:Garmond; font-size:12pt;text-align:justify"> Now is the time for all good men…
renders the paragraph as a line of justified text, using the font and point size specified for display.
The original goal of the World Wide Web was to create an information space in which hypertext links could be made from one document to another without the need to navigate any hierarchical organization of documents. HTML was devised because the web's designers wanted to create a system that would facilitate communication among users, and that would do so whether the user had a dumb terminal or a workstation running a graphical user interface (GUI) . They concluded that a common and simple descriptive language was needed, deciding early on that the creation of an SGML DTD incorporating a limited and basic syntax would be the most effective way to manage documents under their system.
Over the years HTML was expanded and refined, culminating in HTML 4.01, which included enhancements for greater support of forms, tables, and style sheets. But as the World Wide Web has grown in size and in the sophistication of the demands it attempted to support, the limitations of HTML became increasingly evident and problematic.
The main problem is that HTML provides a single way of describing the information in a document. It is not extensible, it cannot be customized, so it cannot be adapted to meet special needs—such as mathematical notation, chemical formulas, or proprietary, vendor-specific tags that would extend capabilities—and it has many formatting limitations. Most important of all, HTML does not deal with content or semantics .
Developers concluded that the best way to solve these problems was to abandon the continued improvement of HTML and create a new markup language. The result is the Extensible Markup Language (XML).
XML
The Extensible Markup Language (XML) is a subset of SGML, whose purpose is "to enable generic SGML to be served, received, and processed on the web" through a system of notation that is unlimited and self-defining. Although XML has been designed to be compatible with HTML (as well as SGML), it is not another single, predefined markup language. It is a meta-language—a language for describing other languages—for designing markup.
From a functional perspective, the most significant difference between HTML and XML is that although HTML can describe a logical structure within a document by document— p is an example—XML permits a notation indicating the content of the structure— author is an example— because XML enables authors and editors to create DTDs that conform to the more specific requirements of a document. This capability means that, in addition, an XML document or a portion of its contents can be processed purely as data by a program or stored with similar data on another computer.
Coordinated mainly by the World Wide Web Consortium, XML is under continuing development. Owing to its extensibility, XML has engendered a substantial number of adjunct specifications, including:
- Document Object Model, which is a platform-and language-neutral interface that will allow programs and scripts to dynamically access and update the content, structure, and style of documents;
- XML Query, which is intended to provide flexible query facilities to extract data from real and virtual documents on the web and support full interaction between the web and server-side databases, including databases of XML files;
- XPath, which is a language for addressing parts of an XML document;
- XSL, which is a language for stylesheets intended to support the presentation of XML documents and the specification of formatting semantics.
A number of important XML applications have already been developed. Of them, the most important is XHTML, which is a reformulation of HTML 4.01 in XML that effectively treats HTML as an XML application. Its main purpose is to build a bridge for web designers from HTML to XML. It is also intended to establish a modular standard capable of supporting the provision of "richer web pages" to the increasing wide range of browser platforms that now includes cellular phones, televisions, cars, wireless communicators, and kiosks, as well as desktop computers.
XHTML 1.0 has been formulated in three variant DTDs. XHMTL Strict supports strictly structural markup, with formatting available through the use of the World Wide Web Consortium's Cascading Style Sheet (CSS) language to set the font, color, and other layout effects. XHTML Transitional enables authors to retain some of HTML's presentation features, so that documents may be successfully addressed by older browsers without support for CSS. XHTML Framset replaces HTML Frames for use when an author wants to partition a browser window into two or more frames.
Another important XML application is the Mathematical Markup Language (MathML). MathML, which can be used to encode both mathematical notation and mathematical content, provides a rich vocabulary—30 MathML tags describe abstract notational structures, while another 150 tags provide the means of specifying the intended meaning of a particular expression—or documents with sophisticated mathematical content. Yet another important XML application is the Synchronized Multimedia Integration Language (SMIL). SMIL defines an XML-based language that allows authors to write interactive multimedia presentations. Using SMIL, an author can describe the temporal behavior of a multimedia presentation, associate hyperlinks with media objects, and describe the layout of the presentation on a screen.
Future Directions
A few things about the future of markup languages are clear. First, extensible languages will be dominant, owing to the extent to which they may be enhanced and customized. Second, in order to accommodate the increasing number of portable devices connected to the web and their more modest computational and display capabilities, markup languages will be necessarily modular. And, third, because XML's data description capabilities affords the opportunity to replace web browsers with more powerful applications, such as the applications that make up Microsoft Office, it seems likely that XML and its successors will shift the principal motif of web-based computing from the browser to productivity suites and other client-based applications.
see also Document Processing; E-commerce; Hypermedia and Multimedia; Hypertext; World Wide Web.
Christinger Tomer
Bibliography
Aitken, Peter G. XML: The Microsoft Way. Boston: Addison-Wesley, 2002.
Bryan, Martin. SGML and HTML Explained. Reading, MA: Addison-Wesley Longman, 1997.
Goldfarb, Charles F. The SGML Handbook. New York: Oxford University Press, 1990.
Goldfarb, Charles F., and Paul Prescod. The XML Handbook, 2nd ed. Upper Saddle River, NJ: Prentice Hall, 2000.
Graham, Ian S., and Liam Quin. XML Specification Guide. New York: Wiley, 1999.
St. Laurent, Simon. XML: A Primer. Foster City, CA: M&T Books, 1999.
Travis, Brian E., and Dale C. Waldt. The SGML Implementation Guide: A Blueprint for SGML Migration. New York: Springer, 1995.
Vint, Danny R. SGML at Work. Upper Saddle River, NJ: Prentice Hall, 1999.
Internet Resources
"Extensible Markup Language (XML) 1.0 (Second Edition); W3C Recommendation 6 October 2000." W3C—World Wide Web Consortium Website. <http://www.w3.org/TR/REC-xml>
Library of Congress, Network Development & MARC Standards Office. "Development of the Encoded Archival Description Document Type Definition." Encoded Archival Description Official Web Site. <http://lcweb.loc.gov/ead/eadback.html>