MARBI / Liz O'Keefe

Summary of MARBI meetings at the ALA conference in San Francisco.

I have omitted a few proposals which seemed of minimal interest to our group; my report to ARLIS Update will include a full list of all the proposals brought before the committee.

Proposal 97-10

Proposal suggests a preferred technique for encoding MARC data using a repertoire of characters from the universal character set (ISO 10646 (UCS)) to which the existing USMARC character sets have been mapped.

The MARBI Character Set Subcommittee has reached consensus on mapping most of the characters in the existing USMARC sets to the universal set. The Committee was able to map most USMARC characters to unique character code values in the universal set. Round-trip mapping of the ASCII clones (mostly numbers, punctuation marks, and special symbols) was problematic, because the basic USMARC character set for each script includes its own set of ASCII clone, while the universal set unifies these into a single repertoire. Mapping of USMARC characters to the universal set would be "many-to-one," making exact reversability back into USMARC impossible.

The Subcommittee defined two options:

1. Map USMARC ASCII clones to a unified repertoire in the universal set

This keeps USMARC in sync with the rest of the world, which will be using UNICODE for software, printers, etc. But the original USMARC encodings for ASCII clone characters might be lost after conversion to UNICODE unless an adequate algorithm could be developed.

2. Precede each ASCII clone by a script flag character defined in private use space

This facilitates conversion back to USMARC, but creates a "library dialect" of the universal character set. It would also require that records created originally within systems using one of the universal sets would have to use the script flag characters defined in private use space to let them travel into USMARC.

During the discussion, many subsidiary or related problems connected with UNICODE emerged. For example. UNICODE places non-spacing marks, such as diacritics, after the base letter, while USMARC places the marks before the letters. This practice goes back to the old printers; the software put the diacritic mark before the letter so that it would send a message to the printer not to advance the platen a space, but hold it there, allowing the base letter to print in the same column as the diacritic. It's much easier for modern systems to handle a mark after the base letter, hence the UNICODE placement. But library systems are set up to handle the USMARC placement; how will they cope with the new method--or a mixture of the two methods, since UNICODE and USMARC will have to co-exist peacefully for many years to come? Don't expect the conversion to occur over-night or even in this century. Another interesting point: UNICODE characters are sixteen bits, USMARC are 8-bit. That doubles the size of your database (eek!).

STATUS: Neither of the options for dealing with ASCII clones is completely satisfactory. The Committee agreed to appoint a task force to investigate further, since there are huge financial implications for vendors, utilities, etc., and it would be folly to stake all on one method of processing until we are sure we have chosen the best method.

Proposal 97-11

Proposes defining new subfields to accommodate codes for subentities below the country level in USMARC records, including a new subfield in field 043 (Geographic Area Code) and two new subfields in field 044 (Country of Publishing/Producing Entity Code)

As USMARC use expands, more and more countries would like to be able to code for subentities (states, regions, provinces) under the country level. This proposal would enable those who wished to be more specific to use ISO geographic codes in fields 043 and 044. As well as promoting the international exchange of records (a Good Thing in itself), this could facilitate the kind of research that is directed towards regions or subareas (in other words, if you're doing research on Tuscany or on Burgundy, you could search for the codes for those regions, instead of being limited to searching by the country code).

STATUS: Proposal accepted

Discussion Paper 100:

Discusses several characteristics--language, script, transliteration, nationality--that might be coded in authority records

LANGUAGE: Most cataloging codes specify the language(s) to be used for additions to headings, notes, and, in some cases, the headings themselves. The favored language varies from country to country. To facilitate international exchange of authority records, it has been suggested that authority records contain an indication of which language was used to construct the headings. Several European libraries want this done at the heading level, so they can pick and choose among the various variant name forms and use the language they want to display for a particular purpose or clientele (for example, a French library viewing an authority record created by the British Library for "Song of Roland" might opt for displaying the French language cross-reference, "Chanson de Roland," as the main heading. Alternatively, the authority record might simply indicate the language of the catalog; that would give libraries a clue as to whether they wanted to use that authority record, or look for records which identifying their preferred language as the language of the catalog.

SCRIPT: Some agencies need to record headings and tracings in more than one script because of different orthographies; it might be useful to tag the headings, so that other libraries could disply the script they preferred. The same thing might be done with vernacular/transliterated forms of a heading; at present, you can't indicate the particular transliteration scheme (e.g. Wade-Giles or Pinyin) for the various headings. If the headings were tagged, other agencies could choose the headings using the transliteration scheme they preferred, or go with the vernacular form.

NATIONALITY: Some cataloging agencies outside the US have suggested that authority records should indicate the nationality of persons, so that a specific country's authors can be identified. US practice has never advocated this, because the information would be difficult to determine, and was not seen as particularly valuable for retrieval. An IFLA Working Group on Authority Data elements supports inclusion of field level information on language of heading, script, and transliteration scheme as particularly important, but also supports inclusion of nationality.

Sherman Clarke and Liz O'Keefe spoke up in support of including this information on authority records. Nationality is crucial in art research (as one of the MARBI committee members pointed out, it is impossible to classify a book about an artist if the nationality of the artist is unknown). Curatorial and visual resource files typically store this information as a separate data element. They do not always have separate authority files, so the information often reside in the bibliographic records themselves. Storing this data on the authority record would be much more efficient.

No one at the meeting objected to the idea--provided that inclusion of the data is optional. It could be carried in the 008 and/or in a variable field (if multiple nationalities needed to be recorded). Some people felt that the nationality should be represented by a code, rather than free text, to facilitate international exchange (otherwise you have to search for "German" or "allemagne" or "duits" or "tedesco" or ...)

STATUS: The issues raised by this paper are so complex that several more meetings will be required to resolve the various points.

Discussion Paper 101:

Suggests the inclusion in the Holdings Format of additional note fields that contain copy-specific information but which are currently defined only in the USMARC Bibliographic Format.

Certain notes that are by nature copy-specific (e.g. provenance, source of acquisition, copy and version identification, restrictions on access) are defined only for the bibligraphic format. This leads to awkwardly worded, unclear notes (e.g. "Library copy 1 [whichever that is] is presentation copy from the author" or "Copy 3, printed on vellum, is for use only in the rare book room").

Several attendees liked the idea of moving copy-specific data to the Holdings Format. But there are problems. Rare book catalogers and archivists use copy-specific notes a lot; how will the change affect them? Will this make the $5 subfield, used by rare book catalogers for local information, obsolete? There is a difference between local information and copy-specific information, but it is sometimes hard to draw the line. And what are the system implications? Some systems do not as yet handle the USMARC Holdings format; many do not index the Holdings fields or allow searches; and one attendee noted that Z39.50 doesn't deal well with Holdings data.

STATUS: Held over for future discussion/consultation

Discussion Paper 102

Paper presents problems and solutions for dealing with non-filing characters associated with variable field data in USMARC records

There was general agreement that something needs to be done about non-filing characters in fields such as the 246 (Variant title), 505 (Contents note), and subfield $t within author-title tracings (not a good word was to be heard for the current situation). Various techniques were discussed.

--Omission of initial articles. At present, the method used to ensure correct indexing; this violates the principle of transcription, and often looks witless.

--Indicators. Currently used for many title fields. They let the creator of the data decide how many characters to ignore in filing; and they do to introduce extraneous graphic or control characters. But some variable fields do not have available positions for non-filing indicators; 9 is the largest value now possible for non-filing indicators, and indicators can't identify characters to be ignored in other parts of a field, e.g. the end.

--System recognition of articles. Doesn't work well, because there are so many words that can be either an article or a number ("ein", "un", "una", etc.), and so many words that can be an article in one language and a noun in another ("an" means "year" in French).

--Special subfield for non-filing characters. Separates data that belongs together conceptually and bibliographically ("The" is part of the title of "The hunting of the Snark," even though we don't want to file it under "T.")

--Graphic characters. Some German systems use the SPACING UNDERSCORE to set off non-filing characters. This pollutes the regular cataloging data, and it may be hard to find a graphic character that is never used as part of legitimate cataloging data (the underscore is used now in Internet addresses and file names).

--Special control characters to set off the non-filing characters. The proposal that met with the most interest. ISO 6630 (Bibliographic control set) defines control characters of this type. This would allow the demarcation of non-filing characters anywhere in a field, without adding extraneous characters that might be confused with cataloging data. The downside: special characters require system implementation that affects hardware, software, and existing data; they are not mappable to universal character set encodings; and their introduction would mean using two different methods of indicating non-filing characters (unless we got rid of non-filing indicators, which most people seemed reluctant to do.)

The consensus was that the committee needs to confer with the major players. The issue is related to the implementation of UNICODE, which will change the order in which diacritic characters are linked to base characters. But there is light at the end of the tunnel; catalogers may have to find another pet peeve (there are plenty).

JOINT MEETING between CC:DA and MARBI on Metadata

The CCDA/MARBI joint meeting about metadata featured a free-form, free-wheeling discussion of metadata and the relationship of the Dublin Core to AACR2-formulated records and other library formats. There is a website for metadata analysis, which includes MARC records mapped to Dublin Core, at

http://www.libraries.psu.edu/iasweb/personal/jca/dublin/ex1.htm

A sampling of the points raised:

One speaker felt that anybody used to dealing with preliminary records could cope with Dublin Core records. But some speakers were uneasy about losing the sense of transcription (what are you using as the title of the resource? The title on the home page? The title on the HTML header? What is the chief source?). Some felt that title is not a major accss point, and that Internet users look for author or subject. Librarians are hung up on transcription because of the need to distinguish between many similar mass-produced objects. But others noted that internet users do look for the titles they see cited on "Top Web Sites" or mentioned in articles or listservs. And titles are important for internet resources such as collections of machine-readable texts (e.g. the Text Encoding Initiative).

Consistency of access is one area where catalogers can add value to metadata. Librarians should encourage the use of established lists of names and subjects, classification schemes, etc. Unfortunately, we cannot enforce the use of standards for metadata. We might be able to get some of the big search services, such as Altavista, to encourage the use of the Dublin Core.

The question arose of where you draw the line in cataloging internet resources. Selectivity is the key. Don't catalog every footling attachment to a home page.

Liz O'Keefe
Pierpont Morgan Library
ARLIS/NA representative to the USMARC Advisory Group
eokeefe@morganlibrary.org

... go to more ALA reports ...