'$RCSfile: eml-physical.xsd,v $' Copyright: 2000 Regents of the University of California and the National Center for Ecological Analysis and Synthesis For Details: http://knb.ecoinformatics.org/ '$Author: higgins $' '$Date: 2002/04/21 22:45:30 $' '$Revision: 1.11 $' This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA eml-physical The eml-physical Module defines the structural characteristics of data formats as delivered over the wire or as found in a file system. One physical object (which can be a bytestream or an object in a file system) might contain multiple entities (for example, this would be typical in a MS Access file that contained multiple tables of data). However, it is typically used to describe a file or stream that is in some text-based format such as ASCII or UTF-8, and includes the information needed to parse the data stream to extract the entity and its attributes from the stream. Physical structure. Physical structure of an entity or entities. Physical structure of an entity or entities. This generally is a detailed description of a text representation that shows how the columns and rows of a table are represented, or simply the name of a well-known binary or proprietary format (e.g., Microsoft Excel 2000). The eml-physical was introduced into EML 1.4 as eml-file. Unique identifier The unique identifier of this metadata file or object. The identifier field provides a unique identifier for this metadata documentation. It will most likely be part of a sequence of numbers or letters that are meaningful in a larger context, such as a metadata catalog. That larger system can be identified in the "system" attribute. Multiple identifiers can be listed corresponding to different catalog systems. nceas.3.2]]> The 'identifier' field is derived from the eml-dataset meta_file_id filed in EML 1.4. Catalog system The catalog system in which this identifier is used. This element gives the name of the catalog system in which this identifier is used. It is useful to determine the scope of the identifier, and to determine the semantics of the various subparts of the identifier. Unresolved issue: can or should this be a URI/URL pointing to the catalog system, or just the name? nceas.3.2]]> New to EML 2.0. File format Contains the name of the format for this file. This element contains the name of the file's format. The file's format is typically ASCII, Unicode, or some well-known binary format (e.g., Microsoft Excel 2000). It is recommended to include a complete MIME type here, such as image/jpeg or text/xml. Note that this is the format of the physical file itself. ASCII]]> The format element was introduced into EML 1.4. This attribute is designed for use in providing the version of the format in use. For example, 'Excel' might be the format; with '3.1' being the version New to EML 2.0. Citation is a simple reference describing the format where one can find a detailed description of the format. New to EML 2.0. Character Encoding Contains the name of the chracter encoding used for the data. This element contains the name of the character encoding. This is typically ASCII or UTF-8, or one of the other common encodings. UTF-8]]> Introduced in EML 2.0 Entity size Describes the physical size of the entity. This element contains information of the physical size of the entity, typically in bytes. 13]]> The entitySize was introduced into EML 1.4. Unit of measurement Unit of measurement for the entity size, typically bytes This element gives the unit of measurement for the size of the entity, and is typically bytes. 13]]> The unit was introduced into EML 1.4. Authentication method A value, typically a checksum, used to authenticate that the bitstream delivered to the user is identical to the original. This element describes authentication procedures or techniques, typically by giving a checksum method (e.g., MD5) and checksum value for the bytestream. f5b2177ea03aea73de12da81f896fe40 ]]> The authentication element was introduced into EML 1.4. Authentication method The method used to calculate an authentication checksum. This element names the method used to calculate and authentication checksum that can be used to validate a bytestream. Typical checksum methods include MD5 and CRC. f5b2177ea03aea73de12da81f896fe40 ]]> The authentication element was introduced into EML 1.4. Entity's compression method Name ofthe entity's compression method This element describes any compression methods used to compress the entity, such as zip, compress, etc. The compressed element was introduced into EML 1.4. Encoding Method Method used for encoding the entity This element describes the entity's encoded method, such as MIME base64 encoding or binhex encoding. The encoded element was introduced into EML 1.4. Header lines Header lines in the entity Number of header lines or information that prepares data. 3]]> The numHeaderLines element was introduced into EML 1.4. Record delimiter character Character used to delimit records. This element specifies the record delimiter character when the format is text. The record delimiter is usually a newline (\n) on UNIX, a carriage return (\r) on MacOS, or both (\r\n) on Windows/DOS. Multiline records are usually delimited with two line ending characters, for example on UNIX it would be two newline characters (\n\n). \n\r]]> The recordDelimiter element was introduced into EML 1.4. Quote character Character used to quote values for delimeter escaping This element specifies a character to be used in the entity for quoting values so that field delimeters can be used within the value. This basically allows delimeter "escaping". The quoteChacter is typically a " or '. "]]> The quoteCharacter element was taken from the NBII standard. Literal character Character used to escape other characters This element specifies a character to be used for escaping character values so that the following character is treated as its literal value. This allows "escaping" for special characters like quotes, commas, and spaces when they aren't intended as a delimiter value. The literalChacter is typically a \. \]]> Introduced in EML 2.0. Start column The starting column number for a fixed format attribute. FixedWidth fields have a set length, thus the end of the field can always be determined by adding the fieldWidth to the starting column number. any positive integer, see example in "delimeter" description Introduced into EML 2.0. Field width FieldWidth specification for fixed field length. FixedWidth fields have a set length, thus the end of the field can always be determined by adding the fieldWidth to the starting column number. any positive integer, see example in "delimeter" description The fieldWidth element was introduced into EML 1.4. Semantics changed to work identically to the NBII DTD. Attribute delimiter The end of the attribute (field) is delimited by a special character called a field delimiter. Variable width format fields (attributes) can vary in their field length, thus the end of the field is delimited by a special character called a field delimiter (typically a comma or a space). Data sets are generally classified as fixedWidth format or variableWidth format, but we have determined that this is actually a per-field classification because one may encounter fixedWidth fields mixed together in the same data file with variableWidth fields. In our encoding scheme, the start of each field is assumed to be the column after the last column of the previous field, or the first column if this is the first field in the dataset, unless the starting column is explicity enumerated using the "fieldStartColumn" element. The end column for each field is classified using either a special character delimeter indicated using the filedDelimiter element, or a fixed field length indicated by using the "fieldWidth" element. The delimiter for the last field in the data set can be omitted. variableWidth fields can vary in their field length, and the end of the field is delimited by a special character called a field delimiter, usually a comma or a tab character. fixedWidth fields have a set length, and so the end of the field can always be determined by adding the fieldWidth to the starting column number. Here is an example: Assume we have the following data in a data set: May,100aaaa,1.2, April,200aaaa,3.4, June,300bbbb,4.6, The metadata indicating the physical layout of the 4 fields would include the following: , 3 3 , ]]> In a strictly fixed format file, the metadata would be slightly different: May100aaaa1.2 Apr200aaaa3.4 Jun300bbbb4.6 3 3 4 3 ]]> or, one could explicitly describe the starting columns: 1 3 4 3 7 4 11 3 ]]> comma, tab, white space, etc. The delimiter element was introduced into EML 1.4. Semantics changed to work identically to the NBII DTD, and then modified to fit more cases.