EDF data model and its representation using UML

F. Viallefond

February 4, 2004

1 Introduction

The ALMA Export Data Format (EDF) is structured using the Data Model which is underneath the MeasurementSet ( Kemball and Wieringa 2000). This model is build using concepts from the domain of relational data bases. The present note gives a number of remarks which need to be taken into consideration when representing the EDF data model by using the UML. Suggestions are made to adopt a comprehensive nomenclature for the names of attributes involving relationships between different entities in the model.

2 Context and Definitions

The structure of the data model is presented as an ensemble of tables, their names being listed in the first column of Tab. 1.
These tables are composed of different sections allowing to discriminate between different categories of attributes. These are the key, the non-key, the data-description and the data sections. Categorized as such, this gives the logic for the various relationships between these tables, the role of each attribute implying an association in a manner which is precisely defined. The network of all the relationships forms a conceptual schema. Its construct is based on a structure which follows rules to offer the capabilities of a relational data base.

To make easier the understanding of this presentation and to avoid ambiguities I provide some definitions of words to be used in this document.

Entity: This word is used to represent a table, this table reflecting a relation between a certain number of attributes. The entity is named by the table name. Using object-oriented words it may be considered as a class.

Attribute: This is a column. The relation reflected by the entity is defined by the ensemble of columns for that entity.

Tuple: It is a given row in the table. Each row provides an instanciation of the ensemble of attributes in the entity. Using object-oriented words it is an object.

Key: If the instance of a minimum ensemble of attributes allows to identify a unique tuple in the relation, a key can be assigned to this ensemble to identify it.

Association: An association is represented by a relation of the same name having for attributes the list of the keys of the entities which participate to the association, this in addition to its own non-key attributes. This word comes from UML. Just as an entity is similar to a class, an association is also similar to a class. In UML an instance of a such class is called a link.

Link: In UML terminology it is an instance of an association.

Identifier: To be able to have associations each of the entities which participates to the association needs to have a primary key. A primary key may be explicit or implicit. It is implicit when it corresponds to the position of a tuple in a sequence or, in other words, to the row number in the table. We define an identifier as the primary key in an entity.
For convenience we give to these identifiers a name of the form entityName_ID and it is of type int.

Collection: It may be useful to consider an ensemble of tuples, all members to the same entity. This ensemble defines a collection. In this respect a table is a collection but collections may also correspond only to a subset of the rows in a table. When several values have to be assigned to an attribute in a tuple this attribute refers to a collection. There are several types collections, in particular the set, the bag, the list and the array. Using collections requires telling of which type is each of these collections. This is mandatory to fully describe the structure of the data base. With UML the cardinality is indicated to provide the number of members in a collection.

3 EDF table properties to describe their inter-relations

With these definitions in mind I now give the consequences in term of relations for a given EDF table with the other tables of the EDF data model, this for the various possibilities concerning the parent table section of a given identifier. Each of these possibilities defines the type of relation between the current association taken in consideration and the entity in which the identifier is the primary key.
In the context of relational data bases it is not recommended (forbiden?) to have two tuples which have the same values for their ensemble of key attributes. For this reason it is prefered to avoid the use of an implicit key in the case of the associations. Hence, for those, the key section contains, in addition to the list of primary keys of the entities which participate to the association, the identifier corresponding to the tuple. This can be seen, for examples, with the FEEDor DOPPLERtables which both reflect associations. In this context it is noted that e.g. the SYSCALor CALWIDGETtables do not need to have a SYSCAL_ID or CALWIDGET_ID identifier in their key section; this is because these tables cannot contain two or more tuples with the same value assigned to the attribute which is the key TIME.

To represent the EDF tables using an UML class diagram it is necessary to consider to which section the explicit identifiers belong. The keys participating to an association provide the referential constraint for that association. The non-key attributes in the association may have optional identifiers. In that case the association is also an aggregation. An association is a composite in case there is in its non-key attributes at least one mandatory identifier. A pure composite is an association where all its attributes which are identifiers are mandatory. An example is the DATA_DESCRIPTIONtable. More commonly the EDF tables contain both mandatory and optional identifiers. They must be represented as composites because their existences rely on the existence of one or several other entities.

According to the UML it is necessary to distinguish if a table is an atomic entity (a static or quasi-static table having a single implicit identifier), an entity to which is added a specialisation either as an aggregation or a composition, or an association with or without a role of aggregation or composition. Following this description the status of each table of the data model is given in Tab. 1.



Table 1: EDF tables with their inter-connections
EDF table nameTypePrimary keyKeys participating to the
Identifier visibilityassociationaggregationcomposition
MAIN Ass/Com

ANTENNA_ID[2] FEED_ID[2]
DATA_DESCRIPTION_ID
PROCESSOR_ID
SWITCH_PHASE_ID
FIELD_ID

EXECUTE_ID STATE_ID

ANTENNA Ent ANTENNA_ID implicit

PHASE_ARRAY_ID

BEAM Ent BEAM_ID implicit

CALWIDGET Ass

ANTENNA_ID FEED_ID
SPECTRAL_WINDOW_ID
TIME INTERVAL

DATA_DESCRIPTION Com DATA_DESCRIPTION_ID implicit

SPECTRAL_WINDOW_ID
POLARIZATION_ID

DOPPLER Ass/Com DOPPLER_ID explicit

SOURCE_ID

TRANSITION_ID

EXECUTE_SUMMARY Ass/Com EXECUTE_ID implicit

SCHEDULE_IDMAIN_ID[2]
ANTENNALIST

FEED Ass/Com FEED_ID explicit

ANTENNA_ID
SPECTRAL_WINDOW_ID
TIME INTERVAL

RECEIVER_ID
BEAM_ID

FIELD Agg FIELD_ID implicit

SOURCE_ID
FIELD_ID
EPHEMERIS_ID

FOCUS Ass/Agg

ANTENNA_ID FEED_ID
TIME INTERVAL

FOCUS_MODEL_ID

FREQ_OFFSET Ass/Com

ANTENNA_ID[2] FEED_ID
SPECTRAL_WINDOW_ID
TIME INTERVAL

FIELD_ID

HISTORY Ass

EXECUTE_IDTIME

PHASE_TRACKING Ass

ANTENNA_ID FEED_ID
SPECTRAL_WINDOW_ID
TIME INTERVAL

POINTING Ass/Comp

ANTENNA_ID
TIME INTERVAL

POINTING_MODEL_ID

POLARIZATION Ent POLARIZATION_ID implicit

PROCESSOR Ent PROCESSOR_ID implicit

PWVM Ass/Com

ANTENNA_ID FEED_ID
DATA_DESCRIPTION_ID
PROCESSOR_ID SWITCH_PHASE_ID
FIELD_ID

EXECUTE_ID STATE_ID

PWVMCAL Ass

ANTENNA_ID
SPECTRAL_WINDOW_ID
TIME INTERVAL

RECEIVER Ent RECEIVER_ID implicit

TIME INTERVAL

SEEING Ent

TIME INTERVAL

SOURCE Ass/Agg SOURCE_ID explicit

SPECTRAL_WINDOW_ID
TIME INTERVAL

SOURCE_PARAMETER_ID

SOURCE_PARAMETER Ent/Agg SOURCE_PARAMETER_IDimplicit

TIME INTERVAL

DEP_SOURCE_PAR_ID

SPECTRAL_WINDOW Ent/Agg SPECTRAL_WINDOW_ID implicit

DOPPLER_ID
ASSOC_SPW_ID

STATE Ent STATE_ID implicit

SYSCAL Ass

ANTENNA_ID FEED_ID
SPECTRAL_WINDOW_ID
TIME INTERVAL

WEATHER Ass

ANTENNA_ID
TIME INTERVAL


The content of each column is the following:
  1. The first column gives the names of the EDF tables.
  2. The second column provides, in the context just destcribed, the properties of these EDF tables. Note that some tables play two roles simultaneously, an association plus an aggregation or composition.
  3. The primary key is given in the third column.
    For some EDF tables I do not provide a primary key. As already explained, the reason for this is that the tuples in those tables are not referenced in the conceptual model by identifiers. The tuples in such cases can be identified owing to the presence of the key TIME, eventualy together with a key INTERVAL.
  4. The fourth column gives the status of the primary key, i.e. wether it is implicit or explicit.
  5. The column 5 gives the identifier participating to define associations. In a few cases I append to the identifiers a number between square brackets. This is to indicate a cardinality different from 1.
  6. The column 6 gives the non-key non-mandatory identifier attributes. These indicate a cardinality of 0 or 1 for their participation in the aggregation.
    Note that, to recognize non-mandatory attributes, thay are written between brackets in the layout of the tables.
  7. The last column gives the non-key mandatory identifier attributes. In this case the table is considered as a composite because, as explained, any tuple in that entity could not exist if the tuple identified in the participating other entity does not exist.

Notes:
It is also useful to highlight some characteristics used with the tables which are closely related to associations, in particular when using collections.


Recursion In a few cases there is the need of a recursive relation, a tuple having in its list of non-key attributes an identifier which belongs of the entity itself. This is the case with ( ASSOC_SPW_ID)  in the SPECTRAL_WINDOWtable to have the concept of associated spectral windows and with ( DEP_SOURCE_PAR_ID)  in the SOURCE_PARAMETERfor dependencies of sources properties between different sources or the same source observed at a different time or frequency.

Projection There is a projection in an association when one of the key which contributes could be dropped, i.e. if the all the tuples would be identical whatever the value of that key. Projections are identified in the EDF tables by having an identifier with a value of -1. This is the case e.g. with SPECTRAL_WINDOW_IDin the FEED  and CALWIDGET  tables.

Multi-value attributes When there is the need to associate several references to several tuples of a given entity, a standard is to name them using an enumeration appended to the entity name. The column names ANTENNA1, ANTENNA2 and (ANTENNA3)  in the MAIN  illustrate this, the entity being the ANTENNA  table. Both ANTENNA1 and ANTENNA2 are mandatory (N.B. when only a single antenna needs to be specified as it is the case for single dish observations, the “trick” used is to assign for these different ANTENNAx attributes a single common value).
Number of members in the collections:
For some tables there is a Data description section which hosts numbers to give the number of values for one or several attributes in the Data section. An example is NUM_RECEPTOR in the FEED  table. NUM_RECEPTOR is an attribute in itself because it is used in the tables associated to the FEED  table, e.g. the table SYSCAL . If a cardinality is not indicated in the Data description section but is only implicit in a Data section itself, this cardinality will not be used explicitly when the entity participates to an association. An example is the number of associated spectral windows for a given spectral window. In this case the cardinality remains private to the object.
Arrays:
Most frequently when an ensemble of values is assigned to an attribute , this ensemble corresponds to an a collection of type “array” as this attribute carries an ordered sequence of indexed values. Although the given order may not always be important in itself, a correspondence using that order must be shared by all the attributes which include the common cardinality, e.g. NUM_RECEPTOR, in their format.
Lists:
Some entities could be considered as lists, i.e. as collections of tuples with a certain order, eventually the same tuple appearing more than once. This could be the case with the ANTENNA  table if it hosts several configurations of antenna positions, some being common to the different configurations (N.B. there is no attribute to identify (sub-)arrays in the ANTENNA  table; this is justified because (sub-)arrays may be higly transient). This could be the case as well with the SPECTRAL_WINDOW  table. The reference to a spectral window does not tell to which correlator setup it belongs; if this table simply hosts a certain number of setups then it may happen that two or more setups have one or more common spectral windows and this is not forbiden. Antennas as spectral windows are referenced in the other tables through their implicit identifiers, respectively ANTENNA_IDand SPECTRAL_WINDOW_ID, in a non-ambiguous manner.
N.B.: The table EXECUTE_SUMMARY  contains an item, ANTENNA_LIST which carries an ordered list of antenna identifiers. This list is ordered assuming an implicit rule to define how the baselines (antenna pairs) are ordered in the data stream published by the correlator sub-system; this order being reflected by the order in which appear the baselines in the MAIN  table for a given time stamp. In principle this list cannot refer twice to a given tuple in the ANTENNA  table unless we want to offer some flexibility e.g. if it is not mandatory for all the antenna to have data of single-dish type in addition to the cross-correlations. If this reserve is excluded, ANTENNA_LIST should be renamed ANTENNA_ARRAY and the adopted rule to set the order of the baselines must be provided as part of the data model to describe concisely its meaning as a collection of ordered and indexed antennas.
Sets:
When the order of the values is without a meaning, these attributes to which are assigned a multiplicity of values defining collections should be considered as sets. A possible example would be the DEP_SOURCE_PAR_IDattribute in the SOURCE_PARAMETER  table.

To distinguish easily the attributes which refer to collections I propose to append in their names the type of the collection, e.g. ANTENNA_ARRAY, SCAN_LIST, DEP_SOURCE_PAR_SET etc... If the collections used in the global model do not refer exlusively to ensembles of identifiers, the _ID could be inserted, e.g. ANTENNA_ID_ARRAY. Currently the model includes collections which are not ensembles of identifiers. This is the case e.g. when providing a set of coefficients for polynomial expressions.

Implicit non-key attributes In some tables there are embeded methods, e.g. in the FIELD  table, to determine the directions of the delay, phase and reference centers. The values for these positions can be computed using these methods together with the value of the key TIME when this FIELD  table is associated in the MAIN  or POINTING  table. In such cases the values for these positions can be considered as implicit values for these attributes and the queries to e.g. retrieve data may use these implicit values as filters.