EDF data model and its representation using UML
F. Viallefond
February 4, 2004
1 Introduction
The ALMA Export Data Format (EDF) is structured using the Data Model which is underneath the
MeasurementSet ( Kemball and Wieringa 2000). This model is build using concepts from the domain of
relational data bases. The present note gives a number of remarks which need to be taken into
consideration when representing the EDF data model by using the UML. Suggestions are made to adopt
a comprehensive nomenclature for the names of attributes involving relationships between different
entities in the model.
2 Context and Definitions
The structure of the data model is presented as an ensemble of tables, their names being listed in the
first column of Tab. 1.
These tables are composed of different sections allowing to discriminate between different categories of
attributes. These are the key, the non-key, the data-description and the data sections. Categorized as
such, this gives the logic for the various relationships between these tables, the role of each attribute
implying an association in a manner which is precisely defined. The network of all the relationships
forms a conceptual schema. Its construct is based on a structure which follows rules to offer the
capabilities of a relational data base.
To make easier the understanding of this presentation and to avoid ambiguities I provide some
definitions of words to be used in this document.
Entity: This word is used to represent a table, this table reflecting a relation between a certain
number of attributes. The entity is named by the table name. Using object-oriented words it may be
considered as a class.
Attribute: This is a column. The relation reflected by the entity is defined by the ensemble of
columns for that entity.
Tuple: It is a given row in the table. Each row provides an instanciation of the ensemble of
attributes in the entity. Using object-oriented words it is an object.
Key: If the instance of a minimum ensemble of attributes allows to identify a unique tuple in the
relation, a key can be assigned to this ensemble to identify it.
Association: An association is represented by a relation of the same name having for attributes
the list of the keys of the entities which participate to the association, this in addition to its own
non-key attributes. This word comes from UML. Just as an entity is similar to a class,
an association is also similar to a class. In UML an instance of a such class is called a
link.
Link: In UML terminology it is an instance of an association.
Identifier: To be able to have associations each of the entities which participates to the association
needs to have a primary key. A primary key may be explicit or implicit. It is implicit when it
corresponds to the position of a tuple in a sequence or, in other words, to the row number in the table.
We define an identifier as the primary key in an entity.
For convenience we give to these identifiers a name of the form entityName_ID and it is of type
int.
Collection: It may be useful to consider an ensemble of tuples, all members to the same entity.
This ensemble defines a collection. In this respect a table is a collection but collections may also
correspond only to a subset of the rows in a table. When several values have to be assigned to an
attribute in a tuple this attribute refers to a collection. There are several types collections, in
particular the set, the bag, the list and the array. Using collections requires telling of which
type is each of these collections. This is mandatory to fully describe the structure of the
data base. With UML the cardinality is indicated to provide the number of members in a
collection.
3 EDF table properties to describe their inter-relations
With these definitions in mind I now give the consequences in term of relations for a given EDF table
with the other tables of the EDF data model, this for the various possibilities concerning the parent
table section of a given identifier. Each of these possibilities defines the type of relation between the
current association taken in consideration and the entity in which the identifier is the primary
key.
In the context of relational data bases it is not recommended (forbiden?) to have two tuples which have
the same values for their ensemble of key attributes. For this reason it is prefered to avoid the use of an
implicit key in the case of the associations. Hence, for those, the key section contains, in addition to the
list of primary keys of the entities which participate to the association, the identifier corresponding to
the tuple. This can be seen, for examples, with the FEEDor DOPPLERtables which both reflect
associations. In this context it is noted that e.g. the SYSCALor CALWIDGETtables do not need to
have a SYSCAL_ID or CALWIDGET_ID identifier in their key section; this is because these tables
cannot contain two or more tuples with the same value assigned to the attribute which is the key
TIME.
To represent the EDF tables using an UML class diagram it is necessary to consider to which
section the explicit identifiers belong. The keys participating to an association provide the referential
constraint for that association. The non-key attributes in the association may have optional identifiers.
In that case the association is also an aggregation. An association is a composite in case there is in its
non-key attributes at least one mandatory identifier. A pure composite is an association where all its
attributes which are identifiers are mandatory. An example is the DATA_DESCRIPTIONtable. More
commonly the EDF tables contain both mandatory and optional identifiers. They must be
represented as composites because their existences rely on the existence of one or several other
entities.
According to the UML it is necessary to distinguish if a table is an atomic entity (a static or
quasi-static table having a single implicit identifier), an entity to which is added a specialisation either
as an aggregation or a composition, or an association with or without a role of aggregation or
composition. Following this description the status of each table of the data model is given in Tab.
1.
Table 1: | EDF tables with their inter-connections |
|
The content of each column is the following:
- The first column gives the names of the EDF tables.
- The second column provides, in the context just destcribed, the properties of these
EDF tables. Note that some tables play two roles simultaneously, an association plus an
aggregation or composition.
- The primary key is given in the third column.
For some EDF tables I do not provide a primary key. As already explained, the reason
for this is that the tuples in those tables are not referenced in the conceptual model by
identifiers. The tuples in such cases can be identified owing to the presence of the key TIME,
eventualy together with a key INTERVAL.
- The fourth column gives the status of the primary key, i.e. wether it is implicit or explicit.
- The column 5 gives the identifier participating to define associations. In a few cases I
append to the identifiers a number between square brackets. This is to indicate a cardinality
different from 1.
- The column 6 gives the non-key non-mandatory identifier attributes. These indicate a
cardinality of 0 or 1 for their participation in the aggregation.
Note that, to recognize non-mandatory attributes, thay are written between brackets in the
layout of the tables.
- The last column gives the non-key mandatory identifier attributes. In this case the table is
considered as a composite because, as explained, any tuple in that entity could not exist if
the tuple identified in the participating other entity does not exist.
Notes:
It is also useful to highlight some characteristics used with the tables which are closely related to
associations, in particular when using collections.
-
Recursion In a few cases there is the need of a recursive relation, a tuple having in
its list of non-key attributes an identifier which belongs of the entity itself. This is
the case with ( ASSOC_SPW_ID) in the SPECTRAL_WINDOWtable to have the
concept of associated spectral windows and with ( DEP_SOURCE_PAR_ID) in the
SOURCE_PARAMETERfor dependencies of sources properties between different sources
or the same source observed at a different time or frequency.
-
Projection There is a projection in an association when one of the key which contributes could
be dropped, i.e. if the all the tuples would be identical whatever the value of that key.
Projections are identified in the EDF tables by having an identifier with a value of -1. This
is the case e.g. with SPECTRAL_WINDOW_IDin the FEED and CALWIDGET tables.
-
Multi-value attributes When there is the need to associate several references to several tuples
of a given entity, a standard is to name them using an enumeration appended to the entity
name. The column names ANTENNA1, ANTENNA2 and (ANTENNA3) in the MAIN
illustrate this, the entity being the ANTENNA table. Both ANTENNA1 and ANTENNA2
are mandatory (N.B. when only a single antenna needs to be specified as it is the case
for single dish observations, the “trick” used is to assign for these different ANTENNAx
attributes a single common value).
Number of members in the collections:
For some tables there is a Data description section which hosts numbers to give the number
of values for one or several attributes in the Data section. An example is NUM_RECEPTOR
in the FEED table. NUM_RECEPTOR is an attribute in itself because it is used in the
tables associated to the FEED table, e.g. the table SYSCAL . If a cardinality is not
indicated in the Data description section but is only implicit in a Data section itself, this
cardinality will not be used explicitly when the entity participates to an association. An
example is the number of associated spectral windows for a given spectral window. In this
case the cardinality remains private to the object.
Arrays:
Most frequently when an ensemble of values is assigned to an attribute , this ensemble
corresponds to an a collection of type “array” as this attribute carries an ordered sequence
of indexed values. Although the given order may not always be important in itself, a
correspondence using that order must be shared by all the attributes which include the
common cardinality, e.g. NUM_RECEPTOR, in their format.
Lists:
Some entities could be considered as lists, i.e. as collections of tuples with a certain order,
eventually the same tuple appearing more than once. This could be the case with the
ANTENNA table if it hosts several configurations of antenna positions, some being
common to the different configurations (N.B. there is no attribute to identify (sub-)arrays
in the ANTENNA table; this is justified because (sub-)arrays may be higly transient).
This could be the case as well with the SPECTRAL_WINDOW table. The reference
to a spectral window does not tell to which correlator setup it belongs; if this table
simply hosts a certain number of setups then it may happen that two or more setups
have one or more common spectral windows and this is not forbiden. Antennas as spectral
windows are referenced in the other tables through their implicit identifiers, respectively
ANTENNA_IDand SPECTRAL_WINDOW_ID, in a non-ambiguous manner.
N.B.: The table EXECUTE_SUMMARY contains an item, ANTENNA_LIST which
carries an ordered list of antenna identifiers. This list is ordered assuming an implicit rule
to define how the baselines (antenna pairs) are ordered in the data stream published by
the correlator sub-system; this order being reflected by the order in which appear the
baselines in the MAIN table for a given time stamp. In principle this list cannot refer
twice to a given tuple in the ANTENNA table unless we want to offer some flexibility
e.g. if it is not mandatory for all the antenna to have data of single-dish type in addition
to the cross-correlations. If this reserve is excluded, ANTENNA_LIST should be renamed
ANTENNA_ARRAY and the adopted rule to set the order of the baselines must be provided
as part of the data model to describe concisely its meaning as a collection of ordered and
indexed antennas.
Sets:
When the order of the values is without a meaning, these attributes to which are assigned a
multiplicity of values defining collections should be considered as sets. A possible example
would be the DEP_SOURCE_PAR_IDattribute in the SOURCE_PARAMETER table.
To distinguish easily the attributes which refer to collections I propose to append
in their names the type of the collection, e.g. ANTENNA_ARRAY, SCAN_LIST,
DEP_SOURCE_PAR_SET etc... If the collections used in the global model do
not refer exlusively to ensembles of identifiers, the _ID could be inserted, e.g.
ANTENNA_ID_ARRAY. Currently the model includes collections which are not ensembles
of identifiers. This is the case e.g. when providing a set of coefficients for polynomial
expressions.
-
Implicit non-key attributes In some tables there are embeded methods, e.g. in the FIELD
table, to determine the directions of the delay, phase and reference centers. The values for
these positions can be computed using these methods together with the value of the key
TIME when this FIELD table is associated in the MAIN or POINTING table. In such
cases the values for these positions can be considered as implicit values for these attributes
and the queries to e.g. retrieve data may use these implicit values as filters.