osa logo
OSA Home
Start Here
Data Overview
Known Issues
the Surveys
Coverage Maps
Schema browser
Data access
Login
Archive Listing
GetImage
MultiGetImage
Region
Freeform SQL
CrossID
SQL Cookbook
Q&A
Glossary
Release History
Gallery
Publications
Monitor
Downtime
Links
osa logo bottom
IFA     ROE
Home | Overview | Browser | Access | Login | Cookbook 
  OSA logo

An Introductory SQL Tutorial for OSA Users

Contents


1. Introduction

In this document we provide a brief introduction to the use of Structured Query Language (SQL) for accessing data in the OmegaCam Science Archive (OSA). Sections included in this document are a Primer aimed at users new both to the OSA and to SQL and a Reference which should be of use more experienced and returning users. Readers wanting a fuller introduction to SQL should consult an online tutorial or one of the legion of SQL books available: O'Reilly's SQL in a nutshell is a good introduction. The document concludes with some examples of cross-querying VST and external survey data, which we expect to be a common usage of the archive.

The OSA and its prototype, the SuperCOSMOS Science Archive (SSA) are large databases so, for test purposes, we have produced the "Personal SSA" (PSSA), which is a small subset of the SSA, containing solely the data in the region of the sky with 184 < RA (deg) < 186 and -1.25 < Dec. (deg) < 1.25, which is the same area of sky as the "Personal SkyServer" produced for the Early Data Release (EDR)of the Sloan Digital Sky Survey (SDSS). The PSSA may be downloaded from here (as a .zip file with installation instructions included) or can be queried using a web interface of the same form as for the full SSA. Similarly, the initial release of the OSA (which contains a limited amount of commissioning data) should be used if you are unfamiliar with SQL but wish to exercise queries on that database. OSA and SSA users are strongly encouraged to use these small subsets for developing and debugging queries: with databases as large as the OSA and full SSA it can take a long time to find out that the query you wrote does not do what you intended!

This OSA Cookbook is based on the SSA version. Currently, the latter provides a more comprehensive introduction to SQL and in particular gives real-world examples of astronomy application queries that were used in the construction of the SSA. Novice users are strongly recommended to familiarise themselves with the PSSA and the information given in the SSA Cookbook before attempting real science with the OSA.

2. Primer

2.1. Relational databases

The OSA is a relational database, which means that it stores data in tables composed of rows and columns. Each row comprises the information stored for one data entry – e.g.. a celestial object in the case of the OSA – and there is one column for each of the attributes recorded for that entry – e.g. RA, Dec, ellipticity, etc. The different tables comprising a database may be linked (or related), if they each have columns representing the same data value (integrity constraints can be included in the table definitions which ensure consistency between two related tables, e.g. by preventing the deletion of only one of a pair of rows in different tables thus linked). For ease of use, it is possible to define virtual tables - called views - which are subsets of the data in one or more tables and which can be queried using the same syntax as ordinary tables (which are sometimes called base tables, to distinguish them from these virtual tables). In addition to tables and views, the major constituents of a relational database are indexes (the database community prefer that spelling to "indices"), which can speed up the identification of records which satisfy the particular condition expressed in a query, and various stored procedures and functions which extend the range of operations which can be performed on data held in the tables. The collection of definitions of columns, tables, views, indexes, stored procedures and functions in a database is called its schema.

2.2. The OSA as a relational database

The OSA schema is described in detail elsewhere, but we recap here the basic features which we shall use later. For each VST survey programme in the OSA (e.g. the ATLAS Survey) two major tables called, for example, vhsDetection and vhsSource are available. The columns in *Detection are basically the attributes derived by running the pipeline image analyser over single passband frames, and these single-frame detections are then merged into multi-epoch, multi-colour records for individual celestial objects, which are stored in *Source. In addition to these major tables, there are also a number of metadata tables, which store ancillary information describing the processes involved in obtaining VST imaging data, and which enable the provenance of data values in *Source and *Detection to be traced all the way back to a given image. To aid spatial matching of objects within the OSA and between the OSA and external catalogue datasets (such as the 2MASS Point Source Catalogue and SDSS Data Release catalogues), there are also "neighbour" and "cross-neighbour" tables which record pairs of sources within 10 to 30 arcsec of one another, depending on the catalogues joined.

2.3. Structured Query Language (SQL)

SQL is the standard language for accessing and manipulating data stored in a relational database. In fact, several versions of the SQL standard exist, and most database management systems (DBMSs) actually support a subset of standard SQL, with some vendor-specific additions. The OSA is currently implemented in Microsoft's SQL Server 2000 DBMS, so OSA users will employ its SQL dialect (known as Transact-SQL, or T-SQL), although we have tried to restrict the use of vendor-specific features to a minimum. A fuller reference on T-SQL dialect than presented here is available online here.

The first thing to understand about SQL is that it is a set-based language, not a procedural language, like Fortran or C. A user submitting an SQL query to a relational database is defining the set of properties of the records to be returned from the database, not specifying the list of operations which will lead to their delivery; this latter is the responsibility of the DBMS engine, which will decide the best way to execute a given query from a set of possible execution plans. Many database vendors are adding procedural capabilities to the SQL dialects they support, and these constitute one of the main areas of difference between those dialects. These extensions will not be discussed here, as we shall concentrate on the basics of standard SQL.

2.4. The basic structure of a OSA SQL statement

For security reasons, the OSA does not allow users to execute queries which affect the basic structure and contents of the database, only those which can extract data from it. In SQL terms, this means that only SELECT statements are allowed: N.B. in this tutorial we write all SQL keywords in upper case italics and some column names in mixed case, both for clarity, although the OSA's SQL dialect is case insensitive by default. There are three basic classes of SELECT statement:

2.4.1 Projections

A projection is the retrieval of a set of full columns from a table. To retrieve the nominal RAs and Decs of the centres of all VST ATLAS Survey fields in the OSA, one would type:

SELECT ra, dec FROM atlasMergeLog

where atlasMergeLog is the name of the OSA table which records information about ATLAS fields, and ra and dec are the names of the relevant columns in that table.

2.4.2 Selections

A selection is the retrieval of the data values in particular columns for those rows in a table which satisfy certain criteria. So, if one were interested only in fields whose nominal centres lie in a 1 degree strip on the celestial equator, the appropriate SQL query would be:

SELECT ra, dec
FROM atlasMergeLog
WHERE dec BETWEEN -22.5 AND -12.5

In this example the SQL statement has been split into three lines to emphasise the SELECT…FROM…WHERE syntax, but this is still one SQL statement. The SQL Query Form in the OSA interface ignores the whitespace at the end of each line of text and generates a single query string from valid multi-line text like this. (Note that this means that users should not used double hyphens to indicate comments in multi-line queries, since this will result in all text after the first comment being ignored.)
Multiple constraints can be included in the WHERE clause of a selection, so, for example, the query above could be rewritten as:

SELECT ra, dec FROM atlasMergeLog WHERE (dec >= -22.5) AND (dec <= -12.5)

while the centres of all other fields could be selected using the following statement:

SELECT ra, dec FROM atlasMergeLog WHERE (dec < -22.5) OR (dec > -12.5)

The parentheses in these examples have been included for clarity – they are only required when needed to avoid ambiguity, and when necessary to over-rule the standard order of precedence amo ngst operators, outlined in Section 3.3.6 (users should note that the accidental omission of the WHERE clause from a selection turns it not into an invalid query, but into the projection of the columns contained in its SELECT clause, which, for tables as large as the *Source and *Detection tables of the OSA will return a lot of data). Note the units of spherical coordinates in the OSA: these are almost always decimal degrees for both RA and Dec (a notable exception to this rule is attribute raBase in table Multiframe, where the units are hours for consistency with the FITS files ingested from the processing centre. You can easily convert units in selections or WHERE clauses: e.g. SELECT raBase*15.0 ..., or WHERE (ra/15.0) > 23.0 ... etc.

Note that users should check which attributes in which tables have been indexed in the OSA, since the performance of queries that can make use of them should be significantly better than for those which do not: this information is presented in the Schema Browser as highlighted rows in the lists of table attributes.

2.4.3 Joins

A join is the retrieval of data entries from one or more tables in a database matched under some criterion. Extending our example above, a user may be interested in the dates on which the Y band images in a strip were taken. The Multiframe table in the OSA has an attribute called mjdObs, which records the Modified Julian Date of the observation. The *MergeLog and Multiframe tables are linked by having the common attribute multiframeID, which is a unique identifier for each FITS file ingested into the archive. The SQL query retrieving the desired dates here would be:

SELECT mjdObs, ra, dec
FROM atlasMergeLog, Multiframe
WHERE (dec BETWEEN -22.5 AND -12.5)
AND (gmfID = multiframeID) AND (gmfID > 0)

In this query, records in the Multiframe table and the atlasMergeLog table are being joined on condition that their values for the multiframeID attribute are equal. Furthermore, we select only those rows for which there is a Y multiframe identifier, i.e. we exclude any rows where there is no Y multiframe in the merged frame set. Such a situation is indicated by a default value for the attribute (see Section 2.5.). Note also that in general, FITS files produced by the pipeline are multi-extension files containing a paw-print of 4 individual images, and fields in the merge logs are made up from sets of individual images in different passbands and at different epochs. The observation dates of the four paw-print images will be nearly the same, so the results set should contain sets of four repeats of the observation dates matched with the different centres of the device images that comprise the multiframe.

The AS keyword can be used to rename the attributes in the SELECT clause so that their names in the output result set differs from those by which they are known in the database table. For example, a user thinking that the column names referred to in the query above are a little cryptic could rewrite it as follows:

SELECT mjdObs AS gModifiedJulianDate, ra AS fieldCentreRA, dec AS fieldCentreDec
FROM atlasMergeLog, Multiframe
WHERE (dec BETWEEN -22.5 AND -12.5)
AND (gmfID = multiframeID) AND (gmfID > 0)

and the columns returned in the output result set will be headed gModifiedJulianDate, fieldCentreRA, fieldCentreDec. This useful facility can be misused by the unwary or the very stupid. For example, it would be possible to have a query which started "SELECT ra AS dec, dec AS ra", which could confuse the subsequent analysis of the extracted result set. In the particular case of extracting data in VOTable format from the OSA, the assignment of Unified Content Descriptors (UCDs) to columns - i.e. setting the values of the ucd attributes to <FIELD> elements in the metadata section at the start of the VOTable document - is performed on the basis of the column name in the output result set, so a user swapping ra and dec, as in the example above, would end up with the wrong UCD for both those two columns, causing problems were that VOTable to be used in conjunction with other Virtual Observatory tools reliant on UCDs. For the moment, users wishing to have UCDs included in VOTable output from the OSA must not rename table columns, while, more generally, it is left to the user to ensure that any renaming they do perform is sensible (note: UCDs in the SSA and OSA have been initially assigned according to the UCD1 definitions).

It is also possible to write join queries in a way that makes it more explicit that they are joining the two tables, i.e. the example above becomes:

SELECT mjdObs, ra, dec
FROM Multiframe
JOIN atlasMergeLog ON (gmfID = multiframeID)
WHERE (dec BETWEEN -22.5 AND -12.5) AND (gmfID > 0)

This is an inner join, meaning it only returns the (mjdObs,ra,dec) triplets for matched rows, but there are other types of join, which return different combinations of data (see the SSA Cookbook for more details).

2.4.4 Subqueries

The SQL Server dialect of SQL allows the construction of nested SELECT statements, in which the WHERE clause includes a subquery which is itself a valid SELECT statement. For example, the join on atlasMergeLog and Multiframe in the previous Section could be rewritten in the following way:

SELECT mjdObs
FROM Multiframe
WHERE multiframeID IN
(SELECT gmfID FROM atlasMergeLog WHERE (dec BETWEEN -22.5 AND -12.5) AND (gmfID > 0))

Note that the ra and dec attributes from the atlasMergeLog table have now been removed from the SELECT clause of the main query. This is because the atlasMergeLog table is only present in the subquery, and its attributes are no longer accessible to the main query. The one exception to this is its gmfID attribute, but that is only accessible by value - i.e. the subquery generates a list of gmfID values and matches between this and the multiframeID column of the Multiframe table are made by use of the IN operator. Note also that now the results set only contains the distinct values of the observation dates, i.e. we get one observation date per multiframe, as opposed to previously where the results sets contained repeated observation dates for the distinct device image centres.

This query could also be written using a second logical operator, ANY, as follows:

SELECT mjdObs
FROM Multiframe
WHERE multiframeID = ANY
(SELECT gmfID FROM atlasMergeLog WHERE (dec BETWEEN -22.5 AND -12.5) AND (gmfID > 0))

where the ANY operator is used to match rows in the Multiframe table with any of the rows in the output result set from the subquery which have the same gmfID value. Note that in both these subqueries formulations the list of attributes in the SELECT clause of the subquery must be consistent with the rest of the WHERE clause in the main query, since they are to be matched by value.

These last two queries illustrate the equivalence of IN and = ANY, but care must be taken if the logical operator NOT is added to the outer query, so that it seeks matches with the complement of the set for which matches are sought in the original query. The operator which is equivalent to NOT IN is not < > ANY, as one might initially expect, but rather < > ALL - where ALL is another logical operator, which evaluates to TRUE if all of a set of comparisons are TRUE - which is clear when one thinks through the logic of that query, but perhaps not immediately apparent.

The most common use for subqueries is to express complicated joins in a simpler fashion. Up to 32 levels of nesting are allowed, in principle, although memory restrictions may prevent that from being achieved in practice. To evaluate a complicated join as a nested series of queries would often be much slower, but the database engine should recognise the equivalence of the two forms of the query and recast it in the most efficient form as part of its query optimisation process. So, there should be no difference in the performance of queries submitted as complicated joins or as nested subqueries, and the latter are to be prefered if they help ensure that the query that is executed really is the one that the user wanted to run.

2.5. Null and default values in the OSA

Our discussion to this point has implicitly assumed that the values of the attributes corresponding to each column in every row in a database table are known. This need not always be the case, as a simple example from the OSA illustrates. The *Source tables in the OSA merges information about detections made in the individual frame set images. A very blue object may well not be detected in a K band image, so what should be written for such an object in the column of the *Source table which records, say, the ellipicity of the K band detection? One answer would be a null value, which is a special type of entry to be included in a table if the value of a given attribute is not known (or is indeterminate or is not applicable) for a particular row. In designing the OSA we have decided not to use nulls in these cases, but to define default values for use in these situations instead: e.g. in the example above, we would set the K band ellipticity in the *Source table of a celestial object undetected in that band to be a recognisably meaningless value; in the case of floating point numbers, -0.9999995e9. Nulls and default values are semantically different: the query processor in a database management system (DBMS) recognises that a value marked as null is unknown or indeterminate and will not include it in, say, the computation of the mean value of all entries in a column, while, to the query processor, a default value is like any other value, and will include it, unless the user explicitly excludes it - e.g. by computing the mean magnitude only for those objects with magnitudes brighter than -0.9999995e9, in this case. Other default values in the OSA include -99999999 for 4- and 8-byte integer attributes, -9999 for 2-byte integers, and 0 for 1-byte (unsigned) integers. The schema browser generally indicates the default value for many of the table attributes, and they have all been chosen to lie well beyond the range of legitimate values found in the OSA, so it is simple to exclude them:

SELECT AVG(rAperMag3)
FROM atlasSource
WHERE rAperMag3 > 0.0

As a result of defining default values for some columns, there have to be dummy rows in some tables (i.e. rows for which every attribute takes the appropriate default value). The reason for this is illustrated by consideration of the *MergeLog and Multiframe tables in the OSA. If one of the multiframe UIDs (e.g. the Y image attribute, gmfID) in one of the *MergeLog tables has a default value because that particular frame set does not yet have a Y-band image, and if the attribute gmfID references the attribute multiframeID in table Multiframe, then table Multiframe needs an entire row of defaults for multiframeID=-99999999 in order to maintain the referential integrity of the database.

It is the responsibility of the user to ensure that the presence of default values and dummy rows will not corrupt the results of queries run on the OSA, but our decision to use them, rather than nulls greatly simplifies the logic involved in all queries run on the database. The inclusion of null values for an attribute means that an expression involving it can evaluate to TRUE, FALSE or UNKNOWN, and we believe that the simplification for users in avoiding this three-valued logic greatly outweighs the burden of having to remember that default values exist for some columns in the OSA.

3. Reference: additional options in SELECT statements

Previous sections have described the basic SELECT…FROM…WHERE… structure of an SQL query. This is the basic syntax to be employed for querying the OSA, but there are some additional options in the SELECT clause which users may find useful. Once again, a fuller reference than is presented below is available online here.

3.1 Aggregate Functions

SQL offers a number of useful aggregate functions, which can be used for deriving summary information on columns or selections from them.

3.1.1 Mathematical aggregate functions

The meanings of those mathematical aggregate functions which apply only to numeric columns are very straightforward: AVG, MAX, MIN, SUM, STDEV, STDEVP, VAR, and VARP return, respectively, the mean, maximum, minimum, sum, standard deviation, population standard variation, variance and population variance of all values in the specified column. They can be used in conjunction with a WHERE clause or not, i.e.

SELECT max(dec) FROM atlasMergeLog

will return the maximum value for the dec column found in the atlasMergeLog table, while

SELECT max(dec) FROM atlasMergeLog WHERE (ra BETWEEN 180 AND 360)

returns the maximum value found for fields within one hemisphere of the sky.

N.B. Strictly speaking, these functions only apply to non-null values found within the particular column. As discussed above, the OSA contains no null values by design, but uses default values instead. These will not be ignored in the computation of aggregate functions. For example, the SQL query

SELECT min(mjdObs) FROM Multiframe

returns the value -0.9999995e9, which is clearly nonsensical astronomically, and just illustrates the unintentional selection of the dummy row in the Multiframe table.

3.1.2 COUNT

There are additional aggregate functions which can be run on columns of all types. The most useful of these is COUNT, which can be used in a number of ways, as shown in the following example. The programmeID attribute in the ProgrammeFrame table identifies the multiframes which belong to that programme in the archive: since the OSA combines data from several programmes (e.g. ATLAS), there are several different values of programmeID. To find out how many, one could use the following query:

SELECT COUNT (DISTINCT programmeID) FROM ProgrammeFrame

In practice, this is best discovered from querying the OSA's Programme table. Note that the inclusion of the DISTINCT keyword means that only the number of distinct programmeID values was returned. If it had been omitted, then the returned value would be the total number of entries in the programmeID column, which is simply equal to the number of rows in the table, since there are no null values for programmeID in the ProgrammeFrame table.
The total number of rows in the table can be also obtained using the following syntax:

SELECT COUNT(*) FROM ProgrammeFrame

Using COUNT(*) like this is a very good way of ensuring that a query is sensible before getting data returned by running it, and users are strongly encouraged to use this method to develop and debug SQL queries before running them on the full OSA. The performance overhead involved in this two-query process is not as high as it might seem, since, depending on how heavily the database is being used, some of the data from the COUNT(*) query will still be in cache when the query is run a second time to extract the desired attributes. Users should note that COUNT should be replaced by COUNT_BIG if there is likely to be more than 2x109 records to be counted: this circumvents an SQL Server bug.

3.1.3 TOP

An interesting function specific to SQL Server’s SQL dialect is TOP, which is illustrated as follows. The query

SELECT TOP 10 mjdObs FROM Multiframe

would return the MJDs for ten rows in the Multiframe table. This will generally not be the ten highest (i.e. most recent) values in the table; remember that SQL is a set-based language, so a query yields the set of rows satisfying the criteria specified in the query, but with, by default, no particular ordering within that set. The ten highest area values can be obtained using TOP, however, with the addition an ORDER BY clause, which does impose an ordering of the rows in the result set: i.e.:

SELECT TOP 10 mjdObs FROM Multiframe ORDER BY mjdObs DESC

Note the presence of the DESC (for descending) keyword in the ORDER BY clause, which is required because the default behviour is for that clause to list rows in ascending order.

3.1.4 GROUP BY and HAVING

The GROUP BY clause allows aggregate functions to return more than a single value. For example, the user running the "SELECT COUNT(DISTINCT programmeID) FROM ProgrammeFrame" query above might want to know how many OSA multiframes are associated with each programme. That information is returned by the following query:

SELECT programmeID, COUNT(*) FROM ProgrammeFrame GROUP BY programmeID

where the rows in ProgrammeFrame are grouped by their programmeID value and separate counts are made of the number of rows in each group.

3.2 Mathematical Functions

The following mathematical functions are supported by SQL Server's SQL dialect.

  • ABS(x): returns the absolute value of x.
  • ACOS(x): returns the angle, in radians and in the range 0 to π, whose cosine is x.
  • ASIN(x): returns the angle, in radians and in the range -π/2 to π/2, whose sine is x.
  • ATAN(x): returns the angle, in radians and in the range -π/2 to π/2, whose tangent is x.
  • ATN2(y,x): returns the angle, in radians, whose tangent is y/x.
  • CEILING(x): returns the smallest integer greater than, or equal to, x.
  • COS(x): returns the cosine of angle x in radians.
  • COT(x): returns the cotangent of angle x in radians.
  • DEGREES(x): returns the angle in degrees corresponding to angle x in radians.
  • EXP(x): returns the exponential of x.
  • FLOOR(x): returns the largest integer less than or equal to x.
  • LOG(x): returns the natural logarithm of x.
  • LOG10(x): returns the base 10 logarithm of x.
  • PI(): returns the value of π.
  • POWER(x,y): returns the value of x raised to the power y.
  • RADIANS(x): returns the angle in radians corresponding to angle x in degrees.
  • RAND(x): returns a random number between 0 and 1, using an optional integer seed x if supplied.
  • ROUND(x,n,m): returns the value of x rounded to a specified length or precision. If n is a positive integer, then x is rounded to the number of decimal places specified by n. If n is a negative integer, then x is rounded n places to the left of the decimal point. ROUND can also be used to truncate x, through use of the optional argument m. If m is a positive integer, then x is truncated to the precision specified by n, rather than rounded to it.
  • SIGN(x): returns the value 1, 0, -1, depending whether x is positive, zero or negative, respectively.
  • SIN(x): returns the sine of the angle x in radians.
  • SQUARE(x): returns the square of x.
  • SQRT(x): returns the positive square root of x.
  • TAN(x): returns the tangent of angle x in radians.

Arithmetic functions (such as ABS, CEILING, DEGREES, FLOOR, POWER, RADIANS, and SIGN) return a value having the same data type as the input value, while trigonometric functions and others (such as EXP, LOG, LOG10, SQUARE, and SQRT), cast their input values to float and return a float value; this probably is of no concern to the average OSA user. All mathematical functions, except for RAND, are deterministic functions - i.e. they return the same results each time they are called with a specific set of input values - and RAND is deterministic only when a seed parameter is specified.

3.3 Operators

An operator in SQL is a symbol specifying an action that is performed on one or more expressions. For the present purposes, their major use is to provide greater flexibility in the possible forms of WHERE clauses of queries and in the columns of the result sets they can produce, which need not simply be columns drawn from the table(s) being queried. There are several classes of operator to consider.

3.3.1 Arithmetic operators

SQL Server's SQL dialect supports five arithmetic operators. The four basic ones - Addition (+), Subtraction (-), Multiplication (*) and Division (/) - plus the Modulo operation, (%), which returns the remainder of dividing one integer by another, and is used in the format "dividend%divisor". The use of the four basic arithmetic operators follows straightforwardly from their definitions.

3.3.2 Comparison operators

The Boolean comparison operators are used most frequently to filter rows via the WHERE clause of a SQL query. The most simple comparison operators ( <, >, =) were used above without introduction, but there a total of nine comparison operators which can be applied to pairs of expressions in the SQL Server dialect of SQL: = (Equal to); > (Greater than); < (Less than); >= (Greater than or equal to); <= (Less than or equal to); <>(Not equal to); != (Not equal to); !> (Not greater than); and !< (Not less than).

3.3.3 Logical operators

In a similar fashion, we have used a number of the logical operators (e.g. AND, BETWEEN, etc) above without introduction, but the following is the full list of logical operators supported by SQL Server:

  • ALL: TRUE if all of a set of comparisons are TRUE.
  • AND: TRUE if both of two Boolean expressions are TRUE.
  • ANY: TRUE if at least one of a set of comparisons is TRUE.
  • BETWEEN: TRUE if the operand is within a specified range.
  • EXISTS: TRUE if a subquery contains any rows.
  • IN: TRUE if the operand is equal to one of a list of expressions.
  • LIKE: TRUE if the operand matches a pattern.
  • NOT: Reverses the value of any other Boolean expression.
  • OR: TRUE if either of a pair of Boolean expressions is TRUE.
  • SOME: TRUE if at least one of a set of comparisons is TRUE.

The use of some of these operators (e.g. AND, BETWEEN, OR) has been illustrated above, but it is worth making a few comments on the remaining ones.

The LIKE operator is used for pattern matching. This is most commonly used for string matching. For example, a user interested in knowing how many ATLAS multiframes are present in the OSA could issue the query:

SELECT COUNT(*) FROM Multiframe WHERE project LIKE "%ATLAS%"

Note the use of the percentage sign wildcard character which matches all the project strings defined for the ATLAS. Several other wildcard characters can be used in conjunction with LIKE; these are described further in the SSA Cookbook, as are further examples of logical operators.

3.3.4 String concatenation operator

The plus sign, +, is used as a string concatenation operator in the SQL Server dialect of SQL. This is most likely to be of use to OSA users in the formatting of result sets - i.e. in the definition of SELECT clauses.

3.3.5 Unary operators

The SQL Server dialect of SQL defines three unary operators - i.e. operators which have only one operand - although none of these are likely to be of much use to most OSA users. The first of these is the positive unary operator, +, which is used principally in SQL statements not allowed by the OSA query interface. The second, -, is the negative unary operator, which returns the negative value of a numeric expression, as shown in the following query:

SELECT -MIN(decBase) FROM Multiframe WHERE MultiframeID > 0

The final unary operator, ~, the Bitwise NOT operator, converts each bit in a numeric expression of integer data type into its 1s complement (i.e. 0s are changed to 1 and vice versa). This might be of use in queries involving the quality flags in the *Source and *Detection tables.

3.3.6 Operator precedence

The operators described in this subsection have the following descending levels of precedence:

  • + (Positive Unary Operator), - (Negative Unitary Operator), ~ (Bitwise NOT)
  • * (Multiply), / (Division), % (Modulo)
  • + (Add), + (Concatenate), - (Subtract), & (Bitwise AND)
  • = , > , < , > = ,< = , < > , ! = , ! > , ! < (Comparison operators)
  • ^ (Bitwise Exclusive OR), | (Bitwise OR)
  • NOT
  • AND
  • ALL, ANY, BETWEEN, IN, LIKE, OR, SOME

When two operators in an expression have the same operator precedence level, they are evaluated left to right based on their position in the expression.

4. How to query cross-matched data in the archive

Archive catalogue tables are automatically cross-matched to themselves and to a number of external survey catalogue datasets held locally, e.g. SDSS Data Releases; FIRST, IRAS, ROSAT and 2MASS catalogues; and legacy photographic catalogues like USNO-B (for a complete list, click on "Browser" on a navigation bar or see below). Rather than prejoining any two datasets to create a static, merged table of what are assumed to be associated sources on the basis of a fixed joining algorithm, and choosing a subset of what are assumed to be useful attributes from each of the two catalogue tables to propagate into that merged table, the philosophy is to create a table of pointers between any two matched datasets. This means, for example, that any externally catalogued source co-incident or nearby an archive source is readily available, out to some predefined maximum angular search radius, and all sorts of science usages are possible because the exact matching criteria (e.g. maximum allowable displacement, consistency in morphological parameters or classification, or even the number of possible external matches for a given archive source) can be tuned at query time by simple expressions in SQL. Furthermore, all attributes of both datasets are available to querying, because no decision has been made as to which attributes to propagate into a static merged set. The flip side to this flexibility is the rather opaque syntax for querying cross-matched data in SQL: instead of querying a single merged table, e.g. SELECT * FROM MergedTable WHERE ..., in general you have to query three tables: the two cross-matched tables and the table of pointers, e.g. SELECT * FROM Table1, ExternalDB..Table2, CrossNeighboursTable2 WHERE ..., using the joining techniques discussed above to select out the rows that you require. Some real-world examples best illustrate how to query cross-matched data in the OSA, but first we discuss the naming of OSA objects (databases, tables and attributes) that pertain to catalogue joining.

4.1 Naming convention for tables and attributes

In OSA parlance, a table of pointers that associates sources within one table (i.e. that provides a list of all nearby sources for each source within one table) is called a neighbour table, while a table of pointers that associates a set of externally catalogued sources to a table of VST sources is called a cross-neighbour table. Cross-neighbour tables are likely to be used the most, but neighbour tables have their uses (for example, you may wish to check the internal consistency of VST photometry and astrometry by selecting a set of measurements of the same sources in overlap regions of adjacent frame sets - this is most easily done using a neighbour table).

A list of all available neighbour/cross-neighbour tables is most easily obtained as follows:

SELECT neighbourTable FROM RequiredNeighbours

The naming convention is simple: neighbour tables are named by appending the word Neighbours to the source table that they point to, e.g. atlasSource has a neighbour table called atlasSourceNeighbours; while cross-neighbour tables are named by concatenating the two associated table names with an X, e.g. the cross-neighbour table between the ATLAS source table and the 2MASS point source catalogue is called atlasSourceXtwomass_psc. Within these tables, there will always be the following three attributes: masterObjID, which points to the source for which neighbours have been created (the central source of the neighbourhood, if you like); slaveObjID, which points to the neighbouring sources (sources lying within the neighbourhood); and finally distanceMins which is the angular separation between the central source and the neighbour, in units of arcminutes. Depending on the external catalogue joined, there may be other attributes in cross-neighbour tables (use the Browser to examine these).

At the time of writing, external catalogue databases held within the OSA consist of:

External survey Catalogue product SQL database name Table name Description
Sloan Digital Sky Survey EDR SkyServerV3 PhotoObj SDSS Early Data Release (as released in the SkyServer RDBMS implementation, not the original SX OODBMS)
DR9 (8,7,5,3,2) BestDR9 (8,7,5,3,2) PhotoObj SDSS Data Release 8 (7,5,3,2)
2-Micron All-Sky Survey Point Source Catalogue TWOMASS twomass_psc 2MASS point source catalogue
Extended Source Catalogue TWOMASS twomass_xsc 2MASS point source catalogue
ROSAT All Sky Survey Bright Source Catalogue ROSAT rosat_bsc ROSAT all-sky survey bright source catalogue
Faint Source Catalogue ROSAT rosat_fsc ROSAT all-sky survey faint source catalogue
IRAS PSC IRAS iras_psc Infra-Red Astronomical Satellite point source catalogue
VLA-FIRST FIRST catalogue FIRST firstSource Faint Images of the Radio Sky at Twenty cm from the VLA survey
SSA Southern hemisphere survey SSA Source SuperCOSMOS Sky Survey as ingested into the SuperCOSMOS Science Archive
Millenium Galaxy Catalogue MGC MGC mgcDetection 37.5 square degree, medium-deep B-band imaging survey
NRAO VLA Sky Survey NVSS NVSS nvssSource A 1.4 GHz continuum survey covering the entire sky north of -40 deg declination.

Alternatively, you can use the OSA Browser to look at all the neighbour tables.

4.2 Selecting nearby cross-matches

Suppose a user wishes to select the identifiers and co-ordinates of all Faint Source Catalogue sources from the ROSAT All-Sky Survey that are within 6 arcsec of a ATLAS source. The SQL to do this is as follows:

SELECT atlas.sourceID, atlas.ra, atlas.dec, fsc.seqNo, fsc.ra, fsc.dec
FROM atlasSource AS atlas, ROSAT..rosat_fsc AS fsc, atlasSourceXrosat_fsc AS x
WHERE x.masterObjID=atlas.sourceID AND x.slaveObjID=fsc.seqNo AND x.distanceMins<0.1

Note: i) the use of table aliases atlas and fsc which are a convenient short-hand; ii) the use of attribute prefixes (e.g. atlas.ra) to distinguish between attributes in different tables that happen to have the same name; iii) the table joining condition in the WHERE clause, which selects associated rows (if you omit this condition, you will get all rows of each table joined to all other rows, i.e. an extremely large and useless dataset!); and finally iv) the specification of a maximum radius of 0.1 arcmin (=6 arcsec) for this query, where the maximum available for ROSAT joins is 30 arcsec. You can see the maximum joining radii (in arcsec) available to you in each neighbour/cross-neighbour table by issuing the following query:

SELECT neighbourTable, joinCriterion*3600.0 as maxRadius
FROM RequiredNeighbours

When selecting nearby cross-matches, users should note that one or more than one rows may result for each master object, since there may be more than one slave neighbour within the neighbourhood defined by the maximum join criterion specified. If you want the nearest match only, then the next section explains how to do this.

4.3 Selecting the nearest cross-match

Suppose a user wishes to select the infrared and optical point source photometry for a sample (say the first 50 rows for speed) in the ATLAS and SDSS DR2 overlap, taking the nearest SDSS source to each VST source as being the most likely match, where the maximum positional tolerance is set to 2 arcsec. The following query will do the job:

SELECT TOP 50 atlas.ra, atlas.dec, uAperMag3, gAperMag3, rAperMag3, iAperMag3, zAperMag3, psfMag_u, psfMag_g, psfMag_r, psfMag_i, psfMag_z
FROM atlasSource AS atlas, BestDR8..PhotoObj AS dr8, atlasSourceXDR8PhotoObj AS x
WHERE masterObjID=atlas.sourceID AND slaveObjID=dr8.ObjID AND distanceMins<0.033333 AND sdssPrimary=1 AND distanceMins IN (
SELECT MIN(distanceMins)
FROM atlasSourceXDR8PhotoObj
WHERE masterObjID=x.masterObjID AND sdssPrimary=1)

Note the following: i) here, we have used table aliases as attribute prefixes only where necessary (i.e. where attributes are not unambiguously identified by their names alone); ii) we select "primary" objects only from the SDSS, as we are interested in a data set with no duplicates; iii) a subquery is used to select the closest neighbour in each case, and the main query selects this nearest object if it is within the specified 2 arcsec maximum radial tolerance.

4.4 Selecting neighbours within one table

For VST survey tables that have neighbour tables defined, you can examine the neighbours of each source in the table. For example, suppose we want to check the internal consistency of OSA photometry and astrometry by selecting a sample of duplicate source measurements from overlap regions between adjacent frame sets in the ATLAS. The following query will do the job:

SELECT (s1.ra-s2.ra)*3600.0*COS(RADIANS(s1.dec)) as deltaRA,
               (s1.dec-s2.dec)*3600.0 as deltaDec,
               (s1.uAperMag3-s2.uAperMag3) as du,
               (s1.gAperMag3-s2.gAperMag3) as dg,
               (s1.rAperMag3-s2.rAperMag3) as dr,
               (s1.iAperMag3-s2.iAperMag3) as di
FROM atlasSource AS s1, atlasSource AS s2, atlasSourceNeighbours AS x
WHERE s1.uAperMag3 > 0 AND s2.uAperMag3 > 0 AND s1.gAperMag3 > 0 AND s2.gAperMag3 > 0
               AND s1.rAperMag3 > 0 AND s2.rAperMag3 > 0 AND s1.iAperMag3 > 0 AND s2.iAperMag3 > 0
               AND masterObjID=s1.sourceID AND slaveObjID=s2.sourceID AND s1.priOrSec=s1.frameSetID AND distanceMins<0.01 AND distanceMins IN (
SELECT MIN(distanceMins)
FROM atlasSourceNeighbours
WHERE masterObjID=x.masterObjID)

Note that in this case, a stringent pairing tolerance of 0.6 arcsec has been set (this corresponds roughly to a 3-sigma chop given the typical positional errors in VST astrometry). Also, we have used the useful feature of user-defined computed columns to create a tabular dataset of attributes that are most useful to us (i.e. differences in positions in arcsec and differences in magnitudes in each available passband).

5. Refining sample selection using quality bit flags

Within the detection and source tables there are quality bit flags, labelled ppErrBits in the detection tables and *ppErrBits in the source tables, where * is the filter name e.g. uppErrBits. These attributes encode the quality issues associated with a given detection such that detections subject to more serious quality issues have higher values. For a detection with no known quality issues the attribute's value is zero. The bit flag value encodes up to 32 distinct quality issues, which are divided into four different classes according to severity. This allows you to select a sample of sources where the quality is better than a certain decimal threshold value corresponding to each warning classification listed in this table:

Warning Class Threshold
Information 256
Warning 65536
Important 16777216
Severe

So, to select a sample that consists of sources with no known quality issues or just minor, informative, warnings we would add ppErrBits < 256 to the WHERE clause of our SQL query. To consider the quality information from every passband in your selection, you can merge the source table *ppErrBits attributes as: (gppErrBits | rppErrBits | ippErrBits) < 256 for as many passbands as exist in that programme's source table. As an example of such source selections in practice, suppose we want to produce a colour magnitude plot for galaxies in the ATLAS source table which have no important quality issues:

SELECT iPetroMag, rmiExt
FROM atlasSource WHERE mergedClass=1 AND iPetroMag>-9.99995e+8 AND rmiExt>-9.99995e+8
AND (rppErrBits | ippErrBits) < 65536

Initially one would wish to select just those sources with no known quality issues, and then widen the search to include more and more potentially spurious sources. To further refine our samples we can select or deselect certain quality issues according to their associated bit mask, the full list of which may be found in the glossary entries for the ppErrBits attributes. To select sources marked with a certain quality issue we would add ppErrBits & bitMask != 0 to the WHERE clause of our SQL query, where bitMask is the bit mask value (decimal or hexadecimal) assigned to that quality issue as listed in the ppErrBits glossary entry. Conversely, to exclude a certain quality issue we would add ppErrBits & bitMask = 0 to the WHERE clause of our SQL query.

For example, having initially excluded all sources with important quality issues, we might wish to include a sample that are within the ears of tiles. To do this we just extend the SQL query as follows:

SELECT iPetroMag,rmiExt
FROM atlasSource WHERE mergedClass=1 AND iPetroMag>-9.99995e+8 AND rmiExt>-9.99995e+8
AND ((rppErrBits | ippErrBits) < 65536 OR (rppErrBits | ippErrBits) & 0x00400000 != 0)

where the 0x00400000 is the bit mask, in hexadecimal notation, that represents the "source lies within a dither offset of the stacked frame boundary" quality issue.

Link to a fully detailed guide to the design of the quality bit flag attribute, ppErrBits, together with the complete list of quality issues with full descriptions and associated bit masks.

6. Refining sample selection for filter coverage

When making selections from the survey merged source tables *Source, users are reminded to think carefully about their requirements as regards filter coverage in the sample. Because releases of survey data are made prior to full completion of the surveys (obviously to expedite science exploitation), selections will be incomplete in various subtle ways depending on exactly how the sample predicates are specified.



Home | Overview | Browser | Access | Login | Cookbook
Listing | FreeSQL
Links | Credits

WFAU, Institute for Astronomy,
Royal Observatory, Blackford Hill
Edinburgh, EH9 3HJ, UK

osa-support@roe.ac.uk
23/4/2015