An Introductory SQL Tutorial for SSA Users
In this document we provide a brief introduction to the use of Structured Query Language (SQL) for accessing data in the SuperCOSMOS Science Archive (SSA). We divide this document into a Primer aimed at users new both to the SSA and to SQL, a Reference which should be of use more experienced and returning users, and an Examples section, which presents a set of 20 realistic queries used in the design of the SSA. Readers wanting a fuller introduction to SQL should consult an online tutorial or one of the legion of SQL books available: O'Reilly's SQL in a nutshell is a good introduction. Some familiarity with SuperCOSMOS and the set of parameters returned by its image analyser are assumed in what follows, so readers may wish to consult the introduction provided by Hambly et al. (2001), MNRAS, 326, 1295, while those wishing to know the differences between the data presented in the SSA and those made available previously through the SuperCOSMOS Sky Survey web interface should consult the SSA Database Overview page.
The SSA is a large database - more than 2 TB in size - so, for test purposes, we have produced the "Personal SSA" (PSSA), which is a small subset of the SSA, containing solely the data in the region of the sky with 184 < RA (deg) < 186 and -1.25 < Dec. (deg) < 1.25, which is the same area of sky as the "Personal SkyServer" produced for the Early Data Release (EDR)of the Sloan Digital Sky Survey (SDSS). The PSSA may be downloaded from here (as a .zip file with installation instructions included) or can be queried using a web interface of the same form as for the full SSA. SSA users are strongly encouraged to use the PSSA for developing and debugging queries that they want to run on the SSA: with a database as large as the full SSA it can take a long time to find out that the query you wrote does not do what you intended!
In particular, queries within this tutorial may be run on the PSSA by simply copying the highlighted text from this document and pasting it into the text box of the PSSA's SQL Query form, and after each one displayed against a highlighted background we provide a link to a copy of the output page obtained from running the query with the form's default setting of returning to the user's browser the first 30 rows of the result set. Note that these output pages may differ in detail from the ones you may obtain running the same query. As discussed below, SQL is a set-based language, and all that is guaranteed is that the same query on the same database returns the same result set, with no guarantees as to the order in which rows appear in that result set. The amount of data that users can have returned from the SSA is currently limited to 15 million row-columns - i.e. for a query returning five columns, a maximum of three million rows will be returned. The COUNT(*) command, described in Section 3.1.2 below, can be used to determine the number of rows to be returned by a query, if there is doubt whether this limited will be exceeded.
2.1. Relational databases
The SSA is a relational database, which means that it stores data in tables composed of rows and columns. Each row comprises the information stored for one data entry – i.e. a celestial object in the case of the SSA – and there is one column for each of the attributes recorded for that entry – e.g. RA, Dec, ellipticity, etc, for the SSA. The different tables comprising a database may be linked (or related), if they each have columns representing the same data value, and integrity constraints can be included in the table definitions which ensure consistency between two related tables, e.g. by preventing the deletion of only one of a pair of rows in different tables thus linked. For ease of use, it is possible to define virtual tables - called views - which are subsets of the data in one or more tables and which can be queried using the same syntax as ordinary tables (which are sometimes called base tables, to distinguish them from these virtual tables). In addition to tables and views, the major constituents of a relational database are indexes (the database community prefer that spelling to "indices"), which can speed up the identification of records which satisfy the particular condition expressed in a query, and various stored procedures and functions which extend the range of operations which can be performed on data held in the tables. The collection of definitions of columns, tables, views, indexes, stored procedures and functions in a database is called its schema.
The SSA schema is described in detail elsewhere, but we recap here the basic features which we shall use later. The two major tables in the SSA are called Detection and Source. The columns in Detection are basically the attributes derived by running the SuperCOSMOS image analyser over a single plate scan, and these single-plate detections are then merged into multi-epoch, multi-colour records for individual celestial objects, which are stored in Source. In addition to these two major tables, there are also a number of metadata tables, which store ancillary information describing the processes involved in obtaining and reducing SuperCOSMOS data, and which enable the provenance of data values in Source and Detection to be traced all the way back to a given glass plate exposed in an observation of a particular survey field made under known conditions and subsequently processed using a certain set of calibration coefficients. The SSA uses the same set of spatial access routines as the SDSS SkyServer, based on the Hierarchical Triangular Mesh (HTM) pixelisation of the celestial sphere, which was developed at Johns Hopkins University. To aid spatial matching of objects within the SSA and between the SSA and the SDSS EDR, respectively, there are also "Neighbours" and "CrossNeighboursEDR" tables which record pairs of sources within 10 arcsec of one another. Three views are defined in v1.0 of the SSA: ReliableStars, CompleteStars and ReliableGalaxies. As their names suggest, these are intended for use when well defined subsamples of stars or galaxies with high completeness or reliability are required, and they are defined in terms of selections on attributes in the Source table. Their advantage is that the user does not need to remember the constraints (detailed in Section 2.6 below) on the attributes required to define the subsample, but can simply query it using the view created to constitute that subsample. Users should check which attributes in which tables have been indexed in the v1.0 SSA, since the performance of queries that can make use of them should be significantly better than for those which do not: this information is presented in the Schema Browser.
SQL is the standard language for accessing and manipulating data stored in a relational database. In fact, several versions of the SQL standard exist, and most database management systens (DBMSs) actually support a subset of standard SQL, with some vendor-specific additions. The SSA is currently implemented in Microsoft's SQL Server 2008 DBMS, so SSA users will employ its SQL dialect, although we have tried to restrict the use of vendor-specific features to a minimum. A fuller reference on this SQL dialect than presented here is available on line here.
The first thing to understand about SQL is that it is a set-based language, not a procedural language, like Fortran or C. A user submitting an SQL query to a relational database is defining the set of properties of the records that she wants returned from the database, not specifying the list of operations which will lead to their delivery; this latter is the responsibility of the DBMS engine, which will decide the best way to execute a given query from a set of possible execution plans. Many database vendors are adding procedural capabilities to the SQL dialects they support, and these constitute one of the main areas of difference between those dialects. These extensions will not be discussed here, as we shall concentrate on the basics of standard SQL.
For security reasons, the SSA does not allow users to execute queries which affect the basic structure and contents of the database, only those which can extract data from it. In SQL terms, this means that only SELECT statements are allowed: N.B. in this tutorial we write all SQL keywords in upper case italics and some column names in mixed case, both for clarity, although the SSA's SQL dialect is case insensitive by default. There are three basic classes of SELECT statement:
A projection is the retrieval of a set of full columns from
a table. To retrieve the nominal RAs and Decs of the centres of all sky
survey fields in the SSA, one would type:
where Field is the name of the SSA table which records information about sky survey fields, and nominalRA and nominalDec are the names of the relevant columns in that table.
A selection is the retrieval of the data values in
particular columns for those rows in a table which satisfy certain critieria.
So, if one were interested only in fields whose nominal centres lie in a 10
degree strip south of the celestial equator, the appropriate SQL query would
In this example the SQL statement has been split into three lines to emphasise the SELECT…FROM…WHERE form of the selection, but this is still one SQL statement. The SQL Query Form in the SSA interface ignores the whitespace at the end of each line of text and generates a single query string from valid multi-line text like this. (Note that this means that users should not used double hyphens to indicate comments in multi-line queries, since this will result in all text after the first comment being ignored.)
Multiple constraints can be included in the WHERE clause of a selection, so, for example, the query above could be rewritten as:
[Link to demo result set]
while the field centres of all other fields could be selected using the following statement:
[Link to demo result set]
The parentheses in these examples have been included for clarity – they are only required when needed to avoid ambiguity, and when necessary to over-rule the standard order of precedence amongst operators, outlined in Section 3.4.9. (Users should note that the accidental omission of the WHERE clause from a selection turns it not into an invalid query, but into the projection of the columns contained in its SELECT clause, which, for tables as large as the Source and Detection tables of the SSA - both of which have in excess of one billion rows - will return a lot of data.)
A join is the retrieval of data entries from one or more tables
in a database matched under some criterion. Extending our example above, a user may be interested in the
dates on which SSA exposures in this equatorial strip were taken. The Plate
table in the SSA has an attribute called MJD, which records the Modified
Julian Date at the midpoint of the exposure of each photographic plate making
up the SSA. The Plate and Field tables are linked by having the
common attribute fieldID, which is a unique identifier for each sky
survey field (e.g. Field 1 in the ESO/SRC field system has a different fieldID
value to Field 1 in the Palomar system). The SQL query retrieving the desired
dates here would be:
Another way of doing this is to use a prefix which is an alias for
the table name, assigned using the AS keyword in the FROM clause
of the join statement. In this way, the previous query can be rewritten as:
Prefixes are not needed for the other attributes in this query (mjd, nominalRA and nominalDec) because they each occur in only one of the tables listed in the FROM clause, so they are unambiguous. The AS keyword can also be used to rename the attributes in the SELECT clause so that their names in the output result set differs from those by which they are known in the database table. For example, a user thinking that the column names referred to in the query above are a little cryptic could rewrite it as follows:
[Link to demo result set]
and the columns returned in the output result set will be headed MeanJulianDate, PlateCentreRA, PlateCentreDec. This useful facility can be misused by the unwary or the very stupid. For example, it would be possible to have a query which started "SELECT ra AS dec, dec AS ra", which could confuse the subsequent analysis of the extracted result set. (In the particular case of extracting data in VOTable format from the SSA, the assignment of Unified Content Descriptors (UCDs) to columns - i.e. setting the values of the ucd attributes to <FIELD> elements in the metadata section at the start of the VOTable document - is performed on the basis of the column name in the output result set, so a user swapping ra and dec, as in the example above, would end up with the wrong UCD for both those two columns, causing problems were that VOTable to be used in conjunction with other Virtual Observatory tools reliant on UCDs. For the moment, users wishing to have UCDs included in VOTable output from the SSA must not rename table columns, while, more generally, it is left to the user to ensure that any renaming they do perform is sensible.)
It is also possible to write join queries in a way that makes it more explicit that
they are joining the two tables, i.e. the example above becomes:
This is an inner join, meaning it only returns the
(mjd,nominalRA,nominalDec) triplets for matched rows, but there are other types
of join, which return different combinations of data. Firstly, there are left
and right outer joins and full joins, all of which also
return data from unmatched rows. Replacing the join clause above with
Finally, a full join
would return the appropriate attributes for the unmatched rows in both tables, as well as the matched rows.
The SQL Server dialect of SQL allows the construction of nested SELECT statements, in which the
WHERE clause includes a subquery which is itself a valid SELECT statement. For
example, the join on Field and Plate in the previous Section could be rewritten in the following way:
Note that the prefixes have been removed here, since the two occurences of fieldID are now separate and no longer ambiguous. The second thing to notice is that the nominalRA and nominalDec attributes from the FIELD table have now been removed from the SELECT clause of the main query. This is because the FIELD table is only present in the subquery, and its attributes are no longer accessible to the main query. The one exception to this is its fieldID attribute, but that is only accessible by value - i.e. the subquery generates a list of fieldID values and matches between this and the fieldID column of the Plate table are made by use of the IN operator.
This query could also be written using a second logical operator, ANY, as follows:[Link to demo result set]
where the ANY operator is used to match rows in the Plate table with any of the rows in the output result set from the subquery which have the same fieldID value. Note that in both these subqueries formulations the list of attributes in the SELECT clause of the subquery must be consistent with the rest of the WHERE clause in the main query, since they are to be matched by value.
These last two queries illustrate the equivalence of IN and = ANY, but care must be taken if the logical operator NOT is added to the outer query, so that it seeks matches with the complement of the set for which matches are sought in the original query. The operator which is equivalent to NOT IN is not < > ANY, as one might initially expect, but rather < > ALL - where ALL is another logical operator, which evaluates to TRUE if all of a set of comparisons are TRUE - which is clear when one thinks through the logic of that query, but perhaps not immediately apparent.
The most common use for subqueries is to express complicated joins in a simpler fashion. Up to 32 levels of nesting are allowed, in principle, although memory restrictions may prevent that from being achieved in practice. To evaluate a complicated join as a nested series of queries would often be much slower, but the database engine should recognise the equivalence of the two forms of the query and recast it in the most efficient form as part of its query optimisation process. So, there should be no difference in the performance of queries submitted as complicated joins or as nested subqueries, and the latter are to be prefered if they help ensure that the query that is executed really is the one that the user wanted to run.
Our discussion to this point has implicitly assumed that the values of the attributes corresponding to each column in every row in a database table are known. This need not always be the case, as a simple example from the SSA illustrates. The Source table in the SSA merges information about detections made on B, R and I band plates. A very blue object may well not be detected in an I band plate, as these are shallower than the B and R plates, so what should be written for such an object in the column of the Source table which records, say, the ellipicity of the I band detection? One answer would be a null value, which is a special type of entry to be included in a table if the value of a given attribute is not known (or is indeterminate or is not applicable) for a particular row. In designing v1.0 of the SSA we have decided not to use nulls in these cases, but to define default values for use in these situations instead: e.g. in the example above, we would set the I band ellipticity in the Source table of a celestial object undetected in that band to be -9.999e9. Nulls and default values are semantically different: the query processor in a database management system (DBMS) recognises that a value marked as null is unknown or indeterminate and will not include it in, say, the computation of the mean value of all entries in a column, while, to the query processor, a default value is like any other value, and will include it, unless the user explicitly excludes it - e.g. by computing the mean magnitude only for those objects with magnitudes brighter than -99.0, in this case.
As a result of defining default values for some columns, there have to be dummy rows in some tables (i.e. rows for which every attribute takes the appropriate default value). The reason for this is illustrated by consideration of the Source and Detection tables in the SSA, whose design is detailed here. The Source table includes attributes (called ObjIDB, ObjIDR1, ObjIDR2, ObjIDI) which store the values of the ObjID attribute in the Detection table of the records, respectively, of the B band, first epoch R band, second epoch R band and I band detections which have been merged into the Source record. In our example above, there is no I band detection merged into the Source record of our very blue object, so ObjIDI must take a default value. This, in turn, necessitates the creation of a dummy row in the Detection table, whose ObjID attribute takes the value used as the default for ObjIDB, ObjIDR1, ObjIDR2 or ObjIDI used for a non-detection in a particular band, since otherwise the integrity constraints that ensure that every detection referenced in the Source table is present in the Detection table would be violated. The other columns of this dummy row are then populated with the appropriate default values. The default values used for the attributes in the SSA are listed here and have all been chosen to lie well beyond the range of legitimate values found in the SSA, so it is simple to exclude them.
It is the responsibility of the user to ensure that the presence of default values and dummy rows will not corrupt the results of queries run on the SSA, but our decision to use them, rather than nulls greatly simplifies the logic involved in all queries run on the database. The inclusion of null values for an attribute means that an expression involving it can evaluate to TRUE, FALSE or UNKNOWN, and we believe that the simplification for users in avoiding this three-valued logic greatly outweighs the burden of having to remember that default values exist for some columns in the SSA.
Having discussed the
syntax of a SELECT statement, we can describe the definition of the
views available in v1.0 of the SSA. As outlined above, views are virtual tables
comprising a subset of data drawn from one or more tables. Views are defined
using a SELECT statement and in the SQL Server dialect of SQL this takes
the following form:
In essence, ViewName is being defined as the result set from a virtual query, specified by select_statement, which can take any of the forms outlined above.
The benefits of defining views can be illustrated by the
example of the CompleteStars view in SSA v1.0, which is defined as follows:
The CompleteStars view defines a highly complete selection of point sources from the Source table. It contains all the attributes for all those entries in Source for which there is a reasonable astrometric solution (chi2<5.0) derived from more at least two plates (Nplates>1) and which, in all the bands in which it is detected, is free of strong warnings concerning quality issues (qual<2048) and has image parameters within ranges outwith which it is highly unlikely that point sources will be found; these ranges are defined to be ellipticity <1/3 (i.e. image elongation is less that 50%) and -3.0 < profile statistic < 4.5 (i.e. the areal profile set for the image is close to that of an idealised PSF, where "close" is between a 3sigma lower limit and a 4.5 sigma upper limit on the N(0,1) profile statistic). Use of this view is recommended for analyses requiring high stellar completeness but which are not too sensitive to some contamination, particularly from misclassified sources, poorly parameterised deblends near the limiting magnitude of the plate and poorly parameterised/calibrated images in very crowded regions.
As always, the user must decide what SQL statement defines the set of sources best suited to the particular scientific
problem at hand, and in some cases, reliable star/galaxy separation will be more important than completeness, and, with this
in mind, we have defined ReliableStars and ReliableGalaxies views for SSA v1.0, using the following SQL
The definition of the ReliableStars view is best understood by contrasting it with that for CompleteStars. The definition of ReliableStars imposes tighter constraints on many parameters than CompleteStars to reduce contamination from non-stellar sources: the maximum ellipticity is reduced from one third to one quarter; the upper limit on the profile statistic is reduced to 3.0sigma; the overall image quality must be better (qual < 128) and only non-deblended (blend=0) detections are considered. Sources must be detected on at least three out of possible four plates (Nplates > 2) and these multi-epoch detections must admit a good astrometric solution (chi2 < 3.0). Finally, a cut in Galactic coordinates is made. Sources must lie at |b| > 10 degrees, and be more than 20 degrees from the Galactic centre, to avoid the most crowded areas of the plane and bulge, where image parameterisation is difficult. Note that the sample defined by this view will be incomplete in the most crowded regions included by this positional cut, due to blended stars being mis-classified as galaxies, so, if completeness is more of a concern that reliability, users may choose not to use ReliableStars, however it is recommended for those analyses where contamination by non-stellar sources must be minimised.
The selection defining the ReliableGalaxies view is simpler. It is a magnitude-limited (Bj < 20.0) subsample of sources with good quality (qual < 128) detections on UKJ, UKR and UKI plates, which are classed as non-stellar and non-blended in the UKJ and UKR surveys (which are deepr than the UKI) and which satisfy a positional constraint (|b| > 35.0 and d > 45.0) designed to excise those regions on the sky where contamination by stars in the bulge and close to the Galactic plane is a significant problem. As with the other views in the v1.0 SSA, use of ReliableGalaxies can greatly simplify the SQL for a great many of the queries to be run against the SSA, but users must be aware of its limitations, notably its incompleteness at faint magnitudes and lower Galactic latitudes.
Previous sections have described the basic SELECT…FROM…WHERE… structure of an SQL query. This is the basic syntax to be employed for querying the SSA, but there are some additional options in the SELECT clause which users may find useful.
SQL offers a number of useful aggregate functions, which can be used for deriving summary information on columns or selections from them.
The meanings of those mathematical aggregate functions which apply only to
numeric columns are very straightforward: AVG, MAX, MIN, SUM,
STDEV, STDEVP, VAR, and VARP return, respectively,
the mean, maximum, minimum, sum, standard deviation, population standard
variation, variance and population variance of all values in the specified
column. They can be used in conjunction with a WHERE clause or not, i.e.
will return the maximum value for the nominalDec column found in the field table, while
[Link to demo result set]
returns the maximum values found for fields within one hemisphere of the sky.
N.B. Strictly speaking, these functions only apply to non-null
values found within the particular column. As discussed above, the v1.0 SSA contains
no null values by design, but uses default values instead. These will not be
ignored in the computation of aggregate functions. For example, the SQL query
returns the value -999.9999990, which is clearly nonsensical astronomically, and just illustrates the unintentional selection of the dummy row in the Field table.
There are additional aggregate functions which can be run on columns of all
types. The most useful of these is COUNT, which can be used in a number
of ways, as shown in the following example. The systemID attribute in the Field
table identifies the scheme under which the field centres were defined: since
the SSA combines data from several surveys, it can take several different values. To
find out how many, one can use the following query:
which returns the answer 4 (which includes the default value of zero, and which could, in practice, also be discovered from querying the SSA's FieldSystem table). Note that the inclusion of the distinct keyword means that only the number of distinct systemID values was returned. If it had been omitted, then the returned value would be the total number of entries in the systemID column, which is simply equal to the number of rows in the table, since there are no null values for systemID in the Field table.
The total number of rows in the table could be obtained using the following query:
[Link to demo result set]
The count(*) function can also be used in conjunction with a WHERE
clause, and it is in this guise that it is most useful. Suppose a user were
interested in selecting a sample of the very large objects from the Detection
table, but didn't know the size of the largest objects. In ignorance, the user might
think that 100 pixels was a pretty large area and would query the SSA to return
a list of attributes for all such objects. Such a query would return about one
tenth of the records in the Detection table, which would probably return far
more data than the user could cope with. A better policy would be to use count(*)
to find out how many objects would be returned by an area cut at 100 pixels,
with the following query:
and if the count returned by this query is too high, it can be repeated with a higher area cut until a sensible sample size is returned - e.g. a cut a 1000 pixels returns something like 0.5% of the entries in Detection. Using COUNT(*) like this is a very good way of ensuring that a query is sensible before getting data returned by running it, and users are strongly encouraged to use this method in conjunction with the PSSA to develop and debug SQL queries before running them on the full SSA. The performance overhead involved in this two-query process is not as high as it might seem, since, depending on how heavily the database is being used, some of the data from the COUNT(*) query will still be in cache when the query is run a second time to extract the desired attributes. Users should note that COUNT should be replaced by COUNT_BIG if there is likely to be more than 2x109 records to be counted: this circumvents a SQL Server bug.
An interesting function specific to SQL Server’s SQL dialect is TOP, which is illustrated as follows. The query[Link to demo result set]
would return the area values for ten rows in the Detection table. This will generally not be the ten highest area values in the table; remember that SQL is a set-based language, so a query yields the set of rows satisfying the criteria specified in the query, but with, by default, no particular ordering within that set. The ten highest area values can be obtained using TOP, however, with the addition an ORDER BY clause, which does impose an ordering of the rows in the result set: i.e.:
[Link to demo result set]
Note the presence of the DESC (for descending) keyword in the ORDER BY clause, which is required because the default behviour is for that clause to list rows in ascending order.
3.1.4 GROUP BY and HAVING
The GROUP BY clause allows aggregate functions to return more than a
single value. For example, the user running the "SELECT COUNT(DISTINCT systemID) FROM field"
query above might want to know how many of the SSA fields come from the each of
the four systems whose presence is indicated by the result of that query. That
information is returned by the following query:
where the rows in field are grouped by their systemID value and separate counts are made of the number of rows in each group. The result set from this query includes the dummy row in the Field table (which has systemID=0), and this may be removed through the use of the HAVING clause, i.e.
[Link to demo result set]
where the HAVING clause here adds a search condition on the results of the GROUP BY clause.
The majority of queries run on astronomical databases have a spatial component to them, restricting attention to particular region of the celestial sphere, usually defined as some range in either Equatorial or Galactic coordinates. To enhance the performance of the database for these popular spatial queries would require some sort of spatial indexing, but a practical problem is that very few DBMSs support two-dimensional indexing on the surface of a sphere. Indeed, not all DBMSs support two-dimensional indexing on a plane, and SQL Server is one of those DBMSs which supports indexing in only one dimension. Indexing on RA and/or Dec (which is better, not having the complication of wrap-around at the meridian) would aid spatial queries somewhat - and, indeed, indexing on both RA and Dec is implemented in the v1.0 SSA - but what is desired is a prescription under which two-dimensional regions of the celestial sphere can be indexed using a one-dimensional index such that regions which are close on the sky have similar values for the index code, as that will facilitate efficient spatial searches.
A number of approaches can be used to map the two-dimensional surface of a sphere onto a one-dimensional index, but, as mentioned above, the v1.0 SSA uses the same spatial indexing prescription as the SkyServer database of the Sloan Digital Sky Survey (SDSS). This is based on the Hierarchical Triangular Mesh (HTM) pixelisation of the celestial sphere, which was developed at Johns Hopkins University by Peter Kunszt, Alex Szalay and Ani Thakar, and implemented in SQL Server by Jim Gray and Don Slutz of Microsoft Research: we thank them for making available their code for use in the SSA. The HTM scheme is described in much greater detail elsewhere, and we discuss here only those facets which are relevant to use of the SSA.
The HTM scheme starts with the projection of an octahedron onto the celestial sphere, which defines a base set of eight triangular regions of the sky. Each triangular face of the octahedron is then divided into four triangles, defined such that their vertices are the vertices of the original triangle and the midpoints of its sides. The projection of this set of triangles onto the sphere divide it into 32 regions. This process of splitting triangles into four can be repeated to yield an increasingly finely-grained pixelisation of the sphere. The triangles defined at each stage in this procedure can be numbered according to a set procedure, in which an integer code is assigned to each whose most significant bits are inherited from the parent triangle and whose least significant bits record which of the four triangles created from that parent it is. In this way, it is possible to define an integer pixel code - the HTMID - for each triangle in this hierarchical pixelisation of the sphere which records not only the location of the centre of the triangle on the celestial sphere, but also the level in the hierarchical decomposition of the sphere at which it is defined. A 20-level decomposition produces triangles whose average area (note that the areas on the sphere bounded by projections of the edges of the hierarchically decomposed octahedron are not equal) is less than 0.1 square arcsecond, so these are perfectly adequate for labelling individual celestial objects within the SSA unambiguously.
This 20-level HTM decomposition is the basis for the spatial indexing of all positions (locations of celestial objects in the Detection and Source tables, and of field centres in the Field table) in the v1.0 SSA: N.B. all equatorial positions within the v1.0 SSA are quoted in J2000 coordinates. The HTMID for all these defined positions is recorded in the HTMTable table, which includes columns for an ID code (the TableID, itself defined in the HTMTableID table) for the parent table in which the position is defined, together with the ID number within that table of the row in which it is found. Following SkyServer we provide a set of functions for manipulating the entries in HTMTable: these are the key to performing efficient spatial queries in the SSA, so we strongly recommend that users learn how to use them.
The first two functions are concerned with the mapping from regions of the celestial sphere to the HTMID code.
The function fHTMLookupEq takes as arguments an (RA,Dec) pair (with both coordinates in degrees) and returns
the 20-level HTMID code for that location. For example, the HTMID for the point (185.0,0.0) is found
with the following query:
The prefix dbo. must be added to the names of all functions called on the SSA: it stands for "database owner" and is part of the system by which SQL Server assigns access rights to different user account names.
A more powerful function is fHTM_Cover, which returns the HTMID values of the triangles (at a
depth in the hierarchical decomposition specified by the user) which intersect a particular region of the
celestial sphere (defined in J2000 coordinates). This region can be specified in a number of ways, the most useful
being a CIRCLE, defined by the (RA,Dec) pair of its centre (both in degrees) and its radius (in arcminutes),
and a CONVEX, which is a polygon defined by the (RA,Dec) pairs of its vertices. Whereas fHTMLookupEq
returns a single integer (the HTMID code), fHTM_Cover
is a table-valued function, meaning that it returns a table, which must be queried with a SELECT
statement of the same syntax as the static tables and views which are found in the SSA schema. The use of
fHTM_Cover is illustrated in the following query, which probes the coverage of the circular region
of radius 40 arcminutes centred on (185.0, 0.0) by Level-6 HTM triangles:
This returns a table with six rows and two columns, the latter being headed with the names HTMIDstart and HTMIDend. The list of Level-6 HTM triangles which intersect with the circular search region is obtained by adding together the six intervals which start and end, respectively, with the HTMID values listed in the HTMIDstart and HTMIDend columns: i.e. in this case, the list of triangles is [40968-40969,40996-40997,41012-41013,55300-55301,55320-55321,55352-55353]. Similarly, the use of the CONVEX form of fHTM_Cover is demonstrated by the following query:
[Link to demo result set]
which returns an eight-row table of (HTMIDstart,HTMIDend) pairs from which can be reconstructed the list of Level-8 HTM triangles which intersect the polygon with vertices at (184.6,0.3), (184.4,0.3), (185.2,-0.2) and (185.0,0.0).
The next three functions perform proximity searches within HTMTable. The first of these, fGetNearbyObjEq
searches for entries from a particular parent table that lie within a circle of a given radius centred on a location
specified by an (RA,Dec) pair. So, the following query:
selects entries from the Source table (whose members have tableID=1 in HTMTable) within a radius of 0.5 arcmin of (185.0,0.0). fGetNearbyObjEq is a table-valued function, which returns a set of six attributes for each of the selected rows: the ObjID value for the row in its parent table (which is the attribute fieldID for plate centres from the Field table); the triplet (cx,cy,cz), which is a representation of the entry's position on the celestial sphere in the Cartesian coordinates used "under the hood" of this routine; the HTMID corresponding to that position; and the distance (in arcminutes) from that source position to the centre of the search region. Since fGetNearbyObjEq is a table-valued function, joins can be made with it, just as with any other table. So, one can find out whether any of the entries from Source returned by the previous query are contained within the ReliableGalaxies view with the following query: [Link to demo result set]
which checks for the presence in ReliableGalaxies of the entries in the result set of the fGetNearbyObjEq proximity search through a join on their objID value.
The ordering of the results returned by fGetNearbyObjEq is not guaranteed, so, to enable the ready
determination of the closest entry in Source, say, to a given position on the sky without a second
selection on the distance column of the result set, another function, fGetNearestObjEq, is
defined to do just that. So, the nearest of the Source entries to position (185.0,0.0) is returned
where fGetNearestObjEq is another table-valued function, which returns the same set of six columns as fGetNearbyObjEq, while a related function, fGetNearestObjIDEq, returns only the ID value of the nearest entry. Since this is not table-valued, it is used with SELECT alone, rather than using a "SELECT * FROM" syntax - i.e the ObjID value of the entry in the Source table nearest to the position (185.0,0.0) is returned using the following query [Link to demo result set]
One of the main purposes of introducing the HTM pixelisation and its associated set of access functions was
to reduce the number of great circle distance calculations which have to be performed in proximity searches:
the great circle distance formula is a fairly complex piece of spherical trigonometry, which is easy to type
incorrectly in SQL and which is not efficiently executed by standard DBMSs, so it is desirable that its use
be kept to a minimum. It is actually used in proximity searches, such as those specified using the
fGetNearbyObjEq function, but only as a last check on the restricted number of rows which might
match the proximity criterion on the basis of their HTMID value. However, the great circle distance
formula is useful in many situation, so we define a function, fGreatCircleDist, that implements it.
Its use is illustrated by the following query:
which returns the great circle distance (in arcminutes) between (185.0,0.0) and (184.0,0.0).
The following mathematical functions are supported by SQL Server's SQL dialect.
Arithmetic functions (such as ABS, CEILING, DEGREES, FLOOR, POWER, RADIANS, and SIGN) return a value having the same data type as the input value, while trigonometric functions and others (such as EXP, LOG, LOG10, SQUARE, and SQRT), cast their input values to float and return a float value; this probably is of no concern to the average SSA user. All mathematical functions, except for RAND, are deterministic functions - i.e. they return the same results each time they are called with a specific set of input values - and RAND is deterministic only when a seed parameter is specified
An operator in SQL is a symbol specifying an action that is performed on one or more expressions. For the present purposes, their major use is to provide greater flexibility in the possible forms of WHERE clauses of queries and in the columns of the result sets they can produce, which need not simply be columns drawn from the table(s) being queried. There are several classes of operator to consider.
3.4.1 Arithmetic operators
SQL Server's SQL dialect supports five arithmetic operators. The four basic ones - Addition (+), Subtraction (-), Multiplication (*) and Division (/) - plus the Modulo operation, (%), which returns the remainder of dividing one integer by another, and is used in the format "dividend%divisor". The use of the four basic arithmetic operators is demonstrated in Section 4 below, and use of all five is straightforward from their definitions.
3.4.2 Bitwise operators
The utility of bitwise operators in SSA v1.0 is limited to the testing of quality flag in the Detection table, which is composed of flags for various quality issues set independently in a bitwise fashion, and which reappears as the set of four quality flags (qualB, qualR1, qualR2 and qualI) for the detections merged into each record in the Source table. The meaning of each of the quality flags set bitwise in these attributes is explained here, and they may be tested using the Bitwise AND operator, &. For example, Q10 of Section 4 probes the setting of the 4th bit of the quality flag (corresponding to presence of a pixel value above the highest areal profile level) by using terms in the WHERE clause of the form "qualb & 16 = 16". In addition, there are also Bitwise OR (|)and Bitwise Exclusive OR (^) operators defined, which could be used in a similar fashion.
3.4.3 Comparison operators
The Boolean comparison operators are used most frequently to filter rows via the WHERE clause of a SQL query. The most simple comparison operators ( <, >, =) were used above without introduction, but there a total of nine comparison operators which can be applied to pairs of expressions in the SQL Server dialect of SQL: = (Equal to); > (Greater than); < (Less than); >= (Greater than or equal to); <= (Less than or equal to); <>(Not equal to); != (Not equal to); !> (Not greater than); and !< (Not less than).
3.4.4 Logical operators
In a similar fashion, we have used a number of the logical operators (e.g. AND, BETWEEN, etc)
above without introduction, but the following is the full list of logical operators supported by SQL Server:
The LIKE operator is used for pattern matching. This is most commonly used for string matching. For example,
a user interested in knowing how many plates yielding data included in the SSA had the IIIaF photographic emulsion
could issue the query:
Several wildcard characters can be used in conjunction with LIKE. For example, the grain size in the photographic emulsions changed with time, so someone might be interested in finding the plates with an emulsion from the IIIa family. The query for that would be: [Link to demo result set]
where the percentage sign, %, is a wildcard for any string of zero or more characters. Three other wildcards can be used with LIKE in a similar fashion. The underscore, _, denotes a single character, so the previous query could also have been written [Link to demo result set]
If the user wanted to be sure that only plates with the IIIaF or IIIaJ emulsions were selected, this could be amended to read [Link to demo result set]
while plates with any other IIIa emulsion apart from those could be identified with this query: [Link to demo result set]
as the presence of the caret, ^, indicates that emulsion names ending in F or J should not be included in the count. A range of characters can be included by use of a hyphen, -, in the square brackets, so the query above could be rewritten as [Link to demo result set]
if one also wanted to exclude emulsions IIIaG, IIIaH, IIIaI, should they exist. This last question on its own can be determined using the following query: [Link to demo result set]
This query returns as a result set the string "Filters IIIaG, IIIaH, IIIaI do not exist" when run on the SSA, and is the first query we have seen that do not include a FROM clause; this is not required because the data to be returned, as specified in the SELECT clause, is a constant, rather than a set of values drawn from a table.
The IN logical operator determines if a given value matches any entries in a subquery or list. The
first of these is discussed in Section 2.4.4. above [add link], while the second is readily illustrated by another
reworking of the query above to count all plates taken
with IIIaF or IIIaJ emulsions. Using IN instead of LIKE it can be rewritten as:
The SOME and ANY operators are strictly equivalent and both compare a scalar expression with a
single column set of values. The count of IIIaF plates could be found (in a rather circuitous manner)
with the following query:
Many of the examples given in this Section have been somewhat contrived, as we have stuggled to find realistic astronomical queries that would make use of some of the logical operators supported by the SQL Server dialect of SQL. This probably indicates that the average SSA user is unlikely to make use of these operators, which we have discussed here solely for the sake of completeness.
The plus sign, +, is used as a string concatenation operator in the SQL Server dialect of SQL. This is most likely to be of use to SSA users in the formatting of result sets - i.e. in the definition of SELECT clauses.
3.4.8 Unary operators
The SQL Server dialect of SQL defines three unary operators - i.e. operators which have only one operand - although
none of these are likely to be of much use to most SSA users. The first
of these is the positive unary operator, +, which is used principally in SQL statements not allowed by the SSA
query interface, so this will not be of use to SSA users. The second, -, the negative unary operator, which
returns the negative value of a numeric expression, as shown in the following query:
The final unary operator, ~, the Bitwise NOT operator, converts each bit in a numeric expression of integer data type into its 1s complement (i.e. 0s are changed to 1 and vice versa). This might be of use in queries involving the quality flags in the Source and Detection tables.
3.4.9 Operator precedence
The operators described in this subsection have the following descending levels of precedence:
When two operators in an expression have the same operator precedence level, they are evaluated left to right based on their position in the expression.
In their design of SkyServer, the SQL Server implementation of the SDSS archive, the Sloan archive team were influenced by a set of 20 queries (see Data Mining the SDSS SkyServer Database by Jim Gray et al.) which were intended to be realistic examples of the kinds of queries that astronomers would want to run on SkyServer. We adopted the same approach in our design of the SSA. We took the 20 queries used for SkyServer, translated a number of them to run on the SSA, and replaced the others by new queries, often performing joins between tables in the SSA and others in the Early Data Release (EDR) version of SkyServer. We discuss these below in turn. Since these were intended to be representative of the queries which SSA users would submit, they have not necessarily been written in particularly elegant SQL, nor has the astronomical content (e.g. colour cuts used) been verified terribly rigorously. In each case we provide a link to the output obtained by running the query on the Personal SSA. To ensure that this matches exactly what a user would obtain we have amended some queries slightly, by the addition of a "TOP 30" to the select statement and an "…ORDER BY…" clause at the end of the query: recall from the discussion of ORDER BY above that the set-based nature of SQL means that the ordering of rows within a result set is not guaranteed, until explicitly dictated through use of ORDER BY.
Q1: Find the positions of all galaxies brighter than magnitude 20 in B with a local B band extinction is >0.1 mag.
The Source table of the SSA contains E(B-V) values computed from the Schlegel et al. (1998), ApJ, 500, 525 maps for the position of each entry, so this query is simply a selection on the EBmV attribute recording those values scaled to yield the AB. The scaling factor is 4.09 (Binney and Merrifield 1998) and we use the ReliableGalaxies view of the Source table, for simplicity, as that already makes our magnitude cut at B=20.[Link to demo result set]
Q2: Provide the positions and magnitudes of stars for which the magnitudes from the two R band surveys differ by more 3 magnitudes.
This simple query performs a selection on Source to look for
star-like objects detected in both red surveys in the SSA with magnitudes
differing by more than 3 mags. Stars varying this much are rare, so this
query is more likely to detect oddities in the data, rather than extreme variable stars.
Indeed, run on the CompleteStars view in the PSSA it finds no matches at all, so, to
see if it detects any artefacts in the PSSA we can relax the selection criteria for stars from
those used to define the CompleteStars view, as follows:
Q3: Find the positions of all galaxies with a profile statistic > 10 in all detected wavebands and photometric colours consistent with being an elliptical galaxy.
This query is a translation to the SSA of Query 5 from the SDSS list of “20 Queries”,
which selects sources for which a de Vaucouleurs profile is a better fit than an exponential profile.
The profile information available in the SSA is more limited, so, in adapting this query to the SSA,
we base it around a colour selection on the ReliableGalaxies view together with a cut on the
profile statistic which should only select large galaxies with profiles very disimilar from the assumed
PSF. For our colour cut, we take the original colour-magnitude locus for
low-redshift ellipticals defined in (g,r,i) space by
Eisenstein et al.
(2001), AJ, 122, 2267
and translate it into cuts in SuperCOSMOS (B,R,I) space through the use of the
transformations published by
Fukugita et al. (1996), AJ, 111, 1748 between the Sloan and
Johnson-Cousins passbands, and by
Blair and Gilmore (1982), PASP, 94, 742 for conversion
between them and the photographic filters used in the SSA, i.e.
Putting all these together gives the following query, where we neglect the first epoch R band
survey, since the definition of ReliableGalaxies ensures detection of a non-stellar source
in the other three bands and relax the colour-magnitude cut, to allow selection of lower luminosity
ellipticals than the LRGs of interest to Eisenstein et al. (2001):
Q4: Provide the mean positions and magnitudes of any stellar objects with colours and proper motions consistent with being a white dwarf.
This query, which we shall run on the ReliableStars view selects white dwarf candidates on the basis
of their Reduced Proper Motion, Hr=R + 5* log10(μ/arcsec per year) + 5, computed using proper motion (μ)
from the SSA and R band photometry from the second epoch survey. White dwarfs are then selected as satisfying
Hr>3.75*(B-R2)+13.75 and Hr>9.091*(R2-I)+14.09, and having significantly detected proper
motions(i.e. μ > 5 σμ)with the following query:
Q5: Find the positions and (B,R,I) magnitudes of all star-like objects within delta mag of 0.2 of the colours of a quasar of redshift 2.5 < z < 3.5.
The corresponding query from the SDSS “20 Queries” list on which this is based sought
quasar candidates in the redshift range 5.5<z<6.5, which are very
unlikely to be found in the SSA. All colour selection of quasars in the SSA
will yield a large number of false positives, due to the lack of U band data,
but we take the (g,r,i) selection criteria for low-z quasars from Richards et
al (2001) – i.e. –0.2 < (g-r) < 0.35 and –0.4 < (r-i) < 0.4 – and translate
that into the SSA photometric bands, using the tranformations from Q3, obtaining the following:
This selection criterion is not dissimilar to that of Irwin (1991) who argues that z<3 quasars are to be found in an ellipse centred on BJ-R~0.4, R-I~0.2.
Q6 Find unpaired objects
This query implements one of the Usages of the WFCAM Science Archive, itself devised from the UKIDSS proposal: the SSA is a prototype for the WFCAM Science Archive (WSA) in many respects. This query simply selects entries in Source which have a detection in the first or second epoch red survey, but not both.[Link to demo result set]
Q7: Provide a list of star-like objects that are 1% rare in (B-R,R-I)-space.
This query illustrates how SQL can be used to generate rough density
estimates in the space of the attributes present in a table. The query selects stellar
objects on the basis of ellipticity and profile statistics and bins them up into a unit
interval grid in (B-R,R-I) space.
The ROUND functions round the colours into integer values, which are then turn into integer type by the CAST(... AS INT) operation. The output from this query is a table listing the number of entries from ReliableStars in each of the populated bins of (B-R,R-I) space, and these enable one to calculate which are the most unusual 1% of stars in that colour space.
Q8: Create a gridded count of galaxies with B-R>1.2 and R<19 over 184<RA<186 and -1.25<Dec<1.25, on a grid of 2 arcmin
This illustrates how a SQL query might be used in a clustering analysis, by computing counts-in-cells. As in Q7, it bins objects - this time into 2x2 arcmin bins on the sky – and then returns the count of galaxies in each bin.[Link to demo result set]
Q9: Create a count of galaxies in Level-9 HTM triangles which satisfy a certain colour cut, like 0.7B-0.5R-0.2I<0.8 and R<19.
This is another counting query, but this time, counting galaxies falling
into particular Level-9 HTM triangles (regions about 1/4 square degree in size).
This binning is performed by taking the HTMID for each relevant entry in the ReliableGalaxies
view and dividing it by 222, as required to obtain the HTMID of the
Level-9 triangle in which the galaxy lies, given that the HTMID stored in ReliableGalaxies
is at Level-20.
Q10: Find the positions of all galaxies with a pixel brighter than the highest areal profile threshold in any band within 1 degree of a given point (185.0,0.0) in the sky
This query illustrates the use of a join with the result set from the table-valued function fGetNearbyObjEq. It is essentially a self-join of the Source table - i.e. a join of Source with itself - but uses the ReliableGalaxies view to restrict attention to galaxies. The call to fGetNearbyObjEq finds all objects in Source withing 1 degree of (12 20 00, 00 00 00) - i.e. RA=185 degrees, Dec=0.0 degrees - and then joins with it a selection of those entries in ReliableGalaxies which are flagged in at least one band as having been detected with at least one pixel lying above the highest areal profile threshold used by the SuperCOSMOS Image Analyser. This flag is set with the 4th bit in the qualflag, and its value is probed using the Bitwise "AND" operator, &.[Link to demo result set]
Q11: Find the plate numbers of those plates with nominal centres within 20 degrees of (185,0)
The nominal centres of all the survey fields are stored in the Field table, so this requires a
join between that and the Plate table on the fieldID attribute. The spatial selection is
performed by making a call to the table-valued function fGetNearbyObjEq in a subquery and then
joining that HTMID, too.
Q12: Find the positions and (B-R,R-I) colours of all galaxies with blue band area between 100 and 200 pixels, -20 < supergalactic latitude (sgb)/degrees < 20, and declination less than zero, and return them in colour order.
This repeats one of the SDSS “20 Queries”, but replacing a surface
brightness cut with an image area cut, since surface brightness values are not
stored within the SSA. The Source table - and, hence, the ReliableGalaxies view -
includes positions in Galactic coordinates, so this query just requires a little spherical
trigonometry to implement the selection on supergalactic latitude, and it requires a join with
Detection to retrieve the B Band area. The clause containing the cut on supergalactic latitude
is made more cumbersome by the requirement of converting angles between degrees and radians several times.
Q13: Find the positions, R band magnitudes and B-R colours of all galaxies with an area greater than 100 pixels and a major axis 10 < d/arcsec < 30 in the red band and with an ellipticity>0.5.
The SuperCOSMOS plate scale is 67.14 arcsec/mm, so a conversion factor of 1 micron
=0.067 arcsec is required to recast this query in terms of the physical units
in which the length of the major axis of an image is recorded in the Detection
table. The main constraints in this query are applied to rows in the Detection
table and the join
with the ReliableGalaxies table is made only to remove duplicate entries in the result
set, as a consequence of the two epochs of red survey data, and to extract the colour information. The
contraint that sourceID >0 is required to remove the parents of deblended images, as these are
stored in Detection for completeness, but only their children are merged to form entries in Source.
Q14: Find galaxies that are blended with a star and output the deblended magnitudes.
Entries in the Detection table which were blended together can be recognised
by having a common parent, so this query looks for such pairs of objects by
performing a self-join on Detection constrained to pick galaxies from one copy
of the table and stars from the other, with added checks that they have
sensible magnitudes and that duplicate records in the overlap regions between
plates are excised.
Q15: Find all pairs of galaxies within 10 arcsec of another that have very similar colours, and return their positions and B band magnitudes
The original SDSS query on which this is based is motivated by a
gravitational lens search. For the SSA, we take “similar colours” to mean B-R
and R-I colours within 0.1 mag, and execute it solely over the magnitude range range 17.0 < BJ < 18.0,
where the star/galaxy separation and photometric accuracy are both very good. This query uses the SSA Neighbours
table, which records which entries in Source are less than 10 arcsec
from which other ones, and which is designed to speed up queries like this,
which would otherwise require very time-consuming self-joins of Source
with a spatial match predicate. Note that the find clause prevents double-counting.
Q16: Find the positions of stars with Sloan 5-band colours and SSA proper motions which are consistent with their being subdwarfs.
This query roughly mirrors the work of
Digby et al (2003),
who constructed a
sample of subdwarf stars on the basis of their location in a reduced proper motion
(RPM) diagram computed using SSA proper motions,μ, together with r band
magnitudes and (r-i) colours from the SDSS Early Database Release (EDR) catalogue.
The reduced proper motion, Hr, is defined to be
Q17: Provide a list of positions of galaxies whose Sloan and SSA magnitudes are consistent with there having been a supernova in the galaxy at one of its epochs of observation.
This query was motivated by the idea of using the epoch difference between Sloan
and SuperCOSMOS surveys to estimate the local supernova rate, through
identifying galaxy images which appear to have changed in brightness by more
than five magnitudes; of course, in practice, this identifies more artefacts in
the data than it does real supernovae, but several have been found this way. As
with Q3, and subsequent queries, this requires use of colour equations to
transform between the SDSS and SSA photometric system. This is another query
that uses the CrossNeighboursEDR table to match (good quality) SSA galaxies with
SDSS galaxies lying within 1 arcsec, and then further constraints on colour and
magnitude are applied to find those which have changed in blue or red band
brightness by more than 5 mags.
Q18: Provide a count of high-quality star-like sources brighter than 16th magnitude which are in either the SSA but not in the SDSS.
This speculative query looks for bright optical transients, but, again, is
more likely to turn up artefacts in the data, such as saturated bright stars. (N.B. we
run this on the Personal SSA and the full EDR, and must be careful which way round we use
the two databases, so that we don't just detect objects in the EDR lying outside its overlap
with the PSSA.)
Q19: Provide the positions of star-like objects with SDSS colours consistent with being a quasar and positions consistent with not having moved between all the epochs in the SSA.
This query is motivated by the idea of selecting quasars as stellar-like
objects with zero proper motion. The astrometric accuracy of SuperCOSMOS and
the length of the time baseline between the SSA’s constituent surveys are nearly
adequate for this to be a sensible way to select quasar candidates, but it is
mainly included as another example of an SSA-SDSS join query, using CrossNeighboursEDR
and a colour-magnitude cut in the Sloan bands suitable for low-z quasars,
namely g<=22, -0.27 <= u-g < 0.71, -0.24<= g-r<0.35, -0.27<=r-i<0.57
Q20: Provide a list of SSA objects within a magnitude of their respective nominal plate limit which are unpaired in the SSA and have no SDSS counterpart.
This query is designed to detect spurious objects in the SSA, on the
assumption that all real SSA objects should be detected in the (deeper) SDSS.
It works by selecting SSA objects in Sourcewhich are detected in only
one band, and within a magnitude of the respective nominal plate limit for that
band, and then looks up the ObjID values for those sources to see
whether they are included in the list of ObjID values for Source
entries which have no SDSS counterpart with 0.5 arcsec, as revealed by the CrossNeighboursEDR
The field 287 database contains a special time-resolved dataset constructed from the detections on nearly 200 plates in the same survey field, and sits alongside the main SSA database. It shares many of the same features, including a similar relational design, but is geared towards generation of accurate light-curve information for the restricted set of sources in the 6x6 square degree field. The time baseline covered is from ~1977 to ~2002.
(Note that it doesn't matter which database - SSA or Personal SSA - you select in the drop-down box on the freeform SQL webform if you prefix database object names in this way).
In common with the SSA, F287 contains Detection, Source and Plate but there is no field table (because the plates all come from one field) and the merged sources are created from stack detections only. There are four stacks created from the best 8 plates from the middle of the run of individual plates in each of the passbands U, B, R and I. Unique identifiers relate each detection to it's plate (plateNum in both Plate and Detection tables) and each source (Source.objIDU, objIDB, objIDR and objIDI) to it's constituent stack detections (Detection.objID). Note that while plateNum is the standard UK Schmidt plate identifier without the passband letter for individual plates (i.e. 12602 for plate J12602 etc.), "plateNum" for the stacks is 1, 2, 3 and 4 for U, B, R and I respectively (this assignment being consistent with filterIDs assigned for those passbands). To see some details of the stacks,
The idea behind making a stack from a subset of all available plates in each passband is to create a "master" source list that is deeper than the deepest single plate in any passband; to crossmatch this list (see below) to the Detection table to link individual epoch measurements to their source; and to provide a reference for the relative recalibration of the photometry in the individual epochs to remove any systematic errors in that photometry as a function of field position and magnitude (such errors are large and omnipresent on Schmidt photographic plates owing to emulsion sensitivity variations). For example, to pick out one source and it's corresponding stack detections, we use the following SQL:
for example for source unique identifier (objID) number 3972. Note that the results summary page of the web interface will allow production of thumbnail images from the main SSA survey plates (which are included in the F287 set) if the results set includes non-default RA and Dec attributes.
The seeing in the F287 plate collection is not the best (typically 2 to 3 arcsec) in comparison to modern digital surveys like SDSS, and source confusion can be a problem. For example, close pairs of objects may be blended in the deep stack and some of the poorer individual epoch plates, and yet resolved and/or deblended on better individual plates. This means that the positional association of stack sources with epoch detections can be subject to significant numbers of mismatches. The same blending and quality information (except halo/spike/track flagging) as exists in the SSA is available to filter out deblends at query time; additionally, we provide a source neighbour table SourceNeighbours that can be used, amongst other things, to examine the environment of a source to see if there any close pairs. For example:
shows the first thirty sources that have one or more neighbour(s) within the default neighbourhood radius of 10 arcseconds (this could be reduced to say 5 arcseconds by adding WHERE distanceMins < 5.0/60.0). Such a query could be used as a subquery filter (without the TOP 30 and the aggregate function, of course) to remove those sources having crowded environs, e.g.:
returns a count of all sources that have no neighbour within 10 arcsec.
The association between Source and individual epoch detections in Detection is done through the cross-neighbour table SourceXDetection. Every source will have one or more detections recorded in SourceXDetection along with the angular displacement between the two records in arcminutes (distanceMins). This is because, at the very least, an isolated faint source detected in only one passband stack but in no other stacks (or the less deep individual epoch plates) will have a single crossneighbour record pointing to that individual stack detection, and a distanceMins of zero. The cross-neighbour association is always between the unique identifiers (also the relational primary keys) of the two joined tables for good query-time performance. In addition to the cross-neighbour table between Source and Detection, the F287 database also contains a cross-neighbour table between Source and a locally held copy of the Veron catalogue of known quasi-stellar objects.
The construction of the cross-neighbour tables is simple: for a given source in the master list Source, every object on each slave plate (including the stacks) within the neighbourhood radius of 10 arcseconds is located and recorded in the table via a row consisting of the master Source objID, the slave Detection objID, and the angular distance between them. It is via the SourceXDetection cross-neighbour table that light curves can be generated, while F287 source identifications of known QSOs can be found via the SourceXKnownQSO table - see below for further details and examples. The following SQL illustrates how to use the known QSO cross-neighbour table to find the F287 source IDs and names of a set of quasars having redshift 4 or greater:
Note that we specify that the crossmatch distance must be less than 3 arcsec, which is a good rule-of-thumb for data of this type. Another join via addition of F287..Source AS s in the FROM clause along with ... AND s.objID = x.masterObjID in the WHERE clause allows the user to see the F287 source information by also specifying ..., s.* in the SELECT list.
into TOPCAT and plot a histogram (if you select either FITS or VOTable output formats on the SQL webform, the results page will contain a push-button link to start this handy application automatically). It is clear that the typical (modal) error at these magnitudes is ~0.05mag. (Note the use of the scaling of MAD via a factor 1.48 to an equivalent Gaussian RMS assuming a normal distribution). We could then trawl for a sample of variable candidates with individual standard deviations (as a reasonably sensitive detector of even a few non-Gaussian outliers amongst the run of photometric measurements, in this case limited to at least 50) being 6 times larger than this, e.g.
By this simple analysis, the most variable source is objID = 741793. However, the user should beware that more often than not, "variables" selected in this way may not be genuine - there are all manner of reasons why the run of photometric measurements should contain one or more outlying points with respect to the nominal errors - it is a fact of life that spurious measurements from emulsion flaws, deblending errors, and many other effects can lead to a non-Gaussian tail in the distribution of measurements. It is most advisable to use the various morphological attributes to filter out as many false positives as possible. For example, we could insist that the candidates are not deblended and have good quality in all stacks by adding ... AND qualU < 64 AND qualB < 64 AND qualR < 64 AND qualI < 64 AND blendU = 0 AND blendB = 0 AND blendR = 0 AND blendI = 0 to the WHERE clause in the last query above. Then objID 690728 is the most variable candidate; note that while sigsCorMagB is large, madsCorMagB is only ~2 sigma outlying. The latter is a robust Gaussian RMS equivalent sigma via scaled median of absolute deviation, and is less susceptible to the odd outlying point in the sequence. So the liklihood is that the light curve of this object will show just a few outlying points.
5.5 Light curvesThe B band light curve data for the example in the previous section can be extracted as follows:
where the subquery construct ensures that only the nearest neighbour within 1 arcsecond on each plate is considered, and the final clause orders the data by ASCending time. (Note that inside the subquery we employ a feature of the construction of the Detection objIDs: these are assigned by running number in the four least significant bytes of the 8 byte integer, and with the plate number in the four most significant bytes. So dividing by the hexadecimal constant 0x100000000 right-shifts the bits by four bytes and leaves the plate number as the result. Although somewhat esoteric, this feature enables you to access the plate number of any objID without an SQL join with the Detection table.)
Finally, the table of known QSOs stored and crossmatched in the F287 database is provided to allow the study of the variability of the quasars in the field. For example, to find the objIDs of QSOs unambigously identified by proximty within 1 arcsecond and having redshift more than 3.0, use the following query:
It is then a simple matter to get the light curve in any colour by using the first query in this section, e.g. for object H 87 substitute s.objID = 154058. By combining elements from several of the above SQL fragments, we could get the R light curve of H 87 in one go via:
2MPZ can be matched to older versions of the UKIDSS and VISTA surveys through the 2MASS XSC neighbour table, e.g. UKIDSS_LAS DR10, via the WFCAM Science Archive:
Home | Overview | Browser | Access | Cookbook | Links | Credits
Radial | MenuQuery | FreeSQL | CrossID
WFAU, Institute for Astronomy,