About the Aggregate Tool

 

The goal of the tool is to provide an easy way to calculate the average signal of probes on a device such as a microarray versus their position relative to a set of annotation coordinates. These coordinates can correspond to any start and end pairs, such as annotations. The hope is that by taking the aggregate signal across many locations that noise spots will cancel out, and it can be determined whether or not there really is a correlation between, say, a transcription factor and its location on a gene. Alternatively, the tool can also count the number of probes that fall into a certain bin rather than the average signal, thus computing a density.

 

There are a couple of modes the tool can assume: in the first, only one annotation coordinate is considered in the calculation. A preset number of bins (see parameters: n) extend a predetermined number of base pairs (see parameters: radius) in both directions of what are defined as the start positions in the annotation file. The second option allows for a certain number of bins, m to correspond to the locations between the start and stop locations of each pair in the annotation file.

Because the distance between start end points of genes can vary, the tool divides each start/end pair into m bins. The first bin encompasses the probes within the first 1/mth of the gene, the second bin the second 1/mth, and so on. The tool also reports an average signal for n bins outside the start/end pair. These are not scaled based on the length of the gene. Rather, they extend a certain number of base pairs (radius) before the start site and after the stop site. It should be noted that the tool is designed to invert the start and end points if the annotated object in question is transcribed on the reverse strand instead of the forward strand.

 

Parameters

 

Annotation file. Tab-delimited file containing several start and end positions, chromosome locations, and strand annotations, such as a bed file. The default structure of these files is like a bed file in which the chromosome is in column 1, 5’ position in column 2, 3’ position in column 3, and strand (“+” or “-“) in column 4. The program will automatically switch orientation based on strand. For this field, you may either upload your own file or use one of the common gene annotations found in the dropdown menu.

 

Signal file. Tab-delimited file containing signals and positions. Depending on what type of file is specified as well as how probewidth (see below) is defined, these can either represent discrete points with an independent probe width, or they can represent the signal for all locations until the next defined spot. Take, for example, the following excerpt from a signal file:

chr1     10        2.3

chr1     40        5.0

If probewidth is say 5, these two lines are read to mean that locations 10 through 15 on chromosome 1 have a signal of 2.3, and locations 40 through 45 have signal of 5.0. If probewidth is left blank, these two lines are read to mean that locations 10 through 40 have a signal of 2.3. These two options would require one to select the “sgr” format. The “wiggle” format allows one to upload a file containing four columns: one for chromosome, one for start position, one for stop position, and one for intensity, respectively. The “density” option allows one to upload a two column file containing chromosome and location. The program will assume that all points’s have a width of 1 and will calculate the number of probes that fall into the various bins, rather than the average intensity.

 

Number of bins (n). Specifies how many bins will be in each flanking region (for a total of 2xn bins). For jobs that do not use scaled bins in the region between start and stop sites, there will be 2xn bins: n bins on each side of the start site.

 

Number of bins (m). Optional field to be defined if stop and start sites (regions) are to be considered in the aggregation process. Defines the number of bins assigned to the region between start and stop sites (for a total number of m+2m bins).

 

Length of flanking region. Specifies number of base pairs the program should analyze in both directions from either the start site or the start/stop pair, depending on the option specified.

 

Probe width. For signal files that present data for probes of a specific width (see Signal file above for use instructions).

 

Minimum gene length: Tells the program not to consider annotations for start/stop pairs that are shorter than a certain distance.

 

Include intergenic regions. Tells the program to scale bins inside the start/stop sites, or tells program to analyze in separate directions from the start site if unchecked.

 

Use mean. Regardless of whether this box is checked or not, the program will automatically ensure that only the median signal of an individual gene (or number of probes, if the density option is selected) contributes to the final “averaging” calculation in each bin to avoid bias against shorter genes. If use mean is selected, the program will take the mean signal across all genes per bin and report the final result. Otherwise, it will use the median signal (of the median signal of each gene).

 

Output

 

The output file appears as a text file containing two columns of numbers. The first column is bin number, where the negative numbers correspond to all the bins before the start site, and the second column is the average intensity (or density). A sample output can be found on the website.

 

Notes. Please make sure that all headers are removed from your input files. The program can take anywhere from seconds to hours depending on the size of your data set. Unless otherwise specified, the default values for all numeric parameters are 0. The source code for this program can be found at <website>. On certain compilers, an error message concerning the “floor” function arises when trying to compile aggplayground.cpp. This can usually be dealt with by explicitly typecasting the arguments inside to either double or float.