BreakPtr Documentation

Release 1.0,  April 2007
Documentation provided by Jan Korbel, Peter Starr, Mark Gerstein

1. Introduction
2. Installation
3. Training and Generating an HMM (model)
4. Running the Finder and Annotator
5. Graphical User Interface (GUI)
6. Command Line Examples
7. Citing BreakPtr
8. References
9. Contact

1. Introduction

BreakPtr (abbrev. for Break-Pointer) is a computational approach developed at Yale University for fine-mapping copy-number variants (CNVs) based on high-resolution comparative genome hybridization (HighRes-CGH) data and nucleotide sequence characteristics of breakpoints. BreakPtr is described in detail in the [1]. Usage is free for academic institutions; companies wishing to use BreakPtr are requested to contact Jan Korbel or Mark Gerstein (see Contact). BreakPtr's Finder module predicts breakpoints of large deletions and amplifications. It's Annotator identifies actual dosage (copy-number) ratios, and Flagger identifies regions where probe cross-hybridization may have occurred.

2. Installation

2.1 System Requirements
2.2 Installation Instructions
2.3 BreakPtr Directory Map

2.1 System Requirements

BreakPtr is tested and validated on the following reference platforms:

BreakPtr Reference Platforms
Operating system OS version Processor architecture Other Requirements
Microsoft Windows XP Intel Sun Java 2 Runtime Environment
Java Development Kit 5
Macintosh X Power PC Sun Java 2 Runtime Environment
Java Development Kit 5
UNIX Suse 10.1 ... Sun Java 2 Runtime Environment
Java Development Kit 5

2.2 Installation Instructions

  1. Move BreakPtr.tar.gz to your preferred installation directory.
  2. Unpack BreakPtr.tar.gz by relocating to this directory in the command prompt and typing:
    tar -xvf BreakPtr.tar.gz
  3. A new directory BreakPtr/ will be created in the current directory. Enter to run BreakPtr.
  4. See System Requirements for software you must already have before attempting to install.

2.3 BreakPtr Directory Map

Important BreakPtr Directories
Directory Files Contained
BreakPtr/ Main directories, BreakPtrGUI activation icon, documentation
BreakPtr/Data/Models/ All Hidden Markov Models
BreakPtr/Data/Subject/ All Log2Ratio data files on which to run the Finder and Annotator
BreakPtr/Data/Chain/ All Local-Sequence-BlastZ-Redundancy files (only needed for the 'full' parameterization, and currently only available for chromosomes 11 and 22, assembly hg17; in case you want to run the 'full' parameterization using a different HighRes-CGH array, an according file has to be generated as described in [1]).
BreakPtr/Data/Training/control/ All Log2Ratio files with control regions for training.
BreakPtr/Data/Training/aberrations/ All Log2Ratio files with amplification and deletion regions for training.
BreakPtr/Java/bin/ All binary executables and .jar files
BreakPtr/Output/ All output files and example_output/ directory
BreakPtr/Output/example_output/ All output files for the example data given with BreakPtr

3. Training and Generating an HMM (model)

3.1 Background
3.2 Example Training on New Data
3.3 Setting Transition Probabilities

3.1 Background

BreakPtr training involves making use of prior knowledge on CNVs. Ideally, the data on which BreakPtr is trained will come from the same hybridization conditions and array design used for predicting CNVs. As an alternative, a set of training data is already provided (i.e., a model trained on chromosome 22 HighRes-CGH data; note however that the accuracy of the results from using this training set on data from other hybridization conditions or array designs may be compromised).

It is usually advantageous to have some prior knowledge about copy-number variation (deletions and duplications) in the set to be analyzed. For instance, about 2 copy-number variants, both losses and gains, are usually sufficient to train the 'core' model of the Finder, and alternative parameterizations are implemented based on criteria introduced by Scott . In case no prior knowledge on CNVs exists, a reasonable alternative is to apply CNVs 'guessed' based on visual/manual inspection of HighRes-CGH data (i.e. plots showing normalized log2-ratios vs. their chromosomal coordinate).

3.2 Example Training on New Data

Example: training the 'core' model (module Finder) on new data: - In order to train the 'normal' state, place data files (normalized log2ratio file(s)) which will serve as non-CNV control regions for training into BreakPtr/Data/Training/normal. In practice, if no control experiment (same genomic DNA labeled with different fluorescent dyes hybridized to a single array) is available, we have noticed that it is usually reasonable (unless large portions of the genome are 'abnormal', such as may be the case for some cancer tissues) to apply one or several 'typical' array hybridizations as controls for training the 'normal' state (the reasoning being that most of the genome should be unaffected by copy-number variation). List control experiments in the file Control_experiments_for_parameter_estimation.txt in the Data folder.

  1. Place normalized log2ratio files which contain known CNV regions (or some 'guessed' CNVs) for training into BreakPtr/Data/Training/cnp.
  2. Following the format of BreakPtr/Data/Training/Large_Aberrations.txt, record the experiment-ID for each array used to train 'deletion' and 'duplication' states, the dosage of each CNV (we use known dosages/copy-number ratios, or best possible estimates according to log2-ratio values within the CNV), the chromosome, the CNV start position, and the CNV end position (use the file Large_Aberrations.txt as a template, spaces and tabs matter and may, if wrongly used, hamper reading in the data.)
  3. Follow example (1) in Command Line Examples to perform the training and generate the HMM file.

3.3 Setting Transition Probabilities

An essential aspect of parameter estimation is providing a reasonable estimate for transition probabilities between states (e.g. the probability for the HMM to transit between 'normal' and 'deletion' states) and within states. We suggest that transition probabilities pb should be estimated for each transition between states by dividing the number of previously known/presumed aberrations (based on approximately mapped deletions/duplications) in the concatenated training set by the number of probes. Then, transition probabilities pw within states are set according to the following equation (all probabilities sum up to 1):

Transition Probability Equation
for pw11 (transition from state 1 to state 1, or "within state 1") for a 3-state HMM, e.g. the 'core' model of the Finder.

In practice, the initial estimates for transition probabilities currently have to be added to the HMM-model file manually, i.e. in the following way for the 'core' model in the file "example_core.hmm":

# Transition matrix:
        0.999999998     1e-9    1e-9
        1e-9    0.999999998     1e-9
        1e-9    1e-9    0.999999998
This can e.g. be altered to (assuming pb=1e-5):
# Transition matrix:
        0.99998     1e-5    1e-5
        1e-5    0.99998     1e-5
        1e-5    1e-5    0.99998

In the first row of the matrix, [0.999999998, 1e-9, 1e-9] represent the transition probability estimates pw11 (transition within state 1='normal'), pb12 (between 'normal' and 'duplication/amplification' states), and pb13 (between 'normal' and 'deletion' states), and so forth. Be aware of the fact that for each row the sum of probabilities should equal 1; while manually refining the file be careful not to modify existing tab or space characters as this may compromise file reading.

4. Running the Finder and Annotator

4.1 File Placement

First place the appropriate files in the Data/Models/, Data/Subject/, and Data/Chain/ directories as described in BreakPtr Directory Map. Then run either the Finder-Core, Finder-Full, or Annotator module of the GUI or follow examples (2), (3), and (4) in Command Line Examples.

4.2 Output Interpretation

The values of the "State" field in each output file are integers with the following meanings:

  1. Finder-Core: state 0='normal'; state 1='amplification'; state 2='deletion'
  2. Finder-Full: state 0='normal'; state 1='transition_state_A'; state 2='transition_state_A'; state 3='amplification'; state 4='transition state B'; state 5='transition state B'; state 6='deletion'
  3. Annotator: state 0='normal, or copy number ratio 2:2'; state 1='3:2'; state 2='4:2'; state 3='>4:2'; state 5='1:2'; state 5='0:2'

5. Graphical User Interface (GUI)

To run BreakPtr modules via the GUI, click "BreakPtrGUI.jar" in the main BreakPtr/ directory. The GUI below will appear. The GUI can also be opened through the terminal, e.g. if in a linux environment, by entering the BreakPtr/ direcotry and typing: java -jar BreakPtrGUI.jar

BreakPtrGUIScreenShot

First, choose the BreakPtr module to run from the "Select Mododule" Menu at the left side of the window. An new input form for the Trainer-Core, Trainer-Full, Trainer-Annotator, Finder-Core, Finder-Full, and Annotator will appear with the paths of the example files and directories on your computer already inserted. To run the module on the example data, press "Run Module." Then wait until the path to the output file appears in the "Output File" text field before you retrieve the output. To run the module on your own data, press "Clear Fields" and input the necessary information. Paths can either be typed in manually or the "Browse" button can be used. In the Trainer-Core and Trainer-Full, the "Aberrations File" and "BreakPoints File" can be generated by pressing the "Create" button to the right of the "Browse" button. Finally, a "Help" Menu resides at the top-right corner of the screen which provides field descriptions and advice for each BreakPtr module.

6. Command Line Examples

To run the following examples via command line, replace "/BreakPtrHomePath/" in the classpath instruction with the location of BreakPtr on your computer. Then change into the BreakPtr directory, and copy and paste the command into the terminal. (In graphical environments, the directory location can usually be found under "Location" in the "Properties" window. In command line environments, the location can be found by entering the BreakPtr directory with the "cd" utility, then typing "pwd" into the terminal. Please find help online for basic UNIX commands if necessary.)

  1. TRAINER-CORE To run the initial training and generate an hmm file, use:
    java -classpath .:/BreakPtrHomePath/BreakPtr/Java/bin/:/BreakPtrHomePath/BreakPtr/Java/bin/allJars.jar
    HMMTrainer -r Data/Training/Large_aberrations.txt -c Data/Training/control/ -a Data/Training/aberrations/
    -t 1.0E-9 -o Output/TrainerCoreOutput.txt
    where option:
    -r is the path to the "aberrations" file with known amplification and deletion regions on which to train
    -c is the path to the "control" directory with files that contain control regions on which to train
    -a is the path to the "aberrations" directory with files that contain amplification and deletion regions on which to train
    -t is the transition probability between states
    -o is the path to the output file.

    The output is in BreakPtr/Output/TrainerCoreOutput.txt, which should be identical to BreakPtr/Output/example_output/example_training.hmm.
  2. TRAINER-FULL To run the initial training and generate an hmm file, use:
    java -classpath .:/BreakPtrHomePath/BreakPtr/Java/bin/:/BreakPtrHomePath/BreakPtr/Java/bin/allJars.jar
    HMMTrainerFull -r Data/Training/Large_aberrations.txt -c Data/Training/control/ -a Data/Training/aberrations/
    -b Data/Training/verified_breakpoints_chr22_TEST.txt -z Data/Chain/chr22_SelfChain_isoTm_hg17.pos.gz
    -t 1.0E-9 -o Output/TrainerFullOutput.txt
    where option:
    -r is the path to the "aberrations" file with known amplification and deletion regions on which to train
    -c is the path to the "control" directory with files that contain control regions on which to train
    -a is the path to the "aberrations" directory with files that contain amplification and deletion regions on which to train
    -b is the path to the "breakpoints" file with regions for training transition regions
    -z is the path to the "blastZ" file with measures of nucleotide sequence redundancy
    -t is the transition probability between states
    -o is the path to the output file.

    The output is in BreakPtr/Output/TrainerFullOutput.txt, which should be identical to BreakPtr/Data/Models/example_full.hmm.
  3. FINDER-CORE To run the core parameterization of BreakPtr, use:
    java -classpath .:/BreakPtrHomePath/BreakPtr/Java/bin/:/BreakPtrHomePath/BreakPtr/Java/bin/allJars.jar
    MakeEmissions -l Data/Subject/36320__normalized_comForRev_unav.txt.gz -h Data/Models/example_core.hmm
    where option:
    -l is the path to the "log2ratio" file on which to apply the Finder
    -h is the path to the "hmm" file with the parameters for the Finder-Core hidden markov model
    -t is the new transition probability between states in the hmm, exclusion does not alter transition probability.

    The output is in BreakPtr/Output/FinderCoreOutput.txt, which should be identical to BreakPtr/Output/example_output/example_core_output.txt
  4. FINDER-FULL The 'full' parameterization of the Finder, which we usually implement according to criteria based on Scott segmenting CGH data and analyzing chromosomal DNA sequence To run the full parameterization of BreakPtr, use:
    java -classpath .:/BreakPtrHomePath/BreakPtr/Java/bin/:/BreakPtrHomePath/BreakPtr/Java/bin/allJars.jar
    MakeEmissions -l Data/Subject/36320__normalized_comForRev_unav.txt.gz -h Data/Models/example_full.hmm
    -b Data/Chain/chr22_SelfChain_isoTm_hg17.pos.gz
    where option:
    -l is the path to the "log2ratio" file on which to apply the Finder
    -h is the path to the "hmm" file with the parameters for the Finder-Full hidden markov model
    -b is the path to the "blastZ" file with measures of nucleotide sequence redundancy
    -t is the new transition probability between states in the hmm, exclusion does not alter transition probability.

    The output is in BreakPtr/Output/FinderFullOutput.txt, which should be identical to BreakPtr/Output/example_output/example_full_output.txt
  5. ANNOTATOR To run the dosage estimator, use:
    java -classpath .:/BreakPtrHomePath/BreakPtr/Java/bin/:/BreakPtrHomePath/BreakPtr/Java/bin/allJars.jar
    MakeEmissions -l Data/Subject/36320__normalized_comForRev_unav.txt.gz -h Data/Models/example_dosage.hmm
    where option:
    -l is the "log2ratio" file on which to apply the Annotator
    -h is the "hmm" file with the parameters for the Annotator hidden markov model.

    The output is in BreakPtr/Output/AnnotatorOutput.txt, which should be identical to BreakPtr/Output/example_output/example_output_dosage_estimator.txt
Remember that the -t option [between-state transition probability] can be utilized to relax transition probabilities, and increase the sensitivity of breakpoint prediction. Simply append -t pb to your Trainer or Finder command, where pb is the new transition probability.

7. Citing BreakPtr

Please cite the following paper, if using BreakPtr in your research:

Korbel JO, Urban AE, Grubert F, Du J, Royce TE, Starr P, Zhong, G, Emanuel BS, Weissman SM, Snyder M & Gerstein MB (2007) Systematic prediction and validation of breakpoints associated with copy-number variants in the human genome, submitted.

8. References

  1. Korbel JO, Urban AE, Grubert F, Du J, Royce TE, et al. (2007) Systematic prediction and validation of breakpoints associated with copy-number variants in the human genome. submitted.
  2. Scott DW (1979) On optimal and data-based histograms. Biometrika 66: 605-610.

9. Contact

Contact details can be found on Jan Korbel's website or on Mark Gerstein's website:

http://homes.gersteinlab.org/people/korbel/
http://www.gersteinlab.org/about/