BreakPtr Documentation

Release 1.0, April 2007
Documentation provided by Jan Korbel, Peter Starr, Mark Gerstein

1. Introduction
2. Installation
3. Training and Generating an HMM (model)
4. Running the Finder and Annotator
5. Examples
6. References
7. Citing BreakPtr
8. Contact

1. Introduction

BreakPtr (abbrev. for Break-Pointer) is a computational approach developed at Yale University for fine-mapping copy-number variants (CNVs) based on high-resolution comparative genome hybridization (HighRes-CGH) data and nucleotide sequence characteristics of breakpoints. BreakPtr is described in detail in the PNAS paper below. Usage is free for academic institutions; companies wishing to use BreakPtr are requested to contact Jan Korbel or Mark Gerstein (see below for contact info). BreakPtr's Finder module predicts breakpoints of large deletions and amplifications. It's Annotator identifies actual dosage (copy-number) ratios, and Flagger identifies regions where probe cross-hybridization may have occurred.

2. Installation

2.1 System Requirements
2.2 Installation Instructions

2.1 System Requirements

BreakPtr is tested and validated on the following reference platforms:

BreakPtr Reference Platforms
Operating system OS version Processor architecture Other Requirements
Microsoft Windows XP Intel x86 Sun Java 2 Standard Edition 1.4.2_10
JDK 5
Macintosh ... ... Sun Java 2 Standard Edition 1.4.2_10
JDK 5
UNIX ... ... Sun Java 2 Standard Edition 1.4.2_10
JDK 5

2.2 Installation Instructions

  1. Move BreakPtr.tar.gz to your preferred installation directory.
  2. Unpack BreakPtr.tar.gz by relocating to this directory in the command prompt and typing:
    tar -xzvf BreakPtr.tar.gz
  3. A new directory BreakPtr/ will be created in the current directory. Enter to run BreakPtr.
  4. Change the config.properties file to indicate where the necessary external files are. See System Requirements for a list of programs you must already have before attempting to install.

3. Training and Generating an HMM (model)

3.1 Background
3.2 Example Training on New Data
3.3 Setting Transition Probabilities

3.1 Background

BreakPtr training involves making use of prior knowledge on CNVs; ideally, the data on which BreakPtr is trained will come from the same hybridization conditions and array design used for predicting CNVs. As an alternative, a set of training data is already provided (i.e., a model trained on chromosome 22 HighRes-CGH data; note however that the accuracy of the results from using this training set on data from other hybridization conditions or array designs may be compromised). It is usually advantageous to have some prior knowledge about copy-number variation (deletions and duplications) in the set to be analyzed. For instance, about 2 copy-number variants, both losses and gains, are usually sufficient to train the 'core' model of the Finder, and alternative parameterizations are implemented based on criteria introduced by Scott . In case no prior knowledge on CNVs exists, a reasonable alternative is to apply CNVs 'guessed' based on visual/manual inspection of HighRes-CGH data (i.e. plots showing normalized log2-ratios vs. their chromosomal coordinate).

3.1 Example Training on New Data

Example: training the 'core' model (module Finder) on new data: - In order to train the 'normal' state, place data files (normalized log2ratio file(s)) which will serve as non-CNV control regions for training into BreakPtr/Data/Training/normal. In practice, if no control experiment (same genomic DNA labeled with different fluorescent dyes hybridized to a single array) is available, we have noticed that it is usually reasonable (unless large portions of the genome are 'abnormal', such as may be the case for some cancer tissues) to apply one or several 'typical' array hybridizations as controls for training the 'normal' state (the reasoning being that most of the genome should be unaffected by copy-number variation). List control experiments in the file Control_experiments_for_parameter_estimation.txt in the Data folder.

  1. Place normalized log2ratio files which contain known CNV regions (or some 'guessed' CNVs) for training into BreakPtr/Data/Training/cnp.
  2. Following the format of BreakPtr/Data/Training/Large_Aberrations.txt, record the experiment-ID for each array used to train 'deletion' and 'duplication' states, the dosage of each CNV (we use known dosages/copy-number ratios, or best possible estimates according to log2-ratio values within the CNV), the chromosome, the CNV start position, and the CNV end position (use the file Large_Aberrations.txt as a template, spaces and tabs matter and may, if wrongly used, hamper reading in the data.)
  3. Follow example (1) below to perform the training and generate the HMM file.

3.1 Setting Transition Probabilities

An essential aspect of parameter estimation is providing a reasonable estimate for transition probabilities between states (e.g. the probability for the HMM to transit between 'normal' and 'deletion' states) and within states. We suggest that transition probabilities pb should be estimated for each transition between states by dividing the number of previously known/presumed aberrations (based on approximately mapped deletions/duplications) in the concatenated training set by the number of probes. Then, transition probabilities pw within states are set according to the following equation (all probabilities sum up to 1): , e.g. for pw11 (transition from state 1 to state 1, or "within state 1") for a 3-state HMM, e.g. the 'core' model of the Finder.

In practice, the initial estimates for transition probabilities currently have to be added to the HMM-model file manually, i.e. in the following way for the 'core' model in the file "example_core.hmm":

# Transition matrix:
        0.999999998     1e-9    1e-9
        1e-9    0.999999998     1e-9
        1e-9    1e-9    0.999999998
This can e.g. be altered to (assuming pb=1e-5):
# Transition matrix:
        0.99998     1e-5    1e-5
        1e-5    0.99998     1e-5
        1e-5    1e-5    0.99998

In the first row of the matrix, [0.999999998, 1e-9, 1e-9] represent the transition probability estimates pw11 (transition within state 1='normal'), pb12 (between 'normal' and 'duplication/amplification' states), and pb13 (between 'normal' and 'deletion' states), and so forth... (Be aware of the fact that for each row the sum of probabilities should equal 1; while manually refining the file be careful not to modify existing tab or space characters as this may compromise file reading).

4. Running the Finder and Annotator

Place the appropriate files in the directories below and follow examples (2), (3), and (4) in Examples.

BreakPtr Directories
Directory Files Contained
BreakPtr/Data/Models/ All HMM Model
BreakPtr/Data/Subject/ All Log2Ratio Data Files on which to Run the Finder and Annotator
BreakPtr/Data/Chain/ All Local-Sequence-BlastZ Redundancy Files (only needed for the 'full' parameterization, and currently only available for chromosomes 11 and 22, assembly hg17; in case you want to run the 'full' parameterization using a different HighRes-CGH array, an according file has to be generated as described in )

5. Examples

To run the following examples, you need to change the directory into BreakPtr.

  1. TRAINER To run the initial training and generate an hmm file, use:
    ./Batch_train_Crosshyb_mvnHMM.pl -i_ab Data/Chromosomal_aberrations_for_parameter_estimation.txt
    -i_norm Data/Control_experiments_for_parameter_estimation.txt -s 3 -c 1
    The output is in out.hmm, which should be identical to "Output/example_training.hmm".
  2. FINDER-CORE To run the core parameterization of BreakPtr, use:
    ./make_multivariate_emission_and_jaHMM.pl -i Data/Subject/36320__normalized_comForRev_unav.txt.gz
    -hmm Data/Models/example_core.hmm -Gauss > outfile
    The results will be in BreakPtr/Output/outfile, which should be identical to BreakPtr/Output/example_output/example_core_output.txt [state 0='normal'; state 1='amplification'; state 2='deletion']
  3. FINDER-FULL The 'full' parameterization of the Finder, which we usually implement according to criteria based on Scott segmenting CGH data and analyzing chromosomal DNA sequence To run the full parameterization of BreakPtr, use:
    ./make_multivariate_emission_and_jaHMM.pl -i Data/Subject/36320__normalized_comForRev_unav.txt.gz
    -hmm Data/Models/example_full.hmm -b_loc Data/Chain/chr22_SelfChain_isoTm_hg17.pos.gz > outfile
    The results will be in BreakPtr/Output/outfile, which should be identical to BreakPtr/Output/example_output/example_full_output.txt [state 0='normal'; state 1='transition_state_A'; state 2='transition_state_A'; state 3='amplification'; state 4='transition state B'; state 5='transition state B'; state 6='deletion']
  4. ANNOTATOR To run the dosage estimator, use:
    ./make_multivariate_emission_and_jaHMM.pl -i Data/Subject/36320__normalized_comForRev_unav.txt.gz -Gauss
    -hmm Data/Models/example_dosage.hmm > your_out_put.txt
    The results will be in BreakPtr/Output/outfile, which should be identical to BreakPtr/Output/example_output/example_output_dosage_estimator.txt [state 0='normal, or copy number ratio 2:2'; state 1='3:2'; state 2='4:2'; state 3='>4:2'; state 5='1:2'; state 5='0:2']
The -t option [between-state transition probability] can be utilized to relax transition probabilities, and increase the sensitivity of breakpoint prediction.

6. References

7. Citing BreakPtr

Please cite the following paper, if using BreakPtr in your research:

Korbel JO, Urban AE, Grubert F, Du J, Royce TE, Starr P, Zhong, G, Emanuel BS, Weissman SM, Snyder M & Gerstein MB (2007) Systematic prediction and validation of breakpoints associated with copy-number variants in the human genome, submitted.

8. Contact

Contact details can be found on Jan Korbel's website or on Mark Gerstein's website:

http://homes.gersteinlab.org/people/korbel/
http://www.gersteinlab.org/about/