Release 1.0, April 2007
Documentation provided by Jan Korbel, Peter Starr, Mark Gerstein
3. Training and Generating an HMM (model)
4. Running the Finder and Annotator
7. Citing BreakPtr
BreakPtr (abbrev. for Break-Pointer) is a computational approach developed at Yale University for fine-mapping copy-number variants (CNVs) based on high-resolution comparative genome hybridization (HighRes-CGH) data and nucleotide sequence characteristics of breakpoints. BreakPtr is described in detail in the PNAS paper below. Usage is free for academic institutions; companies wishing to use BreakPtr are requested to contact Jan Korbel or Mark Gerstein (see below for contact info). BreakPtr's Finder module predicts breakpoints of large deletions and amplifications. It's Annotator identifies actual dosage (copy-number) ratios, and Flagger identifies regions where probe cross-hybridization may have occurred.
2.1 System Requirements
2.2 Installation Instructions
BreakPtr is tested and validated on the following reference platforms:
BreakPtr Reference Platforms
|Operating system||OS version||Processor architecture||Other Requirements|
|Microsoft Windows||XP||Intel x86|| Sun Java 2 Standard Edition 1.4.2_10
|Macintosh||...||...|| Sun Java 2 Standard Edition 1.4.2_10
|UNIX||...||...|| Sun Java 2 Standard Edition 1.4.2_10
tar -xzvf BreakPtr.tar.gz
3.2 Example Training on New Data
3.3 Setting Transition Probabilities
BreakPtr training involves making use of prior knowledge on CNVs; ideally, the data on which BreakPtr is trained will come from the same hybridization conditions and array design used for predicting CNVs. As an alternative, a set of training data is already provided (i.e., a model trained on chromosome 22 HighRes-CGH data; note however that the accuracy of the results from using this training set on data from other hybridization conditions or array designs may be compromised). It is usually advantageous to have some prior knowledge about copy-number variation (deletions and duplications) in the set to be analyzed. For instance, about 2 copy-number variants, both losses and gains, are usually sufficient to train the 'core' model of the Finder, and alternative parameterizations are implemented based on criteria introduced by Scott . In case no prior knowledge on CNVs exists, a reasonable alternative is to apply CNVs 'guessed' based on visual/manual inspection of HighRes-CGH data (i.e. plots showing normalized log2-ratios vs. their chromosomal coordinate).
Example: training the 'core' model (module Finder) on new data: - In order to train the 'normal' state, place data files (normalized log2ratio file(s)) which will serve as non-CNV control regions for training into BreakPtr/Data/Training/normal. In practice, if no control experiment (same genomic DNA labeled with different fluorescent dyes hybridized to a single array) is available, we have noticed that it is usually reasonable (unless large portions of the genome are 'abnormal', such as may be the case for some cancer tissues) to apply one or several 'typical' array hybridizations as controls for training the 'normal' state (the reasoning being that most of the genome should be unaffected by copy-number variation). List control experiments in the file Control_experiments_for_parameter_estimation.txt in the Data folder.
An essential aspect of parameter estimation is providing a reasonable estimate for transition probabilities between states (e.g. the probability for the HMM to transit between 'normal' and 'deletion' states) and within states. We suggest that transition probabilities pb should be estimated for each transition between states by dividing the number of previously known/presumed aberrations (based on approximately mapped deletions/duplications) in the concatenated training set by the number of probes. Then, transition probabilities pw within states are set according to the following equation (all probabilities sum up to 1): , e.g. for pw11 (transition from state 1 to state 1, or "within state 1") for a 3-state HMM, e.g. the 'core' model of the Finder.
In practice, the initial estimates for transition probabilities currently have to be added to the HMM-model file manually, i.e. in the following way for the 'core' model in the file "example_core.hmm":
This can e.g. be altered to (assuming pb=1e-5):# Transition matrix: 0.999999998 1e-9 1e-9 1e-9 0.999999998 1e-9 1e-9 1e-9 0.999999998
# Transition matrix: 0.99998 1e-5 1e-5 1e-5 0.99998 1e-5 1e-5 1e-5 0.99998
In the first row of the matrix, [0.999999998, 1e-9, 1e-9] represent the transition probability estimates pw11 (transition within state 1='normal'), pb12 (between 'normal' and 'duplication/amplification' states), and pb13 (between 'normal' and 'deletion' states), and so forth... (Be aware of the fact that for each row the sum of probabilities should equal 1; while manually refining the file be careful not to modify existing tab or space characters as this may compromise file reading).
Place the appropriate files in the directories below and follow examples (2), (3),
and (4) in Examples.
|BreakPtr/Data/Models/||All HMM Model|
|BreakPtr/Data/Subject/||All Log2Ratio Data Files on which to Run the Finder and Annotator|
|BreakPtr/Data/Chain/||All Local-Sequence-BlastZ Redundancy Files (only needed for the 'full' parameterization, and currently only available for chromosomes 11 and 22, assembly hg17; in case you want to run the 'full' parameterization using a different HighRes-CGH array, an according file has to be generated as described in )|
To run the following examples, you need to change the directory into BreakPtr.
The output is in out.hmm, which should be identical to "Output/example_training.hmm"../Batch_train_Crosshyb_mvnHMM.pl -i_ab Data/Chromosomal_aberrations_for_parameter_estimation.txt
-i_norm Data/Control_experiments_for_parameter_estimation.txt -s 3 -c 1
The results will be in BreakPtr/Output/outfile, which should be identical to BreakPtr/Output/example_output/example_core_output.txt [state 0='normal'; state 1='amplification'; state 2='deletion']./make_multivariate_emission_and_jaHMM.pl -i Data/Subject/36320__normalized_comForRev_unav.txt.gz
-hmm Data/Models/example_core.hmm -Gauss > outfile
The results will be in BreakPtr/Output/outfile, which should be identical to BreakPtr/Output/example_output/example_full_output.txt [state 0='normal'; state 1='transition_state_A'; state 2='transition_state_A'; state 3='amplification'; state 4='transition state B'; state 5='transition state B'; state 6='deletion']./make_multivariate_emission_and_jaHMM.pl -i Data/Subject/36320__normalized_comForRev_unav.txt.gz
-hmm Data/Models/example_full.hmm -b_loc Data/Chain/chr22_SelfChain_isoTm_hg17.pos.gz > outfile
The results will be in BreakPtr/Output/outfile, which should be identical to BreakPtr/Output/example_output/example_output_dosage_estimator.txt [state 0='normal, or copy number ratio 2:2'; state 1='3:2'; state 2='4:2'; state 3='>4:2'; state 5='1:2'; state 5='0:2']./make_multivariate_emission_and_jaHMM.pl -i Data/Subject/36320__normalized_comForRev_unav.txt.gz -Gauss
-hmm Data/Models/example_dosage.hmm > your_out_put.txt
Please cite the following paper, if using BreakPtr in your research:
Korbel JO, Urban AE, Grubert F, Du J, Royce TE, Starr P, Zhong, G, Emanuel BS, Weissman SM, Snyder M & Gerstein MB (2007) Systematic prediction and validation of breakpoints associated with copy-number variants in the human genome, submitted.
Contact details can be found on Jan Korbel's website or on Mark Gerstein's website: