Introduction
The GGTC gene trap screen utilizes 2 different PCR methods to analyze the vector integration sites: Splinkerette (SPLK) PCR and 5' RACE.
These methods are used to obtain genomic sequence tags and cDNA-type sequence tags respectively. The pipeline for the analysis of the gene trap sequence tags
(GTST) combines various bioinformatic techniques and is used to localize the vector integration site on the genome and to identify the mutated gene:
 figure1
1. Sequence processing:
In the first step of the pipeline the GTSTs are preprocessed for subsequent BLAST searches (fig.1). Analysis of mutant ES cell lines by PCR tags,
which contain both a short stretch of vector sequence and sequence of the mutated gene. The key step in the sequence processing is the vector sequence clip.
Successful identification of the transition zone between the vector and the gene sequence is important for the prediction of the true vector integratopm site
(fig.2). The end of the sequence tag is determined by a PCR-specific SPLK adapter sequence. Unless the SPLK adapter sequence can be identified, the low quality sequence
end is clipped. The resulting sequence passes several filtering and masking steps, e.g. a repeat-masking step and a low-compexity filter.
 figure2
2. BLAST alignments:
BlastN is used to map the sequence tag to the mouse genome (ENSEMBL).
3. Identification of the vector insertion point:
To find the vector insertion site most precisely the transition zone between the vector sequence and the endogenous sequence of the trapped
gene needs to be identified. This transition zone is used to deduce the true vector insertion point from the alignment starting point (fig.2).
4. Genomic localization:
After running BLAST searches with all sequence tags, two different pipelines are used for cDNA and genomic sequence tags respectively. Due to
their origin from mRNA transcripts, cDNA tags result in gapped alignments when aligned to the mouse genome sequence, whereas genomic tags normally result in linear
alignments:
(i) genomic sequence tags:
The crucial step in the analysis of the optimal localization of the sequence tag is the selection of the best alignment. The alignments are
filtered for significance and the remaining alignments are screened to identify the best matching alignment (fig.3). If both PCR analyses, 5' or 3' Splinkerette PCR,
return signifcant results for a particular cell line, verification of the genomic localization of the vector integration is done by cross-comparing the resultant
genomic coordinates, the chromosome matched, and the orientation. The distance between the vector integration points predicted by the 5' and 3' Splinkerette PCR
analysis should not exceed a threshold of 1000 nucleotides. Finding single end mappings (class III) or paired end mappings (class I) determines the level of
evidence of the mapping process (fig.3). If no best matching alignment has been found, the alignments of both 3' SPLK tag and 5' SPLK tag are screened for paired end
mappings by an all-against-all approach and the best mapping by e-value is selected (class II). If no paired mappings can be found both tags are omitted.
 figure3
(ii) cDNA sequence tags:
The SPLIGN software (NCBI)
is used to map cDNA sequence tags to the mouse genome(fig.4). Various gene models predicted by
SPLIGN are compared to select the best matching model. Important selection parameters are the total length of the gene model, identity of the longest exon, and a
minimal deviation of the alignment starting point to the vector integration site (i.e. the transition between vector and gene sequnce in terms of the GTST).
The resultant best-matching gene model is used to predict the site where the splice acceptor of the vector has been spliced to the splice donor of the endogenous
exon (most of the vectors used by the GGTC are intron trap vectors that depend on splicing). To reconcile the predicted gene model, the splice site, as deduced
from the cDNA, is compared with the vector integration site, as deduced from the genomic sequence tag analysis. If the predicted splice site either deviates more than 1 Mb from
the vector integration site or is localized on another chromosome
or in the opposite orientation, the selected gene model is rejected.
 figure4
5. Annotation:
The genomic coordinate of the vector insertion site, the orientation of the alignment, and the chromosome matched are used to identify the
trapped gene on the basis of the mouse genome annotation (ENSEMBL). Identification of the vector insertion in relation to
strucural elements of the trapped gene
(e.g. exons, introns, UTR, stop signals etc.) are also investigated.
Relevant data from genomic localization and annotation of the sequence tags are stored in a database and presented on the web-interface of
the GGTC.
|