Studna :: Study of upstream DNA regions

Studna :: Study of Nucleic Acids regions

Bioplexity / Martin Saturka

(Studna version: 2.0)

Introduction
Description
Download
Support
License

Introduction

The analysis tool is made for search of sequence patterns. Tailored tool scripts make the tasks easy for genome parts that are upstream to genes. Such chromosome regions are of great interest since they contain expression regulatory sequences, such as promoters. Specific proteins bind to them and regulate gene expression.

DNA forms double helix with approximately 10 nucleotides per one turn of the helix. Thus, DNA binding proteins can 'see' both continual, usually 6 to 8 base pairs, oligonucleotide sequences, and short separated sequences with distances reflecting the turns of DNA double helix topology.

Strands of the DNA double helix are in antiparallel orientation, with genes positioned 'randomly' according to both directions. Thus, regions upstream to genes are located either before or after gene locations, respectively, when we go along chromosomes. In fact, some regulatory sequences are located in other places as well, still the most important ones can be usually found upstream (and near downstream that is covered by Studna tools as well) to genes.

Description

The core of the tool consists of six perl scripts. First, they produce upstream sequences out of available data banks. Second, they make selection of and analysis on requested genes. Third, they help to grasp analysis results. The scripts are:

studna-red

studna-green

studna-yellow

studna-blue

studna-view
- result visualization

Red script is for reduction of genes data (gene2refseq.gz, gene_info.gz, available at the NCBI site, see download section below) of selected species into a suitable source, i.e. SQLite database, of gene information of a selected species.

Green script is for growing up of chromosomal files with upstream sequences. It reads sequences of assembled chromosomes (look at download section) and fills the database file with one sequence pair for each gene, longer upstream (-m ... -1) and shorter downstream (1 ... n) sequences. (Remember, there is no 0 indexed position.)

Yellow script is for yield of specific parts of specific upstream sequences. The yellow script produce 'Fasta' formated files, one for each gene group, according to descriptive configuration files. It takes the sequences from files generated by green script with a help of output of red script.

Blue script is for taking blueprint of provided sequence groups. The blue script is the heart of Studna tool. It counts separated tuple pairs occuring in the provided sequences. It is an important pattern motif with respect to DNA binding proteins and hence to gene expression regulation, see introduction section above. The sequences are taken in groups: each file is for one group. The counting comprises possible mismatches of sequence motifs.

Mismatch: Each occurence of a tuple pair is counted not only in its own counter, but in counters of similar tuple pairs as well. Similar tuple pairs are those with maximally one mismatch in each tuple. Thus, there are three counters for each tuple pairs: one for no mismatch occurences, one for occurences with maximally one mistake, and one with maximally two mistakes (one in each tuple).

View script is for result visualization. Results of blue script is are tables with columns for values of zero, one, and two maximal mismatches, and with rows for tuple pairs. The amount of rows is 4 power sum of tuple lengths. Currently, 4+4 and 3+3 tuple pairs are supported. The results are stored in a simple formatted tables, so that it can be easily read in e.g. R statistics system. Estimations of count probabilities are provided as well. The View script can support result understanding and next explorations.

The tool contains exemplary running of the scripts. They are named 'studna-color.sh' for respective colors. Explanations of particular options are given in the run files.

The main task for many research deals in bioinformatics is work on human data. Thus, this tool was tested primarily on freely available human sequence data, hence it is our main data source. It was tested on data / chromosomes of several other species as well. Do not hesitate to contact us if you encounter any new requirements for processing upstream sequences by this tool.

Download

The complete tool, i.e. scripts, documentation, and exemplary usage runs, can be downloaded from www.bioplexity.org site. Have a look at the analysis section of the Bioplexity web site.

Sequenced chromosomes and gene location data can be downloaded from NCBI site, there are some mirrors as well. H. sapiens chromosome sequences are stored at /genomes/H_sapiens directory. Gene data are stored at /gene/DATA directory.

Support

Professional support for this software is available online. More information can be found at Bioplexity support pages. Research questions can be targeted directly to Martin Saturka, you can send emails to kvutza(at)gmail.com personal address.

License

Author of the Studna tool is Martin Saturka. The code is distributed under the MIT/X license. In short, it means it is freely available. You can both use and distribute this software. The tool is provided as is, without any warranties, and with the best hope it is a useful utility.