LINGuA :: Logic powered Intriguing Notions on Gene microArray data

LINGuA :: Logic powered Intriguing Notions on Gene microArray data

Bioplexity / Martin Saturka

(LINGuA version: 1.0)

Abstract

This survey presents LINGuA system, i.e. 'Logic powered Intriguing Notions on Gene uArray data' - bioinformatics of gene expression. It contains logic-based methods for data mining on microarray data, i.e. for search of new notions out of gene expression data. LINGuA is opensource, available at the Bioplexity site. The main functionality is based on the Enduce library, you can look there for some graphical explanations/descriptions.

Table of contents:

* Lingua introduction

Lingua system is for logic based data mining on microarray data. Having many genomes sequenced, large library exists with an amount of books we can just partially understand to them. Microarrays are useful experimental appliances for taking large amounts of gene expression data that can help to understand the language of genome books. Lingua system is an implementation of an approach that tries to be the base for exploration tasks on microarray data. Since Lingua is based on logic calculi, it possesses features not avalible in systems based solely on statistical ideas. While statistical computations are useful for hypotheses testing, Lingua is tailored for the search of new notions.

Lingua system is prepared as a package for the R statistical system. The library is loaded via library(lingua). Main ideas of the individual parts are described in chapters below. Lingua system is actively improved.

* Data preparation

Actual data values of particular microarrays depend on exact instrumental settings during substance preparation and data acquisition. It is necessary to normalize the aquired data to eliminate variations which do not depend on biological settings.

The normalization is divided into two parts. First, normalization of data belonging to individual experiments is done. It is made according to standard statistical procedures to have a solid basis to start the intriguings on. Call it lingua.normalize(vect). Then, single gene data unization is processed. It is oriented to subsequent information search and retrieval. The unization puts the data into the [-1, 1] interval that is suitable for logic based intriguing. Call it like lingua.unitize(vect, "normal", 1).

Combinations of particular gene data combined into tuplets can enhance the search for rules on interactions between genes. Lingua has prepared methods for producing tuplets based combinations via pairs and triplets of data. It is called lingua.pairs(data, "or"), and lingua.triplets(data, "and").

* Gene interactions

Interactions between genes are searched via contifiers methods. Contifiers are viewed as gene (over-/under-)expression coincidences between either genes or gene tuples. The gene tuples are for intriguings based on notions like 'both genes A and B', or 'at least one of gene A, B'. Gene tuples can be prepaired by lingua.pairs and lingua.triplets methods. It is usually called with gene clusters instead of particular genes themselves to reduce computation resources. See below for the clustering part.

Gene interaction notions are pairs of genes (or gene pairs) which undergo some conditions. Working with the notions can be viewed as positioning their value pairs into [0, 1]*[0, 1] squares - coordniates of each point are (over-/under-) expression values of genes of respective notion pair. Greater are values of the coordinates, greater are occurences of the respective genes. The contifiers methods are contifiers.thresholds(thresh) and contifiers.modes(modes) for setting, and contifiers.aggregate(vecx, vecy) for the search itself.

Gene interacrions can be directional or mutual. The directional ones, say from gene A to gene B, are for notions like: 'when an event occurs on gene A, then an event occurs on gene B', and not necessarily otherwise. In other words, it is: 'most of events with high occurences of A, have high occurences of B as well'.

Mutual interactions, e.g. between genes A and B, are based on notions like: 'there are no or just a few events such that they have occurences on just one of genes A, B'. The notions can be on gene pairs as well, not only on single genes that is shown above. It is easy to see that the clusters (of 'notion clustering' documentation part) are cases of mutual gene interactions, and that the symmetry halves are cases of directional gene interactions.

* Notion clustering

Data clustering is a frequent task for bioinformatics and generally for data-minig. Two main properties have to be chosen for the clustering tasks. First, inter-genes distance measurements, and second clustering layout itself. Distances are used for distinguishing similar vs. dissimilar genes. Suitable clustering layouts have to be chosen to highlight contours of features we are concerning on.

Mutual quantifiers that were introduced above, are used as a similarity distance. They are fast-enumerated and can be used for multiple enumerations. The clustions part of lingua is prepared for the task. It uses k-means clustering scheme and can be called via clustions.thresholds(thresh) and clustions.maxcycles(maxcyc) for settings, and via clustions.cluster(data.cases, ini.centers), clustions.cluster2(data.cases, ini.centers) for the clustering itself. There are two modes with respect to the input data. The first case is for data from the [0, 1] interval. The second case is for [-1, 1]-valued data that are taken as pairs of values.

Having a group of genes under a symmetrical similarity interaction, DNA regions around their positions along chromosomes can be looked for regulatory sequences (with some better hope). StuDNA tool is prepared for such tasks. It can be downloaded from the Bioplexity site.

* State inference

Once having evaluated expression interactions on some genes, the results can be used for gene expression predictions. If we know that e.g. over-expression of genes A and B triggers under-expression of a gene C, we can use such a knowledge for cases where we know expression values of genes A and B, but not that of the gene C.

This kind of data processing is covered by dinorms part of lingua. It takes input data from the [-1; 1] interval and outputs respective predictions. It can be called via dinorms.thresholds(thresh) and dinorms.modes(modes) for setting, and dinorms.aggregate(vect) and dinorms.boost(vect, powers, count, limit) for the processing itself. The last method is for boost-enhanced version of dinorms usage.

Since the relations can lead to both higher and lower values, methods for bidirectional evaluations are needed herein. It have to be done in a two-step process to accomplish it in an updatable way. The two-fold processing is done internally, hence users need not to take care of the peculiarity. It is just reflected in the dinorms.aggregate output that contains three values, the first two are results of the first processing part, the last value is the overall result itself.

* Usage background

From a theoretical point of view, the interactions are generalized quantifiers of applied logic. The considered quantifiers are something like averages of respective interaction evaluations on particular experiments. The case for gene (over-/under-)expression values, i.e. values inside the [0; 1] interval are a generalization of fourfold matrix that is used for crisp data. The [0; 1]*[0; 1] square contains the fourfold matrix cases as events placed on its corner points.

Interaction evaluations used in the Lingua system are based on some formulas used in explanatory, i.e. descriptive, statistics which conform to the logical fundaments of the notion intriguing. There are some other formulas in the area of hypotheses testing that could be used as well. Especially, they are those of non-parametric methods. However, it is is rather slow to compuate them. Still, they can be used for further processing of the intriguing results.

Many researches put their microarray data onto web. An amount of gene expression data can be found by web search engines, like Google, Altavista, Yahoo, and others. Some data, including microarray ones, are available e.g. at the Broad institue (MIT/Harvard), Lymphoma and NCI60 genomic resources (Stanford), Eisen lab (LBNL/Berkeley), Drosophila web (Berkeley), web and ftp of European Bioinformatics Institute site (EBI).