GeneIntelligence (GI) provides a fully automatic tool for exploratory data analysis purposes. The main concept behind GI software is to figure out, and rank associations between measurements (genes expression or methylation levels) and the phenotype of interest (cancer grade, treatment response). Unlike other methods (such as linear regression) our algorithms have no direct assumptions about input data. Thus, they are not biased when, for example, the assumption of normality is not met.
To work with our software You need to sign up, and then prepare data (for more details about supported data types see data sources section) in a specific but simple manner. GI software input must contain two files in CSV (comma-separated values) format. The first one is a dataset containing samples and measurements (Fig. 1), please note that orientation of the dataset matrix is crucial for correct data interpretation.
Fig 1: Thumbnail of the example dataset, rows contain samples, columns contain measurements e.g. expression or methylation levels per sample.
The second necessary file is POI phenotype-of-interest (Fig. 2) and should include sample ID (the same as in the dataset file) and information about the phenotype. Additionally, POI may contain covariates (i.e. age, sex, or BMI) to adjust analysis. And in case of high heterogeneity tissue (such as blood) information about cell proportions (details in section cell fraction correction).
Fig 2: Thumbnail of example POI, rows contains samples, columns contains information about sample phenotype, covariate contains additional information about sample (1 – male, 0 – female)
When files are prepared to use, sign in into account and navigate to new analysis section, enter basic information and then press next button (Fig. 3)
Fig 3: In the new analysis section, set analysis name (all analyses are stored in analysis list section), module type (EPIC/450K for methylation and RNA-seq for expression data), turn on/off correction for cell fraction proportion (this option requires additional data in POI file)
Now You should see a new window (Fig. 4) here You can upload the dataset and poi files when files transfer finish press the run button.
Fig 4: In this section, You can set and upload files to our infrastructure. When the file transferring process finishes, the run button will activate.
The analysis process may take up to several hours depending on the dataset size. When the process ends we will send you a notification via email and the analytical report will be available in the analysis section (Fig. 5).
Fig 5: Here You can search, sort, archive, and view prepared analytical reports.
If provided files do not fulfill those conditions, our validations systems will stop analysis and inform you about the reason via email – email@example.com.
For now, GI software can handle data from EPIC/450K Illumina BeadChips containing methylation levels expressed as beta-values. Future versions of GI software will provide modules to work with gene expression or mixed-type / custom data. If you have a specific dataset currently not supported by the GI online version, please contact us.
To adjust analyse, users may provide covariates in the POI file, they will be considered in markers importance evaluation.
In the case of complex tissue such as blood, there is a need to adjust measurements for cell fraction proportions. This is due to the fact that different cell types vary in specific gene methylation (or expression) levels. It causes that variation in cell proportion may result in differences in observed measurements thus false positive/negative results.
This method requires data about the cell fractions in the POI file in separate columns with CF_ prefix. Please note that this is not a necessary step of the analysis. If You do not know how to predict cell fraction proportions from expression or methylation data, please contact us.
The tabular report provides numerical parameters useful to estimate marker importance. Marker is a variable name from a dataset file if biased is equal to False Adj. p-value display p-value adjusted for provided covariates, biased gives information if this specific variable can not be adjusted (if POI file does not contain any additional variables or biased == True Adj. p-value is calculated using the parametric or nonparametric test for differences among means). Separation gives information on how well markers can distinguish studied groups (separation = 1 means perfect, linear separation). The factor is a metric that allows ordering markers (the strongest have the highest factor value). Please note that this value is study-specific and can not be directly compared between different analyses. In multiclass cases, additional post-hoc-tests are performed for each marker.
A cluster map is a heatmap ordered by an unsupervised clustering algorithm. This type of analysis allows us to figure out patterns specific for each poi.
Principal component analysis (PCA) and TSNE allow projecting high-dimensional data into lower-dimensional space.
Trees are a graphical way to represent as simple as possible how to distinguish poi using selected markers.
If you prefer the PDF version of this documentation you can download it here