# Manual

### Step 1: Choose an RBP

We have a pre-defined list of 40 RBPs, please choose one of them from the drop-down list.

### Step 2: Upload target / background gene lists (optional)

Please upload target / background gene lists which will be used in follow-up statistical analyses. Ensemble gene ids are expected. You can use DAVID’s gene id conversion tool if necessary. If you prefer not to upload your own gene lists, we will use our pre-defined target/background gene lists. Please see below for details on how we define these lists.

### Step 3: Upload gene expression data (optional)

If you have a genome-wide gene expression data for a large set of samples, you can upload it at this step and the statistical analyses will be performed with this data. The data should be formatted in a tab-delimited file such that each row is a gene (with Ensemble gene id) and each column is a sample or condition.

Note that this is not mandatory as we have already compiled several large gene expression datasets.

### Step 4: Enter your email (optional)

If you enter your email, we will send you a link from where you can reach your results.

### Results

Once the submit button is clicked, a table is displayed where each row is a lncRNA. The following types of information are displayed:
• number of non-overlapping motif occurrences in the entire lncRNA sequence
• number of non-overlapping motif occurrences within the CLIP peaks that are located in lncRNAs
• log-odds enrichment score of motif frequencies (as in [1])
• dispersity score to measure the clustering of motif occurrences (as in [2])
• number of CLIP peaks
• number of eCLIP peaks
• number of CLIPdb peaks
• median and maximum expression value across the samples available in GTEX v7 , E-MTAB-2706, E-MTAB-2770 datasets
• consensus score
• significance of correlation analysis (A tick is inserted if the analysis results in a p-value < 0.05) significance of regression analysis

• Consensus score:
The consensus score is calculated by counting the number of satisfied constraints listed below:
• LOD score > 1.6
• dispersity score < 36
• existence of at least one motif in eCLIP or CLIPdb peak
• maximum expression > 5 TPM in at least one dataset
• correlation analysis results in a p-value < 0.05 in at least one dataset
• regression analysis results in a p-value < 0.05 in at least one dataset
• Entries of the table are highlighted if they satisfy one of the constraints mentioned above.

Graph: the occurrences of k-mers are shown with vertical bars within the lncRNA sequence.
Genomeviewer: Once the eye button is clicked, an integrated genome viewer is displayed to investigate the lncRNA region further. In addition to the gene and sequence track, this viewer displays the motifs and CLIP peaks as horizontal bars.

### Analysis

For analysis section, the user needs to first select one of the gene expression datasets among GTEX, E-MTAB-2706, E-MTAB-2770 or the dataset that s/he uploaded. In addition to displaying the expression values of the lncRNA and the RBP across the samples of the selected dataset, following analyses are performed:

#### Correlation between lncRNA expression and target / background genes of the RBP

Using the selected dataset, we calculate the Spearman correlation coefficient between the expression values of the lncRNA and the expression values of each gene in the target/background set. We compare the correlation values of target and background genes with Wilcoxon ranksum test and display them with a box plot.

If the lncRNA of interest acts as a sponge for an RBP that stabilizes its targets, we should see lower correlation values with target genes as the RBP activity would decrease with increased expression of the lncRNA. On the other hand, if the RBP de-stabilizes its targets, correlation values should be higher for target genes.

#### Prediction of target gene expression with and without lncRNA expression

We performed a simple linear regression analysis to predict target gene expression where in one case only RBP expression is used and in the other case both RBP and lncRNA expression are used as features. We calculated the Spearman correlation coefficient values between actual and predicted expression values of target genes on held-out datasets using 10-fold cross-validation scheme. If the lncRNA of interest acts as a sponge for the RBP, we expect to see an improved predictive performance when lncRNA expression is included. We test the significance of improvement using Wilcoxon ranksum test and likelihood ratio test. We count the percentage of genes where likelihood ratio test results in a p-value lessthan < 0.01 in the target and background sets. These percentage values are displayed along with the Wilcoxon ranksum p-value on the plot (e.g. 46% vs 33%)

#### Expression changes upon the knockdown of lncRNA

For the lncRNAs with genome-wide knockdown datasets, we compared the distribution of expression changes of target and background genes with a cumulative distribution frequency (CDF) plot. Here, because lncRNA activity is minimized we expect to see an increased RBP activity and a more pronounced effect (either stabilizing or de-stabilizing) on the expression of target genes. We test the significance of the difference between the two distributions with Wilcoxon ranksum test.

#### Defining target / background sets

Target and background genes sets of RBPs are needed for the statistical analyses implemented by RBPSponge. The user can either upload these gene sets or use our pre-defined gene lists. To define these lists of genes for each RBP, gene annotation files are downloaded from GRCh37 assembly of Ensembl (Release 87) and the longest 3’UTR isoform is determined for each gene. All 3’UTRs are scanned with the set of k-mers that are determined for each RBP. 3’UTRs are also intersected with CLIP peaks to determine overlaps as the presence of a CLIP peak strongly suggests that the region is bound by the RBP of interest.

Data from shRNA knockdown assays are downloaded from ENCODE project (i.e. see Supplementary Table 1 for the complete list). For these assays, log fold changes (LFCs) are calculated using DeSeq2 method (Love et al., 2014). For ELAVL1, knockdown dataset from Mukharjee et al (Mukherjee et al., 2011) is used as knockdown assay data for ELAVL1 was not available in ENCODE project. We defined the target set as those genes with:
• at least one CLIP peak.
• at least one occurrence of one of the top 3 scoring k-mers within a CLIP peak.
• LFC value that has an adjusted p-value < 0.05 as calculated by DeSeq2 for knockdown datasets. We defined the background set as those genes with:
• no CLIP peak.
• no occurrence of top 3 scoring k-mers.
• no significant LFC value in knockdown datasets.

#### Log odds score

To obtain the log-odds score for a lncRNA/RBP pair, we calculate the number of motif occurrences across each sliding window (of size w) of the lncRNA sequence.

Assuming that $$n_{t,\:w,\:i}$$ corresponds to the number of occurrences of binding motifs within the window that starts at position i of lncRNA t, $$n\:^{max}_{t,\:w}$$ is defined as $$max\left(n_{t,w,1},\:n_{t,w,2},\:..,\:n_{t,w,l}\right)$$ where $$l$$ is the starting position of the last sliding window within the lncRNA. This value is then normalized by the average maximum number of occurrences of the same set of binding motifs across all lncRNAs :

$$LOD_{t,w}=\frac{n^{max}_{t,w}}{\left(\sum _i\:n^{max}_{i,w}\:\right)/N}$$

where $$N$$ is the number of all lncRNAs. LOD score is calculated with varying window sizes (from 50nts to 1000nts in steps of 50nts) and the window size with the largest LOD score is reported. LncRNAs with high LOD scores contain an enrichment of binding sites for the RBP of interest compared to other lncRNAs.

#### Dispersity score

The second metric which is named dispersity score evaluates the clustering of motif occurrences within the sequence. To calculate the dispersity score of an lncRNA/RBP pair, we build a vector of normalized $$n^{max}_{t,w}$$ values across all window sizes :

$$x_t=\left(\frac{50}{^{n^{max}_{t,50}}},\:\frac{100}{^{n^{max}_{t,100}}},\:...,\:\frac{1000}{^{n^{max}_{t,1000}}}\right)$$

Then, the dispersity score for lncRNA t is calculated as the standard deviation of vector $$x_t$$. Small dispersity scores correspond to an equal distribution of motif occurrences across the sequence.