[Funcoup] GIN and DOM

Andrey Alexeyenko andrej.alekseenko at scilifelab.se
Wed Oct 19 11:36:40 CEST 2011


> But it was pearson coefficients of gene pairs, so what is the problem?
As I always understood it, the correlations could be (and were) 
calculated between all genes tested in SGA, irrespective of being tested 
directly against each other. Is it wrong? Then I simply put the 255178 
number.

>
> BTW Again, how were the 4880 defined?
Just the number of distinct genes in the 255178-table.

A

>
> /Erik
>
> On 10/19/2011 09:57 AM, Andrey Alexeyenko wrote:
>> Pls do not forget that GIN data were not available (to me) as gene
>> pairs, it was just pearson coefficients, which makes it not the same as
>> PPI data. If I had the original table, I'd of course had introduced that
>> number into the web page, and would not raise the issue...
>>
>> A
>>
>> On 2011-10-19 09:39, Erik Sonnhammer wrote:
>>> Hi, my point is that since genes are the currency of FunCoup, it makes
>>> sense to convert all other currencies to that currency.  Surely we are
>>> not making models of domain pairs, but of gene pairs.
>>>
>>> BTW, the PPI nr seems to be converted from actual datapoints to gene
>>> pairs.  I don't think there are>500000 datapoints for human PPI.
>>>
>>> Ideally we should present both the 'raw' nr of datapoints and the nr of
>>> gene pairs they corresponds to for each datatype.
>>>
>>> /Erik
>>>
>>> On 10/18/2011 07:44 PM, Thomas Schmitt wrote:
>>>> I haven't followed the whole conversation, but in my opinion
>>>> both GIN and DOM fall into to the same category of pairwise
>>>> data as PPI. (Although PPI isn't really pairwise because interactions are defined for more then two genes)
>>>> I think we should therefore report for both the number
>>>> of unique pairs. Namely for GIN the number of gene pairs
>>>> and for DOM the number of domain pairs because thats what the data is about.
>>>>
>>>> /Thomas
>>>>
>>>> On Oct 18, 2011, at 4:23 PM, Erik Sonnhammer wrote:
>>>>
>>>>> GIN: Again, how were the 4880 defined?  Surely we can't just pretend
>>>>> some part of the input wasn't there just because their LLRs were low.
>>>>> Look at MIR, those numbers are even much higher.  I think we should
>>>>> write 225178 for GIN.
>>>>>
>>>>> DOM: I could go for Ngenes for DOM as well I guess.  But Ndomains is
>>>>> like Nlocations, pretty meaningless.
>>>>>
>>>>> /Erik
>>>>>
>>>>> On 10/18/2011 04:04 PM, Andrey Alexeyenko wrote:
>>>>>>> GIN: What is 4880 then and why did you not write 255178?
>>>>>> I could agree, 225178 might look informative...
>>>>>> But as we know it includes mostly pairs that do not deserve LLR>0.5, it
>>>>>> is misleadin 0 if one keeps PPI data in mind.
>>>>>>
>>>>>>>
>>>>>>> DOM: Still don't get it. Maybe a toy example will help. If domains A and
>>>>>>> B interact, 5 genes have A and 5 have B, then there are gene 25 pairs.
>>>>>>> Why is this impossible to compute?
>>>>>> It is possible, I agree, but it is the same as squaring N genes on a MA
>>>>>> - while for those we provide just Nconditions (or Ngenes for SCL).
>>>>>>
>>>>>> A
>>>>>>
>>>>>>>
>>>>>>> /Erik
>>>>>>>
>>>>>>> On 10/18/2011 03:20 PM, Andrey Alexeyenko wrote:
>>>>>>>> On 2011-10-18 15:07, Erik Sonnhammer wrote:
>>>>>>>>> GIN: What was the lowest PLC (i.e. the cutoff)? Note that in total,
>>>>>>>>> Sanjit had 255.178 gene pairs with correlations. AFAICR he used all of
>>>>>>>>> them.
>>>>>>>> The whole table contains 255178 pairs. Min=0.1
>>>>>>>> At min=0.2 and 0.3 we get 18210 and 4184 pairs, respectively.
>>>>>>>> Bin borders (PPI, yeast) are
>>>>>>>>
>>>>>>>> data bin sce__geni_(simple_sgadata_correlations_100308.txt)_ ppi_mt 1
>>>>>>>> upper -0.1075
>>>>>>>> data bin sce__geni_(simple_sgadata_correlations_100308.txt)_ ppi_mt 2
>>>>>>>> upper 0.1385
>>>>>>>> data bin sce__geni_(simple_sgadata_correlations_100308.txt)_ ppi_mt 3
>>>>>>>> upper 0.1905
>>>>>>>> data bin sce__geni_(simple_sgadata_correlations_100308.txt)_ ppi_mt 4
>>>>>>>> upper 0.257
>>>>>>>> data bin sce__geni_(simple_sgadata_correlations_100308.txt)_ ppi_mt 5
>>>>>>>> upper 0.3905
>>>>>>>> data bin sce__geni_(simple_sgadata_correlations_100308.txt)_ ppi_mt 6
>>>>>>>> upper 1000.835
>>>>>>>>
>>>>>>>> and LLRs, respectively:
>>>>>>>> prob ll sce__geni_(simple_sgadata_correlations_100308.txt)_ ppi_mt 1
>>>>>>>> -0.915041861965188
>>>>>>>> prob ll sce__geni_(simple_sgadata_correlations_100308.txt)_ ppi_mt 2
>>>>>>>> -0.0173913346260002
>>>>>>>> prob ll sce__geni_(simple_sgadata_correlations_100308.txt)_ ppi_mt 3
>>>>>>>> 0.645635223480877
>>>>>>>> prob ll sce__geni_(simple_sgadata_correlations_100308.txt)_ ppi_mt 4
>>>>>>>> 1.69915508802254
>>>>>>>> prob ll sce__geni_(simple_sgadata_correlations_100308.txt)_ ppi_mt 5
>>>>>>>> 2.78104091819998
>>>>>>>> prob ll sce__geni_(simple_sgadata_correlations_100308.txt)_ ppi_mt 6
>>>>>>>> 4.60004121876386
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Incidentally, the originator of the data, Charlie Boone, is talking
>>>>>>>>> here
>>>>>>>>> tomorrow at 10. Seems they used a cutoff of PLC>0.2, but I can't find
>>>>>>>>> how many links that amounts to.
>>>>>>>>>
>>>>>>>>> DOM: "how do we define a set of input genes?" Isn't this simply the nr
>>>>>>>>> of genes containing interacting domain pairs? At the very least we
>>>>>>>>> should say what versions of Pfam and UniDomInt were used, but I don't
>>>>>>>>> really see why it is so impossible to convert to gene pair counts.
>>>>>>>>
>>>>>>>> Because while we can count such genes, we cannot anticipate _PAIRS_
>>>>>>>> where both genes have mutually interacting domains. This requires
>>>>>>>> re-running everything. I think the plain Pfam domain list is more
>>>>>>>> informative.
>>>>>>>>
>>>>>>>> A
>>>>>>>>
>>>>>>>>>
>>>>>>>>> /Erik
>>>>>>>>>
>>>>>>>>> On 10/18/2011 01:33 PM, Andrey Alexeyenko wrote:
>>>>>>>>>>> GIN: I just realised that this data type is not described in the
>>>>>>>>>>> Methods
>>>>>>>>>>> section, which it should be as it is new. Could you please provide a
>>>>>>>>>>> section? I'm surprised it's only 4880 interactions - above what
>>>>>>>>>>> cutoff
>>>>>>>>>>> was that?
>>>>>>>>>> It's header is under
>>>>>>>>>> "***: distinct genes/domains", i.e. I just counted how many unique
>>>>>>>>>> genes occurred in the Pearson table.
>>>>>>>>>> This is also the reason that I do not know what to say about preparing
>>>>>>>>>> the table and hence - cannot describe it in the Methods. I used just
>>>>>>>>>> the
>>>>>>>>>> Pearson linear correlation values in the standard form.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> DOM: I'm very unhappy with this just saying 3563 for all species -
>>>>>>>>>>> this
>>>>>>>>>>> is almost meaningless. Are you saying that the mapping to genes was
>>>>>>>>>>> only done on the fly? Could I perhaps ask Dimitri to try to extract
>>>>>>>>>>> the
>>>>>>>>>>> actual gene pairs numbers?
>>>>>>>>>> That would be the hard part: how do we define a set of input genes?..
>>>>>>>>>> And I do not think it is crucial. Indeed, the pivotal dataset is
>>>>>>>>>> UniDomInt: as soon as we have an extra gene with Pfam domains in
>>>>>>>>>> it, we
>>>>>>>>>> can check it in FunCoup with this data. For comparison, in MEX, PEX or
>>>>>>>>>> PPI the gene IDs are pivotal, that's why we count them in the table.
>>>>>>>>>>
>>>>>>>>>>> I also see that we don't describe the UniDomInt usage in the Methods
>>>>>>>>>>> section - do we need to? Was some cutoff or other parameter used?
>>>>>>>>>> I think I just employed the old procedure developed for those old
>>>>>>>>>> Rhodes
>>>>>>>>>> data. And I ignored lines that had UniDomInt score 0.
>>>>>>>>>>
>>>>>>>>>> A
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Also, could I please ask everybody to go through the paper looking
>>>>>>>>>>> for
>>>>>>>>>>> omissions, unclear parts, and other bugs.
>>>>>>>>>>>
>>>>>>>>>>> /Erik
>>>>>>>>>>>
>>>>>>>>>>> On 10/18/2011 12:33 PM, Andrey Alexeyenko wrote:
>>>>>>>>>>>> http://funcoup.sbc.su.se/statistics_2.0.html fixed.
>>>>>>>>>>>>
>>>>>>>>>>>> BUT differently (see the page), as it was (close to) impossible to
>>>>>>>>>>>> calculate the exact numbers:
>>>>>>>>>>>>
>>>>>>>>>>>> - in GIN: due to absence (at me) of the original pairwise file;
>>>>>>>>>>>>
>>>>>>>>>>>> - in DOM: because we store just the domain pairs, and answering
>>>>>>>>>>>> exactly
>>>>>>>>>>>> would take re0running the whole thing in the debug mode and
>>>>>>>>>>>> looking at
>>>>>>>>>>>> variable values...
>>>>>>>>>>>>
>>>>>>>>>>>> Andrey
>>>>>>>>>>>>
>>>>>>>>>>>> On 2011-09-28 12:02, Erik Sonnhammer wrote:
>>>>>>>>>>>>> Great
>>>>>>>>>>>>>
>>>>>>>>>>>>> I guess you mean GIN. How about simply the nr of interactions
>>>>>>>>>>>>> (above the
>>>>>>>>>>>>> cutoff whatever it was)?
>>>>>>>>>>>>>
>>>>>>>>>>>>> for DOM there should be a nr of interactions for each species while
>>>>>>>>>>>>> GIN
>>>>>>>>>>>>> is only in yeast.
>>>>>>>>>>>>>
>>>>>>>>>>>>> /Erik
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 09/28/2011 11:56 AM, Andrey Alexeyenko wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I updated statistics_2.0.html,
>>>>>>>>>>>>>> except the columns DOM and INT where I do not know what to count.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2011-09-27 11:42, Erik Sonnhammer wrote:
>>>>>>>>>>>>>>> Here is a list:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Put 2.0 on home page
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Update release notes (text file fine) with Input dataset sizes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Erik's desktop (ubuntu), under Edge Catetories, Species, “fly”
>>>>>>>>>>>>>>> becomes “...”
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Have KEGG pathway memberships and subcellular localisations been
>>>>>>>>>>>>>>> updated?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Why does not fly FBpp0289426 (NBS) align with its ortholog human
>>>>>>>>>>>>>>> NBN?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Option to turn on debugging info
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> And some suggestions:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Change to “(out of # at pfc>0.1):” under 'Network
>>>>>>>>>>>>>>> edges'.>0.25,>0.5,
>>>>>>>>>>>>>>> 0.75 is a bit too course anyway and may not match the query.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Update example queries(?)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Add “maximum pfc” cutoff to the query – to look for novel links.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Fewer areas on the webpage. Similar options should be grouped in
>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>> area instead. A few areas with clear headers about what they
>>>>>>>>>>>>>>> contain is
>>>>>>>>>>>>>>> preferable.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> /Erik
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 09/27/2011 11:03 AM, Thomas Schmitt wrote:
>>>>>>>>>>>>>>>> Awesome! What are the issues with the website that you want to
>>>>>>>>>>>>>>>> get
>>>>>>>>>>>>>>>> fixed?
>>>>>>>>>>>>>>>> Is this something that we should do asap?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Funcoup mailing list
>>>>>>>>>>> Funcoup at sbc.su.se
>>>>>>>>>>> https://mail.sbc.su.se/mailman/listinfo/funcoup
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Funcoup mailing list
>>>>>>>>>> Funcoup at sbc.su.se
>>>>>>>>>> https://mail.sbc.su.se/mailman/listinfo/funcoup
>>>>>>>>>
>>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Funcoup mailing list
>>>>> Funcoup at sbc.su.se
>>>>> https://mail.sbc.su.se/mailman/listinfo/funcoup
>>>>
>>>> _______________________________________________
>>>> Funcoup mailing list
>>>> Funcoup at sbc.su.se
>>>> https://mail.sbc.su.se/mailman/listinfo/funcoup
>>>
>>> _______________________________________________
>>> Funcoup mailing list
>>> Funcoup at sbc.su.se
>>> https://mail.sbc.su.se/mailman/listinfo/funcoup
>> _______________________________________________
>> Funcoup mailing list
>> Funcoup at sbc.su.se
>> https://mail.sbc.su.se/mailman/listinfo/funcoup
>
> _______________________________________________
> Funcoup mailing list
> Funcoup at sbc.su.se
> https://mail.sbc.su.se/mailman/listinfo/funcoup


More information about the Funcoup mailing list