Bull, Simon C. and Muldoon, Mark R. and Doig, Andrew J. (2013) Maximising the Size of NonRedundant Protein Datasets Using Graph Theory. PLoS ONE, 8 (2012.6). e55484. ISSN 19326203
This is the latest version of this item.
PDF
BullMuldoonDoig_PLoS_ONE8.pdf Download (491kB) 
Abstract
Analysis of protein data sets often requires prior removal of redundancy, so that data is not biased by having multiple copies of similar proteins. This is usually achieved by pairwise comparison of sequences, followed by purging so that no two pairs have similarities above a chosen threshold. From a starting set, such as the PDB or a genome, one should remove as few sequences as possible, to give the largest possible nonredundant set for subsequent analysis. Protein redundancy can be represented as a graph, with proteins as nodes connected by undirected edges, if they have a pairwise similarity above the chosen threshold. The problem is then equivalent to finding the maximum independent set (MIS), where as few nodes are removed as possible to remove all edges. We tested seven MIS algorithms, three of which are new. We applied the methods to the PDB, subsets of the PDB, various genomes and the BHOLSIB benchmark datasets. For PDB subsets of up to 1000 proteins, we could compare to the exact MIS, found by the Cliquer algorithm. The best algorithm was the new method, Leaf. This works by adding clique members that have no edges to nodes outside the clique to the MIS, starting with the smallest cliques. For PDB subsets of up to 1000 members, it usually finds the MIS and is fast enough to apply to data sets of tens of thousands of proteins. It gives sets that are around 10% larger than the commonly used PISCES algorithm, that are of identical quality. We therefore suggest that Leaf should be the method of choice for generating nonredundant protein data sets, though it is ineffective on dense graphs, such as the BHOLSIB benchmarks. The Leaf algorithm and sets from genomes and the PDB are available at: http://www.bioinf.manchester.ac.uk/leaf/.
Item Type:  Article 

Uncontrolled Keywords:  graph theory, maximum independent set, bioinformatics, protein sequence data, graph theory 
Subjects:  MSC 2010, the AMS's Mathematics Subject Classification > 68 Computer science MSC 2010, the AMS's Mathematics Subject Classification > 92 Biology and other natural sciences 
Depositing User:  Dr Mark Muldoon 
Date Deposited:  01 Mar 2013 
Last Modified:  20 Oct 2017 14:13 
URI:  https://eprints.maths.manchester.ac.uk/id/eprint/1945 
Available Versions of this Item

Maximising the Size of NonRedundant Protein Datasets Using Graph Theory. (deposited 07 Jul 2012)
 Maximising the Size of NonRedundant Protein Datasets Using Graph Theory. (deposited 01 Mar 2013) [Currently Displayed]
Actions (login required)
View Item 