We develop and apply a novel framework which is designed to extract information in the form of a positive definite kernel matrix from possibly crude, noisy, incomplete, inconsistent dissimilarity information between pairs of objects, obtainable in a variety of contexts. Any positive definite kernel defines a consistent set of distances, and the fitted kernel provides a set of coordinates in Euclidean space which attempt to respect the information available, while controlling for complexity of the kernel. The resulting set of coordinates are highly appropriate for visualization and as input to classification and clustering algorithms. The framework is formulated in terms of a class of optimization problems which can be solved efficiently using modern convex cone programming software. The power of the method is illustrated in the context of protein clustering based on primary sequence data. An application to the globin family of proteins resulted in a readily visualizable 3D sequence space of globins, where several sub-families and sub-groupings consistent with the literature were easily identifiable.
Citation
Technical Report No. 1107, Department of Statistics, University of Wisconsin, Madison, May, 2005.