Many applications involving clustering prefer solutions that are balanced, i.e, the clusters obtained contain roughly equal numbers of data points. This chapter describes several approaches to obtaining balanced clustering results that also scale well to large data sets. We first describe in detail how a balanced, scalable solution can be obtained by first clustering only a small subset of the data and then efficiently allocating the rest of the data to these initial clusters while simultaneously refining the solution. We also illustrate how certain graph partitioning based approaches to clustering naturally lead to balanced solutions. If the similarity matrix that these approaches act on are sparse, then the solution scales reasonably well. Finally we outline how the concept of "conscience" can be used for fast on-line balanced clustering, illustrating it with results from clustering text documents.
We consider practical methods for adding constraints to the K-Means clustering algorithm in order to avoid local solutions with empty clusters or clusters having very few points. We often observe this phenomena when applying K-Means to datasets where the number of dimensions is n ≥ 10 and the number of desired clusters is k ≥ 20. Moreover, recent studies have shown successfull formulations of various other types of constraints. Particularly, must-linked and cannot-linked types constraints have been studied in several papers.
Incorporating prior knowledge, whether in the form of firmly defining number of non-empty clusters or pairwise relationships, is very essential in partially supervised clustering. Like general clustering problem, partially supervised clustering problem can also be posed as an optimization problem. With partial supervision, underlying clustering model can be used to prevent poor local solutions.
Appropriate objective function needs to be constructed to find clusters that satisfy minimum number of points, must-linked and cannot-linked constraints at the same time. Obviously, it requires an analysis of the applicability and the level of complexity of the constraint types.
We propose explicitly adding k constraints to the underlying clustering optimization problem requiring that each cluster have at least a minimum number of points in it. We then investigate the resulting cluster assignment step. Numerical tests on real datasets indicate that the constrained approach is less prone to poor local solutions, producing a better summary of the underlying data. We also formulate an extended optimization model to cover other types of constraints too.
We present an approach to semi-supervised clustering based on the observation that ``it is easier to criticize than to construct.'' Our approach allows a user to iteratively provide feedback to a clustering algorithm. The feedback is incorporated in the form of constraints which the clustering algorithm attempts to satisfy on future iterations. These constraints allow the user to guide the clusterer towards clusterings of the data that the user finds more useful. We demonstrate semi-supervised clustering with a system that learns to cluster news stories from a Reuters data set.
In many clustering problems, in addition to attribute data, we have relational information, linking different data points. In this chapter, we examine the problem of relational clustering, in which data is clustered based both on attribute information and co-occurrence information. One approach to relational clustering, which we refer to as naive relational clustering, uses neighbor information and makes independent pairwise clustering decisions. An alternative approach is collective relational clustering, in which, rather than viewing the clustering as a collection of pairwise decisions, the clustering decisions depend on the clustering of neighboring points. We present algorithms for both naive and collective relational clustering, and present experiment results on several real-world datasets.
As a discovery technique, clustering should avoid redundancies with existing knowledge about class structures or groupings and reveal novel, previously unknown aspects of the data. We formally define this as a constrained clustering problem, which we refer to as "non-redundant clustering." A family of algorithms is presented which obtain high-quality clusterings while ensuring these solutions are novel with respect to the known structure. We discuss incorporating various forms of known structure such as existing classifications or user-selected features and also discuss how the techniques may be used to enumerate a sequence of multiple plausible clusterings for a data set.
In this chapter, we discuss the integration strategy of instance-level constraints by way of a hidden Markov Random Model. The wide applicability of this model is demonstrated for vectorial, parametric and outlined for pairwise, non-parametric clustering models. The computational complexity induced by the constraints in such models is tackled by way of a mean-field approximation. Furthermore, a weighting mechanism is introduced that allows for controlling the trade-off between constrained and unlabelled data. A model selection heuristic is employed to actually pick a sensible value of this trade-off parameter. Experiments shed light on the usefulness of this proposal in the model-based setting on various data sets.
We extend Gaussian mixture models (GMM) in a simple and natural way to accommodate prior belief that pairs of items should, or should not, be assigned to the same cluster. We incorporate beliefs about pairwise cluster assignments into the GMM by expressing those beliefs as Bayesian priors over the assignment of data points to clusters. The priors can express either hard constraints (items A and B must, or must not, be assigned to the same cluster), or merely preferences for certain assignments. The strengths of the preferences are under user control. The priors penalize each cluster assigment according to both the number and the strength of assignments preferences it violates. We provide an expectation-maximization (EM) algorithm for parameter fitting. Experiments on artificial and real problems show that cluster assignment preferences expressed in the training data are effectively incorporated into the GMM parameters; clustering on test data is improved with respect to the standard GMM.
The last few years, several classical clustering approaches have been extended towards constrained clustering. Various types of constraints, e.g., pairwise constraints or balancing constraints, have been considered. In this chapter proposal, we investigate a co-clustering framework (i.e., a method which provide a partition of objects and a linked partition of features). Supporting constraint-based mining for a co-clustering task has been quite unexplored. First, we consider straightforward extensions of the classical instance level constraints (must-link, cannot-link). They are extended in the sense that they can specify properties that link objects and/or features. Furthermore, we also study more original constraints that exploit sequential orders on features (e.g., when features denote time points and are thus ordered). The idea is that one might specify whether we want that the extracted co-clusters involve contiguous elements or not. In other terms, these are constraints on the shape of the biclusters. Co-clustering needs for the optimization of an objective function (e.g., Goodman-Kruskal's tau coefficient or the loss of mutual information) but it might also ensure that the user-defined constraints are satisfied. Instead of designing constraint processing integration within a co-clustering scheme like CoCluster, we propose a local to global (L2G) framework. The L2G approach consists in postprocessing a collection of (constrained) local patterns (such as, e.g., closed feature sets and their supporting sets of objects) to build a global bipartition. When user-defined constraint are specified on the bipartition, our basic idea is that clustering constraints can be transformed into some local counterpart (i.e., the clustering process itself will be applied on local patterns that satisfy some local constraints). Then, we have to take care of constraint propagation and clustering constraint satisfaction during the clustering computation itself. In this chapter, we want to study this scheme for the user-defined constraints identified so far.
In our approach, constraints are not intended as a tool to improve supervised classification results (i.e., the important context for semi-supervised classification), but as a mean to support relevant clustering w.r.t. some background knowledge. For instance, interval constraint which enables to build partitions where each cluster is an interval w.r.t. the ordered dimension (Similarly, a not interval constraint enforces the discovery of clusters which are independent from the order information) are useful when mining gene expression data (e.g., microarray data that provide gene expression time series). In this chapter, we will consider a couple of gene expression data analysis tasks where constrained co-clustering leads to interesting results, i.e., using constraints supports the discovery of more relevant biclusters given the biological knowledge.
Gaussian Mixture Models (GMM) is a popular generative clustering model, which is usually estimated in an unsupervised manner using the Expectation Maximization (EM) algorithm. In this work we show how equivalence constraints can be incorporated into this algorithm, leading to improved model estimation and improved clustering results. Equivalence constraints provide additional information on pairs of data points, indicating if the points arise from the same source (positive constraint) or from different sources (negative constraint). Such constraints can be gathered automatically in some learning problems, and are a natural form of supervision in others. We present a closed form EM algorithm for handling positive constraints, and a Generalized EM algorithm using a Markov network for the incorporation of negative constraints. We demonstrate that incorporating equivalence constraints leads to considerable improvement in clustering performance.
We consider the problem of clustering a collection of elements based on pairwise judgments of similarity and dissimilarity. Bansal et al. (in: Proceedings of 43rd FOCS, 2002, pp. 238-247) cast the problem thus: given a graph G whose edges are labeled "+" (similar) or "-" (dissimilar), partition the vertices into clusters so that the number of pairs correctly (resp., incorrectly) classified with respect to the input labeling is maximized (resp., minimized). It is worthwhile studying both complete graphs, in which every edge is labeled, and general graphs, in which some input edges might not have labels. We answer several questions left open by Bansal et al. (2002) and provide a sound overview of clustering with qualitative information.
Specifically, we demonstrate a factor 4 approximation for minimization on complete graphs, and a factor Click to view the MathML source approximation for general graphs. For the maximization version, a PTAS for complete graphs was shown by Bansal et al. (2002), we give a factor 0.7664 approximation for general graphs, noting that a PTAS is unlikely by proving APX-hardness. We also prove the APX-hardness of minimization on complete graphs.
Attribute data and relationship data are two principal types of data, representing the intrinsic and extrinsic properties of entities. While attribute data have been the main source of data for cluster analysis, relationship data such as social networks or metabolic networks are becoming increasingly available. It is also common to observe both data types carry complementary information such as in market segmentation and community identification, which calls for a joint cluster analysis of both data types so as to achieve more accurate results. For this purpose, we introduce the novel Connected k-Center problem, taking into account attribute data as well as relationship data. We analyze the complexity of this problem and prove its NP-hardness. We also present a constant factor approximation algorithm, based on which we further design NetScan, a heuristic algorithm that is efficient for large, real databases. Our experimental evaluation demonstrates the meaningfulness and accuracy of the NetScan results.
Interactive Visual Clustering (IVC) is a novel method that allows a user to explore relational data sets interactively, in order to produce a clustering that satisfies their objectives. IVC combines spring-embedded graph layout techniques with user interaction and constrained clustering. This paper describes the IVC method, and gives experimental results on several synthetic and real-world data sets, showing that IVC yields better clustering performance than several alternative methods.
Various kinds of constraints can be used to derive desired clusters in cluster analysis. One kind of constraints, called existence constraint, requires that at least a certain number of user-specified pivot objects must be in each resulting cluster. In this paper, we first examine how an existence constraint can be effectively and efficiently incorporated into a popular clustering method, the c-means. We then illustrate how our algorithm can be adapted to derive good partitioning that satisfies various forms of constraints in the k-anonymity model for privacy-preserving data publishing.
A distance metric is an important factor in clustering, and an appropriate metric must be used to achieve accurate results. Several methods have been proposed for learning a distance metric from pairwise feedback, such as two data items are similar and must be in the same cluster (``must-be-linked'') or dissimilar and cannot be in the same cluster (``cannot-be-linked''). In this chapter, we describe a method for learning a distance metric from only cannot-be-linked example pairs. The metric learning is formulated as a convex quadratic programming problem that can be solved much faster than a semi-definite programming problem, which generally must be solved to learn a distance metric matrix. We apply the metric learning to an object identification problem, in which clustering is used to determine whether the names of people in documents or databases refer to the same person or not. The cannot-be-linked example pairs used in training can be mechanically collected without human supervision under the assumptions that different names refer to different objects and names are arbitrary. Experiments using the DBLP data set show that the learned metric improves precision and recall for object identification.
We propose a discriminative learning approach which can incorporate pairwise constraints into a conventional margin-based learning framework. Different from previous work that mostly attempts to learn better distance metrics or estimate the underlying data distribution, the proposed approach can directly model the classification decision boundary and thus require fewer model assumptions in the learning process. Moreover, the proposed approach can benefit from handling both labeled data and pairwise constraints in a unified framework. We investigate two families of pairwise loss functions, namely convex and non-convex pairwise loss functions, and study the consistency of its estimator with sample size grows. We then design three pairwise learning algorithms by plugging in the hinge loss and the logistic loss functions and derive optimization methods for them. The proposed learning algorithms were evaluated using a people identification task on two surveillance video datasets. The experiments demonstrated that the proposed pairwise learning algorithms considerably outperform the baseline classifiers using only labeled data and two other pairwise learning algorithms with the same amount of pairwise constraints. We also evaluate the effect of using noisy constraints in video dataset in order to utilizing pairwise constraints with human privacy being protected.