Using Descriptor Counts in Clustering
(Directed Exploration of Territories)
Standard Wards Method
- Does a complete clustering.
- Modified Wards method.
- Can use BOOST's output as its start point.
- Compare each item in a dataset to every other item.
- Choose the "best" pairing.*
- New pair now becomes an item, replacing the two that went into it.
- Keep pairing items until done.
*"best" = least increase in summed moments of inertia.
As in BOOST, we bin the (reduced) dataset by holding the most variable bits
constant. In DiET, however, we don't have a tree, just one level of bins.
For each item in the reduced dataset, its best partner might be in
any bin. But it's more likely to be in a bin that
differs by just one or two bits than in a bin that differs by, say, ten.
As we combine pairs to build clusters, we start with restricted search spaces,
centered on each bin, then widen them in stages.
First, we look just within the same bin. As pairings start to lose
compactness, we expand our search territories to include bins that differ by one bit.
Then we expand to include two-bit differences, and so on.
As the number of bins grows, the number of items in each bin shrinks.