A demo on identifying clusters in a data set using the D-Wave quantum computer.
When dealing with large data sets, we do not always have neatly labeled data (i.e. most, if not all, the data could be unlabeled). However, all is not lost as we can still extract insights from this data through clustering.
For example, when dealing with housing data---namely, square footage and price---we can identify clusters in that data which may be indicative of different neighborhoods. Another example could be having a boolean vector of TV shows that consumers watch; clusters in this data could help identify a particular consumer demographic.
As well, if we do have a few labelled data points in our data set, we could potentially label the entire cluster based on these.
Figure: A data set of nine points that have been divided into three clusters (shown with different colors).
To run the demo with a simple, hardcoded data set:
bash
python clustering.py
To run the same demo with a slightly more sophisticated data set:
bash
python example_clusters.py
This provides a visualization of the problem on the D-Wave problem inspector, and plots of the original data set and its solution.
The D-Wave quantum computer accepts problems formulated mathematically in Binary Quadratic Model (BQM) format. The goal here is to build a BQM such that it represents our clustering problem. Namely, we want a BQM such that a low-energy solution found by the D-Wave quantum computer would correspond to a solution to our clustering problem.
Key properties of the clustering problem that we need to capture in our BQM:
Let's go through how we implement each of the key properties of our clustering problem.
<coordinate>.r
, <coordinate>.g
, and <coordinate>.b
. That way, we
can answer yes-no questions for whether a specific coordinate is in a
particular colour cluster.choose_one_group
(shown below). Each three-item-tuple below can
be interpreted as (<join-red>, <join-green>, <join-blue>)
, where the
1s and 0s indicate true and false, respectively. Hence, the
choose_one_group
is a set of all valid states. (e.g. (1, 1, 0)
is not
valid because a data point is not allowed to be in both red and green clusters
at once).bash
choose_one_group = {(0, 0, 1), (0, 1, 0), (1, 0, 0)}
clustering.py
). Each of
the four data points has three nodes associated with it - <coordinate>.r
,
<coordinate.g>
, and <coordinate.b>
- hence the twelve nodes below.max_distance
, the largest distance between any two points in the data set.bash
d = get_distance(coord0, coord1) / max_distance # rescale distance
weight = -math.cos(d*math.pi)
tanh
function.bash
d = math.sqrt(get_distance(coord0, coord1) / max_distance)
weight = -math.tanh(d) * 0.1
0.1
was applied in order to prevent this weight from
overwhelming the other weights in the BQM. The 0.1
is arbitrary and was
found by tinkering with the code.Released under the Apache License 2.0. See LICENSE file.
I tried running this with 20 points on D'wave, but it gave me an error:
ValueError: no embedding found
is this because dwavebinarycsp
can not solve a large dataset?
please add qpu tag to clustering example
constraint-satisfaction-problem machine-learning problem-inspector bqm intermediate