Grouping Customer Opinions Written in Natural Language Using Unsupervised Machine Learning
Ян Жижка (Чехия, Брно)
Семинар прошел 24.11.2012
In the first part, the talk deals with a problem of automatic clustering of unstructured textual documents. Here, this known problem is investigated empirically, focusing especially on very large data taken from the real world: Reviews of customers of hotel accommodations booked online. The data come from one of popular booking service provided by booking.com. Using the biggest selection (almost 2,000,000 freely written reviews in English), the talk presents the problem which clustering method should be used, what parameters of the selected algorithm are optimal, and how to estimate the clustering correctness.
In the second part, the talk mentions another problem that played a specific role in the clustering task and that arose: How to process very large (textual) data volumes when our computers cannot process it because of the never sufficient RAM (memory) size? The side experiments with (pseudo)parallel processing demonstrated some interesting things related to the representing demand of randomly selected subsets of the large original set. When the subsets lose their representative role due to omitting some relevant words because of the selection? Is it better to process more smaller subsets faster or less bigger ones slower?