We consider the problem of separating error messages generated in large distributed data center networks into error events. In such networks, each error event leads to a stream of messages generated by all network components affected by the event. These messages are stored in a giant message log, with no information about the associated events. We study the unsupervised learning problem of identifying the signatures of the events that generated these messages; here, the signature of an error event refers to the probability distribution of messages generated by the event. We design a low-complexity algorithm for this purpose, and demonstrate its scalability on a real dataset consisting of 97 million messages collected over a period of 15 days, from a distributed data center network which supports the operations of a large wireless service provider.

Video Recording