Researchers Analyze Tweets for Bio-Surveillance

Micro-blogging services such as Twitter offer the potential to crowdsource epidemics in real-time. However, Twitter posts (tweets) are often ambiguous and reactive to media trends. In order to ground user messages in epidemic response Nigel Collier, of the National Institute of Informatics in Tokyo, and colleagues, focused on tracking reports of self-protective behavior such as avoiding public gatherings or increased sanitation as the basis for further risk analysis.

The researchers created guidelines for tagging self protective behavior based on Jones and Salathé (2009)s behavior response survey. Applying the guidelines to a corpus of 5,283 Twitter messages related to influenza-like illness showed a high level of inter-annotator agreement (kappa 0.86). They employed supervised learning using unigrams, bigrams and regular expressions as features with two supervised classifiers (SVM and Naive Bayes) to classify tweets into four self-reported protective behavior categories plus a self-reported diagnosis. In addition to classification performance they report moderately strong Spearmans Rho correlation by comparing classifier output against WHO/NREVSS laboratory data for A(H1N1) in the USA during the 2009-2010 influenza season.

The researchers say their study adds to evidence supporting a high degree of correlation between pre-diagnostic social media signals and diagnostic influenza case data, pointing the way toward low cost sensor networks. They say they believe that the signals they have modeled may be applicable to a wide range of diseases. Their research was published in the Journal of Biomedical Semantics.

Reference: Collier N, Son NT and Nguyen NM. OMG U got flu? Analysis of shared health messages for bio-surveillance. Journal of Biomedical Semantics 2011, 2(Suppl 5):S9 doi:10.1186/2041-1480-2-S5-S9