DataArt’s Big Data Competence Center announces the launch of a new beta computer application. The app analyses U.S. and U.K. media news flow and converts it into easy-to-understand charts and infographics.
The application collects news from more than 50 original news sources: newspapers, news agencies, and television channels sites. These include the most major and influential American and British media outlets (e.g. Fox News Channel, BBC News and even press-releases of the major mobile industry companies). It is called structured crawling when the article, its title, author and tags are taken from the web-page without any information noise. The app deals with the processing of a large data stream and organizes it.
During processing, articles are divided into ten categories: business, culture, economics, education, entertainment, health, politics, science, showbiz, sports, technology, world and the category we called “other”. All articles could be sorted in two ways with Natural Language processor: by the referenced object (person, organization, location) or by the emotions presented (the emotional coloring of the text). In the last 3 months we have collected more than 100,000 articles, which is over 1 Gb of data.
The data is aggregated and recorded in charts which can be classified by the type of display and by the categories available.
GEO: all the locations mentioned in the articles are marked on the world map as hot spots. You can select any geographical location and see what persons or organizations were mentioned in articles in the context of the location (pic.1: “Barak Obama in Washington D.C.”– news trends in U.S.).
TRENDS: The chart allows sorting statistics by category and trends (pic.2). The top trends are displayed below the graph. To see how many mentions on a hot topic were given and what the media has written about it, hover your mouse over it (pic.3: the week peak in sport category of 378 mentions when Louisville won NCAA championship on April, 9).
BUBBLES: the same visualization in another original style without displaying numeric values (pic.4).
EMOTIONS: Each word can be given an emotional characteristic - negative or positive. It's a pretty rough division, but it works. Each emotion is assigned with scores according to the emotional scale (most positive = 2; most negative = -2).
The user can choose any category (or several categories) to see the news been given positive/negative connotations (pic.5: positive news about technology).
E.g.: Due to the emotional chart the most positive news of the week of April 5-11 was the reopening of the Rijksmuseum in Amsterdam – 1.8 points (pic.6).
ARTICLES: The filtration mode sorts articles by news source, category, date, or by a particular word. Using the Shift key you can apply multiple filters simultaneously.
News titles, authors, categories, and emotional color are displayed on cards in chronological order (pic.7).
On pictures 8 and 9 I’d like to demonstrate how does the whole thing work. We’ve noticed that there are two leading trends in the technology category: Microsoft and Facebook Home. The news about a Facebook application for OS Android Facebook Home got 1,886 mentions in 88 articles. As we verified later, most of these articles had 2+ points on our emotional scale.
The application was made using pure HTML5 and MongoDB. The product is agile and fast despite the amount of data been analyzed.
We are looking forward to improving the product in progress:
to increase the possibilities of geo filtration;
to integrate with industry specific reference data;
to create more advanced types of graphs and analysis.