STRI: Stochastic Text Reading Intelligence

This is my problem, and hopefully you can relate. I can spend an hour reading a dozen news articles each day and still not know the general direction of the world. My interests and biases guide what I choose to read, and so I get a skewed world picture.
If I could read hundreds of news posts indiscriminately, then I could get a general and somewhat accurate feel for the world at large.
This is my solution and hopefully you can relate. STRI (pronounced stor-ee): Stochastic Text-Reading Intelligence. STRI is (will be) a piece of self-adjusting software that reads the text of many different related articles and compiles a set of variables to describe the general idea of the text. For example, if STRI had read all the available reviews of a certain movie, it could tell you how good, how violent, or how funny the reviewers as a whole thought the movie was. If STRI had read a lot of stock market analyses recently, it could tell you if the analysts as a whole felt the price of gold was going to go up or down.
But how does one go about programming a piece of software to read text? Chat bots seem to have a hard time handling a single line of text, let alone a mass of hundreds or thousands of articles. How do you program software to weed out or compensate for deliberately misleading content. How does a piece of software differentiate between the words of a rambling moron and a leading expert and what does it do about the difference? This the framework I have in mind.
At the most basic level, STRI will look for key words with emotional connotations or some material meaning all on their own. Most of these words would be nouns that could tell a human reader something about the sentence if he or she were just barely skimming. Words like “Smash”, “war”, “jitters”, “glory”, “amazing” that support themselves would make good candidates for key words. Each of those words will have a set of about six different basic values depending on what the content. Common basic values could be “Greed” and “Optimism” for finance, or “Quality” and “Violence” for movies.
Then, for each key word found, STRI would look through the sentence using a basic grammar parser and search for modifying words that would be likely to change the weight of the key word. Common modifiers would be '”not” to reverse the weight and “very” to increase the weight. Then, after an entire article has been parsed this way, all the values are totaled up to get a basic idea of the emotional content article.
Repeat this process for a large number of related articles and the result is an average emotional content of every piece available on the subject. The user would just consult STRI about a given topic, say Kung-Fu Panda and would be given “121 reviews of Kung-Fu Panda returned 78% quality, 33% violence, 61% humour, etc.”, and possibly a list of the reviews for more details.
From there the process could be refined because STRI's findings could be compared to the average of the four-star ratings that were given to Kung-Fu Panda. If there was a significant difference, the values of some words could be adjusted and the calculations redone to try for a closer match. This is the stochastic process; make an educated guess from the previous results to get a better match next time. This works better with stock market reports because technical analysis already uses the changes in stock prices to determine the general mood of the stock market, so STRI would have something complex and numerical to compare its findings to.
To compensate for sources with an obvious bias, STRI could compare the values derived from articles from a given source to the pool of related articles from all sources to see if that source used words that were consistently either skewed or exaggerated. If, for instance, Fox News had a habit of using words that were consistently more violent than other sources on the same topic, then STRI could learn to count future articles from Fox News as less violent than they are reported.
Similarly, STRI could check for basic grammar and spelling to differentiate between a rambling forum topic and a well-read professional journal and could give weight to each article accordingly. Discerning quality is difficult, but spelling is easy.
To get a starting point for new words, STRI could scan several online dictionaries and thesauruses for similar key words that are already in STRI's vocabulary and compose an set of mood values from those, and then use that set of values in a few articles to refine the values through a series of better guesses. For example, if the word “torture” was encountered for the first time, STRI would look at the definition of torture as well as similar words to make a guess at the meaning, then search for articles with the word “torture” in them to see if the definition values make sense and refine them accordingly.
With a sufficient number of starting words in a language and enough processing time, STRI could trawl the internet until it had learned the rest of that language. After a sufficient amount of time in a field, a user could ask STRI questions about topics and be able to get the gist of that topic as well without having to read the pieces on the topic. The key is comparison; anything with a metric that written work can be tied to like a stock price or a movie rating is perfect for an artificial intelligence to not only analyze but learn from as well. From a large enough analysis, a more accurate picture of a topic should be possible.
- Jack Davis, 2008.

Delicious
Digg
StumbleUpon
Facebook
Google
Yahoo
Technorati
Post new comment