In Progress

To Do Now

  • Extract a DeliciousExt data-set with the additional constraint that the chosen tags must also be extractive key-phrases for the documents collected.


  • List all features currently used by machine learning. Feature List
    • ? Include log(1+x) as a feature, in addition to x ?
  • Label 100 documents and re-train model (e.g., with Slashdot-type comments)
  • Profile freq. of occurrence of features using Weka
  • Continue feature engineering to improve classification performance using n-fold cross-validation.
  • Work on extraction of keyphrases from multiple documents (e.g., cluster of documents)
  • Create a Java interface to support clustering of news items that have been read
    • Save each read news item and document as a “FENA” (Features of Entry and Article) XML file.
    • include meta-data such as time spent, etc.

To Do Next

  • Add a button to the RSSOwl interface so that users can give feedback on their interest level in a given news story: like / dislike / neutral . We would like to think about how to incorporate this sort of preference info. into the clustering algorithm.
  • Add a suggestion dialog box
  • Add results page to design document.
  • Investigate automatic feature binarization for Weka to enable use of other classifiers (e.g., maxent)

To Do At Some Later Date

  • Add the category field of a feed entry to the learning algorithm
  • Add menu option to ban a Wikipedia category
  • Add position features for machine learning.
  • Check for stop words before adding a User keyphrase (or is this handled by couldBeKeyphrase()?)
  • Sentence breaking
  • Try indexing into Wikipedia by single terms, and combinations, and grab all the search results and combine them into something. Prefer all terms first.
  • Use Lingpipe named-entity detector as a feature
  • Integrate fully into the GUI, recognizing user settings, languages, etc.
  • Sync with the newest RSSOwl codebase
  • Automatically identify when an article page is just an ad; identify link to true article.

Optional Features

  • Ignore comments
  • Stop words have to be excluded? I run out of heap space otherwise.
  • Go through web page; look at the value part of attribute for either “topic” or “tag”
  • Smart folder that is a “meta-feed”

Probably Never Going Do

  • Context-menu option to add keyphrases from RSS blurb
  • Lookup link text from other pages that link to the webpage, via a search engine.


  • Revisit Dan's comments on the data and recrawl
  • Move your action list here from Word doc.
  • Add the feed URL to the FENA
  • Add Doc. Freq. to ARFF files
  • Create Learner that saves its model to disk.
  • Have the Newsreader load the classifier from disk, and use it instead of the baseline model (maybe via switch?).
  • Somehow use user ratings to influence the learning model. Maybe just have them submit user ratings for performance analysis, and use the FENA data for more training?
  • Experiment using Wikipedia for better keywords.
  • Get query search logs from Microsoft Research and ask them about phrase position features in the MoC model.
  • Good cut-off? I’m thinking 30%
  • Validate user entered keyphrases by checking them for existence in the FENA. What about Wikipedia non-extractive keywords?
  • Load the stop words only once
  • Cache the last 20 or so new item keyword data
  • Grab the Wikipedia article title.
  • Rate and save Wikipedia keyphrases
  • Thread the GUI somehow? How can we halt it when the user clicks rapidly?
LDAP: couldn't connect to LDAP server
nlp-private/intelligent-newsreader.txt · Last modified: 2015/04/22 15:06 by ryancha
Back to top
CC Attribution-Share Alike 4.0 International = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0