[slides and audio] Automating the Construction of Internet Portals with Machine Learning

The paper discusses the use of machine learning techniques to automate the creation and maintenance of domain-specific internet portals. These portals gather and organize web content for easy access, retrieval, and search. However, maintaining such portals is time-consuming and labor-intensive. The authors propose using machine learning methods like reinforcement learning, information extraction, and text classification to streamline the process. They describe a system that efficiently spiders the web, identifies informative text segments, and builds topic hierarchies. This system has been applied to create a portal for computer science research papers, containing over 50,000 papers. The techniques are applicable to other domains as well. The paper highlights the challenges of finding specific information on the web, especially with general search engines that lack precision. Domain-specific portals offer more targeted search capabilities. Examples include Camp Search, LinuxStart, Movie Review Query Engine, Crafts Search, and Travel-Finder, which allow users to search for specific information more effectively. The authors present new machine learning methods for spidering in a topic-directed manner, extracting relevant information, and building browseable topic hierarchies. These methods are based on reinforcement learning, hidden Markov models, and text classification. The spidering task is framed in a reinforcement learning framework, allowing precise and mathematical definitions of optimal behavior. The results show that a reinforcement learning spider is more efficient in finding domain-relevant documents than traditional methods. Information extraction is performed using hidden Markov models, which are effective for automatically identifying textual substrings in documents. The model extracts fifteen different fields from spidered documents with 93% accuracy. These techniques enable efficient and accurate portal creation and maintenance.The paper discusses the use of machine learning techniques to automate the creation and maintenance of domain-specific internet portals. These portals gather and organize web content for easy access, retrieval, and search. However, maintaining such portals is time-consuming and labor-intensive. The authors propose using machine learning methods like reinforcement learning, information extraction, and text classification to streamline the process. They describe a system that efficiently spiders the web, identifies informative text segments, and builds topic hierarchies. This system has been applied to create a portal for computer science research papers, containing over 50,000 papers. The techniques are applicable to other domains as well. The paper highlights the challenges of finding specific information on the web, especially with general search engines that lack precision. Domain-specific portals offer more targeted search capabilities. Examples include Camp Search, LinuxStart, Movie Review Query Engine, Crafts Search, and Travel-Finder, which allow users to search for specific information more effectively. The authors present new machine learning methods for spidering in a topic-directed manner, extracting relevant information, and building browseable topic hierarchies. These methods are based on reinforcement learning, hidden Markov models, and text classification. The spidering task is framed in a reinforcement learning framework, allowing precise and mathematical definitions of optimal behavior. The results show that a reinforcement learning spider is more efficient in finding domain-relevant documents than traditional methods. Information extraction is performed using hidden Markov models, which are effective for automatically identifying textual substrings in documents. The model extracts fifteen different fields from spidered documents with 93% accuracy. These techniques enable efficient and accurate portal creation and maintenance.

Automating the Construction of Internet Portals with Machine Learning

2000 | ANDREW KACHITES MCCALLUM, KAMAL NIGAM, JASON RENNIE, KRISTIE SEYMORE