|
|
信息时代带来了海量的数字化文本。每一天,这些海量的文本通过互联网生成、传播、交换、储存和访问,进入位于世界各地的人们日常生活中。日益累积的数据使得信息的获取越来越困难,同时语言的障碍也变得越来越严重。为了克服这些困难,自然语言计算组正集中精力于相关的研究课题上,其中包括多国语言文本分析、机器翻译、跨语言信息检索和自动问答系统等。多年来,自然语言计算组对微软的产品作出了重大贡献,其中包括日本和中文输入法(IME),用于Office2007的英文写作助理,进入到Windows Live产品中的中国对联游戏、中文分词系统、拼音搜索、搜索引擎speller,用于SQL2005和SharePoint的文本挖掘技术,用于MSN的元数据提取等。
我们的研究策略是数据驱动的统计学习:从Web和第三方收集大规模单语/双语语料库,利用机器学习的方法获取语言/翻译知识。这种知识则用来支持我们的各种研究项目。下面是对我们主要的研究领域的介绍:
语料库收集、分类、 标注
建设一个大型文本语料库作为支持统计学习的基础架构,是一项持续的努力。语料可以从各种文件中获取。最近几年,利用互联网获取大规模的语言数据越来越重要。依据主题和写作风格进行文本分类有益于建造一个平衡语料库以及领域相关的语料库。语料库标注则是一项具有挑战性的任务。它包括分词、命名实体识别、词性标注、句法、语义标注、及指代关系的标注。不同的标注工具可以直接应用于很多自然语言的应用。不同标注的语料库可以作为有监督的训练数据,用于学习面向不同用途的各种统计语言模型。
亚洲语言的自然语言处理
文本信息挖掘和提取(TIME)是一个平台,用于从以不同语言表示的各种文件,譬如网页, Word文件,PowerPoint提取关键的信息。提取的信息可以用来支持信息检索,搜索引擎,机器翻译,自动文摘。这一创新的技术平台涵盖了多种技术,如tokenization,命名实体识别,语义标记或句子的骨架信息提取、关键术语的提取和自动文摘。
统计机器翻译
统计机器翻译项目的重点放在帮助和指导非英语母语的用户,如中国人、日本人和韩国人,自由自在地搜索、阅读和书写英语。为了这个目的,自然语言计算组应用统计机器翻译技术,在不同的层次(词汇、短语、搭配和句子)提供有意义的翻译。 在机器翻译技术的支持下,这个研究小组正在把新的技术应用到搜索引擎中。比如,基于词汇的检索查询的翻译,基于句子的搜索结果snippet的翻译。
信息检索
我们的目标是利用自然语言技术在索引、拼写改正、网页相关性排序等各个方面改善传统的信息检索的性能。我们将首先在若干垂直领域进行深层NLP支持的搜索引擎,然后将逐步扩展到通用领域的搜索。我们已经研究了适合中文搜索的最佳索引方案、新型的检索式的扩展方法、相关词的抽取、词汇相似度的计算。来自多种搜索引擎的信息融合、基本名词短语抽取、使用基于统计和基于实例方式进行的精确的检索式翻译。我们曾经参加过TREC-9,NTCIR-III等跨语言检索的技术比赛并且获得过最佳的成绩。我们也曾经参加了TREC-10的web 检索。
自动问答
自动问答系统对于正在开发的下一代搜索引擎是一项关键技术。给定一个问题,搜索引擎的用户希望得到一个确切的答案,而不需要面对大量的查询结果。自然语言处理小组正在进行问题形成、问题改写、各种提取问题答案的技术开发。基于这项工作,该小组还希望能建立特定领域的聊天机器人,并且通过对论坛、博客网站和其他网络资源,自动挖掘聊天机器人的知识。
语言游戏
自然语言组在世界上首先提出一项技术实现了自动对联。这是一项重要的人工智能创新。本系统可以作为一项语言游戏,供互联网和手机用户使用(http://duilian.msra.cn)。本软件接收用户提出的上联,然后生成下联和横批,用户利用本技术可用来在娱乐过程中学习中国语文和传统文化。
- Resume Information Extraction with Cascaded Hybrid Model
Kun Yu, Gang Guan and Ming Zhou,Resume Information Extraction with Cascaded Hybrid Model, ACL-2005 (http://acl.ldc.upenn.edu/P/P05/P05-1062.pdf). - An Ordinal Regression Method for Document Retrieval
Yunbo Cao, Jun Xu, Tie-Yan Liu, Hang Li, Yalou Huang, and Hsiao-Wuen Hon. An Ordinal Regression Method for Document Retrieval. To appear in SIGIR 2006 - A Supervised Learning Approach to Search of Definitions
Jun Xu, Yunbo Cao, Hang Li, Min Zhao, and Yalou Huang. A Supervised Learning Approach to Search of Definitions. To appear in Journal of Computer Science and Technology, 2006. - Automatic Extraction of Titles from General Documents Using Machine Learning
Yunhua Hu, Hang Li, Yunbo Cao, Li Teng and Dmitriy Meyeron, and Qinghua Zheng. Automatic Extraction of Titles from General Documents Using Machine Learning. To appear in Journal of Information Processing and Management, 2006 - Research on Expert Search at Enterprise Track of TREC 2005
Yunbo Cao, Jingjing Liu, Shenghua Bao and Hang Li. Research on Expert Search at Enterprise Track of TREC 2005. TREC 2005 - Ranking Definitions with Supervised Learning Methods
Jun Xu, Yunbo Cao, Hang Li and Min Zhao. Ranking Definitions with Supervised Learning Methods. WWW 2005 - A New Approach to Intranet Search Based on Information Extraction
Hang Li, Yunbo Cao, Jun Xu, Yunhua Hu, Shenjie Li, Dmitriy Meyerzon. A New Approach to Intranet Search Based on Information Extraction. CIKM 2005 - Email Data Cleaning
Jie Tang, Hang Li, Yunbo Cao and Zhaohui Tang. Email Data Cleaning. Proceedings of KDD 2005. - Title Extraction from Bodies of HTML Documents and Its Application to Web Page Retrieval
Yunhua Hu, Guomao Xin, Ruihua Song, Guoping Hu, Shuming Shi, Yunbo Cao, and Hang Li. Title Extraction from Bodies of HTML Documents and Its Application to Web Page Retrieval. SIGIR 2005 - An Information-Theoretic Approach to Automatic Evaluation of Summaries
Chin-Yew Lin, Guihong Cao, Jianfeng Gao, and Jian-Yun Nie. An Information-Theoretic Approach to Automatic Evaluation of Summaries. In Proceedings of the Human Language Technology Conference - North American chapter of the Association for Computational Linguistics annual meeting (HLT/NAACL-2006), New York City, NY. - ParaEval: Using Paraphrases to Evaluate Summaries Automatically
Liang Zhou, Chin-Yew Lin, Dragos Stefan Munteanu, and Eduard Hovy. ParaEval: Using Paraphrases to Evaluate Summaries Automatically. In Proceedings of the Human Language Technology Conference - North American chapter of the Association for Computational Linguistics annual meeting (HLT/NAACL-2006), New York City, NY. - Summarizing Answers for Complicated Questions
Liang Zhou, Chin-Yew Lin, and Eduard Hovy. Summarizing Answers for Complicated Questions. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy. - Automated Summarization Evaluation with Basic Elements
Eduard Hovy, Chin-Yew Lin, Liang Zhou, and Junichi Fukumoto. Automated Summarization Evaluation with Basic Elements. In Proceedings of the Fifth Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy. - InfoXtract: A customizable intermediate level information extraction engine
R. Srihari, W. Li, T. Cornel, C. Niu, InfoXtract: A customizable intermediate level information extraction engine, In Journal of Natural Language Engineering, 12(4), 1-37, 2006. - Automatic Acquisition of Chinese-English Parallel Corpus from the Web
Zhang, Ying., Wu, Ke., Gao, Jianfeng, and Vines, P. 2006, Automatic Acquisition of Chinese-English Parallel Corpus from the Web. In Proceedings of ECIR-06, 28th European Conference on Information Retrieval. - A Comparative Study of Discriminative Methods for Reranking LVCSR N-Best Hypotheses in Domain Adaptation and Generalization
Zhengyu Zhou, Jianfeng Gao, Frank K. Soong, Helen Meng, 2006. A Comparative Study of Discriminative Methods for Reranking LVCSR N-Best Hypotheses in Domain Adaptation and Generalization. In: ICASSP 2006. - Linear Discriminant Model for Information Retrieval
Jianfeng Gao, Haoliang Qi, Xinsong Xia, Jian-Yun Nie. Linear Discriminant Model for Information Retrieval. The 28th Annual International ACM SIGIR Conference (SIGIR'2005), August 2005. - Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval
Yunhua Hu, Guomao Xin, Ruihua Song, Guoping Hu, Shuming Shi, Yunbo Cao, and Hang Li. Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval. The 28th Annual International ACM SIGIR Conference (SIGIR'2005), August 2005. - 信息检索的依存语言模型
Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu and Guihong Cao."Dependence language model for information retrieval", In SIGIR-2004. Sheffield, UK, July 25-29, 2004. - 一种英-汉命名实体对齐的新方法
Dong-Hui Feng, Ya-Juan Lv, Ming Zhou,"A New Approach for English-Chinese Named Entity Alignment", 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, Jul. 2004.
更多论文……
|
|
|