信息抽取技术.ppt
《信息抽取技术.ppt》由会员分享,可在线阅读,更多相关《信息抽取技术.ppt(77页珍藏版)》请在课桌文档上搜索。
1、信息抽取技术(Information Extraction),主要内容,信息抽取(IE)的引入及概念信息抽取相关研究活动信息抽取的层次和类型信息抽取系统及其应用中文信息抽取系统的开发,1.信息抽取(IE)的引入及概念,先从CLEF项目说起A Co-operative Clinical E-Science Framework(CLEF)Funded by the UK Medical Research CouncilDescriptive information:病历(Clinical histories)放射透视报告(radiology reports)病理学报告(pathology repo
2、rts)染色体和图片注释数据库(annotations on genomic and image databases)技术文献(technical literature)网络资源(Web based resources),ROYAL MARSDEN NHS TRUST-PATIENT CASE NOTE 324A621F:MRS Dorothy Smith DOB:12/05/44 21,Park Crescent Basingstoke B12 Q13 16 Dec 1992 Seen in General Surgical This lady who has had a mastectom
3、y and left open capsulotomy and removal of her prosthesis was seen by me in the clinic today on behalf of Mr Peterson.She has extensive bony lymphoedema in her left arm which does not seem to be getting any better although she is more or less reconciled to the problem.The original problem was that s
4、he complained of shooting pain in the direction of ulna nerve and although there does not seem to be any evidence of local,regional or distant recurrence the pain itself warrants management in a pain clinic.Mrs Smith could be seen in the pain clinic at the Marsden but as this would involve a lot of
5、travelling would like to be treated nearer her home.I wonder whether it would be possible for you to investigate if there is a pain clinic available at Basingstoke as I am sure Dotty could be treated and benefit from its management.I have otherwise arranged for her to be seen in the clinic again in
6、a years time.There are no signs of recurrence at this time.Mr Thomas Partridge,临床报告,#NHS TRUST-PATIENT CASE NOTE#:#DOB:1944 CLEF-RMH-Entry-Key:52A4F6DB2B46E AB 1992 Seen in General Surgical This lady who has had a mastectomy and left open capsulotomy and removal of her prosthesis was seen by me in t
7、he clinic today on behalf of XXXXXXXXXXX.She has extensive bony lymphoedema in her left arm which does not seem to be getting any better although she is more or less reconciled to the problem.The original problem was that she complained of shooting pain in the direction of ulna nerve and although th
8、ere does not seem to be any evidence of local,regional or distant recurrence the pain itself warrants management in a pain clinic.XXXXXXXXX could be seen in the pain clinic at the XXXXXXX but as this would involve a lot of travelling would like to be treated nearer her home.I wonder whether it would
9、 be possible for you to investigate if there is a pain clinic available at XXXXXXXXXXX as I am sure XXXXX could be treated and benefit from its management.I have otherwise arranged for her to be seen in the clinic again in a years time.There are no signs of recurrence at this time.5213A4F612F1,从文本中实
10、现关键信息抽取,根据模版或相关的知识资源标识出重要的信息及其相关关系,Interventions(曾经治疗),Problems(遗留的问题),Problem Site(问题部位),Locations(就医地点),Time(时间属性),从文本中实现关键信息抽取,收集抽取出的信息,Interventions,Problems,Problem Site,Locations,Time,也可跨越多个文档,形成病历,自动实现相关事件的链接?What happened&why?What was done&why?,乳房切除,caused_by,多骨淋巴球增多,12.10.20 Coryza:chest NA
11、D:reassure13.10.20 URTI:wheezy:amoxycillin20.10.20 Anxiety:lump under arm:staging scan24.10.21 PEFR:300:10.11.21 PEFR:400:CXR requested12.11.21 CXR Basal Consolidation:erythromycin27.11.21:Chest clear:07.03.30 Depression:recurrence:Paroxetine19.04.30 WCC OK01.06.31:rpt Rx paroxetine18.10.31 Pain L a
12、rm:coproxamol03.03.31 Viral URTI:PEFR 350:salbutamol04.03.34 WCCAbnormal:30.05.34:BP,ECG NAD:,病历摘要,形成一个非常简短的病历摘要,CLEF-RMH-Entry-Key:52A4F6DB2B46E,Maria Sklodowska-Curie,针对目前的信息过载和数据泛滥的情况自然语言处理(NLP)人类语言技术(HLT)计算机语言学(CL)知识工程(KE)知识管理(KM)语义网络(Semantic Web)智能代理(Agent Based Computing)Web智能(Web Intelligenc
13、e),欧洲美发达国家提出了“知识技术”(Knowledge Technologies)这一概念知识获取知识建模知识表示和可视化知识解析和共享知识重用知识检索知识的出版和分发知识维护,两条研究路线:基于KDD和Data Mining的线路。从结构化的数据(如数据库中的数据)中发现新的知识。基于自然语言处理(NLP)和文本挖掘(Text Mining)的线路。从非结构化或半结构化的数据(如Word、HTML、或PDF文件)发现新知识。“从大量的非结构化的数据中标识并抽取出事件的趋势和模型,并它们转换成为有用并可理解的信息”,集两种线路于一体的知识发现和知识表现的系统,1.信息抽取(IE),信息抽取
14、(Information Extraction):目前日渐成熟,并得到越来越多人关注的文本挖掘方式,1.信息抽取(IE)的引入及概念,Hamish CunninghamInformation Extraction(IE)is a technology based on analysing natural language in order to extract snippets of information.信息抽取是一个输入/输出过程。输入:未知文本信息输出:固定格式、无二意性数据(信息)这些被抽取出来的数据可以直接显示给用户存储于数据库或电子表格中以供随后分析被用于索引系统,以便于将来进行
15、检索访问,Douglas E.Appelt等信息检索和信息抽取对比信息检索仅仅从文件集(数据库)中找出相关的文献(数据)并简单地显现给用户而信息抽取不是仅仅指出某篇文献适合用户的需要,而是抽取真正适合用户的那些信息片段提供给用户,信息检索:获取一个与检索内容相关的文章的子集,用户得分析文章内容,信息抽取:抽取与用户所需内容相关的事实(件),用户分析事实(件)。,信息检索和信息抽取对比总结功能不同。处理技术不同。信息检索系统通常利用统计及关键词匹配等技术,把文本看成词的集合(bags of words),不需要对文本进行深入分析理解;而信息抽取往往要借助自然语言处理技术,通过对文本中的句子以及篇
16、章进行分析处理后才能完成。适用领域不同。由于采用的技术不同,信息检索系统通常是领域无关的,而信息抽取系统则是领域相关的,只能抽取系统预先设定好的有限种类的事实信息。,主要内容,信息抽取(IE)的引入及概念信息抽取相关概念及研究活动信息抽取的层次和类型信息抽取系统及其应用信息抽取技术的应用前景中文信息抽取系统的开发,2.信息抽取相关概念与研究活动,IE的发展与以下研究活动密切相关:MUC(Message Understanding for Comprehension)MET(Multilingual Entity Task Evaluation)ACE(Automatic Content Ext
17、raction)DUC(Document Understanding Conferences)TDT.,2.1MUC,MUC之于IE,正如TREC之于IR也有人们认为MUC是Message Understanding Conference或Message Understanding Competition20世纪80年代未由美国国防部的DARPA(Defense Advanced Research Projects Agency)发起,2.1MUC,MUC唯一任务就是“信息抽取”:对自由文本进行分析,标识出某一特定类型的事件,并将有关这一事件的信息填写到相应的数据模板中总共进行7次:最初的MU
18、C 1-2关注的是对电子邮件信息的抽取20世纪90年代之后的MUC 3-7主要关注对新闻文章的抽取,主题涉及恐怖活动、国际风险投资、企业成功管理经验MUC对于信息抽取的研究内容、信息抽取方式的分类、信息抽取系统的评价等都起到重要的促进作用,2.2MET,MET:Multilingual Entity Task Evaluation也是DARPA发起的一个测评项目。MET的主要是对日语、汉语以及西班牙语等多语种新闻文献进行命名实体抽取MET-1和MET-2测试分别于1996年和1998年进行,2.3ACE,ACE(Automatic Content Extraction)这一项目由美国国家安全局
19、(NSA),美国商务部技术管理部门(NIST),以及中央情报局(CIA)一同主管。关注三种信息的自动化内容抽取:网络上的在线新闻、通过ASR(自动语音识别的)得到的广播新闻以及通过OCR(光学字符识别)得到的报纸新闻,两个目的:希望在自动化内容抽取基础之上,为数据挖掘、链接分析、自动摘要等打下基础通过将相应的信息提供给相应的分析师,以提高信息分析的能力。,2.3ACE,项目为期5年ACE Phase-1(1999.7-2000.12)优先发展的是实体探测及追踪(EDT,Entity Detection and Tracking)。ACE Phase2(2001-现在)被称为EDT+RDC。其中
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 信息 抽取 技术

链接地址:https://www.desk33.com/p-246432.html