且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在C#中找到ms Word文档类别的***方法是什么

更新时间:2023-02-17 16:15:11

这称为文本的语义分析.最简单的方法是定义特定文档类别通用的一组单词.比您在单词类上对该文档进行统计.然后您选出***匹配组.
如果需要更深入的分析,则必须使用同义词库(一种语言的语义图).对于英语,您可以使用以下语言: http://wordnet.princeton.edu/ [ http://www.sersc.org/journals/IJSIP/vol1_no1/papers/07 .pdf [ ^ ], http://en.wikipedia.org/wiki/Document_classification [


i am trying to find the type of ms word document and categorize them for a project and the aim of the project is document clustering(i.e grouping) based on the content of the document.the objective is to achieve semi-supervised learning grouping documents based on both labelled and unlabelled data. and i am reading the document word by word in c#.but i cant find a way to categorize the document based on its content. can anyone give the remedy?. thanks.

That''s called semantic analysis of a text. The easiest way is to define set of words that are common for a specific document category. Than you make statistics for that document over the word classes. And you elect the best matching group.
If you need more deeply analysis, you have to make use of a thesaurus (a semantic graph of a language). For English you can use this one:
http://wordnet.princeton.edu/[^], but it is not common to all cultures to have such thesaurus already made :(
If yo have to go even deeper, you will have to do research. Start here: http://www.sersc.org/journals/IJSIP/vol1_no1/papers/07.pdf[^], http://en.wikipedia.org/wiki/Document_classification[^]


The extension of the word file are .doc/docx you should read all the file from your drive in loop and put their value in string, check the containing values eg: string.contain and categorized accordingly .

Thanks,
Ambesha


It''s not clear what you''re doing reading it "word by word" but if the document is being read using Open XML then you can just get the document properties (CoreFilePropertiesPart) and look for the subject, keywords or category.