C#編程讀取文檔Doc,Docx,Pdf的內容
Doc文檔:Microsoft Word 14.0 Object Library (GAC對象,調用前需要安裝word。安裝的word版本不同,COM的版本號也會不同)
Docx文檔:Microsoft Word 14.0 Object Library (GAC對象,調用前需要安裝word。安裝的word版本不同,COM的版本號也會不同)
Pdf文檔:PDFBox
/* 作者:GhostBear
- 博客地址:Http://blog.csdn.net/ghostbear */ using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.IO; using System.Text.RegularExpressions;
using org.pdfbox.pdmodel; using org.pdfbox.util;
using Microsoft.Office.Interop.Word;
namespace TestPdfReader { class Program { static void Main(string[] args) {
//PDF
PDDocument doc = PDDocument.load(@"C:\resume.pdf");
PDFTextStripper pdfStripper = new PDFTextStripper();
string text = pdfStripper.getText(doc);
string result = text.Replace('\t', ' ').Replace('\n', ' ').Replace('\r', ' ').Replace(" ", "");
Console.WriteLine(result);
//Doc,Docx
object docPath = @"C:\resume.doc";
object docxPath = @"C:\resume.docx";
object missing=System.Reflection.Missing.Value;
object readOnly=true;
Application wordApp;
wordApp = new Application();
Document wordDoc = wordApp.Documents.Open(ref docPath,
ref missing,
ref readOnly,
ref missing,
ref missing,
ref missing,
ref missing,
ref missing,
ref missing,
ref missing,
ref missing,
ref missing,
ref missing,
ref missing,
ref missing,
ref missing);
string text2 = FilterString(wordDoc.Content.Text);
wordDoc.Close(ref missing, ref missing, ref missing);
wordApp.Quit(ref missing, ref missing, ref missing);
Console.WriteLine(text2);
Console.Read();
}
private static string FilterString(string input)
{
return Regex.Replace(input, @"(\a|\t|\n|\s+)", "");
}
}
}</pre>
本文由用戶 cymt 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!