且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在 .NET Core 中将 Word doc 和 docx 格式转换为 PDF,无需 Microsoft.Office.Interop

更新时间:2023-02-08 09:21:05

这真是太痛苦了,难怪所有第三方解决方案都向每个开发人员收取 500 美元.

好消息是

但是,如果您尝试使用带有图片或链接的 Word 文档,您会发现它们丢失或损坏.

这篇 CodeProject 文章解决了这些问题:

如果您只想在网络浏览器中显示 Word .docx 文件,***不要将 HTML 转换为 PDF,因为这会显着增加带宽.您可以使用 VPP 技术将 HTML 存储在文件系统、云或 dB 中.

HTML 转 PDF

接下来我们需要做的是将 HTML 传递给 DinkToPdf.下载 DinkToPdf (90 MB) 解决方案.构建解决方案 - 恢复所有包和编译解决方案需要一段时间.

重要提示:

如果您想在 Linux 和 Windows 上运行,DinkToPdf 库需要项目根目录中的 libwkhtmltox.so 和 libwkhtmltox.dll 文件.如果需要,还有一个适用于 Mac 的 libwkhtmltox.dylib 文件.

这些 DLL 位于 v0.12.4 文件夹中.根据您的 PC(32 位或 64 位),将 3 个文件复制到 DinkToPdf-masterDinkToPfd.TestConsoleAppinDebug etcoreapp1.1 文件夹.

重要事项 2:

确保在 Docker 镜像或 Linux 机器上安装了 libgdiplus.libwkhtmltox.so 库依赖于它.

将 DinkToPfd.TestConsoleApp 设置为启动项目并更改 Program.cs 文件以从使用 Open-Xml-PowerTools 保存的 HTML 文件中读取 htmlContent,而不是 Lorium Ipsom 文本.

var doc = new HtmlToPdfDocument(){全局设置 = {ColorMode = ColorMode.Color,Orientation = Orientation.Landscape,PaperSize = PaperKind.A4,},对象 = {新对象设置(){页数 = 真,HtmlContent = File.ReadAllText(@C:TFSSandboxOpen-Xml-PowerTools-abfbaac510d0d60e2f492503c60ef897247716cfToolsTest	est1.html"),WebSettings = { DefaultEncoding = "utf-8";},HeaderSettings = { FontSize = 9, Right = "Page [page] of [toPage]", Line = true },FooterSettings = { FontSize = 9, Right = "Page [page] of [toPage]";}}}};

Docx 与 PDF 的结果令人印象深刻,我怀疑很多人会挑出很多差异(特别是如果他们从未见过原件):

附言.我意识到您想将 .doc.docx 都转换为 PDF.我建议自己制作一项服务,使用特定的非服务器 Windows/Microsoft 技术将 .doc 转换为 docx.doc 格式是二进制的,不适用于 服务器端办公自动化.

I need to display Word .doc and .docx files in a browser. There's no real client-side way to do this and these documents can't be shared with Google docs or Microsoft Office 365 for legal reasons.

Browsers can't display Word, but can display PDF, so I want to convert these docs to PDF on the server and then display that.

I know this can be done using Microsoft.Office.Interop.Word, but my application is .NET Core and does not have access to Office interop. It could be running on Azure, but it could also be running in a Docker container on anything else.

There appear to be lots of similar questions to this, however most are asking about full- framework .NET or assuming that the server is a Windows OS and any answer is no use to me.

How do I convert .doc and .docx files to .pdf without access to Microsoft.Office.Interop.Word?

This was such a pain, no wonder all the third party solutions are charging $500 per developer.

Good news is the Open XML SDK recently added support for .Net Standard so it looks like you're in luck with the .docx format.

Bad news at the moment there isn't a lot of choice for PDF generation libraries on .NET Core. Since it doesn't look like you want to pay for one and you can't legally use a third party service we have little choice except to roll our own.

The main problem is getting the Word Document Content transformed to PDF. One of the popular ways is reading the Docx into HTML and exporting that to PDF. It was hard to find, but there is .Net Core version of the OpenXMLSDK-PowerTools that supports transforming Docx to HTML. The Pull Request is "about to be accepted", you can get it from here:

https://github.com/OfficeDev/Open-Xml-PowerTools/tree/abfbaac510d0d60e2f492503c60ef897247716cf

Now that we can extract document content to HTML we need to convert it to PDF. There are a few libraries to convert HTML to PDF, for example DinkToPdf is a cross-platform wrapper around the Webkit HTML to PDF library libwkhtmltox.

I thought DinkToPdf was better than https://code.msdn.microsoft.com/How-to-export-HTML-to-PDF-c5afd0ce


Docx to HTML

Let's put this altogether, download the OpenXMLSDK-PowerTools .Net Core project and build it (just the OpenXMLPowerTools.Core and the OpenXMLPowerTools.Core.Example - ignore the other project).

Set the OpenXMLPowerTools.Core.Example as StartUp project. Add a Word Document to the project (eg test.docx) and set this docx files properties Copy To Output = If Newer

Run the console project:

static void Main(string[] args)
{
    var source = Package.Open(@"test.docx");
    var document = WordprocessingDocument.Open(source);
    HtmlConverterSettings settings = new HtmlConverterSettings();
    XElement html = HtmlConverter.ConvertToHtml(document, settings);

    Console.WriteLine(html.ToString());
    var writer = File.CreateText("test.html");
    writer.WriteLine(html.ToString());
    writer.Dispose();
    Console.ReadLine();

Make sure the test.docx is a valid word document with some text otherwise you might get an error:

the specified package is invalid. the main part is missing

If you run the project you will see the HTML looks almost exactly like the content in the Word document:

However if you try a Word Document with pictures or links you will notice they're missing or broken.

This CodeProject article addresses these issues: https://www.codeproject.com/Articles/1162184/Csharp-Docx-to-HTML-to-Docx

I had to change the static Uri FixUri(string brokenUri) method to return a Uri and I added user friendly error messages.

static void Main(string[] args)
{
    var fileInfo = new FileInfo(@"c:	empMyDocWithImages.docx");
    string fullFilePath = fileInfo.FullName;
    string htmlText = string.Empty;
    try
    {
        htmlText = ParseDOCX(fileInfo);
    }
    catch (OpenXmlPackageException e)
    {
        if (e.ToString().Contains("Invalid Hyperlink"))
        {
            using (FileStream fs = new FileStream(fullFilePath,FileMode.OpenOrCreate, FileAccess.ReadWrite))
            {
                UriFixer.FixInvalidUri(fs, brokenUri => FixUri(brokenUri));
            }
            htmlText = ParseDOCX(fileInfo);
        }
    }

    var writer = File.CreateText("test1.html");
    writer.WriteLine(htmlText.ToString());
    writer.Dispose();
}
        
public static Uri FixUri(string brokenUri)
{
    string newURI = string.Empty;
    if (brokenUri.Contains("mailto:"))
    {
        int mailToCount = "mailto:".Length;
        brokenUri = brokenUri.Remove(0, mailToCount);
        newURI = brokenUri;
    }
    else
    {
        newURI = " ";
    }
    return new Uri(newURI);
}

public static string ParseDOCX(FileInfo fileInfo)
{
    try
    {
        byte[] byteArray = File.ReadAllBytes(fileInfo.FullName);
        using (MemoryStream memoryStream = new MemoryStream())
        {
            memoryStream.Write(byteArray, 0, byteArray.Length);
            using (WordprocessingDocument wDoc =
                                        WordprocessingDocument.Open(memoryStream, true))
            {
                int imageCounter = 0;
                var pageTitle = fileInfo.FullName;
                var part = wDoc.CoreFilePropertiesPart;
                if (part != null)
                    pageTitle = (string)part.GetXDocument()
                                            .Descendants(DC.title)
                                            .FirstOrDefault() ?? fileInfo.FullName;

                WmlToHtmlConverterSettings settings = new WmlToHtmlConverterSettings()
                {
                    AdditionalCss = "body { margin: 1cm auto; max-width: 20cm; padding: 0; }",
                    PageTitle = pageTitle,
                    FabricateCssClasses = true,
                    CssClassPrefix = "pt-",
                    RestrictToSupportedLanguages = false,
                    RestrictToSupportedNumberingFormats = false,
                    ImageHandler = imageInfo =>
                    {
                        ++imageCounter;
                        string extension = imageInfo.ContentType.Split('/')[1].ToLower();
                        ImageFormat imageFormat = null;
                        if (extension == "png") imageFormat = ImageFormat.Png;
                        else if (extension == "gif") imageFormat = ImageFormat.Gif;
                        else if (extension == "bmp") imageFormat = ImageFormat.Bmp;
                        else if (extension == "jpeg") imageFormat = ImageFormat.Jpeg;
                        else if (extension == "tiff")
                        {
                            extension = "gif";
                            imageFormat = ImageFormat.Gif;
                        }
                        else if (extension == "x-wmf")
                        {
                            extension = "wmf";
                            imageFormat = ImageFormat.Wmf;
                        }

                        if (imageFormat == null) return null;

                        string base64 = null;
                        try
                        {
                            using (MemoryStream ms = new MemoryStream())
                            {
                                imageInfo.Bitmap.Save(ms, imageFormat);
                                var ba = ms.ToArray();
                                base64 = System.Convert.ToBase64String(ba);
                            }
                        }
                        catch (System.Runtime.InteropServices.ExternalException)
                        { return null; }

                        ImageFormat format = imageInfo.Bitmap.RawFormat;
                        ImageCodecInfo codec = ImageCodecInfo.GetImageDecoders()
                                                    .First(c => c.FormatID == format.Guid);
                        string mimeType = codec.MimeType;

                        string imageSource =
                                string.Format("data:{0};base64,{1}", mimeType, base64);

                        XElement img = new XElement(Xhtml.img,
                                new XAttribute(NoNamespace.src, imageSource),
                                imageInfo.ImgStyleAttribute,
                                imageInfo.AltText != null ?
                                    new XAttribute(NoNamespace.alt, imageInfo.AltText) : null);
                        return img;
                    }
                };

                XElement htmlElement = WmlToHtmlConverter.ConvertToHtml(wDoc, settings);
                var html = new XDocument(new XDocumentType("html", null, null, null),
                                                                            htmlElement);
                var htmlString = html.ToString(SaveOptions.DisableFormatting);
                return htmlString;
            }
        }
    }
    catch
    {
        return "The file is either open, please close it or contains corrupt data";
    }
}

You may need System.Drawing.Common NuGet package to use ImageFormat

Now we can get images:

If you only want to show Word .docx files in a web browser its better not to convert the HTML to PDF as that will significantly increase bandwidth. You could store the HTML in a file system, cloud, or in a dB using a VPP Technology.


HTML to PDF

Next thing we need to do is pass the HTML to DinkToPdf. Download the DinkToPdf (90 MB) solution. Build the solution - it will take a while for all the packages to be restored and for the solution to Compile.

IMPORTANT:

The DinkToPdf library requires the libwkhtmltox.so and libwkhtmltox.dll file in the root of your project if you want to run on Linux and Windows. There's also a libwkhtmltox.dylib file for Mac if you need it.

These DLLs are in the v0.12.4 folder. Depending on your PC, 32 or 64 bit, copy the 3 files to the DinkToPdf-masterDinkToPfd.TestConsoleAppinDebug etcoreapp1.1 folder.

IMPORTANT 2:

Make sure that you have libgdiplus installed in your Docker image or on your Linux machine. The libwkhtmltox.so library depends on it.

Set the DinkToPfd.TestConsoleApp as StartUp project and change the Program.cs file to read the htmlContent from the HTML file saved with Open-Xml-PowerTools instead of the Lorium Ipsom text.

var doc = new HtmlToPdfDocument()
{
    GlobalSettings = {
        ColorMode = ColorMode.Color,
        Orientation = Orientation.Landscape,
        PaperSize = PaperKind.A4,
    },
    Objects = {
        new ObjectSettings() {
            PagesCount = true,
            HtmlContent = File.ReadAllText(@"C:TFSSandboxOpen-Xml-PowerTools-abfbaac510d0d60e2f492503c60ef897247716cfToolsTest	est1.html"),
            WebSettings = { DefaultEncoding = "utf-8" },
            HeaderSettings = { FontSize = 9, Right = "Page [page] of [toPage]", Line = true },
            FooterSettings = { FontSize = 9, Right = "Page [page] of [toPage]" }
        }
    }
};

The result of the Docx vs the PDF is quite impressive and I doubt many people would pick out many differences (especially if they never see the original):

Ps. I realise you wanted to convert both .doc and .docx to PDF. I'd suggest making a service yourself to convert .doc to docx using a specific non-server Windows/Microsoft technology. The doc format is binary and is not intended for server side automation of office.