最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

c# - Extracting a Specific Text from PDF using RegEx - Stack Overflow

programmeradmin1浏览0评论

Without getting into too much detail about how we ended up in this situation (a lot of poor business decisions), I need to find the text: "SomeID=[Integer]" from a PDF file (e.g. SomeID=123456). Ultimately, what I need is the 123456. This text will only be on the first page of the PDF. It's actually going to be in the color white so it's "invisible".

My initial thought was to grab all the Text from the first page PDF and then use RegEx to parse for "SomeID=[Integer]". I do not care about all the other text in this PDF. I only care about finding "SomeID=" and the integer that follows.

What is a simple way to get all Text from PDF, without using a Nuget Library?

But, I can try to get one approved if there's one that is solid.

Without getting into too much detail about how we ended up in this situation (a lot of poor business decisions), I need to find the text: "SomeID=[Integer]" from a PDF file (e.g. SomeID=123456). Ultimately, what I need is the 123456. This text will only be on the first page of the PDF. It's actually going to be in the color white so it's "invisible".

My initial thought was to grab all the Text from the first page PDF and then use RegEx to parse for "SomeID=[Integer]". I do not care about all the other text in this PDF. I only care about finding "SomeID=" and the integer that follows.

What is a simple way to get all Text from PDF, without using a Nuget Library?

But, I can try to get one approved if there's one that is solid.

Share Improve this question edited Feb 4 at 23:30 K J 11.9k4 gold badges23 silver badges66 bronze badges asked Feb 4 at 21:43 user3121062user3121062 538 bronze badges 6
  • If you want to hide meta data, I think you could just append it to the end of the file. Like appending a .zip file would work. Or you could just use your own byte format. – Jeremy Lakeman Commented Feb 5 at 1:01
  • 1 "without using a Nuget Library" - is only nuget a problem? Or are third party tools and libs in general? – mkl Commented Feb 5 at 6:11
  • "without using a Nuget Library?" - basically: Good Luck. I wouldn't touch that with a 10ft pole. – Fildor Commented Feb 5 at 8:53
  • @mkl Third Party Tools, in general. Essentially, they would like to limit that as soon much as possible. We have a DevExpress library, though I'm not super familiar with that, but I wonder if there's a way to do that – user3121062 Commented Feb 5 at 14:54
  • 1 Working with arbitrary PDF files without using existing libraries or tools means a lot of work. If you only need to process very special files, the situation can be easier but still not easy. – mkl Commented Feb 5 at 15:29
 |  Show 1 more comment

1 Answer 1

Reset to default 0

Text may be unseen in a PDF rendering so here we can see the "Default" black text on the left, and SomeID has no colour set (for that area on the right). However that area could be printer tracking yellow or white or any other unseen colour.

To see all PDF text in black and white we simply need to read the page as if it were plain text.

If we have a region of interest at a known location we can trim down the area of extraction to a per page value.

Thus any application cross platforms can shell such a program line and using redirection just read that small zone (or a larger one) or even extract the line by find and split the result.

-f 1 -l 1 restricts the search to first page.

>pdftotext -layout -f 1 -l 1 hiddentext.pdf -|find "SomeID"
               SomeID=123456

You can even set an environmental with system related contortions.

>cmd /V:ON /r pdftotext -layout -f 1 -l 1  hiddentext.pdf -|find "SomeID">%temp%\output.txt &set /p input=<%temp%\output.txt&&set output=%input:~-6%&&echo/&&set output

output=123456
发布评论

评论列表(0)

  1. 暂无评论