c# - Extracting a Specific Text from PDF using RegEx

Without getting into too much detail about how we ended up in this situation (a lot of poor business decisions), I need to find the text: "SomeID=[Integer]" from a PDF file (e.g. SomeID=123456). Ultimately, what I need is the 123456. This text will only be on the first page of the PDF. It's actually going to be in the color white so it's "invisible".

My initial thought was to grab all the Text from the first page PDF and then use RegEx to parse for "SomeID=[Integer]". I do not care about all the other text in this PDF. I only care about finding "SomeID=" and the integer that follows.

What is a simple way to get all Text from PDF, without using a Nuget Library?

But, I can try to get one approved if there's one that is solid.

What is a simple way to get all Text from PDF, without using a Nuget Library?

But, I can try to get one approved if there's one that is solid.

Share Improve this question edited Feb 4 at 23:30 K J 11.9k4 gold badges23 silver badges66 bronze badges asked Feb 4 at 21:43 user3121062 538 bronze badges

If you want to hide meta data, I think you could just append it to the end of the file. Like appending a .zip file would work. Or you could just use your own byte format. – Jeremy Lakeman Commented Feb 5 at 1:01
1 "without using a Nuget Library" - is only nuget a problem? Or are third party tools and libs in general? – mkl Commented Feb 5 at 6:11
"without using a Nuget Library?" - basically: Good Luck. I wouldn't touch that with a 10ft pole. – Fildor Commented Feb 5 at 8:53
@mkl Third Party Tools, in general. Essentially, they would like to limit that as soon much as possible. We have a DevExpress library, though I'm not super familiar with that, but I wonder if there's a way to do that – user3121062 Commented Feb 5 at 14:54
1 Working with arbitrary PDF files without using existing libraries or tools means a lot of work. If you only need to process very special files, the situation can be easier but still not easy. – mkl Commented Feb 5 at 15:29

| Show 1 more comment

1 Answer 1

Sorted by: Reset to default 0

Text may be unseen in a PDF rendering so here we can see the "Default" black text on the left, and SomeID has no colour set (for that area on the right). However that area could be printer tracking yellow or white or any other unseen colour.

To see all PDF text in black and white we simply need to read the page as if it were plain text.

If we have a region of interest at a known location we can trim down the area of extraction to a per page value.

Thus any application cross platforms can shell such a program line and using redirection just read that small zone (or a larger one) or even extract the line by find and split the result.

-f 1 -l 1 restricts the search to first page.

>pdftotext -layout -f 1 -l 1 hiddentext.pdf -|find "SomeID"
               SomeID=123456

You can even set an environmental with system related contortions.

>cmd /V:ON /r pdftotext -layout -f 1 -l 1  hiddentext.pdf -|find "SomeID">%temp%\output.txt &set /p input=<%temp%\output.txt&&set output=%input:~-6%&&echo/&&set output

output=123456

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

c# - Extracting a Specific Text from PDF using RegEx - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)