pdf - Find a string in scanned images in C#.net -


i want c# program searches string in pdf file contains scanned images. also, code should display particular page in searched string present.

for example, consider pdf file contains scanned images(.png) of receipts. want search receipt_number. page in particular receipt_number present, should open in pdf reader.

i use modi this.

well, that's pretty complicated task. have tried similar before , results not great; however, program created, , have, need enough. reason worked me due fact focused on specific area of page, knew find specifically-formatted text. mind you, worked printed text... stamped or hand-written text pain deal with. enough ocr training, i'm sure fixed, didn't have have additional time devote project.

your results heavily depend on ocr method choose, scan quality, whether it's typed or written hand, whether scan aligned or skewed, etc., etc., etc.

i'm not going give code, won't learn if do, i'll give few tips of how started. if stuck, post specific question here , you'll help.

there many ways , way i've tried through conversion of scanned pdf file images (one per page). then, ran images through recognition algorithm in attempt retrieve of text (in case, specific rectangle on image).

so, images out pdf, can use magick.net. it's available through nuget, should easy part. since it's scanned pdf, should not have issues getting images out. there plenty of tutorials , if stuck post specific question on site.

the optical character recognition part hard part; however, there libraries may assist or, @ least, started. used tessnet2 library (http://www.pixel-technology.com/freeware/tessnet2/).

there c# wrappers available , may able find nuget. here's 1 place on github (https://github.com/charlesw/tesseract). also, have here: https://code.google.com/p/tesseractdotnet/ , here: https://github.com/charlesw/tesseract-ocr-dotnet.

some of tessnet3 , tessnet2. i've had success tessnet2 32-bit version, not others. so, give of them go , see works you,

know ahead of time you're diving pretty complicated area , if stuck or fail understand things, don't frustrated... give time.


Comments

Popular posts from this blog

apache - PHP Soap issue while content length is larger -

asynchronous - Python asyncio task got bad yield -

javascript - Complete OpenIDConnect auth when requesting via Ajax -