FREE PYTHON CODES

Thursday, 7 July 2016

Python code to convert all pdf files to .txt files in a folder

 import os

from os import chdir, getcwd, listdir, path

import pyPdf

from time import strftime


def check_path(prompt):

    ''' (str) -> str

    Verifies if the provided absolute path does exist.

    '''

    abs_path = raw_input(prompt)

    while path.exists(abs_path) != True:

        print "\nThe specified path does not exist.\n"

        abs_path = raw_input(prompt)

    return abs_path   

   

print "\n"


folder = check_path("Provide absolute path for the folder: ")


list=[]

directory=folder

for root,dirs,files in os.walk(directory):

    for filename in files:

        if filename.endswith('.pdf'):

            t=os.path.join(directory,filename)

            list.append(t)


m=len(list)

i=0

while i<=len(list):

    path=list[i]

    head,tail=os.path.split(path)

    var="\\"

   

    tail=tail.replace(".pdf",".txt")

    name=head+var+tail

   

   

    

    content = ""

    # Load PDF into pyPDF

    pdf = pyPdf.PdfFileReader(file(path, "rb"))

    # Iterate pages

    for i in range(0, pdf.getNumPages()):

        # Extract text from page and add to content

        content += pdf.getPage(i).extractText() + "\n"

    print strftime("%H:%M:%S"), " pdf  -> txt "

    f=open(name,'w')

    f.write(content.encode("UTF-8"))

    f.close

Simple animation loop in python

Click here to download the python file

Build A Salesforce To Sell For Your Business With Monkey Business

For Python version 3 users follow the code given below.
for the code to work you have to install pypdf2
by using ----------> pip install pypdf2

For converting html files to text files Click here

import os

from os import chdir, getcwd, listdir, path

import PyPDF2

from time import strftime


def check_path(prompt):

    ''' (str) -> str

    Verifies if the provided absolute path does exist.

    '''

    abs_path = input(prompt)

    while path.exists(abs_path) != True:

        print ("\nThe specified path does not exist.\n")

        abs_path = input(prompt)

    return abs_path   

   

print ("\n")


folder = check_path("Provide absolute path for the folder: ")


list=[]

directory=folder

for root,dirs,files in os.walk(directory):

    for filename in files:

        if filename.endswith('.pdf'):

            t=os.path.join(directory,filename)

            list.append(t)




for item in list:
    path=item

    head,tail=os.path.split(path)

    var="\\"

   

    tail=tail.replace(".pdf",".txt")

    name=head+var+tail

    

   

   

    

    content = ""

    

    pdf = PyPDF2.PdfFileReader(path, "rb")

    

    for i in range(0, pdf.getNumPages()):

        

        content += pdf.getPage(i).extractText() + "\n"
        

    print (strftime("%H:%M:%S"), " pdf  -> txt ")

    with open(name,'a') as out:
        out.write(content

42 comments:

Unknown3 January 2017 at 10:51
it is showing error "No module installed"
ReplyDelete
Replies
Daelius1 February 2017 at 16:20
please, contact me... I need your advice to create a script that does that on python3. brunovtf@gmail.com.
See ya
ReplyDelete
Replies
dghind7 February 2017 at 02:41
Hi, Nice program. Thanks for this. I think you for got to add and increment to "i" in the while loop which is why it is only converting the first file it gets in the given folder.
ReplyDelete
Replies
Bijon Mathew21 February 2017 at 07:25
guys if you require any python programs .. tell me in the comments
ReplyDelete
Replies
Unknown27 February 2017 at 08:19
Is there a way to use this code for Python 3.6? I keep getting an error for no module named pdf
ReplyDelete
Replies
Unknown5 March 2017 at 02:07
can you help me with convert all html files to .txt files in a folder? i try use pyhtml but it always error :'(
ReplyDelete
Replies
Unknown1 August 2017 at 10:05
Hi Bijon Mathew,
I am very new to python programming.Thanks a lot for the above program.
I actually tried giving scan pdf file as input, as I want to convert it into text file or extract the text from the image as a text file. I only see empty text files in output.. Can you please help me to solve this using scan pdf as input.
ReplyDelete
Replies
BePositive11 August 2017 at 02:11
HI. when i enter the path, i get the error message saying
Traceback (most recent call last):
File "C:/Python27/CreditConvert 2.py", line 24, in
path=list[i]
IndexError: list index out of range
ReplyDelete
Replies
Unknown22 August 2017 at 01:07
hello, after i do run i found this message :
raceback (most recent call last):
File "E:/Python/1.py", line 5, in
import pyPdf
ImportError: No module named pyPdf
>>>
ReplyDelete
Replies
Anonymous25 September 2017 at 00:44
How to convert a bunch of PDF file to csv in python?
ReplyDelete
Replies
Unknown17 December 2017 at 13:54
hi Bijon. need help with the following code
i have to write code where i have to connect mysql database to the buttons of the user interface so that i can retrieve the data within a required range. im using python 3.4.
the code is as follows-
from tkinter import *

root = Tk()

root .geometry("400x200")

def retrieve_input():
inputValue1 = textBox1.get("1.0","end-1c")
print(inputValue1)
inputValue2 = textBox2.get("1.0","end-1c")
print(inputValue2)

w1=Label(root, height=2, width=10, text='range1')
w1.pack()

textBox1= Text(root, height=2, width=10)
textBox1.pack()

w2=Label(root, height=2, width=10, text='range2')
w2.pack()

textBox2 = Text(root, height=2, width=10)
textBox2.pack()

buttonCommit=Button(root, height=1, width=10, text="ok",
command=lambda: retrieve_input())

buttonCommit.pack()

mainloop()

when i click on ok..i shud get the data.
ReplyDelete
Replies
Unknown17 December 2017 at 13:54
hi Bijon. need help with the following code
i have to write code where i have to connect mysql database to the buttons of the user interface so that i can retrieve the data within a required range. im using python 3.4.
the code is as follows-
from tkinter import *

root = Tk()

root .geometry("400x200")

def retrieve_input():
inputValue1 = textBox1.get("1.0","end-1c")
print(inputValue1)
inputValue2 = textBox2.get("1.0","end-1c")
print(inputValue2)

w1=Label(root, height=2, width=10, text='range1')
w1.pack()

textBox1= Text(root, height=2, width=10)
textBox1.pack()

w2=Label(root, height=2, width=10, text='range2')
w2.pack()

textBox2 = Text(root, height=2, width=10)
textBox2.pack()

buttonCommit=Button(root, height=1, width=10, text="ok",
command=lambda: retrieve_input())

buttonCommit.pack()

mainloop()

when i click on ok..i shud get the data.
ReplyDelete
Replies
jeya2 April 2018 at 04:10
how to convert docx and doc file to txt in a folder
ReplyDelete
Replies
ravi4 June 2018 at 22:22
how to convert html to pdf
ReplyDelete
Replies
Unknown11 June 2018 at 22:29
I am getting UnicodeEncodeError how to fix it
ReplyDelete
Replies
Unknown16 September 2018 at 00:09
Im trying to convert from .eml files to docx files, is this possible ?
ReplyDelete
Replies
GANG JUSTIN28 October 2018 at 05:40
Provide absolute path for the folder: D:\pdf
18:05:09 pdf -> txt
Traceback (most recent call last):
File "C:/Users/bhagyaraj/PycharmProjects/untitled1/pdftotext.py", line 65, in
out.write(content.encode("UTF-8"))
TypeError: write() argument must be str, not bytes

ReplyDelete
Replies
Reddy14 February 2019 at 10:38
Hi Bijon,

The Python version 3 code is running perfectly for me and am able to convert the pdf files to .txt files.
But it is not showing/picking ‘;’ and the text after ‘;’.

Can you please advice me on how to get the get the text values after ‘;’.

Many thanks

Regards,
Reddy
ReplyDelete
Replies
Unknown2 May 2019 at 05:20

UnicodeEncodeError: 'charmap' codec can't encode character '\ufb01' in position 6477: character maps to
ReplyDelete
Replies
Unknown15 May 2019 at 01:38
Hi Bijon, Thanks for the codes. I tried to convert 5 pdfs which are in my machine. I am using Python 3 and Jupyter notebook. I am a learner in python. When I run the codes for python 3 there was no error message. But I can't see any txt files. Plesae guide me. I can convert each pdfs to txt file using PyPDF2, however, I have 20 pdfs and need to extract the txt from them and then to do some text classification in that. Please let me know if you could help me in that. Thanks a lot in advance.
ReplyDelete
Replies
Unknown4 December 2019 at 23:28
hello what is the name variable in line no 91
ReplyDelete
Replies
Unknown17 January 2020 at 17:26

---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
in
90
91 with open(name,'a') as out:
---> 92 out.write(content)

~\Anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
17 class IncrementalEncoder(codecs.IncrementalEncoder):
18 def encode(self, input, final=False):
---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0]
20
21 class IncrementalDecoder(codecs.IncrementalDecoder):

UnicodeEncodeError: 'charmap' codec can't encode character '\ufb01' in position 7251: character maps to
ReplyDelete
Replies
Unknown18 February 2020 at 04:20
i want Python code to to push all PDF data to excel files.

ReplyDelete
Replies
aRandomMoron18 March 2020 at 08:35
Hey, hum sry 2 take ur time, i'm brain new 2 python n i reaaaaly suck.. could someone help me plz? i have this error :
File "C:/Users/Gabriel/PycharmProjects/frompdf2mybe/Test2PDF.py", line 66

^
SyntaxError: unexpected EOF while parsing

Process finished with exit code 1
ReplyDelete
Replies

FREE PYTHON CODES

a

Thursday, 7 July 2016

42 comments:

Blog Archive

About Me