a

Thursday, 7 July 2016

Python code to convert all pdf files to .txt files in a folder

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
 import os

from os import chdir, getcwd, listdir, path

import pyPdf

from time import strftime


def check_path(prompt):

    ''' (str) -> str

    Verifies if the provided absolute path does exist.

    '''

    abs_path = raw_input(prompt)

    while path.exists(abs_path) != True:

        print "\nThe specified path does not exist.\n"

        abs_path = raw_input(prompt)

    return abs_path   

   

print "\n"


folder = check_path("Provide absolute path for the folder: ")


list=[]

directory=folder

for root,dirs,files in os.walk(directory):

    for filename in files:

        if filename.endswith('.pdf'):

            t=os.path.join(directory,filename)

            list.append(t)


m=len(list)

i=0

while i<=len(list):

    path=list[i]

    head,tail=os.path.split(path)

    var="\\"

   

    tail=tail.replace(".pdf",".txt")

    name=head+var+tail

   

   

    

    content = ""

    # Load PDF into pyPDF

    pdf = pyPdf.PdfFileReader(file(path, "rb"))

    # Iterate pages

    for i in range(0, pdf.getNumPages()):

        # Extract text from page and add to content

        content += pdf.getPage(i).extractText() + "\n"

    print strftime("%H:%M:%S"), " pdf  -> txt "

    f=open(name,'w')

    f.write(content.encode("UTF-8"))

    f.close



Simple animation loop in python


Click here to download the python file



Build A Salesforce To Sell For Your Business With Monkey Business


  For Python version 3 users follow the code given below.
  for the code to work you have to install pypdf2
  by using  ---------->        pip install pypdf2
   
 For converting html files to text files  Click here

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import os

from os import chdir, getcwd, listdir, path

import PyPDF2

from time import strftime


def check_path(prompt):

    ''' (str) -> str

    Verifies if the provided absolute path does exist.

    '''

    abs_path = input(prompt)

    while path.exists(abs_path) != True:

        print ("\nThe specified path does not exist.\n")

        abs_path = input(prompt)

    return abs_path   

   

print ("\n")


folder = check_path("Provide absolute path for the folder: ")


list=[]

directory=folder

for root,dirs,files in os.walk(directory):

    for filename in files:

        if filename.endswith('.pdf'):

            t=os.path.join(directory,filename)

            list.append(t)




for item in list:
    path=item

    head,tail=os.path.split(path)

    var="\\"

   

    tail=tail.replace(".pdf",".txt")

    name=head+var+tail

    

   

   

    

    content = ""

    

    pdf = PyPDF2.PdfFileReader(path, "rb")

    

    for i in range(0, pdf.getNumPages()):

        

        content += pdf.getPage(i).extractText() + "\n"
        

    print (strftime("%H:%M:%S"), " pdf  -> txt ")

    with open(name,'a') as out:
        out.write(content 

42 comments:

  1. it is showing error "No module installed"

    ReplyDelete
    Replies
    1. you need to install required modules using pip

      Delete
    2. What modules? can you list them?

      Delete
    3. This comment has been removed by the author.

      Delete
  2. please, contact me... I need your advice to create a script that does that on python3. brunovtf@gmail.com.
    See ya

    ReplyDelete
    Replies
    1. I will add the python 3 version soon.By the way have you installed all the required imports mentioned at the start of the program?

      Delete
  3. Hi, Nice program. Thanks for this. I think you for got to add and increment to "i" in the while loop which is why it is only converting the first file it gets in the given folder.

    ReplyDelete
    Replies
    1. Hello, Even I am facing the same issue. I am not able to iterate through all the pdf's. How it fix this?

      Delete
  4. guys if you require any python programs .. tell me in the comments

    ReplyDelete
    Replies
    1. i want to convert all docx and doc files to txt format ia a folder

      Delete
    2. Can you suggest the code to convert PDF files to html.. without using libraries.

      Delete
    3. bro i need program to convert pdf files to xls (excel) please help me !

      Delete
  5. Is there a way to use this code for Python 3.6? I keep getting an error for no module named pdf

    ReplyDelete
    Replies
    1. can you show me the screen shot of the error, by the way have you installed Pypdf module?

      Delete
    2. I have put the python3 version code above.

      Delete
    3. The python3 version code above worked perfectly. Thank you very much for the help!

      Delete
  6. can you help me with convert all html files to .txt files in a folder? i try use pyhtml but it always error :'(

    ReplyDelete
    Replies
    1. Okay ..but all codes will be in 2x version.

      Delete
    2. This comment has been removed by the author.

      Delete
    3. I have put the link above for converting html files to text files

      Delete
    4. i want to convert html pages to pdf files

      Delete
  7. Hi Bijon Mathew,
    I am very new to python programming.Thanks a lot for the above program.
    I actually tried giving scan pdf file as input, as I want to convert it into text file or extract the text from the image as a text file. I only see empty text files in output.. Can you please help me to solve this using scan pdf as input.

    ReplyDelete
  8. HI. when i enter the path, i get the error message saying
    Traceback (most recent call last):
    File "C:/Python27/CreditConvert 2.py", line 24, in
    path=list[i]
    IndexError: list index out of range

    ReplyDelete
  9. hello, after i do run i found this message :
    raceback (most recent call last):
    File "E:/Python/1.py", line 5, in
    import pyPdf
    ImportError: No module named pyPdf
    >>>

    ReplyDelete
    Replies
    1. That means you have not installed the pyPdf package. Try installing it using pip .

      Delete
  10. How to convert a bunch of PDF file to csv in python?

    ReplyDelete
  11. hi Bijon. need help with the following code
    i have to write code where i have to connect mysql database to the buttons of the user interface so that i can retrieve the data within a required range. im using python 3.4.
    the code is as follows-
    from tkinter import *

    root = Tk()

    root .geometry("400x200")

    def retrieve_input():
    inputValue1 = textBox1.get("1.0","end-1c")
    print(inputValue1)
    inputValue2 = textBox2.get("1.0","end-1c")
    print(inputValue2)

    w1=Label(root, height=2, width=10, text='range1')
    w1.pack()

    textBox1= Text(root, height=2, width=10)
    textBox1.pack()

    w2=Label(root, height=2, width=10, text='range2')
    w2.pack()

    textBox2 = Text(root, height=2, width=10)
    textBox2.pack()

    buttonCommit=Button(root, height=1, width=10, text="ok",
    command=lambda: retrieve_input())

    buttonCommit.pack()


    mainloop()

    when i click on ok..i shud get the data.

    ReplyDelete
  12. hi Bijon. need help with the following code
    i have to write code where i have to connect mysql database to the buttons of the user interface so that i can retrieve the data within a required range. im using python 3.4.
    the code is as follows-
    from tkinter import *

    root = Tk()

    root .geometry("400x200")

    def retrieve_input():
    inputValue1 = textBox1.get("1.0","end-1c")
    print(inputValue1)
    inputValue2 = textBox2.get("1.0","end-1c")
    print(inputValue2)

    w1=Label(root, height=2, width=10, text='range1')
    w1.pack()

    textBox1= Text(root, height=2, width=10)
    textBox1.pack()

    w2=Label(root, height=2, width=10, text='range2')
    w2.pack()

    textBox2 = Text(root, height=2, width=10)
    textBox2.pack()

    buttonCommit=Button(root, height=1, width=10, text="ok",
    command=lambda: retrieve_input())

    buttonCommit.pack()


    mainloop()

    when i click on ok..i shud get the data.

    ReplyDelete
  13. how to convert docx and doc file to txt in a folder

    ReplyDelete
  14. how to convert html to pdf

    ReplyDelete
  15. I am getting UnicodeEncodeError how to fix it

    ReplyDelete
  16. Im trying to convert from .eml files to docx files, is this possible ?

    ReplyDelete
  17. Provide absolute path for the folder: D:\pdf
    18:05:09 pdf -> txt
    Traceback (most recent call last):
    File "C:/Users/bhagyaraj/PycharmProjects/untitled1/pdftotext.py", line 65, in
    out.write(content.encode("UTF-8"))
    TypeError: write() argument must be str, not bytes

    ReplyDelete
  18. Hi Bijon,

    The Python version 3 code is running perfectly for me and am able to convert the pdf files to .txt files.
    But it is not showing/picking ‘;’ and the text after ‘;’.

    Can you please advice me on how to get the get the text values after ‘;’.

    Many thanks

    Regards,
    Reddy

    ReplyDelete

  19. UnicodeEncodeError: 'charmap' codec can't encode character '\ufb01' in position 6477: character maps to

    ReplyDelete
  20. Hi Bijon, Thanks for the codes. I tried to convert 5 pdfs which are in my machine. I am using Python 3 and Jupyter notebook. I am a learner in python. When I run the codes for python 3 there was no error message. But I can't see any txt files. Plesae guide me. I can convert each pdfs to txt file using PyPDF2, however, I have 20 pdfs and need to extract the txt from them and then to do some text classification in that. Please let me know if you could help me in that. Thanks a lot in advance.

    ReplyDelete
  21. hello what is the name variable in line no 91

    ReplyDelete

  22. ---------------------------------------------------------------------------
    UnicodeEncodeError Traceback (most recent call last)
    in
    90
    91 with open(name,'a') as out:
    ---> 92 out.write(content)

    ~\Anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
    17 class IncrementalEncoder(codecs.IncrementalEncoder):
    18 def encode(self, input, final=False):
    ---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0]
    20
    21 class IncrementalDecoder(codecs.IncrementalDecoder):

    UnicodeEncodeError: 'charmap' codec can't encode character '\ufb01' in position 7251: character maps to

    ReplyDelete
  23. i want Python code to to push all PDF data to excel files.


    ReplyDelete
  24. Hey, hum sry 2 take ur time, i'm brain new 2 python n i reaaaaly suck.. could someone help me plz? i have this error :
    File "C:/Users/Gabriel/PycharmProjects/frompdf2mybe/Test2PDF.py", line 66

    ^
    SyntaxError: unexpected EOF while parsing

    Process finished with exit code 1

    ReplyDelete