1.Program to get all the text from a site and store it in a .txt file
(Requires BeautifulSoup )
Regex code to find all email address :
Program to find a particular file type (eg. .mp3,.doc,.mp4 etc) in the PC along with file path
converting doc,docx files to txt files
(Requires BeautifulSoup )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | import urllib import re from bs4 import BeautifulSoup var1=raw_input('Enter url: ') url =var1 html = urllib.urlopen(url).read() soup = BeautifulSoup(html) for script in soup(["script", "style"]): script.extract() text = soup.get_text() lines = (line.strip() for line in text.splitlines()) chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) text = '\n'.join(chunk for chunk in chunks if chunk) text = text.encode('ascii', 'ignore').decode('ascii') print('Text copied to Data.txt file') fhand=open('Data.txt','w') fhand.write(text) fhand.close() |
1 | re.findall(r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)",string_name) |
Program to find a particular file type (eg. .mp3,.doc,.mp4 etc) in the PC along with file path
1 2 3 4 5 6 7 8 9 10 | import fnmatch import os rootdir='/' var1=raw_input('Enter file type to search prefix * : ') pattern=var1 for root,subdirname,filelist in os.walk(rootdir): for filename in fnmatch.filter(filelist,pattern): t=(os.path.join(root,filename)) print(t) print('ALL FILES FOUND') |
converting doc,docx files to txt files
No comments:
Post a Comment