a

Friday, 17 June 2016

1.Program to get all the text from a site and store it in a .txt file
  
  (Requires  BeautifulSoup )

  



 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import urllib

import re

from bs4 import BeautifulSoup

var1=raw_input('Enter url:   ')


url =var1

html = urllib.urlopen(url).read()

soup = BeautifulSoup(html)



for script in soup(["script", "style"]):

    script.extract()   




text = soup.get_text()

lines = (line.strip() for line in text.splitlines())

chunks = (phrase.strip() for line in lines for phrase in line.split("  "))

text = '\n'.join(chunk for chunk in chunks if chunk)

text = text.encode('ascii', 'ignore').decode('ascii')


print('Text copied to Data.txt file')

fhand=open('Data.txt','w')

fhand.write(text)

fhand.close()
Regex code to find all email address :

1
re.findall(r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)",string_name)


Program to find a particular file type (eg. .mp3,.doc,.mp4 etc) in the PC along with file path

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import fnmatch
import os
rootdir='/'
var1=raw_input('Enter file type to search  prefix  * :  ')
pattern=var1
for root,subdirname,filelist in os.walk(rootdir):
        for filename in fnmatch.filter(filelist,pattern):
                t=(os.path.join(root,filename))
                print(t)
print('ALL FILES FOUND')



 


 

converting doc,docx files to txt files 
 

 

No comments:

Post a Comment