Word Extractor.

Discussion in 'Web Programming' started by r0ut3r, May 28, 2009.

Word Extractor.
  1. Unread #1 - May 28, 2009 at 9:22 PM
  2. r0ut3r
    Joined:
    Jan 2, 2009
    Posts:
    263
    Referrals:
    0
    Sythe Gold:
    0

    r0ut3r Forum Addict
    Banned

    Word Extractor.

    #!/usr/bin/python
    #Word Extractor from a site.

    import sys, urllib2, re, sets

    #Min length of word
    MIN_LENGTH = 3
    #Max length of word
    MAX_LENGTH = 10

    Code:
    def StripTags(text):
    	finished = 0
    	while not finished:
    		finished  =1
    		start =  text.find("<")
    		if start >= 0:
    			stop = text[start:].find(">")
    			if stop >= 0:
    				text = text[:start] + text[start+stop+1:]
    				finished = 0
    	return text
    			
    if len(sys.argv) != 3:
    	print "\nUsage: ./wordextract.py <site> <file to save words>"
    	print "Ex: ./wordextract.py http://www.test.com wordlist.txt\n"
    	sys.exit(1)
    
    site = sys.argv[1]
    if site[:7] != "http://":
    	site = "http://"+site
    	
    print "\n[+] Retrieving Source:",site
    source = StripTags(urllib2.urlopen(site).read())
    words = re.findall("\w+",source)
    words = list(sets.Set(words))
    l = len(words)
    print "[+] Found:",l,"words"
    print "[+] Trimming words to length"
    for word in words:
    	if not MIN_LENGTH <= len(word) <= MAX_LENGTH:
    		words.remove(word)
    print "\n[+] Removed:",l-len(words),"words"
    print "[+] Writing:",len(words),"words to",sys.argv[2]
    file = open(sys.argv[2],"a")
    for word in words:
    	file.writelines(word+"\n")
    file.close()
    print "\n[-] Complete\n"
    pm me to fix flaws, fixed the max length from 5 to 10
     
  3. Unread #2 - May 28, 2009 at 9:25 PM
  4. i am java
    Joined:
    Apr 9, 2008
    Posts:
    231
    Referrals:
    0
    Sythe Gold:
    0

    i am java Active Member

    Word Extractor.

    What does it do again?
     
< Need A Runescape2 Gold For Sale Design | >

Users viewing this thread
1 guest


 
 
Adblock breaks this site