Job Budget: Between $200 and $500 Bids/ Views: 2 / 272
Time Remaining: 9d 23h 22m (ends Nov 18, 2008 21:15 U.S. Eastern Time)
Job Description
I need a script (Perl or PHP) that can be run from a server that can extract emails either from a web site that is an interface of a database open to general public (http:// and
https://) or from web sites containing information I want. This script also needs to be compiled as a stand-alone .exe program for Windows Server 2003 and Windows XP. This shouldn\'t be a big issue for people that have data scraping/extractors, since the script does just that, download pages, grab emails from them, and save them in an excel or access file. Not a big issue.
- Extract/capture email, and info of the email\'s owner (US address: given name, family name, organization name, city, state, postcode and phone #; if non-US address: given name, family name, organization name, city & state/province, country name, postcode and phone #) if available and URL/ID from a specified database and domain folder through URL addresses. For example, each of the following addresses is linked to a person\'s profile in a database or domain folder, including name, email address, organization name, location, and phone #, etc. Each address below has its own profile format. I will let you know the wanted databases in detail after you win this bid. We need to have options to manually set up the items we want to collect. For example, sometimes we may just want to collect nothing but emails; on the other times, we may like to collect names, emails, and organization names (but nothing else), etc, depending upon the URL we are visiting.
- Once we input a person\'s database address as below, this program should stay in the same database and infinitely loop to search for wanted information of all the people in the database, by increasing and decreasing ID number (multi-threads), until finished or manually stopped. This program should periodically save results to avoid unexpected outrage/error leading to data loss. You should notice that the addresses below all contain a string of “ID=” or “id=”. The program should automatically change the numbers right after the string of “ID=” or “id=”, retrieve the wanted information, save them in an access or excel file, and then loop to next one until the database is examined fully. Some numbers could contain no information, then the program should just loop to next one.
http://www.domain.org/index.cfm?page...l.cfm&ID=49312
https://secure.domain.org/xxxxx/dire...px?DirID=79053
http://subdomain.domain.edu/WhitePag...&a=hs&r=83&kw=
http://subdomain.domain.edu/WhitePag...&a=hs&r=83&kw=
- Extract emails from folder or subfolder in the domain, like domain.com or only from domain.com/folder and on, and not from the root one. Sometimes an email address is embedded under a name. Then collect the name as well as the email from embedded link. For example,
http://www.domain.org/aids/faculty.asp
- Crawl pages only in the URL specified, or folder within the URL domain.com/folder, with a maximum of 7-10 hunting depth. Capture emails that can not be manually copied.
- Multithread extraction of emails, connection to URLs in multiple threads for faster speed.
- Delete duplicated emails automatically at the end of job
- Delete all emails (if we tick option) from URL where emails were extracted from.
- Authentication details. If it\'s a forum, a member needs to enter user/password. The script should allow for entering user/password and get identified.......
further details go to
Job:Email Extractor | myTino - The World's Leading Online Outsourcing Network