Problem :
User need fast way to collect data from website, before this user was manually copy paste line by line into excel, to make it harder, some information is located inside other page “more details info”.
Data need to refresh monthly to get new updated details.
Solution :
Create scraper script.
make cronjob
Update suggestion :
Next add to own db and make own api to another app to call it, and store additional custom data point.
Details :
- Get the link of main page data, for me its web form
- Using Google inspect tool
- Select on “Network” tab
- We will monitor this process in other dock.
- Press Search button, after that it will show output of data.
- Its like we found hidden api to send payload and received response.
- Note : some website implement cross checking cookies, and need extra layer of authentication when handling api.
- Playing with api.
- Install RapidApi extension on VsCode.
- Select POST, enter url we testing, and copy the payload from google chrome inspector.
- Try custom change payload form data and submit, it will give difference response based on key value you gave.
- Next we will using Python beautifulSoup4 BS4 for better parsing the data.
import requests
from bs4 import BeautifulSoup
from tqdm.auto import tqdm
import csv
import time
import re
start = time.time()
#save file as "output.csv"
with open("G1.csv", "w", newline='') as fp:
payload = {
'comName': '' ,
'ComState': '7', #Pilih Negeri ikut nombor 1-kelantan 2-selangor ..etc
'ComCategoryID': '', #Pilih 1-building ..etc
'SSMNo':'' ,
'ComDistrict': '',
'ComSpecID':'' ,
'ComGradeID': '7',#Pilih Grade 1-G7 2-G6 ... etc
'seltype': '1',
'hdnpagesize': '25000', #tukar jumlah page
'hdnctpage': '1',
'hdntotpage': '1000',
'hdnsortcol': 'ComGrade',
'hdnsortdir': '0',
'hdntotalrecs': '25000', #Tukar jumlah rekod
'hdnexportopts': '0',
'txtgoto':'' ,
'selpagesize': '25000', #tukar jumlah page jugak
'selvalidity': '1' #validity 0-All 1-Valid 2-expired
}
post_response = requests.post(url='http://cims.cidb.gov.my/SMIS/regcontractor/reglocalsearchcontractor.vbhtml', data=payload)
url = post_response.text
# print(url)
url_papar = "http://cims.cidb.gov.my/SMIS/regcontractor/reglocalsearch_view.vbhtml?search=P&comSSMNo="
soup = BeautifulSoup(url, 'lxml')
#guna Regex cari word Jumlah
Ptotal = soup.find(text=re.compile("Jumlah"))
print("G1 "+Ptotal)
Pbtotal = re.sub('\D', '', Ptotal)
with tqdm(total=int(Pbtotal)) as progress_bar:
for index , ssmNo in enumerate(soup.select('a.open-AddBookDialog1')):
progress_bar.update(1)
ssm_No = ssmNo.get('data-flag')
while True:
try:
papar_response = requests.get(url_papar+ssm_No)
papar_url = papar_response.text
break # you can also check the returned status before breaking the loop
except requests.exceptions.RequestException:
print('Internet problem, cuba lagi lepas 10 second')
time.sleep(10) # wait 5 mins before retry
try:
soup2 = BeautifulSoup(papar_url, 'lxml')
# print(index)
company = []
table = soup2.findAll('table')[0]
rows = table.findAll('tr')
title = soup2.find('h3').text
except IndexError:
continue
#check kalau company takde data
if title == 'CONTRACTOR PROFILE REGISTERED WITH CIDB':
for row in rows:
info = row.findAll('th')[1].text.strip().replace("\r\n ", ",")
company.append(info)
table2 = soup2.findAll('table')[1]
rows2 = table2.findAll('tr')[1]
grade = rows2.find('td').text
company.append(grade)
wr = csv.writer(fp, dialect='excel')
wr.writerow(company)
else:
continue
end = time.time()
total_seconds = (end-start)
print(time.strftime("%H:%M:%S", time.gmtime(total_seconds)))
print('Done')
- Next we would like to automate data collection update every month using cronjob.