Scraping Data from your Bank in Python
As part of my previous posts, I talked about ledger and plain text accounting. The only part missing is that you need a method to import transactions from your bank. For this I have been doing this by hand, bi-weekly. I would have to do the following:
- Log in to online banking
- Go to the transactions page
- Select the date range for transactions I needed (double check last date of transaction in ledger at this point)
- Download the Microsoft Excel format file that wasn’t in the proper format
- Convert this Excel file into a CSV file that matched my import format (watch the dates, is it YYYY-MM-DD or DD/MM/YYYY?)
- Finally import the CSV file into ledger
- Check the balance matches between my online banking and ledger
Sounds like a lot of work right?
That’s when I decided to write a script to do this (well, some of it) automatically. I decided to use Python to refresh my skills and then decided I’d try to use only my iPad Pro to write the code. This made it harder, due to the limited selection of inbuilt modules in Pythonista app. I researched using modules that I could install via pip, but a lot of these just simply won’t work on iOS, as they expect Google Chrome or Mozilla Firefox (ie. they are Selenium based). I decided to use the excellent requests library and Beautiful Soup python modules.
This creates a bit of extra work, since we can’t execute any Javascript on any of the pages. So I first got the contents of my banks login page, and dumped the html to see the:
- names on the form fields
- the endpoint (submit action on form) where data is sent
I then created a payload myself, with the data as the browser would send it. Then I got requests to post this data to the correct endpoint (just like a browser does). At first this didn’t work, and then I remembered that it could have a CSRF token. When I checked the HTML source, sure enough it did, so I created a function to strip this out and I also sent this with the data. Success! I got to the second login page…
After this was my pin code, but the bank only asks for digits. I simply put my pin in an array and scraped the digits needed, all minused by 1 (since arrays are 0-indexed). I posted this data the exact same way as before. Next I was in! I could display all my balance info.
The last step was to get the transactions. This proved a bit tricky also since my bank has a download button that’s in flash for the Excel file. I was a bit stumped until I noticed they used an open source program called data tables. In data tables all the fields are nicely marked with what data they contain (be it date, description, money in, money out, etc. etc). I used Beautiful Soup to grab this data, and then finally write it to a CSV file. I was quite happy with the script, and how fast I could pull it together in python!
Warning: The code below is just to show how it could be done, and was written in a hurry to get a job done. It should be used for ideas, not as a stylistic guide to how you should write nicely named variables with good documentation!
#!/usr/bin/python2
import requests
from datetime import datetime
import urlparse
import urllib
import dropbox
import csv
from bs4 import BeautifulSoup
USERNAME = ""
PASSWORD = ""
pin = [1,2,3,4]
FILENAME="bank.csv"
start_date = urllib.quote_plus(datetime(2017, 05, 15).strftime("%d/%m/%Y"))
end_date = urllib.quote_plus(datetime.today().strftime("%d/%m/%Y"))
ROOT_URL = "https://www.open24.ie/online"
LOGIN_URL = "https://www.open24.ie/online/Login/"
FORM_ENDPOINT_URL = "https://www.open24.ie/online/"
FORM2_ENDPOINT_URL = "https://www.open24.ie/online/Login/Step2"
TX_ENDPOINT = "https://www.open24.ie/online/Accounts/Details/RecentTransactions"
def get_data(session, url):
result = session.get(url)
return BeautifulSoup(result.content,"html.parser")
def post_data(session, url, data):
result = session.post(url, data=data, headers=dict(referer = LOGIN_URL))
return BeautifulSoup(result.content,"html.parser")
"""
Finds out which digits are requested from PIN
"""
def get_pin_digits(soup):
t1 = int(soup.find("label", {'for': 'login-digit-1'}).string.replace("Digit ",""))-1
t2 = int(soup.find("label", {'for': 'login-digit-2'}).string.replace("Digit ",""))-1
t3 = int(soup.find("label", {'for': 'login-digit-3'}).string.replace("Digit ",""))-1
return t1, t2, t3
"""
Gets the Verification token used to submit forms
"""
def get_verif_token(soup):
return soup.find("input", {'name': '__RequestVerificationToken'}).get('value')
"""
Gets the account UUID generated on login
"""
def get_acct_id(soup):
acct_id_url = soup.find("h2", {'class': 'heading-general'}).contents[0]['href']
url_data = urlparse.urlparse(acct_id_url)
query = urlparse.parse_qs(url_data.query)
acct_id = query["accountId"][0]
return acct_id
def main():
session_requests = requests.session()
# Get login csrf token
soup = get_data(session_requests, LOGIN_URL)
authenticity_token = get_verif_token(soup)
# Create payload
payload = {
"__RequestVerificationToken": authenticity_token,
"section": "",
"login-number": USERNAME,
"login-password": PASSWORD,
}
# Perform initial login
soup = post_data(session_requests, LOGIN_URL, payload)
authenticity_token = get_verif_token(soup)
t1, t2, t3 = get_pin_digits(soup)
# Create payload
payload = {
"__RequestVerificationToken": authenticity_token,
"login-digit-1": pin[t1],
"login-digit-2": pin[t2],
"login-digit-3": pin[t3]
}
# Perform 2nd login
soup = post_data(session_requests, FORM2_ENDPOINT_URL, payload)
avail_funds = soup.find("span", {'class': 'fund-1'}).string.encode('ascii', 'ignore')
cur_bal = soup.find("span", {'class': 'fund-2'}).string.encode('ascii', 'ignore')
acct_id = get_acct_id(soup)
print("Available Funds: ", avail_funds)
print("Current Balance: ", cur_bal)
print("Account ID: ", acct_id)
url = 'https://www.open24.ie/online/Accounts/Details/RecentTransactions?accountId='+acct_id+'&from-date='+start_date+'&to-date='+end_date
soup = get_data(session_requests, url)
valid_trs = [item for item in soup.find_all('tr') if "data-uid" in item.attrs]
outputFile = open(FILENAME, 'w')
outputWriter = csv.writer(outputFile, lineterminator = '\n')
outputWriter.writerow(['Date','Desc','In','Out','Total'])
for tr in valid_trs:
date = datetime.strptime(tr.find("td", {'class': 'date'}).string, '%d %b %y').strftime('%d-%b-%y')
desc = tr.find("td", {'class': 'desc'}).string
if desc is None:
desc = tr.find("td", {'class': 'desc'}).find('a', {'class':'underline'}).string
desc = str(desc).strip()
tds = tr.findAll("td", {'class': 'currency'})
monin = tds[0].string
monout = tds[1].string
total = tds[2].string
outputWriter.writerow([date, desc, monin, monout, total])
outputFile.close()
if __name__ == '__main__':
main()