Video

Oil prices with ease

The web is alive with fantastic sources of information on all manners of topics with many websites providing a RSS feed or API in order to link into their data. However many sites do not and we have to go to more elaborate ways to extract the data from them. Well, it would be great if we could get the computer to automatically query the web for the latest prices and return them to us. In the example here, we are going to pull in the latest Brent Crude price from BloombergThe Financial Times and BBC, work out the average price and save these off into text files before displaying.

So….what’s the idea?

One of the many great things about Python is how easy it is to throw together scripts to do relatively complex tasks which, in other languages, would take much more code. The standard library in Python is rich and there’s an extensive set of additional libraries that we can plug into. In this task urllib2 will be used to connect to the websites and read the underlying HTML code, Beautiful Soup to help process that information and then possibly throw in regular expression or two just to finish things off.

Beautiful Soup makes it all so easy

Python has inbuilt HTML parser but we can do an awful lot more using the rather fantastic Beautiful Soup which makes pulling and processing data from websites really simple; pass in the website and it will pull out all the links, find selected tables based on any criteria or simply match certain things you’d like to find with ease. We’ll use Beautiful Soup on two of the websites we visit to pick out the prices from certain rows.

The first stage is to import the four libraries we’re going to use. Urllib2, re and time are part of the standard library and will be part of your Python build automatically. BeautifulSoup will need to be installed (see link below) and I’m running 3.2.0 which is fully compatible with Python 2.7. After this, a timer object is initiated and the current system time passed to it which we’ll use to calculate the total elapsed time come the end of the import. Strictly speaking, this is a superflous step and one which can be removed in a production environment.

The three websites get scraped next. In the case of Bloomberg and the Financial Times, urllib2 is passed the UR, the raw contents of the page are read and then passed to BeautifulSoup for data extraction. The pages elements are static but populated by a backend datasource and thus the location of the table elements can be hard coded into the code – BloomSoup.findAll(‘tr’)[26] and FTsoup.findAll(‘tr’)[0] . In the case of the BBC data, the price is not held within a table element and thus BeautifulSoup does not need to be used. Finally, for each of the three source, a simple regular expression is used to remove the actual price from the parsed data element.

Saving and displaying

The objects BloomPrice, FTPrice and BBCPrice now hold the three extract oil prices and the float() function has converted them from strings into a numerical value. The penultimate step is to use the print statement to display the values on the console along with a short mention as to the source. A average could be calculated by importing Python’s math library, but, for our purposes would be overkill; a simple arithmetic mean is calculated by summing all the prices and dividing by three.

Finally, the prices are saved to a location given in the OutputPath variable and then presented in console for 10 seconds

Python Source Code

The source code presented here has been updated – thanks to SoonerBourne34 – to reflect changes to the BBC and Bloomberg website. Therefore, the code displayed in the YouTube video does not match perfectly match that shown in the video.

 

[sourcecode language=”python”]
from BeautifulSoup import BeautifulSoup
import urllib2,re, time
start = time.time()
# Find Bloomberg Brent Price
rawBloomData = urllib2.urlopen(“http://www.bloomberg.com/energy/”).read()
BloomSoup = BeautifulSoup(rawBloomData)
brent = BloomSoup.findAll(‘tr’)[14]
BloomPrice = float(re.search(re.compile (r”\d+\.\d*”),str(brent.contents)).group())

# Find FT Brent Price
rawFTData = urllib2.urlopen(“http://markets.ft.com/tearsheets/performance.asp?s=1054972″).read()
FTsoup = BeautifulSoup(rawFTData)
FT = FTsoup.findAll(‘tr’)[0]
FTPrice = float(re.search(re.compile (r”\d+\.\d*”),str(FT)).group())

# Find BBC Brent Price
rawBBCData = urllib2.urlopen(“http://www.bbc.co.uk/news/business/market_data/commodities/default.stm”).read()
BBCSoup = BeautifulSoup(rawBBCData)
oyell = BBCSoup.findAll(‘tr’)[14]
BBCPrice = float(re.search(re.compile (r”\d+\.\d*”),str(oyell)).group())
# Compile for display
print ” ”
print ” Brent Crude ($/Brl)”
print ” ————————”
print ” Bloomberg : %.2f” %(BloomPrice)
print ” Financial Times : %.2f” %(FTPrice)
print ” BBC : %.2f” %(BBCPrice)
print ” ————————”
print ” Average : %.2f” %((BloomPrice+FTPrice+BBCPrice)/3)
print ” ————————”
print ” ”
# Write to files
OutputPath = “C:\\Test\\”
open(OutputPath+”Bloomberg.txt”,”wb”).write(“%.2f” %(BloomPrice))
open(OutputPath+”FT.txt”,”wb”).write(“%.2f” %(FTPrice))
open(OutputPath+”BBC.txt”,”wb”).write(“%.2f” %(BBCPrice))
open(OutputPath+”Average.txt”,”wb”).write(“%.2f” %((BloomPrice+FTPrice+BBCPrice)/3))
print “\n”
time.sleep(10)

[/sourcecode]

Links

BeautifulSoup : http://www.crummy.com/software/BeautifulSoup/

3 COMMENTS

  1. As an alternative to using regexp, as you’re using BeautifulSoup to find a hardcoded table row, you can use it to extract the data element directly, e.g.

    BloomPrice = float(BloomSoup.findAll(‘tr’)[14].findAll(‘td’)[1].contents[0])

    • Thanks for that Brendan – this is the approach I would take now that I am more familiar with BS/Python and it is appreciated that you took the time to mention that

I'd love to hear what your thoughts are...please feel free to leave a reply