The pricing of most goods and services are based on supply and demand, and electricity is no exception. The problem of electricity price forecasting is related yet distinct from that of electricity load forecasting. Although the demand (load) and the price are correlated, their relation is non-linear. The load is influenced by the factors such as non-storability of electricity, consumers’ behavioural patterns, and seasonal changes in demand. The price, on the other hand, is affected by those aforesaid factors as well as additional aspects such as financial regulations, competitors’ pricing, dynamic market factors, and various other macro- and micro economic conditions.
As a result, the price of electricity is a lot more volatile than the electricity load. Interestingly, when dynamic pricing strategies are introduced, prices become even more volatile, where the daily average price changes by up to 50% while other commodities may exhibits about 5% change. Load forecasting has progressed to a point where the load can be predicted with up to 98% of accuracy in some cases. However, current state-of-the-art techniques in price forecasting are at most about 95% accurate. Thus, a more accurate price forecasting system is necessary since many retailers and their businesses depend on the prices of electricity.
What’s the appeal of machine learning?
Since the invention of the computer, there have been people talking about the things that computers will never be able to do. Whether it was beating a grand master at chess or winning on Jeopardy!, these predictions have always been wrong. Whilst computers have become vastly more powerful, there are still tasks that human brains are far more adapt at tackling; sorting images, reading text or recognising faces. To get around the limitations of classic rules based approaches, researchers began to model the way that neurons are linked and fire in the brain and thus created artificial neural networks.
There’s a big advantage to working with many simple actors rather than a single complex one: simple actors can self-correct. There have been attempts at self-editing versions of regular software, but it’s artificial neural networks that have taken the concept of machine learning to new heights.
You’ll hear the word “non-deterministic” used to describe the function of a neural network, and that’s in reference to the fact that our software neurons often have weighted statistical likelihoods associated with different outcomes for data; there’s a 40% chance than an input of type A gets passed to this neuron in the next layer, a 60% chance it gets passed to that one instead. These uncertainties quickly add up as neural networks get larger or more elaborately interconnected, so that the exact same starting conditions might lead to many different outcomes or, more importantly, get to the same outcome by many different paths.
So, we introduce the idea of a “learning algorithm.” A simple example is improving efficiency: send the same input into the network over and over and over, and every time it generates the correct output, record the time it took to do so. Some paths from A to B will be naturally more efficient than others, and the learning algorithm can start to reinforce neuronal behaviours that occurred during those runs that proceeded more quickly.
Much more complex ANNs can strive for more complex goals, like correctly identifying the species of animal in a Google image result. The steps in image processing and categorisation get adjusted slightly, relying on an evolution-like sifting of random and non-random variation to produce a cat-finding process the ANN’s programmers could never have directly devised.
Non-deterministic ANNs becomes much more deterministic as they restructure themselves to be better at achieving certain results, as determined by the goals of their learning algorithms. This is called “training” the ANN — you train an ANN with examples of its desired function, so it can self-correct based on how well it did on each of these runs. The more you train an ANN, the better it should become at achieving its goals.
Stochastic neural networks are a type of artificial neural networks built by introducing random variations into the network, either by giving the network’s neurons stochastic transfer functions, or by giving them stochastic weights. This makes them useful tools for optimisation problems, since the random fluctuations help it escape from local minima
There’s also the idea of “unsupervised” or “adaptive” learning, in which you run the algorithm with no desired outputs in mind, but let it start evaluating results and adjusting itself according to its own… whims? As you might imagine, this isn’t well understood just yet, but it’s also the most likely path down which we might find true AI — or just really, really advanced AI. If we’re ever truly going to send robots out into totally unknown environments to figure out totally unforeseen problems, we’re going to need programs that can assign significance to stimuli on their own, in real time.
That’s where the power of ANNs truly lies; since their structure allows them to make iterative changes to their own programming, they have the ability to find answers that their own creators never could have. Whether you’re a hedge fund, an advertising company, or an oil prospector, the sheer potential of combining the speed of a computer with the versatility of a brain is impossible to ignore. That’s why being able to program “machine learning” algorithms is now one of the most sought-after skill sets in the world. In the coming century we may very well be less concerned with solving problems than with teaching computers to learn to solve problems for us.
An artificial neural network is an interconnected group of nodes, akin to the vast network of neurons in a brain. Here, each circular node represents an artificial neuron and an arrow represents a connection from the output of one neuron to the input of another
Many papers have been published in the area of using Artificial Neural Network (ANN) for price forecasting. They almost all start by extracting the best features from a pool of market features and training the ANN with these features in order to create a real-time forecasting model. Lagged prices are generally used in price forecasting since its high auto-correlation with electricity market prices. However, in a real-time setup, apart from the system load and price during the previous hour, no other features are available, thus restricting us to features from the available pool.
Practical Machine Learning
Due to the importance of accurate price forecasting in volatile electricity markets, a number of approaches have been presented in the literature. These approaches range from traditional time series analysis to machine learning techniques for forecasting future prices. ARIMA and GARCH models are examples of traditional methods, while artificial neural networks, Hidden Markov Models, fuzzy inferred neural networks and support vector regression are examples of machine learning techniques.
Feature creation and selection are the first steps in either classification or regression and are widely used processes in machine learning that involve either the creation of new features or the selection of an optimal subset from a pool of existing features. The selected subset will contain key features which contribute to the accuracy of the forecasts and also help reduce over-fitting of the model.
The electricity market data comes in the form of a time series, i.e. as (time, value) pairs, and does not provide any specific features for use with ANN. Thus we have to create features from the available past data to be used as inputs to the ANN. In price forecasting, it is important to take into account both short and long term trends and also seasonal patterns. Sudden changes in the price might be caused by seasonal behaviours and other factors. In order to capture this behaviour we create putatively relevant features based on historical data which lasts for longer period; such as last year/same day/same hour data, last year/same day/same hour price fluctuation, last week/same day/same hour price, last week/same hour price fluctuation etc
In machine learning, pattern recognition and in image processing, feature extraction starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning and generalisation steps, and in some cases leading to better human interpretations. Feature extraction is related to dimensionality reduction.
Feature selection is a very important step towards building a robust forecasting model. The number of features that are suitable for an ANN will vary depending upon the model, but, we can imagine that after creating a number features that capture long and short term trends, it’s important to perform feature selection to find the best set of features. Features generated using historical prices already gave good forecasting results, but it is possible to improve the accuracy by considering other features that are not directly associated with price data. Other parameters that could plausibly affect the load or price in the market incorporate elements of temperature, day of the week and the occurrence of holiday’s into the generated feature set.
Feature selection techniques should be distinguished from feature extraction. Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features. Feature selection techniques are often used in domains where there are many features and comparatively few samples (or data points). Archetypal cases for the application of feature selection include the analysis of written texts and DNA microarray data, where there are many thousands of features, and a few tens to hundreds of samples
In order to forecast the price in the following hour, we would need the temperature in that particular hour. However, as this is not available in real time, and thus it’s common to use the predicted temperature that is provided by a weather forecasting service. For training and testing purposes historical data is used but for real time use we can use a service such as wounderground.com to get the forecasted temperature value. For the holiday data use predefined holidays in the tested region and directly bind them to the existing data.
Constructing the Neural Network
Many models of ANN has been proposed for classification and regression (forecasting) problems in machine learning and we will look to cover many of them in more depth in later blog posts.
To perform forecasting using a neural network, two basic steps are required: training, and learning. We assume that the training set containing historical data along with desired output is available. In the learning step the neural network learns to reconstruct the input-output mapping by updating the weight of inputs and biases at the end of each iteration. Back-propagation is the most common learning algorithm, in which at the end of each iteration output error is propagated back to input adjusting the weight and biases. To overcome the slow convergence rate of the back-propagation algorithm, two parameters, learning rate and momentum can be adjusted.
What is the learning rate?
The learning rate is how quickly a network abandons old beliefs for new ones. To use a contrived example. If a child sees 10 examples of cats and all of them have orange fur, it will think that cats have orange fur and will look for orange fur when trying to identify a cat. Now it sees a black a cat and her parents tell her it’s a cat (supervised learning). With a large “learning rate”, it will quickly realize that “orange fur” is not the most important feature of cats. With a small learning rate, it will think that this black cat is an outlier and cats are still orange.
Whilst that example is a bit of a stretch. The point is that a higher learning rate means the network changes its mind more quickly. That can be good in the case above, but can also be bad. If the learning rate is too high, it might start to think that all cats are black even though it has seen more orange cats than black ones.
In general, you want to find a learning rate that is low enough that the network converges to something useful, but high enough that you don’t have to spend years training it.
What is momentum?
In neural networks, we use gradient descent optimisation algorithm to minimise the error function to reach a global minima. To avoid sub-optimal results and the algorithm get stuck in a local minima, we use a momentum term in the objective function, which is a value between 0 and 1 that increases the size of the steps taken towards the minimum by trying to jump from a local minima.
If the momentum term is large then the learning rate should be kept smaller. A large value of momentum also means that the convergence will happen fast. But if both the momentum and learning rate are kept at large values, then you might skip the minimum with a huge step. A small value of momentum cannot reliably avoid local minima, and can also slow down the training of the system. Momentum also helps in smoothing out the variations, if the gradient keeps changing direction. A right value of momentum can be either learned by hit and trial or through cross-validation.
The road ahead
This blog post is the first in a Machine Learning series and I hope you’ll join me as a take my first steps down the path, learning some of the basics of machine learning, discover how they can be applied to practical problems and share some of the cutting edge research in this incredibly interesting field.If you’ve found this blog helpful and would like other topics covered, please feel free to drop me an email with suggestions. You’re welcome to subscribe using ‘Subscribe to Blog via Email’ section and this will get you the latest posts straight to your inbox before they’re available anywhere else