684 Rajeeva Karandikar, Forecasting models

Forecasting models

RAJEEVA KARANDIKAR

THIS article discusses some issues related to forecasting election outcomes in the Indian context drawing upon my team’s experience of forecasting election outcomes for Indian Parliament and state assembly elections over the last two decades. Even as we developed a scientific approach to the whole exercise, at times we had to resort to the art of the feasible. We will describe this approach and also point out some limitations of this endeavour.

As will be evident, there is science behind this exercise – sampling, estimation, and so on, ideas from statistics. But the data size required to predict the winner in each of the 543 constituencies is too big to be practical – we simply do not have resources (money as well as trained and reliable manpower) to conduct a survey with say 10,00,000 respondents. This is where art enters the picture. We have to explore crude models and zero in on something that works in practice – in other words, explore the art of the feasible.

Let us start with a question that has been repeatedly asked ever since the advent of opinion polls: How can the opinion of say, 40,000 people, tell anyone as to how the 83.4 crore (834 million) voters in India are going to vote?¹ The simple explanation is that the accuracy of a sampling scheme essentially depends upon the sample size and not on the sampling fraction (sample size as a proportion of population size).² This is counter intuitive but true. In principle, even a sample size of 3381 across the country can tell us the approximate the vote share of major political parties (to within 2%).

In any statistical exercise, one must start with an overall objective, then take into account what historical data is available and what fresh data can be collected based on resources available. Finally, taking into account domain expertise, one needs to build a statistical model that links the objective to the data.

The objective of a nationwide poll is prediction of seats for major parties. If we conduct a survey with 3381 respondents (chosen randomly) in each constituency, we will be able to predict the winner in each constituency with reasonable accuracy and thereby forecast the number of seats each party would get. However, this involves canvassing about 18,35,000 respondents! It is unlikely that any pollster has this sort of money or trained and reliable manpower to carry out such an exercise.

An opinion poll will yield estimates of vote shares for different parties. However, no simple formula can translate this vote share into seats, as the number of seats that various parties win also depends on the way the votes for these parties are distributed across the constituencies. Note that there will not be a large enough sample in each constituency to permit a stand alone prediction of a winner – indeed, in some constituencies there may be no data at all. Thus, one needs to build a statistical model that would help deduce voting intentions in a constituency based on opinion poll data in that and other constituencies, say neighbouring ones or those in the same state.

To build a model, let us take stock of the relevant data that is available. We will discuss this for a parliamentary election with 543 constituencies. The historical election data that is available consists of the following: for each election in each constituency, the total electorate and its break-up between male and female voters, total votes polled and its break-up between male and female voters, total valid votes and total votes for each candidate (and the party for each candidate, if any). When elections are announced, voter lists for the ensuing elections are published and organized as follows: In each constituency, a list of polling booths is available and for each polling booth one has a list of eligible voters along with their gender, age and address. No other socio-economic variable (education, religion, caste, income, etc.) is available.

The population profile on socioeconomic variables is available at the state level, as also the district level, but not at the constituency level since each electoral constituency usually comprises parts of several districts. Since we do not have profiles of constituencies on socioeconomic variables, it is not possible to use these variables in a model. Another important feature of Indian electoral reality is the volatility of public opinion.³ These two factors, namely absence of socioeconomic profiles at constituency level and high volatility of public opinion, means that one cannot use the UK models for poll prediction in India.

Let us examine what is feasible. Given the resources (monetary and reliable manpower), we can get reasonable vote estimates for parties for any group of 15-20 constituencies (or more). Noting that in India the major parties which are in contention for winning seats varies from state to state and further, that the voting pattern in one state seems to have little or no effect on another state, we need to model votes for various parties in each state using only the opinion poll data in that state. In other words, treat each state in isolation.

The swing across a group of constituencies for a given party/alliance is defined as the change in percentage of votes for the party/alliance from the previous election to the current election. Once we estimate votes for the party/alliance in a given group of constituencies, using historical data we can estimate the swing across that group (of constituencies) for all major parties. Each large state can be divided into geographic sub-regions (e.g. Maharashtra into Mumbai and suburbs, western Maharashtra, Vidarbha, Konkan, Marathwada). Likewise, between rural and urban constituencies. Further, the state can also be divided into groups of reserved and general constituencies.

We make an assumption that swing for a party/alliance in a constituency is a convex combination of swing across the state, swing across geographical region to which the constituency belongs, swing across rural/urban (as the case may be) and swing across reserved/general constituencies (as the case may be). Indeed, we could also divide the state into groups by some other criterion, such as the phase of the election, when elections are spread across days (as is often the case in recent times). Since 1998, we have used the model described in the Appendix with minor modifications on numerous occasions and it has served its purpose.⁴ Of course, we attempted to improve upon this model, but so far have not been able to come up with a better one.

Next we describe how to handle cases where there is change of alliances. For this let us understand the role of the historical data (where we had assumed that alliances are same as current election). The opinion poll data tells us the overall level of support for a party/alliance, while the historical data essentially tells us how these votes for the party/alliance are distributed across the state. So we need to make an assessment of how people would have voted in the earlier election had the alliance structure been what it is today and come out with revised figures of historical votes Xip for party p in constituency i. We will call these as simulated historical votes.

Let us consider the case where two parties which had an alliance contested separately (as in Maharashtra the BJP and Shiv Sena were in an alliance till the 2014 Lok Sabha elections, but contested separately during the Vidhan Sabha election the same year). In such a case, we make an assessment of overall strength of the two partners and split the alliance votes in that proportion, with more votes for whichever party had their candidate in a given constituency. So if in our judgement one partner A has 60% support while the other B has 40%, then wherever in the previous election the candidate was from party A, we assign 70% of votes of the alliance from previous election to party A and 30% for party B. And in constituencies where party B had its candidates in previous election, we assign 50% votes to both parties.

Now let us come to formation of a new alliance. We cannot simply add the votes of the two parties to come up with the simulated historical votes for the alliance. This is because all votes of one party may not be transferred to the candidate of the other party. Here again, we need to make a political assessment as to what fraction of votes would be transferred. Assume party C and D come together. In a constituency where C has a candidate, in the current election, we take the historical votes for party C and add to that a proportion a of votes of party D. Likewise, wherever party D has a candidate in the current election, we take historical votes of party D and add a proportion b of votes of C. This way we construct simulated historical data. The proportions a, b are chosen based on our political judgement. This (as well as choice of the coefficients in the swing model) is the art part of election forecasting.

When some party switches alliances from one election to the next (as has happened several times in Tamil Nadu), we first split the alliances and then take the simulated vote as the starting point. When a new party makes an appearance (such as AAP in Delhi), it is tough to assign simulated historical votes. This is purely a matter of judgement.

We have thus explained the model and mechanics of forecasting seats based on opinion poll data. We start with historical data (of the last election). If alliances have changed we create historical simulated data. Then based on the survey data and the swing model we estimate votes for all major parties in each constituency. Finally we obtain seats for parties by the probabilistic count method. Needless to add that all this methodology will be of no use unless we have reliable sampling data collected following statistical methodology.

In the rest of the article we dwell upon some related issues. Opinion polls are touted by media as the truth, the whole truth and nothing but the truth. On the other hand, representatives of parties who are predicted to lose seem to trash them. The public at large perhaps looks at them with amusement. The fact is that opinion polls can at best give an indication of what is likely to happen – will some party/alliance get an absolute majority and, if so, which party? Will it be a borderline or comfortable majority? One should not read too much in the exact numbers, but rather take them as an indication of what is likely to happen. Apart from the numbers, opinion polls give us an insight as to why people may have voted the way they did. This is what interests social scientists. From the point of view of a statistician, this is one time when we have resources to do a sample survey, analyze and publish results that are taken note of since this is soon followed by actual voting and counting. This gives us an opportunity to showcase the power of statistical techniques – how even a sample size of 25,000 across the country can give us insights into what is likely to happen.

We have talked about high volatility of public opinion. This makes any forecast of poll outcome based on pre-election opinion polls highly suspect, especially since in India we have multi-phase polls lasting about 3 to 5 weeks and during this period, one cannot make public any forecast based on opinion polls. Both the survey and its analysis has to be completed two days before the first phase begins. A survey done weeks ahead of polling in major parts of the country/state can at best only give an insight into what the public at large was thinking at the time. This could well change by the time voting takes place. We have been very careful in making this point whenever we have done pre-election opinion polls.

Apart from the volatility of public opinion, there is another factor that raises serious questions about the predictive power of any forecast based on pre-election opinion polls. The survey at best gives an estimate of the voting intention of registered voters at large. However, only between 50% and 65% voters usually cast their votes. Moreover, the lack of enthusiasm in exercising the vote is not uniform across socioeconomic classes. Generally the percentage of those who cast their votes is higher among economically weaker sections of society. The same is the case with the uneducated sections as well as among scheduled castes and backward castes. Thus parties which enjoy higher support among the underprivileged groups are likely to do better in the actual poll than in the survey based findings.

If we are able to conduct several opinion polls over a period of time and if there is a movement in one direction across most socioeconomic classes – as was the case in the year leading to the 2014 Lok Sabha poll – we could then forecast the outcome with some confidence. Generally speaking, however, forecasting the outcome of elections using pre-election opinion polls is risky, given the time lag which may change voter mood and because not all voters exercise their franchise. Exit polls are designed to take care of both these concerns since, as the term suggests, the survey is done as voters are exiting the polling booths. However, this also poses a challenge – how does one randomly pick respondents? We can follow a rigorous statistical process and choose polling booths where to sample, but have to leave the choice of respondents to the person in the field. We can only give a thumb rule such as every 7th person or every 10th person. Note that on some occasions we found the socioeconomic profile of the sample to be very different from the census profile (whereas in door-to-door surveys, where respondents are chosen via randomisation, the sample profile is much closer to the population profile).

Given that multiphase polls have now become the norm, we prefer post-polls where we do a door-to-door survey of respondents drawn randomly following proper statistical methodology during the two or three days following the voting in constituencies which get chosen in the randomisation process. This means that on the last day of polling, when various other agencies present their forecast based on exit polls, we have data on all but the last phase. On some occasions, we have done an exit poll for the last phase constituencies and made a forecast along with others on the last day of polls. On other occasions, we extrapolated and made a tentative forecast on the last day of polls and revised it a few days later based on post-poll in the last phase. Our experience is that post-poll based forecasts are generally better, though it means that we can complete the exercise only a day or two before counting day.

Another major problem involves overestimation of the percentage of respondents who claim that they have voted (we ask this question during post-poll). So we also ask our interviewer to observe if the respondents had ink marks on their fingers and, for prediction of seats, we only take into account those responses where it is visible. Yet another practice is to compensate for incorrect answer by respondents to the question of whom they have voted or intend to vote. The practice is to impute their true voting intention based on their responses to the other questions such as their like or dislike for previous government, their view on various personalities, and so on. Indeed, in some cases investigators may not even ask for whom the respondents have voted or intend to vote. It is true that many will hesitate to answer such a direct question. However, our interviewers carry old style ballot papers with names and party symbols of the candidates, and after all the other questions are done, the respondents are asked to mark their vote on this ballot (away from the eyes of the interviewer), fold the paper and put it is a box. Most people are comfortable doing this and so we have refrained from imputing votes based on other responses. We feel that doing so may result in more errors than it may correct.

To conclude, opinion polls, which are as much about the power of statistics, have their limitations and can also be off the mark. In a neck and neck contest, no survey can predict the winner with any confidence, though a psephologist is on much safer ground than a soothsayer. With good opinion polls one usually gets the basic story right – who will get the largest number of seats, would this party (or alliance) cross the half-way mark comfortably or would it be around that number or well short. Moreover, opinion polls help us understand why people voted the way they did, the issues that determined their vote, and so on. This is one way to get an insight into the mind of the voter.

Appendix: Though simplistic, this model is a good starting point. For simplicity of notation, we fix a large state say j. Assume that there is no change in the alliance structure from the previous one to the current election.

We will write down the model for swing Cip in a constituency i (in the state j) for a party p as follows: suppose we have divided the constituencies in the state j into groups Gj21, Gj22,... Gj2A, of constituencies (say by sub-regions), into Gj31, Gj32,... Gj3B (say into Rural/Urban with B=2); Gj41, Gj42,... Gj4C, (say into reserved/general with C=2) and Gj51, Gj52,... Gj5D (say into voting phases) . Then

Cip = bj1 Sjp +bj2 Sjp2a + bj3 Sjp3b + bj4 Sjp4c +bj5 Sjp5a .............................(1)

where

bj1 + bj2 + bj3 + bj4 +bj5 =1, .............(2)

the constituency i belongs to Gj2a, Gj3b, Gj4c and Gj5d; Sjp is the swing in state j for party p, Sjpka is the swing for party p in the group Gjka. We can estimate Sjp and Sjpka from the opinion poll data and history data. Then we will get an estimate of Cip and then if Xip denotes the percentage of votes for party p in constituency i in the last election, then

Yip= Xip + Cip

gives us the estimate of votes for party p in constituency i.

In a small state (say with less than 15 seats), the model would simply be Cip = bj1 Sjp

Thus if there is no change in alliance structure from one election to the next, then this model gives us a way to estimate vote share Yip of the party p in the constituency i. Thus we will have an estimate of vote share of all major parties across all constituencies. We now need to describe how one obtains the coefficients bj1, bj2, bj3, bj4, bj5. We will come to it later. For now we will proceed assuming that we have been given the coefficients.

Let us now come to the question of how to convert these estimates of vote shares into estimates of seats. One simple method would be to treat our vote estimates as true votes and count winners. For each constituency i the party q with highest predicted vote Yiq (among Yip) is counted as winner and this gives a nationwide estimate of seats. However, we can see that greater the difference between predicted vote shares of the two leading parties, higher is our confidence about predicting the winner. We need to take this into account while counting winners. Thus, we need to translate the vote shares into predicted probability of win (or in Bayesian terminology, posterior probability of win) for the candidates in each seats, in such a way that higher the difference, higher the probability of winning for the leading candidate.

Consider the case when there are only two candidates and the leading candidate (say L) is predicted to get a% votes and the trailing candidate (say T) (100-a)% votes. What is the probability that L is the winner (where a>50)?

Let us assume that the standard deviation of the estimate a is s.

One method would be to put prior probabilities on the percent of votes in the population for each of the two candidates and work out the posterior probabilities of win for them respectively.

Assuming a uniform prior on vote share for each candidate [over (0,100) with their sum being 100], it can be seen that the posterior probability that L will indeed win equals

P(Z>(50-a)/s) ........................... (2)

where Z has standard normal distribution. Likewise the probability of T winning is

P(Z>(a-50)/s) ........................... (3)

This has another interesting interpretation. (3) is the probability of the best case scenario from the point of view of T- that indeed the two (L and T) are neck to neck with T having a slight edge and then the estimates show a gap of 2a-100 between their vote shares.

Either way, this requires an assessment of the standard deviation of the constituency wise vote estimates. We have to remember that the errors could be sampling errors and modelling errors – in other words, the postulated relationship (1) itself has an error component. This analogy can be extended to first three candidates and each of them could be assigned probability of winning. Then adding probabilities of win for a party across the country would give estimated seats. We can see from the second interpretation of the assigned probabilities that this method tends to be conservative for the winner. For future reference, let us call this method the probabilistic count method.

Let us now come to the question of the coefficients bj1, bj2, bj3, bj4, bj5. One possibility is to use data from two elections in the recent past and treat equation (1) as regression problem and estimate bj1, bj2, bj3, bj4, bj5 subject to the constrain (2). For example, while developing this methodology for 1998 elections, we used 1991 data as historical data and from 1996 data as current and obtained estimates of the coefficients. We found that the error sum of squares was very high, in other words the model (1) is not good. So instead, we choose the coefficients based on political understanding of the state. In some states, the overall state effect is large, in which case we choose bj1 high and other coefficients are small. In another state the regional effect is stronger in which case we assigned a higher value for bj1. Likewise we assigned values to these coefficients state by state based on political understanding of the historical voting pattern.

We then resorted to back testing, as if we were in 1996 and some survey had produced exact estimates of vote shares. We used only these vote shares and the actual data from 1991 – ignoring the vote data from individual constituencies for 1996. Then we used model (1) to estimate vote shares Yip – of course the estimates were way off from the actual vote shares (as the regression model fit had been bad even with best choice of coefficients). Then we used the probabilistic count method to estimate seats in the Lok Sabha for major parties. It was satisfying to see that while the vote estimates at constituency level were off the mark, the nationwide estimates of seats were quite good. However, even state wise estimates of seats were quite off the mark. We conclude that while at micro level (individual constituency level) the model fit is not good, it serves the objective of predicting seats at national level.

* The sampling scheme has been developed in collaboration with CSDS-Lokniti. The model for vote share prediction was developed along with Yogendra Yadav and Clive Payne during the 1998 Lok Sabha elections. The probabilistic count method was developed along with Rahul Roy and Abhay Bhatt, colleagues from the Indian Statistical Institute, Delhi.

Footnotes:

1. In 2014, the total number of eligible voters in India was 83.4 crore (834 million) and 55.3 crore (553 million) citizens voted in the 2014 Lok Sabha polls.

2. For a detail note on sampling, see, Rajeeva Karandikar, ‘Power and Limitations of Opinion Polls: My Experiences’, The Hindu Centre for Public Policy, April 2014. Accessible at: http://www.thehinducentre.com/verdict/commentary/article5739722.ece

3. Volatility of public opinion between two time points is defined as the percentage of people who changed their voting intention during that period. It is believed by all experts that volatility over a five year period (from one election to the next) is very high. Based on aggregate data, we can see that it is likely to be over 10%. In 1998, CSDS had conducted a pre-election opinion poll (for India Today) and also conducted a post poll (for the early seat projections we were making) where the same set of respondents were approached. We found that about 30% of respondents had changed their mind. The time gap was about eight days for a third of the voters, 14 days for another third and 22 days for the rest. Thus volatility was about 30%.

4. For the detailed technical note see, R.L. Karandikar, C. Payne and Y. Yadav, ‘Predicting the 1998 Indian Parliamentary Election’, Electoral Studies 21, 2002, pp.69-89.