In this blog post, Ellen van de Poel (Erasmus University Rotterdam) summarizes discussions held in Bergen between academics and PBF implementers about the desirability and feasibility of various research designs to identify causal impact of PBF. She covers discussions held within the dedicated working group (facilitated by Atle Fretheim from the Norwegian Knowledge Center for the Health Services, Oslo) and in plenary sessions.
Being a novice to the field of Performance Based Financing (PBF), this interesting two day workshop brought quite some new insights to me, but also raised many questions. Does paying for performance crowd out intrinsic motivation? Are the poor benefiting most from PBF, or do these schemes mostly cater for the better off? If PBF increases health care utilization, does it do so in a cost-effective way, compared to the traditional input based financing? Many interesting questions, ideas for research, … but, being a quantitative researcher, the most burning question on my mind is whether indeed PBF has a substantial impact on access to quality care in low and middle income countries. Prove of the impact of PBF is also urgently called upon by a recent Cochrane review, but robust impact evaluations seem hindered by the complexity of the intervention. This blog summarises our discussions in Bergen.
Getting the research question right
Let’s first make sure we got the research question right. “Does PBF work?” is too simple.
For whom does it work? Which populations are we interested in? Patients using the health facilities, all households in catchment areas, poor or better off individuals? Defining the right target population is important for data collection. Exit interviews reveal valuable information about patient experiences but are limited to users of care, household surveys provide a more general picture, but might be hard to link to facility information; average effects may mask important inequities.
What do we mean by PBF? Various terminologies like PBF, P4P, RBF and contracting are often used interchangeably while such programs can vary to a great extent. In Bergen, practitioners stressed several key features. Let’s agree here that with ‘PBF’ we mean programs including incentivized payments, additional financial resources, reinforced supervision and increased managerial autonomy.
Works compared to what? Are we interested in evaluating PBF against the counterfactual of traditional input based financing, holding resources constant or at least equal – the incentive effect? Or do we want to measure average effects of the complete PBF program – resource and incentive effect? This distinction obviously has important consequences for defining an appropriate counterfactual and knowing what we capture.
Works to achieve what? What are the important hypotheses, and main outcomes of interest? To avoid ending up testing for a significant effect on each (of the many!) incentivized outcomes, and finding some effects by chance only, researchers should be more transparent about the main hypotheses that they want to test before conducting the study.
Once we agree on the research question, let’s see how it can be answered in a scientifically rigorous and robust way. Let’s go through the standard impact evaluation toolbox:
We agreed that the simplest thing to do is to compare districts/facilities with PBF to those without PBF. Easy enough, but the difference between both can be driven by many things other than the PBF intervention, so drawing causal conclusions is really not possible here.
We could compare outcomes from within the same district/facility from a period before PBF to a period after. While this approach is useful for monitoring purposes, and therefore typically embedded in the project cycle of PBF, it leaves room for too many factors other than PBF (e.g. other nationwide programs) to drive the change to really claim causality.
To be able to net out the effect of such nationwide changes that might bias the before-after comparison of PBF, we can compare the trend in outcomes in PBF facilities/districts to the trend in other, non-PBF, facilities/districts and only attribute the difference between both trends to the PBF intervention. In Bergen, some presented studies used the so-called difference-in-differences design. Basically, the non-PBF units (controls) serve as a counterfactual for what would have happened to the PBF’ed ones (treated) should there have been no PBF. This design can produce very robust impact estimates, but the validity of the counterfactual is crucial. Many of the controlled before-after studies that were presented in Bergen showed baseline differences between treated and controls that were much larger than the PBF effects identified. Furthermore, many of these effects seemed to derive from a negative trend among the controls, and not so much from huge increases among the treated. This does raise questions about comparability. When a country moves forward with PBF in some areas it is possible that attention/resources are being reduced in the ‘control’ areas, hereby biasing the counterfactual. In such instances, contextual and more qualitative information is crucial to build up credibility in the estimate.
Ideally, the allocation of districts/facilities to control or intervention is done randomly to make sure that there are no systematic differences between them – the so called Randomized Controlled Trial (RCT).
While the argument often made that it is unethical to randomize subjects in social policy experiments seems not justified – in fact there’s nothing more ethical than tossing a coin if you don’t have the resources to cater to everyone in one go – RCTs can be challenging in practice.
Randomization seems politically difficult. How to explain to some districts/facilities that they will have to wait, how to convince local health officials/providers that the assignment was done in a fair way? Transparency of the randomization process is crucial and can go a long way in mitigating these difficulties.
Another challenge relates to the design of the RCT. Randomization really only works if there is a sufficient number of units to sample from. Randomly drawing 200 from a total of 500 facilities obviously works better than drawing two provinces from a total of 5. A lower level of randomization not only ensures a higher degree of comparability between treated and controls but also increases statistical power, i.e. it is easier to pick up small effects that are statistically significant.
So ideally researchers would want to randomize facilities into a PBF program. But is PBF really a facility level intervention, as it not only changes the way in which providers are rewarded and monitored but also the ways in which districts supervise and operate? While it may be difficult – but not impossible? – to evaluate the entire PBF program through randomization at the facility level, an RCT design can be very feasible to identify effects of program components. For example, a design in which randomly chosen treated facilities obtain the full PBF program, while controls (in the same area) only receive the additional supervision, managerial autonomy and financial resources could very well identify the effect of paying for performance only. Such a design is currently under test in Cameroon, but some practitioners have been really harsh at such designs, which may be a nightmare for PBF implementers and district managers.
A more acceptable design is to randomly vary the fee structure across facilities within the district. This could be a way to identify the optimal reward scheme (but will not answer the question whether PBF works). There are some nice possible sophistications, such as randomly adding bonuses for equity-targets to existing PBF programs in some facilities, which can allow to identify their effects on the distributional impact of PBF, etc. This is the research design Burkina Faso is heading for.
Another method to evaluate the effects of PBF is interrupted time series analysis – two studies presented in Bergen used this technique. Basically the idea is to look for kinks in high frequency (administrative) data that coincide with the starting date of the PBF program. Ideally this is compared to data from a control group in which no such kink is evident. These studies can be quite convincing, but are subject to some constraints. First, one needs high quality and frequently collected data. Second, to convincingly attribute the kink in any trend to the PBF program, it is important to establish that this trend has been relatively stable prior to the intervention. Third, as an important component of any PBF program is to improve the reporting and verification of the health care system, we need to be careful interpreting the kink in the trend in reported outcomes as this may be very well due to facilities having improved their registration of services provided, rather than having increased the volume. So while looking at time series can be useful and powerful to evaluate modifications to PBF programs that do not touch on the monitoring system, they may be less suitable to evaluate entire PBF programs. Yet, in some settings, it may prove to be the main design available. It could also be an interesting track in countries where data reliability has already been improved. This could be the case for instance in Burundi, where the forthcoming impact evaluation will look at the addition of new indicators within a nationwide PBF system.
Finally, it may be worthwhile for researchers to consider using secondary data for evaluating PBF programs. Supplementing household survey data that is collected for many countries on a regularly basis (like the Demographic and Health Surveys, or Households Living Standards Measurement Surveys) with information on the (sequential) PBF rollout can reveal robust results if appropriate statistical techniques are used to correct for the non-randomized rollout.
A lot of research questions related to PBF, a lot of methods to use, so what are we waiting for? Well from the Bergen workshop I realized ‘we’ are not waiting at all. Research teams from LSHTM and Heidelberg University talked about interesting impact evaluations currently being done in Tanzania and Malawi, and also the World Bank is currently funding a large number of impact evaluations. It would be useful if such projects will not only contribute to the evidence base on the effects of PBF, but also to the evidence base on the practicalities of setting up prospective studies (see also here). It seems like many of these efforts are unnecessarily duplicated because this information is not easily available to researchers setting up new impact evaluations. These new evaluations are indeed needed to establish whether paying for performance is a cost-effective way of increasing access to good health care, but also to find out how the incentives should be designed to get the biggest bang for the buck.