Democratising Public Data: An Education Case Study
by Zen family member Krishna Bholah
‘Public’ and ‘democracy’ are associated words and for good reason. ‘Public’ can be defined as ‘of or concerning the people as a whole’,[i] while democracy is derived from the Greek term dēmos meaning ‘people’ and kratos meaning ‘rule’.[ii] Yet, while we have seen progress in the accessibility of public data, its enablement of people to ‘rule’ through informed decision-making is yet to be seen.
In this blog, we talk about how, despite government efforts in improving its data strategy, there are several challenges that prevent its utilisation by non-technical professionals for the greater public good. This can hinder both public and private decision-making to challenge prominent issues faced today, including climate change. Using a simple example from the education sector, we demonstrate how two publicly available datasets can become complex when analysing to answer a basic question.
The latest UK National Data Strategy emphasised commitment to improving data use in government, enabling ‘businesses and organisations to … innovate, experiment and drive a new era of growth.’[iii] While government organisations have opened access to datasets, increased partnerships with regulators and the private sector, its utilisation by smaller companies, as opposed to ‘tech giants’ has been recognised as a challenge. Lack of resources, software, knowledge, and experience means that ‘data democratisation’ for small companies and private individuals is far from resolved.
On average, 25% of small to medium sized businesses do not employ someone in a data role.[iv] While larger businesses are more likely to recruit for data roles, there remains a significant UK data skills gap from both recent graduates, as well as existing employees. This upward pressure has led to 46% of employers struggling to recruit for these roles over the last two years. With strong demand and weak supply, salary expectations for new hires (22%) and financial cost of implementation (21%) are considered the top two greatest barriers to establishing a data role/team for UK businesses. This can have a disproportionate impact on smaller businesses in their progress in data maturity and reaping the benefits of data-driven insights for the greater public good.[v]
Education sector case study: Can a secondary school assess its current energy performance against its peers?
We looked at publicly available government datasets related to the education sector and building energy performance, considering the question below to identify some of the data challenges faced.
How does a secondary school assess its current energy performance against others in the education sector?
A secondary school will have an average 1014 pupils and 63 teaching staff members, putting most secondary schools into the ‘medium’ sized business category.[i] Reports suggest that the Education sector is significantly less likely to have existing data roles compared to other industries with the largest skills gaps being:[ii]
This case study demonstrates the immediate potential impact on informed decision-making by the education sector to improve their energy performance. Moreover, it demonstrates an opportunity cost where the education system is not positioned to help educate future data literate individuals. Below, we outline the six steps taken to conduct our data analysis, discussing some of the key challenges faced in the process.
1. Data collection:
We identified two datasets required for the analysis:
- A list of open schools in England and Wales
The Department for Education (DfE) publish and maintain a register of schools and colleges in England and Wales. This dataset can be accessed without registration, available for immediate download and includes filtering options.
2. A report on the energy performance of public non-domestic buildings, including schools
The Ministry of Housing, Communities & Local Government publish Display Energy Certificates (DECs) for domestic and non-domestic buildings. A DEC uses band ratings from ‘A’ to ‘G’ to categorise each building’s energy efficiency. A is the most efficient and G is the least.[iii] Access can be gained by public users through a short online registration process. This allows for users to search for a public building’s DEC using an address, or to download the entire dataset.
We found no significant barriers of entry for data collection, regardless of data literacy and/or data skills.
2. Data access:
Some datasets can be easily accessed (i.e., viewed in a readable format once collected) on software such as Microsoft Excel. Other datasets are so large that these well-known tools become practically unworkable. While the DfE Information Application Programming Interface (API) provides a sustainable option for data access, this requires specialist skills unlikely to be held within a school. The alternative was accessing the data via a .csv file, which can be opened using Microsoft Excel. With only 50,000 rows, the dataset can be accessed without major impact on performance.
Accessing the DEC data demonstrates the challenges of slightly larger datasets. While DEC data is available for direct access via Microsoft Excel, the downloads are contained in separate files due to the file size. This file concatenation i.e., importing all rows from each separate file into one so an analysis can be made across the entire dataset. This is something non-data professionals may struggle with. Once concatenated, the entire DEC dataset contained over 400,000 rows. While being under the limit of maximum rows in Microsoft Excel, this size of dataset may start to impact performance on older computers.
We found some potential barriers of entry for data access.
3. Data cleansing:
A key challenge faced by data professionals is the need to cleanse a dataset before being able to make accurate, informative, and useful insights. In this example, our DEC dataset contained all public non-domestic buildings, including previous DECs conducted on the same property. Therefore, you may see one DEC for each year that the school has been operating. Viewing DECs that are out of date is irrelevant for our immediate purpose since the question involves understanding a school’s current performance. It is necessary to programmatically remove all historical data, retaining only the most recent DEC information for each organisation.
The programmatic removal was a significant challenge due to a lack of unique identifier for each school property. While there was a unique building reference number, this did not distinguish separate properties technically registered to the same building. Data cleansing can be a time-consuming and challenging task for even those with technical data skills.
Another major limitation of the DEC dataset is the fact that it is not a live feed. Published ‘every four to six months’, a number of DECs may have been updated between the date accessing the data and the last published date. This can render the dataset out-of-date and therefore provide an inaccurate answer.
4. Data linking:
To associate all schools in England with their associated DEC data, a process called data linking is required. This ‘… is the process of joining data sets through deciding whether two records, in the same or different data sets, belong to the same entity.’ In this example, we wanted to link school DfE establishment names and their unique reference number to their DEC ratings so we can start to analyse schools registered with the DfE. While Microsoft Excel can partially cater to matching, the process is inefficient and prone to mistakes. We were therefore required to use more advanced software requiring intermediate data analysis capabilities.
Even with access to software more suitable for data linking, we found several practical challenges in the matching process. Matching the DfE school name with the ‘ADDRESS1’ in the DEC data was unfruitful. Due to differences in naming conventions and ‘no universally accepted standard for the “correct” form of the address’, not all organisations matched.[i] Matching based on postcode led to multiple organisations registered at the same location being presented. More advanced data techniques, such as address matching algorithms, can help resolve these issues, a skill that is unlikely to be present within a school.
The Office for National Statistics recognises that matching addresses remains a ‘significantly complicated task’ in many instances.[ii] The challenge even remains present for the government entities who publish the very public data they seek to link.[iii]
We found overwhelming barriers of entry for data linking and manual alternatives were unfeasibly time-consuming.
5. Data analysis:
It is only until this point that we were able to answer the basic question of how well a school is performing against others in the education sector. To answer the immediate question, a simple analysis can be made by averaging the DEC ratings for all open schools to provide a national average. Comparing the school’s individual DEC rating will show whether the school is performing better or worse than the average.
The benefit of using raw data means you can provide deeper insights, observing nuances. For example, instead of looking at the national average, you can isolate your average to schools that are in the same local authority to compare locally. A multi-faceted perspective can be developed by linking more data sources. This can help you understand more complex, potentially weaker correlations. For example, the Ministry of Justice and DfE share data to understand the relationship between crime and education.[iv]
We found no barriers of entry for data analysis once the data was cleansed and linked.
6. Data visualisation and presentation:
An often forgotten but vital part of the process involves the appropriate presentation of findings. Exporting our linked data into Excel and creating charts to visualise and summarise our analysis found no issues. When dealing with higher quantities of unstructured and/or raw data, advanced programming languages are more suitable, requiring knowledge of the chosen programming language.
Our case study demonstrates significant challenges in enabling the education sector to take initiative and use open data published by government bodies to achieve their sustainability goals. Initiatives such as Streamlined Energy and Carbon Reporting (SECR) have related goals ‘designed to increase awareness of energy costs’ but are equally disjoined from DEC data.[i] This increases the workload and difficulty for education stakeholders to make informed decisions, hindering progress in achieving a net-zero and carbon positive future.
The challenges highlighted in our analysis demonstrate an overwhelming barrier of entry for non-data professionals and individuals to utilise public data within and outside the education sector. While governments can develop web-based platforms for drag-and-drop analytics, this will not tackle more rooted issues and therefore only solves the challenge partially.
As the UK data skills gap remains significant, the costs associated with building out a new data team may be considered unfeasible. Alternative options include upskilling employees, enabling them to become more data literate and help move their business further along in data maturity.[ii] While 70% of UK ‘workers are interested in seeking out data skills training’ to maintain ‘pace with the changing requirements of [their] role’, only 49% of those surveyed had received data skills training within the previous two years.[iii]
Lack of emphasis on employment and training in the area can often be due to a ‘lack of understanding of how data skills can benefit [businesses] in the future.’[iv] The education sector should employ more data driven individuals, enabling them to teach students and bridge the data skills gap. This can help develop future business executives who understand the advantages of leading data-driven organisations, enabling change for the better.
Krishna Bholah is a Business Intelligence Analyst at Zenergi with a passion for helping drive organisations further along the data maturity model. He currently supports Zenergi’s sales and marketing team by building the right infrastructure and processes to make informed data-driven decisions.
For more information on data maturity, we would recommend this recent comparative analysis of eleven maturity models.
[i] Definition of public [online]. Oxford University Press. https://www.lexico.com/definition/public (Accessed: 9th November 2021).
[ii] Fleck, Robert K., and F. Andrew Hanssen. “The Origins of Democracy: A Model with Application to Ancient Greece.” The Journal of Law & Economics, vol. 49, no. 1, [The University of Chicago Press, The Booth School of Business, University of Chicago, The University of Chicago Law School], 2006, pp. 115–46, https://doi.org/10.1086/501088.
[iii] Department for Digital, Culture, Media & Sport, “UK National Data Strategy” (published 9 September 2020) (UK National Data Strategy).
[iv] Department for Digital, Culture, Media & Sport, “Policy paper: Quantifying the UK Data Skills Gap – Full report”. Micro being defined as 2-9 employees, small as 10 – 49 employees, medium as 50 – 249 employees and large 250+ employees. (Quantifying the UK Data Skills Gap)
[v] UK National Data Strategy.
[vi] The Department for Education, “Class Size and education in England evidence report” (published December 2011). Based on the most recent available data reporting the Pupil Teacher Ratio of 16:1 in 2011.
[vii] Quantifying the UK Data Skills Gap.
[viii] https://www.gov.uk/check-energy-performance-public-building (last accessed 9th November 2021).
[ix] Office for National Statistics, “ONS working paper series no 17 – Using data science for the address matching service”. (last accessed 9 September 2021). https://www.ons.gov.uk/methodology/methodologicalpublications/generalmethodology/onsworkingpaperseries/onsworkingpaperseriesno17usingdatasciencefortheaddressmatchingservice (ONS working paper series no 17)
[x] ONS working paper series no 17
[xi] Office for National Statistics, “Joined up data in government: the future of data linking methods” (published 25 August 2020, last accessed 9 September 2021) https://www.gov.uk/government/publications/joined-up-data-in-government-the-future-of-data-linking-methods/joined-up-data-in-government-the-future-of-data-linkage-methods (Joined up data in government: the future of data linking methods)
[xii] Joined up data in government: the future of data linking methods.
[xiii] Education & Skills Funding Agency, “Guidance: Streamlined Energy and Carbon Reporting (SECR) for academy trusts”(updated 4 August 2021, last accessed 9 September 2021). https://www.gov.uk/government/publications/academy-trust-financial-management-good-practice-guides/streamlined-energy-and-carbon-reporting
[xiv] For more information, see Król, Karol & Zdonek, Dariusz. (2020). Analytics Maturity Models: An Overview. Information. 11. 142. 10.3390/info11030142.
[xv] of 5,000 workers (Quantifying the UK Data Skills Gap).
[xvi] Quantifying the UK Data Skills Gap.