Overview
The Data Science campus leverages Big Data sources and modern technology to support the National Institute of Statistics of Rwanda, government and other stakeholders. Data Science Campus is mandated to modernize official statistics through implementing innovation policies such as Rwanda Data Revolution policy.
The data revolution policy (DRP) focuses on building big data and analytics capabilities to derive insights that contribute to enormous social-economic benefits including informed policy decision making, enhancing transparency and promoting citizen participation, GDP contribution, Monitoring National Development Progress and SDGS, supporting research and development, Business Intelligence, Innovation for data enabled applications among others
Data Science Projects
Job Vacancies Web Scraping
Nowadays, organizations use websites to announce jobs as the use of technology grows. In Rwanda, different platforms are used such as job in Rwanda, E-recruitment of MIFOTRA, ndangira.net, LinkedIn and so on. These websites provide data related to employment such as number of jobs, domain of job, skills and experience required, degree of education, etc. These data can complement data gathered from Labor Force Survey conducted by NISR which provide the statistics on labor supply side.
To leverage information from websites, NISR staff in collaboration with Office for National Statistics, UK used advanced methodology of data science, and worked on this project of web scrapping job vacancies in a period of six months where they built codes in python programming language for cleaning, analyzing data and generating an automatic report from vacancies web scrapped from both Job in Rwanda and MIFOTRA. Moreover, each week, NISR staff pulls new jobs from the two websites.
Post Enumeration Survey Data linkage
The National Institute of Statistics of Rwanda (NISR) conducted the Post Enumeration Survey in 2022 (PES-2022). The main objective of this survey was to evaluate the quality of Rwanda Population Housing Census 2022 (RPHC 2022). To assess the Census coverage, the process of matching needs to be undertaken and this involves of checking whether two records from different datasets (Census and PES in our case) relate to one person.
The implementation of matching required robust tools and well-trained personnel, hence a dedicated algorithm for matching was developed through Python programming language in collaboration with the Office for National Statistics, UK that also trained the NISR staff in techniques of automatic and manual matching.
The process of matching was performed within different stages such as household level, enumeration area (EA) level, district level and country level where both automatic and clerical matching methods were used. The method of automatic matching was based on the predefined rules and the computer makes decision if a pair of records matches and for the clerical matching method, decisions are made by human judgement.
In our context, matching was done mainly using the deterministic approach where match keys/rules were developed basing on variables that are most likely to facilitate an optimal identification of people in both RPHC2022 and PES2022 datasets. These are: household identification (HHID), names, age, sex, marital status and relationship to the head of household.
Prior the real matching, there was the tuning period which referred to the period of PES data collection where some but not all of the PES (Post Enumeration Survey) data was available. The tuning aimed at testing and updating developed match-keys. The tuning period is a very important time because the matching algorithm must be properly tuned using the real data to ensure that the automatic matching captures as many correct matches as possible, ideally no false matches should be made automatically.
The process of matching was a success as demonstrated by the computed results from the precision and recall which assesses the quality of the matching algorithm where 99,94% and 99,98% of precision and recall respectively were achieved.