Transparently handle dot character in ICD-10/OPCS-4 codes #2333

evansd · 2024-12-20T09:34:10Z

Canonically ICD-10 and OPCS-4 codes are written with a dot between the 3rd and 4th characters e.g. A01.1.

However, in the data we currently have these dots are omitted and the equivalent code is written A011. (For example, see the apcs.all_procedures field.)

Our syntactic validation for these codes currently requires them in dotless format:

ehrql/ehrql/codes.py

Lines 91 to 111 in 784d011

    
           class ICD10Code(BaseCode): 
        
               "ICD-10" 
        
               regex = re.compile(r"[A-Z][0-9]{2,3}") 
        
           class OPCS4Code(BaseCode): 
        
               "OPCS-4" 
        
               # The documented structure requires three digits, and a dot between the 2nd and 3rd 
        
               # digit, but the codes we have in OpenCodelists omit the dot and sometimes have only 
        
               # two digits. 
        
               # https://en.wikipedia.org/wiki/OPCS-4#Code_structure 
        
               regex = re.compile( 
        
                   r""" 
        
                   # Uppercase letter excluding I 
        
                   [ABCDEFGHJKLMNOPQRSTUVWXYZ] 
        
                   [0-9]{2,3} 
        
                   """, 
        
                   re.VERBOSE, 
        
               )

But it would be nicer if we accepted strings either with or without the dot and converted them to the dotless format at the point we cast them to a code type.

If ever we end up having data with the codes in the dotted format then we can make a new type which does the reverse (i.e. converts dotless to dotted). This would allow us to use the existing codelists with the new field without having to worry about arbitrary syntactic variation.

Slack thread:
https://bennettoxford.slack.com/archives/C069YDR4NCA/p1734627474856169

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transparently handle dot character in ICD-10/OPCS-4 codes #2333

Transparently handle dot character in ICD-10/OPCS-4 codes #2333

evansd commented Dec 20, 2024

Transparently handle dot character in ICD-10/OPCS-4 codes #2333

Transparently handle dot character in ICD-10/OPCS-4 codes #2333

Comments

evansd commented Dec 20, 2024