Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hyphenated housenumber parsing #204

Open
missinglink opened this issue May 22, 2019 · 15 comments
Open

hyphenated housenumber parsing #204

missinglink opened this issue May 22, 2019 · 15 comments

Comments

@missinglink
Copy link
Member

missinglink commented May 22, 2019

We have a conservative setting for parsing hyphenated house numbers.

ie. is 4-6 a 'number range' or a 'house number and apartment number'

In some countries such as Canada their postal authority recommends separating the house number and apartment number with a hyphen.
https://en.wikipedia.org/wiki/Address#Canada

If there is an apartment number it should be written before the house number and separated by a hyphen.

As we cannot reliably determine the postal addressing format, we discard the address rather than potentially corrupting the number series with an incorrect value.

We have tests for this behaviour here: https://github.com/pelias/interpolation/blob/master/test/lib/analyze.js#L41

It would probably be better to assume these numbers are ranges and then try to detect countries where hyphens are used to delimit apartment numbers (possibly via a bbox check) and then only apply the conservative logic for these countries.

@missinglink
Copy link
Member Author

we already have some configurable values to control how we handle hyphens:
https://github.com/pelias/interpolation/blob/master/lib/analyze.js#L1-L8

@vicchi
Copy link
Contributor

vicchi commented Jul 15, 2020

@missinglink Putting my "UK centric addressing format" hat on for a second ...

When running ./interpolate to create address.db, based on a GB OSM extract, I see lots of messages like ...

could not reliably parse housenumber 6 & 8
could not reliably parse housenumber 104-114

... both of which make sense for the UK. Apartment/flat/unit numbers are (almost) always expressed as Apt 1 or Flat 1 followed by the rest of the address, so 104-114 is (almost) always a building number range. Of course, except when it's not. But mostly it is. This is not an exact science as you well know.

Also, given that a significant number of UK building number allocations follow odd numbers on the one side of the road and even numbers on the other, then 6 & 8 is really a range of two adjacent buildings. Except when they're not.

Looking at the constants at https://github.com/pelias/interpolation/blob/master/lib/analyze.js#L1-L8 ... I'd welcome some suggestions on what magic values to drop in here and tweak to make the interpolate script treat cases such as these as ranges as the data sets I'm using for Pelias are (currently) only for the UK?

@missinglink
Copy link
Member Author

missinglink commented Jul 15, 2020

Hmm yeah so we can totally add country-specific logic, some potential issues adding that:

  • We don't actually know which country each address belongs to!
  • The function signatures would need to be updated to allow us to pass this info in, and possibly to return multiple values.

OSM has the concept of interpolation ranges, these are much more reliable and already supported out-of-the-box, as are TIGER ranges.

You should also consider just doing nothing, which I know sounds like an anti-solution but let me explain 😄

Interpolation ranges are only valuable when they are valid, if one or more erroneous members are introduced into the range then it can screw up most of the street.

However, if we have fewer points then we only lose out on precision, so a valid sparse index is probably preferable to a dense range with errors, if that makes sense?

Maybe you could send me an example of a street which you'd like to improve?

@missinglink
Copy link
Member Author

Ugh the address coverage in the UK is just so bad, what ever happened to the OpenAddressesUK project and the rumours of Ordnance Survey opening some block range data up?

There's an interactive demo where you can click streets to see the coverage, which just proves how sparse the coverage is in the UK, even in London:

Maybe I missed a bunch of data in the last import?

@vicchi
Copy link
Contributor

vicchi commented Jul 15, 2020

Here's a good example, which happens to be my local supermarket ... Tesco, 20-28 Broad St, Teddington TW11 8RF

AFAIK OpenAddresses UK almost got there, but then died due to claims of legal rights ov
er the data from an "organisation", which resulted in almost half of the data being excised, so the project ... expired. See also (cough cough) this.

There is a whole new load of OS open data coming this month as a result of the UK Geospatial Commission shaking things up, which I'm waiting eagerly to see just what gets released and whether I can a) use this in my Pelias instance and then (of course) b) contribute this back to Pelias. But right now ... I'm waiting

@missinglink
Copy link
Member Author

END OF RANT and to answer your question, the easiest thing to do is split the data yourself, so a single row in your file becomes two rows, one is the beginning number and one is the end.

That's it, there is no added value in generating all the rest of the values within the range, they can be interpolated.

@vicchi
Copy link
Contributor

vicchi commented Jul 15, 2020

That makes a lot of sense and I'll give that a try. Also, I appreciated the rant about open addressing data in the UK. I feel that way ... a lot

@missinglink
Copy link
Member Author

Amazing, I've been waiting 6 years for this day, if/when that happens we should jump on a call.

@vicchi
Copy link
Contributor

vicchi commented Jul 15, 2020

@missinglink Hmm ... three new open data sets are now up on the new OS Data Hub, Open TOIDs, Open UPRNs and Open USRNs ... sadly I'm underwhelmed at first glance. Not what I'd hoped for. It's just indentifiers which are linkages into proprietary data sets such as AddressBase and MasterMap ... https://osdatahub.os.uk/downloads/open

@missinglink
Copy link
Member Author

👑 📧 is the 😈

@missinglink
Copy link
Member Author

missinglink commented Jul 15, 2020

What I would love to have (at minimum) is 4 house numbers per street, just the start-left, start-right, end-left & end-right house numbers, from this we can figure out quite a lot, and if those 4 points also had postal code info then this would make a huge difference.

What I'm describing is the TIGER file I'm using for the USA, to some degree we could delete all of OSM and OA for the USA and it wouldn't be too bad.

@vicchi
Copy link
Contributor

vicchi commented Jul 15, 2020

Hmmm ... with some custom preprocessing and tooling you might be able to cobble that together from (off the top of my head) ONS PD, OS OpenNames, CodePoint Open and OpenRoads and WOF. Maybe. Plus some interpolation into OSM. Though that may veer into (OSM ODbL) derived data set licensing horrendousness.

@vicchi
Copy link
Contributor

vicchi commented Jul 15, 2020

But that would only be for England and Wales. Maybe for Scotland too. But definitely not for Northern Ireland. Because history and politics.

@missinglink
Copy link
Member Author

Hmm... I had a quick look at those data sources today and I couldn't find anything more granular than a street 😢

@vicchi
Copy link
Contributor

vicchi commented Jul 16, 2020

@missinglink There's a conversation about UK open data, probably mainly about admin polygons, going on over on Gitter which would be good to get your take on when you have a moment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants