-
Notifications
You must be signed in to change notification settings - Fork 328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kylin-based analytics option #227
Conversation
Some of our data from elasticsearch comes back as integers, others as strings, depending on how it was originally inserted.
Place files in /YYYY/MM directories (instead of /YYYY-MM)
Rollback parquet version so it's compatible with the version of Hive we're loading into. Also place the parquet files in the directories expected by hive partitioning. This also includes some unused code related to testing TSV and Avro outputs.
Tests seem to indicate ORC has the best performance for us. It also offers the best compatibility with the older version of Hive we're using for HDP 2.2/Kylin compatibility (allowing us to support things like timestamps, which aren't supported on the older version of Parquet). This cleans up our dependencies, removes all the file outputs except ORC, and tunes a few of the output fields to be specific to ORC files (using SMALLINTs, etc).
Kylin doesn't support MAX() on timestamp fields, so we need to store the timestamp as a raw integer value.
This solution isn't exactly pretty or optimized, but for this first pass, I think we'll just try to emulate the ElasticSearch output verbatim, so we don't have to update any other parts of the app. But this could be cleaned up and optimized if we eventually just have to worry about supporting SQL.
This is still a work in progress, but a few notes based on what we've figured out so far.
This shifts our caching location of the city data into MongoDB. This might not be the best final location for it (we may want to move it into SQL land), but for a first pass, this will let the same basic approach work regardless of where the raw log data is stored (elasticsearch or kylin).
It's pretty hacky and terribly inefficient for now, but it loads.
- The drilldown queries were erroring at the top host-level. - Some of the filter logs queries were bombing because we were inadvertently performing other group by and aggregations on unrelated queries. - Optimize some odd slowness with the request_ip IS NULL queries.
This is to answer queries that cannot be satisfied by the Kylin data cubes.
The basic idea is that we'll still use heka to log to possibly different sources (elasticsearch, postgres, or hdfs). For HDFS (for use with Hive and Kylin), we'll perform logging from Heka to Flume, which has better integration with writing directly to HDFS. This updates the various logged field names for better consistency with the SQL design of the table.
CMake's externals handling might be a much cleaner and more reliable way to handle building the various sub-projects. It also looks like we might be able to use cpack to create our binary packages (rather than needing fpm).
The various admin searches were passed to mongo as a regex search. However, we were only wanting to perform a "contains" search, and not actually expose the raw regexs. Exposing the raw regexes caused searches with various special characters to fail (since they weren't escaped). Allowing raw regexes also allowed for potential regex-based denial of services by untrusted admins. Brakeman pointed this out, so we're also adding brakeman and bundle-audit to our CI runs for better security checks.
Hitting the publish API without any selected changes now performs no action. The publish button is also now disabled in the UI until the user checks at least one checkbox. See 18F/api.data.gov#307
The listing of website backends didn't properly account for admin permissions in some cases. But the admins couldn't edit or display the individual website backends, so the issue was limited only to the listing page. See: 18F/api.data.gov#261 This also adds a "none" scope for mongoid, which a couple of our policies were already erroneously using (which was resulting in errors being thrown in certain unexpected admin permission circumstances).
There were some small shifts in some of the geoip results in the latest data file. Fix things so we're not quite as sensitive to these small changes in the future (we'll allow results within 0.02 decimal places, since that's we don't actually care that much about the detailed precision). Also fix one of the foreign tests with a special character, since that IP no longer geocodes to the city. We'll use Bogotá instead.
If the server was being run in development mode, bundler-audit was causing require issues, since it expected ENV["HOME"] to be set (and it's not in the server context).
This makes it easier to browse the groups and figure out who group members.
I'm going to go ahead and merge this into master. There's still a few things to sort out with the Kylin implementation (surrounding segment merging after certain time periods), but it's largely functional, and the default ElasticSearch functionality has proven to be backwards compatible (so this shouldn't impact other users). I'd also still like to implement a basic SQL analytics adapter (using postgres), but that can wait. This will also be handy to get into master, so we can more easily work on the next 0.12 release without mixing other things up in this branch. The build changes will be useful to merge in (since they change a number of files), and we've also been fixing some other bugs while staging this branch which I don't want to get mixed up in this pull request (there's unfortunately a few of those small fixes already mixed up in here, but I'd like to avoid any other branch messiness). |
This adds the ability to switch the analytics system to utilize Kylin instead of Elasticsearch.
For more context on this functionality, see 18F/api.data.gov#235
This also includes a switch in the build system to CMake. See #226 for why that functionality got merged into this branch (basically, it will help in packaging all this up).
Instead of doing a lot of rambling in this pull request, I've tried to consolidate my rambling to some new documentation pages:
There's still some TODOs scattered throughout the docs, so there's the primary things remaining to complete:
In terms of release planning and packaging, here's what I'm thinking:
@cmc333333: I think there are two primary differences since we last discussed things: