Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kylin-based analytics option #227

Merged
merged 197 commits into from
May 22, 2016
Merged

Kylin-based analytics option #227

merged 197 commits into from
May 22, 2016

Conversation

GUI
Copy link
Member

@GUI GUI commented Mar 25, 2016

This adds the ability to switch the analytics system to utilize Kylin instead of Elasticsearch.

For more context on this functionality, see 18F/api.data.gov#235

This also includes a switch in the build system to CMake. See #226 for why that functionality got merged into this branch (basically, it will help in packaging all this up).

Instead of doing a lot of rambling in this pull request, I've tried to consolidate my rambling to some new documentation pages:

  • Analytics Storage Adapters: This describes what a system admin maintaining an API Umbrella installation would need to know about picking the different storage engines.
  • Analytics Architecture: This describes the overall analytics architecture in more detail.

There's still some TODOs scattered throughout the docs, so there's the primary things remaining to complete:

  • For feature complete:
    • Implement Kylin refresher background task.
    • Add option to log to Elasticsearch and Kylin at the same time for transition period.
  • For merging and shipping this:
    • Fix test suite broken by CMake changes.
    • Fix analytics tests, by switching the test suite back to Elasticsearch as the default analytics adapter.
    • Implement a PostgreSQL storage adapter leveraging the same SQL Kylin uses. This should also give us a way to more easily test the primary SQL functionality without having to be connected to a hadoop setup.
    • Fix binary packaging under CMake setup.
    • Create separate binary package for the optional kylin analytics dependencies.
    • Ideally, figure out some way to run our test suite against Kylin & Hadoop, but that might be pretty difficult to setup in a CI environment given all the Hadoop setup required.

In terms of release planning and packaging, here's what I'm thinking:

  • I'd like to get out at least one more v0.11.x release with some bug fixes that have already happened on master, along with maybe a couple other small bug fixes.
  • All of this analytics stuff gets merged in for v0.12 release. Elasticsearch will still be the default analytics storage adapter for this release.
  • v0.13 is when we switch to PostgreSQL as the default storage adapter and deprecate the Elasticsearch adapter (unless there's interest in maintaining it from others).

@cmc333333: I think there are two primary differences since we last discussed things:

  1. Heka's been swapped out for rsyslog. They both basically do the same thing with regards to buffering log data and then forwarding them onto other places (and possibly multiple places concurrently). The reason for this switch was that I ran into various issues with Heka's ability to send messages to Kafka (which is for the new setup to eventually get things into Hadoop & Kylin). rsyslog's Kafka output worked, and it also offers a built-in PostgreSQL output (versus a third-party Heka plugin), along with other SQL outputs, so it seemed desirable for that functionality as well.
  2. How we're getting logs from nginx into Hadoop and ultimately the ORC Hive tables has changed a bit. It's unfortunately complicated, but I've tried to explain it in the docs.
    • Last we talked, I think I was struggling using Flume. You had suggested trying our own code to utilize the streaming Hive table support, which I did implement earlier on this branch. In some ways, it did work better than Flume's streaming Hive table support (less memory usage and crashing), but the more I tinkered, the more I realized there were lots of rather nasty edge conditions around getting the data into that system reliably without possibly missing chunks of data. I also discovered there are a number of bugs with Hive streaming in the version of Hive we have to use due to Kylin compatibility (I think it was the first version to support hive streaming). While it's possible we could have worked around those bugs, in combination with tackling the various edge conditions I had come up with, it seemed like quite a bit of work, so I ended up not using the hive streaming table support.
    • Instead, I'm back to Flume, with basic HDFS TSV files. It's not particularly elegant, and in some ways it's more code and hops on our end, but it's all using basic, well-established HDFS and Hive technologies.
    • As noted on the docs page, though, if Hive streaming and Kylin streaming both come together in the future, then it's definitely possible all this could be greatly simplified. But in the meantime, I'm hoping this solution should be okay.

GUI added 30 commits February 14, 2016 18:25
Some of our data from elasticsearch comes back as integers, others as
strings, depending on how it was originally inserted.
Place files in /YYYY/MM directories (instead of /YYYY-MM)
Rollback parquet version so it's compatible with the version of Hive
we're loading into.

Also place the parquet files in the directories expected by hive
partitioning.

This also includes some unused code related to testing TSV and Avro
outputs.
Tests seem to indicate ORC has the best performance for us. It also
offers the best compatibility with the older version of Hive we're using
for HDP 2.2/Kylin compatibility (allowing us to support things like
timestamps, which aren't supported on the older version of Parquet).

This cleans up our dependencies, removes all the file outputs except
ORC, and tunes a few of the output fields to be specific to ORC files
(using SMALLINTs, etc).
Kylin doesn't support MAX() on timestamp fields, so we need to store the
timestamp as a raw integer value.
This solution isn't exactly pretty or optimized, but for this first
pass, I think we'll just try to emulate the ElasticSearch output
verbatim, so we don't have to update any other parts of the app. But
this could be cleaned up and optimized if we eventually just have to
worry about supporting SQL.
This is still a work in progress, but a few notes based on what we've
figured out so far.
This shifts our caching location of the city data into MongoDB. This
might not be the best final location for it (we may want to move it into
SQL land), but for a first pass, this will let the same basic approach
work regardless of where the raw log data is stored (elasticsearch or
kylin).
It's pretty hacky and terribly inefficient for now, but it loads.
- The drilldown queries were erroring at the top host-level.
- Some of the filter logs queries were bombing because we were
  inadvertently performing other group by and aggregations on unrelated
  queries.
- Optimize some odd slowness with the request_ip IS NULL queries.
This is to answer queries that cannot be satisfied by the Kylin data
cubes.
The basic idea is that we'll still use heka to log to possibly different
sources (elasticsearch, postgres, or hdfs). For HDFS (for use with Hive
and Kylin), we'll perform logging from Heka to Flume, which has better
integration with writing directly to HDFS.

This updates the various logged field names for better consistency with
the SQL design of the table.
CMake's externals handling might be a much cleaner and more reliable way
to handle building the various sub-projects. It also looks like we might
be able to use cpack to create our binary packages (rather than needing
fpm).
GUI added 10 commits April 28, 2016 22:35
The various admin searches were passed to mongo as a regex search.
However, we were only wanting to perform a "contains" search, and not
actually expose the raw regexs. Exposing the raw regexes caused searches
with various special characters to fail (since they weren't escaped).

Allowing raw regexes also allowed for potential regex-based denial of
services by untrusted admins. Brakeman pointed this out, so we're also
adding brakeman and bundle-audit to our CI runs for better security
checks.
Hitting the publish API without any selected changes now performs no
action. The publish button is also now disabled in the UI until the user
checks at least one checkbox.

See 18F/api.data.gov#307
The listing of website backends didn't properly account for admin
permissions in some cases. But the admins couldn't edit or display the
individual website backends, so the issue was limited only to the
listing page.

See: 18F/api.data.gov#261

This also adds a "none" scope for mongoid, which a couple of our
policies were already erroneously using (which was resulting in errors
being thrown in certain unexpected admin permission circumstances).
@GUI GUI force-pushed the analytics-revamp branch from cb2be83 to c909dc0 Compare May 4, 2016 01:55
GUI added 14 commits May 3, 2016 22:27
There were some small shifts in some of the geoip results in the latest
data file. Fix things so we're not quite as sensitive to these small
changes in the future (we'll allow results within 0.02 decimal places,
since that's we don't actually care that much about the detailed
precision).

Also fix one of the foreign tests with a special character, since that
IP no longer geocodes to the city. We'll use Bogotá instead.
If the server was being run in development mode, bundler-audit was
causing require issues, since it expected ENV["HOME"] to be set (and
it's not in the server context).
This makes it easier to browse the groups and figure out who group
members.
@GUI
Copy link
Member Author

GUI commented May 22, 2016

I'm going to go ahead and merge this into master. There's still a few things to sort out with the Kylin implementation (surrounding segment merging after certain time periods), but it's largely functional, and the default ElasticSearch functionality has proven to be backwards compatible (so this shouldn't impact other users). I'd also still like to implement a basic SQL analytics adapter (using postgres), but that can wait.

This will also be handy to get into master, so we can more easily work on the next 0.12 release without mixing other things up in this branch. The build changes will be useful to merge in (since they change a number of files), and we've also been fixing some other bugs while staging this branch which I don't want to get mixed up in this pull request (there's unfortunately a few of those small fixes already mixed up in here, but I'd like to avoid any other branch messiness).

@GUI GUI merged commit 08259aa into master May 22, 2016
@GUI GUI changed the title Kylin-based analytics option (In Progress - Do Not Merge) Kylin-based analytics option May 22, 2016
@GUI GUI deleted the analytics-revamp branch May 22, 2016 14:26
@GUI GUI added this to the v0.12 milestone Jun 30, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant