Kylin-based analytics option #227

GUI · 2016-03-25T06:50:52Z

This adds the ability to switch the analytics system to utilize Kylin instead of Elasticsearch.

For more context on this functionality, see 18F/api.data.gov#235

This also includes a switch in the build system to CMake. See #226 for why that functionality got merged into this branch (basically, it will help in packaging all this up).

Instead of doing a lot of rambling in this pull request, I've tried to consolidate my rambling to some new documentation pages:

Analytics Storage Adapters: This describes what a system admin maintaining an API Umbrella installation would need to know about picking the different storage engines.
Analytics Architecture: This describes the overall analytics architecture in more detail.

There's still some TODOs scattered throughout the docs, so there's the primary things remaining to complete:

For feature complete:
- Implement Kylin refresher background task.
- Add option to log to Elasticsearch and Kylin at the same time for transition period.
For merging and shipping this:
- Fix test suite broken by CMake changes.
- Fix analytics tests, by switching the test suite back to Elasticsearch as the default analytics adapter.
- Implement a PostgreSQL storage adapter leveraging the same SQL Kylin uses. This should also give us a way to more easily test the primary SQL functionality without having to be connected to a hadoop setup.
- Fix binary packaging under CMake setup.
- Create separate binary package for the optional kylin analytics dependencies.
- Ideally, figure out some way to run our test suite against Kylin & Hadoop, but that might be pretty difficult to setup in a CI environment given all the Hadoop setup required.

In terms of release planning and packaging, here's what I'm thinking:

I'd like to get out at least one more v0.11.x release with some bug fixes that have already happened on master, along with maybe a couple other small bug fixes.
All of this analytics stuff gets merged in for v0.12 release. Elasticsearch will still be the default analytics storage adapter for this release.
v0.13 is when we switch to PostgreSQL as the default storage adapter and deprecate the Elasticsearch adapter (unless there's interest in maintaining it from others).

@cmc333333: I think there are two primary differences since we last discussed things:

Heka's been swapped out for rsyslog. They both basically do the same thing with regards to buffering log data and then forwarding them onto other places (and possibly multiple places concurrently). The reason for this switch was that I ran into various issues with Heka's ability to send messages to Kafka (which is for the new setup to eventually get things into Hadoop & Kylin). rsyslog's Kafka output worked, and it also offers a built-in PostgreSQL output (versus a third-party Heka plugin), along with other SQL outputs, so it seemed desirable for that functionality as well.
How we're getting logs from nginx into Hadoop and ultimately the ORC Hive tables has changed a bit. It's unfortunately complicated, but I've tried to explain it in the docs.
- Last we talked, I think I was struggling using Flume. You had suggested trying our own code to utilize the streaming Hive table support, which I did implement earlier on this branch. In some ways, it did work better than Flume's streaming Hive table support (less memory usage and crashing), but the more I tinkered, the more I realized there were lots of rather nasty edge conditions around getting the data into that system reliably without possibly missing chunks of data. I also discovered there are a number of bugs with Hive streaming in the version of Hive we have to use due to Kylin compatibility (I think it was the first version to support hive streaming). While it's possible we could have worked around those bugs, in combination with tackling the various edge conditions I had come up with, it seemed like quite a bit of work, so I ended up not using the hive streaming table support.
- Instead, I'm back to Flume, with basic HDFS TSV files. It's not particularly elegant, and in some ways it's more code and hops on our end, but it's all using basic, well-established HDFS and Hive technologies.
- As noted on the docs page, though, if Hive streaming and Kylin streaming both come together in the future, then it's definitely possible all this could be greatly simplified. But in the meantime, I'm hoping this solution should be okay.

Some of our data from elasticsearch comes back as integers, others as strings, depending on how it was originally inserted.

Place files in /YYYY/MM directories (instead of /YYYY-MM)

Rollback parquet version so it's compatible with the version of Hive we're loading into. Also place the parquet files in the directories expected by hive partitioning. This also includes some unused code related to testing TSV and Avro outputs.

Tests seem to indicate ORC has the best performance for us. It also offers the best compatibility with the older version of Hive we're using for HDP 2.2/Kylin compatibility (allowing us to support things like timestamps, which aren't supported on the older version of Parquet). This cleans up our dependencies, removes all the file outputs except ORC, and tunes a few of the output fields to be specific to ORC files (using SMALLINTs, etc).

Kylin doesn't support MAX() on timestamp fields, so we need to store the timestamp as a raw integer value.

This solution isn't exactly pretty or optimized, but for this first pass, I think we'll just try to emulate the ElasticSearch output verbatim, so we don't have to update any other parts of the app. But this could be cleaned up and optimized if we eventually just have to worry about supporting SQL.

This is still a work in progress, but a few notes based on what we've figured out so far.

This shifts our caching location of the city data into MongoDB. This might not be the best final location for it (we may want to move it into SQL land), but for a first pass, this will let the same basic approach work regardless of where the raw log data is stored (elasticsearch or kylin).

It's pretty hacky and terribly inefficient for now, but it loads.

- The drilldown queries were erroring at the top host-level. - Some of the filter logs queries were bombing because we were inadvertently performing other group by and aggregations on unrelated queries. - Optimize some odd slowness with the request_ip IS NULL queries.

This is to answer queries that cannot be satisfied by the Kylin data cubes.

The basic idea is that we'll still use heka to log to possibly different sources (elasticsearch, postgres, or hdfs). For HDFS (for use with Hive and Kylin), we'll perform logging from Heka to Flume, which has better integration with writing directly to HDFS. This updates the various logged field names for better consistency with the SQL design of the table.

CMake's externals handling might be a much cleaner and more reliable way to handle building the various sub-projects. It also looks like we might be able to use cpack to create our binary packages (rather than needing fpm).

The various admin searches were passed to mongo as a regex search. However, we were only wanting to perform a "contains" search, and not actually expose the raw regexs. Exposing the raw regexes caused searches with various special characters to fail (since they weren't escaped). Allowing raw regexes also allowed for potential regex-based denial of services by untrusted admins. Brakeman pointed this out, so we're also adding brakeman and bundle-audit to our CI runs for better security checks.

Hitting the publish API without any selected changes now performs no action. The publish button is also now disabled in the UI until the user checks at least one checkbox. See 18F/api.data.gov#307

The listing of website backends didn't properly account for admin permissions in some cases. But the admins couldn't edit or display the individual website backends, so the issue was limited only to the listing page. See: 18F/api.data.gov#261 This also adds a "none" scope for mongoid, which a couple of our policies were already erroneously using (which was resulting in errors being thrown in certain unexpected admin permission circumstances).

There were some small shifts in some of the geoip results in the latest data file. Fix things so we're not quite as sensitive to these small changes in the future (we'll allow results within 0.02 decimal places, since that's we don't actually care that much about the detailed precision). Also fix one of the foreign tests with a special character, since that IP no longer geocodes to the city. We'll use Bogotá instead.

If the server was being run in development mode, bundler-audit was causing require issues, since it expected ENV["HOME"] to be set (and it's not in the server context).

This makes it easier to browse the groups and figure out who group members.

GUI · 2016-05-22T14:25:28Z

I'm going to go ahead and merge this into master. There's still a few things to sort out with the Kylin implementation (surrounding segment merging after certain time periods), but it's largely functional, and the default ElasticSearch functionality has proven to be backwards compatible (so this shouldn't impact other users). I'd also still like to implement a basic SQL analytics adapter (using postgres), but that can wait.

This will also be handy to get into master, so we can more easily work on the next 0.12 release without mixing other things up in this branch. The build changes will be useful to merge in (since they change a number of files), and we've also been fixing some other bugs while staging this branch which I don't want to get mixed up in this pull request (there's unfortunately a few of those small fixes already mixed up in here, but I'd like to avoid any other branch messiness).

GUI added 30 commits February 14, 2016 18:25

Adding initial script to migrate elasticsearch data to parquet files.

185aaf8

Switch timestamp to store as long. Fix handling for mixed types.

5d96198

Some of our data from elasticsearch comes back as integers, others as strings, depending on how it was originally inserted.

Reorganize default directory structure for migrated data.

cced392

Place files in /YYYY/MM directories (instead of /YYYY-MM)

Reorganized error handling to better pinpoint problematic fields.

5c588a0

Handle unexpected array types during log migrations.

5131896

Ignore some common URL differences before logging a warning.

afbef91

Eliminate more false positive warning messages during log migrations.

c122f7e

Logging improvements for migration script.

2b480cb

Further fixes for noisy logging.

e444191

Fix migration url comparison when hostname contains dollar signs.

53530b4

Testing ORC File output too. Testing new schema for path hierarchies.

01ad624

Some initial experiments with hitting Kylin for analytics.

d5634d2

Revert timestamp field to long type.

ece8a0b

Kylin doesn't support MAX() on timestamp fields, so we need to store the timestamp as a raw integer value.

Add basic instructions to migrating analytics data to Kylin.

2230b2f

This is still a work in progress, but a few notes based on what we've figured out so far.

Fix drilldown report so it doesn't error when no data is present.

41569e8

Implement top part of "Filter Logs" against Kylin/SQL.

8f1edfb

It's pretty hacky and terribly inefficient for now, but it loads.

Handle aggregating the results according to a specific timezone.

f185037

Better handling for empty string values in migrated log data.

55705c9

A rough, first pass at falling back to Presto SQL queries.

be45c9d

This is to answer queries that cannot be satisfied by the Kylin data cubes.

Basic implementation of the raw request logs API in SQL.

1d0c50d

Make sure log endpoint searches at least 24 hours (but no more than 48).

44bbde5

Fix awesomeprint config in production mode.

ef92747

Upgrade to Heka 0.10 and add plugin for postgres output.

ff0f219

Experimenting with CMake based builds and installs.

599457b

CMake's externals handling might be a much cleaner and more reliable way to handle building the various sub-projects. It also looks like we might be able to use cpack to create our binary packages (rather than needing fpm).

GUI added 10 commits April 28, 2016 22:35

Merge branch 'master' into analytics-revamp

e76c97d

Merge branch 'master' into analytics-revamp

36e5ad6

Fix perp kylin wrapper not properly stopping kylin.

d8a190f

Fix error thrown when publishing config without selecting anything.

f0b9315

Hitting the publish API without any selected changes now performs no action. The publish button is also now disabled in the UI until the user checks at least one checkbox. See 18F/api.data.gov#307

Basic job to refresh Kylin for live incoming data.

eac1385

Improve logging for hadoop processing.

05c2d71

Fix kylin refresh task looping over days.

7cb0f91

Fix processor auth against Kylin directly. Better error handling.

ba79855

GUI force-pushed the analytics-revamp branch from cb2be83 to c909dc0 Compare May 4, 2016 01:55

GUI added 14 commits May 3, 2016 22:27

Bump openssl to 1.0.2h.

4d368cc

Dependency upgrades.

7ca2fe3

Add missing javac installation to build dependencies.

bfa9d56

Extend reload test timeout, since this test can sometimes take longer.

b850cb5

Update bundler and go dependencies.

df9f559

Ensure the CI environment has recent geoip data file.

e190eab

Fix syntax error in after-install script.

10974a9

Add make target to run after-install script manually.

6e5da31

Don't require bundler-audit and brakeman by default in dev mode.

a1ed9e7

If the server was being run in development mode, bundler-audit was causing require issues, since it expected ENV["HOME"] to be set (and it's not in the server context).

Ignore cmake's install_manifest.txt

d632c9d

Fix geoip rm and xargs issue when there was nothing to delete.

381ee94

More thorough distclean.

58d915b

Display admin accounts belonging to groups on the group pages.

9175d60

This makes it easier to browse the groups and figure out who group members.

GUI merged commit 08259aa into master May 22, 2016

GUI changed the title ~~Kylin-based analytics option (In Progress - Do Not Merge)~~ Kylin-based analytics option May 22, 2016

GUI deleted the analytics-revamp branch May 22, 2016 14:26

GUI added this to the v0.12 milestone Jun 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kylin-based analytics option #227

Kylin-based analytics option #227

GUI commented Mar 25, 2016 •

edited

Loading

GUI commented May 22, 2016

Kylin-based analytics option #227

Kylin-based analytics option #227

Conversation

GUI commented Mar 25, 2016 • edited Loading

GUI commented May 22, 2016

GUI commented Mar 25, 2016 •

edited

Loading