Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pkg/ottl] Add ParseMySQL converter to handle parsing/normalizing MySQL logs #34610

Open
TylerHelmuth opened this issue Aug 12, 2024 · 5 comments
Labels
discussion needed Community discussion needed enhancement New feature or request pkg/ottl

Comments

@TylerHelmuth
Copy link
Member

TylerHelmuth commented Aug 12, 2024

Component(s)

pkg/ottl

Is your feature request related to a problem? Please describe.

I came across an issue recently that I could not easily solve using OTTL's existing ExtractPatterns and replace_pattern functions.

I was experimenting with ingesting MySQL slow query logs and found that normalization of the query (replacing the parts of the slow query that are unique such as input parameters) was very hard to do with regex.

Consider a slow query log such as

# Time: 140905  6:33:11
# User@Host: dbuser[dbname] @ hostname [1.2.3.4]
# Query_time: 0.116250  Lock_time: 0.000035 Rows_sent: 0  Rows_examined: 20878
SET timestamp=1723472310;
SELECT
    UGI.dept_id,
    UGI.dept_name,
    COUNT(UGI.id) AS totalInnovators,
    SUM(UGI.totalIdeas) AS totalIdeas,
    SUM(UGI.points) AS totalPoints
FROM
    (
        SELECT
            UPI.id,
            UPI.dept_id,
            DI.name as dept_name,
            UPI.points,
            COUNT(UI.idea_id) AS totalIdeas
        FROM
            user_info AS UPI
            INNER JOIN dept_info AS DI ON UPI.dept_id = DI.id
            LEFT JOIN user_idea AS UI ON UPI.id = UI.user_id
        GROUP BY
            UPI.id
    ) AS UGI
GROUP BY
    UGI.dept_id
HAVING
    (
        totalPoints > 0
        AND totalInnovators > 0
    )
ORDER BY
    SUM(UGI.points) / COUNT(UGI.id) DESC
LIMIT
    5

Assuming this string is in the body of a log, it would be nice to extract:

  • Time
  • User
  • Host
  • Query time
  • Lock time
  • Rows sent
  • Rows examined
  • The slow query, normalized

Getting all these parts using the ExtractPattern converter is pretty straight forward (although it is a long regex). But normalizing the query was very hard. We were were able to come up with a regex that was close, but it isn't perfect and is quite complex.

  transform:
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - 'merge_maps(attributes, ExtractPatterns(body, "^# Time: (?P<time>[^ ]+)Z\\n# User@Host: (?P<user>[^#]+) @ (?P<client>[^#]+?) (Id:[^#]+)?\\n# Query_time: (?P<query_time>[0-9.]+) *Lock_time: (?P<lock_time>[0-9.]+) *Rows_sent: (?P<rows_sent>[0-9]+) *Rows_examined: (?P<rows_examined>[0-9]+)( *Rows_affected: (?P<rows_affected>[0-9]+))?(?P<normalized_query>(.|\\s)*)"), "upsert")'
          - set(attributes["query"], attributes["normalized_query"])
          - replace_pattern(attributes["normalized_query"], "(=\\s*((\"[^\"]*\")|('[^']*')|([0-9][^,;)\\s]*)))", "= ?")
          - replace_pattern(attributes["normalized_query"], "(([rR][oO][wW])?\\((((\"[^\"]*\")|('[^']*')|([0-9][^,;)\\s]*))[,\\s]*)+)", "(?")

Describe the solution you'd like

Normalizing the query in an efficient, consistent, and guaranteed manner requires a parser that understands SQL.

With a converter that knows how to interpret these logs, we could simplify (and probably speed up) this process. Instead of the several complex regex statements above, we could do something like

  transform:
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - merge_maps(attributes, ParseMySQL(body), "upsert")

and the ParseMySQL converter would return a map like:

{
    "time": "40905  6:33:11",
    "user": "dbuser[dbname]",
    "host": "hostname [1.2.3.4]",
    "query_time": "0.116250",
    "lock_time": "0.000035",
    "rows_sent": "0",
    "rows_examined": "20878",
    "normalized_query":"<seen formatted normalized query>"
}
SET
    timestamp = ?;

SELECT
    UGI.dept_id,
    UGI.dept_name,
    COUNT(UGI.id) AS totalInnovators,
    SUM(UGI.totalIdeas) AS totalIdeas,
    SUM(UGI.points) AS totalPoints
FROM
    (
        SELECT
            UPI.id,
            UPI.dept_id,
            DI.name as dept_name,
            UPI.points,
            COUNT(UI.idea_id) AS totalIdeas
        FROM
            user_info AS UPI
            INNER JOIN dept_info AS DI ON UPI.dept_id = DI.id
            LEFT JOIN user_idea AS UI ON UPI.id = UI.user_id
        GROUP BY
            UPI.id
    ) AS UGI
GROUP BY
    UGI.dept_id
HAVING
    (
        totalPoints > ? # notice this is now a ? instead of a unique value
        AND totalInnovators > ? # notice this is now a ? instead of a unique value
    )
ORDER BY
    SUM(UGI.points) / COUNT(UGI.id) DESC
LIMIT
    ? # notice this is now a ? instead of a unique value

Describe alternatives you've considered

Regex, lots of regex.

Additional context

Honeycomb has a sql normalizer that we use today that we could use in OTTL to do the normalization: https://github.com/honeycombio/mysqltools/tree/main

@TylerHelmuth TylerHelmuth added enhancement New feature or request discussion needed Community discussion needed pkg/ottl labels Aug 12, 2024
Copy link
Contributor

Pinging code owners for pkg/ottl: @TylerHelmuth @kentquirk @bogdandrutu @evan-bradley. See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@kentquirk
Copy link
Member

This is not a problem that can be solved generally by regex, as the SQL language is both complex and recursively defined, so a tool like this would be helpful.

But we should recognize that MySql is not the only flavor of SQL, and name the function generically, and probably give the function a parameter that could specify a SQL variant (even if only one is supported initially).

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion needed Community discussion needed enhancement New feature or request pkg/ottl
Projects
None yet
Development

No branches or pull requests

2 participants