Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Add support for Schema ingestion in Java - Avro, JSON-schema, etc. #11947

Open
shirshanka opened this issue Nov 25, 2024 · 0 comments
Labels

Comments

@shirshanka
Copy link
Contributor

Describe the bug
DataHub Java SDK lacks schema ingestion capabilities for Avro and JSON Schema formats, while the Python SDK has robust support for both. Although Java has Protobuf support in a standalone module, we need to provide equivalent capabilities for Avro and JSON Schema to ensure consistency between both SDKs.

To Reproduce

  1. Create a complex nested Avro or JSON Schema with:
    • Nested record types
    • Arrays of complex types
    • Maps with complex value types
    • Union types (for Avro)
  2. Attempt to generate DataHub schema using Java SDK
  3. Observe that no built-in conversion utilities exist, unlike Python SDK's support for Avro (avro_schema_to_mce()) and JSON Schema

Expected behavior
The Java SDK should provide equivalent schema ingestion capabilities as the Python SDK:

  1. Add Avro and JSON Schema conversion utilities to match Python SDK capabilities
  2. Automatic handling of nested types and complex schema structures
  3. Keep parity with Python SDK's schema handling features while maintaining the existing Protobuf support in the standalone module
  4. Helper methods to extract schema metadata (e.g., field descriptions, annotations, meta_mapping)

Additional context

  • Python SDK currently handles both Avro and JSON Schema through datahub.ingestion.extractor.schema_util and datahub.ingestion.extractor.json_schema_util
  • Java SDK has Protobuf support but in a standalone module
  • Neither SDK currently supports Thrift schema ingestion
  • Manual schema mapping for Avro and JSON Schema in Java is error-prone and time-consuming
  • This feature would provide a consistent experience across both SDKs
  • Consider implementing similar schema inference capabilities as found in the Python SDK's existing schema utilities
  • Many organizations use a mix of Python and Java, so having consistent support across both SDKs is crucial
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant