Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PIP-193] [feature][connectors] Add support for a transform Function in Sinks #16740

Merged
merged 26 commits into from
Aug 31, 2022

Conversation

cbornet
Copy link
Contributor

@cbornet cbornet commented Jul 22, 2022

Motivation

For PIP 193 : Sink preprocessing Function (#16739).
Implements changes to support the Sink transform function except the LocalRunner support.

Modifications

See #16739 Implementation section

Verifying this change

  • Make sure that the change passes the CI checks.

This change added tests and can be verified as follows:

TODO

Does this pull request potentially affect one of the following parts:

  • The public API: yes
  • The rest endpoints: yes
  • The admin cli options: yes

Documentation

Check the box below or label this PR directly.

Need to update docs?

  • doc-required
    (Your PR needs to update docs and you will update later)

  • doc-not-needed
    (Please explain why)

  • doc
    (Your PR contains doc changes)

  • doc-complete
    (Docs have been already added)

@cbornet cbornet marked this pull request as draft July 22, 2022 10:07
@cbornet cbornet marked this pull request as ready for review July 26, 2022 18:32
Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -384,7 +386,12 @@ private void sendOutputMessage(Record srcRecord, Object output) throws Exception
try {
this.sink.write(sinkRecord);
} catch (Exception e) {
log.info("Encountered exception in sink write: ", e);
if (e instanceof ClassCastException && functionClassLoader != componentClassLoader) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this check at sink create or start?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can't because each record can be different

in Java even if you declare a class with some Generic types, you can always force the system to return something wrong.
so this check must stay here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a question about the type of input for user-defined functions. In the current implementation, the consumer in the sink sets the schema according to the generic definition of the sink. usually, we use GenericObject, If the user-defined function does not match the generic, the current logic will go to the following logic instead of being sent to the sink, is this the expected behavior?

if (result.getUserException() != null) {
Exception t = result.getUserException();
log.warn("Encountered exception when processing message {}",
srcRecord, t);
stats.incrUserExceptions(t);
srcRecord.fail();
} else {

Maybe we need to validate the input parameter types of the user-defined function at the sink start?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have a Function with the Sink we use the input type of the Function to configure the PulsarSource instead of the Sink type. See here.
Does it answer your concerns ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry. I miss this.

However, if the user wants to use it, the input parameter type may need to be carefully defined. In general, users can only use GenericObject, which triggers the registration of auto_consumer type schema, because only then can multiple topic inputs be processed.

Am I right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the same input type as if you had a regular Sink or Function. If you want to consume any type of Schema then yes, you should use GenericObject or byte[]. If you know that the Schema will always be String, you can use String as input.

@eolivelli eolivelli changed the title [feature][connectors] Add support for a preprocess Function in Sinks [PIP193] [feature][connectors] Add support for a preprocess Function in Sinks Aug 24, 2022
}
AbstractSinkRecord<?> sinkRecord;
if (output instanceof Record) {
sinkRecord = new OutputRecordSinkRecord<>(srcRecord, (Record) output);
Record record = (Record) output;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about moving this block to a method ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

finalSchema = schema;
}

sinkRecord = new OutputRecordSinkRecord<>(srcRecord, record) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about creating a named class ? it will help Sink developers when they dump the "Record" object (and especially us when we will debug problems about this feature :-) )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the code to use distinct constructors. It does not introduce a new class though.

@@ -881,4 +963,98 @@ private void setupOutput(ContextImpl contextImpl) throws Exception {
Thread.currentThread().setContextClassLoader(this.instanceClassLoader);
}
}

private static <T> Schema<T> getSinkSchema(Record<?> record, Class<T> clazz) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could move these 3 methods to some other class that does similar things, like TopicSchema.
but I have no strong opinion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're doing these methods bc we don't have a topic. So it feels weird to put these in a class named TopicSchema

import org.apache.pulsar.common.protocol.schema.SchemaVersion;
import org.apache.pulsar.common.schema.SchemaInfo;

class SinkSchemaInfoProvider implements SchemaInfoProvider {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about adding a javadoc here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good

class SinkSchemaInfoProvider implements SchemaInfoProvider {

AtomicLong latestVersion = new AtomicLong(0);
ConcurrentHashMap<SchemaVersion, SchemaInfo> schemaInfos = new ConcurrentHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not very clear to me (by reading the code) who is populating these maps.
maybe we could add explicit accessors ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cut the code a little. Is it clearer now ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good

Copy link
Contributor

@nicoloboschi nicoloboschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cbornet cbornet closed this Aug 24, 2022
@cbornet cbornet reopened this Aug 24, 2022
@cbornet cbornet requested a review from eolivelli August 25, 2022 13:36
Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@eolivelli eolivelli changed the title [PIP193] [feature][connectors] Add support for a preprocess Function in Sinks [PIP-193] [feature][connectors] Add support for a preprocess Function in Sinks Aug 26, 2022
@github-actions github-actions bot added the doc-required Your PR changes impact docs and you will update later. label Aug 26, 2022
@cbornet cbornet changed the title [PIP-193] [feature][connectors] Add support for a preprocess Function in Sinks [PIP-193] [feature][connectors] Add support for a transform Function in Sinks Aug 26, 2022
@cbornet
Copy link
Contributor Author

cbornet commented Aug 26, 2022

Following feedback, I renamed preprocess function to transform function.
cc @shibd @eolivelli @dave2wave @mattisonchao @nlu90 @nicoloboschi @HQebupt @lhotari @michaeljmarshall

@cbornet cbornet force-pushed the function-sink branch 5 times, most recently from da621e0 to 84cf65c Compare August 29, 2022 15:36
@cbornet cbornet closed this Aug 29, 2022
@cbornet cbornet reopened this Aug 29, 2022
@cbornet cbornet closed this Aug 30, 2022
@cbornet cbornet reopened this Aug 30, 2022
@eolivelli eolivelli merged commit dab0d1f into apache:master Aug 31, 2022
@cbornet cbornet deleted the function-sink branch August 31, 2022 12:35
cbornet added a commit to cbornet/pulsar that referenced this pull request Aug 31, 2022
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Sep 1, 2022
@momo-jun
Copy link
Contributor

@cbornet It will help users a lot if you can add the docs for this feature. Do you have any plans on that?

@Technoboy- Technoboy- added this to the 2.12.0 milestone Sep 23, 2022
@Technoboy- Technoboy- modified the milestones: 2.12.0, 2.11.0 Nov 14, 2022
@cbornet
Copy link
Contributor Author

cbornet commented Nov 23, 2022

Transform functions are not part of 2.11 so we need to wait for the release before doing the doc...

@momo-jun
Copy link
Contributor

@cbornet the milestone tag is still 2.11.
@Technoboy- just a double-check, is this feature going to be included in 2.11?

@Technoboy- Technoboy- modified the milestones: 2.11.0, 2.12.0 Dec 1, 2022
@Technoboy-
Copy link
Contributor

@cbornet the milestone tag is still 2.11. @Technoboy- just a double-check, is this feature going to be included in 2.11?

Yes, not included in 2.11.

@momo-jun
Copy link
Contributor

Transform functions are not part of 2.11 so we need to wait for the release before doing the doc...

Hi @cbornet, Since 2.11 has been released, now it's good timing to start adding the docs for this feature to the next version of docs. Feel free to share your plans and anything I can help with. Thank you.

@Anonymitaet Anonymitaet added doc-complete Your PR changes impact docs and the related docs have been already added. and removed doc-required Your PR changes impact docs and you will update later. labels May 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connector doc-complete Your PR changes impact docs and the related docs have been already added.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants