Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DFC Connector Data Capture Feature & Store #24

Open
RaggedStaff opened this issue Jun 17, 2024 · 8 comments
Open

DFC Connector Data Capture Feature & Store #24

RaggedStaff opened this issue Jun 17, 2024 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@RaggedStaff
Copy link

Discussed in https://github.com/orgs/datafoodconsortium/discussions/30

Originally posted by jgaehring March 25, 2024

Objective

Enable remote data capture functionality in the DFC connector, as requested by
the FDC Governance Circle, so that data may be captured within the DFC
Network and relayed to an independent triple store that will act as a Data
Commons.

Proposal

While we discussed the possible necessity of incorporating the data capture
mechanism into the code generator's templates, I've realized that may not ever
be necessary or even desirable. In all three implementations of the connector,
the core request/response logic can be found within the main Connector class
or its modules (such as the JsonldStream importer and exporter in the case of
the TypeScript implementation), which are all contained within the static code
directories and not produced through code generation. Because these
import/export methods are indirectly invoked by all semantic object subclasses'
getters, setters, adders and removers, it would be the ideal place to inject
optional hooks that could extend the import/export behavior.

A good model for this kind of extension might be the axios library's
interceptor pattern:

// Add a request interceptor
axios.interceptors.request.use(function (config) {
    // Do something before request is sent
    return config;
  }, function (error) {
    // Do something with request error
    return Promise.reject(error);
  });

// Add a response interceptor
axios.interceptors.response.use(function (response) {
    // Any status code that lie within the range of 2xx cause this function to trigger
    // Do something with response data
    return response;
  }, function (error) {
    // Any status codes that falls outside the range of 2xx cause this function to trigger
    // Do something with response error
    return Promise.reject(error);
  });

Internally the axios interceptors are private members of the
InterceptorManager, with a separate instantiation for both request and
response cycle. The interceptors can also be "ejected":

const myInterceptor = axios.interceptors.request.use(function () {/*...*/});
axios.interceptors.request.eject(myInterceptor);

Some consideration should be given to the API for the connector and the
corresponding getters and setters that will actually invoke the capturing logic.
The getters and setter can differ in behavior, with some being synchronous and
others asynchronous, while the capturing behavior will always be asynchronous.
But we could generally take an approach such as the following:

const loggerRef = connector.interceptors.import.use(logger);

Where logger could be a function (or two functions, to handle both success and
error results), or an instance of a Logger class with a wider variety of
configurable options, or both.

As for the triple store, where logs will be sent, there are a lot of options. To
begin, the DFC prototype could be used for running integration tests in the
local development environment. If that achieves much of desired outcomes, a fork
of that could be prepared for deployment. A more customized solution could be
built with SemApps, but might require more development. Another extenuating
factor is the degree to which OFN's stakeholders would like this store to be
integrated with OFN's core software and regional server instances.

As for the triple store to send logs to, there are many options, depending upon
how tightly integrated with the core OFN software and server instances OFN's
stakeholders wish this to be, as opposed to a totally independent server that
core OFN knows nothing about. It may be more difficult to judge with much
accuracy the cost and time required to stand up a maintainable instance of the
triple store based on these decisions and a more detailed conversation. In any
case, however, the proposed logging interceptor should work just the same, since
the only parameter it will strictly require should be a location to send the
logs to. Different logging interceptors can be adapted to different behaviors as
desired, and even combined, since this would enable multiple interceptors. The
flexibility of the interceptor pattern may in fact allow for more incremental
development of the triple store and how it is deployed to production.

Requirements

  • Implement the .import.use() and export.use() method, a general interface
    for the function or Interceptor class they would each accept as arguments,
    and the implementations of those functions or classes as the actually
    ImportLogger and ExportLogger. Obviously, the names for all these classes
    and methods can be decided upon later. These will first be implemented in
    TypeScript.
  • Write appropriate unit tests for these interceptors and the data capture
    implementation(s), extending the existing TypeScript connector tests as
    appropriate. These will only mock the intended triple store behavior.
  • Pending further discussion, develop integration tests that can run against a
    local instance of a triple store, possibly based on the DFC prototype or
    SemApps, that can receive and store JSON-LD logs. Preferably this local
    instance will be containerized so it's easy to replicate on a staging server,
    or perhaps as the basis for store that can eventually go into production for
    the data commons.

Milestones

  1. TypeScript connector's import.use() and export.use() methods, interfaces,
    classes, and corresponding unit tests.
  2. Local triple store and integration tests of the connector's interceptor API
    and the data capture interceptors specifically.
  3. Staging server and/or production deployment of the triple store.

Estimated Time and Cost

Milestones 1 and 2 will each require roughly 15 hours of development time, and
their order is more or less interchangeable. Depending decisions on how best to
develop, test, and deploy the triple store, milestone 3 could vary widely,
potentially as little as 6-12 development hours, or over 30 dev hours, if more
customization is required beyond simply running an off-the-shelf solution.
Similarly, milestone 4 is difficult to assess at this time, but would require at
least the same amount of dev hours, possibly more.

# Description Dev Hrs Est. Cost Duration
1 Connector features 24 - 30 $2520 - $3150 1 - 2 wks
2 Integration testing 6 - 30 $630-3150 1 - 3 wks
3 Staging/production deployments 12 - 60 $1260-6300 2 - 6 wks

The contingencies in milestones 2 and 3 makes this a very imprecise estimation,
costing anywhere from $4,410 to $12,600 and taking 1 to 3 months to
complete
. We can speak in further detail on the expectations for the triple
store as we go ahead with the connector features, or wait until a clearer set of
requirements can be determined for all 3 milestones.

Further discussions have highlighted that the Semantizer libraries are having functionality upgraded to support mixins. This is a dependency for this work: the Data Capture functionality will be included as a mixin.

@RaggedStaff RaggedStaff added the enhancement New feature or request label Jun 17, 2024
@RaggedStaff RaggedStaff transferred this issue from datafoodconsortium/.github Jul 3, 2024
@jgaehring
Copy link

I think to move forward there are two main blockers for now.

  1. I need to consult with Maxime to understand better how mixins work in assemblee-virtuelle/semantizer-typescript, so that something compatible can be included into the TS connector.
  2. An understanding of the production requirements for FDC Governance Circle, such as, where the triple store should be hosted, to what extent should it be integrated with the OFN UK instance, etc. This will help to narrow down the estimates for time and cost on Milestones 2 & 3 listed in the table above.

@RaggedStaff
Copy link
Author

RaggedStaff commented Jul 16, 2024

2. An understanding of the production requirements for FDC Governance Circle, such as, where the triple store should be hosted, to what extent should it be integrated with the OFN UK instance, etc. This will help to narrow down the estimates for time and cost on Milestones 2 & 3 listed in the table above.

@jgaehring The triple store will be separate from all participating platforms. I'd have a preference to stand something up on the Infomaniak's Jelastic Cloud instance we're using to host the Shopify apps.

At this stage we aren't trying to integrate with anything... just (securely) store the data somewhere, so it can be managed, by the members, as their data commons in the future.

Lets have a quick chat about what might work...are you around tomorrow? I'm free 1-2pm or 4-4:30 (UK) .

On the other blocker - @lecoqlibre is on vacation this week, but I think around next week... maybe we should all talk together next week?

@RachL RachL moved this from Todo to In Progress in Tech meeting board Jul 29, 2024
@jgaehring
Copy link

  1. I need to consult with Maxime to understand better how mixins work in assemblee-virtuelle/semantizer-typescript, so that something compatible can be included into the TS connector.

For my own sake, I'm just noting the snapshot of the semantizer's mixin implementation as it stands right now, although it is considered unstable:

https://github.com/assemblee-virtuelle/semantizer-typescript/blob/61c5ddbcde51fbc7469ac315169ac8b42a74d194/src/test/src/index.ts

@RaggedStaff
Copy link
Author

@jgaehring Notes from our call:

We agreed to modify the export function(s) in the static area of connector-codegen (for ts, ruby & php), to check a parameter & if TRUE, we POST the exported JSONLD to our triple store.

We'll start with the PHP verion (Big Barn), then TS, then Ruby.

@jgaehring
Copy link

As discussed in today's tech call, this is the relevant part of the TypeScript codegen implementation (pending merge of PR #20) where the call to semantizer's .export() method will be wrapped with the Data Capture logic, which basically just needs to call .export() again with the new destination:

public async export(objects: Array<Semanticable>, options?: IConnectorExportOptions): Promise<string> {
const exporter = options?.exporter? options.exporter : this.exporter;
return exporter.export(objects, {
inputContext: options?.inputContext,
outputContext: options?.outputContext
});
}

That "wrapper" can be moved lower down the stack to the internals of the semantizer, once it reaches its next stable release, but that later change shouldn't require breaking changes to either the connector or the semantizer's APIs. Therefore, I believe there should be no problem implementing the data capture feature with the existing alpha version of the semantizer, since costs prohibit that being upgraded in the near future regardless, without incurring significant tech debt once the stable release becomes available.

@lecoqlibre
Copy link
Member

lecoqlibre commented Oct 1, 2024

What do you think about using the observer pattern to decouple the data-capture feature from the connector itself?

We would have a method to register a new observer for the export method like connector.registerCallbackForExport(callback: (exported: string) => void).

Each time the connector.export() method will be called, a new callback will be triggered. This mechanism can be used for any other export-related feature.

In the client code you want to capture data from, you will just have register a handler of your choice (which can be implemented is a separated package and even in a DFC related one if you want like @datafoodconsortium/connector-data-capture).

You can also export a pre-configured Connector class from this package so your clients can just import it without configure it:

import { Connector } from "@datafoodconsortium/connector-data-capture";

const connector = new Connector();

connector.export(...); // this will trigger the data-capture handler

@jgaehring @RaggedStaff

@RaggedStaff
Copy link
Author

@jgaehring queried whether its best ot use composer or PHAR for unit testing. @lecoqlibre will confirm here.

@jgaehring
Copy link

So my working assumption has been that an instance of the DataCapture plugin/mixin/whatever will be hardcoded into an early version of the 2.0 release of the Connector itself – perhaps only an alpha or RC version like 2.0.0-rc.1 or 2.0.0-alpha.1 – along with some hardcoded references to some magic environment variables. That way, the only thing early adopters of the library need to do is update the version to 2.0.0-rc.1 with their relevant package manager,

// composer.json (for PHP)
{
  "require": {
    "datafoodconsortium/connector": "^v2.0.0-rc.1"
    }
}

then add something like this to their .env file:

# .env file
EXPERIMENTAL_DATA_CAPTURE_ENABLED=true
EXPERIMENTAL_DATA_CAPTURE_EXPORT_URL=https://api.example.com/json-ld/

and that's it, a very minimal contract. To activate the data capture functionality, consumers only have to modify configuration files without the need to update their application code, which is the principal objective for this pre-release API.

Although there is a simpler path to achieve that, I have taken your recommendation, @lecoqlibre, of employing the observer pattern for this plugin/mixin/whatchamacallit. The Connector itself implements the subject or observable, while a separate DataCapture plugin fills the role of the subscriber or observer. Ultimately, the idea is that the consumer will instantiate the Connector and DataCapture objects separately, then just call .attach() or .subscribe() on the former, passing in the later as its first parameter. Here's what that would look like in PHP:

$connector = new Connector();
$observer = new DataCapture("https://api.example.com/json-ld/");
$connector->attach($observer);

But for the aforementioned pre-release version, in order to eliminate the need for consumers to modify their application code, the DataCapture plugin will instead be instantiated automatically within the Connector's constructor.

Here's what that looks like in my current PHP implementation:

/**
* If DataCapture is enabled in the environmental variables, attach the
* observer automatically. This should eventually be externalized rather
* than calling it here in the constructor.
*/
if ($_ENV["EXPERIMENTAL_DATA_CAPTURE_ENABLED"]) {
$observer = new DataCapure($_ENV["EXPERIMENTAL_DATA_CAPTURE_EXPORT_URL"]);
$this->attach($observer);
}

That entire if { ... } block will be deleted for the general release of 2.0.0, so environment variables alone will not be sufficient to activate data capture for anyone who consumes the stable version. Instead, they'll need to import DataCapture themselves, either as a separate module or independent package altogether. Then they can make their own determination whether or not to use environment variables and what to name them. So after instantiating the Connector, they could still do something like,

$ofnCapture = new DataCapture($_ENV["OFN_CAPTURE_URL"]);
$connector->attach($ofnCapture);

but that's up to them. A separate option parameter could be included in the Connector's constructor, but as of now, I'm not doing that for the PHP version, since .attach() seems to be more idiomatic and is automatically covered by the standard library's SplSubject interface.

The Connector will still need to implement the SplSubject interface, but I've elected to overload the .attach() method with an optional string parameter, $event, which defaults to "*" as a wildcard for all methods. That way the Connector's implementation of SplSubject neither has to be limited to the .export() method, nor does it necessarily get applied to all methods no matter what:

public function attach(\SplObserver $observer, string $event = self::EVENT_WILDCARD): void {
$isValid = $this->initEventGroup($event);
if ($isValid) $this->observers[$event][] = $observer;
}

The exact methods that will support the attachment of observers are declared explicitly and exposed as public constants:

/**
* Observers that will be notified whenever certain methods are called.
* Those methods are treated as events and explicitly made visible as public
* constants below, so that observers may reference them.
*/
private $observers = [];
public const EVENT_WILDCARD = "*";
public const EVENT_EXPORT = "export";
public const EVENT_IMPORT = "import";

The private $observers member is an array of object stores, one for each method to keep track of its own observers. Currently, I've only added support for .import() and .export(), but if additional public constants are specified using the same "EXPORT_" prefix, they'll be added automatically by the constructor via PHP reflection:

// Create an empty store for each supported method/event declared above.
$reflector = new \ReflectionClass($this);
foreach ($reflector->getConstants() as $key => $c) {
if (str_starts_with($key, "EVENT_")) {
$this->observers[$c] = new \SplObjectStorage();
}
}

Any observer can then limit its event scope either by passing the correct string, accessing one of the constants, or omitting the parameter entirely so it defaults to all events:

// These 2 are equivalent:
$connector->attach($observer, "export");
$connector->attach($observer, $connector->EVENT_EXPORT);

// These 3 are equivalent:
$connector->attach($observer);
$connector->attach($observer, "*");
$connector->attach($observer, $connector->EVENT_WILDCARD);

To my mind, that's adequately decoupled from the data capture functionality, or any future plugins that someone wishes to develop.

The Burning Question 🔥 🤔

The burning question, which I raised with @RaggedStaff earlier today, is this: Do we want to roll that out with a temporary pre-release config option like I described above, where the environment variables are hardcoded and, if detected, the observer will be attached automatically in the constructor? Or do we jump straight to the intended stable API, where none of that's hardcoded, but the library's consumers must commit some minor changes to their application code in order to get the data capture plugin to work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: In Progress
Development

No branches or pull requests

3 participants