Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] OTA Job notification isn't received #117

Open
Fienberber opened this issue Sep 20, 2024 · 22 comments
Open

[BUG] OTA Job notification isn't received #117

Fienberber opened this issue Sep 20, 2024 · 22 comments

Comments

@Fienberber
Copy link

Fienberber commented Sep 20, 2024

I'm migrating from to the esp-aws-iot [202406.01-LTS] from [202210.01-LTS].
The previous integration was working properly so I know this is not an AWS issue.

Using the OTA example, I can start OTA's using polling. Using the requestJobDocumentHandler(); by sending an event at OtaAgentEventRequestJobDocument.

But previously I was subscribed to $aws/things/+/jobs/#, which notified when OTA where available.
I don't want to use polling to start an OTA. But I don't achieve subscription to the jobs topic.

I think I'm missing a critical part of the example. Because it seems like it doesn't subscribe to the $aws/things/+/jobs/# topic.

@aggarg
Copy link
Member

aggarg commented Sep 23, 2024

I think I'm missing a critical part of the example. Because it seems like it doesn't subscribe to the $aws/things/+/jobs/# topic.

Are you saying that your subscribe calls fails? Or the subscribe succeeds and you do not get a notification?

@Fienberber
Copy link
Author

Fienberber commented Sep 25, 2024

Thank you for your response,

Actually I thought I was subscribed but since I don't get the notification I tried to add the subscription manually.
At this moment I get disconnected from the broker.
In the example ota_over_mqtt_demo I don't see where the subscription take place.
I see the request during the initialization, but afterward I don't see any pulling either.

That's why I think I'm missing something.

@AniruddhaKanhere
Copy link
Member

@Fienberber, the demo doesn't actively subscribe to the MQTT topic. See the jobs docs on AWS. Clients are sent the messages from the broker even when they are not subscribed to it.

To that end, the demos 'inject' an OTA handler which handles unsolicited publishes. Thus, any topics not handled by coreMQTT-Agent (or manager) will be passed on to the OTA handler which verifies whether the job topic matches with what is expected and processes it.

You should not need to process the OTA events yourself. The OTA task should do that on its own by calling processOTAEvents. Can you elaborate on what is happening? Is the task not starting? Are you using the unmodified version of the demo?

Thanks

@Fienberber
Copy link
Author

Hi @AniruddhaKanhere thank you for you response.
I use a modified version of the demo in order to comply with infrastructure. However I get the OTA to work while polling for new job. I could have missed something when modifying the demo...

I enabled every level of logs for MQTT in the menuconfig. But I see no publish when I create a remote job on the AWS control panel. The OTA handler doesn't get anything either.
That's why I tried to do I myself, in order to see if the remote server actually publishes something.

The task seem to running since it works properly at startup. Or am I mistaken ?

@aggarg
Copy link
Member

aggarg commented Oct 3, 2024

Can you try to add a log here to dump pxPublishInfo->pTopicName and pxPublishInfo->pPayload. This will confirm if the OTA notification is forwarded to the OTA handler.

@fjuliofontes
Copy link

Hi @Fienberber ,
Did you find the solution? I'm also suffering from the same issue. Only when calling requestJobDocumentHandler() the OTA starts.

@AniruddhaKanhere
Copy link
Member

@fjuliofontes why do you need to call the requestJobDocumentHandler function manually? These lines should take care of the same.

Can you please add a breakpoint there and check whether that line is hit and the step through to figure out why isn't that message sent/processed?

@AniruddhaKanhere
Copy link
Member

Also, can you try creating the job before running the code on the device?
I suggest, create the job on the console - wait for 30s, and then run the code on the device. That will verify whether there is some timing issue in the code in the repository.

@Fienberber and @fjuliofontes can you both please try the above?

@fjuliofontes
Copy link

@fjuliofontes why do you need to call the requestJobDocumentHandler function manually? These lines should take care of the same.

Can you please add a breakpoint there and check whether that line is hit and the step through to figure out why isn't that message sent/processed?

I'm not actually calling, it's during bootup that the function get's called. I was just pointing that it's the only way for the job to start, rebooting.

@fjuliofontes
Copy link

Also, can you try creating the job before running the code on the device? I suggest, create the job on the console - wait for 30s, and then run the code on the device. That will verify whether there is some timing issue in the code in the repository.

@Fienberber and @fjuliofontes can you both please try the above?

Thank you, as I described above the OTA will start at bootup if the job is already created.

By the way, the compiler was giving a warning about a falling-through on this line and I've added a break there.

Best regards,

@Fienberber
Copy link
Author

Hello @AniruddhaKanhere thank you for your response.
Sorry for the delay, I still have the issue. I added a task that pulls jobs in the meantime.

Also, can you try creating the job before running the code on the device? I suggest, create the job on the console - wait for 30s, and then run the code on the device. That will verify whether there is some timing issue in the code in the repository.

@Fienberber and @fjuliofontes can you both please try the above?

As @fjuliofontes said, if the Job was created before the software start, it will work. At init the OTA is polled once. The issue happen when the software is running and a job is send from AWS.

Can you try to add a log here to dump pxPublishInfo->pTopicName and pxPublishInfo->pPayload. This will confirm if the OTA notification is forwarded to the OTA handler.

I've added the log, I've already done it, and didn't saw anything anormal.

Software startup
E (7155) ota_over_mqtt: vOTAProcessMessage Topis "$aws/things/DEVICE-ID/jobs/start-next/accepted", for Payload "{"clientToken":"test","timestamp":1729752877}"

  • OTA Job creation
  • Nothing even ater 10min
  • Device REBOOT
E (7176) ota_over_mqtt: vOTAProcessMessage Topis "$aws/things/DEVICE-ID/jobs/start-next/accepted", for Payload "{"clientToken":"test","timestamp":1729754684,"execution":{"jobId":"AFR_OTA-****","status":"IN_PROGRESS","queuedAt":1729753816,"startedAt":1729754684,"lastUpdatedAt":1729754684,"versionNumber":2,"executionNumber":1,"jobDocument":{"afr_ota":{"protocols":["MQTT"],"streamname":"AFR_OTA-****","files":[{"filepath":"/device/updates","filesize":1367568,"fileid":0,"certfile":"/certificates/authcert.pem","sig-sha256-ecdsa":"MEQ...0Q=="}]}}}}"
E (14776) ota_over_mqtt: vOTAProcessMessage Topis "$aws/things/DEVICE-ID/streams/AFR_OTA-****/data/json", for Payload "{"f":0,"i":0,"l":4096,"p":"6QU...RQ=="}"
E (24727) ota_over_mqtt: vOTAProcessMessage Topis "$aws/things/DEVICE-ID/jobs/start-next/accepted", for Payload "{"clientToken":"test","timestamp":1729755015,"execution":{"jobId":"AFR_OTA-****","status":"IN_PROGRESS","statusDetails":{"self_test":"ready"},"queuedAt":1729753816,"startedAt":1729754684,"lastUpdatedAt":1729754989,"versionNumber":3,"executionNumber":1,"jobDocument":{"afr_ota":{"protocols":["MQTT"],"streamname":"AFR_OTA-****","files":[{"filepath":"/device/updates","filesize":1367568,"fileid":0,"certfile":"/certificates/authcert.pem","sig-sha256-ecdsa":"MEQ...0Q=="}]}}}}"
E (25737) ota_over_mqtt: vOTAProcessMessage Topis "$aws/things/DEVICE-ID/jobs/AFR_OTA-****/update/accepted", for Payload "{"timestamp":1729755016}"
E (30777) ota_over_mqtt: vOTAProcessMessage Topis "$aws/things/DEVICE-ID/jobs/start-next/accepted", for Payload "{"clientToken":"test","timestamp":1729755021}"

I can try other things if necessary.

@AniruddhaKanhere
Copy link
Member

@Fienberber, thank you for that debug log.

Can you try to add the following line here:

    switch( recvEventId )
    {
+       case OtaAgentEventStart:
        case OtaAgentEventRequestJobDocument:
            ESP_LOGI( TAG, "Request Job Document event Received \n" );

            requestJobDocumentHandler();
            otaAgentState = OtaAgentStateRequestingJob;
            break;

I think that would fix the issue as the OTA task calls the requestJobDocumentHandler repeatedly.

@Fienberber
Copy link
Author

@AniruddhaKanhere thank you for your response.

Unfortunately, it doesn't change anything. The processOTAEvents ins't even called after the Job creation...
I think the issue comes from the topic subscription.

@AniruddhaKanhere
Copy link
Member

@Fienberber, I am not sure why doesn't these lines do not call processOTAEvents.

Can you please put a breakpoint there? The prvOTADemoTask should call it repeatedly.

As far as topic subscription is considered, the Jobs topic is a special topic to which IoT core publishes even without subscription as I highlighted above here.

Thanks for your patience.

@Fienberber
Copy link
Author

@AniruddhaKanhere tank you for your response.

I've put a break point, actually it loops only 3 times, It it waiting for an event here.

The two previous events are : OtaAgentEventRequestJobDocument and then OtaAgentEventReceivedJobDocument as expected when the firmware starts.

But then it is just waiting for an event it never gets...

I don't know if it's expected but when OtaAgentEventReceivedJobDocument is received I run into this case (OtaPalJobDocFileCreateFailed).
The message received is on topic **/jobs/start-next/accepted with the content {"clientToken":"test","timestamp":1730280274}

@AniruddhaKanhere
Copy link
Member

Ah! I see the issue now. Can you please make the following change here?

/* Temp buffer.*/
    uint8_t buff[ sizeof( OtaEventMsg_t ) ];

-   retVal = xQueueReceive( otaEventQueue, &buff, portMAX_DELAY );
+   retVal = xQueueReceive( otaEventQueue, &buff, 10000 );

    if( retVal == pdTRUE )

The above should allow the loop to break in absence of any event. That way, when the OTA job is started, the device will ping IoT core.

Thank you for your patience as we try to figure out the core issue. Thanks!

@Fienberber
Copy link
Author

Fienberber commented Nov 5, 2024

Hi @AniruddhaKanhere, thank you, It did the trick !

I want to keep using the 202406.01-LTS-release so I made some quick modification to the example
If you have suggestion I'd be happy to improve what I've done.

    static OtaEvent_t lastRecvEventIdBeforeSuspend = OtaAgentEventStart;
    OtaEventMsg_t nextEvent = { 0 };

-    OtaReceiveEvent_FreeRTOS( &recvEvent );

+    {
+        EventGroupHandle_t eventGroup = xEventGroupCreate();
+        BlockingTaskParams taskParams = {
+            .eventGroup = eventGroup,
+            .recvEvent = &recvEvent
+        };
+        TaskHandle_t blockingTaskHandle;
+        // Create the blocking task
+        xTaskCreate(OtaReceiveEvent_FreeRTOS_InTask, "OtaReceiveEvent_FreeRTOS_InTask", otaconfigOTA_TASK_STACK_SIZE, &taskParams, otaconfigOTA_TASK_PRIORITY, &blockingTaskHandle);
+
+        // Wait for completion or timeout
+        EventBits_t result = xEventGroupWaitBits(eventGroup, TASK_COMPLETED_BIT, pdTRUE, pdFALSE, pdMS_TO_TICKS(TIMEOUT_MS));
+
+        if (result & TASK_COMPLETED_BIT) {
+            // Blocking function completed in time
+        } else {
+            // Timeout occurred, handle error
+            ESP_LOGI( TAG, "No events during timeout, ping IoT core." );
+            vTaskDelete(blockingTaskHandle);
+        }
+
+        // Ensure cleanup of the event group after usage
+        vEventGroupDelete(eventGroup);
+    }

    recvEventId = recvEvent.eventId;

In addition to

// Structure to pass multiple parameters to the blocking task
typedef struct {
    EventGroupHandle_t eventGroup;
    OtaEventMsg_t * recvEvent;
} BlockingTaskParams;

void OtaReceiveEvent_FreeRTOS_InTask(void *params) {
    BlockingTaskParams *taskParams = (BlockingTaskParams *)params;
    EventGroupHandle_t eventGroup = taskParams->eventGroup;
    // Call of the blocking when no events function
    OtaReceiveEvent_FreeRTOS( taskParams->recvEvent );
    
    // If function returns before timeout, signal completion
    xEventGroupSetBits(eventGroup, TASK_COMPLETED_BIT);
    vTaskDelete(NULL);  // Delete this task when done
}

@AniruddhaKanhere
Copy link
Member

@Fienberber, your solution works indeed.

But do note that creating a deleting a task is very resource intensive. That is why I suggested that you make the change xQueueReceive() call instead. We are working on adding this change in the upstream repo.

Thank you for your patience.

@jvasanthan
Copy link

@AniruddhaKanhere The changes you mentioned did not work. The timeout prints the following message.
E (105926) AWS_OTA: Failed to receive event from OTA Event Queue: xQueueReceive returned error: OtaOsStatus_t=130
but OTA is not triggered. I had to reset ESP for OTA to take place.
What am I missing here.

@AniruddhaKanhere
Copy link
Member

@jvasanthan, my above suggested changes are to be used with the previous changes mentioned here.

When both of these changes are made, then only OTA works. Let me know if that helps

@jvasanthan
Copy link

@AniruddhaKanhere after making those 2 changes, it works. Thankyou.

@RomanKhudobei
Copy link

I ended up subscribing to notify-next topic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants