Announcement

Collapse
No announcement yet.

Newer 2477D and 2334-222 intermittently respond to wrong All-Link command

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Newer 2477D and 2334-222 intermittently respond to wrong All-Link command

    Hi fellow Insteon hackers! Apologies for writing a novella here, but I am hoping to get some input on what I believe is very odd behavior.

    Quick intro: I am a ~15 year Insteon user and hacker at my own home. I've been developing a PHP-based controller application called Footprint which I hope to make public at some point in the future. My installation at my rural home consists of ~100 active Insteon devices and ~40 Insteon groups/scenes. My application uses a 2413S PLM to interface with the devices. Overall, I'd describe my setup as mature and working very well. I consider myself to have an advanced understanding of the PLM and the Insteon hardware, including ALDB commands and concepts. My application manages all links and scenes programmatically. And almost all of my devices today are I2CS but my application can programmatically manage I1 and I2 style link databases accurately and efficiently. With all the device management and scene links, my PLM currently hosts ~500 total links for the installation.

    Last year, we finished a remodel where we added a new master bed room to the house which involved installing a half dozen new 2477D and 2334-222 devices, with some new scenes (e.g. SwitchLink Dimmers linked to Keypad buttons). Since adding these latest devices and scenes, I have been following a very weird problem where the "Master Bed Closet" scene (group 19) intermittently respond to a group 22 ON command sent by the PLM. It doesn't happen but once every other day or so, but unfortunately when it does, its early in the morning and right next to my wife's bedside -- not good in the WAF department! The issue tends to occur in the morning because the "Ceiling Fans" scene (group 22) is usually triggered in the morning hours by the HVAC system calling for heat. Also, I should add that group 22 cycles on at least a dozen times a day since it is triggered by our thermostat calling for heat.

    Here's are the 2 most recent examples of the devices belonging to Insteon group 19 (0x13) responding to an All-Link ON command sent to Insteon group 22:

    Dec 26 05:19:00 <user.info> lion php: [footprint] transportInsteon: Tx ALL-Link command 0X11|0X00 to group 22 (0261161100) (ACK, 125 ms)
    Dec 26 05:19:02 <user.info> lion php: [footprint] transportInsteon: Rx Standard Group Cleanup Direct Message 0X11|0X13 from Master Bedroom - Entry (ACK, 1/4 hops, 1941 ms)
    Dec 26 05:19:03 <user.info> lion php: [footprint] transportInsteon: Rx Standard Group Cleanup Direct Message 0X11|0X13 from Master Bedroom - Closet (ACK, 1/4 hops, 1949 ms)


    Dec 28 04:27:00 <user.info> lion php: [footprint] transportInsteon: Tx ALL-Link command 0X11|0X00 to group 22 (0261161100) (ACK, 125 ms)
    Dec 28 04:27:01 <user.info> lion php: [footprint] transportInsteon: Rx Standard Group Cleanup Direct Message 0X11|0X16 from Master Bedroom - Ceiling Fan (ACK, 1/4 hops, 1271 ms)
    Dec 28 04:27:02 <user.info> lion php: [footprint] transportInsteon: Rx Standard Group Cleanup Direct Message 0X11|0X16 from Bedroom - Ceiling Fan (ACK, 1/3 hops, 884 ms)
    Dec 28 04:27:03 <user.info> lion php: [footprint] transportInsteon: Rx Standard Group Cleanup Direct Message 0X11|0X16 from Office - Ceiling Fan (ACK, 1/3 hops, 2004 ms)
    Dec 28 04:27:04 <user.info> lion php: [footprint] transportInsteon: Rx Standard Group Cleanup Direct Message 0X11|0X16 from Living - Ceiling Fan (ACK, 1/3 hops, 2009 ms)
    Dec 28 04:27:05 <user.info> lion php: [footprint] transportInsteon: Rx Standard Group Cleanup Direct Message 0X11|0X13 from Master Bedroom - Entry (ACK, 1/4 hops, 3100 ms)
    Dec 28 04:27:07 <user.info> lion php: [footprint] transportInsteon: Rx Standard Group Cleanup Direct Message 0X11|0X13 from Master Bedroom - Closet (ACK, 1/4 hops, 3110 ms)


    In the first example, none of the actual members of group 22 responded to the command. In the second example, a mixture of the intended and unintended recipients responded to the command. Not sure if this means anything or not.

    As you fellow hackers might understand, when faced with this challenge I threw all assumptions of stability to the wind and embraced it as an opportunity to find a bug in my own code. I've completed an exhaustive audit of all levels and layers in my application. I looked for erroneous ALDB records, missing bin2hex/hex2bin translations, PLM communication timing issues, etc. and have come up absolutely empty handed. I've used HouseLinc2 to independently check the links of both the PLM and the devices thinking something was missing. I've done physical device factory resets and reprograms. While I have found and fixed a handful of interesting bugs in my code, the situation remains unchanged and I have not been able to find anything responsible for this behavior. And, what I have absolutely confirmed at this point, is that my PLM is sending one and only one 0261161100 command at the time that these group 19 members choose to intermittently respond. So I am reaching out to the community here in hopes that someone can either (a) share similar experiences that were hardware related or (b) help me uncover a stone I have missed.

    To recap, here are the variables as I see them:

    1) The PLM and device ALDB maintain 2 independent groups (22, 19) and the correct controller and responder links are in the right places.
    2) The PLM receives a 0261161100 from the controller application.
    3) Group 22 devices reliably respond to the command (e.g. >99% of the time).
    4) Group 19 devices intermittently respond to the same command (e.g. <10% of the time).
    5) This "interplay" does not occur anywhere else on the property between any other devices or groups.

    Thank you in advance for your time and assistance here...

    Best,
    Daniel

  • #2
    I am going to keep updating this thread as I further troubleshoot this issue. Still hoping for some community feedback though.

    Last night, I decided to update my PLM to see if that might help. I bought a new 2413U (rev 2.3) on Amazon and replaced my 2413S (rev 2.1) with it. Unfortunately this morning I saw the same issue had occurred again. Because of the configuration of my server rack in the garage where my controller is installed, I needed to use a 16' USB cable (Amazon Basics, double shielded, gold connectors). Everything seems to work just fine, however I did notice in the quick start guide that they say to keep the USB cable length under 10', and ideally 6'. Could this possibly have something to do with the issue? I realize in hind sight that even with my previous PLM (2413S), I had extended the serial cable length using a quality female-female CAT5 coupler to around the same length for the same reason...

    Comment


    • #3
      They are using the RS-232 protocol so distance does matter, however at a 19200 baud rate you should be able to go ~50ft. (7.5m) If they are calling out <10ft in the manual probably best to follow the instructions.

      Comment


      • #4
        Thanks SeanM. Everything seems to be working fine with the 16’ cable but I will get a 10’ and optimize soon.

        the original issue reported still occurs.

        Comment


        • #5
          SeanM - I switched to a 10' USB cable a couple days after my last post and have been running on it for the last month. I have also expended a great amount of time going through every single aspect of my software related to programmatic ALDB link management. I've fixed numerous bugs, optimized timing issues, and performed a lot of manual testing and verification that things are working as I expect them to be. Unfortunately even after all this, I'm still seeing the oddball behavior originally reported. And FWIW none of my software bugs I fixed had affected a change or malfunction in the original devices responding to the group ON command.

          To boost the WAF at home, I've written a routine which handles these "rogue" group cleanup messages by comparing their group number to recently initiated group commands, and then reverting the device back to its last state (e.g. OFF) when it occurs. This at least quickly turns the light back off after it erroneously responds to the ON for the group it does not belong to. Not ideal, but a workaround at least.

          One thought that occurred to me was the possible significance of ALDB group numbering which I may be overstepping somehow. According to the INSTEON documentation, valid group numbers can be 0x00-0xFF (dec 0-255) but with 0xFF representing all devices:
          An ALL-Link Group Number of 0xFF denotes all devices linked to a Controller. Responders interpret an ALL-Link Group Number of 0xFF in the To Address low byte as matching any stored ALL-Link Group Number.
          It was based on this information that I concluded that the PLM could effectively use any group number in that range to control other devices. As such, I settled on the following ALDB group numbering scheme for my software to manage:
          • 0x00: Device Registration links (e.g. PLM as controller, device as responder, to satisfy I2CS requirements; never used for any commands or scene control)
          • 0x0C-0xE6: Scene Control links (e.g. PLM as controller, devices as responder, for user-defined scenes in my software)
          • 0xFE: Device All On/Off Scene links (e.g. PLM as controller, device as responder if the device has been flagged in my software via a user interface control)
          Though theoretically not required, this scheme actually reserves groups 0x01-0x0B since those group numbers are commonly used in links between devices (e.g. corresponding to button numbers on SwitchLincs, Keypads, etc.). And as for the special group 0xFF discussed in the documentation above, my software is aware of this group by way of a defined constant, but does not actually use it at all since my software-defined group 0xFE serves a similar, but more precise purpose.

          Were my assumptions correct here? Any issues with the scheme I've laid out above?

          Comment

          Working...
          X