A large hospital recently had been experiencing issues with their SCCM Server replication and SCCM Remote Console. Initially, they started to notice slow package replication which quickly expanded to the following list:
- SCCM 2007 Remote Console Loading Slowly
- SCCM 2007 Remote Console Not Connecting to Remote Clients
- SCCM 2007 Packages Taking Hours for Replication
- SCCM 2007 ‘My’ Computer freezing on server / drive not accessible
- SCCM 2007 Explorer Crashes on Central Server / Cannot View Inboxes\Auth\
The hospital, with over 12,000 computers, hosted a central server and two primary child sites (as shown above). These servers are all Windows Server 2008 R2 servers and had been running functionally since February of 2012. Site_101 and Site_102 were both configured to replicate their record sets to the central server.
Please note that there is an additional WDS deployment server which is not shown in the above image as it is irrelevant to this discussion.
SCCM issue: Device Discovery Records build-up
After a bit of research, it was determined that the issues above had to do with the configuration of the Discovery Methods. Site_100, Site_101, and Site_102 were configured to perform full Active Directory System Discoveries (DDR Record Generation) every 5 minutes (relevant to their respective AD hierarchy).
In the hospital’s case, the (DDR) records were being created by the two children sites and pushed up to the Central Server. The Central Server had to #1 process its own device discovery records, #2 process Site_101 and Site_102 discovery records. The Central Server could not process the DDRs as fast as they were being generated. Subsequently, the directory started to fill up with old device discovery information. By the time the device discovery record issue was discovered, the %drive%\Program Files (x86)\Microsoft Configuration Manager\inboxes\auth\ddm.box\ inbox had almost 22.1 million device discovery records.
The build up of files had the following effects on the server:
- Disk Contention – Files were being written and deleted (delete is a disk write as well!) causing the disk queue to be in the upwards of 200 operations. This simply crippled the use of that RAID drive which affected all SCCM components rendering the Central Server useless.
- Windows Indexing Issues – Due to the mere number of files (22.1 million), in addition to keeping track of those being added / removed, the indexing service could not keep up with the operations of the server. This was the cause of Windows Explorer to hanging and crashing as it was attempting to index the files.
To resolve the issue, the server was restarted in safe mode and the ddm.box was renamed to ddm.box.old. The files in ddm.box.old should have been deleted at that point, however, due to the number of files, it would have extended the outage. To return operations of the SCCM Central server, the server was restarted and the deletion was done outside of safe mode.
After the reboot completed, the Discovery Methods were immediately adjusted to the following:
- Central Server Site_100 Device Discovery was turned OFF – The Central Server was purposed for only reporting and configuration of packages. The child will provide the device discovery records to the Central Server.
- Site_101 Reconfigured – The full device discovery methods of the respective active directory OUs were adjusted to once a day at 12:00am (outside of the backup window). Delta discoveries have been enabled for every 30 minutes.
- Site_102 Reconfigured – The full device discovery methods of the respective active directory OUs were adjusted to once a day at 11:00pm (outside of the backup window). Delta discoveries have been enabled for every 30 minutes.
To delete the files in the ddm.box.old directory the following was performed:
- Disable AV Scanning of Inboxes- Symantec Endpoint Protection was scanning all files on the server. A directory exclusion of %drive%\Program Files (x86)\Microsoft Configuration Manager\inboxes\ was created to optimize operations on the server (and help with the deletion process)
- Batch Deletion of Files– Unfortunately, Windows Explorer could not delete the files from the directory. Even when indexing is turned off, it will index the files prior to deletion.
- Created a new .bat file changing the directory to %drive%\Program Files (x86)\Microsoft Configuration Manager\inboxes\auth\ddm.box\
- Issued the following command del *.* /f/s
- This command took 5 days to execute.
- Defragmentation of the drive – After the deletion, the SCCM drive had 28% fragmentation and a disk defragmentation was run.
Preventing and Monitoring the Issue
In order to monitor further file backups on the server, a new PowerShell script was created to monitor files. The script can be found here: PowerShell Script – SCCM 2007 Health Check
This script is designed to run in two modes:
- Morning General Health Status – To report warning level and critical level errors.
- Hourly Health Check – To report only critical level errors.
This script is used along with Task Scheduler which executes these programs on intervals. Please note that it was determined that if the file level for the hospital’s environment that anything over 20,000 files constitutes a critical error OR if a service is not running as expected.