Best Practices
The following the recommended best practices for Escape Online. They are not requirements. We developed them based on Microsoft recommendations.
General
Update SQL Server 2005 to the latest service pack (SP3)
We strongly recommend that you put the database and backups on separate backup devices. Otherwise, if the device that contains the database fails, your backups will be unavailable. Also, putting the data and backups on separate devices optimizes the I/O performance for both the production use of the database and writing backups.
Data and Log Files
Data and log files should be placed on separate drives. Placing both data and log files on the same device can cause contention for that device and result in poor performance. Placing the files on separate drives allows the I/O activity to occur at the same time for both the data and log files.
When you create a new database, specify separate drives for the data and log. To move files after the database is created, the database must be taken offline. Move the files by using one of the following methods:
- Restore the database from backup by using the RESTORE DATABASE statement with the WITH MOVE option.
- Detach and then attach the database specifying separate locations for the data and log devices.
- Specify a new location by running the ALTER DATABASE statement with the MODIFY FILE option, and then restarting the instance of SQL Server.
Tempdb Database
When the tempdb database is heavily used, SQL Server may experience contention when it tries to allocate pages. Depending on the degree of contention, this may cause SQL Server to be unresponsive for short periods of time.
This rule checks to see if there are more than eight CPUs or cores. If there are, it checks to see how many data files are in the tempdb database filegroup. If any of the following are true, this rule is raised:
- Number of processors or cores is greater than or equal to 8 and the trace flag -T1118 is off.
- Number of tempdb data files is less than the number of processors.
- tempdb data file sizes are not equal.
Reducing I/O and Allocation Bottlenecks
Following best practices recommendations can reduce I/O bottlenecks and allocation bottlenecks.
- Increase the number of data files in tempdb to maximize disk bandwidth and to reduce contention in allocation structures. As a general guideline, create one data file per CPU or core. Each file should be set to the same size. This enables the proportional fill algorithm to distribute the allocation load uniformly with minimal contention.
- For instances of SQL Server 2000, upgrade to SP4, or apply the fixes described in Knowledge Base article 328551: Concurrency enhancements for the tempdb database.
- For instances of SQL Server 2005, consider implementing the best practice recommendations in Working with tempdb in SQL Server 2005.
If the previous steps do not significantly reduce the allocation contention and the contention is on SGAM pages, implement trace flag -T1118. Under this trace flag, SQL Server allocates full extents to each database object, thereby eliminating the contention on SGAM pages. Note that this trace flag affects every database on the instance of SQL Server. For information about how to determine whether the allocation contention is on SGAM pages, see "Monitoring contention caused by DML operations" in Working with tempdb in SQL Server 2005. To address immediate problems, use trace flag -T1118; however, we recommend that you consider implementing longer-term, scalable, solutions. For example, you might have to modify queries to limit the use of table variables, temp tables, or cursors.
tempdb Size and Placement Recommendations
To achieve optimal tempdb performance, we recommend the following configuration for tempdb in a production environment:
- Set the recovery model of tempdb to SIMPLE. This model automatically reclaims log space to keep space requirements small.
For more information, see ALTER DATABASE (Transact-SQL) or How to: View or Change the Recovery Model of a Database (SQL Server Management Studio).
- Allow for tempdb files to automatically grow as required. This allows for the file to grow until the disk is full.
TIP:If the production environment cannot tolerate the potential for application time-outs that may occur during autogrow operations, preallocate space to allow for the expected workload.
- Set the file growth increment to a reasonable size to avoid the tempdb database files from growing by too small a value. If the file growth is too small, compared to the amount of data that is being written to tempdb, tempdb may have to constantly expand. This will affect performance. We recommend the following general guidelines for setting the FILEGROWTH increment for tempdb files.
| tempdb file size |
FILEGROWTH increment |
| 0 to 200 MB |
100 MB |
| 200 MB or more |
200 MB |
You may have to adjust this percentage based on the speed of the I/O subsystem on which the tempdb files are located. To avoid potential latch time-outs, we recommend limiting the autogrow operation to approximately two minutes. For example, if the I/O subsystem can initialize a file at 50 MB per second, the FILEGROWTH increment should be set to a maximum of 6 GB, regardless of the tempdb file size. If possible, use instant database file initialization to improve the performance of autogrow operations.
Preallocate space for all tempdb files by setting the file size to a value large enough to accommodate the typical workload in the environment. This prevents tempdb from expanding too frequently, which can affect performance. The tempdb database should be set to autogrow, but this should be used to increase disk space for unplanned exceptions.
Create as many files as needed to maximize disk bandwidth. Using multiple files reduces tempdb storage contention and yields significantly better scalability. However, do not create too many files because this can reduce performance and increase management overhead. As a general guideline, create one data file for each CPU on the server (accounting for any affinity mask settings) and then adjust the number of files up or down as necessary. Note that a dual-core CPU is considered to be two CPUs.
Make each data file the same size; this allows for optimal proportional-fill performance.
Put the tempdb database on a fast I/O subsystem. Use disk striping if there are many directly attached disks.
Put the tempdb database on disks that differ from those that are used by user databases.
Event Viewer and SQL Logs
Search the error logs for problems, and take corrective action.
When you examine the event log, you should use a subtractive approach: remove messages that are know to be harmless, instead of searching for the known errors in the following list. We recommend the subtractive approach because the set of possible indicators of I/O issues is far larger than the following list, and also because SQL Server cannot know about third-party drivers that might log their own custom error messages. The following list contains information about some errors that are known to be more common than most.
| Source |
Error |
Notes |
<any> |
The device, \Device\Scsi\cpqcissm1, did not respond within the time-out period. |
For more information, see Troubleshooting event ID 9, 11, and 15 on Cluster Servers and How to troubleshoot event ID 9, event ID 11, and event ID 15 error messages. |
Disk |
The driver detected a controller error on \Device\Harddisk4\DR4. |
For more information, see Troubleshooting event ID 9, 11, and 15 on Cluster Servers and How to troubleshoot event ID 9, event ID 11, and event ID 15 error messages. |
Disk |
The device, \Device\Harddisk14\DR14, is not ready for access yet. |
For more information, see Troubleshooting event ID 9, 11, and 15 on Cluster Servers. |
SaveDump |
The computer has rebooted from a bugcheck. The bugcheck was: ... |
Operating system stops and a blue screen appears. Contact Microsoft Windows Support. |
Disk |
An error was detected on device \Device\Harddisk3\DR3 during a paging operation. This error indicates an I/O error has occurred during a hard page fault. |
For more information, see Support for multiple clusters attached to the same SAN device and Support for booting from a Storage Area Network (SAN). |
ClusSvc |
Cluster disk resource Disk J:: is corrupt. Running ChkDsk /F to repair problems. |
Can be the result of SCSI host adapter configuration issues or a malfunctioning device. For more information, see Troubleshooting event ID 9, 11, and 15 on Cluster Servers, and Troubleshooting Multiple Cluster Symptoms on the Same SAN. |
Ntfs |
The file system structure on the disk is corrupt and unusable. Please run the chkdsk utility on the volume F:. |
Be aware that there is at least one case in which this error is incorrectly raised. For more information, Chkdsk.exe Is Scheduled After You Run Disk Defragmenter on Your NTFS Partition. |
Disk |
Data was recovered using error correction code on device \Device\Harddisk5\DR5. |
Indicates hardware failure. |
EventLog |
The previous system shutdown at <time> on <date> was unexpected. |
Typically indicates a hard server cycle after the operating system stops responding or a blue screen appears. Could also indicate something more mundane, such as a power failure if the system is not protected by an uninterruptible power supply. |
Ftdisk |
{Lost Delayed-Write Data} The system was attempting to transfer file data from buffers to \Device\HarddiskVolume4. The write operation failed, and only some of the data may have been written to the file. |
Indicates a failed I/O request. This could be caused by anything from a firmware bug to faulty SCSI cables. For more information, see Troubleshooting Multiple Cluster Symptoms on the Same SAN and Support for multiple clusters attached to the same SAN device. |
WMIxWDM |
Any of the following:
- Machine check event reported is a fatal error
- Machine check event reported is a corrected error
- Machine Check Event reported is a fatal memory hierarchy error
- Machine Check Event reported is a fatal TLB error
- Machine Check Event reported is a fatal Bus or Interconnect error
- Machine Check Event reported is a fatal Bus or
- Interconnect timeout error Machine Check Event reported is a fatal internal watchdog timer error
- Machine Check Event reported is a fatal Microsoft ROM parity error
- Machine Check Event reported is a fatal condition A processor received an external signal that an unrecoverable error has occurred
- Machine Check Event reported is a fatal functional redundancy check error
|
Hardware fault. For help, contact your hardware manufacturer. For more information, see Event IDs 106 and 107 may be logged in the System log on a Windows Server 2003-based computer. |
|