Section 12.4. The Fault Tree | Using Samba: A File and Print Server for Linux, Unix & Mac OS X, 3rd Edition

12.4. The Fault Tree

The fault tree presented in this section is for diagnosing and fixing problems that occur when you're installing and reconfiguring Samba. Before you set out to troubleshoot any part of the Samba suite, you should know the following information:

Your client IP address (we use 192.168.236.10)
Your server IP address (we use 192.168.236.86)
The netmask for your network (typically 255.255.255.0)
Whether the systems are all on the same subnet (our example systems are)

For clarity, we've renamed the server in the following examples to server.example.com, and the client system to client.example.com.

12.4.1. How to Use the Fault Tree

Start the tests here, without skipping forward; it won't take long (about five minutes) and might actually save you time backtracking. Whenever a test succeeds, you will be given a name of a section to which you can safely skip.

12.4.2. Troubleshooting Low-Level IP

The first series of tests is that of the low-level services that Samba needs to run. The tests in this section verify that:

The IP software works
The Ethernet hardware works
Basic name service is in place

Subsequent sections add the Samba daemons smbd and nmbd, host-based access control, authentication and per-user access control, file services, and browsing. The tests are described in considerable detail to make them understandable by both technically oriented end users and experienced systems and network administrators.

Beware of firewalls! The Windows XP SP2 firewall will disable the host from answering basic network requests such as ping. For these tests, consider disabling any firewall settings on both the client and server if possible.

12.4.2.1. Testing the networking software with ping

The first command to enter on both the server and the client is ping 127.0.0.1. This tries to send data to the loopback address and indicates whether any networking support is functioning. On both Windows and Unix, you can run ping 127.0.0.1 from a command shell and usually interrupt it after a few lines. Here is an example on a Linux server:

 $ ping 127.0.0.1 PING localhost: 56 data bytes 64 bytes from localhost (127.0.0.1): icmp-seq=0. time=1. ms 64 bytes from localhost (127.0.0.1): icmp-seq=1. time=0. ms 64 bytes from localhost (127.0.0.1): icmp-seq=2. time=1. ms ^C ----127.0.0.1 PING Statistics---- 3 packets transmitted, 3 packets received, 0% packet loss round-trip (ms) min/avg/max = 0/0/1

Some versions of ping let you set a limit on how many times it makes the round trip, so you don't have manually interrupt the command. For instance, on Linux you could enter ping -c5 to stop automatically after five transmissions.

If you get ping: no answer from _ ._ ._ ._ or 100% packet loss, you have no IP networking installed on the system. The address 127.0.0.1 is the internal loopback address and doesn't depend on the computer being physically connected to a network. If this test fails, you have a local problem. TCP/IP isn't installed, it's misconfigured, or a firewall might be preventing ICMP packets. See your operating system documentation if it's a Unix server. If it's a Windows client, follow the instructions in Chapter 3 to install networking support.

If you're the network manager, some good references are TCP/IP Network Administration, by Craig Hunt, and Windows Server 2003 Network Administration, by Craig Hunt and Roberta Bragg, both published by O'Reilly. An excellent resource for understanding the TCP/IP protocol suite is Richard Stevens' TCP/IP Illustrated, Vol. 1 (Addison-Wesley).

12.4.2.2. Testing local name services with ping

Next, try to ping localhost from a shell on the Samba server. The name localhost is the conventional hostname for the 127.0.0.1 loopback interface, and it should resolve to that address. After typing ping localhost, you should see output similar to the following:

 $  ping localhost PING localhost: 56 data bytes  64 bytes from localhost (127.0.0.1): icmp-seq=0. time=0. ms    64 bytes from localhost (127.0.0.1): icmp-seq=1. time=0. ms    64 bytes from localhost (127.0.0.1): icmp-seq=2. time=0. ms    ^C

If this succeeds, try the same test on the client. Otherwise:

If you get unknown host: localhost, there is a problem resolving the hostname localhost to a valid IP address. (This might be as simple as a missing entry in a local hosts file.) From here, skip down to the section "Troubleshooting Name Services" later in this chapter.
If you get "ping: no answer," or "100% packet loss," but pinging 127.0.0.1 worked, name services is resolving to an incorrect address. Check the file or database (typically /etc/hosts on a Unix system) that the name service is using to resolve addresses to ensure that the entry is correct.

12.4.2.3. Testing the networking hardware with ping

Next, ping the server's network IP address from itself. This should get you exactly the same results as pinging 127.0.0.1:

 $ ping 192.168.236.86 PING 192.168.236.86: 56 data bytes 64 bytes from 192.168.236.86 (192.168.236.86): icmp-seq=0. time=1. ms 64 bytes from 192.168.236.86 (192.168.236.86): icmp-seq=1. time=0. ms 64 bytes from 192.168.236.86 (192.168.236.86): icmp-seq=2. time=1. ms ^C ----192.168.236.86 PING Statistics---- 3 packets transmitted, 3 packets received, 0% packet loss round-trip (ms) min/avg/max = 0/0/1

If this test works on the server, repeat it for the client. Otherwise:

If ping network_ip fails on either the server or client, but ping 127.0.0.1 works on that system, you have a TCP/IP problem that is specific to the Ethernet network interface card on the computer. Check with the documentation for the network card or host operating system to determine how to configure it correctly. However, be aware that on some operating systems, the ping command appears to work even if the network is disconnected, so this test doesn't always diagnose all hardware problems.

12.4.2.4. Testing connections with ping

Now, ping the server by name (instead of its IP address)once from the server and once from the client. This is the general test to determine whether your network is working:

 $ ping server PING server.example.com: 56 data bytes 64 bytes from server.example.com (192.168.236. 86): icmp-seq=0. time=1. ms 64 bytes from server.example.com (192.168.236.86): icmp-seq=1. time=0. ms 64 bytes from server.example.com (192.168.236.86): icmp-seq=2. time=1. ms ^C ----server.example.com PING Statistics---- 3 packets transmitted, 3 packets received, 0% packet loss round-trip (ms) min/avg/max = 0/0/1

If successful, this test tells you four things:

The hostname (e.g., server) is being found by your local name server.
The hostname has been expanded to the full name (e.g., server.example.com).
The host's address is being returned (192.168.236.86).
The client and server can successfully send and receive packets to each other.

If this test isn't successful, one of several things can be wrong with the network:

First, if you get ping: no answer, or 100% packet loss, you're not connecting to the network, the other system isn't connecting or isn't responding, or one of the addresses is incorrect. Verify that the server does not have an active firewall preventing it from receiving or replying to ICMP packets. Also check the addresses that the ping command reports on each system, and ensure that they match the ones you set up initially.
If the addresses do not match, try entering the command arp -an, and see whether there is an entry for the other system. (The arp command stands for the Address Resolution Protocol. The arp -an command lists all the addresses known on the local system.) Here are some things to try:
- If you receive a message like 192.168.236.86 at (incomplete), the Ethernet address of 192.168.236.86 is unknown. This message indicates a complete lack of connectivity, and you're likely having a problem at the very bottom of the TCP/IP protocol stackthe Ethernet interface layer.
- If you receive a response similar to server (192.168.236.86) at 8:0:20:12:7c:94, the server has been reached at some time, or another system is answering on its behalf. However, this means that ping should have worked: because it hasn't, you may have an intermittent networking or ARP problem.
- If the IP address from ARP doesn't match the address you expected, investigate and correct the addresses manually.
- If each system can ping itself but not another, something is wrong on the network between them.
If you get ping: network unreachable or ICMP Host Unreachable, you're not receiving an answer, and more than one network is probably involved.
It is much simpler to deal with hosts on the same subnet. However, networking, like life, is not always ideal. At this point, it is time to rely on your (or a good friend's) TCP/IP administration skills. Check the default gateway settings on the host and verify that the router IP address is correct. Then try to ping the gateway. If this fails, go through the troubleshooting steps for failing to ping the server from the client or vice versa. This failure is at the network layer. If you can ping the router, but not the host on the other side of it, use the gateway's documentation to verify that packets are successfully being routed from one network to the other.
If possible though, try to test a server and client that are on the same network:
1. First, perform the tests for ping: no answer described earlier in this section. If these don't help you identify the problem, the remaining possibilities are that an address is wrong, your netmask is wrong, a network is down, or the packets have been stopped by a firewall.
2. Check both the address and the netmasks on source and destination systems to see whether something is obviously wrong. Assuming that both systems really are on the same network, they both should have the same netmasks, and ping should report the correct addresses. If the addresses are wrong, you'll need to correct them. If they are correct, the programs might be confused by an incorrect netmask. See the section "Netmasks" later in this chapter.
3. If the commands are still reporting that the network is unreachable and neither of the previous two conditions are in error, one network really might be unreachable from the other. This is a general networking issue; if you have a separate network manager at your site, he or she may have to investigate. This, too, is an issue for the network manager.
If you get ICMP Administratively Prohibited, you've struck a firewall of some sort or a misconfigured router. Again, if you have a separate network manager at your site, that person may have to investigate.
If you get ICMP Host redirect but ping reports that packets are getting through, this is generally harmless: you're simply being rerouted over the network.
If you get a host redirect and no ping responses, you are being redirected, but no one is responding. Treat this situation just like the Network unreachable response, and check your addresses and netmasks.
If you get ICMP Host Unreachable from gateway hostname, ping packets are being routed to another network, but the other system isn't responding and the router is reporting the problem on its behalf. Again, treat this like a Network unreachable response, and start checking addresses and netmasks.
If you get ping: unknown host hostname, your system's name is not known. This tends to indicate a name service problem, which didn't affect localhost. Have a look at "Troubleshooting Name Services," later in this chapter.
If you get partial successwith some pings failing but others succeedingyou have either an intermittent problem between the systems or an overloaded network. Ping a bit longer, and see whether more than about 3 percent of the packets fail. If so, take the necessary steps to reduce the network problem or contact a network manager if that role is fulfilled by someone other than you. However, if only a few fail, or if you happen to know some massive network program is running, don't worry. The TCP/IP suite of protocols is able to compensate for the occasional lost packets.
If you get a response such as smtsvr.antares.net is alive when you actually pinged server.example.com, either you're using someone else's address or the system has multiple names and addresses. If the address is wrong, the name service is clearly the culprit; you'll need to change the address in the name service database to refer to the correct system. This is discussed in "Troubleshooting Name Services," later in this chapter.
Servers are often multihomedi.e., connected to more than one network, with different names on each net. If you are getting a response from an unexpected name on a multihomed server, look at the address and see whether it's on your network (see the section "Netmasks" later in this chapter). If so, you should use that address, rather than one on a different network, for both performance and reliability reasons.
Servers can also have multiple names for a single Ethernet address, especially if they are web servers. This is harmless, albeit startling. You should probably use the official (and permanent) name, rather than an alias that might change.
If everything works but the IP address reported is 127.0.0.1, you have a name service error. This error typically occurs when an operating system installation program generates an /etc/hosts line similar to 127.0.0.1 localhost hostname.domainname. The localhost line should say something similar to 127.0.0.1 localhost or 127.0.0.1 localhost.localdomain. Correct it, lest it cause failures in the negotiations over who is the master browse list holder and who is the master browser. It can also cause hard-to-diagnose errors in later tests.

If this command works from the server, repeat it from the client.

12.4.3. Troubleshooting Server Daemons

Once you've confirmed that basic networking is working properly, the next step is to make sure that the daemons are running on the server. This determination takes three separate tests, because no single one of the following tests can decisively prove that everything is functioning properly.

To be sure that the daemons are running, you need to find out whether they:

Have started
Are registered or bound to a TCP/IP port by the operating system
Are actually listening for incoming connections

12.4.3.1. Tracking daemon startup

First, check the Samba logs. If you've started the daemons, the message smbd version release started should appear. If it doesn't, you need to restart the Samba daemons.

If the daemon reports that it has indeed started, look out for bind failed on port XXX socket_addr=0 (Address already in use). This means another daemon has been started on port 139 or 445 (smbd). Also, nmbd will report a similar failure if it cannot bind to port 137 or 138. Either you've started a daemon twice, or the inetd server has tried to provide a daemon for you.^[*] If it's the latter, we'll diagnose that in a moment.

^[*] We use the name inetd for the TCP/IP metadaemon or superserver that runs on many Unix and Linux servers; the actual command is usually either inetd or the newer xinetd.

Another useful trick for locating a startup failure is to start the failing service from the command line and monitor its progress. All Samba daemons support the -i command-line option for just such a purpose. Combined with a high debug level dumping to standard output, this option should help you to locate the exact point of startup. The following example illustrates the message displayed when you try to launch smbd when a previous instance was still running:

 $ smbd -d 10 -i .... ERROR: smbd is already running. File /var/run/smbd.pid exists    and process id 31654 is running. talloc report on 'null_context' (total 453 bytes in 73 blocks)         lp_talloc      contains      453 bytes in    72 blocks

From here, you can check the process listing to verify whether the existing process is in fact smbd. It is possible that a previous instance of smbd has exited but not cleaned up its pid file, and that another process exists with that same pid. Use the ps command on the server with the "long" option for your system type (commonly ps ax or ps -ef), and see whether smbd and nmbd are already running. This often looks like the following:

 $ ps ax | grep mbd 31654 ?           Ss      0:00 smbd -D -d3 31656 ?           Ss      0:02 nmbd -D -d3 31657 ?           S      0:00 smbd -D -d3

This example illustrates that smbd and nmbd have already started as standalone daemons (the -D option) at log level 3 (-d3).

12.4.3.2. Looking for daemons bound to ports

Both smbd and nmbd have to register with the operating system so that they can get access to the necessary TCP/IP ports. The netstat command will tell you if this has been done. Run the command netstat -a on the server, and look for lines mentioning netbios-ns (137/udp), netbios-dgm (138/udp), netbios-ssn (139/tcp), or microsoft-ds (445/tcp):

 $ netstat -an | egrep ':(137|138|139|445)' tcp           0      0 *:139            *:*        LISTEN tcp           0      0 *:445            *:*        LISTEN udp           0      0 *:137            *:* udp           0      0 *:138            *:*

Although you may see additional lines listed, there should be at least two UDP lines, one for the NetBIOS name service port (137) and one for the NetBIOS datagram service (138), indicating that the nmbd server is registered and (we hope) is waiting to answer requests. There should also be at least one TCP line for each of the values of the smb ports parameter in smb.conf. The default value includes both ports 139 and 445, so frequently you will see a TCP line for each one. Additionally, these ports should be in the LISTEN state. This means that smbd is up and waiting to accept connections.

There might be other TCP lines indicating connections from smbd to clients, one for each client. These are usually in the ESTABLISHED state. If there are smbd lines in the ESTABLISHED state, smbd is definitely running. If there is only one line in the LISTEN state, you can't be sure yet. If both of the lines are missing, a daemon has not succeeded in starting, so it's time to check the logs, and then go back to Chapter 2.

If there is a line for each client, it might be coming either from a Samba daemon or from the meta-daemon, inetd. It's quite possible that your inetd startup file contains lines that start Samba daemons without your realizing it; for instance, although such behavior is becoming increasingly rare, the lines might have been placed there if you installed Samba as part of a Linux distribution. The daemons started by inetd prevent ours from running. This problem typically produces log messages such as bind failed on port XXX socket addr=0 (Address already in use).

Check your /etc/inetd.conf file or /etc/xinetd.d/ directory; unless you're intentionally starting the daemons from there, any servers bound to the netbios-* or microsoft-ds ports should be disabled. Refer to Chapter 2 for details concerning Samba and inetd.

12.4.3.3. Checking smbd with telnet

The easiest way to test that the smbd server is actually working is to send it a meaningless message and see if it is rejected. Try something such as the following:

 $ echo "hello" | telnet localhost 139 Trying Trying 192.168.236.86 ... Connected to server. Escape character is '^]'. Connection closed by foreign host.

This command sends an erroneous but harmless message to smbd. If you get a Connected message followed by a Connection closed message, the test was a success. You have an smbd daemon listening on the port and rejecting improper connection messages. On the other hand, if you get telnet: connect: Connection refused, most likely no daemon is present. A less likely explanation is that you have attempted to connect to the wrong port. Remember that the ports used by smbd are controlled by the smb ports option. Make sure you use one of these ports. If all else fails, check the logs and go back to Chapter 2.

Regrettably, there isn't an easy test for nmbd. If the telnet test and the netstat test both say that an smbd is running, there is a good chance that netstat will also be correct about nmbd running. nmbd is tested further later in this chapter when we troubleshoot network browsing problems.

12.4.3.4. Testing daemons with testparm

Once you know there's a daemon, you should always run testparm, in hopes of getting something such as the following:

 $ testparm Load smb config files from /usr/local/samba/lib/smb.conf Processing section "[homes]" Processing section "[printers]" ... Processing section "[tmp]" Loaded services file OK. ...

The testparm program normally reports the processing of a series of sections and responds with Loaded services file OK if it succeeds. If there is something wrong with the file, testparm reports one or more of the following messages, which also appear in the logs as noted:

WARNING: You have some share names that are longer than 12 characters.: This error is for anyone using Windows Me and older clients. They fail to connect to shares with long names.
WARNING: [ name] service MUST be printable!: A printer share lacks a print ok = yes option.
WARNING: No path in service name - making it unavailable!: Current versions of Samba disable any service other than [homes] that does not have an explicit path set.
NOTE: name is flagged unavailable: Just a reminder that you have used the available = no option in a share.
Can't find include file [name]: A configuration file referred to by an include option did not exist. If you were including the file unconditionally, this is an error and probably a serious one: the share will not have the configuration you intended. If you were including it based on one of the % variables, such as %a (architecture), you must decide whether, for example, a missing Windows XP configuration file is a problem. It often isn't.
Can't copy service name, unable to copy to itself.: You tried to copy an smb.conf section into itself.
Unable to copy servicesource not found: [name]: Indicates a missing or misspelled section in a copy = option.
Ignoring unknown parameter name.: Typically indicates an obsolete, misspelled, or unsupported option.
Global parameter name found in service section.: Indicates that a global-only parameter has been used in an individual share. Samba ignores the parameter.

After the first testparm test, repeat it with (exactly) three parameters: the name of your smb.conf file, the name of your client, and its IP address:

 $ testparm /usr/local/samba/lib/smb.conf client 192.168.236.10

This command runs one more test that checks the hostname and address against hosts allow and hosts deny options and might produce the Allow connection from hostname to service and/or Deny connection from hostname to service messages for the client system. These messages indicate that you have hosts allow and/or hosts deny options in your smb.conf, and they prohibit access from the client system.

12.4.4. Troubleshooting SMB Connections

Now that you know the servers are up, you need to make that sure they're running properly. Start by placing a simple smb.conf file in the /usr/local/samba/lib directory.

12.4.4.1. A minimal smb.conf file

In the following tests, we assume that you have a [temp] share suitable for testing, plus at least one valid user account (we'll use one named rose). An smb.conf file that includes just these is as follows:

 [global]     workgroup = EXAMPLE     security = user [homes]     read only = no [temp]     path = /data/tmp     read only = no

12.4.4.2. Testing locally with smbclient

The first test ensures that the server can list its own services (shares). Run the command smbclient -L localhost -N to anonymously connect to the server from itself. You should see the following:

 $ smbclient -L localhost -N Anonymous login successful Domain=[EXAMPLE] OS=[Unix] Server=[Samba 3.0.22]]     Sharename       Type         Comment     ---------       -----     ----------     temp           Disk     homes           Disk     IPC$           IPC         IPC Service (Samba 3.0.22) ...

If you received this output or something similar, move on to the next section, "Testing connections with smbclient." On the other hand, if you receive an error, check the following:

If you get Connection to localhost failed, either you've spelled its name wrong or there actually is a problem (which should have been seen back in "Testing local name services with ping"). In the latter case, move on to the section "Troubleshooting Name Services," later in this chapter.
If you get Error connecting to xxx.xxx.xx.xx (Connection refused), the server was found, but it wasn't running an smbd daemon. Skip back to "Troubleshooting Server Daemons," earlier in this chapter, and retest the daemons.
If you're using inetd (or xinetd) instead of standalone daemons, be sure to check your /etc/ inetd.conf (or xinetd configuration files) and /etc/services entries against their manpages for errors as well.
If you get the message NT_STATUS_ACCESS_DENIED, you aren't permitted access to the server. This could mean you have a hosts allow option that doesn't include the server or a hosts deny option that does. Recheck with the command testparm smb.conf your_hostname your_ip_address (see the section "Testing daemons with testparm"), and correct any unintended prohibitions. The error could also be caused by a restrict anonymous setting in smb.conf.

12.4.4.3. Testing connections with smbclient

Run the command smbclient //server/temp to connect to the server's [temp] share and to see if you can connect to a file service. We assume that a valid account for the user named rose has already been created. You should get the following response:

 $ smbclient //server/temp -U rose Password: <enter password> Domain=[EXAMPLE] OS=[Unix] Server=[Samba 3.0.22] smb: \> quit

If you get Get_Hostbyname: Unknown host name or Connect error: Connection refused, see the previous section, "Testing locally with smbclient," for the possible diagnoses.

Now, at the Password: prompt, provide the password for the account given as the -U argument value. If you then get an smb: \> prompt, the connection works. Enter quit and continue on to the next section, "Testing connections with net use."

A response of NT_STATUS_LOGON_FAILURE indicates either that you are using an invalid account name or that the password you used didn't match the credentials for the account. It is a good idea to verify that the account exists by running pdbedit --verboserose.

An error message referring to NT_STATUS_BAD_NETWORK_NAME can be caused by any one of the following:

A wrong share name: you might have spelled it wrong, it might be too long, it might be in mixed case, or it might not be available. Check that it's what you expect with testparm (see the earlier section, "Testing daemons with testparm").
An invalid users or valid users option in your smb.conf file that doesn't allow your account to connect. Recheck using testparm smb.conf your_hostname your_ip_address (see the earlier section, "Testing daemons with testparm").
A valid hosts option that doesn't include the server, or an invalid hosts option that does. Also test this with testparm.
There is one more reason for this failure that has nothing at all to do with passwords: the path parameter in your smb.conf file might point somewhere that doesn't exist. This will not be diagnosed by testparm. You will have to check it manually.

Once you have connected to [temp] successfully, repeat the test, this time logging in to your home directory (e.g., connect to the network path //server/rose). If you have to change anything to get that to work, retest [temp] again afterward.

12.4.4.4. Testing connections with net use

Run the following command on the Windows client to see whether it can connect to the server:

 C:\> net use * \\server\temp /user:rose

You should be prompted for a password whether or not the password for rose on the Samba server is different than the one you used to logon to the Windows console. Once the correct password has been transmitted, you should see the response:

 The command was completed successfully.

If that worked, congratulations! You have completed all of these tests successfully, and your server should be ready to accept connections from users. Otherwise:

If you get The specified shared directory cannot be found, or Cannot locate specified share name, the directory name is either misspelled or not in the smb.conf file.
If you get The computer name specified in the network path cannot be located or Cannot locate specified computer, the directory name has been misspelled, the name service has failed, there is a networking problem, or the hosts deny option includes your host.
- If it is not a spelling mistake, you need to double back at least to the section "Testing connections with smbclient" to investigate why it doesn't connect.
- If smbclient does work, there is a name service problem with the client name service, and you need to proceed to the section "Testing the server with nmblookup" and see whether you can look up both the client and server with nmblookup.
If you get The password is invalid for \\server\temp, verify that you are using the correct credentials. If you provide your password again and it still fails, your password is not being matched on the server, or possibly the configuration file has a valid users or invalid users list denying you permission.
You might have the NetBEUI protocol bound to the Microsoft client. This often produces long timeouts and erratic failures and is known to have caused failures to accept passwords in the past. Unless you absolutely need the NetBEUI protocol, remove it.