XNSIO
  About   Slides   Home  

 
Managed Chaos
Naresh Jain's Random Thoughts on Software Development and Adventure Sports
     
`
 
RSS Feed
Recent Thoughts
Tags
Recent Comments

Archive for the ‘Deployment’ Category

Upstream Connection Time Out Error in Nginx

Thursday, July 21st, 2011

Currently at Industrial Logic we use Nginx as a reverse proxy to our Tomcat web server cluster.

Today, while running a particular report with large dataset, we started getting timeouts errors. When we looked at the Nginx error.log, we found the following error:

[error] 26649#0: *9155803 upstream timed out (110: Connection timed out) 
while reading response header from upstream, 
client: xxx.xxx.xxx.xxx, server: elearning.industriallogic.com, request: 
"GET our_url HTTP/1.1", upstream: "internal_server_url", 
host: "elearning.industriallogic.com", referrer: "requested_url"

After digging around for a while, I discovered that our web server is taking more than 60 secs to respond. Nginx has a directive called proxy_read_timeout which defaults to 60 secs. It determines how long nginx will wait to get the response to a request.

In nginx.conf file, setting proxy_read_timeout to 120 secs solved our problem.

server {
    listen       80;
    server_name  elearning.industriallogic.com;
    server_name_in_redirect off;
    port_in_redirect        off;
 
    location / {
        proxy_set_header  X-Real-IP  $remote_addr;
        proxy_set_header  X-Real-Host  $host;
        proxy_read_timeout 120;
        ...
    }
    ...
}

Reverse DNS Lookup freaking out on Windows Server for Chinese IP Address

Sunday, July 10th, 2011

Recently an important client of Industrial Logic’s eLearning reported that access to our Agile eLearning website was extremely slow (23+ secs per page load.) This came as a shock; we’ve never seen such poor performance from any part of the world. Besides, a 23+ secs page load basically puts our eLearning in the category of “useless junk”.

From China From India

Notice that from China its taking 23.34 secs, while from any other country it takes less than 3 secs to load the page. Clearly the problem was when the request originated from China. We suspected network latency issues. So we tried a traceroute.

Sure enough, the traceroute does look suspicious. But then soon we realized the since traceroute and web access (http) uses different protocols, they could use completely different routes to reach the destination. (In fact, China has a law by which access to all public websites should go through the Chinese Firewall [The Great Wall]. VPN can only be used for internal server access.)

Ahh..The Great Wall! Could The Great Wall have something to do with this issue?

To nail the issue, we used a VPN from China to test our site. Great, with the VPN, we were getting 3 secs page load.

After cursing The Great Wall; just as we were exploring options for hosting our server inside The Great Wall, we noticed something strange. Certain pages were loading faster than others consistently. On further investigation, we realized that all pages served from our Windows servers were slower by at least 14 secs compared to pages served by our Linux servers.

Hmmm…somehow the content served by our Windows Server is triggering a check inside the Great Wall.

What keywords could the Great Wall be checking for?

Well, we don’t have any option other than brute forcing the keywords.

Wait a sec….we serve our content via HTTPS, could the Great Wall be looking for keywords inside a HTTPS stream? Hope not!

May be it has to do with some difference in the headers, since most firewalls look at header info to take decisions.

But after thinking a little more, it occurred to me that there cannot be any header difference (except one parameter in the URL and may be something in the Cookie.) That’s because we use Nginx as our reverse proxy. The actual content being served from Windows or Linux servers should be transparent to clients.

Just to be sure that something was not slipping by, we decided to do a small experiment. Have the exact same content served by both Windows and Linux box and see if it made any difference. Interestingly the exact same content served from Windows server is still slow by at least 14 secs.

Let’s look at the server response from the browser again:

Notice the 15 secs for the initial response to the submit request. This happens only when the request is served by the Windows Server.

We had to look deeper into where those 15 secs are coming from. So we decided to take a deeper look, by using some network analysis tool. And look what we found:

A 14+ sec response from our server side. However this happens only when the request is coming from China. Since our application does not have any country specific code, who else could be interfering with this? There are 3 possibilities:

  • Firewall settings on the Windows Server: It was easy to rule this out, since we had disabled the firewall for all requests coming from our Reverse Proxy Server.
  • Our Datacenter Network Settings: To prevent against DDOS Attacks from Chinese Hackers. A possibility.
  • Low level Windows Network Stack: God knows what…

We opened a ticket with our Datacenter. They responded back with their standard response (from a template) saying: “Please check with your client’s ISP.”

Just as I was loosing hope, I explained this problem to Devdas. When he heard 14 secs delay, he immediately told me that it sounds like a standard Reverse DNS Lookup timeout.

I was pretty sure we did not do any reverse DNS lookup. Besides if we did it in our code, both Windows and Linux Servers should have the same delay.

To verify this, we installed Wire Shark on our Windows servers to monitor Reverse DNS Lookup. Sure enough, nothing showed up.

I was loosing hope by the minute. Just out of curiosity, one night, I search our whole code base for any reverse DNS lookup code. Surprise! Surprise!

I found a piece of logging code, which was taking the User IP and trying to find its host name. That has to be the culprit. But then why don’t we see the same delay on Linux server?

On further investigation, I figured that our Windows Server did not have any DNS servers configured for the private Ethernet Interface we were using, while Linux had it.

Eliminated the useless logging code and configured the right DNS servers on our Windows Servers. And guess what, all request from Windows and Linux now are served in less than 2 secs. (better than before, because we eliminated a useless reverse DNS lookup, which was timing out for China.)

This was fun! Great learning experience.

Stop Over-Relying on your Beautifully Elegant Automated Tests

Tuesday, June 21st, 2011

Time and again, I find developers (including myself) over-relying on our automated tests. esp. unit tests which run fast and  reliably.

In the urge to save time today, we want to automate everything (which is good), but then we become blindfolded to this desire. Only later to realize that we’ve spent a lot more time debugging issues that our automated tests did not catch (leave the embarrassment caused because of this aside.)

I’ve come to believe:

A little bit of manual, sanity testing by the developer before checking in code can save hours of debugging and embarrassment.

Again this is contextual and needs personal judgement based on the nature of the change one makes to the code.

In addition to quickly doing some manual, sanity test on your machine before checking, it can be extremely important to do some exploratory testing as well. However, not always we can test things locally. In those cases, we can test them on a staging environment or on a live server. But its important for you to discover the issue much before your users encounter it.

P.S: Recently we faced Error CS0234: The type or namespace name ‘Specialized’ does not exist in the namespace ‘System.Collections’ issue, which prompted me to write this blog post.

Error CS0234: The type or namespace name ‘Specialized’ does not exist in the namespace ‘System.Collections’

Tuesday, June 21st, 2011

Industrial Logic’s eLearning has a feature where students can upload their programming exercise and get automated, personalized feedback. We do various server-side analysis (automated critique) of the student’s code to score them and to give them feedback about how well they performed in their programming exercises. To do this we need to compile their code on our server.

Recently we upgraded our .Net Compiler from version 2.0 to version 3.5. In version 2.0 we had to provide a reference to System.dll file (located in c:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\System.dll) for compiling. In version 3.5, they don’t have System.dll file. It was replaced by System.core.dll (located in c:\Program Files\Reference Assemblies\Microsoft\Framework\v3.5\System.Core.dll)

After making this change, when we ran the compiler using c:\WINDOWS\Microsoft.NET\Framework\v3.5\Csc.exe we got the following error:

Microsoft (R) Visual C# 2008 Compiler version 3.5.30729.1
for Microsoft (R) .NET Framework version 3.5
Copyright (C) Microsoft Corporation. All rights reserved.

SomeCSharpClass.cs: error CS0234: The type or namespace name ‘Specialized’ does not exist in the namespace ‘System.Collections’ (are you missing an assembly reference?)

It turns out that ‘System.collections.specialized’ namespace used to exist in System.dll in .net 2.0. In 3.5 System.Core.dll (which replaced the System.dll) does not contain it. Hence the compile time error.

We’ve fixed this issue by adding both System.Core.dll (v3.5) and System.dll (v2.0) to the compiler reference path. Not sure if this is the right thing to do. But it seems to work.

Presenting on “Continuous Deployment Demystified” at Bangalore Agile Group on May 5th

Friday, April 29th, 2011

Version Control Branching (extensively) Considered Harmful

Sunday, April 17th, 2011

Branching is a powerful tool in managing development and releases using a version control system. A branch produces a split in the code stream.

One recurring debate over the years is what goes on the trunk (main baseline) and what goes on the branch. How do you decide what work will happen on the trunk, and what will happen on the branch?

Here is my rule:

Branch rarely and branch as late as possible. .i.e. Branch only when its absolutely necessary.

You should branch whenever you cannot pursue and record two development efforts in one branch. The purpose of branching is to isolate development effort. Using a branch is always more involved than using the trunk, so the trunk should be used in majority of the cases, while the branch should be used in rare cases.

According to Ned, historically there were two branch types which solved most development problems:

  • Fixes Branch: while feature work continues on the trunk, a fixes branch is created to hold the fixes to the latest shipped version of the software. This allows you to fix problems without having to wait for the latest crop of features to be finished and stabilized.
  • Feature Branch: if a particular feature is disruptive enough or speculative enough that you don’t want the entire development team to have to suffer through its early stages, you can create a branch on which to do the work.

Continuous Delivery (and deployment) goes one step further and solves hard development problems and in the process avoiding the need for branching.

What are the problems with branching?

  • Effort, Time and Cognitive overhead taken to merge code can be actually significant. The later we merge the worse it gets.
  • Potential amount of rework and bugs introduced during the merging process is high
  • Cannot dynamically create a CI build for each branch, hence lack of feedback
  • Instead of making small, safe and testable changes to the code base, it encourages developers to make big bang, unsafe changes

P.S: I this blog, I’m mostly focusing on Client-Server (centralized) version control system. Distributed VCS (DVCS) have a very different working model.

Pharma Hack: Spammy Links visible to only Search Engine Bots in WordPress, CMS Made Simple and TikiWiki

Sunday, April 10th, 2011

Over the last 6 months, I’ve been blessed with various pharma hacks on almost all my site.

(http://agilefaqs.com, http://agileindia.org, http://sdtconf.com, http://freesetglobal.com, http://agilecoachcamp.org, to name a few.)

This is one of the most clever hacks I’ve seen. As a normal user, if you visit the site, you won’t see any difference. Except when search engine bots visit the page, the page shows up with a whole bunch of spammy links, either at the top of the page or in the footer. Sample below:

Clearly the hacker is after search engine ranking via backlinks. But in the process suddenly you’ve become a major pharma pimp.

There are many interesting things about this hack:

  • 1. It affects all php sites. WordPress tops the list. Others like CMS Made Simple and TikiWiki are also attacked by this hack.
  • 2. If you search for pharma keywords on your server (both files and database) you won’t find anything. The spammy content is first encoded with MIME base64 and then deflated using gzdeflate. And at run time the content is eval’ed in PHP.

This is how the hacked PHP code looks like:

If you inflate and decode this code it looks like:

  • 3. Well documented and mostly self descriptive code.
  • 4. Different PHP frameworks have been hacked using slightly different approach:
    • In WordPress, the hackers created a new file called wp-login.php inside the wp-includes folder containing some spammy code. They then modified the wp-config.php file to include(‘wp-includes/wp-login.php’). Inside the wp-login.php code they further include actually spammy links from a folder inside wp-content/themes/mytheme/images/out/’.$dir’
    • In TikiWiki, the hackers modified the /lib/structures/structlib.php to directly include the spammy code
    • In CMS Made Simple, the hackers created a new file called modules/mod-last_visitor.php to directly include the spammy code.
      Again the interesting part here is, when you do ls -al you see: 

      -rwxr-xr-x 1 username groupname 1551 2008-07-10 06:46 mod-last_tracker_items.php

      -rwxr-xr-x 1 username groupname 44357 1969-12-31 16:00 mod-last_visitor.php

      -rwxr-xr-x 1 username groupname 668 2008-03-30 13:06 mod-last_visitors.php

      In case of WordPress the newly created file had the same time stamp as the rest of the files in that folder

How do you find out if your site is hacked?

  • 1. After searching for your site in Google, check if the Cached version of your site contains anything unexpected.

  • 2. Using User Agent Switcher, a Firefox extension, you can view your site as it appears to Search Engine bot. Again look for anything suspicious.

  • 3. Set up Google Alerts on your site to get notification when something you don’t expect to show up on your site, shows up.

  • 4. Set up a cron job on your server to run the following commands at the top-level web directory every night and email you the results:
    • mysqldump your_db into a file and run
    • find . | xargs grep “eval(gzinflate(base64_decode(“

If the grep command finds a match, take the encoded content and check what it means using the following site: http://www.tareeinternet.com/scripts/decrypt.php

If it looks suspicious, clean up the file and all its references.

Also there are many other blogs explaining similar, but different attacks:

Hope you don’t have to deal with this mess.

Issues on Upgrading to Latest Version of CMS Made Simple

Saturday, April 2nd, 2011

I really like CMS Made Simple. Its pretty neat if you want to set up a website using a CMS software.

However, after being hacked last week, I just tried upgrading to the latest version (1.9.4) of CMS Made Simple. I was happy to see the upgrade script ran fine and everything seemed to work. However there were couple of things that were broken.

I’m documenting them here, hoping it might be useful to others.

1) The old config.php file contained the following property

#If you're using the internal pretty url mechanism or mod_rewrite, would you like to
#show urls in their hierarchy?  (ex. http://www.mysite.com/parent/parent/childpage)
$config['use_hierarchy'] = true;

However in the latest version, when they migrate the old version of config.php, this property is dropped.

Some plugins like Blogs Made Simple rely on this property for creating pretty URLs for RSS feeds.

2) We’ve written some code which looks up the current page id in the $gCms variable.

We had used the following code to figure out the page id:

$smarty->_tpl_vars['gCms']->variables['page_id'])

However in the latest version of CMSMS this does not work. Instead had to change it to:

$smarty->_tpl_vars['page_id'])

A Catalog of Papers about Large, Complex, Distributed Systems

Saturday, February 19th, 2011

My friend Srihari has been working on a great project called Systems We Make. His goal is to provide a consolidated catalog of various papers/articles about building large, complex, distributed system.

He personally searches, reads through and filters each the papers for you. This is awesome!

Please support his initiative.

Change WordPress Table Prefix using SQL Scripts

Tuesday, September 7th, 2010

WP Security Scan Plugin suggests that wordpress users should rename the default wordpress table prefix of wp_ to something else. When I try to do so, I get the following error:

Your User which is used to access your WordPress Tables/Database, hasn’t enough rights( is missing ALTER-right) to alter your Tablestructure. Please visit the plugin documentation for more information. If you believe you have alter rights, please contact the plugin author for assistance.

Even though the database user has all the required permissions, I was not successful.

Then I stumbled across this blog which shows how to manually update the table prefix.

Inspired by this blog I came up with the following steps to change wordpress table prefix using SQL Scripts.

1- Take a backup

You are about to change your WordPress table structure, it’s recommend you take a backup first.

mysqldump -uuser_name -ppassword -h host db_name > dbname_backup_date.sql

2- Edit your wp-config.php file and change

$table_prefix = ‘wp_’;

to something like

$table_prefix = ‘your_prefix_’;

3- Change all your WordPress table names

$mysql -uuser_name -ppassword -h host db_name
 
RENAME TABLE wp_blc_filters TO your_prefix_blc_filters;
RENAME TABLE wp_blc_instances TO your_prefix_blc_instances;
RENAME TABLE wp_blc_links TO your_prefix_blc_links;
RENAME TABLE wp_blc_synch TO your_prefix_blc_synch;
RENAME TABLE wp_captcha_keys TO your_prefix_captcha_keys ;
RENAME TABLE wp_commentmeta TO your_prefix_commentmeta;
RENAME TABLE wp_comments TO your_prefix_comments ;
RENAME TABLE wp_links TO your_prefix_links;
RENAME TABLE wp_options TO your_prefix_options;
RENAME TABLE wp_postmeta TO your_prefix_postmeta ;
RENAME TABLE wp_posts TO your_prefix_posts;
RENAME TABLE wp_shorturls TO your_prefix_shorturls;
RENAME TABLE wp_sk2_logs TO your_prefix_sk2_logs ;
RENAME TABLE wp_sk2_spams TO your_prefix_sk2_spams;
RENAME TABLE wp_term_relationships TO your_prefix_term_relationships ;
RENAME TABLE wp_term_taxonomy TO your_prefix_term_taxonomy;
RENAME TABLE wp_terms TO your_prefix_terms;
RENAME TABLE wp_ts_favorites TO your_prefix_ts_favorites ;
RENAME TABLE wp_ts_mine TO your_prefix_ts_mine;
RENAME TABLE wp_tweetbacks TO your_prefix_tweetbacks ;
RENAME TABLE wp_usermeta TO your_prefix_usermeta ;
RENAME TABLE wp_users TO your_prefix_users;
RENAME TABLE wp_yarpp_keyword_cache TO your_prefix_yarpp_keyword_cache;
RENAME TABLE wp_yarpp_related_cache TO your_prefix_yarpp_related_cache;

4- Edit wp_options table

UPDATE your_prefix_options SET option_name='your_prefix_user_roles' WHERE option_name='wp_user_roles';

5- Edit wp_usermeta

UPDATE your_prefix_usermeta SET meta_key='your_prefix_autosave_draft_ids' WHERE meta_key='wp_autosave_draft_ids';                 
UPDATE your_prefix_usermeta SET meta_key='your_prefix_capabilities' WHERE meta_key='wp_capabilities';                       
UPDATE your_prefix_usermeta SET meta_key='your_prefix_dashboard_quick_press_last_post_id' WHERE meta_key='wp_dashboard_quick_press_last_post_id'; 
UPDATE your_prefix_usermeta SET meta_key='your_prefix_user-settings' WHERE meta_key='wp_user-settings';                      
UPDATE your_prefix_usermeta SET meta_key='your_prefix_user-settings-time' WHERE meta_key='wp_user-settings-time';                 
UPDATE your_prefix_usermeta SET meta_key='your_prefix_usersettings' WHERE meta_key='wp_usersettings';                       
UPDATE your_prefix_usermeta SET meta_key='your_prefix_usersettingstime' WHERE meta_key='wp_usersettingstime';
    Licensed under
Creative Commons License